This guidance provides users instructions to access the HDFS data in OpenPAI.
Data on HDFS can be accessed by various ways. Users can choose the proper way according to there needs. For shell access, user can use WebHDFS and HDFS Command to access HDFS data. User can also user web browser to view HDFS data through the web portal. For accessing data from a deep learning framework, please use HDFS API and avoid using other means for best performance as well as robustness. Note that some deep learning framework has built-in support to HDFS. For example, to traini on large data, TensorFlow usually serializes the data into several big files like TF Record and it supports HDFS natively. For PyTorch, it recommended to use HDFS Python library to access HDFS data during the training.
WebHDFS provides a set of REST APIs and this is our recommended way to access data. WebHDFS REST API contains the detailed instructions of the APIs. In OpenPAI all the WebHDFS requests will be redirected by Pylon. We needn't directly access the name node or data node. So the rest server URI will be http://master-node-address/webhdfs. The master-node-address is the address of the machine with pai-master label true in configuration file layout.yaml. Following are two simple examples to show how the APIs can be used to create and delete a file.
- Create a File
Suppose to create file test_file under directory /test. First step is submit a request without redirection and data with command:
curl -i -X PUT "http://master-node-address/webhdfs/api/v1/test/test_file?op=CREATE"
This command will return the data node where the file should be written. The location URI would be like
Then run following command with this URI to write file data:
curl -i -X PUT -T file-data-to-write "returned-location-uri"
Here the returned-location-uri is the location URI mentioned in the first command.
- Delete a File
If we want to delete the file created by above example, run following command:
curl -i -X DELETE "http://master-node-address/webhdfs/api/v1/test/test_file?op=DELETE"
- Prepare HDFS cmd package:
The commands are available in the Hadoop package. Users can use this package in two ways.
Method 1 (Host env):
Please download the version you need on Hadoop Releases. Then extract it to your machine by running
tar -zxvf hadoop-package-name
All commands are located in bin directory.
Method 2 (docker container env):
We upload a Docker image to DockerHub with built-in HDFS support. Please refer to the HDFS commands guide for details.
All commands are located in bin directory.
- How to use cmd:
Please refer HDFS Command Guide for detailed command descriptions.
- Where to get the HDFS entrypoint:
All files in the HDFS are specified by its URI following pattern hdfs://hdfs-name-node-address:name-node-port/parent/child. Here the name-node-port is 9000. The hdfs-name-node-address default value is the same OpenPAI entrypoint page ip address.
Note: hdfs-name-node-address It is the address of the machine with pai-master label true in configuration file layout.yaml. If you don't know where this file is, please contact the cluster administrator.
Data on HDFS can be accessed by pointing your web browser to http://hdfs-name-node-address:5070/explorer.html after the cluster is ready. The hdfs-name-node-address is the address of the machine with pai-master label true in configuration file layout.yaml. From release 2.9.0 users can upload or delete files on the web portal. On earlier release users can only browse the data.
The Java APIs allow users to access data from Java programs. The detailed HDFS API interfaces can be found on HDFS API Doc。
The C API is provided by libhdfs library and it only supports a subset of the HDFS operations. Please follow the instructions on C APIs for details.
The Python API can be installed with command:
pip install hdfs
Please refer HdfsCLI for the details.