A set of tools to manage Parquet files.
The ParquetStats tool takes as input:
- the HDFS folder containing Parquet files. Url form, such as
hdfs://localhost:9000/foo/bar
and absolute path form, such as/foo/bar
, are both valid - the configuration file that specifies the IP address and port of namenode (optional).
It gives as output the following descriptive statistics: the total size of the directory, standard deviation, average size, minimum size, maximum size and number of files. File sizes are reported in MByte.
Usage: ParquetStats (-i, --input path_to_HDFS_folder) [-c, --config path_to_configuration_file]
The brackets denote optional elements, while parentheses the mandatory ones. You can either use the long or the short version of the two options. If the -c
option is omitted, the configuration file is searched in the classpath; otherwise, an error message will be displayed to the user.
Here an example of tool usage:
ParquetStats -i /home/parquet_dataset -c /etc/hadoop/conf
The previous command will connect to the namenode, will read .parquet files in parquet_dataset
folder and will compute the statistics. Moreover, the tool allows to manage logging level at runtime by setting the -Dlog4j=log_level
system property as VM arguments. There are several log_level
as you can see in the log4j library documentation.
The ParquetRepart tool takes as input:
- the HDFS folder containing NT Parquet files. Url form, such as
hdfs://localhost:9000/foo/bar
and absolute path form, such as/foo/bar
, are both valid - the number NR of .parquet file we want to get in output or the minimum file size
- a flag indicating the overwrite mode.
It gives as output a new set of NR Parquet files, stored in a new folder (generally, NR ≤ NT).
Usage: ParquetRepart (-i, --input path_to_HDFS_folder) (-o, --output path_to_output_dir] [-p, --partitions number_of_parquet_files | -s, --size min_size_of_parquet_file] [-f, --flagoverwrite]
If the new folder already exists on the server, the overwrite flag must be specified; otherwise, an error message is displayed to the user and redistribution of Parquet files is canceled.
Here an example of tool usage:
ParquetRepart -i /home/parquet_dataset -o /home/parquet_out_dir -p 20
The previous command will read .parquet files in parquet_dataset
folder and then will write a new set of twenty .parquet files in parquet_out_dir
. It is assumed that the input folder contains more than twenty .parquet files.
The ParquetBench tool takes as input:
- the HDFS folder containing Parquet files. Url form, such as
hdfs://localhost:9000/foo/bar
and absolute path form, such as/foo/bar
, are both valid - an integer n that specifies the number of repetitions of a query test
It gives as output n execution times. Basically the ParquetBench tool will perform a benchmark to evaluate the application performance, that is the execution time of a test query on varying of the number of files Parquet.
Usage: ParquetBench (-i, --input path_to_HDFS_folder) (-n, --nloop number_of_repetitions]
Here an example of tool usage:
ParquetBench -i /home/parquet_dataset -n 20