GitHub - GiovanniPaoloGibilisco/TPC-DS-Spark: Implementation of a spark application using hive that implement some queries inspired by TPC-DS

How to install Spark with Hive support and run tpcds queries:

Install Hadoop and configure hdfs
Compile Spark with hive support

git clone https://github.com/apache/spark (eventually move to the desired tag)

./make-distribution.sh --name spark-with-hive --tgz -Psparkr -Phadoop-2.6 -Dhadoop.version=2.6.2 -Phive -Phive-thriftserver -Pyarn (check that the version of hadoop you are using is the same used in the command to build spark, this takes a while...)

Install Spark in some folder (e.g. /opt/spark extracting the generated archive)

Install hive
- Configure hive to use mysql as metastore http://www.cloudera.com/content/www/en-us/documentation/archive/cdh/4-x/4-2-0/CDH4-Installation-Guide/cdh4ig_topic_18_4.html
- skip the part of the configuration about the "hive.metastore.uris" parameter
- Add in the folder with the spark configuration the same hive-site.xml file used in hive configuration (this tells spark where to look for the metastore)
Generate tpcds dataset (I'm using https://github.com/hortonworks/hive-testbench)
- if not alreadygenerated on hdfs, put it there
Load the tables in hive to setup the metastore using script: reset-metastore.sh
- e.g. ./reset-metastore.sh 2 /data
Build the spark application with the embedded queries
- git clone https://github.com/GiovanniPaoloGibilisco/TPC-DS-Spark
- cd tcp-ds
- mvn clean install
run the query submitting the application to spark
- spark-submit --master spark://clusterino1:7077 --class it.polimi.spark.tpcds.Query target/uber-tcp-ds-0.0.1-SNAPSHOT.jar -i /data/2 -o /output -db tpcds_text_2 -id R1
- the db is the one created in step 5, the name is "tpcds_text_"+
- custom queries can be executed using -q "query text" instead of the -id argument

To change the dataset size:

Repeat Step 4
Repeat Step 5 (optionally dropping the other database)
Repeat Step 7 (as many time as needed with the required queries)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
tcp-ds		tcp-ds
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

GiovanniPaoloGibilisco/TPC-DS-Spark

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages