-
Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis.
-
It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.
-
For more information:http://spark.apache.org/
You can find the latest Spark documentation, including a programming guide, on the project web page. This README file only contains basic setup instructions.
-
Spark is built using Apache Maven. To build Spark and its example programs, run:
build/mvn -DskipTests clean package
(You do not need to do this if you downloaded a pre-built package.)
-
You can build Spark using more than one thread by using the -T option with Maven, see "Parallel builds in Maven 3". More detailed documentation is available from the project site, at "Building Spark"](http://spark.apache.org/docs/latest/building-spark.html).
-
For general development tips, including info on developing Spark using an IDE, see "Useful Developer Tools".
The easiest way to start using Spark is through the Scala shell:
./bin/spark-shell
Try the following command, which should return 1000:
scala> sc.parallelize(1 to 1000).count()
Alternatively, if you prefer Python, you can use the Python shell:
./bin/pyspark
And run the following command, which should also return 1000:
>>> sc.parallelize(range(1000)).count()
Testing first requires building Spark. Once Spark is built, tests can be run using:
./dev/run-tests
Please see the guidance on how to run tests for a module, or individual tests.