JPMML-SparkML-Package

JPMML-SparkML as an Apache Spark Package.

Prerequisites

Apache Spark 1.6.X or 2.0.X.

Installation

Clone the JPMML-SparkML-Package project and enter its directory:

git clone https://github.com/jpmml/jpmml-sparkml-package.git
cd jpmml-sparkml-package

When targeting Apache Spark 1.6.X, check out the spark-1.6.X development branch:

git checkout spark-1.6.X

Scala

Build the project:

mvn clean package

The build produces an uber-JAR file target/jpmml-sparkml-package-1.1-SNAPSHOT.jar.

PySpark

Add the Python bindings of Apache Spark to the PYTHONPATH environment variable:

export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python

Build the project using the pyspark profile:

mvn -Ppyspark clean package

The build produces an EGG file target/jpmml_sparkml-1.1rc0.egg and an uber-JAR file target/jpmml-sparkml-package-1.1-SNAPSHOT.jar.

Test the uber-JAR file:

cd src/main/python
nosetests

Usage

Scala

Launch the Spark shell with JPMML-SparkML-Package; use --jars to specify the location of the uber-JAR file:

spark-shell --jars /path/to/jpmml-sparkml-package/target/jpmml-sparkml-package-1.1-SNAPSHOT.jar

Fitting an example pipeline model:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.feature.RFormula

val df = spark.read.option("header", "true").option("inferSchema", "true").csv("Iris.csv")

val formula = new RFormula().setFormula("Species ~ .")
val classifier = new DecisionTreeClassifier()
val pipeline = new Pipeline().setStages(Array(formula, classifier))
val pipelineModel = pipeline.fit(df)

Exporting the fitted example pipeline model to PMML byte array:

val pmmlBytes = org.jpmml.sparkml.ConverterUtil.toPMMLByteArray(df.schema, pipelineModel)
println(new String(pmmlBytes, "UTF-8"))

PySpark

Add the EGG file to the PYTHONPATH environment variable:

export PYTHONPATH=$PYTHONPATH:/path/to/jpmml-sparkml-package/target/jpmml_sparkml-1.1rc0.egg

Launch the PySpark shell with JPMML-SparkML-Package; use --jars to specify the location of the uber-JAR file:

pyspark --jars /path/to/jpmml-sparkml-package/target/jpmml-sparkml-package-1.1-SNAPSHOT.jar

Fitting an example pipeline model:

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import RFormula

df = spark.read.csv("Iris.csv", header = True, inferSchema = True)

formula = RFormula(formula = "Species ~ .")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [formula, classifier])
pipelineModel = pipeline.fit(df)

Exporting the fitted example pipeline model to PMML byte array:

from jpmml_sparkml import toPMMLBytes

pmmlBytes = toPMMLBytes(sc, df, pipelineModel)
print(pmmlBytes.decode("UTF-8"))

License

JPMML-SparkML-Package is licensed under the GNU Affero General Public License (AGPL) version 3.0. Other licenses are available on request.

Additional information

Please contact info@openscoring.io

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
src		src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JPMML-SparkML-Package

Prerequisites

Installation

Scala

PySpark

Usage

Scala

PySpark

License

Additional information

About

Releases

Packages

Languages

License

robertjrodger/jpmml-sparkml-package

Folders and files

Latest commit

History

Repository files navigation

JPMML-SparkML-Package

Prerequisites

Installation

Scala

PySpark

Usage

Scala

PySpark

License

Additional information

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages