This repository contains code and documentation for use with Google Cloud Dataproc.
codelabs/opencv-haarcascade
provides the source code for the OpenCV Dataproc Codelab, which demonstrates a Spark job that adds facial detection to a set of images.codelabs/spark-bigquery
provides the source code for the PySpark for Preprocessing BigQuery Data Codelab, which demonstrates using PySpark on Cloud Dataproc to process data from BigQuery.codelabs/spark-nlp
provides the source code for the PySpark for Natural Language Processing Codelab, which demonstrates using spark-nlp library for Natural Language Processing.notebooks/python
provides example Jupyter notebooks to demonstrate using PySpark with the BigQuery Storage Connector and the Spark GCS Connectorspark-tensorflow
provides an example of using Spark as a preprocessing toolchain for Tensorflow jobs. Optionally, it demonstrates the spark-tensorflow-connector to convert CSV files to TFRecords.spark-translate
provides a simple demo Spark application that translates words using Google's Translation API and running on Cloud Dataproc.
See each directories README for more information.
You can find more Dataproc resources in these github repositories:
- Hadoop/Spark GCS Connector
- Spark BigQuery Connector
- Hadoop BigQuery Connector
- Spark Pubsub Connector
- Spark Spanner Connector
- Hive Bigquery Storage Handler
- Dataproc Python examples
- Dataproc Pubsub Spark Streaming example
- Dataproc Java Bigtable sample
- Dataproc Spark-Bigtable samples
For more information, review the Dataproc
documentation. You can also
pose questions to the Stack
Overflow community
with the tag google-cloud-dataproc
.
See our other Google Cloud Platform github
repos for sample applications and
scaffolding for other frameworks and use cases.
- See CONTRIBUTING.md
- See LICENSE