Part 1. Implement data lake on GCP Cloud Datastore, Cloud Dataproc and BigQuery.

Created cluster using the below command to connect dataproc with Spark and BigQuery:

CLUSTER_NAME="clusterspark" gcloud beta dataproc clusters create ${CLUSTER_NAME}
--region us-central1
--zone us-central1-a
--master-machine-type n1-standard-1
--master-boot-disk-size 500
--num-workers 2
--worker-machine-type n1-standard-1
--worker-boot-disk-size 500
--image-version 1.3-debian10
--project warm-skill-297413
--optional-components=ANACONDA,JUPYTER
--enable-component-gateway
--metadata 'PIP_PACKAGES=google-cloud-bigquery google-cloud-storage'
--metadata gcs-connector-url=gs://path/to/custom/gcs/connector.jar
--metadata bigquery-connector-url=gs://path/to/custom/hadoop/bigquery/connector.jar
--metadata spark-bigquery-connector-url=gs://path/to/custom/spark/bigquery/connector.jar
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh
--properties "spark:spark.jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar"

Part 2. Try ELT processes (created python scripts and submit them as jobs in dataproc) with all three types of storage learned in week

Part 3. Try with structured data (csv, parquet,orc), unstructured data (text excerpts from article, images), semi-structured data (json, xml). Use pyspark for loading and querying.

Part 4. Write a python script to read a csv file and write it back to cloud storage in parquet format using pyspark

Part 4: Read the CSV file from Google Cloud and converted to Parquet file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Week4Assignment.md

Week4Assignment.md

Files

Week4Assignment.md

Latest commit

History

Week4Assignment.md

File metadata and controls