Description • Stack • Diagram •
Create a process to start the infrastructure, ingest, read, transform (to Parquet file), processing and manipulate files on DataLake. The resources were created and destroyed by Terraform pipeline.
- Data Lake on GCP Cloud Storage.
- Job Spark (PySpark) on GCP Cloud DataProc.
- GCP BigQuery can be used to get insights querying the Data Lake (Parquet Files).
- Python
- PySpark
- Terraform
- Google Cloud Platform (Cloud Storage, Cloud Dataproc)