Identify the IP address that will be administering the resources:
curl ifconfig.me
Create the .auto.tfvars
file:
cp config/template.tfvars .auto.tfvars
Set the required variables:
subscription_id = "<subscriptionId>"
allowed_public_ips = ["<your ip>"]
Create the resources:
terraform init
terraform apply -auto-approve
Approve the managed private endpoints generated for:
- Data lake
- Synapse
Using NYC taxi dataset for this project.
Create the data directory and download the database file:
mkdir nyctls
curl -L https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-01.parquet -o nyctls/nyc-trip-records.parquet
Create the file system and upload the file replacing the account-name
option value:
az storage fs create --auth-mode login -n database --account-name <storage-name>
az storage blob upload --auth-mode login -f ./nyctls/nyc-trip-records.parquet -c synapse -n database/nyc-trip-records.parquet --account-name <storage-name>
Create a new Lake database:
- Name: Database1
- Linked service: The data lake storage
- Input folder: synapse/database
- Data format: Parquet
Create a new Table from the data lake:
- External tablet name: nyc_taxi
- Linked service: The data lake storage
- Input file: synapse/database/nyc-trip-records.parquet
Upload the spark/synapse-transform.ipynb notebook to Synapse.
Connect to the Spark pool and run the notebook.
- Copy and transform data in Azure Synapse Analytics by using Azure Data Factory or Synapse pipelines
- Quickstart: Create a new lake database leveraging database templates
- Introduction to Microsoft Spark Utilities
- Quickstart: Create a serverless Apache Spark pool in Azure Synapse Analytics using web tools