Skip to content

epomatti/azure-datafactory

Repository files navigation

ADF ETL

Identify the IP address that will be administering the resources:

curl ifconfig.me

Create the .auto.tfvars file:

cp config/template.tfvars .auto.tfvars

Set the required variables:

subscription_id    = "<subscriptionId>"
allowed_public_ips = ["<your ip>"]

Create the resources:

terraform init
terraform apply -auto-approve

Private Endpoint

Approve the managed private endpoints generated for:

  • Data lake
  • Synapse

Data set

Using NYC taxi dataset for this project.

Create the data directory and download the database file:

mkdir nyctls
curl -L https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-01.parquet -o nyctls/nyc-trip-records.parquet

Create the file system and upload the file replacing the account-name option value:

az storage fs create --auth-mode login -n database --account-name <storage-name>
az storage blob upload --auth-mode login -f ./nyctls/nyc-trip-records.parquet -c synapse -n database/nyc-trip-records.parquet --account-name <storage-name>

Synapse

Lake database

Create a new Lake database:

  • Name: Database1
  • Linked service: The data lake storage
  • Input folder: synapse/database
  • Data format: Parquet

Create a new Table from the data lake:

  • External tablet name: nyc_taxi
  • Linked service: The data lake storage
  • Input file: synapse/database/nyc-trip-records.parquet

Spark

Upload the spark/synapse-transform.ipynb notebook to Synapse.

Connect to the Spark pool and run the notebook.

Reference