Lab 2: Transform Big Data using Azure Data Factory Mapping Data Flows

In this lab you will use Azure Data Factory to download large data files to your data lake and use Mapping Dataflows to generate a summary dataset and store it. The dataset you will use contains detailed New York City Yellow Taxi rides for the first half of 2019. You will generate a daily aggregated summary of all rides using Mapping Data Flows and save the resulting dataset in your Azure Synapse Analytics. You will use Power BI to visualise summarised taxi ride data.

The estimated time to complete this lab is: 60 minutes.

Microsoft Learn & Technical Documentation

The following Azure services will be used in this lab. If you need further training resources or access to technical documentation please find in the table below links to Microsoft Learn and to each service's Technical Documentation.

Azure Service	Microsoft Learn	Technical Documentation
Azure Data Lake Gen2	Large Scale Data Processing with Azure Data Lake Storage Gen2	Azure Data Lake Gen2 Technical Documentation
Azure Data Factory	Data ingestion with Azure Data Factory	Azure Data Factory Technical Documentation
Azure Synapse Analytics	Implement a Data Warehouse with Azure Synapse Analytics	Azure Synapse Analytics Technical Documentation

Lab Architecture

Step	Description
	Build an Azure Data Factory Pipeline to copy big data files from shared Azure Storage
	Ingest data files into your data lake
	Use Mapping Data Flows to generate a aggregated daily summary and save the resulting dataset into your Azure Synapse Analytics data warehouse.
	Visualize data from your Azure Synapse Analytics using Power BI

IMPORTANT: Some of the Azure services provisioned require globally unique name and a “-suffix” has been added to their names to ensure this uniqueness. Please take note of the suffix generated as you will need it for the following resources in this lab:

Name	Type
SynapseDataFactory-suffix	Data Factory (V2)
synapsedatalakesuffix	Storage Account
synapsesql-suffix	SQL server

Create Azure Synapse Analytics database objects

In this section you will connect to Azure Synapse Analytics to create the data warehouse objects used to host and process data.

IMPORTANT
Execute these steps inside the ADPDesktop remote desktop connection

Open Azure Data Studio.
If you already have a connection to SynapseSQL endpoint, then go to step 6.
On the Servers panel, click New Connection.
On the Connection Details panel, enter the following connection details:
- Server: synapsesql-suffix.database.windows.net
- Authentication Type: SQL Login
- User Name: adpadmin
- Password: P@ssw0rd123!
- Database: SynapseDW
Click Connect.
Right click the SynapseSQL endpoint name and then click New Query.

Create two new round robin distributed tables named [NYC].[TaxiDataSummary] and [NYC].[TaxiLocationLookup]. Use the script below:

create table [NYC].[TaxiDataSummary]
(
    [PickUpDate] [date] NULL,
    [PickUpBorough] [varchar](200) NULL,
    [PickUpZone] [varchar](200) NULL,
    [PaymentType] [varchar](11) NULL,
    [TotalTripCount] [int] NULL,
    [TotalPassengerCount] [int] NULL,
    [TotalDistanceTravelled] [decimal](38, 2) NULL,
    [TotalTipAmount] [decimal](38, 2) NULL,
    [TotalFareAmount] [decimal](38, 2) NULL,
    [TotalTripAmount] [decimal](38, 2) NULL
)
with
(
    distribution = round_robin,
    clustered columnstore index
)

go

create table [NYC].[TaxiLocationLookup]
(
    [LocationID] [int] NULL,
    [Borough] [varchar](200) NULL,
    [Zone] [varchar](200) NULL,
    [ServiceZone] [varchar](200) NULL
)
with
(
    distribution = round_robin,
    clustered columnstore index
)
go

Create the NYCTaxiData-Raw Container on Azure Data Lake Storage Gen2

In this section you will create a container in your SynapseDataLake that will be used as a repository for the NYC Taxi Data files. You will copy 6 files from the MDWResources Storage Account into your NYCTaxiData-Raw container. These files contain data for all Yellow Taxi rides in the first half of 2019, one file for each month of the year.

IMPORTANT
Execute these steps on your host computer

In the Azure Portal, go to the lab resource group and locate the Azure Storage account synapsedatalakesuffix.
On the Overview panel, click Containers.
On the synapsedatalakesuffix – Countainers blade, click + Container. On the New container blade, enter the following details:
- Name: nyctaxidata-raw
- Public access level: Private (no anonymous access)
Click OK to create the new container.

Create Linked Service connection to MDWResources

In this section you will create a linked service connection to a shared storage accounnt called MDWResources hosted in an external Azure subscription. This storage account hosts the NYC Taxi data files you will copy to your data lake. As this storage account sits in an external subscription you will connect to it using a SAS URL token.

IMPORTANT
Execute these steps on your host computer

Open the Azure Data Factory portal and click the Author option (pencil icon) on the left-hand side panel. Under Connections tab, click Linked Services and then click + New to create a new linked service connection.
On the New Linked Service blade, type “Azure Blob Storage” in the search box to find the Azure Blob Storage linked service. Click Continue.
On the New Linked Service (Azure Blob Storage) blade, enter the following details:
- Name: MDWResources
- Connect via integration runtime: AutoResolveIntegrationRuntime
- Authentication method: SAS URI
- SAS URL:
```
https://mdwresources.blob.core.windows.net/?sv=2018-03-28&ss=b&srt=sco&sp=rwl&se=2050-12-30T17:25:52Z&st=2019-04-05T09:25:52Z&spr=https&sig=4qrD8NmhaSmRFu2gKja67ayohfIDEQH3LdVMa2Utykc%3D
```
Click Test connection to make sure you entered the correct connection details and then click Finish.

Create Source and Destination Data Sets

In this section you are going to create 5 datasets that will be used by your data pipeline:

Dataset	Role	Description
MDWResources_NYCTaxiData_Binary	Source	References MDWResources shared storage account container that contains source NYC Taxi data files.
SynapseDataLake_NYCTaxiData_Binary	Destination	References your synapsedatalake-suffix storage account. It acts as the destination for the NYC Taxi data files copied from MDWResources_NYCTaxiData_Binary.
NYCDataSets_NYCTaxiLocationLookup	Source	References [NYC].[TaxiLocationLookup] table on the NYCDataSets database. This table contains records with all taxi location codes and names.
SynapseDW_NYCTaxiLocationLookup	Destination	References the destination table [NYC].[TaxiLocationLookup] in the Azure Synapse Analytics data warehouse SynapseDW and acts as destination of lookup data copied from NYCDataSets_NYCTaxiLookup.
SynapseDataLake_NYCTaxiData_CSV	Source	References the NYCTaxiData-Raw container in your SynapseDataLake-suffix storage account. It functions as a data source for the Mapping Data Flow.
SynapseDW_NYCTaxiDataSummary	Destination	References the table [NYC].[TaxiDataSummary] in the Azure Synapse Analytics and acts as the destination for the summary data generated by your Mapping Data Flow.

IMPORTANT
Execute these steps on your host computer

Open the Azure Data Factory portal and click the Author (pencil icon) option on the left-hand side panel. Under Factory Resources tab, click the ellipsis (…) next to Datasets and then click Add Dataset to create a new dataset.
Type “Azure Blob Storage” in the search box and select Azure Blob Storage. Click Continue.
On the Select Format blade, select Binary and click Continue.

On the Set Properties blade, enter the following details:
- Name: MDWResources_NYCTaxiData_Binary
- Linked service: MDWResources
- File Path: Container: nyctaxidata, Directory: [blank], File: [blank]