This repo will address how to use Azure Data Lake Store (ADLS) Gen2 as external storage with Azure Databricks and contains automation scripts.
There are currently four options for connecting from Databricks to ADLS Gen2:
- Using the ADLS Gen2 storage account access key directly
- Using a service principal directly (OAuth 2.0)
- Mounting an ADLS Gen2 filesystem to DBFS using a service principal (OAuth 2.0)
- Azure Active Directory (AAD) credential passthrough
We will focus on authenticating to ADLS Gen 2 storage from Azure databricks clusters
Create and initialize ADLS gen 2 file system, enabling the hierarchical namespaces.
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
dbutils.fs.ls("abfss://@.dfs.core.windows.net/")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")
Important
- When the hierarchical namespace is enabled for an Azure Data Lake Storage Gen2 account, you do not need to create any Blob containers through the Azure portal.
- When the hierarchical namespace is enabled, Azure Blob storage APIs are not available. See this Known issue description. For example, you cannot use the wasb or wasbs scheme to access the blob.core.windows.net endpoint.
- If you enable the hierarchical namespace there is no interoperability of data or operations between Azure Blob storage and Azure Data Lake Storage Gen2 REST APIs.
Enable Azure Data Lake Storage credential passthrough for a high-concurrency cluster
spark.read.csv("abfss://@.dfs.core.windows.net/MyData.csv").collect()
Enable Azure Data Lake Storage credential passthrough for a standard cluster