-
Notifications
You must be signed in to change notification settings - Fork 121
Configuring Power BI Direct Query to Azure Cosmos DB via Apache Spark (HDI)
A powerfully fun way to visualize your data in Azure Cosmos DB is to use Power BI. While there is an ODBC Driver (refer to Connect to Azure Cosmos DB using BI analytics tools with the ODBC driver), this method requires you to download all of the data from Azure Cosmos DB into Power BI.
To workaround this issue, one technique is to use the azure-cosmosdb-spark
connector which allows you to use Apache Spark as the bridge between Power BI and Azure Cosmos DB. Power BI has direct query capabilities to Apache Spark and with the azure-cosmosdb-spark
connector, you can create direct connectivity from Power BI to Azure Cosmos DB.
Note, these are alpha working instructions and we will over time simplify how to do this so it will be easier for you to configure this.
You will need the following components
- Power BI Desktop,
- an Apache Spark service such as Azure HDInsight Spark, and
- an Azure Cosmos DB subscription.
The key configuration here is the ability to copy the azure-cosmosdb-spark
JARs to the worker nodes on your HDI cluster.
Note, these instructions are necessary as we have not yet developed a
script action
to automatically copy the JARs to the head and worker nodes. We have the item Create script action to copy the azure-cosmosdb-spark JARs to all HDI head and worker nodes #79 that once addressed, will minimize the instructions below.
To get the jars, please build the code using mvn clean package
or you can download them from the releases folder. As of this writing, the latest version of the JARS can be found in azure-cosmosdb-spark_2.1.0_2.11-1.0.0.
Grab these JARs and be prepared to upload them to your HDI cluster worker nodes.
The goal here is to copy the azure-cosmosdb-spark
JARS to the `/usr/hdp/current/spark2-client/jars on your worker and head nodes of your cluster.
To get this information, you will need to log into your Azure HDI cluster and copy down a list of your head and worker nodes. To do this, first you will log into the Azure Portal and connect to your HDI cluster such as the image below.
You will need to click on Cluster Dashboard.
From here, you click on HDInsight Cluster Dashboard.
Then click on Hosts and you see the list of head nodes (prefix of hn) and worker nodes (prefix of wn).
Record this list of nodes as you will need to copy the JARs into the each one of these nodes.
Note, configuring your HDI cluster using a private/public key will make things easier as you will be able to
scp
the files without specifying a password
You will run the scp
command below from your own box and copy the files from your own box to the active head node. The command should look something like this:
scp azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar sshuser@hn1-sparki.blah.internal.cloudapp.net:~/
**IMPORTANT: Do this for all for all the JAR except the uber-jar OR just copy the uber-jar.
You will run this command from the after you ssh
into the active head node.
The command should look something this:
scp azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar sshuser@hn1-sparki.blah.gx.internal.cloudapp.net:~/
scp azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar sshuser@wn10-sparki.blah.gx.internal.cloudapp.net:~/
scp azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar sshuser@wn12-sparki.blah.gx.internal.cloudapp.net:~/
scp azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar sshuser@wn13-sparki.blah.gx.internal.cloudapp.net:~/
scp azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar sshuser@wn14-sparki.blah.gx.internal.cloudapp.net:~/
Do this for all JARs for all nodes.
This next step allows you to copy the JARs from the ~/
folder to /usr/hdp/current/spark2-client/jars
. Note, you will need to ssh
into each node to perform this command. The command should look something like this:
sudo cp azure-documentdb-spark-0.0.3-SNAPSHOT.jar /usr/hdp/current/spark2-client/jars
Remember to do this for all nodes of your HDI cluster.
Now that all the JARs are copied to the nodes, you will need to restart your Spark service. To do this, go to Azure Portal > HDI Cluster > Cluster Dashboard > HDI cluster dashboard > Services > Spark2. Then click on Service Actions > Restart All.
Now that your HDI cluster can automatically connect to your Azure Cosmos DB cluster, you will need to build a table - the source of which is a Cosmos DB query. This way from the Power BI perspective, it is only connecting to a Spark SQL table (not realizing that it is a Cosmos DB data source underneath).
The easiest way to do this is to ssh
into your active head node and then run spark-shell, e.g.
spark-shell --master yarn
Then you can run a command to create an external table such as the one below:
spark.sql("create table flights using com.microsoft.azure.cosmosdb.spark options (endpoint 'https://doctorwho.documents.azure.com:443/', database 'DepartureDelays', collection 'flights_pcoll', masterkey '$masterkey')")
Thus if you run the show tables
command, it should look something like this:
spark.sql("show tables").show()
+---------------+-----------+
| tableName|isTemporary|
+---------------+-----------+
| flights| false|
|hivesampletable| false|
+---------------+-----------+
And now you can test it by running Spark SQL queries against it, such as:
-- Test the table works
spark.sql("select * from flights limit 10").show()
spark.sql("select date, delay, distance, origin, destination from flights limit 10").show()
-- Test a group by statement
spark.sql("select destination, count(1) from flights where origin = 'SEA' group by destination").show()
Now, you can connect Power BI to Spark as noted in the instructions Apache Spark BI using data visualization tools with Azure HDInsight.
The only difference is now you will connect to a table (such as flights
above) where its underlying source is an Azure Cosmos DB query.
Based on the initial tests, here are some quick tips:
- Because Power BI can ask a lot of queries (especially if you have a lot of widgets), you may want to shrink the scope of the query by creating a view. For example, if I only care about flights departing from Seattle, I can create the view below that limits Power BI to only request for flights departing Seattle. You're reducing the amount of data transferred as well as taking advantage of Cosmos DB indexing at the same time.
spark.sql("CREATE VIEW originSEA AS SELECT date, distance, delay, origin, destination FROM flights WHERE origin = 'SEA'")
- Due to the spikiness of Power BI queries, you may want to enable Azure Cosmos DB RU per minute pricing.
As noted, this is an early preview and we will be updating this document as we optimize the query performance as well as the installation process. Meanwhile, please do not hesitate to submit issues, suggest any PRs, or contact us at AskCosmosDB@microsoft.com.