Spark SQL backend (to support Elasticsearch, Cassandra, etc) #241

sscarduzio · 2016-04-02T10:40:54Z

I can't resist saying Caravel looks much neater than Kibana, plus the user management doesn't cost money and it's not an afterthought.
It would be amazing to see Caravel replacing my Kibana dashboard, using the data I've got currently in Elasticsearch.

You use an SQL interface to query the data store, is there any chance Caravel can speak to Elasticsearch through Spark SQL?
Spark has a mature Elasticsearch connector, so it should be OK.

And wait.. If you support Spark SQL, you'll be immediately able to support HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source!

Is this a path worth exploring for this project? I think it's quite exciting.

gbrian · 2016-04-02T13:34:39Z

+1
I'm looking for Apache Drill connector, as well

ariepratama · 2016-04-02T14:36:13Z

+1
on this feature too

mistercrunch · 2016-04-02T17:12:27Z

Totally worth doing, there's 2 paths for it, either by creating a SqlAlchemy dialect (might not be possible is Spark SQL is funky), or creating a new datasource and implementing the query interface. For now we have 2 datasources: sqlalchemy or druid. It's totally doable to add a third one, it just needs to implement something like:
https://github.com/airbnb/caravel/blob/master/caravel/models.py#L460

Basically you need to receive these parameters and return a pandas dataframe.

mistercrunch · 2016-04-02T17:13:06Z

We use Spark at Airbnb and have some SparkSql in places, we might have use cases for it internally, but I'm not sure where it fits in the priority list.

sscarduzio · 2016-04-03T11:17:57Z

Cool thanks for the pointers! This new connector would surely unlock a wealth of valuable contributions from other businesses which happen to not use Druid or a plain RDBMS.

Sounds like a good investment to me :)

joshwalters · 2016-04-06T17:46:02Z

I am really interested in adding Hive support, I may take a crack at it sometime in the next few weeks. Dropbox has a Python/Hive project that I was looking at: https://github.com/dropbox/PyHive

gbrian · 2016-04-06T18:03:56Z

Does it means Impala as well? Thanks

guang · 2016-04-07T01:28:43Z

+1

csalperwyck · 2016-04-11T08:59:13Z

+1 for Hive

joshwalters · 2016-04-13T18:36:42Z

@gbrian Yes, the package I am looking at would add support for Hive and Impala. I opened an issue to track this: #339

OElesin · 2016-04-23T06:09:05Z

Great work guys, but can I load data from Elasticsearch?

rahulgagrani · 2016-04-23T19:14:28Z

+1 to addition of Elasticsearch support.

philippfrenzel · 2016-04-23T19:21:01Z

+1

povilasb · 2016-04-27T05:41:16Z

+1

nabilblk · 2016-05-06T09:40:36Z

+1 for Hive

bwboy · 2016-05-11T08:04:13Z

+1 for Hive and Elasticsearch

JohnOmernik · 2016-05-20T00:29:53Z

I am working on an Apache Drill Sql Alchemy Dialect. I have some basic things working, and have been working with others on the Drill mailing list. There has been talk of plugging Drill to Elastic Search, which seems a bit convoluted, however, since Elasticsearch doesn't have a SQL interface, Drill works really nice, if we get a Dialect working for Drill, then other storage plugins will (hopefully) just work. Some of the work can be found here:

Docker container with pyodbc, unixodbc, Drill ODBC, and caravel all working:

https://github.com/JohnOmernik/caraveldrill

Drill Dialect (work in progress, feel free to play with it and try it, please report issues as you find them, this is iterative brute force programming at this point!)
https://github.com/JohnOmernik/sqlalchemy-drill

sathieu · 2016-06-01T02:05:18Z

I've taken a different approach and started a native backend.

WIP is at https://github.com/sathieu/caravel/tree/elasticsearch (beware: I'll squash commits and force push).

Not much is working yet, and I don't have dedicated time on it. We'll see what comes.

tninja · 2016-06-01T19:24:05Z

+1 to sparksql

bolkedebruin · 2016-06-10T07:07:43Z

For what is worth: spark 2 will be sql compliant so then a sqlalchemy dialect is feasible

benvogan · 2016-07-06T23:44:28Z

+1 for spark SQL. That will get you connected to most data sources these days.

giaosudau · 2016-07-11T04:16:33Z

1 for Spark SQL, Hive.

shkr · 2016-07-20T16:54:41Z

You can connect it to Spark SQL. If it uses a hive back-end then you refer to this documentation page for instructions on how to connect sparkl sql via a jdbc+hive connector. https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#01%20Databricks%20Overview/14%20Third%20Party%20Integrations/05%20Beeline.html. The one I prefer is dropbox/pyhive to connect to spark sql in my python projects. For scala or java the jdbc+hive will be preferable.

sbookworm · 2016-07-21T02:40:01Z

+1 for spark sql

mistercrunch · 2016-07-22T00:01:59Z

Sweet! Can other confirm that SparkSQL works for them through SQL alchemy?!

mistercrunch · 2016-07-22T00:12:46Z

Giving hints about how to use SparkSQL in the docs: #803

giaosudau · 2016-07-24T02:19:51Z

@mistercrunch Right now it does.
But in long query it stops a hold process. I think it relate to thread.

maver1ck · 2016-11-24T17:52:01Z

I used Spark Thrift Server with Pyhive and it almost works (I need to change one line in hive dialect)

kaiosama · 2017-02-06T22:05:38Z

@shkr Hi, I am trying to achieve the same thing with pyHive and have not been able to make it work. What is the URI you are using for setting up Superset data source? I am trying something like jdbc+hive://localhost:10000/, and it gives an error: "Can't load plugin: sqlalchemy.dialects:jdbc.hive". I am sure I must be missing something here.. Thanks in advance for any instructions on this.

-- update --
Looks like I had a hiveserver2 problem, I restarted it and then I was able to use this URI:
hive://user@localhost:10000/database
However I can't get what is listed on the wiki to work (jdbc+hive://), the error message is "Can't load plugin: sqlalchemy.dialects:jdbc.hive"

I have another question that is, what do you mean when you say use SparkSql as backend? I am fairly new to this, but AFAIK I can save dataframes in SparkSql to a Hive table, from which I can then create a Superset table/slice using the above connector. But is there more that I can do to make this process better? My overall goal is to be able to create tables/ slices from parquet files on HDFS.

ChethanChandra · 2017-02-10T05:09:56Z

+1 for Elasticsearch support.

santhavathi · 2017-02-17T13:14:53Z

@giaosudau, What is the SQLAlchemy URI, I should give in superset to connect to SparkSQL
I used below and it is not working, 172.31.12.201 is where 1.6.2 spark master runs
hive://172.31.12.201:7077/test_database

shkr · 2017-02-20T05:40:38Z

@santhavathi when you open spark ui dashboard, there is a ip printed on top, which is the hostname of the head of the cluster. you have to use that, as your hostname in the hive url.

example : hive://<spark-cluster-master/

santhavathi · 2017-02-20T08:57:19Z

@shkr, thanks so much for the reply.
I had to start the hive server (spark thrift server) on my spark cluster.
Also giving hive:// gives below error
ERROR: Connection failed!

The error message returned was:
Could not locate column in row for column 'tab_name'

I used impala:// and it works now.

cduverne · 2017-02-20T15:06:33Z

Hello guys, I see in the documentation that SparkSQL is supported : http://airbnb.io/superset/installation.html#database-dependencies.

What does this concretely mean ? Which DB can we query then ?

Thanks a lot in advance.

kaiosama · 2017-02-20T19:56:40Z

@shkr according to your latest comment, I tried the following URI: hive://172.17.0.2, where 172.17.0.2 is what I got from spark UI.

It allows me to add it as a database, so far so good. However when I query against a table in this database, the job tracker shows a MapReduce job. I would expect the job to be a Spark job though, is it true in your case?
I was able to connect to local hive using hive://localhost:10000, so far these two work like the same thing to me.

santhavathi · 2017-02-21T05:57:49Z

@kaiosama, when you said you are connecting to hive://172.17.0.2, what is the port you used here, and are you directly connecting to spark master without hiveserver running?

kaiosama · 2017-02-21T19:29:16Z

@santhavathi that is the full URI I used, without port #. I tried using some port #s from the spark UI page but none of them works.

It was with a running hive server. Maybe I am missing something here, but it seems to me that Spark-sql is supposed to be used against Hive, i.e. you always need a running Hive server? Or can the Spark-sql connector be used against other sources? It's like @cduverne mentioned, it's not very clear to me. And I have not got any replies about how to get "jdbc+hive" work as said in the document.

oblamine · 2017-02-28T15:11:20Z

+1 for Hbase support :)

mistercrunch · 2017-02-28T16:26:41Z

At Airbnb we can do Hbase through Presto with the HBase Presto connector.

oblamine · 2017-03-02T16:18:42Z

would you please give me a link so i can follow install steps?

balchandra · 2017-03-03T13:21:15Z

Hi Can someone, please list down steps to do to connect ElasticSearch from Superset.
It would be great help

mistercrunch · 2017-03-04T01:12:13Z

@balchandra it would involve using this:
https://github.com/loverajoel/sqlalchemy-elasticquery

shkr · 2017-03-07T19:33:31Z

@kaiosama The hostname directs the sql-alchemy to use SQL at the given port. Hard to say whether a map reduce is the normal behavior to expect, without knowing details about your setup of hive, map reduce and spark.

balchandra · 2017-03-10T09:55:33Z

@mistercrunch...
I tried using the same ...connecting Superset with Sqlalchemy-elasticquery.
I was able to connect when both Superset and Elasticsearch are installed in same server.
Also i was not able to view table/indices when got connected.
Can you tell me how exactly it is supposed to be used.
It will help me to great extent
Thanks in advance

mistercrunch · 2017-03-10T18:46:21Z

Looks like sqlalchemy-elasticquery isn't what I thought it was. Depending on how ANSI compliant ElasticSearch's SQL is, it may be possible to create your own sqlalchemy dialect. If not, someone would have to create a new connector for it. Luckily I recently refactored and formalized the connector abstraction.

xycloud · 2017-03-30T16:09:53Z

+1 for elasticsearch

zbidi · 2017-04-03T21:52:54Z

+1 for elasticsearch

hongqp · 2017-04-18T12:46:03Z

+1 for Hive and Elasticsearch

mistercrunch · 2019-10-25T05:42:14Z

Good news about ElasticSearch here! #8441

srinify · 2021-04-09T00:37:28Z

Closing since Superset now works with Elasticsearch!

https://superset.apache.org/docs/databases/elasticsearch

mistercrunch added the enhancement:request Enhancement request submitted by anyone from the community label Apr 2, 2016

jgbarah mentioned this issue Jun 21, 2016

Support more NoSQL databases #600

Closed

xrmx mentioned this issue Jul 18, 2016

Does support spark sql #770

Closed

mistercrunch added the documentation label Jul 22, 2016

mistercrunch added the validation:required A committer should validate the issue label Jul 22, 2016

xrmx mentioned this issue Mar 27, 2017

[question] : Support Spark SQL standalone #2483

Closed

apache locked and limited conversation to collaborators Apr 18, 2017

kristw added the inactive Inactive for >= 30 days label Mar 20, 2019

stale bot removed the inactive Inactive for >= 30 days label Oct 25, 2019

srinify closed this as completed Apr 9, 2021

Spark SQL backend (to support Elasticsearch, Cassandra, etc) #241

Spark SQL backend (to support Elasticsearch, Cassandra, etc) #241

Comments

sscarduzio commented Apr 2, 2016

gbrian commented Apr 2, 2016

ariepratama commented Apr 2, 2016

mistercrunch commented Apr 2, 2016

mistercrunch commented Apr 2, 2016

sscarduzio commented Apr 3, 2016

joshwalters commented Apr 6, 2016

gbrian commented Apr 6, 2016

guang commented Apr 7, 2016

csalperwyck commented Apr 11, 2016

joshwalters commented Apr 13, 2016

OElesin commented Apr 23, 2016

rahulgagrani commented Apr 23, 2016

philippfrenzel commented Apr 23, 2016

povilasb commented Apr 27, 2016

nabilblk commented May 6, 2016

bwboy commented May 11, 2016

JohnOmernik commented May 20, 2016

sathieu commented Jun 1, 2016

tninja commented Jun 1, 2016 • edited Loading

bolkedebruin commented Jun 10, 2016

benvogan commented Jul 6, 2016

giaosudau commented Jul 11, 2016

shkr commented Jul 20, 2016

sbookworm commented Jul 21, 2016

mistercrunch commented Jul 22, 2016

mistercrunch commented Jul 22, 2016 • edited Loading

giaosudau commented Jul 24, 2016

maver1ck commented Nov 24, 2016

kaiosama commented Feb 6, 2017 • edited Loading

ChethanChandra commented Feb 10, 2017

santhavathi commented Feb 17, 2017 • edited Loading

shkr commented Feb 20, 2017

santhavathi commented Feb 20, 2017

cduverne commented Feb 20, 2017

kaiosama commented Feb 20, 2017

santhavathi commented Feb 21, 2017

kaiosama commented Feb 21, 2017

oblamine commented Feb 28, 2017

mistercrunch commented Feb 28, 2017

oblamine commented Mar 2, 2017

balchandra commented Mar 3, 2017

mistercrunch commented Mar 4, 2017

shkr commented Mar 7, 2017

balchandra commented Mar 10, 2017

mistercrunch commented Mar 10, 2017

xycloud commented Mar 30, 2017

zbidi commented Apr 3, 2017

hongqp commented Apr 18, 2017

mistercrunch commented Oct 25, 2019

srinify commented Apr 9, 2021

tninja commented Jun 1, 2016 •

edited

Loading

mistercrunch commented Jul 22, 2016 •

edited

Loading

kaiosama commented Feb 6, 2017 •

edited

Loading

santhavathi commented Feb 17, 2017 •

edited

Loading