Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing the output of Spark on Neo4j #32

Open
d34th4ck3r opened this issue Mar 10, 2015 · 2 comments
Open

Storing the output of Spark on Neo4j #32

d34th4ck3r opened this issue Mar 10, 2015 · 2 comments
Assignees

Comments

@d34th4ck3r
Copy link

Presently, mazerunner provides the ability to perform graph analysis on the data already stored in Neo4j Server. However, one important feature is the ability to store the data streaming out of Spark into Neo4j in real time. And also, perform operation on that.

Example of one such condition can be: http://stackoverflow.com/questions/28896898/using-neo4j-with-apache-spark

@kbastani kbastani self-assigned this Mar 11, 2015
@kbastani
Copy link
Collaborator

Can you please provide an example of how this integration might work? What is your input to Spark? What is the output? What's the acceptance criteria for this feature?

@ojairob
Copy link

ojairob commented Apr 14, 2015

Hello, an example might be... I have Terabytes of data in HDFS. This data is comprised of Ad Impressions, Ad Clicks, ROI events driven by the interactions of Impressions / Clicks. There are concepts of a Browser, Ad, Impression, Click, ROI event.. and timestamps / ids for everything. Using a Spark job I would like to, at scale create a neo4j graph. The implementation of which I've tried to investigate on how to scale the creation / insertion of the neo4j data. It seems Mazerunner can take the output of a graphx job and resubmit via some Queue. It also seems like Mazerunner could build a graph from a basic spark / graphx query. And finally I looked into the batch-import project which seems really fast at possibly creating the necessary neo4j files. And subsequently, it would be great to re-batch in new data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants