Simple data pipeline for the RMS Titanic dataset using Python and Pandas library for preprocessing. The data is loaded into a Neo4j graph database for exploration and analysis.
The pipeline fetches the RMS Titanic dataset, cleans, preprocesses it then loads it into a Docker Neo4j instance where relationships between passengers and other entities such as other passengers, lifeboats, cabins, and other data can be visualized, analyzed and explored.
Ensure that the Pandas library and Docker are installed. To run the pipeline, clone the repo and run:
make graph
This will perform the following steps:
- Fetch the data from a URL
- Process it and save it to the ./data folder
- Pull a Neo4j Docker image and run it
- Load the processed data using the create_db.cyp file
To explore the database, navigate to the Neo4j Browser and run any Cypher query. For info on using Cypher please visit the Cypher Basics at Neo4j.
When finished, make clean_up
will stop Neo4j, remove the container and clean up cache files.
A complete Titanic dataset is available from https://data.world/nrippner/titanic-disaster-dataset.
This project is licensed under the MIT License - see the LICENSE file for details