Merge pull request #6 from prrao87/neo4j

Neo4j and KuzuDB comparison
prrao87 · Aug 11, 2023 · b86e27c · b86e27c
2 parents 8a7af44 + 5ca32ae
commit b86e27c
Show file tree

Hide file tree

Showing 12 changed files with 1,025 additions and 163 deletions.
diff --git a/README.md b/README.md
@@ -1,13 +1,17 @@
 # KùzuDB: Benchmark study
 
-[Kùzu](https://kuzudb.com/) is an in-process (embedded) graph database management system (GDBMS) built for query speed and scalability. It is written in C++, optimized for handling complex join-heavy analytical workloads on very large graph databases, and is under active development. The goal of the code shown in this repo is as follows:
+[Kùzu](https://kuzudb.com/) is an in-process (embedded) graph database management system (GDBMS). Because it is written in C++, it is blazing fast, and is optimized for handling complex join-heavy analytical workloads on very large graph databases. The database is under active development, but its philosophy is to become the "DuckDB of graph databases" -- a fast, lightweight, embeddable graph database for analytics use cases, with minimum setup and infrastructure effort.
 
-* Generate an artificial dataset that can be used to build an artificial social network graph
-* Ingest the data into Kùzu
-* Run a set of queries in Cypher on the data to benchmark the performance of Kùzu
-* Study the ingestion and query times in comparison with Neo4j, and optimize where possible
+The goal of the code shown in this repo is as follows:
 
-Python is used as the intermediary between the source data and the DB.
+* Generate an artificial social network dataset, including persons, interests and locations
+* Ingest the data into KùzuDB and Neo4j
+* Run a set of queries in Cypher on either DB to:
+  * (1) Verify that the data is ingested correctly and that the results from either DB are consistent with one another
+  * (2) Benchmark the performance of Kùzu vs an established vendor like Neo4j
+* Study the ingestion and query times for either DB, and optimize where possible
+
+Python is used as the intermediary language between the source data and the DBs.
 
 ## Setup
 
@@ -26,170 +30,24 @@ An artificial social network dataset is used, generated via the [Faker](https://
 
 ### Generate all data at once
 
-A shell script `generate_data.sh` is provided in the root directory of this repo that sequentially runs the Python scripts, generating the data for the nodes and edges for the social network. This is the recommended way to generate the data. A single positional argument is provided to the shell script: The number of person profiles to generate.
+A shell script `generate_data.sh` is provided in the root directory of this repo that sequentially runs the Python scripts, generating the data for the nodes and edges for the social network. This is the recommended way to generate the data. A single positional argument is provided to the shell script: The number of person profiles to generate -- this is specified as an integer, as shown below.
 
 ```sh
-bash generate_data.sh 1000
+bash generate_data.sh 10000
 ```
 
 Running this command generates a series of files in the `output` directory, following which we can proceed to ingesting the data into a graph database.
 
-### Nodes: Persons
-
-First, fake male and female profile information is generated for the number of people required to be in the network.
-
-```sh
-$ cd data
-# Create a dataset of 1000 fake profiles for men and women with a 50-50 split by gender
-$ python create_nodes_person.py -n 1000
-```
-
-The CSV file generated contains a header and fake data as shown below.
-
-
-id|name|gender|birthday|age|isMarried
----|---|---|---|---|---
-1|Natasha Evans|female|1985-08-31|37|true
-2|Gregory Smith|male|1985-11-30|37|false
-
-
-The data in each column is separated by the `|` symbol to make it explicit what the column boundaries are (especially when the data itself contains commas).
-
-### Nodes: Locations
-
-To generate a list of cities that people live in, we use the [world cities dataset](https://www.kaggle.com/datasets/juanmah/world-cities?resource=download) from Kaggle. This is an accurate and up-to-date database of the world's cities and towns, including lat/long and population information of ~44k cities all over the world.
-
-To make this dataset simpler and more realistic, we only consider cities from the following three countries: `US`, `UK` and `CA`. 
-
-```sh
-$ python create_nodes_location.py
-
-Wrote 7117 cities to CSV
-Wrote 273 states to CSV
-Wrote 3 countries to CSV
-```
-
-Three CSV files are generated accordingly for cities, states and the specified countries. Latitude, longitude and population are the additional metadata fields for each city.
-
-#### `cities.csv`
-
-id|city|state|country|lat|lng|population
----|---|---|---|---|---|---
-1|Airdrie|Alberta|Canada|51.2917|-114.0144|61581
-2|Beaumont|Alberta|Canada|53.3572|-113.4147|17396
-
-#### `states.csv`
-
-id|state|country
----|---|---
-1|Alberta|Canada
-2|British Columbia|Canada
-3|Manitoba|Canada
-
-#### `countries.csv`
-
-id|country
----|---
-1|Canada
-2|United Kingdom
-3|United States
-
-### Nodes: Interests
-
-A static list of interests/hobbies that a person could have is included in `raw/interests.csv`. This is cleaned up and formatted as required by the data generator script.
-
-```sh
-$ python create_nodes_interests.py
-```
-
-This generates data as shown below.
-
-id|interest
---- | ---
-1|Anime
-2|Art & Painting
-3|Biking
+See [./data/README.md](./data/README.md) for more details on each script that is run sequentially to generate the data.
 
-### Edges: `Person` follows `Person`
+## Ingest the data into Neo4j or Kùzu
 
-Edges are generated between people in a similar way to the way we might imagine social networks. A `Person` follows another `Person`, with the direction of the edge signifying something meaningful. Rather than just generating a uniform distribution, to make the data more interesting, during generation, a small fraction of the profiles (~0.5%) is chosen to be highly connected. This resembles the role of "influencers" in real-world graphs, and in graph terminology, the nodes representing these persons can be called "hubs". The rest of the nodes are connected via these hubs in a random fashion.
+Navigate to the [neo4j](./neo4j) and the [kuzudb](./kuzudb/) to see the instructions on how to ingest the data into each database.
 
-```sh
-python create_edges_follows.py
-```
-
-This generates data as shown below, where the `from` column contains the ID of a person who is following someone, and the `to` column contains the ID of the person being followed.
-
-from|to
----|---
-50|1
-152|1
-271|1
-
-The "hub" nodes can be connected to anywhere from 0.5-5% of the number of persons in the graph.
-
-### Edges: `Person` lives in `Location`
-
-Edges are generated between people and the cities they live in. This is done by randomly choosing a city for each person from the list of cities generated earlier.
-
-```sh
-$ python create_edges_location.py
-```
-
-The data generated contains the person ID in the `from` column and the city ID in the `to` column.
-
-from|to
----|---
-1|6015
-2|6296
-3|6657
+## Run the queries
 
-### Edges: `Person` has `Interest`
-
-Edges are generated between people and the interests they have. This is done by randomly choosing anywhere from 1-5 interests for each person from the list of interests generated earlier for the nodes.
-
-```sh
-python create_edges_interests.py
-```
-
-The data generated contains the person ID in the `from` column and the interest ID in the `to` column.
-
-from|to
----|---
-1|24
-2|4
-2|8
-
-A person can have multiple interests, so the `from` column can have multiple rows with the same ID.
-
-### Edges: `City` is in `State`
-
-Edges are generated between cities and the states they are in, as per the `cities.csv` file
-
-```sh
-python create_edges_city_state.py
-```
-
-The data generated contains the city ID in the `from` column and the state ID in the `to` column.
-
-from|to
----|---
-1|1
-2|1
-3|1
-
-### Edges: `State` is in `Country`
-
-Edges are generated between states and the countries they are in, as per the `states.csv` file
-
-```sh
-python create_edges_state_country.py
-```
+Some sample queries are run in each DB to verify that the data is ingested correctly, and that the results are consistent with one another.
 
-The data generated contains the state ID in the `from` column and the country ID in the `to` column.
+## Performance comparison
 
-from|to
----|---
-1|1
-2|1
-3|1
+🚧 WIP
diff --git a/data/README.md b/data/README.md
@@ -0,0 +1,174 @@
+# Data generation for study
+
+This section describes the individual data generation scripts to build the nodes and edges of the artificial social network.
+
+## Generate all data at once
+
+As mentioned in the root level README, a shell script `generate_data.sh` is provided that sequentially runs the Python scripts from this directory, generating the data for the nodes and edges for the social network. This is the recommended way to generate the data. A single positional argument is provided to the shell script: The number of person profiles to generate, specified as an integer value as shown below.
+
+```sh
+# Generate data for 100K persons
+bash generate_data.sh 100000
+```
+
+Running this command generates a series of files in the `output` directory, following which we can proceed to ingesting the data into a graph database.
+
+### Nodes: Persons
+
+First, fake male and female profile information is generated for the number of people required to be in the network.
+
+```sh
+$ cd data
+# Create a dataset of fake profiles for men and women with a 50-50 split by gender
+$ python create_nodes_person.py -n 100000
+```
+
+The CSV file generated contains a header and fake data and looks like the below.
+
+
+id|name|gender|birthday|age|isMarried
+---|---|---|---|---|---
+1|Kenneth Scott|male|1984-04-14|39|true
+2|Stephanie Lozano|female|1993-12-31|29|true
+3|Thomas Williams|male|1979-02-09|44|true
+
+Each column uses the `|` separator symbol to make it explicit what the column boundaries are (especially when the data itself contains commas, which is common if the data contains unstructured text).
+
+### Nodes: Locations
+
+To generate a list of cities that people live in, we use the [world cities dataset](https://www.kaggle.com/datasets/juanmah/world-cities?resource=download) from Kaggle. This is an accurate and up-to-date database of the world's cities and towns, including lat/long and population information of ~44k cities all over the world.
+
+To make this dataset simpler and more realistic, we only consider cities from the following three countries: `US`, `UK` and `CA`.
+
+```sh
+$ python create_nodes_location.py
+
+Wrote 7117 cities to CSV
+Wrote 273 states to CSV
+Wrote 3 countries to CSV
+```
+
+Three CSV files are generated accordingly for cities, states and the specified countries. Latitude, longitude and population are the additional metadata fields for each city.
+
+#### `cities.csv`
+
+id|city|state|country|lat|lng|population
+---|---|---|---|---|---|---
+1|Airdrie|Alberta|Canada|51.2917|-114.0144|61581
+2|Beaumont|Alberta|Canada|53.3572|-113.4147|17396
+
+#### `states.csv`
+
+id|state|country
+---|---|---
+1|Alberta|Canada
+2|British Columbia|Canada
+3|Manitoba|Canada
+
+#### `countries.csv`
+
+id|country
+---|---
+1|Canada
+2|United Kingdom
+3|United States
+
+### Nodes: Interests
+
+A static list of interests/hobbies that a person could have is included in `raw/interests.csv`. This is cleaned up and formatted as required by the data generator script.
+
+```sh
+$ python create_nodes_interests.py
+```
+
+This generates data as shown below.
+
+id|interest
+--- | ---
+1|Anime
+2|Art & Painting
+3|Biking
+
+### Edges: `Person` follows `Person`
+
+Edges are generated between people in a similar way to the way we might imagine social networks. A `Person` follows another `Person`, with the direction of the edge signifying something meaningful. Rather than just generating a uniform distribution, to make the data more interesting, during generation, a small fraction of the profiles (~0.5%) is chosen to be highly connected. This resembles the role of "influencers" in real-world graphs, and in graph terminology, the nodes representing these persons can be called "hubs". The rest of the nodes are connected via these hubs in a random fashion.
+
+```sh
+python create_edges_follows.py
+```
+
+This generates data as shown below, where the `from` column contains the ID of a person who is following someone, and the `to` column contains the ID of the person being followed.
+
+from|to
+---|---
+50|1
+152|1
+271|1
+
+The "hub" nodes can be connected to anywhere from 0.5-5% of the number of persons in the graph.
+
+### Edges: `Person` lives in `Location`
+
+Edges are generated between people and the cities they live in. This is done by randomly choosing a city for each person from the list of cities generated earlier.
+
+```sh
+$ python create_edges_location.py
+```
+
+The data generated contains the person ID in the `from` column and the city ID in the `to` column.
+
+from|to
+---|---
+1|6015
+2|6296
+3|6657
+
+### Edges: `Person` has `Interest`
+
+Edges are generated between people and the interests they have. This is done by randomly choosing anywhere from 1-5 interests for each person from the list of interests generated earlier for the nodes.
+
+```sh
+python create_edges_interests.py
+```
+
+The data generated contains the person ID in the `from` column and the interest ID in the `to` column.
+
+from|to
+---|---
+1|24
+2|4
+2|8
+
+A person can have multiple interests, so the `from` column can have multiple rows with the same ID.
+
+### Edges: `City` is in `State`
+
+Edges are generated between cities and the states they are in, as per the `cities.csv` file
+
+```sh
+python create_edges_city_state.py
+```
+
+The data generated contains the city ID in the `from` column and the state ID in the `to` column.
+
+from|to
+---|---
+1|1
+2|1
+3|1
+
+### Edges: `State` is in `Country`
+
+Edges are generated between states and the countries they are in, as per the `states.csv` file
+
+```sh
+python create_edges_state_country.py
+```
+
+The data generated contains the state ID in the `from` column and the country ID in the `to` column.
+
+from|to
+---|---
+1|1
+2|1
+3|1
diff --git a/generate_data.sh b/generate_data.sh
@@ -10,7 +10,7 @@ echo "Generating $1 samples of data";
 # Nodes
 python create_nodes_person.py -n ${1-1000}
 python create_nodes_location.py
-python create_nodes_interest.py
+python create_nodes_interests.py
 
 # Edges
 python create_edges_follows.py