Use Parquet format instead of CSV for data consumption #17

prrao87 · 2023-08-18T03:31:09Z

Closes #4.

The aim of this PR is to generate parquet data files instead of CSVs (much smaller in size, while keeping schema as part of the data), so that we can ingest the parquet data into the Neo4j and Kùzu graphs.

The upstream Kùzu parquet reader has been fixed, and so we can fully use parquet for reading data when building the graph.
The added benefit of parquet is that using pl.read_parquet instead of pl.read_csv is much less verbose because we don't have to worry about specifying separators and other schema information

To do

Need to update doc sections that mention CSV and change to parquet

prrao87 · 2023-08-18T14:44:34Z

@andyfengHKU and @ray6080, I've completed this stage of my benchmark study after switching to reading the data via parquet as per this PR. Here are my findings:

For ingestion, Kùzu is consistently faster than Neo4j by a factor of ~18x for a graph size of 100k nodes and ~2.4M edges. This speedup factor is expected to be even higher as the dataset gets larger and larger.
For OLAP querying, Kùzu is significantly faster than Neo4j for most types of queries, especially for ones that involve aggregating on many-many relationships.

I've left the question as to why certain types of queries are on par with Neo4j as open-ended, we can take a look at those as we go along, and I can rerun the numbers then. Thanks!

prrao87 added 12 commits August 17, 2023 22:53

Change deta generation to use parquet

4b47c6a

Update Kuzu graph build to use parquet

766f4f3

Update Neo4j graph build to use parquet

fb6cae9

Update batch size for edges

0397793

Fix formatting and style

2abe96c

Update docs

11d47ec

Fix formatting

5746abf

Finalize timing numbers

5f4222c

Summarize results of performance comparison

e85be6f

Add ingestion performance comparison

d1645e1

Add list of queries in main page

2c48f20

Fix docs for clarity

a89ed93

prrao87 merged commit 44d887e into main Aug 18, 2023

prrao87 deleted the parquet branch August 18, 2023 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Parquet format instead of CSV for data consumption #17

Use Parquet format instead of CSV for data consumption #17

prrao87 commented Aug 18, 2023 •

edited

Loading

prrao87 commented Aug 18, 2023

Use Parquet format instead of CSV for data consumption #17

Use Parquet format instead of CSV for data consumption #17

Conversation

prrao87 commented Aug 18, 2023 • edited Loading

To do

prrao87 commented Aug 18, 2023

prrao87 commented Aug 18, 2023 •

edited

Loading