Update timer and fix kuzudb queries #12

andyfengHKU · 2023-08-11T21:26:58Z

Changes

Closes #7
Closes #8

Timer update

Previously timer only record query execution time and ignore result iteration time. Depends on the database architecture, result iteration time may also be significant.

Disclaimer: I might be wrong about Neo4j architecture, so feel free to point it out.

Based on previous observation of using Neo4j, Neo4j server client model doesn't materialize all query result by the end of session.run(). Instead, it relies on user's pulling action (i.e. consuming result) to continue execution. For example, if the query returns 10K tuple, session.run() might only materialize the first batch, say 1K. And as user requests more data, it will incrementally materialize the rest 9K tuples batch by batch. This makes a lot of sense in server-client architecture since we don't want to send large amount of data through network.

On the other hand, Kùzu is embedded and materialize all result by the end of connection.execute(). So I think the most straight forward comparison is to measure end-to-end time, i.e. the time from inputing a string query to the time of getting back an arrow table.

Timer is modified in both Kùzu and Neo4j in the same way.

Kùzu query update

Query 2: Remove secondary alias. This is a bug that should be fixed on Kùzu repo.
Query 3: Replaced with the same query 3 as in Neo4j folder. Add lower bound 1 to variable length query.
Query 4: Add lower bound 1 to variable length query.

For Kùzu team, I think we should try to be drop in replacement of Cypher so that user doesn't need to convert their queries. We should be able to do this for read/write statement for sure.

Neo4j result

Query 1 completed in 2.493329s

Query 1:

        MATCH (follower:Person)-[:FOLLOWS]->(person:Person)
        RETURN person.personID AS personID, person.name AS name, count(follower) AS numFollowers
        ORDER BY numFollowers DESC LIMIT 3

Top 3 most-followed persons:
shape: (3, 3)
┌──────────┬────────────────┬──────────────┐
│ personID ┆ name           ┆ numFollowers │
│ ---      ┆ ---            ┆ ---          │
│ i64      ┆ str            ┆ i64          │
╞══════════╪════════════════╪══════════════╡
│ 85723    ┆ Rachel Cooper  ┆ 4998         │
│ 68753    ┆ Claudia Booker ┆ 4985         │
│ 54696    ┆ Brian Burgess  ┆ 4976         │
└──────────┴────────────────┴──────────────┘
Query 2 completed in 0.911550s

Query 2:

        MATCH (follower:Person) -[:FOLLOWS]-> (person:Person)
        WITH person, count(follower) as followers
        ORDER BY followers DESC LIMIT 1
        MATCH (person) -[:LIVES_IN]-> (city:City)
        RETURN person.name AS name, followers AS numFollowers, city.city AS city, city.state AS state, city.country AS country

City in which most-followed person lives:
shape: (1, 5)
┌───────────────┬──────────────┬────────┬───────┬───────────────┐
│ name          ┆ numFollowers ┆ city   ┆ state ┆ country       │
│ ---           ┆ ---          ┆ ---    ┆ ---   ┆ ---           │
│ str           ┆ i64          ┆ str    ┆ str   ┆ str           │
╞═══════════════╪══════════════╪════════╪═══════╪═══════════════╡
│ Rachel Cooper ┆ 4998         ┆ Austin ┆ Texas ┆ United States │
└───────────────┴──────────────┴────────┴───────┴───────────────┘
Query 3 completed in 0.152607s

Query 3:

        MATCH (p:Person) -[:LIVES_IN]-> (c:City) -[*..2]-> (co:Country {country: $country})
        RETURN c.city AS city, avg(p.age) AS averageAge
        ORDER BY averageAge LIMIT 5

Cities with lowest average age in Canada:
shape: (5, 2)
┌───────────┬────────────┐
│ city      ┆ averageAge │
│ ---       ┆ ---        │
│ str       ┆ f64        │
╞═══════════╪════════════╡
│ Montreal  ┆ 37.310934  │
│ Calgary   ┆ 37.592098  │
│ Toronto   ┆ 37.705746  │
│ Edmonton  ┆ 37.931609  │
│ Vancouver ┆ 38.011002  │
└───────────┴────────────┘
Query 4 completed in 0.156949s

Query 4:

        MATCH (p:Person)-[:LIVES_IN]->(ci:City)-[*..2]->(country:Country)
        WHERE p.age > $age_lower AND p.age < $age_upper
        RETURN country.country AS countries, count(country) AS personCounts
        ORDER BY personCounts DESC LIMIT 3

Persons between ages 30-40 in each country:
shape: (3, 2)
┌────────────────┬──────────────┐
│ countries      ┆ personCounts │
│ ---            ┆ ---          │
│ str            ┆ i64          │
╞════════════════╪══════════════╡
│ United States  ┆ 24983        │
│ Canada         ┆ 2514         │
│ United Kingdom ┆ 1498         │
└────────────────┴──────────────┘
Query script completed in 3.723656s

Kùzu result

Query 1 completed in 1.177789s

Query 1:

        MATCH (follower:Person)-[:Follows]->(person:Person)
        RETURN person.id AS personID, person.name AS name, count(follower.id) AS numFollowers
        ORDER BY numFollowers DESC LIMIT 3;

Top 3 most-followed persons:
shape: (3, 3)
┌──────────┬────────────────┬──────────────┐
│ personID ┆ name           ┆ numFollowers │
│ ---      ┆ ---            ┆ ---          │
│ i64      ┆ str            ┆ i64          │
╞══════════╪════════════════╪══════════════╡
│ 85723    ┆ Rachel Cooper  ┆ 4998         │
│ 68753    ┆ Claudia Booker ┆ 4985         │
│ 54696    ┆ Brian Burgess  ┆ 4976         │
└──────────┴────────────────┴──────────────┘
Query 2 completed in 0.358639s

Query 2:

        MATCH (follower:Person)-[:Follows]->(person:Person)
        WITH person, count(follower.id) as numFollowers
        ORDER BY numFollowers DESC LIMIT 1
        MATCH (person) -[:LivesIn]-> (city:City)
        RETURN person.name AS name, numFollowers, city.city AS city, city.state AS state, city.country AS country;

City in which most-followed person lives:
shape: (1, 5)
┌───────────────┬──────────────┬────────┬───────┬───────────────┐
│ name          ┆ numFollowers ┆ city   ┆ state ┆ country       │
│ ---           ┆ ---          ┆ ---    ┆ ---   ┆ ---           │
│ str           ┆ i64          ┆ str    ┆ str   ┆ str           │
╞═══════════════╪══════════════╪════════╪═══════╪═══════════════╡
│ Rachel Cooper ┆ 4998         ┆ Austin ┆ Texas ┆ United States │
└───────────────┴──────────────┴────────┴───────┴───────────────┘
Query 3 completed in 0.014856s

Query 3:

        MATCH (p:Person) -[:LivesIn]-> (c:City)-[*1..2]-> (co:Country {country: $country})
        RETURN c.city AS city, avg(p.age) AS averageAge
        ORDER BY averageAge LIMIT 5;

Cities with lowest average age in Canada:
shape: (5, 2)
┌───────────┬────────────┐
│ city      ┆ averageAge │
│ ---       ┆ ---        │
│ str       ┆ f64        │
╞═══════════╪════════════╡
│ Montreal  ┆ 37.310934  │
│ Calgary   ┆ 37.592098  │
│ Toronto   ┆ 37.705746  │
│ Edmonton  ┆ 37.931609  │
│ Vancouver ┆ 38.011002  │
└───────────┴────────────┘
Query 4 completed in 0.017843s

Query 4:

        MATCH (p:Person)-[:LivesIn]->(ci:City)-[*1..2]->(country:Country)
        WHERE p.age > $age_lower AND p.age < $age_upper
        RETURN country.country AS countries, count(country) AS personCounts
        ORDER BY personCounts DESC LIMIT 3;

Persons between ages 30-40 in each country:
shape: (3, 2)
┌────────────────┬──────────────┐
│ countries      ┆ personCounts │
│ ---            ┆ ---          │
│ str            ┆ i64          │
╞════════════════╪══════════════╡
│ United States  ┆ 24983        │
│ Canada         ┆ 2514         │
│ United Kingdom ┆ 1498         │
└────────────────┴──────────────┘
Queries completed in 1.5745s

- To be consistent in Cypher syntax across DBs

prrao87 · 2023-08-12T12:50:20Z

Hi @andyfengHKU, I agree with your explanation about query times - there's a fair amount of lazy evaluation and fetching going on when you submit queries to Neo4j via Python, and your explanation about not passing too much data across the network in client-server architectures makes sense. In both DBs, we are anyway materializing the query results and converting them to a polars DataFrame for display via arrow, so placing the Timer blocks as you did makes sense.

In any real situation, this is sort of what we'd be doing anyway, and the overall run time of the query request/script is what matters to the end user, so this should be good!

A couple more points:

Query 2: I noticed that you replaced count(follower) with count(follower.id), so I presume that the cause of the segfault in Segfault when running query 2 #7 was due to counting on the node interface directly rather than on the node property, so going forward, this is a thing to keep in mind when using Kùzu?
- For secondary aliases, in any case it makes sense to sanitize the Cypher query for things like this, and in this case, the aliasing was not done well to begin with, so this fix is perfect 👍
For queries 3 and 4 in Running queries 3 and 4 with parameters doesn't work #8 : It might make sense to describe that the variable length query needs to specify the lower bound?

prrao87 · 2023-08-12T12:51:08Z

I updated the docs and I think this PR has served its purpose in fixing the roadblocks! I'll go ahead and merge. We can continue optimizing and testing things out in future releases too. Thanks!

andyfengHKU and others added 3 commits August 11, 2023 17:09

fix timer and kuzudb queries

5ebb2ad

Add lower bound 1 to variable length query in Neo4j

12e9fe8

- To be consistent in Cypher syntax across DBs

Update timing numbers in docs

507ea66

prrao87 merged commit b47fc4c into main Aug 12, 2023

prrao87 deleted the xiyang-kuzudb-fix branch August 12, 2023 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update timer and fix kuzudb queries #12

Update timer and fix kuzudb queries #12

andyfengHKU commented Aug 11, 2023 •

edited by prrao87

Loading

prrao87 commented Aug 12, 2023

prrao87 commented Aug 12, 2023

Update timer and fix kuzudb queries #12

Update timer and fix kuzudb queries #12

Conversation

andyfengHKU commented Aug 11, 2023 • edited by prrao87 Loading

Changes

Timer update

Kùzu query update

Neo4j result

Kùzu result

prrao87 commented Aug 12, 2023

prrao87 commented Aug 12, 2023

andyfengHKU commented Aug 11, 2023 •

edited by prrao87

Loading