Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update timer and fix kuzudb queries #12

Merged
merged 3 commits into from
Aug 12, 2023
Merged

Update timer and fix kuzudb queries #12

merged 3 commits into from
Aug 12, 2023

Conversation

andyfengHKU
Copy link
Collaborator

@andyfengHKU andyfengHKU commented Aug 11, 2023

Changes

Closes #7
Closes #8

Timer update

Previously timer only record query execution time and ignore result iteration time. Depends on the database architecture, result iteration time may also be significant.

Disclaimer: I might be wrong about Neo4j architecture, so feel free to point it out.

Based on previous observation of using Neo4j, Neo4j server client model doesn't materialize all query result by the end of session.run(). Instead, it relies on user's pulling action (i.e. consuming result) to continue execution. For example, if the query returns 10K tuple, session.run() might only materialize the first batch, say 1K. And as user requests more data, it will incrementally materialize the rest 9K tuples batch by batch. This makes a lot of sense in server-client architecture since we don't want to send large amount of data through network.

On the other hand, Kùzu is embedded and materialize all result by the end of connection.execute(). So I think the most straight forward comparison is to measure end-to-end time, i.e. the time from inputing a string query to the time of getting back an arrow table.

Timer is modified in both Kùzu and Neo4j in the same way.

Kùzu query update

  • Query 2: Remove secondary alias. This is a bug that should be fixed on Kùzu repo.
  • Query 3: Replaced with the same query 3 as in Neo4j folder. Add lower bound 1 to variable length query.
  • Query 4: Add lower bound 1 to variable length query.

For Kùzu team, I think we should try to be drop in replacement of Cypher so that user doesn't need to convert their queries. We should be able to do this for read/write statement for sure.

Neo4j result

Query 1 completed in 2.493329s

Query 1:

        MATCH (follower:Person)-[:FOLLOWS]->(person:Person)
        RETURN person.personID AS personID, person.name AS name, count(follower) AS numFollowers
        ORDER BY numFollowers DESC LIMIT 3

Top 3 most-followed persons:
shape: (3, 3)
┌──────────┬────────────────┬──────────────┐
│ personID ┆ name           ┆ numFollowers │
│ ---      ┆ ---            ┆ ---          │
│ i64      ┆ str            ┆ i64          │
╞══════════╪════════════════╪══════════════╡
│ 85723    ┆ Rachel Cooper  ┆ 4998         │
│ 68753    ┆ Claudia Booker ┆ 4985         │
│ 54696    ┆ Brian Burgess  ┆ 4976         │
└──────────┴────────────────┴──────────────┘
Query 2 completed in 0.911550s

Query 2:

        MATCH (follower:Person) -[:FOLLOWS]-> (person:Person)
        WITH person, count(follower) as followers
        ORDER BY followers DESC LIMIT 1
        MATCH (person) -[:LIVES_IN]-> (city:City)
        RETURN person.name AS name, followers AS numFollowers, city.city AS city, city.state AS state, city.country AS country

City in which most-followed person lives:
shape: (1, 5)
┌───────────────┬──────────────┬────────┬───────┬───────────────┐
│ name          ┆ numFollowers ┆ city   ┆ state ┆ country       │
│ ---           ┆ ---          ┆ ---    ┆ ---   ┆ ---           │
│ str           ┆ i64          ┆ str    ┆ str   ┆ str           │
╞═══════════════╪══════════════╪════════╪═══════╪═══════════════╡
│ Rachel Cooper ┆ 4998         ┆ Austin ┆ Texas ┆ United States │
└───────────────┴──────────────┴────────┴───────┴───────────────┘
Query 3 completed in 0.152607s

Query 3:

        MATCH (p:Person) -[:LIVES_IN]-> (c:City) -[*..2]-> (co:Country {country: $country})
        RETURN c.city AS city, avg(p.age) AS averageAge
        ORDER BY averageAge LIMIT 5

Cities with lowest average age in Canada:
shape: (5, 2)
┌───────────┬────────────┐
│ city      ┆ averageAge │
│ ---       ┆ ---        │
│ str       ┆ f64        │
╞═══════════╪════════════╡
│ Montreal  ┆ 37.310934  │
│ Calgary   ┆ 37.592098  │
│ Toronto   ┆ 37.705746  │
│ Edmonton  ┆ 37.931609  │
│ Vancouver ┆ 38.011002  │
└───────────┴────────────┘
Query 4 completed in 0.156949s

Query 4:

        MATCH (p:Person)-[:LIVES_IN]->(ci:City)-[*..2]->(country:Country)
        WHERE p.age > $age_lower AND p.age < $age_upper
        RETURN country.country AS countries, count(country) AS personCounts
        ORDER BY personCounts DESC LIMIT 3

Persons between ages 30-40 in each country:
shape: (3, 2)
┌────────────────┬──────────────┐
│ countries      ┆ personCounts │
│ ---            ┆ ---          │
│ str            ┆ i64          │
╞════════════════╪══════════════╡
│ United States  ┆ 24983        │
│ Canada         ┆ 2514         │
│ United Kingdom ┆ 1498         │
└────────────────┴──────────────┘
Query script completed in 3.723656s

Kùzu result

Query 1 completed in 1.177789s

Query 1:

        MATCH (follower:Person)-[:Follows]->(person:Person)
        RETURN person.id AS personID, person.name AS name, count(follower.id) AS numFollowers
        ORDER BY numFollowers DESC LIMIT 3;

Top 3 most-followed persons:
shape: (3, 3)
┌──────────┬────────────────┬──────────────┐
│ personID ┆ name           ┆ numFollowers │
│ ---      ┆ ---            ┆ ---          │
│ i64      ┆ str            ┆ i64          │
╞══════════╪════════════════╪══════════════╡
│ 85723    ┆ Rachel Cooper  ┆ 4998         │
│ 68753    ┆ Claudia Booker ┆ 4985         │
│ 54696    ┆ Brian Burgess  ┆ 4976         │
└──────────┴────────────────┴──────────────┘
Query 2 completed in 0.358639s

Query 2:

        MATCH (follower:Person)-[:Follows]->(person:Person)
        WITH person, count(follower.id) as numFollowers
        ORDER BY numFollowers DESC LIMIT 1
        MATCH (person) -[:LivesIn]-> (city:City)
        RETURN person.name AS name, numFollowers, city.city AS city, city.state AS state, city.country AS country;

City in which most-followed person lives:
shape: (1, 5)
┌───────────────┬──────────────┬────────┬───────┬───────────────┐
│ name          ┆ numFollowers ┆ city   ┆ state ┆ country       │
│ ---           ┆ ---          ┆ ---    ┆ ---   ┆ ---           │
│ str           ┆ i64          ┆ str    ┆ str   ┆ str           │
╞═══════════════╪══════════════╪════════╪═══════╪═══════════════╡
│ Rachel Cooper ┆ 4998         ┆ Austin ┆ Texas ┆ United States │
└───────────────┴──────────────┴────────┴───────┴───────────────┘
Query 3 completed in 0.014856s

Query 3:

        MATCH (p:Person) -[:LivesIn]-> (c:City)-[*1..2]-> (co:Country {country: $country})
        RETURN c.city AS city, avg(p.age) AS averageAge
        ORDER BY averageAge LIMIT 5;

Cities with lowest average age in Canada:
shape: (5, 2)
┌───────────┬────────────┐
│ city      ┆ averageAge │
│ ---       ┆ ---        │
│ str       ┆ f64        │
╞═══════════╪════════════╡
│ Montreal  ┆ 37.310934  │
│ Calgary   ┆ 37.592098  │
│ Toronto   ┆ 37.705746  │
│ Edmonton  ┆ 37.931609  │
│ Vancouver ┆ 38.011002  │
└───────────┴────────────┘
Query 4 completed in 0.017843s

Query 4:

        MATCH (p:Person)-[:LivesIn]->(ci:City)-[*1..2]->(country:Country)
        WHERE p.age > $age_lower AND p.age < $age_upper
        RETURN country.country AS countries, count(country) AS personCounts
        ORDER BY personCounts DESC LIMIT 3;

Persons between ages 30-40 in each country:
shape: (3, 2)
┌────────────────┬──────────────┐
│ countries      ┆ personCounts │
│ ---            ┆ ---          │
│ str            ┆ i64          │
╞════════════════╪══════════════╡
│ United States  ┆ 24983        │
│ Canada         ┆ 2514         │
│ United Kingdom ┆ 1498         │
└────────────────┴──────────────┘
Queries completed in 1.5745s

@prrao87
Copy link
Owner

prrao87 commented Aug 12, 2023

Hi @andyfengHKU, I agree with your explanation about query times - there's a fair amount of lazy evaluation and fetching going on when you submit queries to Neo4j via Python, and your explanation about not passing too much data across the network in client-server architectures makes sense. In both DBs, we are anyway materializing the query results and converting them to a polars DataFrame for display via arrow, so placing the Timer blocks as you did makes sense.

In any real situation, this is sort of what we'd be doing anyway, and the overall run time of the query request/script is what matters to the end user, so this should be good!

A couple more points:

  • Query 2: I noticed that you replaced count(follower) with count(follower.id), so I presume that the cause of the segfault in Segfault when running query 2 #7 was due to counting on the node interface directly rather than on the node property, so going forward, this is a thing to keep in mind when using Kùzu?
    • For secondary aliases, in any case it makes sense to sanitize the Cypher query for things like this, and in this case, the aliasing was not done well to begin with, so this fix is perfect 👍
  • For queries 3 and 4 in Running queries 3 and 4 with parameters doesn't work #8 : It might make sense to describe that the variable length query needs to specify the lower bound?

@prrao87
Copy link
Owner

prrao87 commented Aug 12, 2023

I updated the docs and I think this PR has served its purpose in fixing the roadblocks! I'll go ahead and merge. We can continue optimizing and testing things out in future releases too. Thanks!

@prrao87 prrao87 merged commit b47fc4c into main Aug 12, 2023
@prrao87 prrao87 deleted the xiyang-kuzudb-fix branch August 12, 2023 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Running queries 3 and 4 with parameters doesn't work Segfault when running query 2
2 participants