Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the Faker connector #23691

Merged
merged 2 commits into from
Oct 24, 2024
Merged

Add the Faker connector #23691

merged 2 commits into from
Oct 24, 2024

Conversation

nineinchnick
Copy link
Member

@nineinchnick nineinchnick commented Oct 6, 2024

Description

The purpose of this connector would be similar to tpch/tpcds - generate random data to be used for all kinds of testing. The main difference is that this works with all Trino data types and any kind of schema. It's also using the Datafaker library, which allows generating more sophisticated random data in multiple languages.

For more details, including usage examples, see https://github.com/nineinchnick/trino-faker/

I'll add documentation if the general idea of adding this connector will be approved.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Faker
* Add Faker connector to allow easy generation of data for testing. ({issue}`23691`)

@cla-bot cla-bot bot added the cla-signed label Oct 6, 2024
@nineinchnick
Copy link
Member Author

@raunaqmorarka thanks a lot for the review, everything looks much nicer now!

@mosabua
Copy link
Member

mosabua commented Oct 8, 2024

Personally I think this would be a great and very useful connector to add. It would enable numerous use cases around learning Trino, testing, benchmarking and other efforts. Specifically it also goes beyond the rather narrow and stale setup from TPCH/TPCDS.

Comment on lines 172 to 175
for (int i = 0; i < positions; i++) {
generator.accept(blockBuilder, completedRows + i);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally what we want is

generator.accept(blockBuilder, completedRows, positions);

Generate N positions in a batch.
This is a megamorphic call site, so looping over it per position can be slow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what's a megamorphic call site, can you share some references? This would change the interfaces a lot, and I'm not confident I can do this correctly, without understanding the optimization. Is this a blocker for this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I was benchmarking it like this, comparing to the sequence table function, and it was fast enough. I'm not sure what gains we can expect here.

Copy link
Member

@raunaqmorarka raunaqmorarka Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short version is that when you have a method call that can be dispatched to more than 2 possible implementations, the JIT may not do certain optimizations and the code may end up being much slower than what it would have been had there been just 1 or 2 possible implementations.
Longer version in http://www.insightfullogic.com/2014/May/12/fast-and-megamorphic-what-influences-method-invoca/
To observe the perf impact, you have to execute this code for 3 different data types and compare the throughput with what you would get if you only executed it on one data type.
This can be tackled in a follow-up, not important enough for current PR to tackle.

@ScalarFunction
@Description("Generate a random string based on the Faker expression")
@SqlType(VARCHAR)
public Slice randomExpression(@SqlType(VARCHAR) Slice expression)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generateRandom ?

@raunaqmorarka raunaqmorarka merged commit 62a0cbf into trinodb:master Oct 24, 2024
93 checks passed
@raunaqmorarka
Copy link
Member

@nineinchnick please work with @mosabua to update docs for this

@github-actions github-actions bot added this to the 464 milestone Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants