Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JOB benchmark dataset [1/N] (imdb dataset) #12497

Merged
merged 4 commits into from
Sep 24, 2024
Merged

Conversation

doupache
Copy link
Contributor

@doupache doupache commented Sep 17, 2024

Which issue does this PR close?

Partial Closes #12311

cd benchmarks/   
./bench.sh data imdb

All IMDB tables are now generated in benchmarks/data/imdb/*.parquet

Rationale for this change

Add imdb dataset for the JOB benchmarking

What changes are included in this PR?

Download the the dataset and convert it to parquet files.

Are these changes tested?

Just like running to generate tpch dataset:
./bench.sh data tpch

run:
./bench.sh data imdb

Are there any user-facing changes?

no

@doupache
Copy link
Contributor Author

Unlike TPCH, which uses table-named folders with partitioned parquet files, IMDB has smaller tables (largest is 360MB).
We can convert each IMDB table to a single, non-partitioned parquet file.

image

@austin362667
Copy link
Contributor

Thanks @doupache for paving the way! Got few nit suggestion.

  1. Do we prefer benchmark name as imdb or job?
  2. Cloud you add [1/N] in the beginning of the PR title to help us track the follow-ups progress?

@doupache
Copy link
Contributor Author

Thanks @austin362667 for the suggestions. IMDB is more suitable than JOB as it's specific and avoids confusion. Job can be used in many different contexts.

Adding 'progress' to the title is also a good idea 👍

@doupache doupache changed the title Add JOB benchmark dataset (imdb dataset) Add JOB benchmark dataset 1/N (imdb dataset) Sep 17, 2024
@doupache doupache changed the title Add JOB benchmark dataset 1/N (imdb dataset) Add JOB benchmark dataset [1/N] (imdb dataset) Sep 17, 2024
benchmarks/src/imdb/convert.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Sep 19, 2024

Thanks @doupache -- I started the CI jobs, and I will try and test this out manually locally over the next few days

@doupache
Copy link
Contributor Author

doupache commented Sep 20, 2024

Thanks @austin362667 and @alamb.

I have updated the PR and learned some Cargo tips from @austin362667.
Using debug build during development is much faster.

#1
cd benchmarks && cargo build 

#2 
cargo run --bin imdb -- convert --input ./data/imdb/ --output ./data/imdb/ --format parquet

i also test all 21 parquet like follwoing. schema is from the original dataset.

# create table 
CREATE EXTERNAL TABLE name (
    id INTEGER NOT NULL PRIMARY KEY,
    name STRING NOT NULL,
    imdb_index STRING,
    imdb_id INTEGER,
    gender STRING,
    name_pcode_cf STRING,
    name_pcode_nf STRING,
    surname_pcode STRING,
    md5sum STRING
)
STORED AS PARQUET
LOCATION '../benchmarks/data/imdb/name.parquet';

# read 
SELECT * FROM name LIMIT 5;

@austin362667
Copy link
Contributor

Unlike TPCH, which uses table-named folders with partitioned parquet files, IMDB has smaller tables (largest is 360MB).
We can convert each IMDB table to a single, non-partitioned parquet file.

Sure! I think maybe it's because historical reasons that ParquetExec didn't support parallel execution for single Parquet file back then. Now it's being supported in #5057

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @doupache and @austin362667 -- I tried this out and it worked great locally!

I ran it locally and it made a bunch of parquet files 👍

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/benchmarks/data/imdb$ ls -l
total 12672680
-rw-r--r--@ 1 andrewlamb  staff    70M May  8  2014 aka_name.csv
-rw-r--r--@ 1 andrewlamb  staff    32M Sep 23 13:37 aka_name.parquet
-rw-r--r--@ 1 andrewlamb  staff    37M May  8  2014 aka_title.csv
-rw-r--r--@ 1 andrewlamb  staff    15M Sep 23 13:37 aka_title.parquet
-rw-r--r--@ 1 andrewlamb  staff   1.3G May  8  2014 cast_info.csv
-rw-r--r--@ 1 andrewlamb  staff   351M Sep 23 13:37 cast_info.parquet
-rw-r--r--@ 1 andrewlamb  staff   206M May  8  2014 char_name.csv
-rw-r--r--@ 1 andrewlamb  staff   105M Sep 23 13:37 char_name.parquet
-rw-r--r--@ 1 andrewlamb  staff    45B May  8  2014 comp_cast_type.csv
-rw-r--r--@ 1 andrewlamb  staff   517B Sep 23 13:37 comp_cast_type.parquet
-rw-r--r--@ 1 andrewlamb  staff    17M May  8  2014 company_name.csv
-rw-r--r--@ 1 andrewlamb  staff   8.5M Sep 23 13:37 company_name.parquet
-rw-r--r--@ 1 andrewlamb  staff    92B May  8  2014 company_type.csv
-rw-r--r--@ 1 andrewlamb  staff   650B Sep 23 13:37 company_type.parquet
-rw-r--r--@ 1 andrewlamb  staff   2.3M May  8  2014 complete_cast.csv
-rw-r--r--@ 1 andrewlamb  staff   1.1M Sep 23 13:37 complete_cast.parquet
-rw-r--r--@ 1 andrewlamb  staff   1.2G Sep 23 13:32 imdb.tgz
-rw-r--r--@ 1 andrewlamb  staff   1.9K May  8  2014 info_type.csv
-rw-r--r--@ 1 andrewlamb  staff   1.9K Sep 23 13:37 info_type.parquet
-rw-r--r--@ 1 andrewlamb  staff   3.6M May  8  2014 keyword.csv
-rw-r--r--@ 1 andrewlamb  staff   2.0M Sep 23 13:37 keyword.parquet
-rw-r--r--@ 1 andrewlamb  staff    85B May  8  2014 kind_type.csv
-rw-r--r--@ 1 andrewlamb  staff   605B Sep 23 13:37 kind_type.parquet
-rw-r--r--@ 1 andrewlamb  staff   261B May  8  2014 link_type.csv
-rw-r--r--@ 1 andrewlamb  staff   767B Sep 23 13:37 link_type.parquet
-rw-r--r--@ 1 andrewlamb  staff    89M May  8  2014 movie_companies.csv
-rw-r--r--@ 1 andrewlamb  staff    25M Sep 23 13:37 movie_companies.parquet
-rw-r--r--@ 1 andrewlamb  staff   919M May  8  2014 movie_info.csv
-rw-r--r--@ 1 andrewlamb  staff   293M Sep 23 13:37 movie_info.parquet
-rw-r--r--@ 1 andrewlamb  staff    34M May  8  2014 movie_info_idx.csv
-rw-r--r--@ 1 andrewlamb  staff    11M Sep 23 13:37 movie_info_idx.parquet
-rw-r--r--@ 1 andrewlamb  staff    89M May  8  2014 movie_keyword.csv
-rw-r--r--@ 1 andrewlamb  staff    27M Sep 23 13:37 movie_keyword.parquet
-rw-r--r--@ 1 andrewlamb  staff   641K May  8  2014 movie_link.csv
-rw-r--r--@ 1 andrewlamb  staff   274K Sep 23 13:37 movie_link.parquet
-rw-r--r--@ 1 andrewlamb  staff   306M May  8  2014 name.csv
-rw-r--r--@ 1 andrewlamb  staff   135M Sep 23 13:37 name.parquet
-rw-r--r--@ 1 andrewlamb  staff   381M May  8  2014 person_info.csv
-rw-r--r--@ 1 andrewlamb  staff   143M Sep 23 13:37 person_info.parquet
-rw-r--r--@ 1 andrewlamb  staff   160B May  8  2014 role_type.csv
-rw-r--r--@ 1 andrewlamb  staff   646B Sep 23 13:37 role_type.parquet
-rw-r--r--@ 1 andrewlamb  staff   4.2K Nov 28  2014 schematext.sql
-rw-r--r--@ 1 andrewlamb  staff   194M May  8  2014 title.csv
-rw-r--r--@ 1 andrewlamb  staff    88M Sep 23 13:37 title.parquet

The only thing I think we should do is add imdb to the list of benchmarks in the bench.sh help text, but we can do that as a follow on PR

**********
* Benchmarks
**********
all(default): Data/Run/Compare for all benchmarks
tpch:                   TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table, hash join
tpch_mem:               TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), query from memory
tpch10:                 TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single parquet file per table, hash join
tpch_mem10:             TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory
parquet:                Benchmark of parquet reader's filtering speed
sort:                   Benchmark of sorting speed
clickbench_1:           ClickBench queries against a single parquet file
clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet
clickbench_extended:    ClickBench "inspired" queries against a single parquet (DataFusion specific)

Copy link
Contributor

@austin362667 austin362667 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @doupache for the contribution and @alamb for the review~
I'll 1. add imdb help text. and fix what @andygrove told in PR [2/N] #1252,
2. non-neg id use UInt32
3 use single context.

@alamb alamb merged commit 6546479 into apache:main Sep 24, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 24, 2024

Let's keep improving things in the next PR. Thanks @austin362667

bgjackma pushed a commit to bgjackma/datafusion that referenced this pull request Sep 25, 2024
* imdb dataset

* cargo fmt

* we should also extrac the tar after download

* we should not skip last col
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add IMDB queries (a.k.a. JOB - Join Order Benchmark) to DataFusion benchmark suite
3 participants