Add JOB benchmark dataset [1/N] (imdb dataset) #12497

doupache · 2024-09-17T05:54:11Z

Which issue does this PR close?

Partial Closes #12311

cd benchmarks/   
./bench.sh data imdb

All IMDB tables are now generated in benchmarks/data/imdb/*.parquet

Rationale for this change

Add imdb dataset for the JOB benchmarking

What changes are included in this PR?

Download the the dataset and convert it to parquet files.

Are these changes tested?

Just like running to generate tpch dataset:
./bench.sh data tpch

run:
./bench.sh data imdb

Are there any user-facing changes?

no

doupache · 2024-09-17T06:21:45Z

Unlike TPCH, which uses table-named folders with partitioned parquet files, IMDB has smaller tables (largest is 360MB).
We can convert each IMDB table to a single, non-partitioned parquet file.

austin362667 · 2024-09-17T08:48:50Z

Thanks @doupache for paving the way! Got few nit suggestion.

Do we prefer benchmark name as imdb or job?
Cloud you add [1/N] in the beginning of the PR title to help us track the follow-ups progress?

doupache · 2024-09-17T10:12:43Z

Thanks @austin362667 for the suggestions. IMDB is more suitable than JOB as it's specific and avoids confusion. Job can be used in many different contexts.

Adding 'progress' to the title is also a good idea 👍

benchmarks/src/imdb/convert.rs

alamb · 2024-09-19T19:29:25Z

Thanks @doupache -- I started the CI jobs, and I will try and test this out manually locally over the next few days

doupache · 2024-09-20T03:35:50Z

Thanks @austin362667 and @alamb.

I have updated the PR and learned some Cargo tips from @austin362667.
Using debug build during development is much faster.

#1
cd benchmarks && cargo build 

#2 
cargo run --bin imdb -- convert --input ./data/imdb/ --output ./data/imdb/ --format parquet

i also test all 21 parquet like follwoing. schema is from the original dataset.

# create table 
CREATE EXTERNAL TABLE name (
    id INTEGER NOT NULL PRIMARY KEY,
    name STRING NOT NULL,
    imdb_index STRING,
    imdb_id INTEGER,
    gender STRING,
    name_pcode_cf STRING,
    name_pcode_nf STRING,
    surname_pcode STRING,
    md5sum STRING
)
STORED AS PARQUET
LOCATION '../benchmarks/data/imdb/name.parquet';

# read 
SELECT * FROM name LIMIT 5;

austin362667 · 2024-09-20T14:11:39Z

Unlike TPCH, which uses table-named folders with partitioned parquet files, IMDB has smaller tables (largest is 360MB).
We can convert each IMDB table to a single, non-partitioned parquet file.

Sure! I think maybe it's because historical reasons that ParquetExec didn't support parallel execution for single Parquet file back then. Now it's being supported in #5057

alamb

Thanks @doupache and @austin362667 -- I tried this out and it worked great locally!

I ran it locally and it made a bunch of parquet files 👍

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/benchmarks/data/imdb$ ls -l
total 12672680
-rw-r--r--@ 1 andrewlamb  staff    70M May  8  2014 aka_name.csv
-rw-r--r--@ 1 andrewlamb  staff    32M Sep 23 13:37 aka_name.parquet
-rw-r--r--@ 1 andrewlamb  staff    37M May  8  2014 aka_title.csv
-rw-r--r--@ 1 andrewlamb  staff    15M Sep 23 13:37 aka_title.parquet
-rw-r--r--@ 1 andrewlamb  staff   1.3G May  8  2014 cast_info.csv
-rw-r--r--@ 1 andrewlamb  staff   351M Sep 23 13:37 cast_info.parquet
-rw-r--r--@ 1 andrewlamb  staff   206M May  8  2014 char_name.csv
-rw-r--r--@ 1 andrewlamb  staff   105M Sep 23 13:37 char_name.parquet
-rw-r--r--@ 1 andrewlamb  staff    45B May  8  2014 comp_cast_type.csv
-rw-r--r--@ 1 andrewlamb  staff   517B Sep 23 13:37 comp_cast_type.parquet
-rw-r--r--@ 1 andrewlamb  staff    17M May  8  2014 company_name.csv
-rw-r--r--@ 1 andrewlamb  staff   8.5M Sep 23 13:37 company_name.parquet
-rw-r--r--@ 1 andrewlamb  staff    92B May  8  2014 company_type.csv
-rw-r--r--@ 1 andrewlamb  staff   650B Sep 23 13:37 company_type.parquet
-rw-r--r--@ 1 andrewlamb  staff   2.3M May  8  2014 complete_cast.csv
-rw-r--r--@ 1 andrewlamb  staff   1.1M Sep 23 13:37 complete_cast.parquet
-rw-r--r--@ 1 andrewlamb  staff   1.2G Sep 23 13:32 imdb.tgz
-rw-r--r--@ 1 andrewlamb  staff   1.9K May  8  2014 info_type.csv
-rw-r--r--@ 1 andrewlamb  staff   1.9K Sep 23 13:37 info_type.parquet
-rw-r--r--@ 1 andrewlamb  staff   3.6M May  8  2014 keyword.csv
-rw-r--r--@ 1 andrewlamb  staff   2.0M Sep 23 13:37 keyword.parquet
-rw-r--r--@ 1 andrewlamb  staff    85B May  8  2014 kind_type.csv
-rw-r--r--@ 1 andrewlamb  staff   605B Sep 23 13:37 kind_type.parquet
-rw-r--r--@ 1 andrewlamb  staff   261B May  8  2014 link_type.csv
-rw-r--r--@ 1 andrewlamb  staff   767B Sep 23 13:37 link_type.parquet
-rw-r--r--@ 1 andrewlamb  staff    89M May  8  2014 movie_companies.csv
-rw-r--r--@ 1 andrewlamb  staff    25M Sep 23 13:37 movie_companies.parquet
-rw-r--r--@ 1 andrewlamb  staff   919M May  8  2014 movie_info.csv
-rw-r--r--@ 1 andrewlamb  staff   293M Sep 23 13:37 movie_info.parquet
-rw-r--r--@ 1 andrewlamb  staff    34M May  8  2014 movie_info_idx.csv
-rw-r--r--@ 1 andrewlamb  staff    11M Sep 23 13:37 movie_info_idx.parquet
-rw-r--r--@ 1 andrewlamb  staff    89M May  8  2014 movie_keyword.csv
-rw-r--r--@ 1 andrewlamb  staff    27M Sep 23 13:37 movie_keyword.parquet
-rw-r--r--@ 1 andrewlamb  staff   641K May  8  2014 movie_link.csv
-rw-r--r--@ 1 andrewlamb  staff   274K Sep 23 13:37 movie_link.parquet
-rw-r--r--@ 1 andrewlamb  staff   306M May  8  2014 name.csv
-rw-r--r--@ 1 andrewlamb  staff   135M Sep 23 13:37 name.parquet
-rw-r--r--@ 1 andrewlamb  staff   381M May  8  2014 person_info.csv
-rw-r--r--@ 1 andrewlamb  staff   143M Sep 23 13:37 person_info.parquet
-rw-r--r--@ 1 andrewlamb  staff   160B May  8  2014 role_type.csv
-rw-r--r--@ 1 andrewlamb  staff   646B Sep 23 13:37 role_type.parquet
-rw-r--r--@ 1 andrewlamb  staff   4.2K Nov 28  2014 schematext.sql
-rw-r--r--@ 1 andrewlamb  staff   194M May  8  2014 title.csv
-rw-r--r--@ 1 andrewlamb  staff    88M Sep 23 13:37 title.parquet

The only thing I think we should do is add imdb to the list of benchmarks in the bench.sh help text, but we can do that as a follow on PR

**********
* Benchmarks
**********
all(default): Data/Run/Compare for all benchmarks
tpch:                   TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table, hash join
tpch_mem:               TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), query from memory
tpch10:                 TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single parquet file per table, hash join
tpch_mem10:             TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory
parquet:                Benchmark of parquet reader's filtering speed
sort:                   Benchmark of sorting speed
clickbench_1:           ClickBench queries against a single parquet file
clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet
clickbench_extended:    ClickBench "inspired" queries against a single parquet (DataFusion specific)

austin362667

Thanks @doupache for the contribution and @alamb for the review~
I'll 1. add imdb help text. and fix what @andygrove told in PR [2/N] #1252,
2. non-neg id use UInt32
3 use single context.

alamb · 2024-09-24T00:55:14Z

Let's keep improving things in the next PR. Thanks @austin362667

* imdb dataset * cargo fmt * we should also extrac the tar after download * we should not skip last col

imdb dataset

e34bd2a

doupache mentioned this pull request Sep 17, 2024

Add IMDB queries (a.k.a. JOB - Join Order Benchmark) to DataFusion benchmark suite #12311

Closed

4 tasks

cargo fmt

dced3af

austin362667 approved these changes Sep 17, 2024

View reviewed changes

we should also extrac the tar after download

0bf209e

doupache changed the title ~~Add JOB benchmark dataset (imdb dataset)~~ Add JOB benchmark dataset 1/N (imdb dataset) Sep 17, 2024

doupache changed the title ~~Add JOB benchmark dataset 1/N (imdb dataset)~~ Add JOB benchmark dataset [1/N] (imdb dataset) Sep 17, 2024

austin362667 mentioned this pull request Sep 18, 2024

Add IMDB(JOB) Benchmark [2/N] (imdb queries) #12529

Merged

8 tasks

austin362667 suggested changes Sep 19, 2024

View reviewed changes

benchmarks/src/imdb/convert.rs Outdated Show resolved Hide resolved

we should not skip last col

dec1fbd

austin362667 mentioned this pull request Sep 21, 2024

Fix and Improve Sort Pushdown for Nested Loop and Hash Join #12559

Merged

alamb approved these changes Sep 23, 2024

View reviewed changes

alamb requested a review from austin362667 September 23, 2024 17:47

austin362667 approved these changes Sep 24, 2024

View reviewed changes

alamb merged commit 6546479 into apache:main Sep 24, 2024
24 checks passed

bgjackma pushed a commit to bgjackma/datafusion that referenced this pull request Sep 25, 2024

Add JOB benchmark dataset [1/N] (imdb dataset) (apache#12497)

202f220

* imdb dataset * cargo fmt * we should also extrac the tar after download * we should not skip last col

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JOB benchmark dataset [1/N] (imdb dataset) #12497

Add JOB benchmark dataset [1/N] (imdb dataset) #12497

doupache commented Sep 17, 2024 •

edited

Loading

doupache commented Sep 17, 2024

austin362667 commented Sep 17, 2024

doupache commented Sep 17, 2024

alamb commented Sep 19, 2024

doupache commented Sep 20, 2024 •

edited

Loading

austin362667 commented Sep 20, 2024

alamb left a comment

austin362667 left a comment

alamb commented Sep 24, 2024

Add JOB benchmark dataset [1/N] (imdb dataset) #12497

Add JOB benchmark dataset [1/N] (imdb dataset) #12497

Conversation

doupache commented Sep 17, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

doupache commented Sep 17, 2024

austin362667 commented Sep 17, 2024

doupache commented Sep 17, 2024

alamb commented Sep 19, 2024

doupache commented Sep 20, 2024 • edited Loading

austin362667 commented Sep 20, 2024

alamb left a comment

Choose a reason for hiding this comment

austin362667 left a comment

Choose a reason for hiding this comment

alamb commented Sep 24, 2024

doupache commented Sep 17, 2024 •

edited

Loading

doupache commented Sep 20, 2024 •

edited

Loading