Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File engine with Parquet format produces inconsistent results #23862

Closed
dgloeckner opened this issue May 3, 2021 · 1 comment · Fixed by #33302
Closed

File engine with Parquet format produces inconsistent results #23862

dgloeckner opened this issue May 3, 2021 · 1 comment · Fixed by #33302
Assignees
Labels
bug Confirmed user-visible misbehaviour in official release

Comments

@dgloeckner
Copy link

You have to provide the following information whenever possible.

Describe the bug
When using multiple INSERT statements on a File table with format Parquet, the underlying Parquet file gets inconsistent.

  • The file's metadata does not reflect the inserted record count.
  • Parquet readers will fail to read data from the file.

Exception from Apache Spark (could be any Parquet processor) processing the file:

Caused by: java.io.IOException: Expected 27540 values in column chunk at file:/tmp/parq/data.parquet offset 645172 but got 33792 values instead over 1 pages ending at file offset 696950

Does it reproduce on recent release?
The list of releases

How to reproduce
Using ClickHouse server version 20.8.11.17.

CREATE TABLE BUG (ID String) ENGINE=File(Parquet)
INSERT INTO BUG VALUES ('Bla')
INSERT INTO BUG VALUES ('Bla')
INSERT INTO BUG VALUES ('Bla')

select count(*) from BUG
SELECT count(*)
FROM BUG
┌─count()─┐
│ 1 │
└─────────┘
1 rows in set. Elapsed: 0.003 sec

parquet-tools reports incorrect num_rows

parquet-tools inspect /var/clickhouse/data/default/BUG/data.Parquet

############ file meta data ############
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 1
num_rows: 1
num_row_groups: 1
format_version: 1.0
serialized_size: 154

############ Columns ############
ID
############ Column(ID) ############
name: ID
path: ID
max_definition_level: 0
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8

Expected behavior
The File engine with format Parquet should either support multiple INSERT statements or fail with a clear error message on the second INSERT.

@dgloeckner dgloeckner added the bug Confirmed user-visible misbehaviour in official release label May 3, 2021
@dgloeckner
Copy link
Author

Note that this issue is different from #15022 as no exception is thrown in the INSERT here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed user-visible misbehaviour in official release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants