Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting Error reading from alluxio://<host>:19998/<path_to_parquet_file> at position xxxxxxx in presto SQL #16833

Open
harsh9898 opened this issue Jan 31, 2023 · 12 comments
Labels
area-jobservice Alluxio Job Service area-master Alluxio master area-worker Alluxio worker priority-high type-bug This issue is about a bug workload-presto Presto Workloads

Comments

@harsh9898
Copy link

harsh9898 commented Jan 31, 2023

Alluxio Version:
2.7.1

Describe the bug
Alluxio version: 2.7.1
Presto version : 0.268
Presto JDBC version: 0.268
Presto coordinator: 1, workers: 4
Single Alluxio master
Data Source: data loaded from Azure ABFS to Alluxio via distributedLoad
UFS: Azure ABFS

  • data updated periodically
  • metadata sync enabled via alluxio.user.file.metadata.sync.interval=300s but UFS is not updated for one week but still getting the error.

File format: parquet
Tools used for query in presto: DBeaver, presto CLI

All the data files are cached into alluxio and presto is reading from alluxio through hive metadata.

When I am trying run the query with DBeaver, it's giving the following error :
com.facebook.presto.spi.PrestoException: Error reading from alluxio://:19998/<path_to_parquet_file> at position xxxxxxx

To Reproduce
This error is random. It gives an error for some of the files and if we remove it from alluxio using --alluxioOnly option and then if I re-run the query, it does not give any error.

Expected behavior
I expect not to produce this kind of error and it gives the desired results.

Urgency
This is a critical error as I am getting this error almost on a daily basis when it restarts alluxio automatically.

Are you planning to fix it
Not sure how to do it.

Additional context
Important Notes:
The data size for some of the parquet files is large.
This error is a random error when it's trying to read all parquet files stored under Alluxio. It's failing on some of the files randomly on a daily basis

@harsh9898 harsh9898 added the type-bug This issue is about a bug label Jan 31, 2023
@jja725
Copy link
Contributor

jja725 commented Jan 31, 2023

Hi @harsh9898 thanks for reporting the issue. The alluxio URI seems weird to me alluxio://:19998/<path_to_parquet_file>.
Do you mind pasting more related logs in presto or alluxio? Thanks

@harsh9898
Copy link
Author

harsh9898 commented Feb 1, 2023

Hi @harsh9898 thanks for reporting the issue. The alluxio URI seems weird to me alluxio://:19998/<path_to_parquet_file>. Do you mind pasting more related logs in presto or alluxio? Thanks

HI @jja725

Thank you for your quick reply. Please see my response below.

Sure, here is the correct URI: sorry for the confusion

  • Alluxio URI ; alluxio://<ip_of_host>:19998/<path_to_parquet_file>

NOTE: Once I removed that file from alluxio storage using --alluxioOnly , error goes away.

I am getting that same error. I got that error again and paste the full error trace here:

com.facebook.presto.spi.PrestoException: Error reading from alluxio://<host_name>:19998/<path_to_parquet_file> at position xxxxxxxxxx
at com.facebook.presto.hive.parquet.HdfsParquetDataSource.readInternal(HdfsParquetDataSource.java:66)
at com.facebook.presto.parquet.AbstractParquetDataSource.readFully(AbstractParquetDataSource.java:60)
at com.facebook.presto.parquet.AbstractParquetDataSource.readFully(AbstractParquetDataSource.java:51)
at com.facebook.presto.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:247)
at com.facebook.presto.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:330)
at com.facebook.presto.parquet.reader.ParquetReader.readBlock(ParquetReader.java:313)
at com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:193)
at com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:171)
at com.facebook.presto.common.block.LazyBlock.assureLoaded(LazyBlock.java:298)
at com.facebook.presto.common.block.LazyBlock.getLoadedBlock(LazyBlock.java:289)
at com.facebook.presto.operator.ScanFilterAndProjectOperator$RecordingLazyBlockLoader.load(ScanFilterAndProjectOperator.java:320)
at com.facebook.presto.operator.ScanFilterAndProjectOperator$RecordingLazyBlockLoader.load(ScanFilterAndProjectOperator.java:306)
at com.facebook.presto.common.block.LazyBlock.assureLoaded(LazyBlock.java:298)
at com.facebook.presto.common.block.LazyBlock.getLoadedBlock(LazyBlock.java:289)
at com.facebook.presto.operator.project.InputPageProjection.project(InputPageProjection.java:69)
at com.facebook.presto.operator.project.PageProjectionWithOutputs.project(PageProjectionWithOutputs.java:56)
at com.facebook.presto.operator.project.PageProcessor$ProjectSelectedPositions.processBatch(PageProcessor.java:327)
at com.facebook.presto.operator.project.PageProcessor$ProjectSelectedPositions.process(PageProcessor.java:201)
at com.facebook.presto.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:315)
at com.facebook.presto.operator.WorkProcessorUtils$YieldingIterator.computeNext(WorkProcessorUtils.java:79)
at com.facebook.presto.operator.WorkProcessorUtils$YieldingIterator.computeNext(WorkProcessorUtils.java:65)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
at com.facebook.presto.operator.project.MergingPageOutput.getOutput(MergingPageOutput.java:128)
at com.facebook.presto.operator.ScanFilterAndProjectOperator.processPageSource(ScanFilterAndProjectOperator.java:301)
at com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:245)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:424)
at com.facebook.presto.operator.Driver.lambda$processFor$9(Driver.java:307)
at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:728)
at com.facebook.presto.operator.Driver.processFor(Driver.java:300)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1079)
at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:599)
at com.facebook.presto.$gen.Presto_0_268_03318e7____20230201_140638_1.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalStateException: No data is read before EOF
at alluxio.shaded.client.com.google.common.base.Preconditions.checkState(Preconditions.java:508)
at alluxio.client.file.AlluxioFileInStream.positionedReadInternal(AlluxioFileInStream.java:269)
at alluxio.client.file.AlluxioFileInStream.positionedRead(AlluxioFileInStream.java:237)
at alluxio.hadoop.HdfsFileInputStream.read(HdfsFileInputStream.java:154)
at alluxio.hadoop.HdfsFileInputStream.readFully(HdfsFileInputStream.java:171)
at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:107)
at com.facebook.presto.hive.parquet.HdfsParquetDataSource.readInternal(HdfsParquetDataSource.java:58)
... 36 more

@LuQQiu LuQQiu added type-bug This issue is about a bug priority-high area-worker Alluxio worker workload-presto Presto Workloads and removed type-bug This issue is about a bug labels Feb 3, 2023
@LuQQiu
Copy link
Contributor

LuQQiu commented Feb 8, 2023

@harsh9898 is it happened once or happens regularly? our engineer looks at the issue and find it may not be easy to reproduce, if it happens next time, can we be involved earlier? e.g. before removed that file from alluxio storage using --alluxioOnly. If next time it happens, feel free to ping me @lu QIU (Alluxio) or @chunxu Tang at Alluxio community slack channel https://slackin.alluxio.io/

@harsh9898
Copy link
Author

harsh9898 commented Feb 8, 2023

@LuQQiu @ChunxuTang

It's happening almost every day for multiple files. It's kind of a regular error for us.

I am afraid, I won't be able to show it as this is happening in production but I will see what else I can provide apart from the full error log trace. If this happens tomorrow as well then I can provide you a more detailed log for that error. Actually, the query is select count(*) as count from <table_name>. This happens every morning at 9:15 am EST. If you will be available at that time then we can connect and discuss the scenarios. Let's work together to find the root cause of this error.

@LuQQiu LuQQiu added area-jobservice Alluxio Job Service area-master Alluxio master labels Feb 8, 2023
@LuQQiu
Copy link
Contributor

LuQQiu commented Feb 8, 2023

User info
Even if the ABFS is not updated, this error persists and the thing is that the error is random. The error is producing different parquet files every day and once I remove that file from alluxioOnly then it runs

For example, we created the hive table called 'sampletable' -> This 'sampletable' reads the data from alluxio's single path/directory - > In this single path, there are multiple files.
SO when I run select count(*) as count from sampletable. This is going to touch all files and that's when it's failing randomly

data in Azure ABFS -> distributedLoad the whole dataset to Alluxio -> run Presto on the whole dataset and fail randomly

@LuQQiu
Copy link
Contributor

LuQQiu commented Feb 13, 2023

stats_failed_file.txt

command:
./bin/alluxio fs ls <failed_file>
Output:
-rw-r--r--  8c643fb6-5ab8-42b0-b8e5-1d469dd72665 <admin_user_name> 2247345073       PERSISTED 01-18-2023 02:23:01:000 100% <failed_file>
command:
./bin/alluxio fs copyToLocal <failed_file> <local_path>
Output:
Block 1263760572422 is expected to be 268435456 bytes, but only 62496768 bytes are available. Please ensure its metadata is consistent between Alluxio and UFS.
command:
hdfs dfs -ls abfs://<data_lake_storage_uri>/<failed_file>
Output:
-rw-r--r--   1 8c643fb6-5ab8-42b0-b8e5-1d469dd72665 <admin_user_name> 2247345073 2023-01-18 02:23 abfs://<data_lake_storage_uri>/<failed_file>

From user:
This is my observance of the multiple errors and behavior is the same as follows:
When one of the blocks of the file is in SSD vs MEM. It produces an error for that particular file and then when we remove that file and run again it runs as it will try to cache again directly from UFS. The observance from my side is that whenever the block of the file is in SSD, it is not able to read the data and produces the error.

@LuQQiu
Copy link
Contributor

LuQQiu commented Feb 14, 2023

May related to #16597

@LuQQiu
Copy link
Contributor

LuQQiu commented Feb 14, 2023

From user
Yes I think the alluxio cache is the issue. On my side, based on my observations, this is what I did:
Formatted the alluxio using ./bin/alluxio format
Then ran the distributedLoad again
Restarted the alluxio
Once the above steps ran, it seems I am not getting this error.
Hence it proves that alluxio was trying to read some old blocks from the cache which was causing this issue.
But I think this could happen in the future as well when UFS is changing frequently.

@LuQQiu
Copy link
Contributor

LuQQiu commented Feb 14, 2023

Likely to cause because of the #16597

@LuQQiu
Copy link
Contributor

LuQQiu commented Feb 14, 2023

User is willing to tryout the changes in #16597 when it's released

@LuQQiu
Copy link
Contributor

LuQQiu commented Feb 15, 2023

From @tcrain
do you know if the files with problems are being written through Alluxio to the UFS? Or if they are written directly to the UFS not using Alluxio? Or are both cases possible?

From @harsh9898
all the files are written directly to UFS. I am not using alluxio to write any data. Alluxio is used just to cache UFS data.

@HelloHorizon
Copy link
Contributor

@LuQQiu @tcrain can I know if #16597 solved this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-jobservice Alluxio Job Service area-master Alluxio master area-worker Alluxio worker priority-high type-bug This issue is about a bug workload-presto Presto Workloads
Projects
None yet
Development

No branches or pull requests

4 participants