-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use pyflink to read data file in S3 bucket? #22
Comments
Can you try after creating the following environment variables? AWS_ACCESS_KEY_ID |
Yes! I have created Environment Variables, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. However, I don't know how to use them in the following pyflink code which is to read data from the source csv file in S3 bucket (specified in the source_file_path below) and also don't know how to use the flink-s3-fs-hadoop-1.17.1.jar is in the folder "/opt/flink/plugins/s3-fs-hadoop". `
|
The error indicates it's an s3 plugin issue. Can you check this doc? https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/s3/ |
I think I might figure it out.
Now it works! (Environment Variables, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY have been created and set like the above snapshot in the second post. No need to update /opt/flink/conf/flink-conf.yaml. ) But I do not know why adding the jar in a subfolder in "flink/plugins", as described in the flink link page, does not work. Thank you so much!!! |
Can you check the using filesystem plugins section of this doc? https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/docker/ |
At least, that is what the doc indicates. I haven't tried myself. |
Totally understand! I also tried to have the same jar in lib and plugins//s3-fs-hadoop, it would not work. |
Hi Jaehyeon,
This is just a question about my practice, not a question about your solution. I am not sure whether is proper to ask here.
I tried to followed your solution of Lab 2. To make it simple, I just want to have the pyflink read the source file in the local docker container folder, and read the source file in S3 bucket.
Now the pyflink can read the source file in local docker container folder with the simplied pyflink code.
`
`
But I haven't figured out how to let it read the source file in S3 bucket.
I didn't successfully create the same environment with the terraform code your provide. I just simply create a S3 bucket and uploaded the source file, taxi-trips.csv. With aws_access_key_id, aws_secret_access_key, I can read the source file in S3 bucket as below.
In docker container, flink-s3-fs-hadoop-1.17.1.jar is in the folder "/opt/flink/plugins/s3-fs-hadoop"
I didn't create IAM user/role for my test, can I let pyflink read/select data in the source file in S3 bucket with aws_access_key_id and aws_secret_access_key? How can I do it?
Thank you so much!
The text was updated successfully, but these errors were encountered: