No such file or directory: '/home/renato/2019-Oct.csv #6877
Replies: 2 comments 4 replies
-
Hi @thecaptain2000, thanks for your question. You should replicate data across nodes used so all workers can find the data path. Alternatively, you could use a shared filesystem, where workers could read from. There is no way to limit the read just to the node where you launch the program and use the worker nodes just for other operations in memory. All workers are involved into execution from the very beggining. We should probably have some section saying about data replication across worker nodes. @anmyachev, do we have something related in the docs? |
Beta Was this translation helpful? Give feedback.
-
This information was added by me in 7f2dc36. |
Beta Was this translation helpful? Give feedback.
-
Hi, I am a newbi of both ray and modin. I have setup on my server a ray cluster with a head node and a couple of worker nodes and I was curious to see how modin was able to speed things up compared to Pandas and figure out how could modin possibly work. In my home directory on the head node, I have a file, whose full path is /home/renato/2019-Oct.csv and to read it I fire up my jupiter in vscode and I have:
import ray
context = ray.init()
import modin.pandas as pd
column_data_types = {
'event_type' : 'category',
'Product_id' : 'int32',
'Category_id' : 'category',
'Category_code' : 'category',
'brand' : 'category',
'price' : 'float32',
'user_id' : 'int32',
'user_session' : object
# Add more columns and data types as needed
}
df = pd.read_csv("/home/renato/2019-Oct.csv", index_col="event_time", dtype=column_data_types)
which works fine if I have just the head node active, but once I add the two worker nodes, fires up a long error trail that ends in: FileNotFoundError: [Errno 2] No such file or directory: '/home/renato/2019-Oct.csv'
which could make sense if worker nodes were used and they try to read the file locally, as the file is not in each worker VM filesystem , however I could not find in the documentation any mention of having to replicate file system content. So, what is the right approach? Can you point out any documentation I could read? Do I need to setup a common file system between head and nodes for the purpose? is there a way to limit the read just to the node where I launch the program and use the worker nodes just for other operations in memory? how does this work? What if instead of a file I was reading columns from a database, etc.
Thank you for your help, if you are an expert modin / ray user you will have a laugh and it won't take more than 1 minute to point me in the right direction as this comes across as a fundamental knowledge to master :)
Renato
Beta Was this translation helpful? Give feedback.
All reactions