No such file or directory: '/home/renato/2019-Oct.csv #6877

thecaptain2000 · 2024-01-24T08:44:43Z

thecaptain2000
Jan 24, 2024

Hi, I am a newbi of both ray and modin. I have setup on my server a ray cluster with a head node and a couple of worker nodes and I was curious to see how modin was able to speed things up compared to Pandas and figure out how could modin possibly work. In my home directory on the head node, I have a file, whose full path is /home/renato/2019-Oct.csv and to read it I fire up my jupiter in vscode and I have:
import ray
context = ray.init()
import modin.pandas as pd
column_data_types = {
'event_type' : 'category',
'Product_id' : 'int32',
'Category_id' : 'category',
'Category_code' : 'category',
'brand' : 'category',
'price' : 'float32',
'user_id' : 'int32',
'user_session' : object
# Add more columns and data types as needed
}

df = pd.read_csv("/home/renato/2019-Oct.csv", index_col="event_time", dtype=column_data_types)

which works fine if I have just the head node active, but once I add the two worker nodes, fires up a long error trail that ends in: FileNotFoundError: [Errno 2] No such file or directory: '/home/renato/2019-Oct.csv'

which could make sense if worker nodes were used and they try to read the file locally, as the file is not in each worker VM filesystem , however I could not find in the documentation any mention of having to replicate file system content. So, what is the right approach? Can you point out any documentation I could read? Do I need to setup a common file system between head and nodes for the purpose? is there a way to limit the read just to the node where I launch the program and use the worker nodes just for other operations in memory? how does this work? What if instead of a file I was reading columns from a database, etc.

Thank you for your help, if you are an expert modin / ray user you will have a laugh and it won't take more than 1 minute to point me in the right direction as this comes across as a fundamental knowledge to master :)

Renato

YarShev · 2024-01-24T09:28:26Z

YarShev
Jan 24, 2024
Collaborator

Hi @thecaptain2000, thanks for your question. You should replicate data across nodes used so all workers can find the data path. Alternatively, you could use a shared filesystem, where workers could read from. There is no way to limit the read just to the node where you launch the program and use the worker nodes just for other operations in memory. All workers are involved into execution from the very beggining.

We should probably have some section saying about data replication across worker nodes. @anmyachev, do we have something related in the docs?

4 replies

thecaptain2000 Jan 24, 2024
Author

Yes, I can confirm that it works now; Looking back it was kind of obvious, I guess I let myself too much into the "you need to change just one line of code" mindset :)

one question, which again seems make sense and it will likely be obvious once I know the right answer: once the data are loaded in memory, it is like partitioned in the three nodes, and all operations (let's say compute the mean value of a column) will be partitioned in the three nodes, correct? (not sure whether this is a modin question or a ray question)

YarShev Jan 24, 2024
Collaborator

It depends on an operation to be performed and the Ray's scheduler that submits remote tasks to worker processes. mean operation is implemented using TreeReduce pattern so it should be distribited across all nodes (of course, taking into account how many partitions a Modin DataFrame has and what the initial data size is).

thecaptain2000 Jan 24, 2024
Author

cool. I went to check the information published by anmyachev, the infor on the NFS ins indeed there. however, I would like to know what I need to manage myself. For example, let's pretend I am loading data from a database and the DB connectivity is set up in all nodes, will modin (or ray) take care by itself what data to load in each node? I faced a similar problem in pytorch lightning where I need to manually take care of loading data in each GPU as the loading task is not handled by the system automatically.

BTW. thank you for your help I am all for "learning to walk with my own legs" but takin on at once with Ray and modin is a big feat.

YarShev Jan 24, 2024
Collaborator

Modin will take care what data to load in each node. You can use read_sql from the original module

from modin.pandas.io import read_sql

or from experimental one, which expands read_sql with additional parameters.

from modin.experimental.pandas.io import read_sql

anmyachev · 2024-01-24T12:11:12Z

anmyachev
Jan 24, 2024
Collaborator

We should probably have some section saying about data replication across worker nodes. @anmyachev, do we have something related in the docs?

This information was added by me in 7f2dc36.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No such file or directory: '/home/renato/2019-Oct.csv #6877

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

No such file or directory: '/home/renato/2019-Oct.csv #6877

thecaptain2000 Jan 24, 2024

Replies: 2 comments · 4 replies

YarShev Jan 24, 2024 Collaborator

thecaptain2000 Jan 24, 2024 Author

YarShev Jan 24, 2024 Collaborator

thecaptain2000 Jan 24, 2024 Author

YarShev Jan 24, 2024 Collaborator

anmyachev Jan 24, 2024 Collaborator

thecaptain2000
Jan 24, 2024

Replies: 2 comments 4 replies

YarShev
Jan 24, 2024
Collaborator

thecaptain2000 Jan 24, 2024
Author

YarShev Jan 24, 2024
Collaborator

thecaptain2000 Jan 24, 2024
Author

YarShev Jan 24, 2024
Collaborator

anmyachev
Jan 24, 2024
Collaborator