Out-of-memory when finetuning large datasets with graphs #146

YouCanNotKnow · 2024-03-25T00:37:44Z

YouCanNotKnow
Mar 25, 2024

Hi CHGNet devs, I am trying to finetune a model on the Open Catalyst Project dataset (https://github.com/Open-Catalyst-Project/ocp/blob/main/DATASET.md). I've run into memory problems when converting the dataset into graphs.

I have been following fine_tuning.ipynb and make_graphs.py in examples. I am able to convert the structures into graphs, but due to the scale of the dataset, memory runs out before I can make a labels.json file.

I can create labels for each individual graph or for smaller batches of the full dataset, but it looks like GraphData in data/dataset.py can only load a single labels.json. Is there a way to batch load labels into a single dataset, or to merge smaller datasets together? Some way to get around the memory problem and train on the full dataset?

BowenD-UCB · 2024-03-25T21:23:33Z

BowenD-UCB
Mar 25, 2024
Maintainer

How large is the OC20 labels file?
If the label dictionary itself is too large to fit into memory, a work-around would be to save each label independently/ in batch. The dataset object can be initiated with only keys, and the corresponding inputs and labels can be loaded from disks on-the-fly during the epochs.

This will require a modified implementation of Dataset object, which you can adapt from one of the Dataset we provided in dataset.py.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out-of-memory when finetuning large datasets with graphs #146

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Out-of-memory when finetuning large datasets with graphs #146

YouCanNotKnow Mar 25, 2024

Replies: 1 comment

BowenD-UCB Mar 25, 2024 Maintainer

YouCanNotKnow
Mar 25, 2024

BowenD-UCB
Mar 25, 2024
Maintainer