Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added checkpointing to support LLMs #114

Merged
merged 27 commits into from
Jan 9, 2024
Merged

added checkpointing to support LLMs #114

merged 27 commits into from
Jan 9, 2024

Conversation

hariharan-devarajan
Copy link
Collaborator

@hariharan-devarajan hariharan-devarajan commented Nov 21, 2023

Changes to support Microsoft's Megatron Deepspeed.

  • Support for Checkpointing of model parameters, optimization states, and layer parameters.
  • Support for Reading Index dataset
  • Write configuration file megatron deepspeed.

@hariharan-devarajan hariharan-devarajan marked this pull request as ready for review November 22, 2023 00:21
@zhenghh04
Copy link
Member

This PR addressed: #88

@hariharan-devarajan
Copy link
Collaborator Author

@zhenghh04 I can confirm with the profiler that this change to checkpointing accurately represents the checkpointing in deepspeed. Additionally, the indexed_binary and mmap_indexed_binary are the two modes used in megatron deepspeed for data reading and the calls are accurate.

You can merge this if it looks good to u.

@hariharan-devarajan hariharan-devarajan force-pushed the feature/checkpoint branch 2 times, most recently from faee51e to 87c195b Compare November 28, 2023 05:11
@hariharan-devarajan hariharan-devarajan force-pushed the feature/checkpoint branch 2 times, most recently from 8a1fb5a to 8537d35 Compare November 29, 2023 04:40
@zhenghh04
Copy link
Member

@hariharan-devarajan could you take a look at the conflict, and make sure that the check-pointing writes are performed with the storage API functions which apply also for S3 storage.


train:
epochs: 1
computation_time: 0.064296
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the computation time from running the real workload?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on the configuration used in the PR #88

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to validate this after merging the PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Comment on lines +126 to +139
if self.model_state:
fname = os.path.join(self.checkpoint_folder, f"model-{epoch}-{step_number}-{my_rank}.pt")
with open(fname, "wb") as f:
torch.save(self.model_state, f)
if self.optimization_state:
fname = os.path.join(self.checkpoint_folder, f"optimizer-{epoch}-{step_number}-{my_rank}.pt")
with open(fname, "wb") as f:
torch.save(self.optimization_state, f)

if self.layer_state and self.args.num_layers > 0:
for layer in range(self.args.num_layers):
fname = os.path.join(self.checkpoint_folder, f"layer-{layer}-{epoch}-{step_number}-{my_rank}.pt")
with open(fname, "wb") as f:
torch.save(self.layer_state, f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure the conflict is solved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@hariharan-devarajan
Copy link
Collaborator Author

@zhenghh04 The original code uses TensorFlow and PyTorch APIs to save. This is needed as we are storing complex tensors.

How would this work with S3? I think we need that fspec type interface for abstracting storage not manual abstraction.

Thoughts?

Copy link
Member

@zhenghh04 zhenghh04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the implementation. This feature implemented here is very useful. Please address the issues raise up.

@@ -17,6 +17,16 @@

from enum import Enum

class CheckpointType(Enum):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename this is IOType instead of CheckpointType?

Check point looks like more different kinds of checkpoint? We can use it as for example, only checkpoint model, optimization state, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about CheckpointIOType Just IOType might confuse with Reading.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named its CheckpointLocationType as RANK_ZERO or ALL_RANKS


train:
epochs: 1
computation_time: 0.064296
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to validate this after merging the PR.

"""
super().generate()
np.random.seed(10)
GB=1024**3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change GB=1073741824.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXED

sample_size = dim1 * dim2
total_size = sample_size * self.num_samples
write_size = total_size
MEMORY_SIZE = 2*GB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow user to configure this using environment variable, with a default value of 2GB?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

under dataset, I will add a configuration called generation_buffer_size. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 58 to 63
if self.args.checkpoint_type == CheckpointType.COLLECTIVE:
rank_to_checkpoint = 0
if rank_to_checkpoint == self.args.my_rank:
num_ranks = 1
if self.args.checkpoint_type == CheckpointType.COLLECTIVE:
num_ranks = self.args.comm_size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean for COLLECTIVE, is it every rank writing data?

Lines 62-63 and Lines 58-59 are inconsistent to each other.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collective basically means in the context of checkpointing is that all data is collected by rank zero and written. I am open for a better word to describe it. Maybe Aggregated and Per-Process?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CALLED IT RANK_ZERO

if self.args.checkpoint_type == CheckpointType.COLLECTIVE:
num_ranks = self.args.comm_size
if self.args.model_size > 0:
self.model_state = {"a": self._get_tensor(self.args.model_size*num_ranks)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_size is the size of the model, right?
It is confusing there, to have model_size * num_ranks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_size is size of model per GPU.
We can define it as absolute model size of app in which case.

  1. For Per GPU case we need to divide it.

Else if it is per GPU then we will have to multiply it for the Collective case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explained correctly in Doc

dlio_benchmark/utils/config.py Show resolved Hide resolved
Change GB to a abs value.
@hariharan-devarajan hariharan-devarajan force-pushed the feature/checkpoint branch 3 times, most recently from b83819e to ddc92ff Compare January 8, 2024 19:29
Copy link
Member

@zhenghh04 zhenghh04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good now.
But we need to validating DLRM and Magatron-Deepspeed config files. I'll create two issues to keep track of this.

@zhenghh04 zhenghh04 merged commit 0a6130a into main Jan 9, 2024
24 checks passed
@zhenghh04 zhenghh04 deleted the feature/checkpoint branch March 12, 2024 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants