Replicating samples across devices (SP / TP enablement) #597

knighton · 2024-02-09T03:23:55Z

1. `World`

Streaming's model of the world/the run/the job is (num nodes, ranks per node) rank processes + (num nodes, ranks per node, workers per rank) worker processes.
For example, 8 ranks, 8 workers/rank -> 8 + 8 * 8 = 72 StreamingDataset replicas per node.
Currently they coordinate via filelocks and shared memory.
A World is just our wrapper around torch.dist and torch.utils.data.get_worker_info.
The World just tells you which one you are, out of how many (nodes, ranks/node, workers/rank).
There's always only one World, semantically speaking.
However, when in a rank, get_worker_info() will say we are worker 0 of 1. One has to keep that in mind.

2. `StreamingDataset` argument `replication: Optional[int]`

replication iterates the same samples for groups of adjacent GPUs (ranks).
It does this by scaling down (in number) and repeating ranks.
But the underlying data doesn't change, so we are correspondingly scaling up the ranks (in length).
It prefers to merge ranks within the same node first, then if it needs to merge further, across node boundaries.

streaming/base/dataset.py

streaming/base/world.py

snarayan21

some clairfying questions

snarayan21 · 2024-02-09T22:07:25Z

streaming/base/world.py

+        rank = self.rank // ratio
+        num_ranks = self.num_ranks // ratio
+        worker = rank * self.workers_per_rank + self.worker_of_rank
+        if self.ranks_per_node <= num_ranks:


@knighton mind adding some comments to the code here? I'm trying to understand what modifications are happening, and I think there may be a bug or a better way of doing this, but it's pretty hard for me to reason about without completely understanding what you're trying to accomplish here.

iiuc ratio is the number of consecutive GPUs that should be sharing the same samples -- and should have the same World information. num_ranks here is the # of TP blocks, and rank is the TP block index. I'm not seeing why we need to check if self.ranks_per_node <= num_ranks, and it seems if we set num_nodes to 1, then we'll have download duplication across all the nodes which will be pretty bad. The TP ratio should be the degree of duplication of the World objects and I'm not sure that this does that entirely correctly...

iiuc ratio is the number of consecutive GPUs that should be sharing the same samples -- and should have the same World information. num_ranks here is the # of TP blocks, and rank is the TP block index.

Exactly (for iterating purposes, modulo that bug for coordinating purposes which am about to fix)

I'm not seeing why we need to check if self.ranks_per_node <= num_ranks, and it seems if we set num_nodes to 1, then we'll have download duplication across all the nodes which will be pretty bad.

PR originally applied tensor parallelism intra-node only, but then iiuc VC noted TP blocks may be inter-node here: #597 (comment)

Spelling out the logic for my own benefit:

If num TP blocks maps 1:1 to or exceeds true local world size: Scale down perceived num nodes to TP blocks / true local world size Don't need to scale down its perceived local world size because we scaled nodes else: Just one perceived node, and what happens in the node, stays in the node Map ranks to fewer TP blocks by scaling down its perceived local world size

Could rewrite it to:

num_nodes = (num_ranks + self.ranks_per_node - 1) // self.ranks_per_node ranks_per_node = num_ranks // num_nodes

modulo that bug for coordinating purposes which am about to fix

a17419e

streaming/base/world.py

andreamad8 · 2024-02-09T23:18:18Z

Now I get an error like:

  File "/streaming/streaming/base/dataset.py", line 1458, in __iter__
    sample_ids = self._get_work(world, epoch, sample_in_epoch)
  File "/streaming/streaming/base/dataset.py", line 1002, in _get_work
    shape_shm, data_shm = self._share_work(epoch_sample_ids)
  File "/streaming/streaming/base/dataset.py", line 949, in _share_work
    shape_shm = SharedMemory(name=name, create=True, size=size, auto_cleanup=False)
  File "/streaming/streaming/base/shared/memory.py", line 41, in __init__
    shm = BuiltinSharedMemory(name, create, size)
  File "lib/python3.9/multiprocessing/shared_memory.py", line 103, in __init__
    self._fd = _posixshmem.shm_open(
FileExistsError: [Errno 17] File exists: '/000000_epoch_shape'

knighton · 2024-02-09T23:30:56Z

Now I get an error like:

  File "/streaming/streaming/base/dataset.py", line 1458, in __iter__
    sample_ids = self._get_work(world, epoch, sample_in_epoch)
  File "/streaming/streaming/base/dataset.py", line 1002, in _get_work
    shape_shm, data_shm = self._share_work(epoch_sample_ids)
  File "/streaming/streaming/base/dataset.py", line 949, in _share_work
    shape_shm = SharedMemory(name=name, create=True, size=size, auto_cleanup=False)
  File "/streaming/streaming/base/shared/memory.py", line 41, in __init__
    shm = BuiltinSharedMemory(name, create, size)
  File "lib/python3.9/multiprocessing/shared_memory.py", line 103, in __init__
    self._fd = _posixshmem.shm_open(
FileExistsError: [Errno 17] File exists: '/000000_epoch_shape'

Investigating...

knighton · 2024-02-09T23:47:04Z

@andreamad8 how about now? You may need to call streaming.base.util.clean_stale_shared_memory()

andreamad8 · 2024-02-09T23:52:47Z

Ok works now :)

I can also test with multiple nodes, but I would need a bit more time (1 hourish)

knighton · 2024-02-09T23:59:56Z

We will begrudge you one sidereal hour

andreamad8 · 2024-02-11T20:07:49Z

was a bit longer than an hours :)

yes, it works I tested up to 16 nodes with different TensorParallel size 1, 4, 8.

snarayan21 · 2024-02-22T09:57:47Z

Cleaned up the PR to have naming not TP-specific and fixed a bug that was preventing determinism (both elastic and non-elastic). Added tests as well. Pending regression tests, should be good to go!

snarayan21

lgtm

knighton added 3 commits February 8, 2024 18:10

Merge branch 'main' of github.com:mosaicml/streaming

716b1a7

Huh

934a581

Fix.

94d6bd0

vchiley reviewed Feb 9, 2024

View reviewed changes

streaming/base/dataset.py Outdated Show resolved Hide resolved

vchiley reviewed Feb 9, 2024

View reviewed changes

streaming/base/world.py Show resolved Hide resolved

vchiley reviewed Feb 9, 2024

View reviewed changes

streaming/base/world.py Show resolved Hide resolved

Let tensor parallelism subsume nodes as well as ranks.

e50004d

snarayan21 reviewed Feb 9, 2024

View reviewed changes

knighton added 2 commits February 9, 2024 14:47

Divide world: _unique_rank_world and _parallel_rank_world.

a17419e

Refactor, add comments.

4d1a523

andreamad8 reviewed Feb 9, 2024

View reviewed changes

streaming/base/world.py Outdated Show resolved Hide resolved

Fix.

0c75f14

Fix: for coordination vs for iterating.

b9e43d5

knighton and others added 5 commits February 12, 2024 21:27

Merge branch 'main' into james/wrd

88d81b5

Merge branch 'main' into james/wrd

adb838e

replication

6f2baf5

replication

f92d795

Merge branch 'main' into james/wrd

bc1083b

snarayan21 changed the title ~~Custom world and/or tensor parallel~~ Replicating samples across devices (sequence / tensor parallelism enablement) Feb 22, 2024

snarayan21 changed the title ~~Replicating samples across devices (sequence / tensor parallelism enablement)~~ Replicating samples across devices (SP / TP enablement) Feb 22, 2024

snarayan21 approved these changes Feb 22, 2024

View reviewed changes

snarayan21 merged commit 0bb81e8 into main Feb 22, 2024
6 checks passed

snarayan21 deleted the james/wrd branch February 22, 2024 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicating samples across devices (SP / TP enablement) #597

Replicating samples across devices (SP / TP enablement) #597

knighton commented Feb 9, 2024 •

edited by snarayan21

Loading

snarayan21 left a comment

snarayan21 Feb 9, 2024

knighton Feb 9, 2024

knighton Feb 9, 2024 •

edited

Loading

knighton Feb 9, 2024

knighton Feb 9, 2024

andreamad8 commented Feb 9, 2024

knighton commented Feb 9, 2024

knighton commented Feb 9, 2024

andreamad8 commented Feb 9, 2024 •

edited

Loading

knighton commented Feb 9, 2024

andreamad8 commented Feb 11, 2024

snarayan21 commented Feb 22, 2024

snarayan21 left a comment

Replicating samples across devices (SP / TP enablement) #597

Replicating samples across devices (SP / TP enablement) #597

Conversation

knighton commented Feb 9, 2024 • edited by snarayan21 Loading

1. World

2. StreamingDataset argument replication: Optional[int]

snarayan21 left a comment

Choose a reason for hiding this comment

snarayan21 Feb 9, 2024

Choose a reason for hiding this comment

knighton Feb 9, 2024

Choose a reason for hiding this comment

knighton Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

knighton Feb 9, 2024

Choose a reason for hiding this comment

knighton Feb 9, 2024

Choose a reason for hiding this comment

andreamad8 commented Feb 9, 2024

knighton commented Feb 9, 2024

knighton commented Feb 9, 2024

andreamad8 commented Feb 9, 2024 • edited Loading

knighton commented Feb 9, 2024

andreamad8 commented Feb 11, 2024

snarayan21 commented Feb 22, 2024

snarayan21 left a comment

Choose a reason for hiding this comment

knighton commented Feb 9, 2024 •

edited by snarayan21

Loading

1. `World`

2. `StreamingDataset` argument `replication: Optional[int]`

knighton Feb 9, 2024 •

edited

Loading

andreamad8 commented Feb 9, 2024 •

edited

Loading