Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group null judge fix #7122

Merged
merged 2 commits into from
Sep 27, 2023
Merged

group null judge fix #7122

merged 2 commits into from
Sep 27, 2023

Conversation

TimeYWL
Copy link
Contributor

@TimeYWL TimeYWL commented Sep 25, 2023

PR types

Bug fixes

PR changes

Others

Description

According to the api(paddle/distributed/communication/group.py:67:

def is_member(self):
    if self.rank < 0:
        return False
    if self.nranks < 2:
        return False
    return True

and the group build code:

if size > 1 and global_rank in ranks:
    rank = 0 if backend == 'heter' else ranks.index(global_rank)
    pg = _new_process_group_impl(
        backend,
        _default_store,
        rank,
        size,
        group_name,
        pg_options=None,
        group_id=gid,
    )
else:
    rank = -1
    pg = None

The code in PaddleNLP/paddlenlp/data/dist_dataloader.py, egg line 155:
if self.mp_group is not None and self.pp_rank == 0:
can not determine the existence of 'mp_group' correctly.

If there is no model_parallel, mp_group will be:
rank: -1, nranks: 1, id: 12, ranks: 0; name: _default_pg12
The broadcast_data_list() will cased error:

Traceback (most recent call last):
  File "run_pretrain.py", line 567, in <module>
    main()
  File "run_pretrain.py", line 549, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/workspace/PaddleNLP/paddlenlp/trainer/trainer.py", line 738, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/workspace/PaddleNLP/paddlenlp/data/dist_dataloader.py", line 181, in __next__
    data_list = broadcast_data_list(data_list, paddle.int64, self.mp_rank, self.mp_group, self.mp_src_rank)
  File "/workspace/PaddleNLP/paddlenlp/data/dist_dataloader.py", line 210, in broadcast_data_list
    paddle.distributed.broadcast(size_cuda, src_rank, group=comm_group).wait()
AttributeError: 'NoneType' object has no attribute 'wait'

@paddle-bot
Copy link

paddle-bot bot commented Sep 25, 2023

Thanks for your contribution!

@codecov
Copy link

codecov bot commented Sep 25, 2023

Codecov Report

Merging #7122 (4d043d1) into develop (9c3f8a4) will decrease coverage by 0.01%.
Report is 4 commits behind head on develop.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           develop    #7122      +/-   ##
===========================================
- Coverage    59.64%   59.64%   -0.01%     
===========================================
  Files          563      563              
  Lines        82644    82645       +1     
===========================================
  Hits         49291    49291              
- Misses       33353    33354       +1     
Files Coverage Δ
paddlenlp/data/dist_dataloader.py 14.78% <0.00%> (ø)

... and 4 files with indirect coverage changes

Copy link
Contributor

@DesmonDay DesmonDay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

paddlenlp/data/dist_dataloader.py Show resolved Hide resolved
@DesmonDay DesmonDay merged commit 685d12b into PaddlePaddle:develop Sep 27, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants