Generation with model parallel Megatron LM #2358

rakeshchada · 2020-07-22T02:26:10Z

❓ Questions and Help

What is your question?

Is there an example demonstrating how to generate using Megatron LM that was trained using model parallelism? The Megatron LM page shows how to run evaluation but there's no information on running generation.

What have you tried?

I tried running the below command but got an error.

Command:

fairseq-generate \
  $DATA_PATH \
  --path $MODEL_PATH \
  --task language_modeling \
  --gen-subset test \
  --max-sentences 8 \
  --criterion cross_entropy \
  --beam 1 \
  --sampling \
  --sampling-topp 0.9 \
  --temperature 0.01 \
  --prefix-size 200 \
  --distributed-world-size 8 \
  --results-path $RESULTS_PATH \
  --model-parallel-size 8;

Error:
/opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/THC/THCTensorScatterGather.cu:100: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [3,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim] failed.

After some debugging, I found that this line in the code caused the above error. But I'm unsure of the cause. It's possible there are some setup issues (data etc). But an example on how to setup and run generation using model parallel megatron LM would be great. Thank you.

The text was updated successfully, but these errors were encountered:

ngoyal2707 · 2020-07-22T02:55:17Z

I think, this is on me, I never got around to fixing generate script with megatron MP. I can look into it, but no promises on timeline yet.
For a reasonable size of model, you can just stitch the model part back into a single model and do the generations reglarly.

Maybe give that a try here?

rakeshchada · 2020-07-23T04:19:30Z

Sure. I can try that. Appreciate if you can give an example that shows how to stitch model parts to one.

AdamDanielKing · 2020-08-11T07:55:18Z

Can anyone confirm that Megatron 11b treats all contiguous spaces as a single space? With some hacky code I have it successfully generating on 2 GPUs (after merging and re-splitting the partitions) but it doesn't seem to understand line breaks. That's a little disappointing since it seems smarter than GPT-2 in a lot of other ways.

Perhaps this code was used during training?
https://github.com/pytorch/fairseq/blob/4c55744ec4cb26749cf2cf8dac89942f26ce4bd2/fairseq/tokenizer.py#L8-L14
There wouldn't seem to be any easy solution. Still appreciate Facebook making the biggest public model release.

xeb · 2020-09-17T06:11:50Z

Does anyone have an example of stitching the model parts together? Did the approach work to generate text with megatron_11b?

thies1006 · 2020-09-18T14:42:52Z

To join the model chunks maybe one could try like this:
https://github.com/facebookresearch/ParlAI/blob/abfb771ac4ed2966d6f3ea22c7a38e4ebc9cc0f0/parlai/agents/bart/convert_fairseq_to_parlai.py#L258-L307

ps. make sure to copy the 'version' entries as well, you might loose normalization layers otherwise.

ngoyal2707 · 2020-09-18T15:25:38Z

Will push the the scirpts to glue and split partitions to master soon.
yeah the line break issue might need some thinking, will take a look at that also.

MasterScrat · 2020-11-13T00:22:08Z

Hello @ngoyal2707, did you get a chance to push the scripts to manage the partitions somewhere?

stale · 2021-07-21T08:04:51Z

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale · 2022-04-19T12:21:06Z

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

rakeshchada added needs triage question labels Jul 22, 2020

myleott removed the needs triage label Jul 24, 2020

myleott assigned ngoyal2707 Jul 24, 2020

stale bot added the stale label Jul 21, 2021

stale bot closed this as completed Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation with model parallel Megatron LM #2358

Generation with model parallel Megatron LM #2358

rakeshchada commented Jul 22, 2020

ngoyal2707 commented Jul 22, 2020

rakeshchada commented Jul 23, 2020

AdamDanielKing commented Aug 11, 2020 •

edited

Loading

xeb commented Sep 17, 2020

thies1006 commented Sep 18, 2020 •

edited

Loading

ngoyal2707 commented Sep 18, 2020

MasterScrat commented Nov 13, 2020

stale bot commented Jul 21, 2021

stale bot commented Apr 19, 2022

Generation with model parallel Megatron LM #2358

Generation with model parallel Megatron LM #2358

Comments

rakeshchada commented Jul 22, 2020

❓ Questions and Help

What is your question?

What have you tried?

ngoyal2707 commented Jul 22, 2020

rakeshchada commented Jul 23, 2020

AdamDanielKing commented Aug 11, 2020 • edited Loading

xeb commented Sep 17, 2020

thies1006 commented Sep 18, 2020 • edited Loading

ngoyal2707 commented Sep 18, 2020

MasterScrat commented Nov 13, 2020

stale bot commented Jul 21, 2021

stale bot commented Apr 19, 2022

AdamDanielKing commented Aug 11, 2020 •

edited

Loading

thies1006 commented Sep 18, 2020 •

edited

Loading