Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add design doc for lookup remote table in Fluid #9068

Merged
merged 12 commits into from
Jul 5, 2018

Conversation

Yancey1989
Copy link
Contributor

Fixed #9066

@@ -0,0 +1,44 @@
# Design Doc: Large Model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a more meaningful name, like "remote large parameter prefetching"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, maybe Prefetching Parameter From Parameter Server sounds good?

@@ -0,0 +1,44 @@
# Design Doc: Large Model

## Abstract
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to tell about the background, why we need this feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


### Split Large Parameter

<img src="src/split_parameter.png" width="400" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the picture's number's are wrong.

Copy link
Contributor

@abhinavarora abhinavarora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Yancey1989 The document required some polishing in the grammar. I have directly pushed a commit to this PR to refine and polish the document.


## Abstract

We propose an approach to prefetch parameter from Parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-fetch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be:
We propose an approach to pre-fetch the parameters from a Parameter Server while distributed training so that Fluid is able to train a model with a large number of parameters that cannot be stored in one trainer's memory.

@helinwang
Copy link
Contributor

helinwang commented Mar 14, 2018

Thanks for the design doc! Curious how much time can prefetch save in our use case?

From my understand this is how prefetch save time:

pre-fetch-OP -> some-OP-A -> some-OP-B -> OP-that-use-the-prefetched-value

prefetch could overlap fetch with the time "some-OP-A" and "some-OP-B" take.

However, is it true that in our case the "OP-that-use-the-prefetched-value" is in the very beginning of the ProgramDesc, thus even if we insert the "pre-fetch-OP" at the beginning of the ProgramDesc, it would not save much time? E.g.,

pre-fetch-OP -> few-OPs-that-does-not-take-long-to-run -> OP-that-use-the-prefetched-value

@typhoonzero
Copy link
Contributor

@helinwang Well, I think prefetch does not intend to save time but to make training with very large parameter distributed on many pservers(because the feature space is very large). This feature could definitely slow down the training but very useful for some CTR models.

@Yancey1989
Copy link
Contributor Author

Thanks @abhinavarora, that looks better😆😆

@helinwang
Copy link
Contributor

@typhoonzero I see, that makes sense, thanks for the reply!

@jacquesqiao
Copy link
Member

Please review this design #9075 first, for it has something to do with Abacus migration. We should consider being compatible with both sides.

@Yancey1989 Yancey1989 changed the title Large model design doc Large parameter distributed training Mar 15, 2018
@Yancey1989 Yancey1989 changed the title Large parameter distributed training Add design doc for lookup remote table Mar 15, 2018
@Yancey1989
Copy link
Contributor Author

I changed the title to Lookup Remote Table because the specific application scenario description would be clearer. And this feature is mainly applied to embedding layer.

This was referenced Mar 19, 2018
@Yancey1989 Yancey1989 changed the title Add design doc for lookup remote table Add design doc for lookup remote table in Fluid Jul 4, 2018
For another design, we can implement a distributed sparse table in Fluid,
and don't need to maintain an external storage component while training.

Prior to reading this design, it would be useful for the reader to make themselves
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to read ... before going on.

![fluid lookup remote table](./src/fluid_lookup_remote_table.png)

Partition a large table into multiple pserver instances
1. `DistributeTranspiler` would split the table partitioned into some small
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only use mod for now, but never mind, it's a design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mod is used for finding the right pserver according to the input Id, and we also need to initialize the shape of the table block on each PServer by split.

Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we may need to remove old design sections later.

Copy link
Member

@jacquesqiao jacquesqiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the outdated part of this design.

@Yancey1989 Yancey1989 merged commit 845618e into PaddlePaddle:develop Jul 5, 2018
@Yancey1989 Yancey1989 deleted the large_model_design_doc branch July 5, 2018 01:14
kuke pushed a commit to kuke/Paddle that referenced this pull request Aug 25, 2018
…gn_doc

Add design doc for lookup remote table in Fluid
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants