-
Notifications
You must be signed in to change notification settings - Fork 368
Conversation
…-> InferenceSession::from_snapshot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good. I think the way modules are split makes sense 👍
Hopefully this wouldn't cause too many conflicts with existing PRs? e.g. #125 comes to mind
let (_, vocabulary) = args.model_load.load(); | ||
let toks = match vocabulary.tokenize(&prompt, false) { | ||
let model = args.model_load.load(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the change. I had considered doing this a few times before. Since both the model and the vocabulary are meant to be immutable after creation, bundling them into the same struct will hardly cause issues.
// The size of a scratch buffer used for inference. This is used for temporary | ||
// storage of intermediate results during inference. | ||
// | ||
// The specific value was copied from `llama.cpp`. | ||
const SCRATCH_SIZE: usize = 512 * 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think llama.cpp figured out a proper way to compute this value. We should have a look at this. Not in this PR of course 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm a little concerned about that one myself, but I don't think it should be too bad - it'll mostly just be discarding the changes to |
This is a little overdue I think, and it'll cause problems for the other PRs, but it also makes it much easier to maintain.
I've made a few controversial changes in the last three commits. The rest are pretty straightforward.