Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add galactica model #234

Merged
merged 35 commits into from
Feb 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
0132148
fix typo and checkpoint downloading issue
Jan 10, 2023
07ce502
saved work 1.29
Jan 29, 2023
176a3b1
saved work in 2.1
Feb 1, 2023
0fd7d8e
Merge branch 'master' of github.com:FlagAI-Open/FlagAI
Feb 1, 2023
8c4c49e
add Dockerfile
ftgreat Feb 7, 2023
9f9694b
Update README_zh.md
ftgreat Feb 7, 2023
4785451
Update Dockerfile
ftgreat Feb 7, 2023
6695e17
Merge branch 'master' of github.com:FlagAI-Open/FlagAI
Feb 8, 2023
db31a5a
Update .gitignore
ftgreat Feb 10, 2023
9c327aa
updated
Feb 13, 2023
c41d3b8
merged master
Feb 13, 2023
d0cc544
fixed errors
Feb 13, 2023
29fc829
removed glm_seq2seq trainer
Feb 13, 2023
651b889
fixed metrics
Feb 13, 2023
a002d1b
restore data
Feb 13, 2023
696c779
updated
Feb 13, 2023
681b272
restore file
Feb 13, 2023
8df94ae
upadted
Feb 13, 2023
56dc0e4
optimized prompt
Feb 15, 2023
34b9ab3
modified errors
Feb 15, 2023
13c2e2d
removed local path
Feb 15, 2023
2ba795e
updated
Feb 15, 2023
c714eb6
updated
Feb 15, 2023
ff39078
updated
Feb 15, 2023
14ec4f2
Add docs
Feb 15, 2023
fd1a642
Merge branch 'master' of github.com:FlagAI-Open/FlagAI
Feb 15, 2023
ffd6f9c
Merge branch 'master' into fix_issue202
Feb 15, 2023
1510598
updated
Feb 15, 2023
30c7465
fixe error
Feb 15, 2023
f8127cb
Merge pull request #227 from Anhforth/opt_prompt
ftgreat Feb 16, 2023
a3c5e80
Merge branch 'master' of github.com:FlagAI-Open/FlagAI
Feb 16, 2023
d117049
updated
Feb 16, 2023
000e153
Merge pull request #224 from Anhforth/fix_issue202
ftgreat Feb 16, 2023
6a3049c
add galactica model
920232796 Feb 27, 2023
9ecbf24
Merge branch 'gpm_dev' into master
ftgreat Feb 27, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ tensorboard*
datasets
qqp
glm_large_qqp_pytorch
wandb
examples/AltCLIP/clip_benchmark_datasets
examples/glm_pretrain/data.lazy
examples/glm_pretrain/examples/glm_pretrain/data.lazy
Expand Down
53 changes: 53 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#Change to your base image, such as pytorch1.11+py38
#https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-02.html#rel_21-02
FROM nvcr.io/nvidia/pytorch:21.06-py3
#You can set available pypi sources
RUN /bin/bash -c "pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple"

ENV STAGE_DIR=/tmp
RUN mkdir -p ${STAGE_DIR}
#Ubuntu
RUN apt-get update && apt-get install -y openssh-server && apt-get install -y git
ARG SSH_PORT=6001
#Client Liveness & Uncomment Port 22 for SSH Daemon
RUN echo "ClientAliveInterval 30" >> /etc/ssh/sshd_config
RUN mkdir -p /var/run/sshd && cp /etc/ssh/sshd_config ${STAGE_DIR}/sshd_config && \
sed "0,/^#Port 22/s//Port 22/" ${STAGE_DIR}/sshd_config > /etc/ssh/sshd_config
RUN cat /etc/ssh/sshd_config > ${STAGE_DIR}/sshd_config && \
sed "0,/^Port 22/s//Port ${SSH_PORT}/" ${STAGE_DIR}/sshd_config > /etc/ssh/sshd_config && \
sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/g' /etc/ssh/sshd_config
EXPOSE ${SSH_PORT}

#Set SSH KEY
RUN mkdir /root/.ssh
RUN printf "#StrictHostKeyChecking no\n#UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
ssh-keygen -t rsa -f /root/.ssh/id_rsa -N "" && cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys && \
chmod og-wx /root/.ssh/authorized_keys

RUN echo $'Host 127.0.0.1 \n\
Hostname 127.0.0.1 \n\
Port 6001 \n\
StrictHostKeyChecking no \n\
User root' > /root/.ssh/config
RUN echo $'Host localhost \n\
Hostname localhost \n\
Port 6001 \n\
StrictHostKeyChecking no \n\
User root' >> /root/.ssh/config

RUN echo "service ssh start" >> /root/.bashrc

#Main deps
RUN pip install tensorboard
RUN pip install sentencepiece
RUN pip install boto3
RUN pip install jieba
RUN pip install ftfy
RUN pip install deepspeed==0.7.7
RUN pip install bmtrain

RUN pip install flagai
#For development usage, you can change as follows
#RUN git clone https://github.com/FlagAI-Open/FlagAI.git && cd FlagAI && python setup.py install

CMD service ssh start && tail -f /dev/null
1 change: 1 addition & 0 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ git clone https://github.com/OpenBMB/BMTrain
cd BMTrain
python setup.py install
```
- [可选] 镜像构建,请参照 [Dockerfile](https://github.com/FlagAI-Open/FlagAI/blob/master/Dockerfile)
- [提示] 单节点docker环境下,运行多卡数据并行需要设置host。 例如,docker节点 root@127.0.0.1,其端口 7110。
```
>>> vim ~/.ssh/config
Expand Down
9 changes: 5 additions & 4 deletions examples/AltDiffusion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,9 +66,10 @@ prompt = "Anime portrait of natalie portman as an anime girl by stanley artgerm
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


loader = AutoLoader(task_name="text2img", #contrastive learning
loader = AutoLoader(task_name="text2img",
model_name="AltDiffusion-m9",
model_dir="./checkpoints")
model_dir="./checkpoints",
use_fp16=False) # Fp16 mode

model = loader.get_model()
model.eval()
Expand Down Expand Up @@ -97,9 +98,9 @@ More parameters of predict_generate_images for you to adjust for `predict_genera
| C | int | 图片的channel数; Numeber of channels of generated images |
| seed | int | 随机种子; Random seed number |

注意:模型推理要求一张至少10G以上的GPU
注意:模型推理要求一张至少14G以上的GPU, FP16模式下则至少11G

Note that the model inference requires a GPU of at least 10G above.
Note that the model inference requires a GPU of at least 14G, and at least 11G for FP16 mode.


# 更多生成结果/More Results
Expand Down
27 changes: 27 additions & 0 deletions examples/galactica/generate_galactica_1.3b.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
from flagai.model.predictor.predictor import Predictor
from flagai.auto_model.auto_loader import AutoLoader
import torch
device = torch.device("cuda:3" if torch.cuda.is_available() else "cpu")

loader = AutoLoader(task_name="lm",
model_name="galactica-1.3b-en",
model_dir="/share/projset/baaishare/baai-mrnd/xingzhaohu/")

model = loader.get_model()
model.to(device)
model.eval()

tokenizer = loader.get_tokenizer()

predictor = Predictor(model, tokenizer)

text = "Please write a abstract about the computer vision. \n"
out = predictor.predict_generate_randomsample(text,
out_max_length=700,
top_k=50,
repetition_penalty=1.2,
temperature=0.7
)
print(out)


59 changes: 59 additions & 0 deletions examples/glm_custom_pvp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Custom prompt-verbalizer pair(PVP)

## 1. Define your own prompt-verbalizer patterns
We provide api for users to create their own function to construct prompt-verbalizer patterns. Here is an example below:
```python
class RtePVP(PVP):
# Verbalizer convert original labels to more meaningful ones
VERBALIZER = {"not_entailment": [" No"], "entailment": [" Yes"]}

@staticmethod
def available_patterns():
return [0, 1, 2]

@property
def spell_length(self):
return self.num_prompt_tokens + self.prefix_prompt

def get_parts(self, example: InputExample):
"""
Construct patterns with input texts and mask, "None" here stands for places to insert continuous prompt tokens
"""
text_a = example.text_a
text_b = example.text_b.rstrip(string.punctuation)
if self.pattern_id == 0:
parts_a, parts_b = [None, '"',
self.shortenable(text_b), '" ?'], [
None, [self.mask], ',', None, ' "',
self.shortenable(text_a), '"'
]
elif self.pattern_id == 1:
parts_a, parts_b = [None, self.shortenable(text_b), '?'], [
None, [self.mask], ',', None,
self.shortenable(" " + text_a)
]
elif self.pattern_id == 2:
parts_a, parts_b = [
None,
self.shortenable(text_a), None, ' question:',
self.shortenable(" " + text_b), ' True or False?', None,
' answer:', [self.mask]
], []
else:
raise NotImplementedError(self.pattern_id)
parts_a, parts_b = self.replace_prompt_tokens(parts_a, parts_b)
return parts_a, parts_b

def verbalize(self, label) -> List[str]:
if self.pattern_id == 4:
return [' true'] if label == 'entailment' else [' false']
return RtePVP.VERBALIZER[label]
```

## 2. Pass the user-defined class to the collate function
```python
collate_fn = ConstructSuperglueStrategy(cl_args,
tokenizer,
task_name=task_name,
custom_pvp=RtePVP)
```
119 changes: 119 additions & 0 deletions examples/glm_custom_pvp/train_custom_pvp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Copyright © 2022 BAAI. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License")
from flagai.trainer import Trainer
from flagai.model.glm_model import GLMForSequenceClassification
from flagai.model.glm_model import GLMForSingleTokenCloze
from flagai.data.tokenizer import Tokenizer

from flagai.data.dataset import SuperGlueDataset
from flagai.test_utils import CollateArguments
from flagai.data.dataset.superglue.control import DEFAULT_METRICS, MULTI_TOKEN_TASKS, CH_TASKS
from flagai.data.dataset import ConstructSuperglueStrategy
from flagai.data.dataset.superglue.pvp import PVP
from flagai.data.dataset.data_utils import build_input_from_ids, build_sample, InputExample
from flagai.data.dataset.data_utils import build_decoder_input, build_decoder_sample, num_special_tokens_to_add
from typing import Tuple, List, Union, Dict
import string

class RtePVP(PVP):
VERBALIZER = {"not_entailment": [" No"], "entailment": [" Yes"]}

@staticmethod
def available_patterns():
return [0, 1, 2, 3, 4]

@property
def spell_length(self):
return self.num_prompt_tokens + self.prefix_prompt

def get_parts(self, example: InputExample):
# switch text_a and text_b to get the correct order
text_a = example.text_a
text_b = example.text_b.rstrip(string.punctuation)
if self.pattern_id == 0:
parts_a, parts_b = [None, '"',
self.shortenable(text_b), '" ?'], [
None, [self.mask], ',', None, ' "',
self.shortenable(text_a), '"'
]
elif self.pattern_id == 1:
parts_a, parts_b = [None, self.shortenable(text_b), '?'], [
None, [self.mask], ',', None,
self.shortenable(" " + text_a)
]
elif self.pattern_id == 2:
parts_a, parts_b = [
None,
self.shortenable(text_a), None, ' question:',
self.shortenable(" " + text_b), ' True or False?', None,
' answer:', [self.mask]
], []
else:
raise NotImplementedError(self.pattern_id)
parts_a, parts_b = self.replace_prompt_tokens(parts_a, parts_b)
return parts_a, parts_b

def verbalize(self, label) -> List[str]:
if self.pattern_id == 4:
return [' true'] if label == 'entailment' else [' false']
return RtePVP.VERBALIZER[label]


# task_name options: ['boolq', 'cb', 'copa', 'multirc', 'rte', 'wic', 'wsc', 'afqmc', 'tnews']
task_name = "rte"

trainer = Trainer(env_type='pytorch',
epochs=10,
batch_size=4,
eval_interval=100,
log_interval=50,
experiment_name='glm_large',
pytorch_device='cuda',
load_dir=None,
lr=1e-4)
print("downloading...")

cl_args = CollateArguments()
cl_args.cloze_eval = True
cl_args.multi_token = task_name in MULTI_TOKEN_TASKS

cl_args.continuous_prompt = True
cl_args.prefix_prompt = 2
cl_args.num_prompt_tokens = 5

if task_name in CH_TASKS:
model_name = 'GLM-large-ch'
add_block_symbols=True,
else:
model_name = 'GLM-large-en'
tokenizer = Tokenizer.from_pretrained(model_name)

# model = GLMForSequenceClassification.from_pretrain(model_name=model_name, spell_length=2,
# class_num=3, tune_prefix_layers=1)

model = GLMForSingleTokenCloze.from_pretrain(download_path="./checkpoints",
model_name=model_name, spell_length=2,
class_num=3, tune_prefix_layers=1)
train_dataset = SuperGlueDataset(task_name=task_name,
data_dir='./datasets/',
dataset_type='train',
tokenizer=tokenizer)

collate_fn = ConstructSuperglueStrategy(cl_args,
tokenizer,
task_name=task_name,
custom_pvp=RtePVP)

valid_dataset = SuperGlueDataset(task_name=task_name,
data_dir='./datasets/',
dataset_type='dev',
tokenizer=tokenizer)

metric_methods = DEFAULT_METRICS[task_name]
trainer.train(model,
collate_fn=collate_fn,
train_dataset=train_dataset,
valid_dataset=valid_dataset,
metric_methods=metric_methods)

23 changes: 23 additions & 0 deletions examples/glm_ptuning/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# P-tuning

Here is an example to train a model with continuous prompt (P-tuning).

## 1. Change the parameters in config
```python
cl_args.continuous_prompt = True # Enable continuous prompt
cl_args.prefix_prompt = 2 # Number of continuous prompt at the beginning
cl_args.num_prompt_tokens = 5 # Number of continuous prompt in the content
```


## 2. Change model parameters

```python
# spell_length is the final number of continuous prompt tokens in an instance, it is usually determined by the PVP structure
# tune_prefix_layers is the number of transformer layers to tune, where the rest layers are frozen
model = GLMForSingleTokenCloze.from_pretrain(download_path="./checkpoints",
model_name=model_name, spell_length=8,
tune_prefix_layers=1)
```

In such way, p-tuning can be enabled in training.
48 changes: 48 additions & 0 deletions examples/glm_ptuning/deepspeed.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
{
"train_micro_batch_size_per_gpu": 456,
"gradient_accumulation_steps": 100,
"steps_per_print": 100,
"gradient_clipping": 1.0,
"zero_optimization": {
"stage": 2,
"contiguous_gradients": false,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e7,
"allgather_bucket_size": 5e7,
"cpu_offload": true
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 1e-5,
"warmup_num_steps": 2000
}
},
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-5,
"weight_decay": 0.1,
"betas": [
0.9,
0.98
],
"eps": 1e-6
}
},
"activation_checkpointing": {
"partition_activations": true,
"contiguous_memory_optimization": false
},
"wall_clock_breakdown": false
}
Loading