This is a University of Macau computer vision practical, authored by Zhedong Zheng. The practical explores the basis of learning pedestrian features. In this practical, we will learn to build a simple person re-ID system step by step. (8 min read) 👍 Any suggestion is welcomed.
Person re-ID can be viewed as an image retrieval problem. Given one query image in Camera A, we need to find the images of the same person in other Cameras. The key of the person re-ID is to find a discriminative representation of the person. Many recent works apply deeply learned models to extract visual features, and achieve the state-of-the-art performance.
We could use this tech to help people. Check the great video by Nvidia. (https://youtu.be/GiZ7kyrwZGQ?t=60)
Person re-identification, 行人重识别, 人の再識別, 보행자 재 식별, Réidentification des piétons, Ri-identificazione pedonale, Fußgänger-Neuidentifizierung, إعادة تحديد المشاة, Re-identificación de peatones
- Install Pytorch from http://pytorch.org/
- Install required packages
pip install -r requirements.txt
- [Optional] You may skip it. Usually it comes with pytorch. Install Torchvision from the source
git clone https://github.com/pytorch/vision
cd vision
python setup.py install
- [Optional] You may skip it. Install apex from the source
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install --cuda_ext --cpp_ext
Because pytorch and torchvision are ongoing projects.
Here we noted that our code is tested based on Pytorch 0.3.0/0.4.0/0.5.0/1.0.0 and Torchvision 0.2.0/0.2.1 .
Check the Prerequisites. The download links for this practice are:
- Code: ReID-Baseline
- Data: Market-1501 [Google] [Baidu]
A quick command line to download Market-1501 is:
pip install gdown
gdown https://drive.google.com/uc\?id\=0B8-rUzbwVRk0c054eEozWG9COHM
You may notice that the downloaded folder is organized as:
├── Market/
│ ├── bounding_box_test/ /* Files for testing (candidate images pool)
│ ├── bounding_box_train/ /* Files for training
│ ├── gt_bbox/ /* Files for multiple query testing
│ ├── gt_query/ /* We do not use it
│ ├── query/ /* Files for testing (query images)
│ ├── readme.txt
Open and edit the script prepare.py
in the editor. Change the fifth line in prepare.py
to your download path, such as \home\zzd\Download\Market
. Run this script in the terminal.
python prepare.py
We create a subfolder called pytorch
under the download folder.
├── Market/
│ ├── bounding_box_test/ /* Files for testing (candidate images pool)
│ ├── bounding_box_train/ /* Files for training
│ ├── gt_bbox/ /* Files for multiple query testing
│ ├── gt_query/ /* We do not use it
│ ├── query/ /* Files for testing (query images)
│ ├── readme.txt
│ ├── pytorch/
│ ├── train/ /* train
│ ├── 0002
| ├── 0007
| ...
│ ├── val/ /* val
│ ├── train_all/ /* train+val
│ ├── query/ /* query files
│ ├── gallery/ /* gallery files
│ ├── multi-query/
In every subdir, such as pytorch/train/0002
, images with the same ID are arranged in the folder.
Now we have successfully prepared the data for torchvision
to read the data.
+ Quick Question. How to recognize the images of the same ID?
For Market-1501, the image name contains the identity label and camera id. Check the naming rule at here.
We can use the pretrained networks, such as AlexNet
, VGG16
, ResNet
and DenseNet
. Generally, the pretrained networks help to achieve a better performance, since it preserves some good visual patterns from ImageNet [1].
In pytorch, we can easily import them by two lines. For example,
from torchvision import models
model = models.resnet50(pretrained=True)
You can simply check the structure of the model by:
print(model)
But we need to modify the networks a little bit. There are 751 classes (different people) in Market-1501, which is different with 1,000 classes in ImageNet. So here we change the model to use our classifier.
import torch
import torch.nn as nn
from torchvision import models
# Define the ResNet50-based Model
class ft_net(nn.Module):
def __init__(self, class_num = 751):
super(ft_net, self).__init__()
#load the model
model_ft = models.resnet50(pretrained=True)
# change avg pooling to global pooling
model_ft.avgpool = nn.AdaptiveAvgPool2d((1,1))
self.model = model_ft
self.classifier = ClassBlock(2048, class_num) #define our classifier.
def forward(self, x):
x = self.model.conv1(x)
x = self.model.bn1(x)
x = self.model.relu(x)
x = self.model.maxpool(x)
x = self.model.layer1(x)
x = self.model.layer2(x)
x = self.model.layer3(x)
x = self.model.layer4(x)
x = self.model.avgpool(x)
x = torch.squeeze(x)
x = self.classifier(x) #use our classifier.
return x
+ Quick Question. Why we use AdaptiveAvgPool2d? What is the difference between the AvgPool2d and AdaptiveAvgPool2d?
+ Quick Question. Does the model have parameters now? How to initialize the parameter in the new layer?
More details are in model.py
. You may check it later after you have gone through this practical.
OK. Now we have prepared the training data and defined model structure.
We can train a model by
python train.py --gpu_ids 0 --name ft_ResNet50 --train_all --batchsize 32 --data_dir your_data_path
--gpu_ids
which gpu to run.
--name
the name of the model.
--data_dir
the path of the training data, e.g., /home/yourname/Market/pytorch
--train_all
using all images to train.
--batchsize
batch size.
--erasing_p
random erasing probability.
Let's look at what we do in the train.py
.
The first thing is how to read data and their labels from the prepared folder.
Using torch.utils.data.DataLoader
, we can obtain two iterators dataloaders['train']
and dataloaders['val']
to read data and label.
image_datasets = {}
image_datasets['train'] = datasets.ImageFolder(os.path.join(data_dir, 'train'),
data_transforms['train'])
image_datasets['val'] = datasets.ImageFolder(os.path.join(data_dir, 'val'),
data_transforms['val'])
dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=opt.batchsize,
shuffle=True, num_workers=8) # 8 workers may work faster
for x in ['train', 'val']}
dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}
Here is the main code to train the model. Yes. It's only about 20 lines. Make sure you can understand every line of the code.
# Iterate over data.
for data in dataloaders[phase]:
# get a batch of inputs
inputs, labels = data
now_batch_size,c,h,w = inputs.shape
if now_batch_size<opt.batchsize: # skip the last batch
continue
# print(inputs.shape)
# wrap them in Variable, if gpu is used, we transform the data to cuda.
if use_gpu:
inputs = Variable(inputs.cuda())
labels = Variable(labels.cuda())
else:
inputs, labels = Variable(inputs), Variable(labels)
# zero the parameter gradients
optimizer.zero_grad()
#-------- forward --------
outputs = model(inputs)
_, preds = torch.max(outputs.data, 1)
loss = criterion(outputs, labels)
#-------- backward + optimize --------
# only if in training phase
if phase == 'train':
loss.backward()
optimizer.step()
+ Quick Question. Why we need optimizer.zero_grad()? What happens if we remove it?
+ Quick Question. The dimension of the outputs is batchsize*751. Why?
Every 10 training epoch, we save a snapshot and update the loss curve.
if epoch%10 == 9:
save_network(model, epoch)
draw_curve(epoch)
In this part, we load the network weight (we just trained) to extract the visual feature of every image.
python test.py --gpu_ids 0 --name ft_ResNet50 --test_dir your_data_path --batchsize 32 --which_epoch 59
--gpu_ids
which gpu to run.
--name
the dir name of the trained model.
--batchsize
batch size.
--which_epoch
select the i-th model.
--data_dir
the path of the testing data.
Let's look at what we do in the test.py
.
First, we need to import the model structure and then load the weight to the model.
model_structure = ft_net(751)
model = load_network(model_structure)
For every query and gallery image, we extract the feature by simply forward the data.
outputs = model(input_img)
# ---- L2-norm Feature ------
ff = outputs.data.cpu()
fnorm = torch.norm(ff, p=2, dim=1, keepdim=True)
ff = ff.div(fnorm.expand_as(ff))
+ Quick Question. Why we flip the test image horizontally when testing? How to fliplr in pytorch?
+ Quick Question. Why we L2-norm the feature?
Yes. Now we have the feature of every image. The only thing we need to do is matching the images by the feature.
python evaluate_gpu.py
Let's look what we do in evaluate_gpu.py
. We sort the predicted similarity score.
query = qf.view(-1,1)
# print(query.shape)
score = torch.mm(gf,query) # Cosine Distance
score = score.squeeze(1).cpu()
score = score.numpy()
# predict index
index = np.argsort(score) #from small to large
index = index[::-1]
Note that there are two kinds of images we do not consider as right-matching images.
-
Junk_index1 is the index of mis-detected images, which contain the body parts.
-
Junk_index2 is the index of the images, which are of the same identity in the same cameras.
query_index = np.argwhere(gl==ql)
camera_index = np.argwhere(gc==qc)
# The images of the same identity in different cameras
good_index = np.setdiff1d(query_index, camera_index, assume_unique=True)
# Only part of body is detected.
junk_index1 = np.argwhere(gl==-1)
# The images of the same identity in same cameras
junk_index2 = np.intersect1d(query_index, camera_index)
We can use the function compute_mAP
to obtain the final result.
In this function, we will ignore the junk_index.
CMC_tmp = compute_mAP(index, good_index, junk_index)
To visualize the result,
python demo.py --query_index 777
--query_index
which query you want to test. You may select a number in the range of 0 ~ 3367
.
It is similar to the evaluate.py
. We add the visualization part.
try: # Visualize Ranking Result
# Graphical User Interface is needed
fig = plt.figure(figsize=(16,4))
ax = plt.subplot(1,11,1)
ax.axis('off')
imshow(query_path,'query')
for i in range(10): #Show top-10 images
ax = plt.subplot(1,11,i+2)
ax.axis('off')
img_path, _ = image_datasets['gallery'].imgs[index[i]]
label = gallery_label[index[i]]
imshow(img_path)
if label == query_label:
ax.set_title('%d'%(i+1), color='green') # true matching
else:
ax.set_title('%d'%(i+1), color='red') # false matching
print(img_path)
except RuntimeError:
for i in range(10):
img_path = image_datasets.imgs[index[i]]
print(img_path[0])
print('If you want to see the visualization of the ranking result, graphical user interface is needed.')
- Market-1501 is a dataset collected at Tsinghua University in summer.
Let's try another dataset called DukeMTMC-reID, which is collected at Duke University in winter.
You may download the dataset at Here GoogleDriver or (BaiduYun password: bhbh). Try it by yourself.
The dataset is quite similar to Market-1501. You may also check with the state-of-the-art results at Here.
+ Quick Question. Could we directly apply the model trained on Market-1501 to DukeMTMC-reID? Why?
-
Try verification + identification loss. You may check the code at Here.
-
Try Triplet Loss. Triplet loss is another widely-used objective. You may check the code in https://github.com/layumi/Person-reID-triplet-loss. I write the code in a similar manner, so let's find what I changed.
-
The pedestrian has some specific attributes, e.g., gender, carrying. They can help the feature learning. We annotate the ID-level attributes for Market-1501 and DukeMTMC-reID. You could check this paper.
-
Could we use natural language as query? Check this paper.
-
Could we use other losses (i.e. contrastive loss) to further improve the performance? Check this paper.
-
Person-reID dataset is not large enough to train a deep-learned network? You may check this paper (use GAN to generate more samples) and try some data augmentation method like random erasing.
- Pedestrian detection is bad? Try Open Pose and Spatial Transformer to align the images.
- Data Limitation? Generate more! Code
- 3D Person Re-identification Code
You may check https://github.com/layumi/Person_reID_baseline_pytorch/blob/master/tutorial/Answers_to_Quick_Questions.md
If you like this repo, please star it. Thanks a lot!
[1] Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. "Imagenet: A large-scale hierarchical image database." In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248-255. Ieee, 2009.
[2] Zheng, Zhedong, Liang Zheng, and Yi Yang. "Unlabeled samples generated by gan improve the person re-identification baseline in vitro." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3754-3762. 2017.
[3] Zheng, Zhedong, Xiaodong Yang, Zhiding Yu, Liang Zheng, Yi Yang, and Jan Kautz. "Joint discriminative and generative learning for person re-identification." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2138-2147. 2019.
[4] Zheng, Zhedong, Liang Zheng, and Yi Yang. "A discriminatively learned cnn embedding for person reidentification." ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, no. 1 (2017): 1-20.
[5] Zheng, Zhedong, Liang Zheng, and Yi Yang. "Pedestrian alignment network for large-scale person re-identification." IEEE Transactions on Circuits and Systems for Video Technology 29, no. 10 (2018): 3037-3045.
[6] Zheng, Zhedong, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. "Dual-path convolutional image-text embedding with instance loss." ACM TOMM 2020.