对大批量数据进行人像分割推理时触发Linux OOM机制导致程序被kill #3486

enemy1205 · 2023-09-05T02:44:58Z

问题确认 Search before asking

我已经搜索过问题，但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

按照Readme.md参考人像分割教程，使用PaddleSeg/contrib/PP-HumanSeg/src/seg_demo.py
由于需要进行大批量视频的分割，因此对seg_demo.py文件进行了一定处理简化，由于仅需二值图，故删去较多不必要部分

# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import os
import sys

import cv2
import numpy as np

__dir__ = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.abspath(os.path.join(__dir__, '../../../')))
from paddleseg.utils import get_sys_env, logger, get_image_list
from infer import Predictor


def parse_args():
    parser = argparse.ArgumentParser(
        description='PP-HumanSeg inference for video')
    parser.add_argument(
        "--config",
        help="The config file of the inference model.",
        type=str,
        required=True)
    parser.add_argument(
        '--video_path', help='Video path for inference', type=str)

    parser.add_argument(
        '--vertical_screen',
        help='The input image is generated by vertical screen, i.e. height is bigger than width.'
        'For the input image, we assume the width is bigger than the height by default.',
        action='store_true')
    parser.add_argument(
        '--use_post_process', help='Use post process.', action='store_true')
    parser.add_argument(
        '--use_optic_flow', help='Use optical flow.', action='store_true')
    parser.add_argument(
        '--test_speed',
        help='Whether to test inference speed',
        action='store_true')

    return parser.parse_args()


def makedirs(save_dir):
    dirname = save_dir if os.path.isdir(save_dir) else \
        os.path.dirname(save_dir)
    if not os.path.exists(dirname):
        os.makedirs(dirname)


def seg_video(predictor,video_name,video_path,save_folder):
    assert os.path.exists(video_path), \
        'The --video_path is not existed: {}'.format(video_path)
    folder_layers=video_name.split('-')
    save_folder = os.path.join(save_folder,folder_layers[0])
    save_folder = os.path.join(save_folder,folder_layers[1]+'-'+folder_layers[2])
    save_folder = os.path.join(save_folder,folder_layers[3])[:-4]
    os.makedirs(save_folder, exist_ok=True)
    cap_img = cv2.VideoCapture(video_path)
    assert cap_img.isOpened(), "Fail to open video:{}".format(video_path)
    frame_count=1
    while cap_img.isOpened():
        ret_img, img = cap_img.read()
        if not ret_img:
            break
        out = predictor.run(img)
        frame_filename = os.path.join(save_folder, f'{video_name[:-4]}-{frame_count:03d}.png')
        cv2.imwrite(frame_filename, out)
        frame_count +=1
    cap_img.release()




if __name__ == "__main__":
    args = parse_args()
    env_info = get_sys_env()
    args.use_gpu = True if env_info['Paddle compiled with cuda'] \
        and env_info['GPUs used'] else False
    save_folder = 'mysave_path'
#my_video_path中存在较多视频文件(>10000)
    video_folder = 'my_video_path'
    if not os.path.exists(save_folder):
        os.makedirs(save_folder)
    predictor = Predictor(args)
    video_names = os.listdir(video_folder)
    video_paths = [os.path.join(video_folder,name) for name in video_names]
    for name , path in zip(video_names,video_paths):
        seg_video(predictor,name,path,save_folder)
        print(f'{name} seg complete!')

此外其他文件未作修改，使用推理模型human_pp_humansegv1_server_512x512_inference_model_with_softmax
可执行脚本:

#! /bin/bash
export CUDA_VISIBLE_DEVICES=7
python src/seg_demo.py --config inference_models/human_pp_humansegv1_server_512x512_inference_model_with_softmax/deploy.yaml

tmux 挂至后台后(直接命令行运行情况相同)，将会正常运行一段时间
显存占用 2~3G/24G(3090Ti)，显存及显卡利用率都比较正常

但是，随着时间推移，大约每半分钟将会占用1G内存并且累积，最后直至触发Linux OOM机制导致被kill掉。

尝试过数次，以及调试，在每个视频的处理完后，它确实会释放部分内存，但每个视频增加的内存>释放的内存，，，，
最终导致250GB+的内存也被占满。。

The text was updated successfully, but these errors were encountered:

enemy1205 · 2023-09-05T02:50:27Z

human_pp_humansegv1_server_512x512_inference_model_with_softmax来源于readme.md上链接，模型及参数未进行过任何修改

enemy1205 · 2023-09-05T13:40:49Z

如图所示，内存稳步增长至OOM

shiyutang · 2023-10-09T11:36:30Z

这可能是一个内存泄漏的bug，我们会尽快修复

shiyutang · 2023-11-29T11:41:50Z

@enemy1205 我们这边似乎没有办法复现这个问题，能否进一步提供你的环境信息，例如paddle版本等

enemy1205 · 2023-11-29T15:59:05Z

如图为测试过程中的记录，256G内存已占用36%并仍在逐步上升

$ nvidia-smi
Wed Nov 29 23:43:48 2023<br/>+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

Python 3.10.12 (main, Jul  5 2023, 18:54:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import paddle
print(paddle.__version__)
2.5.1
paddle.fluid.is_compiled_with_cuda()
True

paddle.utils.run_check()
Running verify PaddlePaddle program ...
I1129 23:33:14.719758 2525999 interpretercore.cc:237] New Executor is Running.
W1129 23:33:14.720786 2525999 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.0, Runtime API Version: 11.1
W1129 23:33:14.724933 2525999 gpu_resources.cc:149] device: 0, cuDNN Version: 8.0.

$ pip list
Package            Version
------------------ -------------
anyio              4.0.0
astor              0.8.1
Babel              2.12.1
bce-python-sdk     0.8.90
blinker            1.6.2
cachetools         5.3.1
certifi            2023.7.22
charset-normalizer 3.2.0
click              8.1.7
contourpy          1.1.0
cycler             0.11.0
decorator          5.1.1
exceptiongroup     1.1.3
filelock           3.12.3
Flask              2.3.3
flask-babel        3.1.0
fonttools          4.42.1
future             0.18.3
h11                0.14.0
httpcore           0.17.3
httpx              0.24.1
idna               3.4
itsdangerous       2.1.2
Jinja2             3.1.2
joblib             1.3.2
kiwisolver         1.4.5
MarkupSafe         2.1.3
matplotlib         3.7.2
numpy              1.25.2
nvidia-ml-py       12.535.108
nvitop             1.3.0
opencv-python      4.5.5.64
opt-einsum         3.3.0
packaging          23.1
paddle-bfloat      0.1.7
paddlepaddle-gpu   2.5.1.post112
pandas             2.1.0
Pillow             10.0.0
pip                23.2.1
prettytable        3.8.0
protobuf           4.24.2
psutil             5.9.5
pycryptodome       3.18.0
pyparsing          3.0.9
python-dateutil    2.8.2
pytz               2023.3
PyYAML             6.0.1
rarfile            4.0
requests           2.31.0
scikit-learn       1.3.0
scipy              1.11.2
setuptools         68.0.0
six                1.16.0
sniffio            1.3.0
termcolor          2.3.0
threadpoolctl      3.2.0
tqdm               4.66.1
typing_extensions  4.7.1
tzdata             2023.3
urllib3            2.0.4
visualdl           2.5.3
wcwidth            0.2.6
Werkzeug           2.3.7
wheel              0.38.4

Ubuntu 22.04.2

5.19.0-42-generic

enemy1205 added the question Further information is requested label Sep 5, 2023

shiyutang self-assigned this Oct 9, 2023

shiyutang added the bug Something isn't working label Oct 9, 2023

chenjjcccc mentioned this issue Oct 24, 2023

【Bug Fix】humanseg显存泄漏 #3543

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

对大批量数据进行人像分割推理时触发Linux OOM机制导致程序被kill #3486

对大批量数据进行人像分割推理时触发Linux OOM机制导致程序被kill #3486

enemy1205 commented Sep 5, 2023

enemy1205 commented Sep 5, 2023

enemy1205 commented Sep 5, 2023

shiyutang commented Oct 9, 2023

shiyutang commented Nov 29, 2023

enemy1205 commented Nov 29, 2023

对大批量数据进行人像分割推理时触发Linux OOM机制导致程序被kill #3486

对大批量数据进行人像分割推理时触发Linux OOM机制导致程序被kill #3486

Comments

enemy1205 commented Sep 5, 2023

问题确认 Search before asking

请提出你的问题 Please ask your question

但是，随着时间推移，大约每半分钟将会占用1G内存并且累积，最后直至触发Linux OOM机制导致被kill掉。

enemy1205 commented Sep 5, 2023

enemy1205 commented Sep 5, 2023

shiyutang commented Oct 9, 2023

shiyutang commented Nov 29, 2023

enemy1205 commented Nov 29, 2023