Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

对大批量数据进行人像分割推理时触发Linux OOM机制导致程序被kill #3486

Open
1 task done
enemy1205 opened this issue Sep 5, 2023 · 5 comments
Open
1 task done
Assignees
Labels
bug Something isn't working question Further information is requested

Comments

@enemy1205
Copy link

问题确认 Search before asking

  • 我已经搜索过问题,但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

按照Readme.md参考人像分割教程,使用PaddleSeg/contrib/PP-HumanSeg/src/seg_demo.py
由于需要进行大批量视频的分割,因此对seg_demo.py文件进行了一定处理简化,由于仅需二值图,故删去较多不必要部分

# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import os
import sys

import cv2
import numpy as np

__dir__ = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.abspath(os.path.join(__dir__, '../../../')))
from paddleseg.utils import get_sys_env, logger, get_image_list
from infer import Predictor


def parse_args():
    parser = argparse.ArgumentParser(
        description='PP-HumanSeg inference for video')
    parser.add_argument(
        "--config",
        help="The config file of the inference model.",
        type=str,
        required=True)
    parser.add_argument(
        '--video_path', help='Video path for inference', type=str)

    parser.add_argument(
        '--vertical_screen',
        help='The input image is generated by vertical screen, i.e. height is bigger than width.'
        'For the input image, we assume the width is bigger than the height by default.',
        action='store_true')
    parser.add_argument(
        '--use_post_process', help='Use post process.', action='store_true')
    parser.add_argument(
        '--use_optic_flow', help='Use optical flow.', action='store_true')
    parser.add_argument(
        '--test_speed',
        help='Whether to test inference speed',
        action='store_true')

    return parser.parse_args()


def makedirs(save_dir):
    dirname = save_dir if os.path.isdir(save_dir) else \
        os.path.dirname(save_dir)
    if not os.path.exists(dirname):
        os.makedirs(dirname)


def seg_video(predictor,video_name,video_path,save_folder):
    assert os.path.exists(video_path), \
        'The --video_path is not existed: {}'.format(video_path)
    folder_layers=video_name.split('-')
    save_folder = os.path.join(save_folder,folder_layers[0])
    save_folder = os.path.join(save_folder,folder_layers[1]+'-'+folder_layers[2])
    save_folder = os.path.join(save_folder,folder_layers[3])[:-4]
    os.makedirs(save_folder, exist_ok=True)
    cap_img = cv2.VideoCapture(video_path)
    assert cap_img.isOpened(), "Fail to open video:{}".format(video_path)
    frame_count=1
    while cap_img.isOpened():
        ret_img, img = cap_img.read()
        if not ret_img:
            break
        out = predictor.run(img)
        frame_filename = os.path.join(save_folder, f'{video_name[:-4]}-{frame_count:03d}.png')
        cv2.imwrite(frame_filename, out)
        frame_count +=1
    cap_img.release()




if __name__ == "__main__":
    args = parse_args()
    env_info = get_sys_env()
    args.use_gpu = True if env_info['Paddle compiled with cuda'] \
        and env_info['GPUs used'] else False
    save_folder = 'mysave_path'
#my_video_path中存在较多视频文件(>10000)
    video_folder = 'my_video_path'
    if not os.path.exists(save_folder):
        os.makedirs(save_folder)
    predictor = Predictor(args)
    video_names = os.listdir(video_folder)
    video_paths = [os.path.join(video_folder,name) for name in video_names]
    for name , path in zip(video_names,video_paths):
        seg_video(predictor,name,path,save_folder)
        print(f'{name} seg complete!')

此外其他文件未作修改,使用推理模型human_pp_humansegv1_server_512x512_inference_model_with_softmax
可执行脚本:

#! /bin/bash
export CUDA_VISIBLE_DEVICES=7
python src/seg_demo.py --config inference_models/human_pp_humansegv1_server_512x512_inference_model_with_softmax/deploy.yaml

tmux 挂至后台后(直接命令行运行情况相同),将会正常运行一段时间
显存占用 2~3G/24G(3090Ti),显存及显卡利用率都比较正常

但是,随着时间推移,大约每半分钟将会占用1G内存并且累积,最后直至触发Linux OOM机制导致被kill掉。

尝试过数次,以及调试,在每个视频的处理完后,它确实会释放部分内存,但每个视频增加的内存>释放的内存,,,,
最终导致250GB+的内存也被占满。。

@enemy1205 enemy1205 added the question Further information is requested label Sep 5, 2023
@enemy1205
Copy link
Author

human_pp_humansegv1_server_512x512_inference_model_with_softmax来源于readme.md上链接,模型及参数未进行过任何修改

@enemy1205
Copy link
Author

image
如图所示,内存稳步增长至OOM

@shiyutang shiyutang self-assigned this Oct 9, 2023
@shiyutang shiyutang added the bug Something isn't working label Oct 9, 2023
@shiyutang
Copy link
Collaborator

这可能是一个内存泄漏的bug,我们会尽快修复

@shiyutang
Copy link
Collaborator

@enemy1205 我们这边似乎没有办法复现这个问题,能否进一步提供你的环境信息,例如paddle版本等

@enemy1205
Copy link
Author

image
如图为测试过程中的记录,256G内存已占用36%并仍在逐步上升

$ nvidia-smi
Wed Nov 29 23:43:48 2023<br/>+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
Python 3.10.12 (main, Jul  5 2023, 18:54:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import paddle
print(paddle.__version__)
2.5.1
paddle.fluid.is_compiled_with_cuda()
True

paddle.utils.run_check()
Running verify PaddlePaddle program ...
I1129 23:33:14.719758 2525999 interpretercore.cc:237] New Executor is Running.
W1129 23:33:14.720786 2525999 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.0, Runtime API Version: 11.1
W1129 23:33:14.724933 2525999 gpu_resources.cc:149] device: 0, cuDNN Version: 8.0.
$ pip list
Package            Version
------------------ -------------
anyio              4.0.0
astor              0.8.1
Babel              2.12.1
bce-python-sdk     0.8.90
blinker            1.6.2
cachetools         5.3.1
certifi            2023.7.22
charset-normalizer 3.2.0
click              8.1.7
contourpy          1.1.0
cycler             0.11.0
decorator          5.1.1
exceptiongroup     1.1.3
filelock           3.12.3
Flask              2.3.3
flask-babel        3.1.0
fonttools          4.42.1
future             0.18.3
h11                0.14.0
httpcore           0.17.3
httpx              0.24.1
idna               3.4
itsdangerous       2.1.2
Jinja2             3.1.2
joblib             1.3.2
kiwisolver         1.4.5
MarkupSafe         2.1.3
matplotlib         3.7.2
numpy              1.25.2
nvidia-ml-py       12.535.108
nvitop             1.3.0
opencv-python      4.5.5.64
opt-einsum         3.3.0
packaging          23.1
paddle-bfloat      0.1.7
paddlepaddle-gpu   2.5.1.post112
pandas             2.1.0
Pillow             10.0.0
pip                23.2.1
prettytable        3.8.0
protobuf           4.24.2
psutil             5.9.5
pycryptodome       3.18.0
pyparsing          3.0.9
python-dateutil    2.8.2
pytz               2023.3
PyYAML             6.0.1
rarfile            4.0
requests           2.31.0
scikit-learn       1.3.0
scipy              1.11.2
setuptools         68.0.0
six                1.16.0
sniffio            1.3.0
termcolor          2.3.0
threadpoolctl      3.2.0
tqdm               4.66.1
typing_extensions  4.7.1
tzdata             2023.3
urllib3            2.0.4
visualdl           2.5.3
wcwidth            0.2.6
Werkzeug           2.3.7
wheel              0.38.4

Ubuntu 22.04.2

5.19.0-42-generic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants