Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out of memory when using -map #8308

Open
stephanecharette opened this issue Dec 25, 2021 · 7 comments
Open

out of memory when using -map #8308

stephanecharette opened this issue Dec 25, 2021 · 7 comments

Comments

@stephanecharette
Copy link
Collaborator

Attempting to train a network with this command:

darknet detector -map -dont_show train /home/stephane/nn/page_orientation/page_orientation.data /home/stephane/nn/page_orientation/page_orientation.cfg

When it gets to calculating the map, the linux kernel eventually kills darknet due to out-of-memory.

Darknet log shows thousands of repeating lines before darknet is killed:

 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 

 detections_count = 9179725, unique_truth_count = 66113  
 rank = 0 of ranks = 9179725 
 rank = 100 of ranks = 9179725 
 rank = 200 of ranks = 9179725 
 rank = 300 of ranks = 9179725 
 rank = 400 of ranks = 9179725 
...
 rank = 4148400 of ranks = 9179725 
 rank = 4148500 of ranks = 9179725 
 rank = 4148600 of ranks = 9179725 
 rank = 4148700 of ranks = 9179725 
 rank = 4148800 of ranks = 9179725 
 rank = 4148900 of ranks = 9179725 
 rank = 4149000 of ranks = 9179725 
 rank = 4149100
Command terminated by signal 9

Neural network is yolov4-tiny with 180 classes. There are 6624 training images and 1810 validation images. Max batches is set to 360000.
The rig is a RTX2070 with 8GB, and the system has 32 GB of ram. At the time the linux kernel kills darknet, dmesg reports the following:

[166238.809496] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[166238.809597] [ 9445]  1000  9445 21308368  8010473 71512064        0             0 darknet
[166238.809602] Out of memory: Kill process 9445 (darknet) score 976 or sacrifice child
[166238.809648] Killed process 9445 (darknet) total-vm:85233472kB, anon-rss:31672024kB, file-rss:220468kB, shmem-rss:149400kB

So this is saying darknet is using 31.67 GB of ram, on a system with 32 GB installed.

Any idea on why that is or what I can do to fix it?

@stephanecharette
Copy link
Collaborator Author

Found another way to replicate the exact same behavour. By running this command:

~/src/darknet/darknet detector map ~/nn/page_orientation/page_orientation.data ~/nn/page_orientation/page_orientation.cfg page_orientation_last.weights

Which results in the following:

Done! Loaded 38 layers from weights-file 

 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 
6624
 detections_count = 5113369, unique_truth_count = 238088  
 rank = 2093000 of ranks = 51133fish: “~/src/darknet/darknet detector…” terminated by signal SIGKILL (Forced quit)

Watching it run with htop in another window, the problem only happens when it starts printing the "rank = ..." on the screen. Up to that point darknet was only consuming a few MB of ram. But whatever happens when it prints the "rank" message, it only takes a few seconds before all ram is consumed.

@stephanecharette
Copy link
Collaborator Author

Some findings using valgrind. May have some memory leaks:

The plist from detector.c:970:

 6,144 bytes in 12 blocks are indirectly lost in loss record 3,962 of 3,971
==00:01:42:41.863 42724==    at 0x4843839: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==00:01:42:41.863 42724==    by 0x18640D: xmalloc_location (utils.c:29)
==00:01:42:41.863 42724==    by 0x187C6F: fgetl (utils.c:441)
==00:01:42:41.863 42724==    by 0x1C2D04: get_paths (data.c:24)
==00:01:42:41.863 42724==    by 0x201E24: validate_detector_map (detector.c:970)
==00:01:42:41.863 42724==    by 0x2071CB: run_detector (detector.c:2023)
==00:01:42:41.863 42724==    by 0x1E96CD: main (darknet.c:493)

The array from detector.c:971:

==00:01:42:41.645 42724== 96 bytes in 1 blocks are definitely lost in loss record 3,272 of 3,971
==00:01:42:41.645 42724==    at 0x4848A23: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==00:01:42:41.645 42724==    by 0x186468: xcalloc_location (utils.c:37)
==00:01:42:41.645 42724==    by 0x18F037: list_to_array (list.c:108)
==00:01:42:41.645 42724==    by 0x201E3A: validate_detector_map (detector.c:971)
==00:01:42:41.645 42724==    by 0x2071CB: run_detector (detector.c:2023)
==00:01:42:41.645 42724==    by 0x1E96CD: main (darknet.c:493)

The truth boxes from detector.c:1078:

 17,040 bytes in 12 blocks are definitely lost in loss record 3,969 of 3,971
==00:01:42:41.864 42724==    at 0x4848C73: realloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==00:01:42:41.864 42724==    by 0x1864C8: xrealloc_location (utils.c:45)
==00:01:42:41.864 42724==    by 0x1C3914: read_boxes (data.c:224)
==00:01:42:41.864 42724==    by 0x2029A1: validate_detector_map (detector.c:1078)
==00:01:42:41.864 42724==    by 0x2071CB: run_detector (detector.c:2023)
==00:01:42:41.864 42724==    by 0x1E96CD: main (darknet.c:493)

@AlexeyAB does this output from valgrind help?

@1000plus900plus40plus8
Copy link

Facing the same issue:

total_bbox = 24137151, rewritten_bbox = 3.886664 % 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 30 Avg (IOU: 0.894058), count: 177, class_loss = 1.757932, iou_loss = 54.255562, total_loss = 56.013493 
v3 (iou loss, Normalizer: (iou: 0.07, obj: 1.00, cls: 1.00) Region 37 Avg (IOU: 0.867980), count: 457, class_loss = 16.576008, iou_loss = 715.146118, total_loss = 731.722107 
 total_bbox = 24137785, rewritten_bbox = 3.886670 % 
Loaded: 0.000094 seconds

 (next mAP calculation at 6948 iterations) 
 Last accuracy mAP@0.5 = 99.73 %, best = 99.73 % 
 6948: 3.916746, 3.769814 avg loss, 0.002610 rate, 4.999030 seconds, 444672 images, 190.365599 hours left
13900
 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 

 detections_count = 6538768, unique_truth_count = 375368  
 rank = 2328600 of ranks = 6538768Command terminated by signal 9
	Command being timed: "/home/user/darknet/darknet detector -map -dont_show train /home/user/nn/example_network_name/example_network_name.data /home/user/nn/example_network_name/example_network_name.cfg /home/user/nn/example_network_name/example_network_name_best.weights -clear"

From dmesg I can see this is a memory issue:

[468171.250720] Out of memory: Killed process 2320380 (darknet) total-vm:52036032kB, anon-rss:7779456kB, file-rss:39756kB, shmem-rss:47928kB, UID:1000 pgtables:18308kB oom_score_adj:0

@stephanecharette
Copy link
Collaborator Author

@AlexeyAB Can the memory leak fixes from PR #8314 be merged?

But note that those memory leak fixes are not enough in my case to fix the problem. The -map still takes more ram that what I have installed. Do you have additional ideas as to why this takes up so much ram when a project has a high number of annotations/images? Anything we can do as users to limit it?

AlexeyAB pushed a commit that referenced this issue Jan 9, 2022
* issue #8308: memory leaks in map

* update the window title with some training stats
@lsd1994
Copy link

lsd1994 commented Jan 13, 2022

same issue with you

@ElHouas
Copy link

ElHouas commented Jun 17, 2022

Hi @stephanecharette,

I have the same problem as you and I have your fix in my local darknet. Are you still facing the same issue when training with a large number of images?

Thanks!

@stephanecharette
Copy link
Collaborator Author

I implemented a work-around of sorts in DarkMark to get around this problem:

image

While it seems to work for me, I have no idea what the real problem is, so I'm not going to guarantee this will fix the issue for you.

AlexeyAB pushed a commit that referenced this issue Sep 21, 2022
…8670)

* issue #8308: memory leaks in map

* update the window title with some training stats

* make sure _best.weights is the most recent weights with that mAP%
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants