Optical Music Recognition using Deep Learning
The DeepScoresV21 Dataset for Music Object Detection contains digitally rendered images of written sheet music, together with the corresponding ground truth to fit various types of machine learning models. A total of 151 Million different instances of music symbols, belonging to 135 different classes are annotated. The total Dataset contains 255,385 Images. For most researches, the dense version, containing 1714 of the most diverse and interesting images, should suffice.
The dataset contains ground in the form of:
- Non-oriented bounding boxes
- Oriented bounding boxes
- Semantic segmentation
- Instance segmentation
Download here.
The accompaning paper The DeepScoresV2 Dataset and Benchmark for Music Object Detection published at ICPR2020.
Other datasets.
A toolkit for convenient loading and inspection of the data was copied and I have made changes, because some of the external libraries have deprecated functions.
I decided to try YOLOv10, because it was the latest model at the start of this project and it eliminates the need for non-maximum suppression (NMS) during inference, which reduces latency by up to 30%.
DeepScoresV2 Dataset contains high resolution images with lots of small objects, the smallest image is 2772x1960 pixels. Using obb_anns
toolkit and ultralytics.utils
I wrote a dataset config file that transformes the dataset to yolo
format. Unfortunately with my hardware I couldn't train proper object detection on large images. In order to train my detector I wrote a custom script that uses sahi.utils
to slice dataset in yolo
format (sahi
has utility to transform datasets only in COCO format). Right now my sliced dataset contains 41766 non-empty 640x640px images.
As it is a music sheet dataset, augmentation parameters had to be accordingly adjusted:
- color adjustments:
- hsv_h: 0.015 # image HSV-Hue augmentation (fraction)
- hsv_s: 0.7 # image HSV-Saturation augmentation (fraction)
- hsv_v: 0.4 # image HSV-Value augmentation (fraction)
- transformation - can have a little rotation and translation but no affine or perspective transformations:
- degrees: 5.0 # image rotation (+/- deg)
- translate: 0.1 # image translation (+/- fraction)
- scale: 0.3 # image scale (+/- gain)
- shear: 0.0 # image shear (+/- deg)
- perspective: 0.0 # (float) image perspective (+/- fraction), range 0-0.001
- orientation - music sheets can be read only in one way, no flips:
- flipud: 0.0 # image flip up-down (probability)
- fliplr: 0.0 # image flip left-right (probability)
- mixing - music sheets are well structured, no hidden or half-hidden objects:
- mosaic: 0.0 # image mosaic (probability)
- mixup: 0.0 # image mixup (probability)
I trained my object detector for 163 epochs, it has a 0.74 mAP50 and 0.54 mAP50-95. Here are my loss/val and metrics plots:
I used get_sliced_prediction
from sahi
with 640x640 slice size and 0.2 overlap ratio. Currently sahi
is not compatible with YOLOv10, so I added custom Yolov10DetectionModel
. Here is an inference image example with 0.4 confidence threshold:
Model is open to public at Roboflow, works on images with 640x640 size. To use it with SAHI slicer here is a Roboflow workflow.
Footnotes
-
L. Tuggener, Y. P. Satyawan, A. Pacha, J. Schmidhuberand T. Stadelmann, “DeepScoresV2”. Zenodo, Sep. 02, 2020. doi: 10.5281/zenodo.4012193. ↩