Skip to content

CatUnderTheLeaf/musicScanner

Repository files navigation

Optical Music Recognition using Deep Learning

Object Detection

Dataset

The DeepScoresV21 Dataset for Music Object Detection contains digitally rendered images of written sheet music, together with the corresponding ground truth to fit various types of machine learning models. A total of 151 Million different instances of music symbols, belonging to 135 different classes are annotated. The total Dataset contains 255,385 Images. For most researches, the dense version, containing 1714 of the most diverse and interesting images, should suffice.

The dataset contains ground in the form of:

  • Non-oriented bounding boxes
  • Oriented bounding boxes
  • Semantic segmentation
  • Instance segmentation

Download here.

The accompaning paper The DeepScoresV2 Dataset and Benchmark for Music Object Detection published at ICPR2020.

Other datasets.

obb_anns

A toolkit for convenient loading and inspection of the data was copied and I have made changes, because some of the external libraries have deprecated functions.

Data preparation

I decided to try YOLOv10, because it was the latest model at the start of this project and it eliminates the need for non-maximum suppression (NMS) during inference, which reduces latency by up to 30%.

DeepScoresV2 Dataset contains high resolution images with lots of small objects, the smallest image is 2772x1960 pixels. Using obb_anns toolkit and ultralytics.utils I wrote a dataset config file that transformes the dataset to yolo format. Unfortunately with my hardware I couldn't train proper object detection on large images. In order to train my detector I wrote a custom script that uses sahi.utils to slice dataset in yolo format (sahi has utility to transform datasets only in COCO format). Right now my sliced dataset contains 41766 non-empty 640x640px images.

Data augmentation

As it is a music sheet dataset, augmentation parameters had to be accordingly adjusted:

  • color adjustments:
    • hsv_h: 0.015 # image HSV-Hue augmentation (fraction)
    • hsv_s: 0.7 # image HSV-Saturation augmentation (fraction)
    • hsv_v: 0.4 # image HSV-Value augmentation (fraction)
  • transformation - can have a little rotation and translation but no affine or perspective transformations:
    • degrees: 5.0 # image rotation (+/- deg)
    • translate: 0.1 # image translation (+/- fraction)
    • scale: 0.3 # image scale (+/- gain)
    • shear: 0.0 # image shear (+/- deg)
    • perspective: 0.0 # (float) image perspective (+/- fraction), range 0-0.001
  • orientation - music sheets can be read only in one way, no flips:
    • flipud: 0.0 # image flip up-down (probability)
    • fliplr: 0.0 # image flip left-right (probability)
  • mixing - music sheets are well structured, no hidden or half-hidden objects:
    • mosaic: 0.0 # image mosaic (probability)
    • mixup: 0.0 # image mixup (probability)

Training

I trained my object detector for 163 epochs, it has a 0.74 mAP50 and 0.54 mAP50-95. Here are my loss/val and metrics plots:

Inference

I used get_sliced_prediction from sahi with 640x640 slice size and 0.2 overlap ratio. Currently sahi is not compatible with YOLOv10, so I added custom Yolov10DetectionModel. Here is an inference image example with 0.4 confidence threshold: inference results

Model

Model is open to public at Roboflow, works on images with 640x640 size. To use it with SAHI slicer here is a Roboflow workflow.

Footnotes

  1. L. Tuggener, Y. P. Satyawan, A. Pacha, J. Schmidhuberand T. Stadelmann, “DeepScoresV2”. Zenodo, Sep. 02, 2020. doi: 10.5281/zenodo.4012193.