Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files (much) shorter than 30s in fma-small #8

Closed
keunwoochoi opened this issue Oct 9, 2017 · 18 comments
Closed

Files (much) shorter than 30s in fma-small #8

keunwoochoi opened this issue Oct 9, 2017 · 18 comments

Comments

@keunwoochoi
Copy link

keunwoochoi commented Oct 9, 2017

Hi, there are 6 files that are much shorter than 30s:

fma_small/098/098565.mp3 --> 1.6s
fma_small/098/098567.mp3 --> 0.5s
fma_small/098/098569.mp3 --> 1.5s
fma_small/099/099134.mp3 --> 0s
fma_small/108/108925.mp3 --> 0s
fma_small/133/133297.mp3 --> 0s

, in case it's not a known issue.

@mdeff
Copy link
Owner

mdeff commented Oct 10, 2017

Hi, yep that's issue #4. It's due to bad length records in the https://freemusicarchive.org database. I should extract that metadata from the mp3 itself rather than relying on data from the API.

@keunwoochoi
Copy link
Author

keunwoochoi commented Oct 10, 2017 via email

@mdeff
Copy link
Owner

mdeff commented Oct 10, 2017

The features were extracted over windows then statistics computed across songs. The process is thus independent of the length of a song. Note though that the distributed features (which the baselines are based on) were computed on the full-length tracks. I fought that was most useful to users because it takes a lot of time to compute them (compared to doing in on 30s excerpts).

The length problem exists for medium and large. Full is fine as it contains the original full-length tracks.

@keunwoochoi
Copy link
Author

keunwoochoi commented Oct 10, 2017 via email

@mdeff
Copy link
Owner

mdeff commented Oct 10, 2017

There is. ;-) The full is a verbatim copy of the mp3 from the https://freemusicarchive.org. Tracks there are up to 3 hours long (figure 2). Small, medium, and large are composed of 30s excerpts (see section 2.6).

@keunwoochoi
Copy link
Author

Oh.. right, the trimming-from-centre-w.r.t.-metadata was already mentioned in #4 thread. Thanks :)

@mdeff
Copy link
Owner

mdeff commented Oct 10, 2017

Exactly

@keunwoochoi
Copy link
Author

It would be actually very helpful for the fresmusicarchive themselves to know that there are incorrect metadata btw. I'm not sure if I have to close this issue at the moment, will just leave it for you.

@mdeff
Copy link
Owner

mdeff commented Oct 10, 2017

I think I've told them at some point. Will check. :)

@cwu307
Copy link

cwu307 commented Nov 13, 2017

First of all, thank you for creating such a nice dataset for MIR community!

A follow-up to Keunwoo's observation, there're many songs that are shorter than 30 sec in medium subset as well. I compiled a list for future reference:

song path || duration (s)
../fma_medium/001/001486.mp3 || 0.0
../fma_medium/005/005574.mp3 || 0.0
../fma_medium/065/065753.mp3 || 0.0
../fma_medium/080/080391.mp3 || 0.0
../fma_medium/098/098558.mp3 || 0.0
../fma_medium/098/098559.mp3 || 0.0
../fma_medium/098/098560.mp3 || 0.0
../fma_medium/098/098565.mp3 || 1.60761904762
../fma_medium/098/098566.mp3 || 6.23129251701
../fma_medium/098/098567.mp3 || 0.510476190476
../fma_medium/098/098568.mp3 || 6.57088435374
../fma_medium/098/098569.mp3 || 1.52925170068
../fma_medium/098/098571.mp3 || 0.0
../fma_medium/099/099134.mp3 || 0.0
../fma_medium/105/105247.mp3 || 0.0
../fma_medium/108/108924.mp3 || 27.3643537415
../fma_medium/108/108925.mp3 || 0.0
../fma_medium/126/126981.mp3 || 0.0
../fma_medium/127/127336.mp3 || 0.0
../fma_medium/133/133297.mp3 || 0.0
../fma_medium/143/143992.mp3 || 0.0

@mdeff
Copy link
Owner

mdeff commented Nov 15, 2017

Thanks for your comment. :)

While there's definitely some errors, beware that some songs can legitimately be shorter than 30s (iff the original version in fma full is shorter than 30s).

BTW I just finished to run a script which measures the exact number of frames (among other things) for every song. I will use the results to update the duration column and to correctly cut the songs who had wrong metadata.

@mdeff
Copy link
Owner

mdeff commented Nov 18, 2017

I looked at you list, and comparing the reported duration by the API and the real duration (computed by dividing the number of decoded frames by the sample rate), the problem is clearly due to wrong metadata. As such, the tracks were cut at the wrong place, possibly even beyond their total length, which resulted in a zero-length clip.

track id: reported length --> real length
  1486:  90 -->  14.39
  5574: 563 -->  30.46
 65753: 242 -->   9.74
 80391: 600 -->  67.84
 98558: 216 -->  87.67
 98559: 193 -->  58.80
 98560: 341 -->  86.13
 98565: 158 -->  65.31
 98566: 148 -->  63.63
 98567: 191 -->  79.39
 98568: 249 --> 114.55
 98569: 184 -->  77.37
 98571: 139 -->  50.81
 99134: 308 -->  33.25
105247: 123 -->  18.91
108924: 187 --> 105.38
108925: 151 -->   7.44
126981: 216 -->  17.45
127336: 275 -->  13.27
133297: 600 --> 113.58
143992: 211 -->  26.23

Now we face a choice:

  1. Keep those track, properly trimmed. This implies that we would have 7 tracks shorter than 30s in the medium subset (and 1 in the small).
  2. Replace those 7 tracks by other candidates, and guarantee that all tracks in medium and small are at least 30s long.

What do you guys think would be the best option?

@cwu307
Copy link

cwu307 commented Nov 18, 2017

Personally I am fine with either choices. However, considering the ease of use for future users, maybe 2nd option would be slightly better?

@chaosinmotion
Copy link

The 2nd choice would be good for the ease of use, which means better diffusion.

@andimarafioti
Copy link

I would prefer to have all tracks in the small/medium subset be 30s long.

@hendriks73
Copy link

Not sure what the status is of this, as the archives are not versioned, but three of the files mentioned by @keunwoochoi in #8 (comment) seem to be simply broken in the current download and cannot be read by soxi.

Simple test (shows all files that soxi returns an error for):

find YOUR_FMA_SMALL_DIR -name '*.mp3' -type f -exec sh -c 'soxi {} > /dev/null 2>&1 || echo {}' \;

Output:

./133/133297.mp3
./099/099134.mp3
./108/108925.mp3

Perhaps it makes sense to update the release or find some other way of making users aware of this?
@ejhumphrey seems to have struggled with something similar in #27

@mdeff
Copy link
Owner

mdeff commented Jun 13, 2020

The archives are versioned. The rc1 version (still the latest as of now) has this issue. There is code in next to fix it, and it will be used in a hypothetical data update.

I've added the pinned meta-issue #41 and a note in the README to make users aware of known issues.

@mdeff
Copy link
Owner

mdeff commented Jun 17, 2020

@hendriks73, the 3 files you list cannot be read by soxy because they are empty (duration 0s). Durations can be found in the medium subset's list (#8 (comment)). (The 3 others from the small subset's list #8 (comment) have duration from 0.5 to 1.6s, hence no problem for soxy.)

#8 (comment) explains why erroneous metadata led to files of duration 0.

@mdeff mdeff closed this as completed Jun 17, 2020
@mdeff mdeff mentioned this issue Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants