Files (much) shorter than 30s in fma-small #8

keunwoochoi · 2017-10-09T23:33:34Z

Hi, there are 6 files that are much shorter than 30s:

fma_small/098/098565.mp3 --> 1.6s
fma_small/098/098567.mp3 --> 0.5s
fma_small/098/098569.mp3 --> 1.5s
fma_small/099/099134.mp3 --> 0s
fma_small/108/108925.mp3 --> 0s
fma_small/133/133297.mp3 --> 0s

, in case it's not a known issue.

The text was updated successfully, but these errors were encountered:

mdeff · 2017-10-10T09:17:34Z

Hi, yep that's issue #4. It's due to bad length records in the https://freemusicarchive.org database. I should extract that metadata from the mp3 itself rather than relying on data from the API.

keunwoochoi · 2017-10-10T09:27:43Z

Oh, I see. Right.. maybe I (as well as others) would like to know how the baselines are computed when it’s shorter? Also that means I guess they’re also short in Large/Full.

…

On 10Oct 2017, at 10:17, Michaël Defferrard ***@***.***> wrote: Hi, yep that's issue #4 <#4>. It's due to bad length records in the https://freemusicarchive.org <https://freemusicarchive.org/> database. I should extract that metadata from the mp3 itself rather than relying on data from the API. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/APZ8xfBwoVnTG-lLoOoXYVTsiRZPi1qtks5sqzYvgaJpZM4PzL1K>.

mdeff · 2017-10-10T10:11:27Z

The features were extracted over windows then statistics computed across songs. The process is thus independent of the length of a song. Note though that the distributed features (which the baselines are based on) were computed on the full-length tracks. I fought that was most useful to users because it takes a lot of time to compute them (compared to doing in on 30s excerpts).

The length problem exists for medium and large. Full is fine as it contains the original full-length tracks.

keunwoochoi · 2017-10-10T10:14:06Z

Cool, thanks. But..

The length problem exists for medium and large. Full is fine as it contains the original full-length tracks.

Really? How was it possible? Because I understood there’s no such a ‘full’ length signal for those files. Sorry for questions that might be included in the paper..

…

On 10Oct 2017, at 11:11, Michaël Defferrard ***@***.***> wrote: The features were extracted over windows then statistics computed across songs. The process is thus independent of the length of a song. Note though that the distributed features (which the baselines are based on) were computed on the full-length tracks. I fought that was most useful to users because it takes a lot of time to compute them (compared to doing in on 30s excerpts). The length problem exists for medium and large. Full is fine as it contains the original full-length tracks. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/APZ8xe4I6eUoOv_cO0SmctsgaBqePWrMks5sq0LRgaJpZM4PzL1K>.

mdeff · 2017-10-10T10:21:45Z

There is. ;-) The full is a verbatim copy of the mp3 from the https://freemusicarchive.org. Tracks there are up to 3 hours long (figure 2). Small, medium, and large are composed of 30s excerpts (see section 2.6).

keunwoochoi · 2017-10-10T10:24:12Z

Oh.. right, the trimming-from-centre-w.r.t.-metadata was already mentioned in #4 thread. Thanks :)

mdeff · 2017-10-10T10:25:03Z

Exactly

keunwoochoi · 2017-10-10T10:26:52Z

It would be actually very helpful for the fresmusicarchive themselves to know that there are incorrect metadata btw. I'm not sure if I have to close this issue at the moment, will just leave it for you.

mdeff · 2017-10-10T10:28:13Z

I think I've told them at some point. Will check. :)

cwu307 · 2017-11-13T22:30:26Z

First of all, thank you for creating such a nice dataset for MIR community!

A follow-up to Keunwoo's observation, there're many songs that are shorter than 30 sec in medium subset as well. I compiled a list for future reference:

song path || duration (s)
../fma_medium/001/001486.mp3 || 0.0
../fma_medium/005/005574.mp3 || 0.0
../fma_medium/065/065753.mp3 || 0.0
../fma_medium/080/080391.mp3 || 0.0
../fma_medium/098/098558.mp3 || 0.0
../fma_medium/098/098559.mp3 || 0.0
../fma_medium/098/098560.mp3 || 0.0
../fma_medium/098/098565.mp3 || 1.60761904762
../fma_medium/098/098566.mp3 || 6.23129251701
../fma_medium/098/098567.mp3 || 0.510476190476
../fma_medium/098/098568.mp3 || 6.57088435374
../fma_medium/098/098569.mp3 || 1.52925170068
../fma_medium/098/098571.mp3 || 0.0
../fma_medium/099/099134.mp3 || 0.0
../fma_medium/105/105247.mp3 || 0.0
../fma_medium/108/108924.mp3 || 27.3643537415
../fma_medium/108/108925.mp3 || 0.0
../fma_medium/126/126981.mp3 || 0.0
../fma_medium/127/127336.mp3 || 0.0
../fma_medium/133/133297.mp3 || 0.0
../fma_medium/143/143992.mp3 || 0.0

mdeff · 2017-11-15T15:50:38Z

Thanks for your comment. :)

While there's definitely some errors, beware that some songs can legitimately be shorter than 30s (iff the original version in fma full is shorter than 30s).

BTW I just finished to run a script which measures the exact number of frames (among other things) for every song. I will use the results to update the duration column and to correctly cut the songs who had wrong metadata.

mdeff · 2017-11-18T02:50:26Z

I looked at you list, and comparing the reported duration by the API and the real duration (computed by dividing the number of decoded frames by the sample rate), the problem is clearly due to wrong metadata. As such, the tracks were cut at the wrong place, possibly even beyond their total length, which resulted in a zero-length clip.

track id: reported length --> real length
  1486:  90 -->  14.39
  5574: 563 -->  30.46
 65753: 242 -->   9.74
 80391: 600 -->  67.84
 98558: 216 -->  87.67
 98559: 193 -->  58.80
 98560: 341 -->  86.13
 98565: 158 -->  65.31
 98566: 148 -->  63.63
 98567: 191 -->  79.39
 98568: 249 --> 114.55
 98569: 184 -->  77.37
 98571: 139 -->  50.81
 99134: 308 -->  33.25
105247: 123 -->  18.91
108924: 187 --> 105.38
108925: 151 -->   7.44
126981: 216 -->  17.45
127336: 275 -->  13.27
133297: 600 --> 113.58
143992: 211 -->  26.23

Now we face a choice:

Keep those track, properly trimmed. This implies that we would have 7 tracks shorter than 30s in the medium subset (and 1 in the small).
Replace those 7 tracks by other candidates, and guarantee that all tracks in medium and small are at least 30s long.

What do you guys think would be the best option?

cwu307 · 2017-11-18T03:02:27Z

Personally I am fine with either choices. However, considering the ease of use for future users, maybe 2nd option would be slightly better?

chaosinmotion · 2017-12-03T12:37:30Z

The 2nd choice would be good for the ease of use, which means better diffusion.

andimarafioti · 2018-04-11T08:58:54Z

I would prefer to have all tracks in the small/medium subset be 30s long.

hendriks73 · 2019-01-20T12:19:38Z

Not sure what the status is of this, as the archives are not versioned, but three of the files mentioned by @keunwoochoi in #8 (comment) seem to be simply broken in the current download and cannot be read by soxi.

Simple test (shows all files that soxi returns an error for):

find YOUR_FMA_SMALL_DIR -name '*.mp3' -type f -exec sh -c 'soxi {} > /dev/null 2>&1 || echo {}' \;

Output:

./133/133297.mp3
./099/099134.mp3
./108/108925.mp3

Perhaps it makes sense to update the release or find some other way of making users aware of this?
@ejhumphrey seems to have struggled with something similar in #27

mdeff · 2020-06-13T03:49:26Z

The archives are versioned. The rc1 version (still the latest as of now) has this issue. There is code in next to fix it, and it will be used in a hypothetical data update.

I've added the pinned meta-issue #41 and a note in the README to make users aware of known issues.

mdeff · 2020-06-17T17:59:06Z

@hendriks73, the 3 files you list cannot be read by soxy because they are empty (duration 0s). Durations can be found in the medium subset's list (#8 (comment)). (The 3 others from the small subset's list #8 (comment) have duration from 0.5 to 1.6s, hence no problem for soxy.)

#8 (comment) explains why erroneous metadata led to files of duration 0.

andimarafioti mentioned this issue Apr 7, 2020

Possibly corrupted files in fma_small #36

Closed

mdeff mentioned this issue Jun 13, 2020

Known issues (and next release) #41

Open

8 tasks

mdeff closed this as completed Jun 17, 2020

mdeff mentioned this issue Jul 20, 2020

Corrupted Files? #44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files (much) shorter than 30s in fma-small #8

Files (much) shorter than 30s in fma-small #8

keunwoochoi commented Oct 9, 2017 •

edited by mdeff

Loading

mdeff commented Oct 10, 2017

keunwoochoi commented Oct 10, 2017 via email

mdeff commented Oct 10, 2017

keunwoochoi commented Oct 10, 2017 via email

mdeff commented Oct 10, 2017

keunwoochoi commented Oct 10, 2017

mdeff commented Oct 10, 2017

keunwoochoi commented Oct 10, 2017

mdeff commented Oct 10, 2017

cwu307 commented Nov 13, 2017 •

edited by mdeff

Loading

mdeff commented Nov 15, 2017

mdeff commented Nov 18, 2017

cwu307 commented Nov 18, 2017

chaosinmotion commented Dec 3, 2017

andimarafioti commented Apr 11, 2018

hendriks73 commented Jan 20, 2019

mdeff commented Jun 13, 2020 •

edited

Loading

mdeff commented Jun 17, 2020

Files (much) shorter than 30s in fma-small #8

Files (much) shorter than 30s in fma-small #8

Comments

keunwoochoi commented Oct 9, 2017 • edited by mdeff Loading

mdeff commented Oct 10, 2017

keunwoochoi commented Oct 10, 2017 via email

mdeff commented Oct 10, 2017

keunwoochoi commented Oct 10, 2017 via email

mdeff commented Oct 10, 2017

keunwoochoi commented Oct 10, 2017

mdeff commented Oct 10, 2017

keunwoochoi commented Oct 10, 2017

mdeff commented Oct 10, 2017

cwu307 commented Nov 13, 2017 • edited by mdeff Loading

mdeff commented Nov 15, 2017

mdeff commented Nov 18, 2017

cwu307 commented Nov 18, 2017

chaosinmotion commented Dec 3, 2017

andimarafioti commented Apr 11, 2018

hendriks73 commented Jan 20, 2019

mdeff commented Jun 13, 2020 • edited Loading

mdeff commented Jun 17, 2020

keunwoochoi commented Oct 9, 2017 •

edited by mdeff

Loading

cwu307 commented Nov 13, 2017 •

edited by mdeff

Loading

mdeff commented Jun 13, 2020 •

edited

Loading