Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error of cut_set.compute_and_store_features #320

Closed
shanguanma opened this issue Jun 19, 2021 · 17 comments · Fixed by #321
Closed

error of cut_set.compute_and_store_features #320

shanguanma opened this issue Jun 19, 2021 · 17 comments · Fixed by #321

Comments

@shanguanma
Copy link
Contributor

shanguanma commented Jun 19, 2021

When I use snowfall to run an asr model, at the prepare Lhotse formate data,I encountered the following problems.
Error detail log is as follows:

# python3 ./prepare_seame.py 
# Invoked at Sat Jun 19 00:39:48 +08 2021 from node09
#
# Started at Sat Jun 19 00:39:48 +08 2021 on node09
WARNING:root:There are 15 recordings that do not have any corresponding supervisions in the SupervisionSet.
Dataset parts:   0%|          | 0/4 [00:00<?, ?it/s]Parts we will prepare:  ('dev_man', 'dev_sge', 'train_trn', 'train_cv')
Musan manifest preparation:
Feature extraction:
partition is dev_man
Processing dev_man
Extracting and storing features (chunks progress):  45%|████▌     | 9/20 [04:55<06:00, 32.80s/it]
Dataset parts:   0%|          | 0/4 [04:56<?, ?it/s]15%|█▌        | 3/20 [04:55<22:49, 80.57s/it] 
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home3/md510/w2020/k2_fsa_2021/lhotse/lhotse/cut.py", line 2064, in compute_and_store_features
    return CutSet.from_cuts(
  File "/home3/md510/w2020/k2_fsa_2021/lhotse/lhotse/cut.py", line 1407, in from_cuts
    return CutSet(cuts=index_by_id_and_check(cuts))
  File "/home3/md510/w2020/k2_fsa_2021/lhotse/lhotse/utils.py", line 316, in index_by_id_and_check
    for m in manifests:
  File "/home3/md510/w2020/k2_fsa_2021/lhotse/lhotse/cut.py", line 2066, in <genexpr>
    cut.compute_and_store_features(
  File "/home3/md510/w2020/k2_fsa_2021/lhotse/lhotse/cut.py", line 424, in compute_and_store_features
    samples=self.load_audio(),
  File "/home3/md510/w2020/k2_fsa_2021/lhotse/lhotse/cut.py", line 394, in load_audio
    return self.recording.load_audio(
  File "/home3/md510/w2020/k2_fsa_2021/lhotse/lhotse/audio.py", line 247, in load_audio
    audio = assert_and_maybe_fix_num_samples(
  File "/home3/md510/w2020/k2_fsa_2021/lhotse/lhotse/audio.py", line 823, in assert_and_maybe_fix_num_samples
    raise ValueError("The number of declared samples in the recording diverged from the one obtained "
ValueError: The number of declared samples in the recording diverged from the one obtained when loading audio (offset=0, duration=1199.328). This could be internal Lhotse's error or a faulty transform implementation. Please report this issue in Lhotse and show the following: diff=9, audio.shape=(1, 19189239), recording=Recording(id='ui03faz_0101', sources=[AudioSource(type='file', channels=[0], source='/home4/md510/w2018/original_seame/wavdata/interview/ui03faz_0101/ui03faz_0101.wav')], sampling_rate=16000, num_samples=19189248, duration=1199.328, transforms=None)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./prepare_seame.py", line 167, in <module>
    main()
  File "./prepare_seame.py", line 135, in main
    cut_set = cut_set.compute_and_store_features(
  File "/home3/md510/w2020/k2_fsa_2021/lhotse/lhotse/cut.py", line 2118, in compute_and_store_features
    cuts_with_feats = combine(progress(f.result() for f in futures))
  File "/home3/md510/w2020/k2_fsa_2021/lhotse/lhotse/manipulation.py", line 27, in combine
    return reduce(add, manifests)
  File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/home3/md510/w2020/k2_fsa_2021/lhotse/lhotse/cut.py", line 2118, in <genexpr>
    cuts_with_feats = combine(progress(f.result() for f in futures))
  File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/home3/md510/anaconda3/envs/foo_k2/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
ValueError: The number of declared samples in the recording diverged from the one obtained when loading audio (offset=0, duration=1199.328). This could be internal Lhotse's error or a faulty transform implementation. Please report this issue in Lhotse and show the following: diff=9, audio.shape=(1, 19189239), recording=Recording(id='ui03faz_0101', sources=[AudioSource(type='file', channels=[0], source='/home4/md510/w2018/original_seame/wavdata/interview/ui03faz_0101/ui03faz_0101.wav')], sampling_rate=16000, num_samples=19189248, duration=1199.328, transforms=None)
# Ended (code 256) at Sat Jun 19 00:44:59 +08 2021, elapsed time 311 seconds

Could you help me to solve it? @pzelasko
Thanks a lot.

@csukuangfj
Copy link
Contributor

Could you show the output of

soxi /home4/md510/w2018/original_seame/wavdata/interview/ui03faz_0101/ui03faz_0101.wav

?

@shanguanma
Copy link
Contributor Author

Yes,

[md510@node02 simple_v1]$ sox --i /home4/md510/w2018/original_seame/wavdata/interview/ui03faz_0101/ui03faz_0101.wav

Input File     : '/home4/md510/w2018/original_seame/wavdata/interview/ui03faz_0101/ui03faz_0101.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:19:59.33 = 19189239 samples ~ 89949.6 CDDA sectors
File Size      : 38.4M
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

@shanguanma
Copy link
Contributor Author

Yes,

[md510@node02 simple_v1]$ soxi /home4/md510/w2018/original_seame/wavdata/interview/ui03faz_0101/ui03faz_0101.wav

Input File     : '/home4/md510/w2018/original_seame/wavdata/interview/ui03faz_0101/ui03faz_0101.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:19:59.33 = 19189239 samples ~ 89949.6 CDDA sectors
File Size      : 38.4M
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

@csukuangfj
Copy link
Contributor

According to the error message:

ValueError: The number of declared samples in the recording diverged from the one obtained when loading 
audio (offset=0, duration=1199.328). 
This could be internal Lhotse's error or a faulty transform implementation. 
Please report this issue in Lhotse and show the following: 
diff=9, audio.shape=(1, 19189239), 
recording=Recording(id='ui03faz_0101', 
sources=[AudioSource(type='file', 
channels=[0], 
source='/home4/md510/w2018/original_seame/wavdata/interview/ui03faz_0101/ui03faz_0101.wav')], 
sampling_rate=16000, num_samples=19189248, duration=1199.328, transforms=None)

duration * sample_rate = 1199.328 * 16000 == 19189248 == num_samples, which is not equal to
audio.shape=(1, 19189239).

transforms=None, so there are no transforms here.

audio.shape matches the output from soxi.

duration and sample_rate are from

lhotse/lhotse/audio.py

Lines 153 to 164 in ef7a037

try:
# Try to parse the file using pysoundfile first.
import soundfile
info = soundfile.info(str(path))
except:
# Try to parse the file using audioread as a fallback.
info = _audioread_info(str(path))
# If both fail, then Python 3 will display both exception messages.
return Recording(
id=recording_id if recording_id is not None else Path(path).stem,
sampling_rate=info.samplerate,
num_samples=info.frames,

Could you use the above code to print num_samples and compare the value with soxi's output? Maybe the header
of the wav is corrupted.

@shanguanma
Copy link
Contributor Author

@csukuangfj , Thanks a lot, Fanjun.
I will follow your suggestion.

[md510@node02 simple_v1]$ python3
Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import soundfile
>>> info = soundfile.info(str("/home4/md510/w2018/original_seame/wavdata/interview/ui03faz_0101/ui03faz_0101.wav"))
>>> num_samples=info.frames
>>> num_samples
19189239
>>> sampling_rate=info.samplerate 
>>> sampling_rate
16000
>>> from pathlib import Path
>>> id=Path("/home4/md510/w2018/original_seame/wavdata/interview/ui03faz_0101/ui03faz_0101.wav").stem
>>> id
'ui03faz_0101'

I found that it is the same as soxi's output.
However, the duration is not equal,
from soxi information is

Duration       : 00:19:59.33 = 19189239 samples ~ 89949.6 CDDA sectors

it is converted seconds is 00:19:59:33==19*60+59.33 ==1199.33
from lhoste information is 19189248 == num_samples

@csukuangfj
Copy link
Contributor

However, the duration is not equal,
from soxi information is

Duration : 00:19:59.33 = 19189239 samples ~ 89949.6 CDDA sectors

I believe soxi's output uses rounding. 19189239/16000 = 1199.3274375


>>> sampling_rate=info.samplerate 
>>> sampling_rate
16000

Can you show the value of info.duration ?


Also, the num_samples from

[md510@node02 simple_v1]$ python3
Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import soundfile
>>> info = soundfile.info(str("/home4/md510/w2018/original_seame/wavdata/interview/ui03faz_0101/ui03faz_0101.wav"))
>>> num_samples=info.frames
>>> num_samples
19189239
>>> sampling_rate=info.samplerate 
>>> sampling_rate
16000
>>> from pathlib import Path
>>> id=Path("/home4/md510/w2018/original_seame/wavdata/interview/ui03faz_0101/ui03faz_0101.wav").stem
>>> id
'ui03faz_0101'

does not match the error message. The error message says

sampling_rate=16000, num_samples=19189248, duration=1199.328, transforms=None)

Perhaps the code chooses _audioread_info. Could you show the num_samples and duration by using _audioread_info?

@shanguanma
Copy link
Contributor Author

Can you show the value of info.duration ?

>>> duration=info.duration
>>> duration
1199.3274375

Perhaps the code chooses _audioread_info. Could you show the num_samples and duration by using _audioread_info?

>>> from lhotse.audio import _audioread_info
>>> info = _audioread_info(str("/home4/md510/w2018/original_seame/wavdata/interview/ui03faz_0101/ui03faz_0101.wav"))
>>> info.samplerate
16000
>>> info.duration  
1199.3274375
>>> info.frames  
19189239

summary it is as follows:

>>> from lhotse.audio import _audioread_info
>>> info = _audioread_info(str("/home4/md510/w2018/original_seame/wavdata/interview/ui03faz_0101/ui03faz_0101.wav"))
>>> info.samplerate
16000
>>> info.duration  
1199.3274375
>>> info.frames  
19189239
>>> from lhotse.audio import _audioread_info
>>> info = soundfile.info(str("/home4/md510/w2018/original_seame/wavdata/interview/ui03faz_0101/ui03faz_0101.wav"))
>>> info.samplerate
16000
>>> info.duration  
1199.3274375
>>> info.frames  
19189239
>>> info1 = _audioread_info(str("/home4/md510/w2018/original_seame/wavdata/interview/ui03faz_0101/ui03faz_0101.wav"))
>>> info1.samplerate
16000
>>> info1.duration  
1199.3274375
>>> info1.frames  
19189239

I still don't find the error message says

num_samples=19189248

@shanguanma
Copy link
Contributor Author

I found that the error is from

def compute_num_samples(duration: Seconds, sampling_rate: int, rounding=ROUND_HALF_UP) -> int:

>>> 1199.328*16000
19189248.0
>>> 1199.3274375 * 16000
19189239.0

@csukuangfj
Copy link
Contributor

I found that the error is from

In that case, I think you can trace the program to find the place where duration is changed.

@pzelasko
Copy link
Collaborator

pzelasko commented Jun 19, 2021 via email

@shanguanma
Copy link
Contributor Author

@pzelasko ,I have validated my data is correct using utils/validate_data_dir.sh --no-feats folder, then I use lhotse command
covered Kaldi format to lhotse format, the command is as follows:

for name in train_trn train_cv dev_sge dev_man; do
      mkdir -p lhotse_data/$name
      lhotse kaldi import kaldi_format/$name 16000 lhotse_data/$name
   done

I believe the above command is correct.

Cool, good debugging! BTW you might want to validate (a subset) of your manifests with „validate(recordings, read_data=True)” that checks if the manifest correctly describes the data. It could help locate the step that corrupted the metadata.

I found that the above error is caused by rounding the duration.

audio = assert_and_maybe_fix_num_samples(

I look for where is round the duration(e.g:1199.3274375 -> 1199.328 ).

@pzelasko
Copy link
Collaborator

pzelasko commented Jun 19, 2021 via email

@danpovey
Copy link
Collaborator

danpovey commented Jun 19, 2021 via email

@pzelasko
Copy link
Collaborator

pzelasko commented Jun 19, 2021 via email

@danpovey
Copy link
Collaborator

danpovey commented Jun 19, 2021 via email

@shanguanma
Copy link
Contributor Author

shanguanma commented Jun 19, 2021

Can you check reco2dur in Kaldi data dir? Lhotse simply reads it when importing and computes the num samples from that. If Kaldi does rounding in reco2dur then num samples would be incorrect.

See

num_samples=int(durations[recording_id] * sampling_rate),

You are right. we should read the audio samples directly and compute the duration and num samples.

Possibly the issue only occurs past a certain duration.

Yes, Only part sentences have such errors.

@danpovey
Copy link
Collaborator

danpovey commented Jun 22, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants