Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FYI: pandas can fail to read participants.tsv #474

Closed
yarikoptic opened this issue Aug 13, 2019 · 14 comments
Closed

FYI: pandas can fail to read participants.tsv #474

yarikoptic opened this issue Aug 13, 2019 · 14 comments

Comments

@yarikoptic
Copy link
Collaborator

yarikoptic commented Aug 13, 2019

/tmp/ds001868 > cat participants.tsv          
participant_id	age	sex
sub-ecog01	38	m%                                                                                      

/tmp/ds001868 > python -c 'from bids import BIDSLayout; b=BIDSLayout(".", derivatives=False); b.get_collections(level="dataset")'                                                                                               Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pandas/core/arrays/categorical.py", line 345, in __init__
    codes, categories = factorize(values, sort=True)
  File "/usr/lib/python3/dist-packages/pandas/util/_decorators.py", line 178, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/pandas/core/algorithms.py", line 630, in factorize
    na_value=na_value)
  File "/usr/lib/python3/dist-packages/pandas/core/algorithms.py", line 476, in _factorize_array
    na_value=na_value)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_labels
TypeError: unhashable type: 'dict'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/yoh/proj/bids/pybids/bids/layout/layout.py", line 850, in get_collections
    sampling_rate=sampling_rate)
  File "/home/yoh/proj/bids/pybids/bids/variables/entities.py", line 92, in get_collections
    nodes = self.get_nodes(unit, entities)
  File "/home/yoh/proj/bids/pybids/bids/variables/entities.py", line 161, in get_nodes
    rows = rows.sort_values(sort_cols)
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 4414, in sort_values
    na_position=na_position)
  File "/usr/lib/python3/dist-packages/pandas/core/sorting.py", line 207, in lexsort_indexer
    c = Categorical(key, ordered=True)
  File "/usr/lib/python3/dist-packages/pandas/core/arrays/categorical.py", line 347, in __init__
    codes, categories = factorize(values, sort=False)
  File "/usr/lib/python3/dist-packages/pandas/util/_decorators.py", line 178, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/pandas/core/algorithms.py", line 630, in factorize
    na_value=na_value)
  File "/usr/lib/python3/dist-packages/pandas/core/algorithms.py", line 476, in _factorize_array
    na_value=na_value)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_labels
TypeError: unhashable type: 'dict'

/tmp/ds001868 > apt-cache policy python3-pandas
python3-pandas:
  Installed: 0.23.3+dfsg-3
  Candidate: 0.23.3+dfsg-3
  Version table:
 *** 0.23.3+dfsg-3 900
        900 http://http.debian.net/debian buster/main amd64 Packages
        900 http://http.debian.net/debian buster/main i386 Packages
        600 http://http.debian.net/debian sid/main amd64 Packages
        600 http://http.debian.net/debian sid/main i386 Packages
        100 /var/lib/dpkg/status
edit 1: additional sample ds001810
(git)smaug:/mnt/btrfs/datasets/datalad/crawl/openneuro/ds001810[master]
$> python -c 'from bids import BIDSLayout; b=BIDSLayout(".", derivatives=False); b.get_collections(level="dataset")'
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pandas/core/arrays/categorical.py", line 345, in __init__
    codes, categories = factorize(values, sort=True)
  File "/usr/lib/python3/dist-packages/pandas/util/_decorators.py", line 178, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/pandas/core/algorithms.py", line 630, in factorize
    na_value=na_value)
  File "/usr/lib/python3/dist-packages/pandas/core/algorithms.py", line 476, in _factorize_array
    na_value=na_value)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_labels
TypeError: unhashable type: 'dict'

...
$> head -n 2 participants.tsv
participant_id  tDCS_on_first_day       gender  age
sub-01  anodal  female  20
@tyarkoni
Copy link
Collaborator

That's a new one. Will take a look.

@yarikoptic
Copy link
Collaborator Author

I never deliver dated merchandise! ;)

@yarikoptic
Copy link
Collaborator Author

this might be related other fail:

[INFO   ] Aggregate metadata for dataset /mnt/btrfs/datasets/datalad/crawl/openneuro/ds001890                                                                                                                                     
[WARNING] Failed to load participants info due to: Length of values does not match length of index [series.py:_sanitize_index:4001]. Skipping the rest of file            

that file content -- and I thought it must be the last row but seems to be kosher
$> cat /mnt/btrfs/datasets/datalad/crawl/openneuro/ds001890/participants.tsv                          
participant_id	session	sex	genotype	Weight	SpO2	HR	Temperature	DOB	Experiment_Date	Age
c1NT	1	M	3xTG	32.3	98	272	35.8	2016-11-22	2017-03-23	3
c1NT	2	M	3xTG	36.2	94	311	35.8	2016-11-22	2017-05-31	6
c2NT	1	M	3xTG	24.8	98	257	36.9	2016-11-29	2017-03-23	3
c2NT	2	M	3xTG	30.6	99	450	35.8	2016-11-29	2017-05-31	6
c2R1	1	M	3xTG	25	97	360	36.1	2016-11-29	2017-03-23	3
c4NT	1	M	C57BL/6	31.5	99	250	36	2016-12-10	2017-03-29	3
c4NT	2	M	C57BL/6	35	96	208	34.8	2016-12-10	2017-06-19	6
c5NT	1	M	C57BL/6	30.1	99	427	35.2	2016-12-10	2017-03-29	3
c5NT	2	M	C57BL/6	31.9	93	223	34.6	2016-12-03	2017-06-05	6
c6NT	1	M	3xTG	26.2	99	330	36.3	2016-12-19	2017-03-30	3
c6NT	2	M	3xTG	30.2	90	250	36	2016-12-19	2017-06-20	6
c7NT	1	M	C57BL/6	28.7	36	350	35	2016-12-10	2017-04-06	3
c7NT	2	M	C57BL/6	28.3	80	266	35.8	2016-12-10	2017-06-19	6
c8NT	1	M	3xTG	25	97	379	36.3	2017-01-20	2017-04-25	3
c8NT	2	M	3xTG	25	97	379	36.3	2017-01-20	2017-04-25	6
c9NT	1	M	3xTG	23.7	99	457	35	2017-01-20	2017-04-25	3
c11L	1	M	3xTG	30.3	97	370	36.6	2016-11-22	2017-03-23	3
c11L	2	M	3xTG	34.7	97	270	35.4	2016-11-22	2017-05-31	6
c21L	1	M	3xTG	24.2	93	254	36.1	2016-11-29	2017-03-23	3
c31L	1	M	3xTG	29.6	93	331	36.1	2016-12-17	2017-03-24	3
c31R	1	M	3xTG	28.2	99	350	35.8	2016-12-17	2017-03-24	3
c31R	2	M	3xTG	32.3	88	380	35.2	2016-12-17	2017-06-20	6
c32L	1	M	3xTG	27.4	96	306	35.8	2016-12-17	2017-03-30	3
c32L	2	M	3xTG	32	82	240	35.8	2016-12-17	2017-06-20	6
c32R	1	M	3xTG	28	94	253	35.8	2016-12-17	2017-03-30	3
c32R	2	M	3xTG	34	89	254	35.8	2016-12-17	2017-06-20	6
c41L	1	M	C57BL/6	30.2	99	300	35.8	2016-12-10	2017-03-29	3
c41L	2	M	C57BL/6	32	95	248	35.4	2016-12-10	2017-06-19	6
c41R	1	M	C57BL/6	26.3	99	280	36	2016-12-10	2017-03-29	3
c41R	2	M	C57BL/6	27.2	98	276	36.6	2016-12-10	2017-06-19	6
c42L	1	M	C57BL/6	30.6	99	250	35	2016-12-10	2017-03-29	3
c42L	2	M	C57BL/6	36	99	269	35	2016-12-10	2017-06-19	6
c42R	1	M	C57BL/6	25.7	99	250	36	2016-12-10	2017-03-29	3
c42R	2	M	C57BL/6	29	96	180	34.5	2016-12-10	2017-06-19	6
c51L	1	M	C57BL/6	30.5	99	250	36.6	2016-12-10	2017-03-29	3
c51L	2	M	C57BL/6	31.6	99	211	34.8	2016-12-03	2017-06-05	6
c61L	1	M	3xTG	27	99	340	36.6	2016-12-19	2017-03-30	3
c61L	2	M	3xTG	31	89	184	36.3	2016-12-19	2017-06-20	6
c61R	1	M	3xTG	26	96	326	36.3	2016-12-19	2017-03-30	3
c61R	2	M	3xTG	30	92	238	36.3	2016-12-19	2017-06-20	6
c62L	1	M	3xTG	25.8	99	335	35.4	2016-12-19	2017-03-30	3
c71L	1	M	C57BL/6	30.1	98	298	35.5	2016-12-10	2017-04-06	3
c71L	2	M	C57BL/6	36.5	97	233	35	2016-12-10	2017-06-19	6
c71R	1	M	C57BL/6	31.4	95	416	35.8	2016-12-10	2017-04-06	3
c71R	2	M	C57BL/6	31.6	99	169	34.8	2016-12-10	2017-06-19	6
c81L	1	M	3xTG	30.1	99	336	35.2	2017-01-20	2017-04-25	3
c81L	2	M	3xTG	30.1	99	336	35.2	2017-01-20	2017-04-25	6
c81R	1	M	3xTG	27.9	89	340	35.4	2017-01-20	2017-04-25	3
c81R	2	M	3xTG	27.9	89	340	35.4	2017-01-20	2017-04-25	6
c91L	1	M	3xTG	25.3	95	613	37	2017-01-20	2017-04-25	3
c91L	2	M	3xTG	25.3	95	613	37	2017-01-20	2017-04-25	6
c91R	1	M	3xTG	26	95	266	35.2	2017-01-20	2017-04-25	3
c3NT2	1	M	3xTG_hydrocephalus	25	98	260	36.3	2016-12-17	2017-03-30	3

$> grep hydro /mnt/btrfs/datasets/datalad/crawl/openneuro/ds001890/participants.tsv | sed -e 's,\t, --- ,g'
c3NT2 --- 1 --- M --- 3xTG_hydrocephalus --- 25 --- 98 --- 260 --- 36.3 --- 2016-12-17 --- 2017-03-30 --- 3

@effigies
Copy link
Collaborator

@yarikoptic For that last, the participants.tsv is fine, but it has a list of 53 participants and there are only 30 participants in the dataset. The participant IDs also don't match, which is going to cause additional problems.

@yarikoptic
Copy link
Collaborator Author

ah! thanks @effigies ! probably having a dedicated exception would be nice here as well with a message matching the one thrown by bids-validator for such a case (isn't bids-validator is run on those datasets upon upload and such issues shouldn't happen with datasets from openneuro?)

@effigies
Copy link
Collaborator

By the standard, "Each participant needs to be described by one and only one row.", so this definitely should be caught by the validator, but it probably isn't.

Which also means that this should be a BIDSValidationError by the #473 discussion, rather than something we handle more gracefully.

I'll open a BIDS validator issue.

@effigies
Copy link
Collaborator

effigies commented Aug 14, 2019

Hmm. Looks like it should be an error already. Do you want to try running the validator? I'm on public wifi, so can't quickly download 9.7GB.

https://github.com/bids-standard/bids-validator/blob/492f99f8836dc6b26372eee5c795aaa405a6d735/bids-validator/utils/issues/list.js#L273-L278

@yarikoptic
Copy link
Collaborator Author

yeap will do. And I did see that error from bids-validator appearing for some datasets , so the question was more either openneuro accepts datasets which do not pass bids-validator

@effigies
Copy link
Collaborator

We shouldn't...

@yarikoptic
Copy link
Collaborator Author

ha -- bids-validator doesn't error out since it is "too smart", but

this participants.tsv is not BIDS compliant -- it has multiple entries for the participants with multiple sessions -- one row per session:
$> head participants.tsv 
participant_id  session sex     genotype        Weight  SpO2    HR      Temperature     DOB     Experiment_Date Age
c1NT    1       M       3xTG    32.3    98      272     35.8    2016-11-22      2017-03-23      3
c1NT    2       M       3xTG    36.2    94      311     35.8    2016-11-22      2017-05-31      6
c2NT    1       M       3xTG    24.8    98      257     36.9    2016-11-29      2017-03-23      3
c2NT    2       M       3xTG    30.6    99      450     35.8    2016-11-29      2017-05-31      6
c2R1    1       M       3xTG    25      97      360     36.1    2016-11-29      2017-03-23      3
c4NT    1       M       C57BL/6 31.5    99      250     36      2016-12-10      2017-03-29      3
c4NT    2       M       C57BL/6 35      96      208     34.8    2016-12-10      2017-06-19      6
c5NT    1       M       C57BL/6 30.1    99      427     35.2    2016-12-10      2017-03-29      3
c5NT    2       M       C57BL/6 31.9    93      223     34.6    2016-12-03      2017-06-05      6
...

$> awk '{print $1;}' participants.tsv | sort | uniq -c | nl
     1        2 c11L
     2        2 c1NT
     3        1 c21L
     4        2 c2NT
     5        1 c2R1
     6        1 c31L
     7        2 c31R
     8        2 c32L
     9        2 c32R
    10        1 c3NT2
    11        2 c41L
    12        2 c41R
    13        2 c42L
    14        2 c42R
    15        2 c4NT
    16        2 c51L
    17        2 c5NT
    18        2 c61L
    19        2 c61R
    20        1 c62L
    21        2 c6NT
    22        2 c71L
    23        2 c71R
    24        2 c7NT
    25        2 c81L
    26        2 c81R
    27        2 c8NT
    28        2 c91L
    29        1 c91R
    30        1 c9NT
    31        1 participant_id

$> ls -ld sub-jgrADc* | nl | tail
    21  drwx------ 1 yoh yoh 20 Aug 14 09:57 sub-jgrADc6NT/
    22  drwx------ 1 yoh yoh 20 Aug 14 09:57 sub-jgrADc71L/
    23  drwx------ 1 yoh yoh 20 Aug 14 09:57 sub-jgrADc71R/
    24  drwx------ 1 yoh yoh 20 Aug 14 09:57 sub-jgrADc7NT/
    25  drwx------ 1 yoh yoh 20 Aug 14 09:57 sub-jgrADc81L/
    26  drwx------ 1 yoh yoh 20 Aug 14 09:57 sub-jgrADc81R/
    27  drwx------ 1 yoh yoh 20 Aug 14 09:57 sub-jgrADc8NT/
    28  drwx------ 1 yoh yoh 20 Aug 14 09:57 sub-jgrADc91L/
    29  drwx------ 1 yoh yoh 10 Aug 14 09:57 sub-jgrADc91R/
    30  drwx------ 1 yoh yoh 10 Aug 14 09:57 sub-jgrADc9NT/
and validator (1.2.5) doesn't issue any error or complaint about that. Submitted https://github.com/bids-standard/bids-validator/issues/820 with this info.

So PyBIDS was kinda right I guess but exception is not informative

@tyarkoni
Copy link
Collaborator

Keep in mind that pybids doesn't call the JS bids-validator, it calls the (much more limited) Python BIDSValidator. The latter tests all the same regexes, but doesn't inspect files or their contents otherwise. Long term, we definitely need to either keep the two in sync, or (more realistically) wrap the JS validator and call it from pybids (which, unfortunately, introduces some heavy dependencies).

@effigies
Copy link
Collaborator

Agreed that we don't call the validator, but when there are conditions that we can identify as only arising from invalid data, then raising an exception that basically says "Go run the validator for more details" could be useful.

@tyarkoni
Copy link
Collaborator

On further investigation, the problem here was that pybids didn't know how to handle nested metadata, so reading in participants.json was breaking things. I think #488 should have implicitly already fixed this. Can you verify that, @yarikoptic?

Assuming it now works, I think I'll close this. The issue about mismatching rows is to my mind strictly a validator issue.

@yarikoptic
Copy link
Collaborator Author

Can you verify that, @yarikoptic?

I can confirm that current master (0.9.2-66-g6751eec AKA 0.9.3-48-g6751eec since 0.9.3 was not annotated) no longer blows, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants