-
Notifications
You must be signed in to change notification settings - Fork 0
/
readme.txt
21 lines (16 loc) · 1.81 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Open Library dump scripts
A collection of scripts to extract statistics from Open Library dump files.
Stats.py: produce statistics of a dump file
stats.py reads the standard in, line by line. It expects a complete JSON record, so before feeding dump files, you should remove everything before the JSON record. For example: sed -nre "s/^[^{]*//p" <ol_dump_file> | python stats.py output.json
During execution, it keeps the statistics in a dict. Each type found in the dump, except the ones with confused identities, gets a key in this dict. The values for these keys are dicts themselves, with keys:
countr - an int count of records (of this type),
keys - a dict with keys found in the records as keys and a list as value. The list contains the number of records each key is found in, followed by the number of values: if the specific key has a list value in the records, the length of all lists is accumulated; otherwise this is the same as the number of records,
identifiers - a dict with keys found in the identifiers object as keys and the number of records and the number of instances of each key as value,
si - a dict with identifiers found in the record object as keys and a list of the number of records and the number of instances of each key as value,
classifications - same as for identifiers, but for classifications,
sc - same as for si, but for sc.
Keys and types of records with confused identities are in a list under key confused. If an exception is caught during processing of a record, a 2-tuple containing the complete record and the exception message is appended to the list under key error.
Exportcsv.py: export data from JSON stats file to separate CSV files
Expects a file generated by stats.py.
Countformats.py: count the values in the physical_format field
Expects Edition JSON records, outputs a tab separated UTF-8 file.