readme.txt

Open Library dump scripts

A collection of scripts to extract statistics from Open Library dump files.

Stats.py: produce statistics of a dump file
stats.py reads the standard in, line by line. It expects a complete JSON record, so before feeding dump files, you should remove everything before the JSON record. For example: sed -nre "s/^[^{]*//p" <ol_dump_file> | python stats.py output.json

During execution, it keeps the statistics in a dict. Each type found in the dump, except the ones with confused identities, gets a key in this dict. The values for these keys are dicts themselves, with keys: 
countr - an int count of records (of this type),
keys - a dict with keys found in the records as keys and a list as value. The list contains the number of records each key is found in, followed by the number of values: if the specific key has a list value in the records, the length of all lists is accumulated; otherwise this is the same as the number of records, 
identifiers - a dict with keys found in the identifiers object as keys and the number of records and the number of instances of each key as value, 
si - a dict with identifiers found in the record object as keys and a list of the number of records and the number of instances of each key as value, 
classifications - same as for identifiers, but for classifications,
sc - same as for si, but for sc.
Keys and types of records with confused identities are in a list under key confused. If an exception is caught during processing of a record, a 2-tuple containing the complete record and the exception message is appended to the list under key error.

Exportcsv.py: export data from JSON stats file to separate CSV files
Expects a file generated by stats.py.

Countformats.py: count the values in the physical_format field
Expects Edition JSON records, outputs a tab separated UTF-8 file.