Losser is a little JSON to CSV converter:
- It takes a list of JSON objects as input and produces a CSV file as output
- The JSON objects don't all have to have the same keys or structure, and they may contain sub-objects and sub-lists arbitrarily nested
- Losser can filter the JSON objects - finding and exporting some fields while ignoring others
- And it can transform the objects - renaming and reordering fields, truncating and formatting values, combining multiple values into lists, deduplicating, etc
- Losser can be used on the command line or as a Python module
Originally created for ckanapi-exporter.
To install run:
pip install losser
On the command line:
losser --column "Data Owner" --pattern '^author$' < input.json
Losser reads a list of JSON objects from stdin and writes the CSV text to
stdout. input.json
should be a JSON file containing a list of objects.
The examples in this README use this input.json file (which
contains datasets exported from demo.ckan.org).
Losser will search each object for fields matching the
regular expression
^author$
and put the values in a column titled "Data Owner".
The output will be a one-column CSV file:
Data Owner |
---|
Bundesbank |
Lucy Chambers |
... |
You can add as many columns as you want: just add a --column
and a
--pattern
argument for each column. The columns will appear in the order that
you specify them:
losser --column "Data Owner" --pattern '^author$' \
--column Maintainer --pattern '^maintainer$' \
< input.json
Data Owner | Maintainer |
---|---|
Bundesbank | Rufus Pollock |
Lucy Chambers | Someone Else |
... | ... |
Losser tries to be a good UNIX citizen. It aims to do one thing and do it well,
and to be composable with other UNIX commands using <
, >
and |
:
- Reads input from stdin
- Writes clean, undecorated CSV text to stdout
- Writes errors and help text to stderr not stdout
- Sets exit status to 0 normally or non-zero if something went wrong
- Is always non-interactive
To export fields from sub-objects use a pattern path: a pattern with more than one argument. Our example JSON objects contain a "tracking_summary" sub-object with a "total" field:
[
{
"author": "Bundesbank",
"maintainer": "Rufus Pollock",
"tracking_summary": {
"total": 456,
"recent": 19
},
...
},
...
]
To extract this field into a third column:
losser --column "Data Owner" --pattern '^author$' \
--column Maintainer --pattern '^maintainer$' \
--column "Total Views" --pattern '^tracking_summary$' "total" \
< input.json
Data Owner | Maintainer | Total Views |
---|---|---|
Bundesbank | Rufus Pollock | 456 |
Lucy Chambers | Someone Else | 200 |
... | ... | ... |
A pattern path can take any number of arguments to recurse into any depth sub-objects.
If one of the patterns in a pattern path lands on a sub-list losser will iterate over the list and recurse on each item in the list, querying the rest of the pattern path against each item and eventually returning a list of results.
Our example objects contain lists of "resource" objects each with a "format" field, among others:
[
{
"author": "Bundesbank",
"maintainer": "Rufus Pollock",
"resources": [
{
"description": "CSV file extracted and cleaned from source excel.",
"format": "CSV",
...
},
{
"description": "Original Excel version.",
"format": "XLS",
...
},
],
...
},
...
]
To extract each of the format fields from each dataset:
losser --column "Data Owner" --pattern '^author$' \
--column Maintainer --pattern '^maintainer$' \
--column "Total Views" --pattern '^tracking_summary$' "total" \
--column Formats --pattern '^resources$' 'format' \
< input.json
Data Owner | Maintainer | Total Views | |
---|---|---|---|
Bundesbank | Rufus Pollock | 456 | CSV, XLS |
Lucy Chambers | Someone Else | 200 | CSV, CSV, JSON, HTML |
... | ... | ... | ... |
List of results are combined into quoted, comma-separated lists in the output CSV.
To remove duplicates from these lists pass the --deduplicate
option to the
column: --column Formats --pattern '^resources$' 'format' --deduplicate
.
Column options like --pattern
, --deduplicate
etc apply to the preceding
--column
. See losser --help
for all the options.
A column query may return a nested list of results, for example if the input object contains a list of lists. When this happens the nested list is flattened in the output CSV.
If a pattern matches more than one key in an object, losser will recurse on each matching key's value, querying the rest of the pattern path against each value and eventually returning a list of values.
To give a simple example, the pattern license
will match a number of keys in
our example objects (license_title
, license_id
and license_url
), so
this:
losser --column "License" --pattern "license" < input.json
Will produce this:
License |
---|
odc-pddl, Open Data Commons Public Domain Dedication and License (PDDL), http://www.opendefinition.org/licenses/odc-pddl |
cc-by, Creative Commons Attribution, http://www.opendefinition.org/licenses/cc-by |
... |
Different objects in the input JSON may use different keys for the same field. For example an "Update Frequency" field that appears as "Update", "update", "Updated", "Update Frequency", etc in different objects.
To catch all of these fields and put them into a single column in the output CSV, just supply a pattern that matches all of them:
losser --column "Update Frequency" --pattern "^update.*" --unique < input.json
In this case we're assuming that each object has only one key that matches our
pattern: we don't want any lists of matching values in our CSV cells.
To enforce this we pass the --unique
option to the column, which will crash
if more than one key matches the pattern.
By default pattern matching is case-insensitive and keys are stripped of
leading and trailing whitespace before matching. To match case-sensitively
and without stripping whitespace, pass the --case-sensitive --strip false
options to the column.
You can specify your columns in a columns.json
file, instead of giving them
on the command line. For example:
{
"Data Owner": {
"pattern": "^author$"
},
"Maintainer": {
"pattern": "^maintainer$"
},
"Total Views": {
"pattern": ["^tracking_summary$", "total"]
},
"Formats": {
"pattern": ["^resources$", "format"],
"deduplicate": true
}
}
Then to use the file do:
losser --columns columns.json < input.json
To call losser from Python:
import losser.losser as losser
table = losser.table(input_objects, columns)
input_objects
should be a list of dicts (e.g. from reading a list of JSON
objects with json.loads()
).
columns
can be dict of dicts in the same format as the columns.json
file
above, or the path to a columns.json
file.
table()
will return the output CSV as a list of dicts or as a UTF8-encoded,
CSV-formatted string (if you pass csv=True
).
Losser's command line interface with --column
and related arguments is fairly
complicated to implement. You may want to offer the same command line features
in your own losser-based command without having to reimplement them.
For example ckanapi-exporter offers
the losser command line interface but adds its own --url
and --apikey
arguments to pull the input data directly from a CKAN site instead of reading
it from stdin.
losser.cli
provides make_parser()
and parse()
functions to enable
inheriting its command-line interface. Here's how you use them:
parent_parser = losser.cli.make_parser(add_help=False, exclude_args=["-i"])
parser = argparse.ArgumentParser(
description="Export datasets from a CKAN site to JSON or CSV.",
parents=[parent_parser],
)
parser.add_argument("--url", required=True)
parser.add_argument("--apikey")
try:
parsed_args = losser.cli.parse(parser=parser)
except losser.cli.CommandLineExit as err:
sys.exit(err.code)
except losser.cli.CommandLineError as err:
if err.message:
parser.error(err.message)
url = parsed_args.url
columns = parsed_args.columns
apikey = parsed_args.apikey
datasets = get_datasets_from_ckan(url, apikey)
csv_string = losser.losser.table(datasets, columns, csv=True)
See ckanapi-exporter for a working example.
To install losser for development, create and activate a Python virtual environment then do:
git clone https://github.com/ckan/losser.git
cd losser
python setup.py develop
pip install -r dev-requirements.txt
To run the tests do:
nosetests
To run the tests and produce a test coverage report do:
nosetests --with-coverage --cover-inclusive --cover-erase --cover-tests