This is provides several JSON-related tools implemented with Jackson. Its goal is to be useable with extremely large json streams, and everything needs to happen streaming.
I tried several tools implemented in python (python -m json.tool, 'jsongrep'), but those consumed very much memory when I fed them a json stream of a Gigabyte or so, and seemed not usable for that, so I implemented similar tools based on jackson2 in java. They are streaming and don’t need much memory, and can deal with huge streams of json.
All tools support a -help argument for an overview of all supported options.
The executable jars are packaged in a zip, which can be downloaded here.
This zip also contain executable scripts to call them with java -jar
, which will work in a unix or osx environment, and can be unzipped somewhere in your path. Typing this install in the current directory:
curl -o json.zip https://repo1.maven.org/maven2/org/meeuw/mihxil-json/0.10/mihxil-json-0.10-all.zip ; unzip -o json.zip ; rm json.zip
Implemented in mihxil-json-formatter
This is actually just a thin layer around Jackson’s JsonParser
and JsonGenerator
.
Usage
jsonformat [<infile>] [<outfile>] infile: defaults to stdin (can explicitely set to stdin as '-'). Can be file name but can also be a remote URL outfile: default to stdout
For a file of nearly one Gb:
michiel@belono:/tmp$ time jsonformat alldocs.json alldocs.formatted.json
real 0m27.783s
user 0m19.880s
sys 0m5.686s
michiel@belono:/tmp$ ls -lah alldocs.*
-rw-rw-r-- 1 michiel wheel 1.3G Feb 22 18:17 alldocs.formatted.json
-rw-rw-r-- 1 michiel wheel 928M Feb 22 14:19 alldocs.json
Implemented in mihxil-json-grep
This is a streaming 'jsongrep', and works a bit like grep. It e.g. can be used to produce one line abstracts of the records which can easily be processed further by a normal grep or awk or so.
The 'grep' (and 'sed') implementation is basically configured using PathMatcher
and extensions. Seperate of that there is a Parser
that can convert a string to (a set of) PathMatcher
(s). The provided command line tools depend on that. But all functionality is also available by constructing java objects in code.
Example
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep y.arr[1].e
y.arr[1].e=z
This just demonstrate a simple path match. It returns the matched path together with the associated value.
It can also accept a second optional parameter which is a file or an URL:
$ jsongrep y.arr[*].*~[xz] test.json
y.arr[1].e=z
Generally the available options are documented in the tools itself too
$ jsongrep --help
jsongrep - 0.10-SNAPSHOT - See https://github.com/mihxil/json
usage: jsongrep [OPTIONS] <pathMatcher expression> [<INPUT FILE>|-]
-?,--help print this message
-d,--debug Debug
-i,--ignoreArrays Ignore arrays (no need to match those)
-m,--max Max number of records
-o,--output <arg> Output format, one of [PATHANDVALUE, PATHANDFULLVALUE, KEYANDVALUE,
KEYANDFULLVALUE, PATH, KEY, VALUE, FULLVALUE]
-r,--record <arg> Record pattern (default to no matching at all). On match, a record
separator will be outputted.
-rs,--recordsep <arg> Record separator
-s,--sep <arg> Separator (defaults to newline)
-sf,--sortfields <arg> Sort the fields of a found 'record', according to the order of the
matchers.
-v,--version Print version
It is possible to specify more than one match
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep y.arr[1].e,a
a=b
y.arr[1].e=z
You can use wildcards in the path:
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep y.arr[*].e
y.arr[1].e=z
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep y.*[*].d
y.arr[0].d=y
This is useful for array indices. But you can also choose it completely ignore array indices in matching, which may simplify things:
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep -ignoreArrays y.arr.e
y.arr[1].e=z
Regex matching on key is also possible, which can e.g. be used to output different keys at the same level more easily.
echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z', 'f': 'g'}]}}" | jsongrep -output PATHANDFULLVALUE -ignoreArrays '*.arr./d|e/'
y.arr[0].d=y
y.arr[1].e=z
which is equivalent to:
echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z', 'f': 'g'}]}}" | jsongrep -output PATHANDFULLVALUE -ignoreArrays '*.arr.d,*.arr.e'
y.arr[0].d=y
y.arr[1].e=z
If a matcher does not match a simple value but an object or an array, it will be reported like this:
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep y.arr,y
y.arr=[...]
y={...}
Unless you specify a different output format:
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep -output PATHANDFULLVALUE y.arr,y
y.arr=[{"d":"y"},{"e":"z"}]
y={"c":"x","arr":[{"d":"y"},{"e":"z"}]}
It is possible to output less
$ jsongrep -output VALUE y.arr[*].*~[xz] test.json
z
$ jsongrep -output KEY y.arr[*].*~[xz] test.json
e
$ jsongrep -output PATH y.arr[*].*~[xz] test.json
y.arr[1].e
$ jsongrep -output KEYANDVALUE y.arr[*].*~[xz] test.json
e=z
Another example on a couchdb database (find documents where a certain field has certain value)
$ jsongrep rows.*.doc.workflow=FOR_REPUBLICATION,rows.*.doc.mid http://couchdbhost/database/_all_docs?include_docs=true |
grep -A 1 workflow
It is also possible to match on value rather than path alone:
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep y.arr[*].*=z
y.arr[1].e=z
That can also be done using regular expressions
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep y.arr[*].*~[xz]
y.arr[1].e=z
You can match directly inside the tree ('…' means 'an arbitrary path)
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep '...e'
y.arr[1].e=z
Matching can be implemented with a javascript function as well:
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep -output KEYANDFULLVALUE '...arr[*] function(doc) { return doc.d == "y"; }'
[0]={"d":"y"}
jsongrep supports the '-sep', '-recordsep' and '-record' parameters. They are intended for example to generate one line abstracts of a bunch of json records. E.g. create a file with 3 fields per line, separated by a tab. The 3 fields are 3 different keys from an array of json objects.
$ jsongrep -output VALUE -sep " " -record '*' '*.mid,*.publishDate,*.lastModified' es.all.json | sort > es.txt
The -record parameter defines what constitutes the start of a new record. If this matches a 'recordsep' will be outputted (this defaults to a newline). Normally between matches a newline is outputted, but when you use -record you’d probably don’t want that. In this example using the -sep argument a tab character is outputted between matches.
Normally, when using this 'record' functionality, the output record will be implicitely sorted like the matches. So in this case first the 'mid', then 'publishDate' then 'lastModified', independent from the order they appeared in the json document. With the '-sortfields' parameter you can disable this behaviour, and simply output in the original order.
A variant of 'jsongrep' is 'jsonsed'. This will just output the incoming json, but it will apply the replacements (which are possible in jsongrep too).
$ echo '{ "items" : [ { "a" : "abc def"}, { "a" : "xyz qwv"}]} ' | jsonsed -ignoreArrays -format 'items.a~abc\s*(.*)~def'
{
"items" : [ {
"a" : "def"
}, {
"a" : "xyz qwv"
} ]
}
Note
|
The syntax for replacement currenlty is <path>~<value>~<replacement> . This will make it hard to have
a literal ~ in the value. The parser may be changed to be more like sed itself. <path>~<ANY><value><ANY><replacement> or so (where <ANY> will be a character you can choose like / or | )
|
Implemented in mihxil-es, and contains a tool to download an entire elasticsearch database.