Skip to content
/ json Public

Json-stream utilities. Formatting of or grepping in arbitrary large json streams without running out of memory. Implemented in java/jackson.

License

Notifications You must be signed in to change notification settings

mihxil/json

Repository files navigation

Json Tools

Build Status Maven Central snapshots javadoc codecov

This is provides several JSON-related tools implemented with Jackson. Its goal is to be useable with extremely large json streams, and everything needs to happen streaming.

I tried several tools implemented in python (python -m json.tool, 'jsongrep'), but those consumed very much memory when I fed them a json stream of a Gigabyte or so, and seemed not usable for that, so I implemented similar tools based on jackson2 in java. They are streaming and don’t need much memory, and can deal with huge streams of json.

All tools support a -help argument for an overview of all supported options.

Download

The executable jars are packaged in a zip, which can be downloaded here.

This zip also contain executable scripts to call them with java -jar, which will work in a unix or osx environment, and can be unzipped somewhere in your path. Typing this install in the current directory:

curl -o json.zip  https://repo1.maven.org/maven2/org/meeuw/mihxil-json/0.10/mihxil-json-0.10-all.zip ; unzip -o json.zip ; rm json.zip

Formatter ('jsonformat')

Implemented in mihxil-json-formatter

This is actually just a thin layer around Jackson’s JsonParser and JsonGenerator.

Usage

jsonformat [<infile>] [<outfile>]

infile: defaults to stdin (can explicitely set to stdin as '-'). Can
        be file name but  can also be a remote URL
outfile: default to stdout

For a file of nearly one Gb:

michiel@belono:/tmp$ time jsonformat alldocs.json  alldocs.formatted.json

real	0m27.783s
user	0m19.880s
sys	0m5.686s

michiel@belono:/tmp$ ls -lah alldocs.*
-rw-rw-r--  1 michiel  wheel   1.3G Feb 22 18:17 alldocs.formatted.json
-rw-rw-r--  1 michiel  wheel   928M Feb 22 14:19 alldocs.json

Grep

Implemented in mihxil-json-grep

javadoc

This is a streaming 'jsongrep', and works a bit like grep. It e.g. can be used to produce one line abstracts of the records which can easily be processed further by a normal grep or awk or so.

Architecture

The 'grep' (and 'sed') implementation is basically configured using PathMatcher and extensions. Seperate of that there is a Parser that can convert a string to (a set of) PathMatcher(s). The provided command line tools depend on that. But all functionality is also available by constructing java objects in code.

Command line 'jsongrep'

Example

$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}"  | jsongrep  y.arr[1].e
y.arr[1].e=z

This just demonstrate a simple path match. It returns the matched path together with the associated value.

It can also accept a second optional parameter which is a file or an URL:

$ jsongrep  y.arr[*].*~[xz] test.json
y.arr[1].e=z

Generally the available options are documented in the tools itself too

$ jsongrep --help
jsongrep - 0.10-SNAPSHOT - See https://github.com/mihxil/json
usage: jsongrep [OPTIONS] <pathMatcher expression> [<INPUT FILE>|-]
 -?,--help                print this message
 -d,--debug               Debug
 -i,--ignoreArrays        Ignore arrays (no need to match those)
 -m,--max                 Max number of records
 -o,--output <arg>        Output format, one of [PATHANDVALUE, PATHANDFULLVALUE, KEYANDVALUE,
                          KEYANDFULLVALUE, PATH, KEY, VALUE, FULLVALUE]
 -r,--record <arg>        Record pattern (default to no matching at all). On match, a record
                          separator will be outputted.
 -rs,--recordsep <arg>    Record separator
 -s,--sep <arg>           Separator (defaults to newline)
 -sf,--sortfields <arg>   Sort the fields of a found 'record', according to the order of the
                          matchers.
 -v,--version             Print version

comma matching

It is possible to specify more than one match

$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}"  | jsongrep  y.arr[1].e,a
a=b
y.arr[1].e=z

wildcard matching

You can use wildcards in the path:

$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}"  | jsongrep  y.arr[*].e
y.arr[1].e=z
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}"  | jsongrep  y.*[*].d
y.arr[0].d=y

This is useful for array indices. But you can also choose it completely ignore array indices in matching, which may simplify things:

$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}"  | jsongrep -ignoreArrays y.arr.e
y.arr[1].e=z

regex matching

Regex matching on key is also possible, which can e.g. be used to output different keys at the same level more easily.

echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z', 'f': 'g'}]}}"  | jsongrep -output PATHANDFULLVALUE -ignoreArrays '*.arr./d|e/'
y.arr[0].d=y
y.arr[1].e=z

which is equivalent to:

echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z', 'f': 'g'}]}}"  | jsongrep -output PATHANDFULLVALUE -ignoreArrays '*.arr.d,*.arr.e'
y.arr[0].d=y
y.arr[1].e=z

reporting

If a matcher does not match a simple value but an object or an array, it will be reported like this:

$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}"  | jsongrep  y.arr,y
y.arr=[...]
y={...}

Unless you specify a different output format:

$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}"  | jsongrep -output PATHANDFULLVALUE y.arr,y
y.arr=[{"d":"y"},{"e":"z"}]
y={"c":"x","arr":[{"d":"y"},{"e":"z"}]}

It is possible to output less

$ jsongrep  -output VALUE  y.arr[*].*~[xz] test.json
z
$ jsongrep  -output KEY  y.arr[*].*~[xz] test.json
e
$ jsongrep  -output PATH  y.arr[*].*~[xz] test.json
y.arr[1].e
$ jsongrep  -output KEYANDVALUE  y.arr[*].*~[xz] test.json
e=z

Another example on a couchdb database (find documents where a certain field has certain value)

$ jsongrep rows.*.doc.workflow=FOR_REPUBLICATION,rows.*.doc.mid  http://couchdbhost/database/_all_docs?include_docs=true  |
                grep -A 1 workflow

matching on value

It is also possible to match on value rather than path alone:

$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}"  | jsongrep  y.arr[*].*=z
y.arr[1].e=z

That can also be done using regular expressions

$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}"  | jsongrep  y.arr[*].*~[xz]
y.arr[1].e=z

arbitrary path

You can match directly inside the tree ('…' means 'an arbitrary path)

$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}"  | jsongrep  '...e'
y.arr[1].e=z

contains

It’s possible to match on object containing a certain key:

$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}"  | jsongrep  '...arr[*] contains d'
y.arr[0]={...}

or the inverse

$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}"  | jsongrep  '...arr[*] ! contains d'
y.arr[1]={...}

javascript matching

Matching can be implemented with a javascript function as well:

$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}"  | jsongrep -output KEYANDFULLVALUE '...arr[*] function(doc) { return doc.d == "y"; }'
[0]={"d":"y"}

separators

jsongrep supports the '-sep', '-recordsep' and '-record' parameters. They are intended for example to generate one line abstracts of a bunch of json records. E.g. create a file with 3 fields per line, separated by a tab. The 3 fields are 3 different keys from an array of json objects.

$ jsongrep -output VALUE -sep "     "  -record '*' '*.mid,*.publishDate,*.lastModified'  es.all.json  | sort > es.txt

The -record parameter defines what constitutes the start of a new record. If this matches a 'recordsep' will be outputted (this defaults to a newline). Normally between matches a newline is outputted, but when you use -record you’d probably don’t want that. In this example using the -sep argument a tab character is outputted between matches.

Normally, when using this 'record' functionality, the output record will be implicitely sorted like the matches. So in this case first the 'mid', then 'publishDate' then 'lastModified', independent from the order they appeared in the json document. With the '-sortfields' parameter you can disable this behaviour, and simply output in the original order.

Stream editing ('jsonsed')

A variant of 'jsongrep' is 'jsonsed'. This will just output the incoming json, but it will apply the replacements (which are possible in jsongrep too).

 $ echo '{ "items" : [ { "a" : "abc def"},  { "a" : "xyz qwv"}]} ' | jsonsed -ignoreArrays -format 'items.a~abc\s*(.*)~def'
{
  "items" : [ {
    "a" : "def"
  }, {
    "a" : "xyz qwv"
  } ]
}
Note
The syntax for replacement currenlty is <path>~<value>~<replacement>. This will make it hard to have a literal ~ in the value. The parser may be changed to be more like sed itself. <path>~<ANY><value><ANY><replacement> or so (where <ANY> will be a character you can choose like / or | )

Elasticsearch

javadoc

Implemented in mihxil-es, and contains a tool to download an entire elasticsearch database.

Json-include

This is unfinished. The idea is to have to tool to have something similar to x-include, but for json.

About

Json-stream utilities. Formatting of or grepping in arbitrary large json streams without running out of memory. Implemented in java/jackson.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages