Skip to content
pika edited this page Sep 14, 2010 · 3 revisions

Introduction

Json is data format (RFC 4627) widely used on the internet. Jsonpat is a small command line tool designed to perform json to json transformations. The transformations are mostly expressed through “patterns”, hence the name jsonpat.
Jsonpat is under active development. The syntax and/or semantics might change.

Install

Dependencies

You need the following packages to build jsonpat:

  1. make
  2. ocaml-3.11 (or higher)

Building

In the jsonpat source directory, type: make
The executable jsonpat.native has been created. Rename this file (or create a link) to what suits you the best, for example “jsonpat”.

Walking through an example

Consider we have the following file (myexample.json):

{"name":"ouipk","gender":"M","age":20}
{"name":"Cartwright","gender":"M","age":39}
{"name":"Colbert","gender":"M","age":24}
{"name":"Saead","gender":"M","age":20}
{"name":"Kurtz","gender":"M","age":30}
{"name":"kandan","gender":"M","age":null}
{"name":"bach","gender":"M","age":23}
{"name":"Kumar","gender":"M","age":21}
{"name":"zaman","gender":"F","age":40}
{"name":"maharjan","gender":"F","age":20}

The option -type

The first question that comes in mind is usually: what is in there ? In other words, one wants to know the “type” of the values present in the file. Jsonpat can infer the type of the json values present in a file by using the option “-type”.

$ jsonpat -type myexample.json
type main = t

and t = {
age: int[20,40] | null ;
gender: gender ;
name: string ;
}

and gender =
“F” | “M”

This type description tells us several things:

  1. the field age is “optional” it can be sometimes null or absent
  2. the field age, when present is an integer in the range 20 .. 40
  3. the field gender is a string with only 2 possible values: “M” and “F”
  4. the field name is a string
  5. both fields name and gender are always present

The “-type” feature can be useful to make sure that the assumption one makes about a file are correct. In fact, it simply does automatically what you would do with your eyes when exposed to new json data. More and more content on the internet uses json and very little of them offer at the same time a formal description of the type. However, most of the time, examples are available. If you copy and paste examples of valid inputs in a file and then run jsonpat -type on them, you get very quickly a sense of the domain of every values. For example, I ran the jsonpat typing on 1000 twitter queries (twitter offers a json interface), the result was the following:

type main = t2

and t2 = {
completed_in: float ;
max_id: int[-1,8211333060] ;
next_page: string | null ;
page: int[1,1] ;
query: string ;
refresh_url: string | null ;
results: (results) array ;
results_per_page: int[15,15] ;
since_id: int[0,0] ;
total: int[4,15] | null ;
}

and results = {
created_at: string ;
from_user: string ;
from_user_id: int[933,92749170] ;
geo: geo | null ;
id: int[7763843784,8211333060] ;
iso_language_code: string | null ;
profile_image_url: string ;
source: string ;
text: string ;
to_user: string | null ;
to_user_id: int[2699,92749303] | null ;
}

and geo = {
coordinates: (float) array ;
type: t ;
}

and t =
“Point”

Now, you must have noticed that whenever strings don’t have many possible values their type definition enumerates those values. The option “-threshold” allows you to define the maximum amount of values that a string type can enumerate. Its default value is 5. For instance, if you change the threshold to 100 for the twitter inputs, the iso_language_code type definition is refined and becomes:

and iso_language_code =
“da” | “de” | “en” | “eo” |
“es” | “fa” | “fi” | “fr” |
“he” | “hu” | “is” | “it” |
“ja” | “ko” | “lt” | “nl” |
“no” | “pl” | “pt” | “ru” |
“sv” | “th” | “zh”

A first program

Our first json program will extract the names of every json present in “myexample.json”.

$ jsonpat -p ‘{"name":x} → x’ myexample.json
“ouipk”
“Cartwright”
“Colbert”
“Saead”
“Kurtz”
“kandan”
“bach”
“Kumar”
“zaman”
“maharjan”

Most of jsonpat programs use the arrow operator. The left hand side of an arrow is what we will call a “pattern”, it’s right hand side is an “expression”. A pattern is more or less a json value where some parts have been replaced with identifiers. A simple way to explain what the arrow operator does would be: if the pattern of the left hand side matches the input value then execute the right hand side. In the example provided above the pattern {"name":x} matches any json record with a field called “name”, and when the field “name” is present, it’s value is bound to the identifier “x”. Now, when evaluating the expression “x” (on the right hand side of the arrow), the returned value becomes the name found in the input.

Example of programs

{field1:x} -> x | {field2:x} -> x
tries to extract the value of field1
if the field is absent extracts the
value field2

{field1:_, x} -> x
extracts all the fields except field1

{?field1:x} -> x
extracts field1 when present,
x = null otherwise.
this pattern never fails

{field1:_} as x -> x
extracts all the records with a
field called “field1”

{field1:n} as x when n >= 0 -> x
extracts all the records with field1’s
value greater than 0

"s.*" as x -> x
extracts all the strings starting
with an ‘s’

{field1:int} as x -> x
extracts all the records where
field1 is an integer
possible types are:
(int,bool,float,array,object,string)

x :: _ -> x
extracts the first element of a list

[x] -> x
extracts the first element of a list of size 1

My_variant x -> x
Equivalent to ["My_variant", x] -> x

pi = 3.141593 ; x -> x+pi
adds pi to a float

Flow composition

The most important operator to “compose” transformation is >>. It works exactly as a “pipe” in a shell. The output of the left hand side of a pipe becomes the input of the right hand side.

({field1:x} -> x) >> (x -> x + 1)
first extracts field1 and then adds 1 to it (in 2 separate steps)

flatten
flattens lists of inputs [1,2,3] becomes:
1
2
3

group
groups inputs sharing the same key (input must be sorted)
[1,“foo1”]
[1,“foo2”]
[2,“foo3”]
becomes:
[1,[“foo1”,“foo2”]]
[2, “foo3”]

Loading ocaml code

Jsonpat allows you to load a primitive that you wrote in ocaml. Your function shall be a transformation using the Jsonpat internal representation and should have the type (JsonAst.value → JsonAst.value). You must then make sure that your primitive is registered.

// In file test.ml:

let _ = JsonAst.register “capitalize”
(function
| String s → String (String.uppercase s)
| _ → String “error”)

$ ocamlopt -shared -I /jsonpat-path/_build test.ml -o test.cmxs
$ jsonpat -load ./test.cmxs -p 'x -> capitalize x.name' myexample.json

The -print option

If you are not sure about the priorities of the operators, you can check your program using the -print option. This option prints your program with all the parens explicitely.

The -learn option

(THIS OPTION IS UNDER ACTIVE DEVELOPMENT, DO NOT USE IT FOR PRODUCTION CODE)

This option is useful if and only if you want to extract values of a primitive type (string,bool,int,floats). It “guesses” the jsonpat program according to the value you are trying to produce. You simply write what the output should be for the first input, and it will automatically extract in a similar way the other values.

$ jsonpat -learn '["ouik","M"]' myexample.json
[“ouipk”,“M”]
[“Cartwright”,“M”]
[“Colbert”,“M”]
[“Saead”,“M”]
[“Kurtz”,“M”]
[“kandan”,“M”]
[“bach”,“M”]
[“Kumar”,“M”]
[“zaman”,“F”]
[“maharjan”,“F”]
You should check that the “guessed” program corresponds to what you are trying to do using the “print” option.

$ jsonpat -learn '["ouipk","M"]' myexample.json -print
({?name,?gender → [x1,x2])