Skip to content

Commit

Permalink
feat: byte slice JSON parser (#1415)
Browse files Browse the repository at this point in the history
## Description

I implemented a JSON parser using byte slices

### State Machine For JSON

Each state transitions to the next when specific conditions are met,
representing the process of parsing a JSON structure sequentially. State
transitions are defined based on the conditions and actions that occur
during the parsing process.

The diagram essentially includes the following states (All states are
defined at `internal.gno` file):

- It starts in the initial state (`__`), returning an error if an
unexpected token is encountered.
- States for handling string (`ST`), number (`MI`, `ZE`, `IN`), boolean
(`T1`, `F1`), and null (`N1`) values.
- States for handling the start and end of objects (`co`) and arrays
(`bo`) (`ec`, `cc`, `bc`).
- States expecting keys (`KE`) and values (`VA`), and for handling
commas (`cm`) and colons (`cl`).

Each state deals with various scenarios that can occur during JSON
parsing, with transitions to the next state determined by the current
token and context. Below is a graph depicting how the states transition:

```mermaid
stateDiagram-v2
    [*] --> __: Start
    __ --> ST: String
    __ --> MI: Number
    __ --> ZE: Zero
    __ --> IN: Integer
    __ --> T1: Boolean (true)
    __ --> F1: Boolean (false)
    __ --> N1: Null
    __ --> ec: Empty Object End
    __ --> cc: Object End
    __ --> bc: Array End
    __ --> co: Object Begin
    __ --> bo: Array Begin
    __ --> cm: Comma
    __ --> cl: Colon
    __ --> OK: Success/End
    ST --> OK: String Complete
    MI --> OK: Number Complete
    ZE --> OK: Zero Complete
    IN --> OK: Integer Complete
    T1 --> OK: True Complete
    F1 --> OK: False Complete
    N1 --> OK: Null Complete
    ec --> OK: Empty Object Complete
    cc --> OK: Object Complete
    bc --> OK: Array Complete
    co --> OB: Inside Object
    bo --> AR: Inside Array
    cm --> KE: Expecting New Key
    cm --> VA: Expecting New Value
    cl --> VA: Expecting Value
    OB --> ST: String in Object (Key)
    OB --> ec: Empty Object
    OB --> cc: End Object
    AR --> ST: String in Array
    AR --> bc: End Array
    KE --> ST: String as Key
    VA --> ST: String as Value
    VA --> MI: Number as Value
    VA --> T1: True as Value
    VA --> F1: False as Value
    VA --> N1: Null as Value
    OK --> [*]: End
```

## Walkthrough The JSON Machine

Gno is not completely compatible with Go, which means that many
functions within the standard library are not fully implemented yet.
Therefore, some files are added not directly related to JSON but
necessary for functionality implementation.

### Float Value Handler

The `strconv` package currently provided by gno has functions injected
for parsing basic `int` and `uint` types, but does not have an
implementation for parsing floating-point numbers with `ParseFloat`.
Therefore, I have brought over the implementation of the `eisel-lemire`
algorithm from Go's strconv package
(`./p/demo/json/eisel_lemire`).<sup>[1](#footnote_1)</sup>

Additionally, since the `FormatFloat` function is also not implemented
yet. So, I imported the `ryu64` algorithm <sup>[2](#footnote_2)</sup> to
implement this feature (`./p/demo/json/ryu`).

Anyway, I plan to add this code to the strconv package if possible, so
that the necessary functionality and functions can be completely written
in gno.

### Buffer

`buffer.gno` manages internal buffer management and interaction with the
state machine for JSON parsing. The buffer processes the JSON string
sequentially, interpreting the meaning of each character and deciding
the next action through the state machine.

Here, I'll describe the key functions and how they interact with the
state machine. The `/` next to a number is a notation borrowed from
Elixir to indicate the number of parameters:

- `newBuffer`: This function creates a new buffer instance containing
the given data. The initial state is set to `GO`, signifying the start
of parsing and preparing for subsequent parsing stages as the state
machine's initial state.

- `first`: Finds the first meaningful (non-whitespace) character.
Although the state machine is not yet activated at this stage, the
result of this function plays a crucial role in determining the first
step of parsing.

- `current`, `next`, `step`: These functions manage the current position
within the buffer, reading characters or moving to the next one.
`current` returns the character at the current index, `next` returns the
next character, and `step` only moves to the next position. These
movement functions are necessary to decide what input should be
processed when the state machine transitions to the next state.

- `getState`: Determines the next state based on the character at the
current buffer position. This function evaluates the class (type of
character) of the current character and uses a state transition table to
decide the next state. This process is central to how the state machine
interprets the JSON structure.

- `numeric/1`, `string/2`, `word/1`: These functions parse numbers,
strings, and specified word tokens, respectively. During parsing, the
state machine transitions to the next state based on the current
character's type and context, which is essential for accurately
interpreting the structure and meaning of JSON data.

- `skip`, `skipAny/1`: Functions for skipping characters that meet
certain conditions, such as moving the buffer index until a specific
character or set of tokens is encountered. These functions are mainly
used to manage the current state of the state machine while parsing
structural elements (e.g., the end of an object or array).

These functions are used to closely interact with the state machine to
recognize and interpret the various data types and structures within the
JSON string. The current state of the state machine changes based on
each character or string the buffer processes, dynamically controlling
the parsing process.

### Unescape

These functions are designed to process JSON strings, specifically by
managing internal buffer interactions and unescaping characters as per
JSON standards. This involves translating escape sequences like `\uXXXX`
for unicode characters, as well as simpler escapes like `\\`, `\/`,
`\b`, `\f`, `\n`, `\r`, and `\t`.

Here's some key functions for this file:

- `Unescape/2`: This is the primary function that takes an input byte
slice (representing a JSON string with escape sequences) and an output
byte slice to write the unescaped version of the input. It processes
each escape sequence encountered in the input slice and translates it
into the corresponding UTF-8 character in the output slice.

- `Unquote/2`: This function is designed to remove surrounding quotes
from a JSON string and unescape its contents. It's useful for processing
JSON string values to their literal representations.

### Node

When a JSON string is decoded, the package converts the data into a Node
type.

```go
type Node struct {
    prev     *Node            // prev is the parent node of the current node.
    next     map[string]*Node // next is the child nodes of the current node.
    key      *string          // key holds the key of the current node in the parent node.
    data     []byte           // byte slice of JSON data
    value    interface{}      // value holds the value of the current node.
    nodeType ValueType        // NodeType holds the type of the current node. (Object, Array, String, Number, Boolean, Null)
    index    *int             // index holds the index of the current node in the parent array node.
    borders  [2]int           // borders stores the start and end index of the current node in the data.
    modified bool             // modified indicates the current node is changed or not.
}
```

This node type allows you to fetch and manipulate the specific values
from JSON. For example, you can use the `GetKey/1` function to retrieve
the value stored at a specific key, and you can use `Delete` to remove
the node. By doing so, enabling you to process JSON data.

--- 
<a name="footnote_1">1</a>: The Eisel-Lemire algorithm provides a fast
way to parse floating-point numbers from strings. The core idea of this
algorithm is to minimize potential errors during the conversion process
from strings to numbers, while processing the conversion as quickly as
possible. Eisel-Lemire is particularly useful when dealing with large
amounts of numerical data, providing much faster and more accurate
results than traditional parsing methods.

<a name="footnote_2">2</a>: The Ryu algorithm focuses on converting
floating-point numbers to strings. Ryu generally converts floating-point
numbers to the shortest possible string representation accurately, with
excellent performance and precision. A key advantage of the Ryu
algorithm is that the converted string maintains the minimum length
while precisely representing the original number. This helps save
storage space and reduces data transmission times over networks.
  • Loading branch information
notJoon authored Mar 29, 2024
1 parent 8ae1e7f commit 6afab42
Show file tree
Hide file tree
Showing 29 changed files with 8,694 additions and 0 deletions.
21 changes: 21 additions & 0 deletions examples/gno.land/p/demo/json/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# MIT License

Copyright (c) 2019 Pyzhov Stepan

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
170 changes: 170 additions & 0 deletions examples/gno.land/p/demo/json/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# JSON Parser

The JSON parser is a package that provides functionality for parsing and processing JSON strings. This package accepts JSON strings as byte slices.

Currently, gno does not [support the `reflect` package](https://docs.gno.land/concepts/effective-gno#reflection-is-never-clear), so it cannot retrieve type information at runtime. Therefore, it is designed to infer and handle type information when parsing JSON strings using a state machine approach.

After passing through the state machine, JSON strings are represented as the `Node` type. The `Node` type represents nodes for JSON data, including various types such as `ObjectNode`, `ArrayNode`, `StringNode`, `NumberNode`, `BoolNode`, and `NullNode`.

This package provides methods for manipulating, searching, and extracting the Node type.

## State Machine

To parse JSON strings, a [finite state machine](https://en.wikipedia.org/wiki/Finite-state_machine) approach is used. The state machine transitions to the next state based on the current state and the input character while parsing the JSON string. Through this method, type information can be inferred and processed without reflect, and the amount of parser code can be significantly reduced.

The image below shows the state transitions of the state machine according to the states and input characters.

```mermaid
stateDiagram-v2
[*] --> __: Start
__ --> ST: String
__ --> MI: Number
__ --> ZE: Zero
__ --> IN: Integer
__ --> T1: Boolean (true)
__ --> F1: Boolean (false)
__ --> N1: Null
__ --> ec: Empty Object End
__ --> cc: Object End
__ --> bc: Array End
__ --> co: Object Begin
__ --> bo: Array Begin
__ --> cm: Comma
__ --> cl: Colon
__ --> OK: Success/End
ST --> OK: String Complete
MI --> OK: Number Complete
ZE --> OK: Zero Complete
IN --> OK: Integer Complete
T1 --> OK: True Complete
F1 --> OK: False Complete
N1 --> OK: Null Complete
ec --> OK: Empty Object Complete
cc --> OK: Object Complete
bc --> OK: Array Complete
co --> OB: Inside Object
bo --> AR: Inside Array
cm --> KE: Expecting New Key
cm --> VA: Expecting New Value
cl --> VA: Expecting Value
OB --> ST: String in Object (Key)
OB --> ec: Empty Object
OB --> cc: End Object
AR --> ST: String in Array
AR --> bc: End Array
KE --> ST: String as Key
VA --> ST: String as Value
VA --> MI: Number as Value
VA --> T1: True as Value
VA --> F1: False as Value
VA --> N1: Null as Value
OK --> [*]: End
```

## Examples

This package provides parsing functionality along with encoding and decoding functionality. The following examples demonstrate how to use this package.

### Decoding

Decoding (or Unmarshaling) is the functionality that converts an input byte slice JSON string into a `Node` type.

The converted `Node` type allows you to modify the JSON data or search and extract data that meets specific conditions.

```go
package main

import (
"fmt"
"gno.land/p/demo/json"
"gno.land/p/demo/ufmt"
)

func main() {
node, err := json.Unmarshal([]byte(`{"foo": "var"}`))
if err != nil {
ufmt.Errorf("error: %v", err)
}

ufmt.Sprintf("node: %v", node)
}
```

### Encoding

Encoding (or Marshaling) is the functionality that converts JSON data represented as a Node type into a byte slice JSON string.

> ⚠️ Caution: Converting a large `Node` type into a JSON string may _impact performance_. or might be cause _unexpected behavior_.
```go
package main

import (
"fmt"
"gno.land/p/demo/json"
"gno.land/p/demo/ufmt"
)

func main() {
node := ObjectNode("", map[string]*Node{
"foo": StringNode("foo", "bar"),
"baz": NumberNode("baz", 100500),
"qux": NullNode("qux"),
})

b, err := json.Marshal(node)
if err != nil {
ufmt.Errorf("error: %v", err)
}

ufmt.Sprintf("json: %s", string(b))
}
```

### Searching

Once the JSON data converted into a `Node` type, you can **search** and **extract** data that satisfy specific conditions. For example, you can find data with a specific type or data with a specific key.

To use this functionality, you can use methods in the `GetXXX` prefixed methods. The `MustXXX` methods also provide the same functionality as the former methods, but they will **panic** if data doesn't satisfies the condition.

Here is an example of finding data with a specific key. For more examples, please refer to the [node.gno](node.gno) file.

```go
package main

import (
"fmt"
"gno.land/p/demo/json"
"gno.land/p/demo/ufmt"
)

func main() {
root, err := Unmarshal([]byte(`{"foo": true, "bar": null}`))
if err != nil {
ufmt.Errorf("error: %v", err)
}

value, err := root.GetKey("foo")
if err != nil {
ufmt.Errorf("error occurred while getting key, %s", err)
}

if value.MustBool() != true {
ufmt.Errorf("value is not true")
}

value, err = root.GetKey("bar")
if err != nil {
t.Errorf("error occurred while getting key, %s", err)
}

_, err = root.GetKey("baz")
if err == nil {
t.Errorf("key baz is not exist. must be failed")
}
}
```

## Contributing

Please submit any issues or pull requests for this package through the GitHub repository at [gnolang/gno](<https://github.com/gnolang/gno>).
Loading

0 comments on commit 6afab42

Please sign in to comment.