Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: add decoder Option NoValidateJSON for skipping JSON faster #696

Merged
merged 22 commits into from
Sep 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 11 additions & 9 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,12 @@ jobs:
${{ runner.os }}-go-

- name: Benchmark Target
continue-on-error: true
run: |
export SONIC_NO_ASYNC_GC=1
export SONIC_BENCH_SINGLE=1
go test -run ^$ -count=10 -benchmem -bench 'Benchmark(Encoder|Decoder)_(Generic|Binding)_Sonic' ./decoder >> /var/tmp/sonic_bench_target.out
go test -run ^$ -count=10 -benchmem -bench 'Benchmark(Get|Set)One_Sonic|BenchmarkParseSeven_Sonic' ./ast >> /var/tmp/sonic_bench_target.out
go test -run ^$ -count=100 -benchmem -bench 'BenchmarkDecoder_(Generic|Binding)_Sonic' ./decoder >> /var/tmp/sonic_bench_target_${{ github.run_id }}.out
go test -run ^$ -count=100 -benchmem -bench 'BenchmarkEncoder_(Generic|Binding)_Sonic' ./encoder >> /var/tmp/sonic_bench_target_${{ github.run_id }}.out
go test -run ^$ -count=100 -benchmem -bench 'Benchmark(Get|Set)One_Sonic|BenchmarkParseSeven_Sonic' ./ast >> /var/tmp/sonic_bench_target_${{ github.run_id }}.out

- name: Clear repository
run: sudo rm -fr $GITHUB_WORKSPACE && mkdir $GITHUB_WORKSPACE
Expand All @@ -42,14 +43,15 @@ jobs:
ref: main

- name: Benchmark main
continue-on-error: true
run: |
export SONIC_NO_ASYNC_GC=1
export SONIC_BENCH_SINGLE=1
go test -run ^$ -count=10 -benchmem -bench 'Benchmark(Encoder|Decoder)_(Generic|Binding)_Sonic' ./decoder >> /var/tmp/sonic_bench_main.out
go test -run ^$ -count=10 -benchmem -bench 'Benchmark(Get|Set)One_Sonic|BenchmarkParseSeven_Sonic' ./ast >> /var/tmp/sonic_bench_main.out
UNIQUE_ID=${{ github.run_id }}
go test -run ^$ -count=100 -benchmem -bench 'BenchmarkDecoder_(Generic|Binding)_Sonic' ./decoder >> /var/tmp/sonic_bench_main_${{ github.run_id }}.out
go test -run ^$ -count=100 -benchmem -bench 'BenchmarkEncoder_(Generic|Binding)_Sonic' ./encoder >> /var/tmp/sonic_bench_main_${{ github.run_id }}.out
go test -run ^$ -count=100 -benchmem -bench 'Benchmark(Get|Set)One_Sonic|BenchmarkParseSeven_Sonic' ./ast >> /var/tmp/sonic_bench_main_${{ github.run_id }}.out

- name: Diff bench
continue-on-error: true
run: |
go get golang.org/x/perf/cmd/benchstat && go install golang.org/x/perf/cmd/benchstat
benchstat -format=csv /var/tmp/sonic_bench_target.out /var/tmp/sonic_bench_main.out
# run: ./scripts/bench.py -t 0.05 -d /var/tmp/sonic_bench_target.out,/var/tmp/sonic_bench_main.out x
./scripts/bench.py -t 0.20 -d /var/tmp/sonic_bench_target_${{ github.run_id }}.out,/var/tmp/sonic_bench_main_${{ github.run_id }}.out x
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,4 @@ fuzz/testdata
*__debug_bin*
*pprof
*coverage.txt
tools/venv/*
28 changes: 18 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,19 +283,21 @@ sub := root.Get("key3").Index(2).Int64() // == 3
**Tip**: since `Index()` uses offset to locate data, which is much faster than scanning like `Get()`, we suggest you use it as much as possible. And sonic also provides another API `IndexOrGet()` to underlying use offset as well as ensure the key is matched.

#### SearchOption

`Searcher` provides some options for user to meet different needs:

```go
opts := ast.SearchOption{ CopyReturn: true ... }
val, err := sonic.GetWithOptions(JSON, opts, "key")
```

- CopyReturn
Indicate the searcher to copy the result JSON string instead of refer from the input. This can help to reduce memory usage if you cache the results
- ConcurentRead
Since `ast.Node` use `Lazy-Load` design, it doesn't support Concurrently-Read by default. If you want to read it concurrently, please specify it.
- ValidateJSON
Indicate the searcher to validate the entire JSON. This option is enabled by default, which slow down the search speed a little.


#### Set/Unset

Modify the json content by Set()/Unset()
Expand All @@ -314,6 +316,7 @@ println(alias1 == alias2) // true
exist, err := root.UnsetByIndex(1) // exist == true
println(root.Get("key4").Check()) // "value not exist"
```

#### Serialize

To encode `ast.Node` as json, use `MarshalJson()` or `json.Marshal()` (MUST pass the node's pointer)
Expand Down Expand Up @@ -381,16 +384,12 @@ See [ast/visitor.go](https://github.com/bytedance/sonic/blob/main/ast/visitor.go

## Compatibility

Sonic **DOES NOT** ensure to support all environments, due to the difficulty of developing high-performance codes. For developers who use sonic to build their applications in different environments, we have the following suggestions:

- Developing on **Mac M1**: Make sure you have Rosetta 2 installed on your machine, and set `GOARCH=amd64` when building your application. Rosetta 2 can automatically translate x86 binaries to arm64 binaries and run x86 applications on Mac M1.
- Developing on **Linux arm64**: You can install qemu and use the `qemu-x86_64 -cpu max` command to convert x86 binaries to amr64 binaries for applications built with sonic. The qemu can achieve a similar transfer effect to Rosetta 2 on Mac M1.
For developers who want to use sonic to meet diffirent scenarios, we provide some integrated configs as `sonic.API`

For developers who want to use sonic on Linux arm64 without qemu, or those who want to handle JSON strictly consistent with `encoding/json`, we provide some compatible APIs as `sonic.API`

- `ConfigDefault`: the sonic's default config (`EscapeHTML=false`,`SortKeys=false`...) to run on sonic-supporting environment. It will fall back to `encoding/json` with the corresponding config, and some options like `SortKeys=false` will be invalid.
- `ConfigStd`: the std-compatible config (`EscapeHTML=true`,`SortKeys=true`...) to run on sonic-supporting environment. It will fall back to `encoding/json`.
- `ConfigFastest`: the fastest config (`NoQuoteTextMarshaler=true`) to run on sonic-supporting environment. It will fall back to `encoding/json` with the corresponding config, and some options will be invalid.
- `ConfigDefault`: the sonic's default config (`EscapeHTML=false`,`SortKeys=false`...) to run sonic fast meanwhile ensure security.
- `ConfigStd`: the std-compatible config (`EscapeHTML=true`,`SortKeys=true`...)
- `ConfigFastest`: the fastest config (`NoQuoteTextMarshaler=true`) to run on sonic as fast as possible.
Sonic **DOES NOT** ensure to support all environments, due to the difficulty of developing high-performance codes. On non-sonic-supporting environment, the implementation will fall back to `encoding/json`. Thus beflow configs will all equal to `ConfigStd`.

## Tips

Expand Down Expand Up @@ -480,8 +479,17 @@ For better performance, in previous case the `ast.Visitor` will be the better ch
But `ast.Visitor` is not a very handy API. You might need to write a lot of code to implement your visitor and carefully maintain the tree hierarchy during decoding. Please read the comments in [ast/visitor.go](https://github.com/bytedance/sonic/blob/main/ast/visitor.go) carefully if you decide to use this API.

### Buffer Size

Sonic use memory pool in many places like `encoder.Encode`, `ast.Node.MarshalJSON` to improve performace, which may produce more memory usage (in-use) when server's load is high. See [issue 614](https://github.com/bytedance/sonic/issues/614). Therefore, we introduce some options to let user control the behavior of memory pool. See [option](https://pkg.go.dev/github.com/bytedance/sonic@v1.11.9/option#pkg-variables) package.

### Faster JSON skip

For security, sonic use [FSM](native/skip_one.c) algorithm to validate JSON when decoding raw JSON or encoding `json.Marshaler`, which is much slower (1~10x) than [SIMD-searching-pair](native/skip_one_fast.c) algorithm. If user has many redundant JSON value and DO NOT NEED to strictly validate JSON correctness, you can enable below options:

- `Config.NoValidateSkipJSON`: for faster skipping JSON when decoding, such as unknown fields, json.Unmarshaler(json.RawMessage), mismatched values, and redundant array elements
- `Config.NoValidateJSONMarshaler`: avoid validating JSON when encoding `json.Marshaler`
- `SearchOption.ValidateJSON`: indicates if validate located JSON value when `Get`

## Community

Sonic is a subproject of [CloudWeGo](https://www.cloudwego.io/). We are committed to building a cloud native ecosystem.
26 changes: 17 additions & 9 deletions README_ZH_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,11 +283,14 @@ sub := root.Get("key3").Index(2).Int64() // == 3
**注意**:由于 `Index()` 使用偏移量来定位数据,比使用扫描的 `Get()` 要快的多,建议尽可能的使用 `Index` 。 Sonic 也提供了另一个 API, `IndexOrGet()` ,以偏移量为基础并且也确保键的匹配。

#### 查找选项

`ast.Searcher`提供了一些选项,以满足用户的不同需求:

```
opts:= ast.SearchOption{CopyReturn: true…}
Val, err:= sonic。gettwithoptions (JSON, opts, "key")
```

- CopyReturn
指示搜索器复制结果JSON字符串,而不是从输入引用。如果用户缓存结果,这有助于减少内存使用
- ConcurentRead
Expand Down Expand Up @@ -381,16 +384,12 @@ type Visitor interface {

## 兼容性

由于开发高性能代码的困难性, Sonic **不**保证对所有环境的支持。对于在不同环境中使用 Sonic 构建应用程序的开发者,我们有以下建议:

- 在 **Mac M1** 上开发:确保在您的计算机上安装了 Rosetta 2,并在构建时设置 `GOARCH=amd64` 。 Rosetta 2 可以自动将 x86 二进制文件转换为 arm64 二进制文件,并在 Mac M1 上运行 x86 应用程序。
- 在 **Linux arm64** 上开发:您可以安装 qemu 并使用 `qemu-x86_64 -cpu max` 命令来将 x86 二进制文件转换为 arm64 二进制文件。qemu可以实现与Mac M1上的Rosetta 2类似的转换效果。

对于希望在不使用 qemu 下使用 sonic 的开发者,或者希望处理 JSON 时与 `encoding/JSON` 严格保持一致的开发者,我们在 `sonic.API` 中提供了一些兼容性 API
对于想要使用sonic来满足不同场景的开发人员,我们提供了一些集成配置:

- `ConfigDefault`: 在支持 sonic 的环境下 sonic 的默认配置(`EscapeHTML=false`,`SortKeys=false`等)。行为与具有相应配置的 `encoding/json` 一致,一些选项,如 `SortKeys=false` 将无效。
- `ConfigStd`: 在支持 sonic 的环境下与标准库兼容的配置(`EscapeHTML=true`,`SortKeys=true`等)。行为与 `encoding/json` 一致。
- `ConfigFastest`: 在支持 sonic 的环境下运行最快的配置(`NoQuoteTextMarshaler=true`)。行为与具有相应配置的 `encoding/json` 一致,某些选项将无效。
- `ConfigDefault`: sonic的默认配置 (`EscapeHTML=false`, `SortKeys=false`…) 保证性能同时兼顾安全性。
- `ConfigStd`: 与 `encoding/json` 保证完全兼容的配置
- `ConfigFastest`: 最快的配置(`NoQuoteTextMarshaler=true...`) 保证性能最优但是会缺少一些安全性检查(validate UTF8 等)
Sonic **不**确保支持所有环境,由于开发高性能代码的困难。在不支持声音的环境中,实现将回落到 `encoding/json`。因此上述配置将全部等于`ConfigStd`。

## 注意事项

Expand Down Expand Up @@ -478,8 +477,17 @@ go someFunc(user)
但是,`ast.Visitor` 并不是一个很易用的 API。你可能需要写大量的代码去实现自己的 `ast.Visitor`,并且需要在解析过程中仔细维护树的层级。如果你决定要使用这个 API,请先仔细阅读 [ast/visitor.go](https://github.com/bytedance/sonic/blob/main/ast/visitor.go) 中的注释。

### 缓冲区大小

Sonic在许多地方使用内存池,如`encoder.Encode`, `ast.Node.MarshalJSON`等来提高性能,这可能会在服务器负载高时产生更多的内存使用(in-use)。参见[issue 614](https://github.com/bytedance/sonic/issues/614)。因此,我们引入了一些选项来让用户配置内存池的行为。参见[option](https://pkg.go.dev/github.com/bytedance/sonic@v1.11.9/option#pkg-variables)包。

### 更快的 JSON Skip

为了安全起见,在跳过原始JSON 时,sonic decoder 默认使用[FSM](native/skip_one.c)算法扫描来跳过同时校验 JSON。它相比[SIMD-searching-pair](native/skip_one_fast.c)算法跳过要慢得多(1~10倍)。如果用户有很多冗余的JSON值,并且不需要严格验证JSON的正确性,你可以启用以下选项:

- `Config.NoValidateSkipJSON`: 用于在解码时更快地跳过JSON,例如未知字段,`json.RawMessage`,不匹配的值和冗余的数组元素等
- `Config.NoValidateJSONMarshaler`: 编码JSON时避免验证JSON。封送拆收器
- `SearchOption.ValidateJSON`: 指示当`Get`时是否验证定位的JSON值

## 社区

Sonic 是 [CloudWeGo](https://www.cloudwego.io/) 下的一个子项目。我们致力于构建云原生生态系统。
5 changes: 5 additions & 0 deletions api.go
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,10 @@ type Config struct {
// NoValidateJSONMarshaler indicates that the encoder should not validate the output string
// after encoding the JSONMarshaler to JSON.
NoValidateJSONMarshaler bool

// NoValidateJSONSkip indicates the decoder should not validate the JSON value when skipping it,
// such as unknown-fields, mismatched-type, redundant elements..
NoValidateJSONSkip bool

// NoEncoderNewline indicates that the encoder should not add a newline after every message
NoEncoderNewline bool
Expand All @@ -109,6 +113,7 @@ var (
ConfigFastest = Config{
NoQuoteTextMarshaler: true,
NoValidateJSONMarshaler: true,
NoValidateJSONSkip: true,
}.Froze()
)

Expand Down
2 changes: 2 additions & 0 deletions decoder/decoder_compat.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ const (
_F_use_number = types.B_USE_NUMBER
_F_validate_string = types.B_VALIDATE_STRING
_F_allow_control = types.B_ALLOW_CONTROL
_F_no_validate_json = types.B_NO_VALIDATE_JSON
)

type Options uint64
Expand All @@ -53,6 +54,7 @@ const (
OptionDisableUnknown Options = 1 << _F_disable_unknown
OptionCopyString Options = 1 << _F_copy_string
OptionValidateString Options = 1 << _F_validate_string
OptionNoValidateJSON Options = 1 << _F_no_validate_json
)

func (self *Decoder) SetOptions(opts Options) {
Expand Down
1 change: 1 addition & 0 deletions decoder/decoder_native.go
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ const (
OptionDisableUnknown Options = api.OptionDisableUnknown
OptionCopyString Options = api.OptionCopyString
OptionValidateString Options = api.OptionValidateString
OptionNoValidateJSON Options = api.OptionNoValidateJSON
)

// StreamDecoder is the decoder context object for streaming input.
Expand Down
88 changes: 81 additions & 7 deletions decoder/decoder_native_test.go
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
//go:build (amd64 && go1.17 && !go1.24) || (arm64 && go1.20 && !go1.24)
// +build amd64,go1.17,!go1.24 arm64,go1.20,!go1.24


/*
* Copyright 2021 ByteDance Inc.
*
Expand All @@ -21,15 +20,90 @@
package decoder

import (
`encoding/json`
_`strings`
`testing`
_`reflect`
"encoding/json"
"fmt"
_ "reflect"
"strings"
_ "strings"
"testing"
"time"

`github.com/bytedance/sonic/internal/rt`
`github.com/stretchr/testify/assert`
"github.com/bytedance/sonic/internal/rt"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)


func BenchmarkSkipValidate(b *testing.B) {
type skiptype struct {
A int `json:"a"` // mismatched
B string `json:"-"` // ommited
C [1]int `json:"c"` // fast int
D struct {} `json:"d"` // empty struct
E map[string]int `json:"e"` // mismatched elem
F json.RawMessage `json:"f"` // unmarshaler
// Unknonwn
}
type C struct {
name string
json string
expTime float64
}
var sam = map[int]interface{}{}
for i := 0; i < 1; i++ {
sam[i] = _BindingValue
}
comptd, err := json.Marshal(sam)
if err != nil {
b.Fatal("invalid json")
}
compt := string(comptd)
var cases = []C{
{"mismatched", `{"a":`+compt+`}`, 5},
{"ommited", `{"b":`+compt+`}`, 5},
{"number", `{"c":[`+strings.Repeat("-1.23456e-19,", 1000)+`1]}`, 1.5},
{"unknown", `{"unknown":`+compt+`}`, 5},
{"empty", `{"d":`+compt+`}`, 5},
{"mismatched elem", `{"e":`+compt+`}`, 5},
{"unmarshaler", `{"f":`+compt+`}`, 3},
}
_ = NewDecoder(`{}`).Decode(&skiptype{})

var avg1, avg2 time.Duration
for _, c := range cases {
b.Run(c.name, func(b *testing.B) {
b.Run("validate", func(b *testing.B) {
b.ResetTimer()
t1 := time.Now()
for i := 0; i < b.N; i++ {
var obj1 = &skiptype{}
// validate skip
d := NewDecoder(c.json)
_ = d.Decode(obj1)
}
d1 := time.Since(t1)
avg1 = d1/time.Duration(b.N)
})
b.Run("fast", func(b *testing.B) {
b.ResetTimer()
t2 := time.Now()
for i := 0; i < b.N; i++ {
var obj2 = &skiptype{}
// fask skip
d := NewDecoder(c.json)
d.SetOptions(OptionNoValidateJSON)
_ = d.Decode(obj2)
}
d2 := time.Since(t2)
avg2 = d2/time.Duration(b.N)
})
// fast skip must be expTime x faster
require.True(b, float64(avg1)/float64(avg2) > c.expTime, fmt.Sprintf("%v/%v=%v", avg1, avg2, float64(avg1)/float64(avg2)))
})
}
}


func TestSkipMismatchTypeAmd64Error(t *testing.T) {
// t.Run("struct", func(t *testing.T) {
// println("TestSkipError")
Expand Down
21 changes: 10 additions & 11 deletions decoder/decoder_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,16 @@
package decoder

import (
`encoding/json`
`runtime`
`runtime/debug`
`strings`
`sync`
`testing`
`time`

`github.com/stretchr/testify/assert`
`github.com/stretchr/testify/require`
"encoding/json"
"runtime"
"runtime/debug"
"strings"
"sync"
"testing"
"time"

"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)

func TestMain(m *testing.M) {
Expand Down Expand Up @@ -85,7 +85,6 @@ func init() {
_ = json.Unmarshal([]byte(TwitterJson), &_BindingValue)
}


func TestSkipMismatchTypeError(t *testing.T) {
t.Run("struct", func(t *testing.T) {
println("TestSkipError")
Expand Down
1 change: 1 addition & 0 deletions internal/decoder/api/decoder.go
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ const (
OptionDisableUnknown = consts.OptionDisableUnknown
OptionCopyString = consts.OptionCopyString
OptionValidateString = consts.OptionValidateString
OptionNoValidateJSON = consts.OptionNoValidateJSON
)

type (
Expand Down
3 changes: 3 additions & 0 deletions internal/decoder/consts/option.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,11 @@ const (
F_disable_unknown = 3
F_copy_string = 4


F_use_number = types.B_USE_NUMBER
F_validate_string = types.B_VALIDATE_STRING
F_allow_control = types.B_ALLOW_CONTROL
F_no_validate_json = types.B_NO_VALIDATE_JSON
)

type Options uint64
Expand All @@ -26,6 +28,7 @@ const (
OptionDisableUnknown Options = 1 << F_disable_unknown
OptionCopyString Options = 1 << F_copy_string
OptionValidateString Options = 1 << F_validate_string
OptionNoValidateJSON Options = 1 << F_no_validate_json
)

const (
Expand Down
Loading
Loading