Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when fetching docs with a json field defined in the doc mapper #1411

Closed
fmassot opened this issue May 10, 2022 · 1 comment · Fixed by #1415
Closed

Error when fetching docs with a json field defined in the doc mapper #1411

fmassot opened this issue May 10, 2022 · 1 comment · Fixed by #1415
Assignees
Labels
bug Something isn't working

Comments

@fmassot
Copy link
Contributor

fmassot commented May 10, 2022

Index config

#
# Index config file for receiving logs in OpenTelemetry format.
# Link: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md
#

version: 0

index_id: otel-logs

doc_mapping:
  field_mappings:
    - name: timestamp
      type: i64
      fast: true
    - name: name
      type: text
      tokenizer: default
    - name: severity
      type: text
      tokenizer: raw
    - name: body
      type: text
      tokenizer: default
      record: position
    - name: attributes
      type: json

indexing_settings:
  timestamp_field: timestamp

search_settings:
  default_search_fields: [severity, body]

Create & Ingest some data

cargo r index create --index-config config/tutorials/otel-logs/index-config.yaml --config config/quickwit.yaml
cargo r index ingest --index otel-logs --config config/quickwit.yaml --input-path documents.json

with documents.json:

{"attributes":{"syslog":{"facility":"daemon","procid":7816,"version":2}},"body":"<26>2 2022-05-10T17:59:56.706Z for.us benefritz 7816 ID715 - A bug was encountered but not in Vector, which doesn't have bugs","name":"ID715","resource":{"host":{"hostname":"for.us"},"service":{"name":"benefritz"},"source_type":"demo_logs"},"severity":"ERROR","timestamp":1652205596}
{"attributes":{"syslog":{"facility":"daemon","procid":7816,"version":2}},"body":"<26>2 2022-05-10T17:59:56.706Z for.us benefritz 7816 ID715 - A bug was encountered but not in Vector, which doesn't have bugs","name":"ID715","resource":{"host":{"hostname":"for.us"},"service":{"name":"benefritz"},"source_type":"demo_logs"},"severity":"ERROR","timestamp":1652205596}
{"attributes":{"syslog":{"facility":"daemon","procid":7816,"version":2}},"body":"<26>2 2022-05-10T17:59:56.706Z for.us benefritz 7816 ID715 - A bug was encountered but not in Vector, which doesn't have bugs","name":"ID715","resource":{"host":{"hostname":"for.us"},"service":{"name":"benefritz"},"source_type":"demo_logs"},"severity":"ERROR","timestamp":1652205596}

Failed to search some docs

cargo r index search --index otel-logs --config ./config/quickwit.yaml --query "ERROR"

...

2022-05-10T20:09:59.490Z ERROR quickwit_search::fetch_docs: Error when fetching docs in splits. split_ids=["01G2QRHZ9AHCE25BYCZ8EAN2KW"] error=searcher-doc-async

Caused by:
    0: An IO error occurred: 'trailing characters at line 1 column 59'
    1: trailing characters at line 1 column 59

Investigation

Where does the error is raised?

The error happens when tantivy tries to deserialize the JSON value

Ok(Value::JsonObject(serde_json::from_reader(reader)?))

It does not happen when you have only one field in the schema and this field is of JSON type

I tried to simplify a bit the index and it's working for example with the following doc mapping

version: 0

index_id: otel-logs

doc_mapping:
  field_mappings:
    - name: attributes
      type: json

Strangely, it continues working if you add the field "name" but it breaks if you add the field "severity".

Something I don't understand on the serialization/deserialization of tantivy FieldValue

I had a look at how tantivy serializes/deserializes the document FieldValue. One thing that is strange is that we don't set the bytes length of the JSON. This means that we expect serde_json to read bytes until the last byte... This works if the JSON field is the last field to read, but if you have other fields to read after the JSON, it breaks.

I'm almost sure my rationale is wrong somewhere. @fulmicoton where is the flaw in my rationale? :)

@fmassot fmassot added the bug Something isn't working label May 10, 2022
@fulmicoton fulmicoton self-assigned this May 11, 2022
@fulmicoton
Copy link
Contributor

Tracked in tantivy tantivy#1366

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants