Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Idea] Introduce new compute role / phase (aka: dynamic fields) #1912

Open
nknize opened this issue Jan 14, 2022 · 3 comments
Open

[Idea] Introduce new compute role / phase (aka: dynamic fields) #1912

nknize opened this issue Jan 14, 2022 · 3 comments
Labels
discuss Issues intended to help drive brainstorming and decision making feature New feature or request Search:Aggregations

Comments

@nknize
Copy link
Collaborator

nknize commented Jan 14, 2022

Dynamic fields (e.g., runtime fields) are useful for many different applications (search time transforms, joins, scripted field search, etc).

One way of achieving this is to create in memory doc values at search time. This is what the FieldData implementation did before doc values were leveraged in Lucene. This is memory intensive, slow, etc. Some users are okay with the performance penalty in favor of flexibility. I postulate that "make it slow" to prioritize flexibility over performance should not be a trade off when it can be avoided; even if there is a mechanism such as the async api.

I'd like to propose and brainstorm using IndexWriter and an fsync-free implementation of Lucene commit to write local segments to a temporary directory consisting of on-disk index or docvalue representation of the search results.

In this manner the segments look just like a persisted index with one difference, by default they're intended to be "volatile" and short lived. The benefit is a reduced, runtime view of the global index (based on a user defined query) that can be further inspected or joined with any additional query or aggregation. Users can also choose to promote the volatile compute segments to a new index, creating an SQL-like ability to persist views in new indexes.

to be continued...

@nknize nknize added discuss Issues intended to help drive brainstorming and decision making feature New feature or request Indexing & Search Search:Aggregations labels Jan 14, 2022
@penghuo
Copy link
Contributor

penghuo commented Jan 19, 2022

Add one use case from PPL parse command. for example, parse command extract field:value pair from raw_log field in raw_index. then use extracted field for filter or aggregation.

source = raw_index | parse raw_log "[timestamp] [ip] [status]" | filter status="404" | stats count() by ip

In our current design, we rewrite parse command as customized Script . so the query will be rewrite as following execution plan (it is not actual plan, for explanation only). One major concern is that raw_log is been parsed two times during query time and aggregation time. If I understand correctly, we could define timestamp, ip, status as dynamic fields. then these dynamic fields could be used during query and aggregation time.

{
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": "return parse(raw_log, '[timestamp] [ip] [status]', status) = 404;"
        }
      }
    }
  },
  "aggs": {
    "genres": {
      "terms": {
        "script": {
          "source": "parse(raw_log, '[timestamp] [ip] [status]', ip)"
        }
      }
    }
  }
}

@Bukhtawar
Copy link
Collaborator

Is this similar to #1133 @nknize ?

@nknize
Copy link
Collaborator Author

nknize commented Jan 21, 2022

Is this similar to #1133

"schema on read" is a use case for dynamic fields. This is a mechanism for achieving "schema on read", along with query time enrichment, joins, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making feature New feature or request Search:Aggregations
Projects
Status: 🆕 New
Development

No branches or pull requests

4 participants