[FEATURE] PPL Observability Solution using S3 #595

joshuali925 · 2022-05-02T21:49:55Z

Overview

OpenSearch is a common choice to store application logs. While the inverted index provides full-text search abilities that helps log searching, using OpenSearch as an observability solution comes with a few drawbacks:

the inverted index grows quickly proportional to the raw data, adding a large storage overhead
in the observability area, it could be unnecessary to index fields of every document because users usually focus on a short time range of data
metrics are usually not stored in OpenSearch with logs. this makes it difficult to correlate logs with metrics, and the user experience would be inconsistent if other products are used for metrics

This document will explore an alternate way for log storage using S3 to address these issues and bring down cost.

Data flow

Indexing

independent PPL library has log patterns configured by user or by extracting from existing logs
ingester receives raw logs from application/collector
ingester uses PPL library to process raw logs, sends derived metrics to OpenSearch metrics index
ingester compresses logs after certain conditions are met (time, log size) and sends them to S3
ingester sends log chunk metadata to OpenSearch S3 metadata index
Dashboards Observability displays visualizations by querying metrics using PPL plugin in OpenSearch

User sample workflow

user notices spikes from metrics in Dashboards Observability
Observability provides the S3 object which contains logs when spike happened using S3 metadata index
user queries S3 metadata index with parse pattern and filters
PPL pulls object from S3 and performs parse with filters
user identifies root cause of spike metrics in returned logs

Functional requirements

User should be able to define log metrics patterns in PPL library
User should be able to configure PPL library to connect to S3 bucket and OpenSearch endpoint
User should be able to integrate PPL library with existing ingestion solutions
User should be able to view metrics in Dashboards Observability and tail corresponding logs in S3 using PPL
User should be able to use regular PPL commands on S3 results

Non-functional requirements

Ingesting to S3 should be more cpu and memory efficient than to OpenSearch
Latency between input and output (s3) will exist but should be small

Non-goals

Full push down to s3 select (need evaluation)

Terms

Log chunk: S3 object size does not impact performance of queries with the same LIMIT, but will impact pagination performance. As a result, logs will be divided by a fixed time period (e.g. hour) and a maximum file size (e.g. 25MB compressed). Each compressed S3 object is a log chunk. Chunk size cannot be too small, otherwise it decreases compression rate and increases overhead when retrieving objects

S3 metadata index: each log chunk will correspond to a document in the S3 metadata index on OpenSearch, containing the S3 object URI, start and end timestamp of logs in the chunk

// metadata example
"_source" : {
    "meta" : {
        "type" : "s3",
        // second log chunk for apache logs between 5PM to 6PM, containing logs from
        // 2022-04-04 17:42:57 to 2022-04-04 17:59:59
        "uri" : "sample-s3-ppl-logs-bucket/2022/04/04/apache-logs.17.2.log.gz",
        "startTime" : "2022-04-04T17:42:57.754Z",
        "endTime" : "2022-04-04T17:59:59.185Z"
    }
}

Implementation

How to query by time range

Each document has startTime and endTime, to get all S3 objects within a given time range (e.g. 2022-04-04 17:11:00 to 2022-04-04 19:43:00) would be

... | where 
   `startTime` <= '2022-04-04 17:11:00' and `endTime` >= '2022-04-04 17:11:00'
or `startTime` <= '2022-04-04 19:43:00' and `endTime` >= '2022-04-04 19:43:00'
or `startTime` >= '2022-04-04 17:11:00' and `endTime` <= '2022-04-04 19:43:00'

A sample response could include these objects

2022/04/04/apache-logs.17.1.log.gz
2022/04/04/apache-logs.17.2.log.gz
2022/04/04/apache-logs.18.1.log.gz
2022/04/04/apache-logs.19.1.log.gz

To exclude logs from 17:00:00 to 17:11:00 and 19:43:00 to 20:00:00, pagination and additional metadata would be needed. One implementation could be storing the latest log line number after every fixed interval in the object metadata. For example, using fixed interval of 10 minutes, metadata of apache-logs.17.1.log.gz could have

"offset": [10318, 19908, 30631, 40710]
// 17:00:00 to 17:10:00 corresponds to log lines 0 to 10318 
// 17:10:00 to 17:20:00 corresponds to log lines 10319 to 19908
// ...

And 17:11:00 rounds down to 17:10:00, and PPL will use pagination to skip the offset: ... | head 10000 from 10318

How to configure metrics from logs

PPL library will use an expression to extract fields from logs and run an aggregation query to extract metrics after every fixed interval.

The text was updated successfully, but these errors were encountered:

joshuali925 added the enhancement New feature or request label May 2, 2022

joshuali925 self-assigned this May 2, 2022

dai-chen mentioned this issue Nov 4, 2023

[RFC] Automatic Workload-Driven Query Acceleration by OpenSearch opensearch-project/opensearch-spark#128

Open

joshuali925 mentioned this issue May 25, 2022

[FEATURE] Standalone PPL streams & library #627

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] PPL Observability Solution using S3 #595

[FEATURE] PPL Observability Solution using S3 #595

joshuali925 commented May 2, 2022

[FEATURE] PPL Observability Solution using S3 #595

[FEATURE] PPL Observability Solution using S3 #595

Comments

joshuali925 commented May 2, 2022

Overview

Data flow

Indexing

User sample workflow

Functional requirements

Non-functional requirements

Non-goals

Terms

Implementation

How to query by time range

How to configure metrics from logs