Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] PPL Observability Solution using S3 #595

Open
joshuali925 opened this issue May 2, 2022 · 0 comments
Open

[FEATURE] PPL Observability Solution using S3 #595

joshuali925 opened this issue May 2, 2022 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@joshuali925
Copy link
Member

Overview

OpenSearch is a common choice to store application logs. While the inverted index provides full-text search abilities that helps log searching, using OpenSearch as an observability solution comes with a few drawbacks:

  • the inverted index grows quickly proportional to the raw data, adding a large storage overhead
  • in the observability area, it could be unnecessary to index fields of every document because users usually focus on a short time range of data
  • metrics are usually not stored in OpenSearch with logs. this makes it difficult to correlate logs with metrics, and the user experience would be inconsistent if other products are used for metrics

This document will explore an alternate way for log storage using S3 to address these issues and bring down cost.

Data flow

image

Indexing

  1. independent PPL library has log patterns configured by user or by extracting from existing logs
  2. ingester receives raw logs from application/collector
  3. ingester uses PPL library to process raw logs, sends derived metrics to OpenSearch metrics index
  4. ingester compresses logs after certain conditions are met (time, log size) and sends them to S3
  5. ingester sends log chunk metadata to OpenSearch S3 metadata index
  6. Dashboards Observability displays visualizations by querying metrics using PPL plugin in OpenSearch

User sample workflow

  1. user notices spikes from metrics in Dashboards Observability
  2. Observability provides the S3 object which contains logs when spike happened using S3 metadata index
  3. user queries S3 metadata index with parse pattern and filters
  4. PPL pulls object from S3 and performs parse with filters
  5. user identifies root cause of spike metrics in returned logs

Functional requirements

  1. User should be able to define log metrics patterns in PPL library
  2. User should be able to configure PPL library to connect to S3 bucket and OpenSearch endpoint
  3. User should be able to integrate PPL library with existing ingestion solutions
  4. User should be able to view metrics in Dashboards Observability and tail corresponding logs in S3 using PPL
  5. User should be able to use regular PPL commands on S3 results

Non-functional requirements

  1. Ingesting to S3 should be more cpu and memory efficient than to OpenSearch
  2. Latency between input and output (s3) will exist but should be small

Non-goals

  1. Full push down to s3 select (need evaluation)

Terms

Log chunk: S3 object size does not impact performance of queries with the same LIMIT, but will impact pagination performance. As a result, logs will be divided by a fixed time period (e.g. hour) and a maximum file size (e.g. 25MB compressed). Each compressed S3 object is a log chunk. Chunk size cannot be too small, otherwise it decreases compression rate and increases overhead when retrieving objects

S3 metadata index: each log chunk will correspond to a document in the S3 metadata index on OpenSearch, containing the S3 object URI, start and end timestamp of logs in the chunk

// metadata example
"_source" : {
    "meta" : {
        "type" : "s3",
        // second log chunk for apache logs between 5PM to 6PM, containing logs from
        // 2022-04-04 17:42:57 to 2022-04-04 17:59:59
        "uri" : "sample-s3-ppl-logs-bucket/2022/04/04/apache-logs.17.2.log.gz",
        "startTime" : "2022-04-04T17:42:57.754Z",
        "endTime" : "2022-04-04T17:59:59.185Z"
    }
}

Implementation

How to query by time range

Each document has startTime and endTime, to get all S3 objects within a given time range (e.g. 2022-04-04 17:11:00 to 2022-04-04 19:43:00) would be

... | where 
   `startTime` <= '2022-04-04 17:11:00' and `endTime` >= '2022-04-04 17:11:00'
or `startTime` <= '2022-04-04 19:43:00' and `endTime` >= '2022-04-04 19:43:00'
or `startTime` >= '2022-04-04 17:11:00' and `endTime` <= '2022-04-04 19:43:00'

A sample response could include these objects

2022/04/04/apache-logs.17.1.log.gz
2022/04/04/apache-logs.17.2.log.gz
2022/04/04/apache-logs.18.1.log.gz
2022/04/04/apache-logs.19.1.log.gz

To exclude logs from 17:00:00 to 17:11:00 and 19:43:00 to 20:00:00, pagination and additional metadata would be needed. One implementation could be storing the latest log line number after every fixed interval in the object metadata. For example, using fixed interval of 10 minutes, metadata of apache-logs.17.1.log.gz could have

"offset": [10318, 19908, 30631, 40710]
// 17:00:00 to 17:10:00 corresponds to log lines 0 to 10318 
// 17:10:00 to 17:20:00 corresponds to log lines 10319 to 19908
// ...

And 17:11:00 rounds down to 17:10:00, and PPL will use pagination to skip the offset: ... | head 10000 from 10318

How to configure metrics from logs

PPL library will use an expression to extract fields from logs and run an aggregation query to extract metrics after every fixed interval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant