Skip to content

Latest commit

 

History

History
74 lines (53 loc) · 2.13 KB

HACKING.adoc

File metadata and controls

74 lines (53 loc) · 2.13 KB

Hacking on kafka-delta-ingest

kafka-delta-ingest is designed to work well with AWS and Azure, which can add some complexity to the development environment. This document outlines how to work with kafka-delta-ingest locally for new and existing Rust developers.

Developing

Make sure the docker-compose setup has been ran, and execute cargo test to run unit and integration tests.

Environment Variables

Name Default Notes

KAFKA_BROKERS

0.0.0.0:9092

A kafka broker string which can be used during integration testing

AWS_ENDPOINT_URL

http://0.0.0.0:4056

AWS endpoint URL for something that can provide stub S3 and DynamoDB operations (e.g. Localstack)

AWS_S3_BUCKET

tests

Bucket to use for test data at the given endpoint

Example Data

A tarball containing 100K line-delimited JSON messages is included in tests/json/web_requests-100K.json.tar.gz. Running ./bin/extract-example-json.sh will unpack this to the expected location.

Pretty-printed example from the file
{
  "meta": {
    "producer": {
      "timestamp": "2021-03-24T15:06:17.321710+00:00"
    }
  },
  "method": "DELETE",
  "session_id": "7c28bcf9-be26-4d0b-931a-3374ab4bb458",
  "status": 204,
  "url": "http://www.youku.com",
  "uuid": "831c6afa-375c-4988-b248-096f9ed101f8"
}

After extracting the example data, you’ll need to play this onto the web_requests topic of the local Kafka container.

ℹ️
URLs sampled for the test data are sourced from Wikipedia’s list of most popular websites - https://en.wikipedia.org/wiki/List_of_most_popular_websites.

Inspect example output

  • List data files - ls tests/data/web_requests/date=2021-03-24

  • List delta log files - ls tests/data/web_requests/_delta_log

  • Show some parquet data (using parquet-tools)

    • parquet-tools show tests/data/web_requests/date=2021-03-24/<some file written by your example>