Allow Stream Creation from Ingestors in Distributed Deployments #825

nikhilsinhaparseable · 2024-06-20T05:23:37Z

Current Behaviour -
In Distributed deployment, stream creation is allowed only from querier node.
At the time of ingestion, ingestors check if stream is created in storage, if yes, syncs it to its memory.

Enhancement Required-
to allow stream creation from ingestor node from Ingestion API so that user does not have to use stream creation endpoint from query node then use ingest endpoint from ingest node for ingestion.
Then, verify if querier node and other ingestor nodes are able to sync from memory and all functionalities work well.

Doc link for reference: https://www.parseable.com/docs/api/log-ingestion

redixhumayun · 2024-06-20T08:32:31Z

Then, verify if querier node and other ingestor nodes are able to sync from memory and all functionalities work well.

By the above do you mean, checking that:

the stream was correctly created via the ingestion node and stored in the local fs (if running in local-store mode) along with metadata in in-memory stores
the stream is now accessible via the query node

redixhumayun · 2024-06-20T08:37:31Z

Is there a way to run the binary in query mode with a local store? I see

cargo run local-store --mode query
Error: Query Server cannot be started in local storage mode. Please start the server in a supported storage mode.

Ideally, I'd like to be able to run the distributed setup on my machine to ensure a faster feedback loop.

nitisht · 2024-06-20T08:41:14Z

You'll need to run a MinIO instance locally and then run Parseable with s3-store https://min.io/docs/minio/linux/operations/install-deploy-manage/deploy-minio-single-node-single-drive.html#download-the-minio-server

redixhumayun · 2024-06-21T13:51:31Z

Problems

How does the query server get the schema of a stream created by an ingestion server so it can run queries against it?
How do we ensure that the ingestion servers don’t attempt to create identical streams with conflicting schemas and metadata at the same time?

Query Server Getting Schema

Lazy Creation

The query server can lazily create the schema and metadata for a stream if it doesn’t exist on demand, the same way an ingestion server does now.

Pros: Simple
Cons: Initial latency slightly higher

Regular Scans

Have the query server regularly scan the relevant bucket to get the diff of streams created, and create schema & metadata files for itself that aren’t already present. (Discussed on call)

Pros: simple to implement
Cons: wasteful, chargeable network calls (minimal cost though - AWS would charge (43,200 requests/month) / (1,000 requests) * $0.005 = 43.2 * $0.005 = $0.216 per month.)

Event Mechanism

Set up AWS (as S3 seems to be the only object store supported) to send an event via SNS when a new stream is created so it becomes a reactive system. Linux fs can use inotify API

Pros: use existing infrastructure provided by most cloud providers.
Cons: depends on operator to set it up

(Side note: Does the query server even need to create its own schema & metadata files in object storage? From what I can see the schema files are identical & the metadata files are the same except for additional fields on the ingestion server’s metadata file. So, why not just load directly from that into query servers in-memory map and not create additional files in object storage?)

Ingestion Servers Creating Schema

Redirect To Query Server

The simplest solution to the problem is to have ingestion servers re-direct the request to the query server. From the docs I can see that the architecture expects a single query server which doubles as a leader so why not just use it as a leader? I know we discussed on call about keeping them de-coupled but if it’s designated as the leader of the setup, why not use it as that? At Tensorlake we had similar approach with ingestion servers and coordinators.

Pros: Simple & straightforward. Minimal code changes
Cons: Query node must be HA

S3 Strong Consistency For Locking

I came across this answer on SO and it tickled my brain. It uses S3’s recent strong consistency to place a lock file and do a List-after-Put to determine that the writer that placed the lock file has acquired the lock. This guarantees that only a single ingestion server will create a stream. All other ingestion servers will be expected to remove their locks.

Pros: interesting & simple. Consistency guaranteed by AWS
Cons: needs to be verified.

Designated Leader/Writer

Have a designated leader/writer among the ingestion servers. All writes must be done via leader. The simple end of this is have a config file demarcating one server as writer. Up to the operator to ensure up-time and backup for leader/writer. The complex end of this is a Raft cluster of ingestion servers (overkill)

Pros: best practise, reduced failure rate. Plenty of open-source raft solutions available
Cons: complex to implement & definitely overkill for current scale

Registry Service

Run a separate, dedicated registry service which will be responsible for stream creation. This can be strongly consistent via sync mechanisms.

Pros: clear separation of concerns
Cons: again, seems overkill for current scale

(Side note: Does it matter if the last writer wins with regards to stream creation from concurrent ingestion servers? That seems like a simple approach and then this concurrency problem goes away)

Conclusion

I favor the approach of forwarding the requests from the ingestion servers to the query server for the reasons I've outlined above. Query servers can create/fetch schemas lazily and store in internal memory. It's minimal code changes and achieves the outcome.

Alternatively, "locking" via S3 would work for multiple concurrent ingestion server writers if connecting to query server will not work.

nikhilsinhaparseable · 2024-06-22T03:27:00Z

@redixhumayun Thank you for this well-articulated and wonderful write-up.
We are also in favour of sending the signal from ingestors to querier to create the stream and store in the internal memory.
For that to happen, we need -

add auth token and domain/port in parseable.json that tells you where and how to connect to query server
create a lock when querier is creating stream based on ingestor's signal - this is required for scenario when multiple ingestors send the stream creation request to querier at the same time and querier has to ensure it creates only once
migrate the existing parseable.json to add the new fields (from point 1) so that older deployments when migrated to this new change, can add this new field at the time of upgrade (migration code also in place, we need to update and create new version).

Do let us know if you need a follow up call to discuss further.

Thanks!

CC: @nitisht

evanxg852000 · 2024-07-04T18:31:09Z

@nikhilsinhaparseable The proposed approach works and considers minimizing the changes in the current setup. Also, there might be other constraints I am not aware of. But, if there is a way to invest time/effort in a registry service (or metadata service), it will be beneficial in the long run IMHO. A single-node registry service with a fallback node can still handle a large cluster. Also, Given this is APM space, if there is a necessity for a multi-node registry service, it can be designed as eventually consistent (to maximize throughput) as opposed to something strongly consistent like raft.

nikhilsinhaparseable added the server label Jul 1, 2024 — with Linear

Anirudhxx mentioned this issue Nov 2, 2024

feat: allow stream creation from ingestor in distributed deployments #980

Merged

6 tasks

nitisht closed this as completed in #980 Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Stream Creation from Ingestors in Distributed Deployments #825

Allow Stream Creation from Ingestors in Distributed Deployments #825

nikhilsinhaparseable commented Jun 20, 2024 •

edited

Loading

redixhumayun commented Jun 20, 2024 •

edited

Loading

redixhumayun commented Jun 20, 2024

nitisht commented Jun 20, 2024

redixhumayun commented Jun 21, 2024

nikhilsinhaparseable commented Jun 22, 2024

evanxg852000 commented Jul 4, 2024

Allow Stream Creation from Ingestors in Distributed Deployments #825

Allow Stream Creation from Ingestors in Distributed Deployments #825

Comments

nikhilsinhaparseable commented Jun 20, 2024 • edited Loading

redixhumayun commented Jun 20, 2024 • edited Loading

redixhumayun commented Jun 20, 2024

nitisht commented Jun 20, 2024

redixhumayun commented Jun 21, 2024

Problems

Query Server Getting Schema

Lazy Creation

Regular Scans

Event Mechanism

Ingestion Servers Creating Schema

Redirect To Query Server

S3 Strong Consistency For Locking

Designated Leader/Writer

Registry Service

Conclusion

nikhilsinhaparseable commented Jun 22, 2024

evanxg852000 commented Jul 4, 2024

nikhilsinhaparseable commented Jun 20, 2024 •

edited

Loading

redixhumayun commented Jun 20, 2024 •

edited

Loading