Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetKAN Bot Re-Architecture #2789

Closed
techman83 opened this issue Jun 19, 2019 · 58 comments · Fixed by #2798 or KSP-CKAN/NetKAN-Infra#20
Closed

NetKAN Bot Re-Architecture #2789

techman83 opened this issue Jun 19, 2019 · 58 comments · Fixed by #2798 or KSP-CKAN/NetKAN-Infra#20
Assignees
Labels
Architecture Discussion needed Infrastructure Issues affecting everything around CKAN (the GitHub repos, build process, CI, ...)

Comments

@techman83
Copy link
Member

techman83 commented Jun 19, 2019

History

NetKAN started its life as a Perl script in late 2014, followed by a single python/bash script, another Perl script which then inspired NetKAN-bot. Which is a modular Perl application with a bunch of libraries and a bunch of scripts to do things like regular indexing, webhooks and mirroring to archive.org.

Problems

NetKAN-bot has served its purpose since writing it around 4 years ago, however with the growth of the number of mods along with now generating the majority of the metadata via the bot it is really starting to creak under the pressure. The application from a build perspective is reasonably straight forward, the infrastructure is not and makes deploying changes a bit of a pain.

Rationale

One of the things that really drove the development for NetKAN-bot was ensuring the metadata produced complied with our schema. @HebaruSan has done a bunch of excellent work in #2788, meaning we could drop a of logic making for a far more simple indexer. Splitting these up and using a FIFO queue also means we don't have to be so defensive about our git repo getting messed up as we don't need to check outputs or even touch the file system if the payload and file on the disk are identical.

Initial Brain Dump

NetKAN bot will be replaced by a bunch of distinct containers with Specific roles.

Inflator

@DasSkelett / @HebaruSan - How much work do you think would be in this? I'm happy to do the Docker/AWS archicture as I pretty much breathe this stuff at work, C# is not something I do a lot of.

  • Processes SQS queue
    • Incomming Message example
      Payload: ModIdentifier
      Attr Refresh: True/False (this is to signify a repo pull is required)
      
    • Outgoing Message example
      Payload: CKAN JSON Output (if success)
      Attr Identifier: ModIdentifier
      Attr Success: True/False
      Attr Error Message: Message/Exception if there is one
      Attr Check Time: timestamp
      

Initially there will be 2 incomming and 2 outgoing

Incomming:

  • Webhooks
  • Scheduled

Outgoing:

  • Master
  • Staged

Implementation

Will require some new features added to NetKAN:

  • Ability to run as a foreground process, that watches an SQS queue
  • Read sqs queues, prioritise webhooks
  • Netkan Repo pull changes when refresh is set true on an sqs message, a call out to shell would be fine at a minmum
  • Upon inflation publish output to sqs master, unless it's marked for staging
  • Manage cache aging. Currently it's a sceduled task, but netkan could age out files and when refresh is set it could mark for refresh of the cached download.

Assumptions that can be made

  • As long as it'll run in fg, packaging into a docker container is trivial
  • Cache storage path will be provided
  • SQS Access will not require keys, the instance will be granted access via IAM

Indexer

Takes care of processing the outgoing messages, check if the payload is different, update, commit, push + udpate status.

The idea behind having a separate queue is to avoid the current situation of needing to be real careful of our repo state, if we process them seperately we can manage this much much easier.

It will also update the status, likely a DynamoDB table. Which we can in the interim produce a json file periodically and upload to S3.

Scheduler

Periodically check instance cpu credit levels, submit all NetKANs to the queue for inflation.

WebHook Processing

This may actually be a lambda script behind load balancer or it may reside on the instance (load balancers are kinda expensive for this small use case). But it will take care of receiving inflation requests and creating an SQS Message on the webhook queue and CKAN-meta changes can spawn messages for the Mirroring to take place.

Mirror processing

This will take care of processing the mirror queue and submitting applicable mods to archive.org for preservation.

Summary

It's likely we can achieve this in a few steps and the idea will be to build automated processes for deployment of changes.

  1. Minimal Operation
  • NetKAN Daemon mode with Git/SQS Processing
  • Indexer
  • Scheduler
  1. Webhooks
  • They can quite happily run as they are until we've built something to process them
  1. Mirroring
  • Along with the webhooks, they'll happily run as they are for now

Extra

  • Consider how we want to impelment repo splitting
  • Consider how to backfill x releases
@techman83 techman83 self-assigned this Jun 19, 2019
@techman83 techman83 added Architecture Discussion needed Infrastructure Issues affecting everything around CKAN (the GitHub repos, build process, CI, ...) labels Jun 19, 2019
@HebaruSan
Copy link
Member

How much work do you think would be in this? I'm happy to do the Docker/AWS archicture as I pretty much breathe this stuff at work, C# is not something I do a lot of.

I have not heard of SQS before, so for me this project would start with a lot of learning.

@techman83
Copy link
Member Author

Fortunately @HebaruSan, SQS is pretty straight forward. You could leave "get_messages" "send_message" stubs and I'd almost be able to add the relevant code to do the things.

@HebaruSan

This comment has been minimized.

@DasSkelett
Copy link
Member

One question:

Initially there will be 2 incomming and 2 outgoing [SQS queues]

Incomming:

* Webhooks

* Scheduled

Outgoing:

* Master

* Staged

Why two outgoing queues? Aren't they both processed by the indexer?
Wouldn't an Attr Staged: True/False or something suffice?

It will also update the status, likely a DynamoDB table. Which we can in the interim produce a json file periodically and upload to S3.

Dare you breaking my indexing alert script, took a long time to get it working... :)

@techman83
Copy link
Member Author

Can we use this re-architecting opportunity to make some headway on #1155? The current bot's "statelessness" effectively prevents us from tracking a release date property, which has been a pretty common enhancement request.

Thinking on this, we already technically have the information in the status.json. Maybe we want to periodically generate an updated.json ala download counts? The statelessness is a really good and reliable thing, we can guarantee whatever is produced by the binary you download, will be produced by the bot. Though as we're decoupling, we could think about putting that into the metadata (but I don't think we should put that in the first pass).

If we're going to change the structure of the bot and the interface between the infrastructure and netkan.exe, we might as well do it once instead of twice.

Actually I don't necessarily agree here. It's an ambitious task to replace and one of the things that is a problem right now is that it's really hard to iterate and push changes into production. Utilising queues and decoupling the processes means we don't have to care about our state, we can just recycle our containers and they'll pickup where they left off.

My aim is to make it so when someone has a great idea and builds a change, pushing it out is very easy, along with rolling it back being also very easy. Or if we need to rebuild the environment, how it is built isn't locked up inside my head.

Why two outgoing queues? Aren't they both processed by the indexer?
Wouldn't an Attr Staged: True/False or something suffice?

Excellent question.

Incomming
It's not super necessary (and may not be at all necessary if I figure out if it's possible), but I intend on using SQS FIFO, it has built in dedupe and a bunch of other reliability features. However I couldn't see a way to force an item to the front and I really want a webhook job to happen in near real time, so basically the indexer will check the webhook queue first before checking the scheduled queue.

Outgoing
I want to be able to process staged commits after processing all the commits to master. I may even do this with a separate indexing container. Basically to reduce all the branch changing logic I've had to build into the current bot. It's a really complex task mid index run and likely responsible for some of the tree corruption that required writing a script to periodically fix.

Dare you breaking my indexing alert script, took a long time to get it working... :)

I wouldn't dare break such a thing! I promise there will be an endpoint to replace it, but initially we'll just produce the same thing. The infrastructure to do more than a json file on S3 costs money and one of my aims is to make the resources we use more efficient. Moving to SQS + making our scheduling smarter we can run as frequently we have the spare compute resources for, instead of carefully balancing and mostly failing. If we get a bunch of funding to "Make this run faster please", it'll be a case of opening up the taps and it will run faster without any code changes.

@techman83
Copy link
Member Author

If someone would like to have a crack at the C# stuff, I'll create a couple of queues and generate API keys with access to them.

The default strategy for the SDK is to cycle through the different auth methods, so if you set the right environment variables it will pick up those automatically and in production it will also do the right things. So you can pretty much focus on just getting the SDK working and adding the SQS logic and the auth will "just work".

@HebaruSan
Copy link
Member

Utilising queues and decoupling the processes means we don't have to care about our state, we can just recycle our containers and they'll pickup where they left off.

Question about this, the "Inflator" will need a persistent download cache to maintain anything like the rates of processing we enjoy today. Would it have that?

@HebaruSan
Copy link
Member

Thinking on this, we already technically have the information in the status.json.

Or in git; here are all the times this .ckan has been modified:

$ git log --format='%aI' -- Astrogator/Astrogator-v0.9.2.ckan
2019-06-13T20:03:50+00:00
2019-05-30T04:03:37+00:00

(Except that the bot currently does a shallow checkout of CKAN-meta; we'd need to do a full checkout to be able to retrieve the original release date.)

@techman83
Copy link
Member Author

Question about this, the "Inflator" will need a persistent download cache to maintain anything like the rates of processing we enjoy today. Would it have that?

Yep. I intend to create a shared cache on the local fs that the the inflator, archiver etc can access. Managing cross container ephemeral data is something I already do a lot in my day job, so it's a pretty easy thing to solve. Netkan can already take a cache path, pretend that path will exist and point to the right place. We can leave cache management out of Netkan for now and keep with a scheduled task to clean it up (super trivial thing to write).

If we ever want to have multiple instances processing this data, then we'll have multiple caches, but I don't see that as a big problem. We could also use something like EFS, for such a small amount of data it wouldn't be expensive and we're not expecting ludicrously high I/O, so performance wouldn't be an issue. Either way, problem for the future.

Or in git; here are all the times this .ckan has been modified

That's a neat idea. I think we should populating historical data as a separate exercise as it'll be a one off (like how we back filled all the hashes then uploaded compatible mods to the archive). I think we could add an indexed field to the spec and resulting metadata. As we're processing messages from a queue instead of spitting things out to a FS, checking git status and committing if there are changes there is scope to be a little smarter with our comparison. It does mean that the bot is no longer producing 100% the same data every time.

Either way, with a newer more flexible architecture I'm comfortable moving that to a future us problem. We can definitely focus down to a design for that specific feature and come up with something that is reliable and achieves the stated goal.

@HebaruSan
Copy link
Member

HebaruSan commented Jun 21, 2019

If someone would like to have a crack at the C# stuff, I'll create a couple of queues and generate API keys with access to them.

You could leave "get_messages" "send_message" stubs and I'd almost be able to add the relevant code to do the things.

I've started a branch built on top of #2788 to refactor Netkan to allow driving it with a queue. The messaging stubs are here:

https://github.com/HebaruSan/CKAN/blob/feature/netkan-sqs/Netkan/Processors/QueueHandler.cs

@techman83
Copy link
Member Author

OMG You are amazing! I'm setting a reminder to look at this on Sunday when I have some time.

@techman83
Copy link
Member Author

Just a brief comment, one of the hugely awesome things about the SDK is authentication resolution is automatic. You can see the constructor here and what the resolution order is here. It's very unlikely we'd want to hard code a set of credentials deep into the application. What it means is for testing you can use an api key via environment variables or .aws/credentials file and in production we can allow the instance access via the IAM Instance Profile without changing anything in the application.

@HebaruSan
Copy link
Member

So... are you saying to use the constructor without the key parameters? I'm still completely guessing how SQS works, mostly going by sample programs as I can find them.

@techman83
Copy link
Member Author

Yup, looking at the list of examples on the first link it covers that scenario. It's a pretty common pattern for AWS.

@HebaruSan
Copy link
Member

Does the first link go where you meant it to? I get the root node of a huge "AWS SDK for .NET Version 3 API Reference" list which doesn't have any authentication stuff noticeably on it.
https://docs.aws.amazon.com/sdkfornet/v3/apidocs/Index.html

@HebaruSan
Copy link
Member

HebaruSan commented Jun 21, 2019

EDIT: Possibly resolved this...

What about the queue URL? Is that a constant, or built into the same config file, or do we have to retrieve it at run time based on the queue name?

I'm going to look it up at startup based on the name. Partly because the above brain dump already lists out what the names are likely to be.

@HebaruSan
Copy link
Member

HebaruSan commented Jun 21, 2019

EDIT: Think I found the answers to these...

Does ReceiveMessage remove the messages from the queue, or will they still be there until deleted explicitly with DeleteMessage?

Sounds like they stay till deleted.

And now that I'm thinking about it, how would multiple clients coordinate to distribute messages among themselves?

I think this is what VisibilityTimeout is for, you can specify a number of seconds for the returned messages to be hidden, after which they reappear.

@HebaruSan
Copy link
Member

HebaruSan commented Jun 21, 2019

Design question:
The format of the incoming message:

Payload: ModIdentifier
Attr Refresh: True/False (this is to signify a repo pull is required)

... implies that the scheduler must have a checked out copy of the NetKAN repo (because it must know all the current mod identifiers). But also that the inflator must also have a copy of the same NetKAN repo (because only the identifier is sent, so it must get the metadata from outside the message, and because it can be told to do a pull).

This seems duplicative. What do you think of the idea of the scheduler sending the JSON from the .netkan file as the body of the message? Then the infrastructure would only need one copy of the NetKAN repo (in the scheduler), and a signaling mechanism for repo pulls would no longer be needed (because the scheduler could send the right version of the JSON if it knows a pull is needed).

@techman83
Copy link
Member Author

Does the first link go where you meant it to? I get the root node of a huge "AWS SDK for .NET Version 3 API Reference" list which doesn't have any authentication stuff noticeably on it.
https://docs.aws.amazon.com/sdkfornet/v3/apidocs/Index.html

Gah, silly page doesn't put the links in the top and I was rushing it out of my brain! https://docs.aws.amazon.com/sdkfornet/v3/apidocs/items/SQS/TSQSClient.html

We can use environment variables for everything, so in theory the client constructor doesn't require anything passed to it (this is how I do it in python all the time and the C# looks the same, but YMMV).

These are the keys generally all that's needed

AWS_ACCESS_KEY_ID – Specifies an AWS access key associated with an IAM user or role.
AWS_SECRET_ACCESS_KEY – Specifies the secret key associated with the access key. This is essentially the "password" for the access key.
AWS_DEFAULT_REGION – Specifies the AWS Region to send the request to.

The region we use for everything is: us-west-2

I'm going to look it up at startup based on the name. Partly because the above brain dump already lists out what the names are likely to be.

The queue urls should be consistent and I will provide them to the container likely via an environment variable (this information is very easy to pull from the queue and pass to the container configuration)
https://sqs.us-east-1.amazonaws.com/80398EXAMPLE/MyTestQueue

One thing I didn't consider is that every time we Check, Add, Delete or Send to a queue it is considered a message, which adds to our total messages sent. So we may want to only have an inbound/outbound queue as every time we check one it'll use a message (they're inexpensive and we will be able to use a smaller instance so we will save money overall), from a first pass standpoint it might be acceptable to not over complicate the queues too much. We can use long polling to reduce the amount of unnecessary messages we consume checking for new ones.

This seems duplicative. What do you think of the idea of the scheduler sending the JSON from the .netkan file as the body of the message? Then the infrastructure would only need one copy of the NetKAN repo (in the scheduler), and a signaling mechanism for repo pulls would no longer be needed (because the scheduler could send the right version of the JSON if it knows a pull is needed).

That's actually a really good idea and it means we don't have to worry about git operations inside of netkan. Leaving the cache management outside it as well means the only thing we're adding is SQS and the rest of the operations already exist. Wrap the inflate in a try/catch on success send the resulting ckan, send the details as the payload instead. Set a success attribute with True/False as applicaable for each scenario.

Tomorrow (in about 18 hours or so) I plan to generate some queues. I will start by finishing my template for creating the dev queues + iam permissions to be able to send/receive from them. After I'll hack together some python and test the interactions.

@HebaruSan
Copy link
Member

That's actually a really good idea and it means we don't have to worry about git operations inside of netkan.

Cool, I will update my prototype code to receive the JSON in the body and not look for a Refresh attribute. I have been pushing updates here as I learn more about SQS:

https://github.com/HebaruSan/CKAN/blob/feature/netkan-sqs/Netkan/Processors/QueueHandler.cs

@techman83
Copy link
Member Author

So I've spent a little more time reading the documentation and doing some testing, we have some capacity to be a little smart how we handle messages. Now as far as webhooks go, we've never really guaranteed them to be instant, but they are currently instant. It might be worth running two instances of the inflator, one to process the webhooks and one for the scheduler. I also think a singular queue for the inflated metadata and setting a 'staged' attribute makes more sense. It was hard before because of the monolithic nature of the indexer before, which not something we need to be concerned about now.

SQS Action Messages
Scheduler Polling (long poll 20 seconds) 131400
Webhook Polling (long poll 20 seconds) 131400
Inflated Polling (long poll 20 seconds) 131400
Index Scheduled (2000 netkans, 10/batch send, bi hourly) 73000
Scheduled Processed (2000 netkans, 1/batch send, bi hourly) 730000
Scheduled Deleted (2000 netkans, 1/batch send, bi hourly) 730000
Inflated Scheduled (2000 netkans, 1/batch send, bi hourly) 730000
Processed (2000 netkans, 1/batch receive, bi hourly) 730000
Scheduled Deleted (2000 netkans, 1/batch, bi hourly) 730000
Total 4117200

If we don't do batch deletes we'll end up around the 5 million message mark including webooks (which is only a couple of dollars), if we process everything in the max batch size of 10 we can almost do all our processing within the free tier of a million requests per month.

Either way we'll be in a much better place and can iterate once the basics are working.

@techman83
Copy link
Member Author

I also whipped up a quick python3 script for submitting NetKANs to a queue

https://gist.github.com/techman83/0604f4dc9f849aac605e46f494de9403

It requires boto3, but the rest are part of stdlib

@techman83
Copy link
Member Author

The 2 queues I created for testing:

  • InboundDev.fifo
  • OutboundDev.fifo

@HebaruSan
Copy link
Member

I thought I read somewhere that FIFOs are more expensive than regular queues that don't guarantee the ordering. Could that be another cost-saving opportunity? While an orderly alphabetical processing sequence would be nice, I don't think it would be mission critical.

@DasSkelett
Copy link
Member

(which will get resource heavy)

Not necessarily; since only the version property would need to change, we don't have to re-do the entire inflation. We could do something as simple as this:

[...]

The approximate cost per increment would be one string split, one string concatenation, and parsing one string into a ModuleVersion.

But this still leaves the problem that we will have outdated netkan data in the NetKAN repository, doesn't it?

@HebaruSan

This comment has been minimized.

@HebaruSan

This comment has been minimized.

@techman83
Copy link
Member Author

Wow indeed. That is rough! I think initially, I'll bring across the current webhooks. As I'm conscious about not dragging this out too much. But I absolutely think we can do better.

Current state of play:

  • Indexing logic re-implemented in python (full integration test hasn't been done, but all the logic has tests and does what it should be doing.)
  • Inflator testing (It works super nicely)
  • Cloudformation pieces mostly complete (at least the harder bits, containers/instances/clusters I'm well across)

Things left to do

  • Add ability for environment variables to populate commandline defaults (I've had a brief look, but requires more than a coffee breaks worth of time to wrap my head around C#)
    • NETKAN_QUEUES
    • GITHUB_TOKEN
    • NETKAN_CACHE
  • Have travis build a docker file when building netkan
  • Finish the container builds for the infrastructure
  • Write the scheduler
  • Finish the cloudformation templates
  • Write a process to export status -> status.json for netkan-status

It probably seems like a lot, but most of the effort has been rewriting the indexing logic. Which is much much simpler now. I'll get that tested and knock out the rest of the bits and pieces.

Am looking forward to replacing the old infrastructure. It'll make it much easier for us to make iterative improvements!

@techman83
Copy link
Member Author

A quick update, as I haven't been idle. But I am getting smashed at work, we're doing a pretty big backend re-architecture, so that's absorbing a lot of cycles!

However, I have a full integration test done. I can fire a netkan at the Incomming queue, the inflator does its thing, throws it at the outgoing queue, the index picks it up processes it, stages it if necessary and updates dynamodb with the current status.

I've also got the initial docker file and compose file done for dev/production building, there are only some minor tweaks to clean up some things there.

I have had a brief look at the C#, whilst I'm pretty sure I could figure it out; I would highly appreciate If someone could knock off these things for netkan, I'll submit a PR with a dockerfile and container build process after:

  • Config via Environment Variables:
    • NETKAN_QUEUES
    • GITHUB_TOKEN
    • NETKAN_CACHE
  • x_netkan_staging_reason - If this can be added as an attribute on the outgoing message that'd be super! I noticed it missing when redoing the PR code and it was one of @HebaruSan 's earlier PRs to the bots branch 🙂

Keep being awesome everyone ❤️

@HebaruSan
Copy link
Member

Config via Environment Variables:

Can you not specify these in the command line? I was assuming this would look like:

netkan.exe --queues $QUEUES --github-token $TOKEN --cachedir $DIR

I will take care of x_netkan_staging_reason now.

@techman83
Copy link
Member Author

Can you not specify these in the command line? I was assuming this would look like:

That involves invoking a shell. Docker runs the command directly. Now, there are a myriad of ways to achieve it, but the application consuming environment variables is the cleanest way.

It would mean that you could launch netkan in queue mode with a docker run command using the command line switches.

I'll sort something in the interim.

@techman83
Copy link
Member Author

I did! Running as a plain command runs it in a shell.

FROM mono:5.20.1
RUN useradd -ms /bin/bash netkan
USER netkan
WORKDIR /home/netkan
ADD netkan.exe .
ENTRYPOINT /usr/bin/mono netkan.exe --queues $QUEUES \
  --github-token $GH_Token --cachedir ckan_cache -v  

@techman83
Copy link
Member Author

Needs some automation work, but we have a docker hub organisation!

https://hub.docker.com/r/kspckan/inflator

@techman83
Copy link
Member Author

techman83 commented Jul 27, 2019

Don't worry @HebaruSan it'll all become apparent what docker is doing to really speed up our deployment to production :)

I also found out AWS SQS has released virtual queues! Certainly not something we need to worry about right now, but it's something I really desired when I was thinking about when how to handle Webhooks.

I'll be moving the new repo under the organisation account in the next couple of days and getting Travis sorted to build and submit the containers.

@HebaruSan
Copy link
Member

The SpaceDock folks were kind enough to teach me enough docker to be dangerous while I was looking at that PR, and I definitely see the potential of this now.

When do we start writing the scheduler? 😁

@techman83
Copy link
Member Author

Haha, I know right! It has already begun! KSP-CKAN/NetKAN-Infra@2e6f290

I don't expect it to take too long to write. I'm gunna keep it pretty simple to begin with and then we can add some smarts later.

@techman83
Copy link
Member Author

@DasSkelett you'll be happy to know status will be taken care of via KSP-CKAN/NetKAN-Infra@18e3efa and KSP-CKAN/NetKAN-Infra@f6f259c

I'd love to replace this at some point, but frontend stuff isn't my forte. I know at work we're doing some super cool stuff with GraphQL + React. Essentially if a query doesn't change it is cached until such time it gets invalidated, essentially meaning out new frontend rarely needs to hit the backend. Which would be really nifty for exposing the DynamoDB stuff without needing to scale up the read credits.

@techman83
Copy link
Member Author

It. Is. Done!

Also we have our first commit from the new infrastructure!

KSP-CKAN/CKAN-meta@c8ae922

.. I need to fix the bots gitconfig it would seem 😉

@HebaruSan HebaruSan changed the title NetKAN Bot Re-Architechture NetKAN Bot Re-Architecture Oct 16, 2019
This was referenced Nov 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Architecture Discussion needed Infrastructure Issues affecting everything around CKAN (the GitHub repos, build process, CI, ...)
Projects
None yet
3 participants