-
-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NetKAN Bot Re-Architecture #2789
Comments
I have not heard of SQS before, so for me this project would start with a lot of learning. |
Fortunately @HebaruSan, SQS is pretty straight forward. You could leave "get_messages" "send_message" stubs and I'd almost be able to add the relevant code to do the things. |
This comment has been minimized.
This comment has been minimized.
One question:
Why two outgoing queues? Aren't they both processed by the indexer?
Dare you breaking my indexing alert script, took a long time to get it working... :) |
Thinking on this, we already technically have the information in the status.json. Maybe we want to periodically generate an updated.json ala download counts? The statelessness is a really good and reliable thing, we can guarantee whatever is produced by the binary you download, will be produced by the bot. Though as we're decoupling, we could think about putting that into the metadata (but I don't think we should put that in the first pass).
Actually I don't necessarily agree here. It's an ambitious task to replace and one of the things that is a problem right now is that it's really hard to iterate and push changes into production. Utilising queues and decoupling the processes means we don't have to care about our state, we can just recycle our containers and they'll pickup where they left off. My aim is to make it so when someone has a great idea and builds a change, pushing it out is very easy, along with rolling it back being also very easy. Or if we need to rebuild the environment, how it is built isn't locked up inside my head.
Excellent question. Incomming Outgoing
I wouldn't dare break such a thing! I promise there will be an endpoint to replace it, but initially we'll just produce the same thing. The infrastructure to do more than a json file on S3 costs money and one of my aims is to make the resources we use more efficient. Moving to SQS + making our scheduling smarter we can run as frequently we have the spare compute resources for, instead of carefully balancing and mostly failing. If we get a bunch of funding to "Make this run faster please", it'll be a case of opening up the taps and it will run faster without any code changes. |
If someone would like to have a crack at the C# stuff, I'll create a couple of queues and generate API keys with access to them. The default strategy for the SDK is to cycle through the different auth methods, so if you set the right environment variables it will pick up those automatically and in production it will also do the right things. So you can pretty much focus on just getting the SDK working and adding the SQS logic and the auth will "just work". |
Question about this, the "Inflator" will need a persistent download cache to maintain anything like the rates of processing we enjoy today. Would it have that? |
Or in git; here are all the times this .ckan has been modified: $ git log --format='%aI' -- Astrogator/Astrogator-v0.9.2.ckan
2019-06-13T20:03:50+00:00
2019-05-30T04:03:37+00:00 (Except that the bot currently does a shallow checkout of CKAN-meta; we'd need to do a full checkout to be able to retrieve the original release date.) |
Yep. I intend to create a shared cache on the local fs that the the inflator, archiver etc can access. Managing cross container ephemeral data is something I already do a lot in my day job, so it's a pretty easy thing to solve. Netkan can already take a cache path, pretend that path will exist and point to the right place. We can leave cache management out of Netkan for now and keep with a scheduled task to clean it up (super trivial thing to write). If we ever want to have multiple instances processing this data, then we'll have multiple caches, but I don't see that as a big problem. We could also use something like EFS, for such a small amount of data it wouldn't be expensive and we're not expecting ludicrously high I/O, so performance wouldn't be an issue. Either way, problem for the future.
That's a neat idea. I think we should populating historical data as a separate exercise as it'll be a one off (like how we back filled all the hashes then uploaded compatible mods to the archive). I think we could add an indexed field to the spec and resulting metadata. As we're processing messages from a queue instead of spitting things out to a FS, checking git status and committing if there are changes there is scope to be a little smarter with our comparison. It does mean that the bot is no longer producing 100% the same data every time. Either way, with a newer more flexible architecture I'm comfortable moving that to a future us problem. We can definitely focus down to a design for that specific feature and come up with something that is reliable and achieves the stated goal. |
I've started a branch built on top of #2788 to refactor Netkan to allow driving it with a queue. The messaging stubs are here: https://github.com/HebaruSan/CKAN/blob/feature/netkan-sqs/Netkan/Processors/QueueHandler.cs |
OMG You are amazing! I'm setting a reminder to look at this on Sunday when I have some time. |
Just a brief comment, one of the hugely awesome things about the SDK is authentication resolution is automatic. You can see the constructor here and what the resolution order is here. It's very unlikely we'd want to hard code a set of credentials deep into the application. What it means is for testing you can use an api key via environment variables or .aws/credentials file and in production we can allow the instance access via the IAM Instance Profile without changing anything in the application. |
So... are you saying to use the constructor without the key parameters? I'm still completely guessing how SQS works, mostly going by sample programs as I can find them. |
Yup, looking at the list of examples on the first link it covers that scenario. It's a pretty common pattern for AWS. |
Does the first link go where you meant it to? I get the root node of a huge "AWS SDK for .NET Version 3 API Reference" list which doesn't have any authentication stuff noticeably on it. |
EDIT: Possibly resolved this...
I'm going to look it up at startup based on the name. Partly because the above brain dump already lists out what the names are likely to be. |
EDIT: Think I found the answers to these...
Sounds like they stay till deleted.
I think this is what |
Design question:
... implies that the scheduler must have a checked out copy of the NetKAN repo (because it must know all the current mod identifiers). But also that the inflator must also have a copy of the same NetKAN repo (because only the identifier is sent, so it must get the metadata from outside the message, and because it can be told to do a pull). This seems duplicative. What do you think of the idea of the scheduler sending the JSON from the .netkan file as the body of the message? Then the infrastructure would only need one copy of the NetKAN repo (in the scheduler), and a signaling mechanism for repo pulls would no longer be needed (because the scheduler could send the right version of the JSON if it knows a pull is needed). |
Gah, silly page doesn't put the links in the top and I was rushing it out of my brain! https://docs.aws.amazon.com/sdkfornet/v3/apidocs/items/SQS/TSQSClient.html We can use environment variables for everything, so in theory the client constructor doesn't require anything passed to it (this is how I do it in python all the time and the C# looks the same, but YMMV). These are the keys generally all that's needed
The region we use for everything is:
The queue urls should be consistent and I will provide them to the container likely via an environment variable (this information is very easy to pull from the queue and pass to the container configuration) One thing I didn't consider is that every time we Check, Add, Delete or Send to a queue it is considered a message, which adds to our total messages sent. So we may want to only have an inbound/outbound queue as every time we check one it'll use a message (they're inexpensive and we will be able to use a smaller instance so we will save money overall), from a first pass standpoint it might be acceptable to not over complicate the queues too much. We can use long polling to reduce the amount of unnecessary messages we consume checking for new ones.
That's actually a really good idea and it means we don't have to worry about git operations inside of netkan. Leaving the cache management outside it as well means the only thing we're adding is SQS and the rest of the operations already exist. Wrap the inflate in a try/catch on success send the resulting ckan, send the details as the payload instead. Set a success attribute with True/False as applicaable for each scenario. Tomorrow (in about 18 hours or so) I plan to generate some queues. I will start by finishing my template for creating the dev queues + iam permissions to be able to send/receive from them. After I'll hack together some python and test the interactions. |
Cool, I will update my prototype code to receive the JSON in the body and not look for a Refresh attribute. I have been pushing updates here as I learn more about SQS: https://github.com/HebaruSan/CKAN/blob/feature/netkan-sqs/Netkan/Processors/QueueHandler.cs |
So I've spent a little more time reading the documentation and doing some testing, we have some capacity to be a little smart how we handle messages. Now as far as webhooks go, we've never really guaranteed them to be instant, but they are currently instant. It might be worth running two instances of the inflator, one to process the webhooks and one for the scheduler. I also think a singular queue for the inflated metadata and setting a 'staged' attribute makes more sense. It was hard before because of the monolithic nature of the indexer before, which not something we need to be concerned about now.
If we don't do batch deletes we'll end up around the 5 million message mark including webooks (which is only a couple of dollars), if we process everything in the max batch size of 10 we can almost do all our processing within the free tier of a million requests per month. Either way we'll be in a much better place and can iterate once the basics are working. |
I also whipped up a quick python3 script for submitting NetKANs to a queue https://gist.github.com/techman83/0604f4dc9f849aac605e46f494de9403 It requires boto3, but the rest are part of stdlib |
The 2 queues I created for testing:
|
I thought I read somewhere that FIFOs are more expensive than regular queues that don't guarantee the ordering. Could that be another cost-saving opportunity? While an orderly alphabetical processing sequence would be nice, I don't think it would be mission critical. |
But this still leaves the problem that we will have outdated netkan data in the NetKAN repository, doesn't it? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Wow indeed. That is rough! I think initially, I'll bring across the current webhooks. As I'm conscious about not dragging this out too much. But I absolutely think we can do better. Current state of play:
Things left to do
It probably seems like a lot, but most of the effort has been rewriting the indexing logic. Which is much much simpler now. I'll get that tested and knock out the rest of the bits and pieces. Am looking forward to replacing the old infrastructure. It'll make it much easier for us to make iterative improvements! |
A quick update, as I haven't been idle. But I am getting smashed at work, we're doing a pretty big backend re-architecture, so that's absorbing a lot of cycles! However, I have a full integration test done. I can fire a netkan at the Incomming queue, the inflator does its thing, throws it at the outgoing queue, the index picks it up processes it, stages it if necessary and updates dynamodb with the current status. I've also got the initial docker file and compose file done for dev/production building, there are only some minor tweaks to clean up some things there. I have had a brief look at the C#, whilst I'm pretty sure I could figure it out; I would highly appreciate If someone could knock off these things for netkan, I'll submit a PR with a dockerfile and container build process after:
Keep being awesome everyone ❤️ |
Can you not specify these in the command line? I was assuming this would look like: netkan.exe --queues $QUEUES --github-token $TOKEN --cachedir $DIR I will take care of |
That involves invoking a shell. Docker runs the command directly. Now, there are a myriad of ways to achieve it, but the application consuming environment variables is the cleanest way. It would mean that you could launch netkan in queue mode with a docker run command using the command line switches. I'll sort something in the interim. |
I did! Running as a plain command runs it in a shell.
|
Needs some automation work, but we have a docker hub organisation! |
Don't worry @HebaruSan it'll all become apparent what docker is doing to really speed up our deployment to production :) I also found out AWS SQS has released virtual queues! Certainly not something we need to worry about right now, but it's something I really desired when I was thinking about when how to handle Webhooks. I'll be moving the new repo under the organisation account in the next couple of days and getting Travis sorted to build and submit the containers. |
The SpaceDock folks were kind enough to teach me enough docker to be dangerous while I was looking at that PR, and I definitely see the potential of this now. When do we start writing the scheduler? 😁 |
Haha, I know right! It has already begun! KSP-CKAN/NetKAN-Infra@2e6f290 I don't expect it to take too long to write. I'm gunna keep it pretty simple to begin with and then we can add some smarts later. |
@DasSkelett you'll be happy to know status will be taken care of via KSP-CKAN/NetKAN-Infra@18e3efa and KSP-CKAN/NetKAN-Infra@f6f259c I'd love to replace this at some point, but frontend stuff isn't my forte. I know at work we're doing some super cool stuff with GraphQL + React. Essentially if a query doesn't change it is cached until such time it gets invalidated, essentially meaning out new frontend rarely needs to hit the backend. Which would be really nifty for exposing the DynamoDB stuff without needing to scale up the read credits. |
It. Is. Done! Also we have our first commit from the new infrastructure! .. I need to fix the bots gitconfig it would seem 😉 |
History
NetKAN started its life as a Perl script in late 2014, followed by a single python/bash script, another Perl script which then inspired NetKAN-bot. Which is a modular Perl application with a bunch of libraries and a bunch of scripts to do things like regular indexing, webhooks and mirroring to archive.org.
Problems
NetKAN-bot has served its purpose since writing it around 4 years ago, however with the growth of the number of mods along with now generating the majority of the metadata via the bot it is really starting to creak under the pressure. The application from a build perspective is reasonably straight forward, the infrastructure is not and makes deploying changes a bit of a pain.
Rationale
One of the things that really drove the development for NetKAN-bot was ensuring the metadata produced complied with our schema. @HebaruSan has done a bunch of excellent work in #2788, meaning we could drop a of logic making for a far more simple indexer. Splitting these up and using a FIFO queue also means we don't have to be so defensive about our git repo getting messed up as we don't need to check outputs or even touch the file system if the payload and file on the disk are identical.
Initial Brain Dump
NetKAN bot will be replaced by a bunch of distinct containers with Specific roles.
Inflator
@DasSkelett / @HebaruSan - How much work do you think would be in this? I'm happy to do the Docker/AWS archicture as I pretty much breathe this stuff at work, C# is not something I do a lot of.
Initially there will be 2 incomming and 2 outgoing
Incomming:
Outgoing:
Implementation
Will require some new features added to NetKAN:
Assumptions that can be made
Indexer
Takes care of processing the outgoing messages, check if the payload is different, update, commit, push + udpate status.
The idea behind having a separate queue is to avoid the current situation of needing to be real careful of our repo state, if we process them seperately we can manage this much much easier.
It will also update the status, likely a DynamoDB table. Which we can in the interim produce a json file periodically and upload to S3.
Scheduler
Periodically check instance cpu credit levels, submit all NetKANs to the queue for inflation.
WebHook Processing
This may actually be a lambda script behind load balancer or it may reside on the instance (load balancers are kinda expensive for this small use case). But it will take care of receiving inflation requests and creating an SQS Message on the webhook queue and CKAN-meta changes can spawn messages for the Mirroring to take place.
Mirror processing
This will take care of processing the mirror queue and submitting applicable mods to archive.org for preservation.
Summary
It's likely we can achieve this in a few steps and the idea will be to build automated processes for deployment of changes.
Extra
The text was updated successfully, but these errors were encountered: