Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐞 [Bug]: wrong free ips node filtering #808

Closed
Omarabdul3ziz opened this issue Feb 11, 2024 · 11 comments
Closed

🐞 [Bug]: wrong free ips node filtering #808

Omarabdul3ziz opened this issue Feb 11, 2024 · 11 comments
Assignees
Labels
grid-proxy belongs to grid proxy type_bug Something isn't working
Milestone

Comments

@Omarabdul3ziz
Copy link
Contributor

What happened?

filtering the nodes with free_ips returns node on free farm https://gridproxy.dev.grid.tf/nodes?free_ips=1
however checking freefarm ips found them all reserved with a contract id https://gridproxy.dev.grid.tf/farms?farm_id=1

which network/s did you face the problem on?

Dev

Twin ID/s

No response

Node ID/s

No response

Farm ID/s

No response

Contract ID/s

No response

Relevant log output

none
@Omarabdul3ziz Omarabdul3ziz added type_bug Something isn't working grid-proxy belongs to grid proxy labels Feb 11, 2024
@Omarabdul3ziz Omarabdul3ziz added this to the 1.0.0 milestone Feb 11, 2024
@Omarabdul3ziz Omarabdul3ziz self-assigned this Feb 14, 2024
@Omarabdul3ziz Omarabdul3ziz moved this to In Progress in 3.13.x Feb 14, 2024
@Omarabdul3ziz
Copy link
Contributor Author

Debugging logs

  • I tried to reproduce the issue locally by running the same proxy version on dev at the time v0.13.21 with a new dump from the dev net cluster, checked the queries and everything looks fine, and the issue wasn't reproducible free_ips filter was working as expected on the locally running server.

  • The first thing i suspected was an out-of-sync state between the two tables public_ip_cache the one we used on querying and public_ip the table that comes from the processor_db whose data is processed directly from the chain,
    The events that may happen on the public_ip table and should be reflected in the public_ips_cache table are:

    • reserve ip > free_ips decrease
    • unreserve ip > free_ips increase
    • insert new ip (expected to be free) > free_ips increase
    • remove reserved ip > free_ips does not change
    • remove free ip > free_ips decrease

    I manually tried every one of those events and all went as expected.

  • then I looked at the dev proxy finding that the issue only happens on freefarm no other farms have the same issue, the farm shows that it has extra 18 free ips than it should.
    looking into the running processor_db on the cluster and it shows that while it has only one free ip
    image

  • trying the whole flow by deploying a VM with public ip on freefarm from the dashboard and the table reflected the changes correctly by increment/decrement the free_ips count on deploying/removing the VM

  • the logic for counting the free_ips for the farm goes like this:

    • at start-up, we aggregate the ips with no contract_id for each farm
    • then on each event, we increment/decrement by one as listed above for the event

    this is more efficient than re-aggregating the whole table each time an event happens. but also will introduce a problem, if anything wrong happens in the lifetime of the proxy it will always be there till a restart is happened.

  • what i am suspecting is the issue happens by either

    • data changed manually for the farm. * which i don't think so*
    • or, a new ip is added with a contract id assigned * this will fail the assumption above* but it is not possible to happen afaik.
  • i think a restart for the proxy will fix the issue and i will keep an eye on this one if it happens again.

@muhamadazmy
Copy link
Member

muhamadazmy commented Feb 15, 2024

So how i understand it it goes as follows:

  • On proxy start, it reconstruct the ips table by aggregating all ips per farm to build a table with farm (total ips, free ips, etc...)
  • When an event is received (i assume contract creation/deletion) the Ips count for that farm is incremented/decremented
  • Do we also handle events for Add IP, Delete IP and Farm create? You can create a new farm with full set of ips in one call.
    • Do we have other calls that can modify the ip list of a farm ?
  • What happens when the event handler fails (is this part of the graphql processors or runs in the proxy)

@Omarabdul3ziz
Copy link
Contributor Author

  • yes both add/delete ips events are handled and also reserve/unreserve ips. actually we don't handle the chain events but the changes on the public_ip table.
  • handling chain events failure is part of the graphql processor
  • chain events (delete/add IP or reserve/unreserve IP) >> graphql processor dumps it in public_ip table >> then triggers on this table trigger update on public_ip_cache table
  • i wasn't aware of the event that creates a farm with ips but I think eventually it will make an insert call to the public_ip table which is a covered event.

@muhamadazmy
Copy link
Member

yeah, if u don't handle chain events directly and only changes to the public_ip table then i think it should cover all changes.

How do u handle changes to public_ips table ? do you re-aggregate the entire table to update the cache, or how is it done?

@Omarabdul3ziz
Copy link
Contributor Author

no, for free_ips and total_ips fields i just increment/decrement based on the change.
and for the ips object, yes i do re-aggregate and build the object again
here is exactly the part where the logic for this is.

@muhamadazmy
Copy link
Member

I don't fully understand this code. I don't have deep experience with SQL. I would recommend to document the sql procedure and get someone to take another look.

@Omarabdul3ziz
Copy link
Contributor Author

Update

  • Unfortunately, restarting didn't solve the problem. It did fix the data on the processor_db
    image
    but the filter is still not working properly on the proxy. It's strange because running the same query generated by the URL /nodes?free_ips=1 on the live processor_db gives the expected data.
    I'll need to double-check with the ops team if this pod is being used by the proxy or if there are any caches causing issues.

  • also i had a call with @AhmedHanafy725 to discuss if there is any case event i might have missed that could add more free_ips like if it is possible to add a new IP reserved and he confirmed it is not possible. but we found some notices:

  • when trying to add farm with a set of ips we found the graphql not syncing properly, it only adds one of ips. here is an issue of that Added public ips on a farm are not always synced with the chain tfchain_graphql#157

  • also while reading in graphql code we found that contract_id field may get updated even if it's value hasn't changed, this means we could have an update event on the public_ip table that updates the contract ID from 0 to 0, or from one value to the same value. we assumed this was not possible so Trigger wasn't covering this case. here is a small pr make more strict condition for ips changes strict condition for free_ips changes #835

  • will deploy this change and verify with the ops to ensure I was investigating in the correct database, and then will check if the issue is still occurring

@Omarabdul3ziz
Copy link
Contributor Author

Update

  • after checking with ops, it is confirmed that the stack on the Kubernetes cluster is not the used one right now, instead we uses a stack with docker-compose for a dev net. so it makes sense now why a restarting the proxy pod on the cluster didn't fix the data.
  • after deployed the changes strict condition for free_ips changes #835 with version 0.14.4 * restarted the proxy * issue is fixed on dev now.

@Omarabdul3ziz Omarabdul3ziz moved this from In Progress to Pending review in 3.13.x Feb 18, 2024
@Omarabdul3ziz Omarabdul3ziz moved this from Pending review to Pending Deployment in 3.13.x Feb 18, 2024
@Omarabdul3ziz Omarabdul3ziz moved this from Pending Deployment to In Verification in 3.13.x Feb 19, 2024
@Omarabdul3ziz
Copy link
Contributor Author

issue still happening on mainnet threefoldtech/tfgrid-sdk-ts#2300 (comment)

@Omarabdul3ziz Omarabdul3ziz moved this from In Verification to Pending review in 3.13.x Mar 13, 2024
@Omarabdul3ziz Omarabdul3ziz moved this from Pending review to In Progress in 3.13.x Mar 13, 2024
@Omarabdul3ziz
Copy link
Contributor Author

firstly i suspected it is a problem with the triggers as it has happened before. but checked the two ips tables and they are on sync.
image

but checking the data on the chain found un sync between the graphql processor and the chain.
image
image

i commented on an issue there threefoldtech/tfchain_graphql#157 (comment)

@Omarabdul3ziz
Copy link
Contributor Author

summing up what Sameh explained in the mentioned issue

out of sync stats could happen:

  • chain has more ips than db
    this is because the processor had some validation that prevented added duplicated ips on the chain, and this validation should happen either on the chain level or from the grid client
  • db has more ips than chain
    previous chain migration removed some invalid ips, these validations were recently integrated into the processor

since trigger to update the public_ip cache table is good, i think there is no more work we can do from the proxy side.

@Omarabdul3ziz Omarabdul3ziz moved this from Blocked to In Verification in 3.14.x Jun 6, 2024
@github-project-automation github-project-automation bot moved this from In Verification to Done in 3.14.x Jun 10, 2024
@rawdaGastan rawdaGastan modified the milestones: 1.1.0, v0.15.x Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
grid-proxy belongs to grid proxy type_bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

4 participants