🐞 [Bug]: wrong free ips node filtering #808

Omarabdul3ziz · 2024-02-11T11:36:56Z

What happened?

filtering the nodes with free_ips returns node on free farm https://gridproxy.dev.grid.tf/nodes?free_ips=1
however checking freefarm ips found them all reserved with a contract id https://gridproxy.dev.grid.tf/farms?farm_id=1

which network/s did you face the problem on?

Dev

Twin ID/s

No response

Node ID/s

No response

Farm ID/s

No response

Contract ID/s

No response

Relevant log output

none

The text was updated successfully, but these errors were encountered:

Omarabdul3ziz · 2024-02-14T14:03:46Z

Debugging logs

I tried to reproduce the issue locally by running the same proxy version on dev at the time v0.13.21 with a new dump from the dev net cluster, checked the queries and everything looks fine, and the issue wasn't reproducible free_ips filter was working as expected on the locally running server.
The first thing i suspected was an out-of-sync state between the two tables public_ip_cache the one we used on querying and public_ip the table that comes from the processor_db whose data is processed directly from the chain,
The events that may happen on the public_ip table and should be reflected in the public_ips_cache table are:
- reserve ip > free_ips decrease
- unreserve ip > free_ips increase
- insert new ip (expected to be free) > free_ips increase
- remove reserved ip > free_ips does not change
- remove free ip > free_ips decrease
I manually tried every one of those events and all went as expected.
then I looked at the dev proxy finding that the issue only happens on freefarm no other farms have the same issue, the farm shows that it has extra 18 free ips than it should.
looking into the running processor_db on the cluster and it shows that while it has only one free ip
trying the whole flow by deploying a VM with public ip on freefarm from the dashboard and the table reflected the changes correctly by increment/decrement the free_ips count on deploying/removing the VM
the logic for counting the free_ips for the farm goes like this:
- at start-up, we aggregate the ips with no contract_id for each farm
- then on each event, we increment/decrement by one as listed above for the event
this is more efficient than re-aggregating the whole table each time an event happens. but also will introduce a problem, if anything wrong happens in the lifetime of the proxy it will always be there till a restart is happened.
what i am suspecting is the issue happens by either
- data changed manually for the farm. * which i don't think so*
- or, a new ip is added with a contract id assigned * this will fail the assumption above* but it is not possible to happen afaik.
i think a restart for the proxy will fix the issue and i will keep an eye on this one if it happens again.

muhamadazmy · 2024-02-15T09:00:09Z

So how i understand it it goes as follows:

On proxy start, it reconstruct the ips table by aggregating all ips per farm to build a table with farm (total ips, free ips, etc...)
When an event is received (i assume contract creation/deletion) the Ips count for that farm is incremented/decremented
Do we also handle events for Add IP, Delete IP and Farm create? You can create a new farm with full set of ips in one call.
- Do we have other calls that can modify the ip list of a farm ?
What happens when the event handler fails (is this part of the graphql processors or runs in the proxy)

Omarabdul3ziz · 2024-02-15T09:36:04Z

yes both add/delete ips events are handled and also reserve/unreserve ips. actually we don't handle the chain events but the changes on the public_ip table.
handling chain events failure is part of the graphql processor
chain events (delete/add IP or reserve/unreserve IP) >> graphql processor dumps it in public_ip table >> then triggers on this table trigger update on public_ip_cache table
i wasn't aware of the event that creates a farm with ips but I think eventually it will make an insert call to the public_ip table which is a covered event.

muhamadazmy · 2024-02-15T09:51:40Z

yeah, if u don't handle chain events directly and only changes to the public_ip table then i think it should cover all changes.

How do u handle changes to public_ips table ? do you re-aggregate the entire table to update the cache, or how is it done?

Omarabdul3ziz · 2024-02-15T10:04:01Z

no, for free_ips and total_ips fields i just increment/decrement based on the change.
and for the ips object, yes i do re-aggregate and build the object again
here is exactly the part where the logic for this is.

tfgrid-sdk-go/grid-proxy/internal/explorer/db/setup.sql

Line 449 in f1ce1ac

UPDATE public_ips_cache

muhamadazmy · 2024-02-15T10:36:11Z

I don't fully understand this code. I don't have deep experience with SQL. I would recommend to document the sql procedure and get someone to take another look.

Omarabdul3ziz · 2024-02-17T13:44:24Z

Update

Unfortunately, restarting didn't solve the problem. It did fix the data on the processor_db

but the filter is still not working properly on the proxy. It's strange because running the same query generated by the URL /nodes?free_ips=1 on the live processor_db gives the expected data.
I'll need to double-check with the ops team if this pod is being used by the proxy or if there are any caches causing issues.
also i had a call with @AhmedHanafy725 to discuss if there is any case event i might have missed that could add more free_ips like if it is possible to add a new IP reserved and he confirmed it is not possible. but we found some notices:
when trying to add farm with a set of ips we found the graphql not syncing properly, it only adds one of ips. here is an issue of that Added public ips on a farm are not always synced with the chain tfchain_graphql#157
also while reading in graphql code we found that contract_id field may get updated even if it's value hasn't changed, this means we could have an update event on the public_ip table that updates the contract ID from 0 to 0, or from one value to the same value. we assumed this was not possible so Trigger wasn't covering this case. here is a small pr make more strict condition for ips changes strict condition for free_ips changes #835
will deploy this change and verify with the ops to ensure I was investigating in the correct database, and then will check if the issue is still occurring

Omarabdul3ziz · 2024-02-18T10:48:39Z

Update

after checking with ops, it is confirmed that the stack on the Kubernetes cluster is not the used one right now, instead we uses a stack with docker-compose for a dev net. so it makes sense now why a restarting the proxy pod on the cluster didn't fix the data.
after deployed the changes strict condition for free_ips changes #835 with version 0.14.4 * restarted the proxy * issue is fixed on dev now.

Omarabdul3ziz · 2024-03-13T14:23:39Z

issue still happening on mainnet threefoldtech/tfgrid-sdk-ts#2300 (comment)

Omarabdul3ziz · 2024-03-17T08:54:51Z

firstly i suspected it is a problem with the triggers as it has happened before. but checked the two ips tables and they are on sync.

but checking the data on the chain found un sync between the graphql processor and the chain.

i commented on an issue there threefoldtech/tfchain_graphql#157 (comment)

Omarabdul3ziz · 2024-06-06T10:55:34Z

summing up what Sameh explained in the mentioned issue

out of sync stats could happen:

chain has more ips than db
this is because the processor had some validation that prevented added duplicated ips on the chain, and this validation should happen either on the chain level or from the grid client
db has more ips than chain
previous chain migration removed some invalid ips, these validations were recently integrated into the processor

since trigger to update the public_ip cache table is good, i think there is no more work we can do from the proxy side.

Omarabdul3ziz added type_bug Something isn't working grid-proxy belongs to grid proxy labels Feb 11, 2024

Omarabdul3ziz added this to 3.13.x Feb 11, 2024

Omarabdul3ziz added this to the 1.0.0 milestone Feb 11, 2024

Omarabdul3ziz self-assigned this Feb 14, 2024

Omarabdul3ziz moved this to In Progress in 3.13.x Feb 14, 2024

AlaaElattar mentioned this issue Feb 14, 2024

Filter by Public IPs in node finder threefoldtech/tfgrid-sdk-ts#2167

Merged

5 tasks

Omarabdul3ziz mentioned this issue Feb 17, 2024

strict condition for free_ips changes #835

Merged

4 tasks

Omarabdul3ziz moved this from In Progress to Pending review in 3.13.x Feb 18, 2024

Omarabdul3ziz moved this from Pending review to Pending Deployment in 3.13.x Feb 18, 2024

Omarabdul3ziz moved this from Pending Deployment to In Verification in 3.13.x Feb 19, 2024

Omarabdul3ziz moved this from In Verification to Pending review in 3.13.x Mar 13, 2024

Omarabdul3ziz moved this from Pending review to In Progress in 3.13.x Mar 13, 2024

Omarabdul3ziz moved this from In Progress to Blocked in 3.13.x Mar 17, 2024

Omarabdul3ziz mentioned this issue Mar 17, 2024

(Mainnet Verification) [Bug 🐞]: Public IP node filtering not working properly threefoldtech/tfgrid-sdk-ts#2300

Closed

xmonader removed this from 3.13.x Mar 21, 2024

xmonader added this to 3.14.x Mar 21, 2024

xmonader modified the milestones: 1.0.0, 1.1.0 Mar 21, 2024

xmonader moved this to Blocked in 3.14.x Mar 21, 2024

Omarabdul3ziz moved this from Blocked to In Verification in 3.14.x Jun 6, 2024

Omarabdul3ziz closed this as completed Jun 10, 2024

github-project-automation bot moved this from In Verification to Done in 3.14.x Jun 10, 2024

rawdaGastan modified the milestones: 1.1.0, v0.15.x Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐞 [Bug]: wrong free ips node filtering #808

🐞 [Bug]: wrong free ips node filtering #808

Omarabdul3ziz commented Feb 11, 2024

Omarabdul3ziz commented Feb 14, 2024

muhamadazmy commented Feb 15, 2024 •

edited

Loading

Omarabdul3ziz commented Feb 15, 2024

muhamadazmy commented Feb 15, 2024

Omarabdul3ziz commented Feb 15, 2024

muhamadazmy commented Feb 15, 2024

Omarabdul3ziz commented Feb 17, 2024

Omarabdul3ziz commented Feb 18, 2024

Omarabdul3ziz commented Mar 13, 2024

Omarabdul3ziz commented Mar 17, 2024

Omarabdul3ziz commented Jun 6, 2024

🐞 [Bug]: wrong free ips node filtering #808

🐞 [Bug]: wrong free ips node filtering #808

Comments

Omarabdul3ziz commented Feb 11, 2024

What happened?

which network/s did you face the problem on?

Twin ID/s

Node ID/s

Farm ID/s

Contract ID/s

Relevant log output

Omarabdul3ziz commented Feb 14, 2024

Debugging logs

muhamadazmy commented Feb 15, 2024 • edited Loading

Omarabdul3ziz commented Feb 15, 2024

muhamadazmy commented Feb 15, 2024

Omarabdul3ziz commented Feb 15, 2024

muhamadazmy commented Feb 15, 2024

Omarabdul3ziz commented Feb 17, 2024

Update

Omarabdul3ziz commented Feb 18, 2024

Update

Omarabdul3ziz commented Mar 13, 2024

Omarabdul3ziz commented Mar 17, 2024

Omarabdul3ziz commented Jun 6, 2024

muhamadazmy commented Feb 15, 2024 •

edited

Loading