Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fpmsyncd] Bug Fix #12625 for Upgrade from 201911 to 202205 is failing #2544

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

nkelapur
Copy link
Contributor

@nkelapur nkelapur commented Nov 22, 2022

Fixes sonic-net/sonic-buildimage#12625

fpmsyncd crash (#12625)

    - What I did
            On warm-reboot ignore the assert if a new dB filed is
            detected in the new build

    - How I did it
            ignore the assert in case new field
            is detected during warm-reboot

    - How to verify it
            Verified by warm-reboot with
            new software  version having
            new field added for ROUTE_TABLE

    Signed-off-by: nikhil.kelapure@broadcom.com

What I did

Why I did it

How I verified it

Details if related

fpmsyncd crash (#12625)

        - What I did
                On warm-reboot ignore the assert if a new dB filed is
                detected in the new build

        - How I did it
                ignore the assert in case new field
                is detected during warm-reboot

        - How to verify it
                Verified by warm-reboot with
                new software  version having
                new field added for ROUTE_TABLE

        Signed-off-by: nikhil.kelapure@broadcom.com
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 22, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

fix typo
@nkelapur
Copy link
Contributor Author

re test please

@nkelapur
Copy link
Contributor Author

retest please

@adyeung
Copy link

adyeung commented Nov 23, 2022

/azpw run Azure.sonic-swss

1 similar comment
@nkelapur
Copy link
Contributor Author

/azpw run Azure.sonic-swss

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-swss

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@nkelapur
Copy link
Contributor Author

/azpw run Azure.sonic-swss

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-swss

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@adyeung
Copy link

adyeung commented Nov 28, 2022

@vaibhavhd @prsunny pls help review the fix

@@ -280,6 +280,12 @@ bool WarmStartHelper::compareAllFV(const std::vector<FieldValueTuple> &v1,
for (auto &v2fv : v2)
{
auto v1Iter = v1Map.find(v2fv.first);
#if 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not have dead code checked-in

Removed dead code
/*
* New field added for the refresh entry
* hence return 'no match'
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring all new attributes sounds dangerous to me. Do we have a way to only ignore extra attributes for the objects that we have analyzed and assessed? @kcudnik any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nkelapur please take a look at this. To be unblocked, can we add a specific skip on assert condition for the extra attribute (weight) that caused the issue?
I agree that ignoring all the extra attributes can bring us more unwarranted issues. Specific ignore helps us control what exactly is allowed/denied functionaly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vaibhavhd I don't completely agree with this change.
1> this warm-restart helper class is mainly used in fpmsyncd. Not used anywhere else.
2> We are in full control of what is going into app-db, in this case for route entries.
I think this whole point of assert was the design assumption earlier was the the fields before and after warm-restart will same.
However that design does not apply anymore, since the number of fields is different before and after warm-upgrade. So in this case the general design should be that this api returns "true" such that after warm-reboot reconciliation the new entry will get written to the app-db. This change is implementing it.

If we check the field for a particular field name ex "weight" in this case, then that will make this code non-modular and difficult to maintain.
At a later point when more new fields are added ex next-hop-group, then another check has to be added to check for that field name.
Also if we plan to use this for something other than route entries, then there will be fields specific to that table we might need to skip.

Please note. We are not ignoring all the new attributes. This api will return true if it finds a non matching attribute, meaning that the entry is different and needs to be set to the app-db at reconcillation. Thus app-db will be updated with the new entry with the extra new attribute and that will trigger OA to take further action.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nkelapur I believe this issue should not be fixed in SONiC. When an new attribute is needed, SAI/SDK should add it in during warm recovery, with SONiC adding it with db_migrator, we have a complete loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yxieca Completely agree with this. If db_migrator take care of adding the new attribute, then we don't need to remove this assert.

@yxieca yxieca requested review from yxieca and kcudnik December 2, 2022 16:01
@prsunny
Copy link
Collaborator

prsunny commented Jan 10, 2023

@nkelapur , @yxieca , is this PR still pursued?

@yxieca
Copy link
Contributor

yxieca commented Jan 11, 2023

@nkelapur , @yxieca , is this PR still pursued?

No, we decided to pursue the db_migrator appraoch. we should close this PR.

However, we are unable to repro the issue lately, we suspect the issue might have been caused by the service unmasking issue that was introduced and fixed recently. The timing matches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[warm-reboot] Upgrade from 201911 to 202205 is failing due to fpmsyncd crash
6 participants