Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sfp-refactoring] Initial support for CMIS application initialization #876

Merged
merged 13 commits into from
Dec 1, 2021

Conversation

ds952811
Copy link
Contributor

@ds952811 ds952811 commented Oct 4, 2021

Signed-off-by: Dante Su dante.su@broadcom.com

Why I did it

Add initial support for CMIS application initialization and loopback

How I did it

  1. sonic-platform-common:
    1-1. sonic-xcvr/*/cmis.py: Add support for QSFP-DD application advertising and initialization
    1-2. sonic-xcvr/sfp_optoe_base.py: Add support for error description and CMIS application selection
    1-3. sonic_platform_base/sfp_base.py: Add stub routines for CMIS application initialization
  2. sonic-platform-daemon: Add CMIS manager tasks to parallelize the per-port CMIS application initialization
  3. sonic-utilities: Update sfputil and sfpshow to support CMIS application advertisement and error detection

How to verify it

  1. Issue 'show logging xcvrd' to check if the application initialization succeeded
  2. Issue 'show interfaces transceiver eeprom --dom' to verify both the EEPROM and DOM information
  3. Issue 'sudo sfputil show eeprom --dom' to verify both the EEPROM and DOM information
  4. Issue 'sfputil show error-status -hw' to verify the error description

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106

Description for the changelog

Repo PR title State
SONiC [sfp-refactoring] Initial support for CMIS application initialization
sonic-platform-common [sfp-refactoring] Initial support for CMIS application initialization
sonic-platform-daemon [sfp-refactoring] Initial support for CMIS application initialization

A picture of a cute animal (not mandatory but encouraged)

@ds952811
Copy link
Contributor Author

ds952811 commented Oct 4, 2021

Example Unit Test Result:
CMIS_UT_IX9.log

doc/sfp-cmis/cmis.md Outdated Show resolved Hide resolved

This document describes functional behavior of the QSFPDD CMIS support in SONiC.

The QSFPDD Common Management Interface Specification (CMIS) version 4.0 was finalized

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document above references the CMIS 5.0 spec, this section is referencing CMIS 4.0. There are minimal differences, but which is being used here? Will CMIS 3.x support be included?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMIS 3 is not be supported, only CMIS 4 and 5 will be supported.
And thanks for the comments, the terms of 'version 4' will be removed.

in May of 2019 and provides a variety of features and support for different transceiver
form factors. From a software perspective, a CMIS application initialization sequence
is now mandatory as per the current dynamic port breakout mode activated. And these is
also a new diagnostic support tha could be useful to SONiC users and developers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "tha"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks


## 1.3 Warm Boot Requirements

Functionality should continue to work across warm boot.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An important aspect of the warm boot requirements is that the datapath initialization sequence should not be performed during WarmBoot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

doc/sfp-cmis/cmis.md Outdated Show resolved Hide resolved
"ext_identifier": "N/A",
"ext_rateselect_compliance": "N/A",
"hardware_rev": "01",
"lane_count": "4",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is media lane count, right? (host lane count would be 8)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, this is the effective lane counts in the line side

doc/sfp-cmis/cmis.md Outdated Show resolved Hide resolved
doc/sfp-cmis/cmis.md Outdated Show resolved Hide resolved
doc/sfp-cmis/cmis.md Outdated Show resolved Hide resolved
doc/sfp-cmis/cmis.md Outdated Show resolved Hide resolved
doc/sfp-cmis/cmis.md Outdated Show resolved Hide resolved
doc/sfp-cmis/cmis.md Outdated Show resolved Hide resolved
doc/sfp-cmis/cmis.md Outdated Show resolved Hide resolved
doc/sfp-cmis/cmis.md Outdated Show resolved Hide resolved
@ds952811 ds952811 changed the title sfp: initial support for CMIS [sfp-refactoring] Initial support for CMIS application initialization Nov 3, 2021
doc/sfp-cmis/cmis-init.md Outdated Show resolved Hide resolved
With CMIS application advertisement support, **show interfaces transceiver eeprom** is now as below.

```
admin@sonic:~$ show interfaces transceiver eeprom Ethernet0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current show command output is like this:

admin@sonic:~$ show interface transceiver eeprom Ethernet0
Ethernet0: SFP EEPROM detected
        Application Advertisement: 400G CR8 - Copper cable
                                   200GBASE-CR4 (Clause 136) - Copper cable
                                   100GBASE-CR2 (Clause 136) - Copper cable
                                   100GBASE-CR4 (Clause 92) - Copper cable
                                   50GBASE-CR (Clause 126) - Copper cable
                                   50GBASE-CR2 with no FEC - Copper cable
                                   25GBASE-CR CA-N (Clause 110) - Copper cable
                                   1000BASE -CX(Clause 39) - Copper cable
                                   
        Connector: No separable connector
        Encoding: Not supported for CMIS cables
        Extended Identifier: Power Class 1(0.25W Max)
        Extended RateSelect Compliance: Not supported for CMIS cables
        Identifier: QSFP-DD Double Density 8X Pluggable Transceiver
        Length Cable Assembly(m): 0.5
        Nominal Bit Rate(100Mbs): Not supported for CMIS cables
        Specification compliance: Not supported for CMIS cables
        Vendor Date Code(YYYY-MM-DD Lot): 2021-05-14 
        Vendor Name: Mellanox
        Vendor OUI: 00-02-c9
        Vendor PN: MCP1660-W00AE30
        Vendor Rev: A3
        Vendor SN: MT2120VS03875

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the old one directly output the raw text fetched from the DB

- The **pmon#xcvrd** will detect the module type of the attched transceivers and post the
information to the **STATE_DB**
- If the transceiver is a CMIS module, the **pmon#xcvrd** will activate the appropriate
application mode as per the current port mode.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I have commented before, I would suggest adding a knob here, not all the platforms need to config this from the NOS layer. please refer to the media setting function currently supported by xcvrd, it provides a configuration file, and only the platform which needs it will turn it on via setting the configuration file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you're right, maybe we should enhance the media_setting.json to conditionally activate the application initialization

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the suggestion to have a per-vendor/MPN setting in media_settins or a "global" setting? I'd be fairly strongly against a per-module setting since it means we must have advance knowledge that the module will be inserted in order to do CMIS-compliant initialization, or a "default" setting for it.

Stepping back a minute - to do speed control (supporting anything other than the default) on active CMIS modules, this feature is required. I have seen some cases where the modules are not fully CMIS-compliant and we have needed to override the application parsing/selection behavior (and internally we have made some changes to media_settings.json to accommodate this).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the suggestion to have a per-vendor/MPN setting in media_settins or a "global" setting? I'd be fairly strongly against a per-module setting since it means we must have advance knowledge that the module will be inserted in order to do CMIS-compliant initialization, or a "default" setting for it.

I would suggest a "global" setting, which allows a vendor to disable(not start) this new function on its platform(s).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the last sync up with Prince, if I'm right about it, we'll drop these exception handlers from media_settings.json until there is a true requirement/request for this. Hence the references to media_settings.json have been removed in the latest code PRs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the last sync up with Prince, if I'm right about it, we'll drop these exception handlers from media_settings.json until there is a true requirement/request for this. Hence the references to media_settings.json have been removed in the latest code PRs.

then how to address the requirement that some platforms may not need to activate this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I'm not following you, the CMIS init is a feature of the transceiver, it's not platform-specific, why do we want to have it disabled on some certain platform?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I'm not following you, the CMIS init is a feature of the transceiver, it's not platform-specific, why do we want to have it disabled on some certain platform?

a possibility is that some vendors have implemented this function in a lower layer, no extra configuration is needed from SONiC, so they can skip this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy that, thanks for the clarification, will add one additional flag into the platform.json for this

This document describes functional behavior of the CMIS application initialization support in SONiC.

The Common Management Interface Specification (CMIS) provides a variety of features
and support for different transceiver form factors. A

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: odd formatting (does not affect final markdown)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, and it's fixed

- The **pmon#xcvrd** will detect the module type of the attched transceivers and post the
information to the **STATE_DB**
- If the transceiver is a CMIS module, the **pmon#xcvrd** will activate the appropriate
application mode as per the current port mode.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the suggestion to have a per-vendor/MPN setting in media_settins or a "global" setting? I'd be fairly strongly against a per-module setting since it means we must have advance knowledge that the module will be inserted in order to do CMIS-compliant initialization, or a "default" setting for it.

Stepping back a minute - to do speed control (supporting anything other than the default) on active CMIS modules, this feature is required. I have seen some cases where the modules are not fully CMIS-compliant and we have needed to override the application parsing/selection behavior (and internally we have made some changes to media_settings.json to accommodate this).

}
}
```
As of now, only **application_selection** is supported, and it defaults to **True** if unspecified

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this interact with the module's advertising? I assume app selection would not be done for QSFP/SFP+ modules, but passive cables do not require app selection either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of checkers in the CMIS Manager

  1. Skip if the cage/form-factor type is not QSFPDD (Or OSFP, but I'm not sure if it shares the same form-factor)
  2. Skip if the media type (i.e. byte 128 or 0 of the module) is not a QSFPDD
  3. Skip if the memory type of the module is FLAT
    Upon port change events, as we support only dynamic-port-breakout, which means sub-ports will share the same config, hence the application code could be determined by the config of anyone of the sub-port in the PORT_SET events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we also have another sanity check to skip the application initialization if

  1. Defaults to application 1 if none of the applications matched the current speed and lane count of the host interface
  2. Skip the application init if no application code update while the current datapath is activated and module state is ready.

- PortChangeEvent.PORT_DEL
Invoke the event callbacks upon **DEL** operations.

- As of now, the number of CmisManagerTask is hard coded to 4

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should take a strong look at xcvrd and consider creating a separate processing thread per xcvr. This allows for all slow i2c operations (DOM update, etc) to be done in a non-blocking manner.

Regardless, 4 is certainly not enough. If we assume 5 seconds per application selection and 32 modules, it would still take ~40 seconds to initialize the modules when it could actually be done in ~5s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree, the CMIS transactions could take several seconds to finish (while it's possible to even scale up to minutes as per the CMIS SPEC, such modules are unlikely to be deployed in the SONIC, as of now)


- As of now, the number of CmisManagerTask is hard coded to 4
- Task 0 is dedicated to listen for **PortChangeEvent.PORT_SET** events, and put requests into the shared message queue
- Task 1-N are responsible for handling the CMIS application initializations in parallel,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current design still takes lot of time in initializing all the 32 modules being served by few threads. This is because the thread is monitoring the datapath state transitioning. If you save the state machine context in sfp object, then you just need to trigger DP init and move on to next task (serve another module) and by the time you finish trigger DP init for all the module you can come to the first module to see if it has completed its DP init or failed. The basic idea is SW doesn't need to spend idle time waiting for a module to finish its DP init, instead it can do useful task.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For eg. In Xcvrd, we need to wait for 2 secs for management init to complete after each module is brought out of reset, before doing 1st i2c transaction, we don't wait 2 secs * 32 = 64 secs. We just spend 2 to 3 secs for all 32 modules to complete management init delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the delay of management init is supposed to be done at per-port module level, and the accurate delay is also module-specific, hence it's better to either add a SFP check code validation in the xcvrd before kicking off the CMIS init, or just check the module state after the interrupt is asserted.

since all the requests are from the shared message queue, only one request will be processed
by a certain task at a time.

![](images/002.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please capture how this below CMIS spec highlighted requirement is met in your design?

image.

Please do note, during Xcvrd boot, it waits for port config done i.e the NPU or ASIC initialization for the particular port is done, when Xcvrd does accessing the module. The SET event in CONFIG_DB for a port doesn't guarantee the corresponding NPU or ASIC initialization is completed. In short, I don't see the above spec requirement for DP init to happen when Tx signal coming from the ASIC/Gearbox is valid being ensured in current design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new statement in the CMIS5, it adds the OutputDisableTx=0xff in the application initialization, and the current implementation did not turn off the Tx power (It's not mention in appendix C of CMIS4) hence I'll later update the code for this

- get_cmis_application_selected()
- get_cmis_application_matched()
- set_cmis_application()
- Add mutex lock to the follow routines to serialize CMIS application initializations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per CMIS spec, each DP initialization should happen independent of other DP init, but then won't this reset() affect all Datapaths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will, but the main purpose of this PR is for the dynamic-port-breakout support, which is having all the lanes configured in the same application mode, hence a reset() is a simple and safest way to ensure the entire module is reset to a known state. I'll later update the code to remove the reset()

- Add support for CMIS application initialization
- get_cmis_application_selected()
- get_cmis_application_matched()
- set_cmis_application()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is set_cmis_application() API per data path or entire module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As communicated earlier, now we use state-based initialization, hence this routine is removed


## sonic-utilities/sfputil (modified)

- Add supprot to show the CMIS application advertisement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a CLI to show current active application as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will the CLI look like? sample output?


## sonic-platform-common/sonic_platform_base/sonic_xcvr/api/public/cmis.py (modified)

- Add support for software reset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module reset is already done by the platform, why is this done here? It will affect the case where we need to initialize individual datapath at different point in time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As communicated earlier, the software reset will be removed

doc/sfp-cmis/cmis-init.md Show resolved Hide resolved
**Example:**
```
{
"CMIS_MEDIA_SETTINGS":{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SONiC will be supporting all CMIS compliant module. why this platform specific check? All platforms in SONiC should be using Xcvrd for initializing the CMIS compliant modules. I am checking with NVIDIA why other platforms should be explicit in specifying the data path init.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As communicated earlier, this module-specific configuration is removed

![](images/001.png)

- Add **PortChangeEvent.PORT_SET** and **PortChangeEvent.PORT_DEL** events to **xcvrd_utilities/port_mapping.py**
- PortChangeEvent.PORT_SET
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the design for handling default application or non-default application as well ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As communicated earlier, in the latest design, we'll support default application for now

@prgeor prgeor self-assigned this Nov 15, 2021
Signed-off-by: Dante Su <dante.su@broadcom.com>
Signed-off-by: Dante Su <dante.su@broadcom.com>
Signed-off-by: Dante Su <dante.su@broadcom.com>
Signed-off-by: Dante Su <dante.su@broadcom.com>
Signed-off-by: Dante Su <dante.su@broadcom.com>
Signed-off-by: Dante Su <dante.su@broadcom.com>
Signed-off-by: Dante Su <dante.su@broadcom.com>
Signed-off-by: Dante Su <dante.su@broadcom.com>
doc/sfp-cmis/cmis-init.md Outdated Show resolved Hide resolved
Signed-off-by: Dante Su <dante.su@broadcom.com>
Signed-off-by: Dante Su <dante.su@broadcom.com>
Signed-off-by: Dante Su <dante.su@broadcom.com>
* [List of Tables](#list-of-tables)
* [Revision](#revision)
* [About This Manual](#about-this-manual)
* [Scope](#scope)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i click the "Scope" it doesn't take me to the section "Scope". Can you fix that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do

return "/sys/bus/i2c/devices/{}-0050/eeprom".format(bus_id)
```

This feature could also be disabled by the platform-specific **pmon_daemon_control.json**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, all sonic platform will have support for CMIS. I will check with platforms why it should be disabled, but for now, it should be enabled for all platforms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do


Functionality should continue to work across warm boot.
- The CMIS application initialization should not be performed during WarmBoot.
- The CMIS application initialization should be skipped if no application code updates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should reconfigure the datapath during fast reboot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not necessary, the CmisMamagerTask will always double-check the current application mode against the desired application from the CONFIG_DB, hence this will be processed when the XCVR INSERTION events are replayed after fast-reboot.

doc/sfp-cmis/cmis-init.md Outdated Show resolved Hide resolved
doc/sfp-cmis/cmis-init.md Show resolved Hide resolved

## sonic-utilities/sfputil (modified)

- Add supprot to show the CMIS application advertisement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will the CLI look like? sample output?

doc/sfp-cmis/cmis-init.md Show resolved Hide resolved
doc/sfp-cmis/cmis-init.md Show resolved Hide resolved
Signed-off-by: Dante Su <dante.su@broadcom.com>
prgeor
prgeor previously approved these changes Dec 1, 2021
Signed-off-by: Dante Su <dante.su@broadcom.com>
@prgeor prgeor merged commit 9d48008 into sonic-net:master Dec 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants