Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLS Silverstone: enable the init sequence for all of the QSFPDD CMIS4… #95

Merged
merged 4 commits into from
Dec 1, 2020
Merged

CLS Silverstone: enable the init sequence for all of the QSFPDD CMIS4… #95

merged 4 commits into from
Dec 1, 2020

Conversation

ds952811
Copy link

… modules

Signed-off-by: Dante Su dante.su@broadcom.com

- What I did

  1. Enable the initialization sequence for all of the QSFPDD CMIS4 modules, which used to be limited to InnoLight only.
  2. Stop issuing software reset to the QSFPDD DACs even though its module state is not ModuleReady for 3+ seconds.
  3. Lower the clock rate of FPGA-I2C to have the best transceiver compatibility.

- How I did it
See What I did

- How to verify it

  1. Have both InnoLight and Eoptolink CMIS4 modules attached to the DUT
  2. Issue 'show interfaces transceiver eeprom' to verify if their application selection is '2'
  3. Intentionally disable anyone of the 4 links of the transceiver, and make sure the other 3 links are still UP

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

… modules

Signed-off-by: Dante Su <dante.su@broadcom.com>
@ds952811
Copy link
Author

ds952811 commented Nov 25, 2020

PLEASE DO NOT approve this review until the Broadcom QA finished the internal tests, I'll later send an all-clear flag once it's done.

This PR is early access to review what are the proposed changes to address the CMIS4 issues, while the new test image has yet generated, the DUT with corresponding changes manually applied are verified, please refer to the attached log for details.
PR95-ENG-TEST.log

@@ -1431,7 +1431,7 @@ static int smbus_access(struct i2c_adapter *adapter, u16 addr,
goto Done;
}

#if 1 /* 100 kHz */
#if 0 /* 100 kHz */
Copy link
Author

@ds952811 ds952811 Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, this is unlikely to be the root cause of I2C failure reported yesterday, however it's the only thing under our control without having the Celestica engaged, that's why I did this.
Please note it's always working on our test unit no matter if high-speed (BIT6-0x40) is enabled or not.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we turn off the high speed mode, how slow would it be? I don't feel we should do this unless we feel it was necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We will have ~30% delays for each I2C operations when the BIT6 (high-speed) is cleared
  2. Based on the latest test report from Broadcom QA, slowing down the clock does not improve the transceiver compatibility.
    And hence, I'll revert this change.

Note: None of the QSFPDD optics is available at my side, so I'm counting on QA for the verifications

e.g.
Dante,

It appears Eoptolink 3.0 broke. It took ~17 minutes for link up after remove/insert
Please see the logs below

Regards,
-kevin

......
Nov 25 21:47:25.015491 sonic INFO pmon#sfputil.py: PORT 31: unable to read PAGE0
Nov 25 21:47:25.015491 sonic INFO pmon#sfputil.py: PORT 31: Unable to initialize the module
Nov 25 21:48:06.595530 sonic INFO pmon#sfputil.py: PORT 31: unable to read PAGE0
Nov 25 21:48:06.595530 sonic INFO pmon#sfputil.py: PORT 31: Unable to initialize the module
Nov 25 21:49:36.731519 sonic INFO pmon#sfputil.py: PORT 31: unable to read PAGE0
Nov 25 21:49:36.731519 sonic INFO pmon#sfputil.py: PORT 31: Unable to initialize the module
Nov 25 21:50:11.383533 sonic INFO pmon#sfputil.py: PORT 31: unable to read PAGE0
Nov 25 21:50:11.383533 sonic INFO pmon#sfputil.py: PORT 31: Unable to initialize the module
Nov 25 21:50:39.103534 sonic INFO pmon#sfputil.py: PORT 31: unable to read PAGE0
Nov 25 21:50:39.103534 sonic INFO pmon#sfputil.py: PORT 31: Unable to initialize the module
Nov 25 21:51:34.543512 sonic INFO pmon#sfputil.py: PORT 31: unable to read PAGE0

# its module state could get stuck at ModuleLowPwr upon reinsertion, and a software
# reset is necessary to get it recovered.
buf = self._read_eeprom_devid(x, self.IDENTITY_EEPROM_ADDR, 85, 1)
if (buf is not None) and (int(buf[0], 16) != 0x03):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to address the DAC failure below:
pmon#sfputil.py: write failed: [Errno 110] Connection timed out

Some of the DAC (Passive Copper) do not support software reset, and hence such a write attempt will fail

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of DAC, should we even add failure counter and come here, See line 553:
self.mod_failure[x] += 1

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another generic question, it seems we are keep monitoring all the settings and do reset/init etc all the time. It is a bit concerning if we that could be disruptive. Thinking maybe we could only do these steps when we got the optical insert event, and finish the bring up process once. of course, we will do it again if insert happens again. Let me know.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We'll actually do nothing in the case of DAC, so I'm not sure why we need to keep the counter advanced? Do we want to have any other automatic recovery for DAC?

  2. The reason why this is not handled in the insertion event is that, the I2C read could fail and it was observed during the QA test last week. However, we could add a inited flag to make sure this is only performed once and only once, let me update the code logic

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1- that is exactly my point, we should not advance the counter and then avoid the write here, we should eliminate the counter advance in the code earlier so it won't even be a failure case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy that, and thanks, I'll update the code accordingly

rev = int(buf[1], 16)
if rev >= 0x40:
return self._init_cmis4_module(port_num, xcvr, self.hwsku)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the CMIS3 modules we have are working fine with 4x100G in its default application mode, and hence this init sequence will only be performed on CMIS4 modules.

Note: The definition of the datapath initialiation slightly differs from CMIS3 to CMIS4, and hence we can't use _init_cmis4_module() on CMIS3 modules.

@@ -450,7 +469,7 @@ def _init_cmis_module_custom(self, port_num, xcvr, hwsku):
self._write_byte(port_num, self.IDENTITY_EEPROM_ADDR, 0x10, 143, 0xff)
# Initialize datapath
self._write_byte(port_num, self.IDENTITY_EEPROM_ADDR, 0x10, 128, 0x00)
time.sleep(0.5)
time.sleep(1)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UT result shows that we need a few seconds for the actual configuration status correctly reflected on the register, but we don't really want to wait too long in the xcvrd, so I slightly enlarge the delay from 500ms to 1sec, hope this could reduce the redudant errors in the syslog

return True

log_info("PORT {0}: {1}: _init_cmis_module_custom".format(port_num, xcvr))
def _get_application_code(self, port_num, host_intf):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guarantee that 100G PAM4 mode is always the application 2, and hence we'll need to decode the application advertisement and figure the correct application code

…nd revert the clock of FPGA-I2C to high-speed

Signed-off-by: Dante Su <dante.su@broadcom.com>
@@ -437,7 +439,7 @@ def _init_cmis4_module(self, port_num, xcvr, hwsku):
log_info("PORT {0}: {1}: _init_cmis4_module".format(port_num, xcvr))

# Allow 1s for software reset
self._write_byte(port_num, self.IDENTITY_EEPROM_ADDR, -1, 26, 0x08)
self._write_byte(port_num, self.IDENTITY_EEPROM_ADDR, -1, 26, 0x58)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure the module will be placed in LowPower mode

buf = self._read_eeprom_devid(port_num, self.IDENTITY_EEPROM_ADDR, self._page_to_flat(202, 0x11), 4)
err = "".join(buf)
err = '00000000'
for t in range(10):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eoptolink CMIS4 took 6 seconds for config status to get populated to the register, and hence 10 sec should be sufficient

# Monitoring the QSFP-DD initialization state, and reinitiate it if necessary
# monitor the QSFP-DD initialization state, and reinitiate it if necessary
if self.mod_inited[x]:
continue
buf = self._read_eeprom_devid(x, self.IDENTITY_EEPROM_ADDR, self._page_to_flat(202, 0x11), 4)
if buf is None:
continue
err = "".join(buf)
if err != '11111111':
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for a record, this checker is necessary to prevent redundant CMIS4 init upon warm-reboot, otherwise significant packet drops will be observed.

@@ -1431,7 +1431,7 @@ static int smbus_access(struct i2c_adapter *adapter, u16 addr,
goto Done;
}

#if 0 /* 100 kHz */
#if 1 /* 100 kHz */
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert the clock to high-speed, unless it's unlikely to be the root cause, and we actually have this enabled in ICOS, and I never got any issues for this

…ecoder

Signed-off-by: Dante Su <dante.su@broadcom.com>
It's a known issue that Eoptolink CMIS3 requires a software reset after
re-insertion, and a burst of software reset messages observed when its
state != ModuleReady for 5 seconds.

And hence, we'll have the following changes in this commit

1. Update the checker from (state != ModuleReady) to (state == ModuleLowPwr),
   which is exactly the behavior observed on the Eoptolink CMIS3, and its state
   will become ModulePwrDn after reset.
2. Add 1-second delay to the software reset, and hence it's impossible to have
   a burst of reset at a time
3. Ensure the failure counter gets reset when its state != ModuleLowPwr

Signed-off-by: Dante Su <dante.su@broadcom.com>
@zhenggen-xu
Copy link
Owner

As you mentioned, please send the all clear flag once the QA is done.

@ds952811
Copy link
Author

ds952811 commented Dec 1, 2020

As you mentioned, please send the all clear flag once the QA is done.

Although the hotfix in the last commit 'CLS Silverstone: sfputil.py: fix Eoptolink CMIS3 support' has already verified by the QA, we still need to generate a new build with this change, and I'm working on it, will let you know once it's ready.

@ds952811
Copy link
Author

ds952811 commented Dec 1, 2020

As you mentioned, please send the all clear flag once the QA is done.

Although the hotfix in the last commit 'CLS Silverstone: sfputil.py: fix Eoptolink CMIS3 support' has already verified by the QA, we still need to generate a new build with this change, and I'm working on it, will let you know once it's ready.

After the internal discussion with my manager, Babu, it sounds like you're not using the binary from Broadcom, and you will use only the image rebuilt from the source package.

In this case, you don't need to wait for the new build(image) from my end, please approve and merge the changes if no any concern

@zhenggen-xu zhenggen-xu merged commit e4e7956 into zhenggen-xu:201811-TH3 Dec 1, 2020
zhenggen-xu pushed a commit that referenced this pull request Dec 15, 2020
)

**sonic-platform-common:**

Commits on Jul 07, 2020
Changes in fan and psu base classes (1.0 platform API) related to pdd… 17292e4
Commits on Jul 10, 2020
update get_pcie_check() to use sysfs instead of lspci output (#95) d4eb804
Commits on Jul 11, 2020
[eeprom] Fix UnboundLocalError (#93) 7c8bed1
Commits on Jul 14, 2020
[Transceiver] Add parser for QSFP-DD cable type and dictionaries for … be1cc24 

**sonic-platform-daemons:**

Commits on Jul 06, 2020
[psud] Store PSU temperature and voltage information to database (#61) ef9716a
Commits on Jul 14, 2020
[xcvrd] Add support for QSFP-DD cables (#66) c530587
Commits on Jul 18, 2020
Initial version of pcied (#60) e665ee8
[xcvrd] Return non-zero error code on SFP error (#67) 4f42a79
Commits on Jul 19, 2020
[README.md] Add LTGM badges (#69) 23757a3
Remove unused imports (#70) 029d5a5 

Signed-off-by: Nazarii Hnydyn <nazariig@mellanox.com>
zhenggen-xu pushed a commit that referenced this pull request Dec 15, 2020
…c-utilities] Update submodules (sonic-net#5435)

Update Python-based submodules to include updated dependencies in setup.py files.

* src/sonic-platform-common 14c6e53...7255d3a (2):
  > [setup.py] Update dependenices (sonic-net#120)
  > [sfputilbase|sfputilhelper] fix value unpack error due to port_alias_asic_map in portconfig.py file (sonic-net#116)

* src/sonic-platform-daemons e1842b2...1aaffcc (2):
  > Add 'wheel' package to 'setup_requires' list (#95)
  > [thermalctld] Fix issue: thermalctld should be auto restarted when being killed (#94)

* src/sonic-snmpagent 1a2b62a...5957460 (1):
  > [setup.py] Add 'wheel' to 'setup_requires' (sonic-net#160)

* src/sonic-utilities 2244d7b...2f79ac1 (4):
  > [show] Add missing import to fix `show ip interfaces` (sonic-net#1126)
  > Fixed config load_minigrpah not working for Multi-asic platfroms. (sonic-net#1123)
  > Add CLI for configuring synchronous mode (sonic-net#1094)
  > [setup.py] Update dependencies (sonic-net#1125)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants