Skip to content

Commit

Permalink
Update pmon-chassis-design.md
Browse files Browse the repository at this point in the history
Addressing review comments from 8/19 chassis-subgroup meeting
  • Loading branch information
mprabhu-nokia authored Aug 26, 2020
1 parent 4206c54 commit 88ffe72
Showing 1 changed file with 18 additions and 15 deletions.
33 changes: 18 additions & 15 deletions doc/pmon/pmon-chassis-design.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,25 +8,25 @@
* [About this Manual](#about-this-manual)
* [Scope](#scope)
* [Acronyms](#acronyms)
* [1. Modular VOQ Chassis - Reference](#1-modular-voq-chassis-reference)
* [2. SONiC Platform Management & Monitoring](#2-sonic-platform-management-monitoring)
* [1. Modular VOQ Chassis Reference](#1-modular-voq-chassis-reference)
* [2. SONiC Platform Management and Monitoring](#2-sonic-platform-management-and-monitoring)
* [2.1 Functional Requirements](#21-functional-requirements)
* [2.2 Chassis Platform Stack](#22-chassis-platform-stack)
* [3. Detailed Workflow](#3-detailed-workflow)
* [3.1 Chassis Boot Process](#31-chassis-boot-sequence)
* [3.1 Chassis Boot Process](#31-chassis-boot-process)
* [3.1.1 Controlcard Boot Process](#311-controlcard-boot-process)
* [3.1.2 Linecard Boot Process](#312-linecard-boot-process)
* [3.2 Chassis Platform Management](#32-chassis-platform-management)
* [3.2.1 Midplane Ethernet](#321-midplane-ethernet)
* [3.2.2 Chassis Monitoring & ChassisD](#322-chassis-monitoring-chassisd)
* [3.2.2 Chassis Monitoring and ChassisD](#322-chassis-monitoring-and-chassisd)
* [3.2.3 Chassis Local Sonic Image Hosting Service](#323-chassis-local-sonic-image-hosting-service)
* [3.2.4 Disaggregated vs Global DB](#324-disaggregated-vs-global-db)
* [3.3 Peripheral Management](#33-peripheral-management)
* [3.3.1 PSUd](#331-psud)
* [3.3.2 Thermalctld](#332-thermalctld)
* [3.3.3 Xcvrd/SFP](#333-xcvrdsfp)
* [3.3.4 LEDd](#334-led)
* [3.3.5 Syseepromd](#335-syseeprom)
* [3.3.4 LEDd](#334-ledd)
* [3.3.5 Syseepromd](#335-syseepromd)
* [3.3.6 Midplane Ethernet](#336-midplane-ethernet)

### Revision ###
Expand Down Expand Up @@ -54,14 +54,12 @@ Control Plane Stack - Set of Processes, Daemons, Dockers implementing control pl

Datapath Stack - Set of Processes, Daemons, Dockers, API's implementing datapath ASIC hardware programming via SAI interface.

## 1. Modular VOQ Chassis - Reference
## 1. Modular VOQ Chassis Reference
The below picture shows reference of VOQ chassis highlevel hardware architecture. Chassis has 1 or 2 control cards (aka supervisor cards), 1 or more linecards and 1 or more switch fabric cards. It also has 1 or more FAN tray, 1 or more PSUs and midplane ethernet. In general, control cards manages the perpherals like fan, psu, midplane ethernet, etc.

![Modular VOQ Chassis](pmon-chassis-images/voq-chassis.png)

As an example, Nokia modular VQO chassis is IXR-7250 which has control card (i.e CPMv1, CPMv2) and linecards(i.e imm36-400g-qsfpdd, imm36-32x100g-4x400g-qsfpdd, etc), Fabric cards (i.e SFMv1, SFMv2)

## 2. SONiC Platform Management & Monitoring
## 2. SONiC Platform Management and Monitoring
### 2.1. Functional Requirements
At a functional level of a chssis, SONiC will manage control cards, line cards and all other peripheral devices of the chassis as required by chassis platform vendor specification. Below requirements capture some of the key areas that is required to operate a VOQ chassis.

Expand Down Expand Up @@ -94,7 +92,7 @@ In a modular disaggregated SONiC software architecture, each linecard will run a

SONiC supports ONIE as a boot method and also provides vendor specific boot method. In either boot method, control card of chassis will be booted first and followed by linecard. For first phase of design, it assumes that control card should be operationally ready before linecards to boot. This is important because some of the sensors and fan settings are managed in a control card and it has to set with correct values when linecards are running to make chassis healthy and avoid over heating.

#### 3.1.1 Control Card Boot Process
#### 3.1.1 Controlcard Boot Process

Control card can be booted using ONiE method. Upon boot, unique ONIE_PLATFORM string will be provided in a ONIE firmware to differentiate the cards and services/dockers it could start via systemd-generator. In case of control card, there wont be dockers like BGP, LLDP, etc started. This service list is included as part of platform specific service list file.

Expand Down Expand Up @@ -125,7 +123,7 @@ HW_TYPE=IOM
sonic-buildimage/device/nokia/x86_64-nokia_ixr7250_36x400g-r0$
```

#### 3.1.2 Line Card Boot Process
#### 3.1.2 Linecard Boot Process
Linecard boot process is very similar to control card and main difference is services that is started on linecard will include protocol dockers such BGP, LLDP, etc. Also, SyncD docker will started for VOQ ASIC instead of SF ASIC.

### 3.2 Chassis Platform Management
Expand All @@ -144,8 +142,9 @@ In order to allow direct access to linecards from outside of the chassis over ex

Allowing DHCP relay or DHCP client on these internal midplane ethernet aren't considered for first phase of the design.

Approach 1 will be the preferred approach. The external management and midplane ethernet networks will be isolated.

#### 3.2.2 Chassis Monitoring & ChassisD
#### 3.2.2 Chassis Monitoring and ChassisD

Modular Chassis has control-cards, line-cards and fabric-cards along with other peripherals. The different types of cards have to be managed and monitored.

Expand All @@ -155,12 +154,14 @@ Modular Chassis has control-cards, line-cards and fabric-cards along with other
* Monitor the status of the line-card, fabric-card etc using new PMON 2.0 APIs. The assumption is that each vendor will have platform-drivers or implementation to detect the status of the cards in the chassis.

This comment has been minimized.

Copy link
@shyam77git

shyam77git Oct 1, 2020

Contributor

I believe it means monitoring LCs (line-cards) status too on Supervisor/CC to ensure all cards(present in the chassis) shows up in show platform.
Can you please confirm?
In this document, noticed following APIs:
get_num_linecards() ; get_all_linecards (); get_linecard_presence ()
Don't seem them yet on master branch of https://github.com/Azure/sonic-platform-common/blob/master/sonic_platform_base/chassis_base.py
So, they would be introduced and come later? Tentatively when or which branch?

As chassis_mgr/chassis owner, chassisD to detect FCs too - i.e. how many max FCs slots, which all Fabric slots have FC present etc.
Recommend looking into adding following to chassis_base. py:
get_num_fabriccards() ; get_all_fabriccards (); get_fabriccard_presence ()

This comment has been minimized.

Copy link
@shyam77git

shyam77git Oct 1, 2020

Contributor

https://github.com/Azure/sonic-platform-common/blob/master/sonic_platform_base/chassis_base.py
While looking at chassis_base.py on master branch, noticed module methods and component methods.
What exactly modules, components referring to here?
Don't think its the cards - LC/FC - as they would be added via get_num_linecards() APIs etc. (as mentioned in above point)

This comment has been minimized.

Copy link
@shyam77git

shyam77git Oct 1, 2020

Contributor
  1. src/sonic-platform-common/sonic_platform_base/linecard_base.py and its API/methods
    Don't this is in master branch
    This linecard_base.py yet to be added? Can you please confirm?
    Tentatively when or which branch?

  2. This class and its API can be invoked/leveraged on any card - LC or CC/Supervisor - to fetch status of specified LC (on CC) or local LC (on LC) or reboot, set admin up/down of the specified LC (on CC) or local LC (on LC) - Can you please correct/confirm?

  3. ChassisD would want to know the same for FCs too.
    LineCardBase can be leveraged for FC card/slot was well.
    Since its generic for any card type (LC/FC/CC), suggest having a generic name, such as CardBase class or some such

* The status will need to be persisted in REDIS-DB.
* PMON processes can subscribe to UP/DOWN events of these cards.
* The UP/DOWN events will be added to syslog.
* Vendor-specific API will be provided to take action on any change event.

#### Schema

The schema for CHASSIS_CARD_INFO table in State DB is:
```
key = CHASSIS_CARD <card index> |"state_db" ;
key = CHASSIS_CARD_INFO | <card index>;
; field = value
name = STRING ; name of the card
slot = 1*2DIGIT ; slot number in the chassis
Expand Down Expand Up @@ -239,6 +240,8 @@ class LineCard(LineCardBase):
```

Additionally, *get_change_event()* can be implemented to handle asynchronous notification of the line-card UP/DOWN events.

#### Show command
The *show platform* command is enhanced to show chassis information

Expand Down Expand Up @@ -584,6 +587,7 @@ Syseepromd will run on control and line-cards indepenedently and monitor for any
To manage and monitor midplace ethernet, the following vendor-specific PMON 2.0 APIs can be introduced:

* API to initialize the midplane on both control and line cards - init_midplane_switch()
* This API will *not* be used to intialize the drivers or configure the IP-address. The drivers should be initialized and IP-addresses should be configured before the Database-dockers are brought up.
* APIs to check midplane connectivity:
* On line-card to check if control-card is reachable via midplane - is_midplane_controlcard_reachable()
* On control-card to check if line-card on slot is reachable via midplane - is_midplane_linecard_reachable(slot)
Expand Down Expand Up @@ -633,4 +637,3 @@ class midplane_monitor_task:
logger.log_info("Stop midplane task loop")
```

This comment has been minimized.

Copy link
@shyam77git

shyam77git Oct 2, 2020

Contributor

Q on section 3.2.4
section 3.2.4 says:

  • Disaggregated DB - Each card updates to local REDIS-DB. Monitoring process will pull or subscribe to the table updates of each card.
  • Global DB - Each card will updated their state to a line-card-table in the Global-DB
    ======
    What all would Global REDIS-DB hold? Can you please share the schema of this DB?
    Per my understanding, definitely environmental sensors of all cards (LCs, local RP, FCs)
    Can you please clarify "Each card will updated their state to a line-card-table in the Global-DB"?
    What card state update is referred here?
    Per start of section 3.2.2., my understanding is All LC card state update (online/offline/empty) + (admin up/down) to reside in CC/Supervisor's local REDIS-DB.

0 comments on commit 88ffe72

Please sign in to comment.