-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Mellanox] thermal control enhancement for dynamic minimum fan speed and PSU fan speed policy #4403
Conversation
…policy and PSU speed policy
This pull request introduces 1 alert when merging 3b77d71 into 768d18d - view on LGTM.com new alerts:
|
retest vsimage please |
retest vsimage please |
@@ -0,0 +1,134 @@ | |||
DEVICE_DATA = { | |||
'ACS-MSN2700': { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like for the different SKU which are different flavour of the same Platform type we need to define the same table over and over. can you please find a better way to do so?
for example: ACS-MSN2700, LS-SN2700, Mellanox-SN2700, Mellanox-SN2700-D48C8, etc.
soon when we will have a dynamic SKU creator we will not be able to support it thus it cannot be based on SKU.
we have the same limitation in sfputil and this is wrong.
} | ||
} | ||
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are missing SN4700 which is also supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have minimum table for 4700. We set minimum FAN speed to 60% for 4700 or any other platform who has no minimum table.
@@ -22,17 +23,24 @@ | |||
|
|||
FAN_PATH = "/var/run/hw-management/thermal/" | |||
LED_PATH = "/var/run/hw-management/led/" | |||
CONFIG_PATH = "/var/run/hw-management/config" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we have this config path on different files. is this correct to have it this way instead of having it inherited?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is correct. The config path here is a directory and I only need it to find some i2c address related to PSU fan. I don't see difference among differenct platforms.
# fan_dir isn't supported on Spectrum 1. It is supported on Spectrum 2 and later switches | ||
FAN_DIR = "/var/run/hw-management/system/fan_dir" | ||
COOLING_STATE_PATH = "/var/run/hw-management/thermal/cooling_cur_state" | ||
|
||
# SKUs with unplugable FANs: | ||
# 1. don't have fanX_status and should be treated as always present | ||
hwsku_dict_with_unplugable_fan = ['ACS-MSN2010', 'ACS-MSN2100'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, should not be SKU but rather a system type
if not isinstance(level, int): | ||
raise RuntimeError("Failed to set cooling level, input parameter must be integer") | ||
|
||
if level < 1 or level > 10: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to have min, max defines and use it for both if case and prints.
THERMAL_ZONE_TEMPERATURE = "thermal_zone_temp" | ||
THERMAL_ZONE_NORMAL_TEMPERATURE = "temp_trip_norm" | ||
|
||
MODULE_COUNTER_PATH = "/var/run/hw-management/config/module_counter" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you are using it?
THERMAL_ZONE_NORMAL_TEMPERATURE = "temp_trip_norm" | ||
|
||
MODULE_COUNTER_PATH = "/var/run/hw-management/config/module_counter" | ||
GEARBOX_COUNTER_PATH = "/var/run/hw-management/config/gearbox_counter" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you are using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
|
||
|
||
@classmethod | ||
def _write_generic_file(cls, filename, content): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what the purpose of the generic file? can you please add more description for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated description.
logger.log_info("Fail to write file {} due to {}".format(filename, repr(e))) | ||
|
||
@classmethod | ||
def set_thermal_algorithm_status(cls, status, force=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the behaviour expected when disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When disable the thermal algorithm, kernel no longer adjust the FAN speed according to thermal zone temperature. We usually disable the thermal algorithm when FAN absence or PSU absence. If a fan unit or PSU is removed from switch, there will be a "air hole" at switch, we need set fan speed to 100% and prevent thermal algorithm adjust it.
@@ -159,6 +185,44 @@ def test_all_fan_presence_condition(): | |||
fan_info.collect(chassis) | |||
assert condition.is_match({'fan_info': fan_info}) | |||
|
|||
def test_any_fan_fault_condition(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should you consider to have a test file for the tests instead of the general code used for production? not must but i think it is better
…level to dynamic minimum value
Please fix conflicts |
@Junchao-Mellanox Please create PR for 201911. There is merge conflict. |
…and PSU fan speed policy (sonic-net#4403) Conflicts: platform/mellanox/mlnx-platform-api/sonic_platform/fan.py
- What I did
Support dynamic minimum fan speed policy. Before this change, the default fan speed is 60% in Mellanox platform which means that the fan speed will be set to 60% if all fans work well. But 60% might be too high for cold environment which consumes a lot power and makes noise. The idea of dynamic minimum fan speed is to set minimum fan speed dynamically according to thermal zone temperature, air flow direction. The dynamic minimum fan speed policy check air flow direction and thermal zone temperature every 60 seconds and determine a proper minimum fan speed according to a "minimum table". Set fan speed to lower than minimum fan speed is not allowed.
Support PSU fan speed policy. Before this change, the PSU fan speed is a fix value in Mellanox platform. It means that the PSU fan speed won't be adjusted according to the system's thermal zone status. The PSU fan speed policy checks the system cooling level periodically and set PSU fan speed according to it.
Adjust the current "all fan and psu presence" policy. Before this change, when all fan and psu work well, this policy just set fan speed to 60% and enable kernel thermal algorithm. Within this change, the policy will check all thermal zones and make sure all thermal zones return to their normal state before setting fan speed to 60%.
Add a fan fault policy. If any fan in fault state, set other fan speed to 100%.
- How I did it
Add a private policy for dynamic minimum fan speed. The policy contains a new condition to detect that if thermal zone temperature or air flow direction changes. The policy also contains a new action to set the minimum fan speed.
Add a private policy for PSU fan speed. The policy contains a new condition to check that if system cooling level changes. The policy also contains a new action to set PSU fan speed via i2c tool.
- How to verify it
Manual test and regression test. Added two test cases in sonic-net/sonic-mgmt#1552.
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)