Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TACACS server monitor design document. #1467

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

liuh-80
Copy link
Contributor

@liuh-80 liuh-80 commented Sep 11, 2023

Add TACACS server monitor design document.
The TACACS server monitor can change TACACS server priority based on syslog and can resolve following issue:
#1462

@liuh-80 liuh-80 marked this pull request as ready for review September 11, 2023 05:30
@ycoheNvidia
Copy link
Contributor

Thanks for addressing the issue.
Even though this might work - in my opinion this is not a good solution for this issue and here is few reasons:

  • user who configures authentication methods doesn't expect configuration to change without any notice or active action from his side.
  • logs can be instable, we have experienced inconsistency of Debian logs in the past, I suspect it might cause bugs for this solution
  • The main issue here is based on an issue with pam/tacplus_pam. In my opinion - it should be resolved there. Adding a monitor workaround solution might be hard to maintain and not be optimal for this issue.

@a-barboza
Copy link
Collaborator

cc: @shdasari

The idea of adjusting the priority based on the responsiveness of the server is good. As mentioned, it should alleviate issue #1462.

However, I feel adjusting the configuration based on a temporary network event might not be the best approach. How about adding some state information to the STATE_DB (COUNTER_DB, or appropriate DB)? Adjust the tacplus pam configuration to account for temporarily unreachable server. Once the network event is detected to be resolved, then the temporary adjustment could be backed out, with suitable logs to advise the admin of the actions taken.

@liuh-80
Copy link
Contributor Author

liuh-80 commented Sep 14, 2023

cc: @shdasari

The idea of adjusting the priority based on the responsiveness of the server is good. As mentioned, it should alleviate issue #1462.

However, I feel adjusting the configuration based on a temporary network event might not be the best approach. How about adding some state information to the STATE_DB (COUNTER_DB, or appropriate DB)? Adjust the tacplus pam configuration to account for temporarily unreachable server. Once the network event is detected to be resolved, then the temporary adjustment could be backed out, with suitable logs to advise the admin of the actions taken.

@a-barboza , currently if SONiC TACACS can connect to first server, it will not try connecting other low priority servers, which means can't detect low priority server network status. and if we want get connection status of low priority server, we need create a new daemon running in background to periodically check server.

Also in this design, the monitor change server priority based on a time window and event count threshold, for example more than 50% connection failed within 5 minutes. this is a simple solution can handle almost every scenario.

So, how about I update the design doc to:

  1. First stage, implement monitor to change priority based on sys event, this will be high priority.
  2. Second stage, update TACACS code to update COUNTER_DB data. create daemon to check TACACS server status. then we can update monitor to change priority based on COUNTER_DB and server reachability?

qiluo-msft
qiluo-msft previously approved these changes Oct 9, 2023
@qiluo-msft
Copy link
Contributor

@ycoheNvidia @a-barboza Would you like to review again? Do you have more comments?

@liuh-80
Copy link
Contributor Author

liuh-80 commented Oct 11, 2023

@a-barboza , I update design doc according to your suggestion, please give your comments.

doc/aaa/TACACS+ Server Monitor.md Outdated Show resolved Hide resolved
doc/aaa/TACACS+ Server Monitor.md Outdated Show resolved Hide resolved
```

### Config DB schema
#### TACPLUS_MONITOR Table schema
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can TACPLUS_MONITOR be disabled? For example, if the table is not defined, does it mean that TACPLUS Monitor is disabled, i.e. no monitoring, and effective priorities are same as configured priorities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 'enable' flag for disable this feature. when feature disabled, configured priorities will same as configured priorities.

@Yarden-Z
Copy link
Contributor

How does a user get notified that the order has been changed?
Should there be some warning here?
Also - what is the overhead for this item? If we have 8 tacacs servers, defined with an 8 second timeout, and monit is running every minute - we might get to starvation. (might also occur if we have less, but timeout is bugger).
This seems like an odd solution towards solving another issue where the current timeout for tacacs is not being applied well.

@liuh-80
Copy link
Contributor Author

liuh-80 commented Oct 23, 2023

How does a user get notified that the order has been changed? Should there be some warning here? Also - what is the overhead for this item? If we have 8 tacacs servers, defined with an 8 second timeout, and monit is running every minute - we might get to starvation. (might also occur if we have less, but timeout is bugger). This seems like an odd solution towards solving another issue where the current timeout for tacacs is not being applied well.

When Monitor found latency or unreachable issue, warning message will write to syslog. in prod environment there need some other service check SONiC device health by syslog and send alert to user, which is not part of SONiC.

For timeout issue, if all 8 servers not reachable, Monit will handle it by send alert event for the timeout.

qiluo-msft
qiluo-msft previously approved these changes Oct 28, 2023
@liuh-80
Copy link
Contributor Author

liuh-80 commented Oct 30, 2023

@lguohan , could you review and signoff this PR?

- When hostcfgd generate TACACS config file, server priority calculated according to following rules:
- Get server priority info from CONFIG_DB TACPLUS_SERVER table.
- Change high latency server to 1, this is because 1 is the smallest priority, and SONiC device will use high priority server first.
- Un-reachable server will not include in TACACS config file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if un-reachable server is excluded, later if the server becomes reachable, how can we include it back?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, hostcfgd will add server back when found an unreachable server become reachable.

config_key = 'config' ; The configuration key
; Attributes
time_window = 1*5DIGIT ; Monitor time window in minute, default is 5
high_latency_threshold = 1*5DIGIT ; High latency threshold in ms, default is 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing yang mode design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, yang model added.


### Functional Requirement
- Monit TACACS+ server unreachable event from COUNTER_DB.
- Monit TACACS+ server slow response event from COUNTER_DB.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which component write to the counter_db, it is not clear from the design doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Monit service will write COUNTER_DB, add detail to design doc.

- Hostcfgd will monitor TACPLUS_SERVER_LATENCY table, and will re-generate TACACS config file when following event happen:
- Any server latency is -1, which means the server is unreachable.
- Any server latency is bigger than high_latency_threshold.
- When hostcfgd generate TACACS config file, server priority calculated according to following rules:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need an option to maintain backward compatibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, add 'enable' flag, this feature can be disable by this flag.

- TACACS+ monitor also will write warning message to syslog when following event happen:
- Any server latency is -1, which means the server is unreachable.
- Any server latency is bigger than high_latency_threshold.
- Hostcfgd will monitor TACPLUS_SERVER_LATENCY table, and will re-generate TACACS config file when following event happen:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the threshold to determine high latency v.s. not. how do we choose the threshold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold is 20ms. which is based on experience when handle TACACS server latency/unreachable issue, I can share more detail in review meeting, but may not necessarily to write that in public doc.

Copy link
Contributor

@lguohan lguohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented.

; Key
config_key = 'config' ; The configuration key
; Attributes
enable = BOOLEAN ; Enable Monitor feature
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable

There is a FEATURE table to control many optional features. And suggest we separate the TACACS monitor feature and TACACS auto downgrade feature, and control them with finer granularity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants