Add TACACS server monitor design document. #1467

liuh-80 · 2023-09-11T05:29:14Z

Add TACACS server monitor design document.
The TACACS server monitor can change TACACS server priority based on syslog and can resolve following issue:
#1462

ycoheNvidia · 2023-09-12T18:19:44Z

Thanks for addressing the issue.
Even though this might work - in my opinion this is not a good solution for this issue and here is few reasons:

user who configures authentication methods doesn't expect configuration to change without any notice or active action from his side.
logs can be instable, we have experienced inconsistency of Debian logs in the past, I suspect it might cause bugs for this solution
The main issue here is based on an issue with pam/tacplus_pam. In my opinion - it should be resolved there. Adding a monitor workaround solution might be hard to maintain and not be optimal for this issue.

a-barboza · 2023-09-14T00:39:19Z

cc: @shdasari

The idea of adjusting the priority based on the responsiveness of the server is good. As mentioned, it should alleviate issue #1462.

However, I feel adjusting the configuration based on a temporary network event might not be the best approach. How about adding some state information to the STATE_DB (COUNTER_DB, or appropriate DB)? Adjust the tacplus pam configuration to account for temporarily unreachable server. Once the network event is detected to be resolved, then the temporary adjustment could be backed out, with suitable logs to advise the admin of the actions taken.

liuh-80 · 2023-09-14T04:24:39Z

cc: @shdasari

The idea of adjusting the priority based on the responsiveness of the server is good. As mentioned, it should alleviate issue #1462.

However, I feel adjusting the configuration based on a temporary network event might not be the best approach. How about adding some state information to the STATE_DB (COUNTER_DB, or appropriate DB)? Adjust the tacplus pam configuration to account for temporarily unreachable server. Once the network event is detected to be resolved, then the temporary adjustment could be backed out, with suitable logs to advise the admin of the actions taken.

@a-barboza , currently if SONiC TACACS can connect to first server, it will not try connecting other low priority servers, which means can't detect low priority server network status. and if we want get connection status of low priority server, we need create a new daemon running in background to periodically check server.

Also in this design, the monitor change server priority based on a time window and event count threshold, for example more than 50% connection failed within 5 minutes. this is a simple solution can handle almost every scenario.

So, how about I update the design doc to:

First stage, implement monitor to change priority based on sys event, this will be high priority.
Second stage, update TACACS code to update COUNTER_DB data. create daemon to check TACACS server status. then we can update monitor to change priority based on COUNTER_DB and server reachability?

doc/aaa/TACACS+ Server Monitor.md

qiluo-msft · 2023-10-10T18:15:42Z

@ycoheNvidia @a-barboza Would you like to review again? Do you have more comments?

liuh-80 · 2023-10-11T05:59:26Z

@a-barboza , I update design doc according to your suggestion, please give your comments.

doc/aaa/TACACS+ Server Monitor.md

a-barboza · 2023-10-12T20:44:39Z

doc/aaa/TACACS+ Server Monitor.md

+```
+
+### Config DB schema
+#### TACPLUS_MONITOR Table schema


How can TACPLUS_MONITOR be disabled? For example, if the table is not defined, does it mean that TACPLUS Monitor is disabled, i.e. no monitoring, and effective priorities are same as configured priorities?

Added 'enable' flag for disable this feature. when feature disabled, configured priorities will same as configured priorities.

Yarden-Z · 2023-10-13T07:33:21Z

How does a user get notified that the order has been changed?
Should there be some warning here?
Also - what is the overhead for this item? If we have 8 tacacs servers, defined with an 8 second timeout, and monit is running every minute - we might get to starvation. (might also occur if we have less, but timeout is bugger).
This seems like an odd solution towards solving another issue where the current timeout for tacacs is not being applied well.

doc/aaa/TACACS+ Server Monitor.md

liuh-80 · 2023-10-23T02:34:58Z

How does a user get notified that the order has been changed? Should there be some warning here? Also - what is the overhead for this item? If we have 8 tacacs servers, defined with an 8 second timeout, and monit is running every minute - we might get to starvation. (might also occur if we have less, but timeout is bugger). This seems like an odd solution towards solving another issue where the current timeout for tacacs is not being applied well.

When Monitor found latency or unreachable issue, warning message will write to syslog. in prod environment there need some other service check SONiC device health by syslog and send alert to user, which is not part of SONiC.

For timeout issue, if all 8 servers not reachable, Monit will handle it by send alert event for the timeout.

liuh-80 · 2023-10-30T02:04:47Z

@lguohan , could you review and signoff this PR?

lguohan · 2023-10-31T16:21:03Z

doc/aaa/TACACS+ Server Monitor.md

+- When hostcfgd generate TACACS config file, server priority calculated according to following rules:
+    - Get server priority info from CONFIG_DB TACPLUS_SERVER table.
+    - Change high latency server to 1, this is because 1 is the smallest priority, and SONiC device will use high priority server first.
+    - Un-reachable server will not include in TACACS config file.


if un-reachable server is excluded, later if the server becomes reachable, how can we include it back?

Fixed, hostcfgd will add server back when found an unreachable server become reachable.

lguohan · 2023-10-31T16:21:15Z

doc/aaa/TACACS+ Server Monitor.md

+config_key           = 'config'  ;  The configuration key
+; Attributes
+time_window              = 1*5DIGIT  ; Monitor time window in minute, default is 5
+high_latency_threshold   = 1*5DIGIT  ; High latency threshold in ms, default is 20


missing yang mode design.

Fixed, yang model added.

lguohan · 2023-10-31T16:22:37Z

doc/aaa/TACACS+ Server Monitor.md

+
+### Functional Requirement
+- Monit TACACS+ server unreachable event from COUNTER_DB.
+- Monit TACACS+ server slow response event from COUNTER_DB.


which component write to the counter_db, it is not clear from the design doc

Monit service will write COUNTER_DB, add detail to design doc.

lguohan · 2023-10-31T16:23:10Z

doc/aaa/TACACS+ Server Monitor.md

+- Hostcfgd will monitor TACPLUS_SERVER_LATENCY table, and will re-generate TACACS config file when following event happen:
+    - Any server latency is -1, which means the server is unreachable.
+    - Any server latency is bigger than high_latency_threshold.
+- When hostcfgd generate TACACS config file, server priority calculated according to following rules:


we need an option to maintain backward compatibility

Fixed, add 'enable' flag, this feature can be disable by this flag.

lguohan · 2023-10-31T16:23:32Z

doc/aaa/TACACS+ Server Monitor.md

+- TACACS+ monitor also will write warning message to syslog when following event happen:
+    - Any server latency is -1, which means the server is unreachable.
+    - Any server latency is bigger than high_latency_threshold.
+- Hostcfgd will monitor TACPLUS_SERVER_LATENCY table, and will re-generate TACACS config file when following event happen:


what is the threshold to determine high latency v.s. not. how do we choose the threshold.

The threshold is 20ms. which is based on experience when handle TACACS server latency/unreachable issue, I can share more detail in review meeting, but may not necessarily to write that in public doc.

lguohan

commented.

qiluo-msft · 2023-12-01T02:48:57Z

doc/aaa/TACACS+ Server Monitor.md

+; Key
+config_key           = 'config'  ;  The configuration key
+; Attributes
+enable                   = BOOLEAN   ; Enable Monitor feature


enable

There is a FEATURE table to control many optional features. And suggest we separate the TACACS monitor feature and TACACS auto downgrade feature, and control them with finer granularity.

Add TACACS monitor design document

31c352c

liuh-80 requested review from qiluo-msft and Yarden-Z September 11, 2023 05:29

liuh-80 marked this pull request as ready for review September 11, 2023 05:30

qiluo-msft requested a review from lguohan September 20, 2023 20:26

qiluo-msft reviewed Sep 20, 2023

View reviewed changes

doc/aaa/TACACS+ Server Monitor.md Show resolved Hide resolved

qiluo-msft reviewed Sep 20, 2023

View reviewed changes

doc/aaa/TACACS+ Server Monitor.md Outdated Show resolved Hide resolved

qiluo-msft reviewed Sep 20, 2023

View reviewed changes

doc/aaa/TACACS+ Server Monitor.md Outdated Show resolved Hide resolved

qiluo-msft reviewed Sep 20, 2023

View reviewed changes

doc/aaa/TACACS+ Server Monitor.md Outdated Show resolved Hide resolved

Improve design document

e8904c2

qiluo-msft previously approved these changes Oct 9, 2023

View reviewed changes

a-barboza reviewed Oct 11, 2023

View reviewed changes

doc/aaa/TACACS+ Server Monitor.md Outdated Show resolved Hide resolved

doc/aaa/TACACS+ Server Monitor.md Outdated Show resolved Hide resolved

dev/liuh/tacacs-server-monitor

d880f38

liuh-80 dismissed qiluo-msft’s stale review via d880f38 October 12, 2023 02:30

a-barboza approved these changes Oct 12, 2023

View reviewed changes

a-barboza reviewed Oct 12, 2023

View reviewed changes

qiluo-msft reviewed Oct 22, 2023

View reviewed changes

doc/aaa/TACACS+ Server Monitor.md Outdated Show resolved Hide resolved

Update TACACS+ Server Monitor.md

3be2092

qiluo-msft previously approved these changes Oct 28, 2023

View reviewed changes

lguohan reviewed Oct 31, 2023

View reviewed changes

lguohan requested changes Oct 31, 2023

View reviewed changes

Improve design doc

972e520

liuh-80 dismissed qiluo-msft’s stale review via 972e520 November 1, 2023 06:27

liuh-80 and others added 5 commits November 1, 2023 14:32

Improve design doc

99be3fa

Add more detail to design doc

9573e13

Update TACACS+ Server Monitor.md

a584a50

Update TACACS+ Server Monitor.md

4f7250a

Update TACACS+ Server Monitor.md

f5f4f2d

qiluo-msft reviewed Dec 1, 2023

View reviewed changes

Remove update TACACS server proority part

f9e4285

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TACACS server monitor design document. #1467

Add TACACS server monitor design document. #1467

liuh-80 commented Sep 11, 2023

ycoheNvidia commented Sep 12, 2023

a-barboza commented Sep 14, 2023

liuh-80 commented Sep 14, 2023

qiluo-msft commented Oct 10, 2023

liuh-80 commented Oct 11, 2023

a-barboza Oct 12, 2023

liuh-80 Nov 1, 2023

Yarden-Z commented Oct 13, 2023

liuh-80 commented Oct 23, 2023

liuh-80 commented Oct 30, 2023

lguohan Oct 31, 2023

liuh-80 Nov 1, 2023

lguohan Oct 31, 2023

liuh-80 Nov 1, 2023

lguohan Oct 31, 2023

liuh-80 Nov 1, 2023

lguohan Oct 31, 2023

liuh-80 Nov 1, 2023

lguohan Oct 31, 2023

liuh-80 Nov 1, 2023

lguohan left a comment

qiluo-msft Dec 1, 2023

Add TACACS server monitor design document. #1467

Are you sure you want to change the base?

Add TACACS server monitor design document. #1467

Conversation

liuh-80 commented Sep 11, 2023

ycoheNvidia commented Sep 12, 2023

a-barboza commented Sep 14, 2023

liuh-80 commented Sep 14, 2023

qiluo-msft commented Oct 10, 2023

liuh-80 commented Oct 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yarden-Z commented Oct 13, 2023

liuh-80 commented Oct 23, 2023

liuh-80 commented Oct 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lguohan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment