Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPB][FLEX Counters] Error seen in SDK reading counters for removed ports #19105

Open
pavannaregundi opened this issue May 28, 2024 · 9 comments
Labels
Triaged this issue has been triaged

Comments

@pavannaregundi
Copy link
Contributor

pavannaregundi commented May 28, 2024

Description

Errors are seen in SDK reading the Queue and Buffer counters for deleted ports after dynamic breakout CLI execution.

2024 May 28 04:17:07.173224 sonic ERR syncd#syncd: xpSaiQueue.c:2355 Error: Queue does not exist, xpStatus: 23
2024 May 28 04:17:07.173291 sonic ERR syncd#syncd: xpSaiQueue.c:2550 Could not store the statistics for the port 6 queue 19.
2024 May 28 04:17:07.173291 sonic ERR syncd#syncd: :- collectData: Failed to get stats of Queue Counter 0x15000000060013: -19
2024 May 28 04:17:07.188896 sonic ERR syncd#syncd: xpSaiQueue.c:130 Error: Entry not found |retVal: 0
2024 May 28 04:17:07.188896 sonic ERR syncd#syncd: xpSaiQueue.c:2355 Error: Queue does not exist, xpStatus: 23
2024 May 28 04:17:07.188896 sonic ERR syncd#syncd: xpSaiQueue.c:2550 Could not store the statistics for the port 6 queue 20.
2024 May 28 04:17:07.188896 sonic ERR syncd#syncd: :- collectData: Failed to get stats of Queue Counter 0x15000000060014: -19
2024 May 28 04:17:07.188946 sonic ERR syncd#syncd: xpSaiQueue.c:130 Error: Entry not found |retVal: 0
2024 May 28 04:17:07.188946 sonic ERR syncd#syncd: xpSaiQueue.c:2355 Error: Queue does not exist, xpStatus: 23
2024 May 28 04:17:07.188974 sonic ERR syncd#syncd: xpSaiQueue.c:2550 Could not store the statistics for the port 6 queue 21.
2024 May 28 04:17:07.188974 sonic ERR syncd#syncd: :- collectData: Failed to get stats of Queue Counter 0x15000000060015: -19
2024 May 28 04:17:07.188974 sonic ERR syncd#syncd: xpSaiQueue.c:130 Error: Entry not found |retVal: 0
2024 May 28 04:17:07.189002 sonic ERR syncd#syncd: xpSaiQueue.c:2355 Error: Queue does not exist, xpStatus: 23
2024 May 28 04:17:07.189056 sonic ERR syncd#syncd: xpSaiQueue.c:2550 Could not store the statistics for the port 6 queue 22.
2024 May 28 04:17:07.189056 sonic ERR syncd#syncd: :- collectData: Failed to get stats of Queue Counter 0x15000000060016: -19
2024 May 28 04:17:07.189056 sonic ERR syncd#syncd: xpSaiQueue.c:130 Error: Entry not found |retVal: 0
2024 May 28 04:17:07.189086 sonic ERR syncd#syncd: xpSaiQueue.c:2355 Error: Queue does not exist, xpStatus: 23
2024 May 28 04:17:07.189136 sonic ERR syncd#syncd: xpSaiQueue.c:2550 Could not store the statistics for the port 6 queue 23.
2024 May 28 04:17:07.189136 sonic ERR syncd#syncd: :- collectData: Failed to get stats of Queue Counter 0x15000000060017: -19
2024 May 28 04:17:09.879108 sonic ERR syncd#syncd: xpSaiBuffer.c:1855 Could not Get Ingress Priority Group Info for 7318349394477064
2024 May 28 04:17:09.879108 sonic ERR syncd#syncd: xpSaiBuffer.c:1806 Error: Ingress priority group entry does not exist: oid - 7318349394477064
2024 May 28 04:17:09.879108 sonic ERR syncd#syncd: xpSaiBuffer.c:4977 Error: Failed to get state data for ingress pg 0x1a000000000008, saiStatus: -7
2024 May 28 04:17:09.879176 sonic ERR syncd#syncd: :- collectData: Failed to get stats of Priority Group Counter 0x1a000000000008: -7
2024 May 28 04:17:09.879176 sonic ERR syncd#syncd: xpSaiBuffer.c:1855 Could not Get Ingress Priority Group Info for 7318349394477065
2024 May 28 04:17:09.879176 sonic ERR syncd#syncd: xpSaiBuffer.c:1806 Error: Ingress priority group entry does not exist: oid - 7318349394477065
2024 May 28 04:17:09.879195 sonic ERR syncd#syncd: xpSaiBuffer.c:4977 Error: Failed to get state data for ingress pg 0x1a000000000009, saiStatus: -7
2024 May 28 04:17:09.879209 sonic ERR syncd#syncd: :- collectData: Failed to get stats of Priority Group Counter 0x1a000000000009: -7
2024 May 28 04:17:09.882426 sonic ERR syncd#syncd: xpSaiBuffer.c:1855 Could not Get Ingress Priority Group Info for 7318349394477066
2024 May 28 04:17:09.882470 sonic ERR syncd#syncd: xpSaiBuffer.c:1806 Error: Ingress priority group entry does not exist: oid - 7318349394477066
2024 May 28 04:17:09.882470 sonic ERR syncd#syncd: xpSaiBuffer.c:4977 Error: Failed to get state data for ingress pg 0x1a00000000000a, saiStatus: -7
2024 May 28 04:17:09.882470 sonic ERR syncd#syncd: :- collectData: Failed to get stats of Priority Group Counter 0x1a00000000000a: -7
2024 May 28 04:17:09.882495 sonic ERR syncd#syncd: xpSaiBuffer.c:1855 Could not Get Ingress Priority Group Info for 7318349394477067

Steps to reproduce the issue:

  1. Collect redis db dumps for reference
redis-dump -d 5 -y -o redis_flex_before_breakout.txt
redis-dump -d 1 -y -o redis_asic_before_breakout.txt
  1. Run DPB which removed more ports than it creates. Example below shows converting from 4x100G to 1x400G.
 # config interface breakout Ethernet0 1x400G -v -f
Do you want to Breakout the port, continue? [y/N]: y
 
Running Breakout Mode : 4x100G
Target Breakout Mode : 1x400G
 
Ports to be deleted :
 {
    "Ethernet0": "100000",
    "Ethernet2": "100000",
    "Ethernet4": "100000",
    "Ethernet6": "100000"
}
Ports to be added :
 {
    "Ethernet0": "400000"
}
  1. Collect redis db dumps again
redis-dump -d 5 -y -o redis_flex_after_breakout.txt
redis-dump -d 1 -y -o redis_asic_after_breakout.txt
  1. check syslogs for errors.

Describe the results you received:

From the collected logs in redis-dump,

redis_asic_before_breakout.txt 

  "ASIC_STATE:SAI_OBJECT_TYPE_HOSTIF:oid:0xd000000002bb2": {
    "expireat": 1716868800.1728191,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "SAI_HOSTIF_ATTR_NAME": "Ethernet2",
      "SAI_HOSTIF_ATTR_OBJ_ID": "oid:0x10000000005a2",
      "SAI_HOSTIF_ATTR_OPER_STATUS": "false",
      "SAI_HOSTIF_ATTR_TYPE": "SAI_HOSTIF_TYPE_NETDEV"
    }
  },

"ASIC_STATE:SAI_OBJECT_TYPE_PORT:oid:0x10000000005a2": {
    "expireat": 1716868800.1954298,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "NULL": "NULL",
      "SAI_PORT_ATTR_ADMIN_STATE": "true",
      "SAI_PORT_ATTR_AUTO_NEG_MODE": "true",
      "SAI_PORT_ATTR_FEC_MODE": "SAI_PORT_FEC_MODE_RS",
      "SAI_PORT_ATTR_HW_LANE_LIST": "2:2,3",
      "SAI_PORT_ATTR_MTU": "9122",
      "SAI_PORT_ATTR_SPEED": "100000"
    }
  },

"VIDTORID": {
    "expireat": 1716868800.2920315,
    "ttl": -0.001,
    "type": "hash",
    "value": {
 

port:
      "oid:0x10000000005a2": "oid:0x1000000000002",

queue:
      "oid:0x150000000006b2": "oid:0x15000000020000",
      "oid:0x150000000006b3": "oid:0x15000000020001",
      "oid:0x150000000006b4": "oid:0x15000000020002",
      "oid:0x150000000006b5": "oid:0x15000000020003",
      "oid:0x150000000006b6": "oid:0x15000000020004",
      "oid:0x150000000006b7": "oid:0x15000000020005",
      "oid:0x150000000006b8": "oid:0x15000000020006",
      "oid:0x150000000006b9": "oid:0x15000000020007",

redis_flex_after_breakout.txt: Stale entries present in FLEX_DB for queues mapped to deleted port.

"FLEX_COUNTER_TABLE:QUEUE_STAT_COUNTER:oid:0x150000000006b2": {
    "expireat": 1716870051.3594844,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "QUEUE_COUNTER_ID_LIST": "SAI_QUEUE_STAT_DROPPED_BYTES,SAI_QUEUE_STAT_DROPPED_PACKETS,SAI_QUEUE_STAT_BYTES,SAI_QUEUE_STAT_PACKETS"
    }
  },

  "FLEX_COUNTER_TABLE:QUEUE_STAT_COUNTER:oid:0x150000000006b3": {
    "expireat": 1716870051.7402668,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "QUEUE_COUNTER_ID_LIST": "SAI_QUEUE_STAT_DROPPED_BYTES,SAI_QUEUE_STAT_DROPPED_PACKETS,SAI_QUEUE_STAT_BYTES,SAI_QUEUE_STAT_PACKETS"
    }
  },

"FLEX_COUNTER_TABLE:QUEUE_STAT_COUNTER:oid:0x150000000006b4": {
    "expireat": 1716870051.3640532,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "QUEUE_COUNTER_ID_LIST": "SAI_QUEUE_STAT_DROPPED_BYTES,SAI_QUEUE_STAT_DROPPED_PACKETS,SAI_QUEUE_STAT_BYTES,SAI_QUEUE_STAT_PACKETS"
    }
  },

  "FLEX_COUNTER_TABLE:QUEUE_STAT_COUNTER:oid:0x150000000006b5": {
    "expireat": 1716870051.7292657,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "QUEUE_COUNTER_ID_LIST": "SAI_QUEUE_STAT_DROPPED_BYTES,SAI_QUEUE_STAT_DROPPED_PACKETS,SAI_QUEUE_STAT_BYTES,SAI_QUEUE_STAT_PACKETS"
    }
  },

  "FLEX_COUNTER_TABLE:QUEUE_STAT_COUNTER:oid:0x150000000006b6": {
    "expireat": 1716870051.728027,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "QUEUE_COUNTER_ID_LIST": "SAI_QUEUE_STAT_DROPPED_BYTES,SAI_QUEUE_STAT_DROPPED_PACKETS,SAI_QUEUE_STAT_BYTES,SAI_QUEUE_STAT_PACKETS"
    }
  },

  "FLEX_COUNTER_TABLE:QUEUE_STAT_COUNTER:oid:0x150000000006b7": {
    "expireat": 1716870051.3757164,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "QUEUE_COUNTER_ID_LIST": "SAI_QUEUE_STAT_DROPPED_BYTES,SAI_QUEUE_STAT_DROPPED_PACKETS,SAI_QUEUE_STAT_BYTES,SAI_QUEUE_STAT_PACKETS"
    }

Describe the results you expected:

Output of show version:

port_breakout_info.txt
redis_asic_after_breakout.txt
redis_asic_before_breakout.txt
redis_flex_after_breakout.txt
redis_flex_before_breakout.txt
syslog.txt

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@pavannaregundi pavannaregundi changed the title [DPB][FLEX Counters] Error seen SDK reading counters for removed ports [DPB][FLEX Counters] Error seen in SDK reading counters for removed ports May 28, 2024
@arlakshm
Copy link
Contributor

arlakshm commented Jun 5, 2024

@dgsudharsan to start a offline discussion on the change needed in Sairedis.

@arlakshm arlakshm added the Triaged this issue has been triaged label Jun 5, 2024
@arlakshm
Copy link
Contributor

arlakshm commented Jun 5, 2024

this PR sonic-net/sonic-swss#3076 has the fix for this issue. Please retest with latest master image.

@pavannaregundi
Copy link
Contributor Author

this PR sonic-net/sonic-swss#3076 has the fix for this issue. Please retest with latest master image.

@arlakshm Thanks for your comment.
Using following master commit:
https://github.com/sonic-net/sonic-buildimage/tree/a7ab698f1c7218b4ddc4db63c42918a8c3eb9eb4
I see that above PR is already part of this master commit.

@dgsudharsan
Copy link
Collaborator

@pavannaregundi From the internally attached PR I see the backref was missing and hence the queue removal didn't happen in the first place. The PR 3076 in SWSS addresses a different race condition which is a statistical issue. I believe we need the yang fix that is linked to this bug.

@pavannaregundi
Copy link
Contributor Author

@pavannaregundi From the internally attached PR I see the backref was missing and hence the queue removal didn't happen in the first place. The PR 3076 in SWSS addresses a different race condition which is a statistical issue. I believe we need the yang fix that is linked to this bug.

I had directly patched changes to /usr/local/yang-models/sonic-buffer-queue.yang in sonic switch and tried that change. It did not work either. So internally we are still checking it.

@stephenxs
Copy link
Collaborator

@pavannaregundi From the internally attached PR I see the backref was missing and hence the queue removal didn't happen in the first place. The PR 3076 in SWSS addresses a different race condition which is a statistical issue. I believe we need the yang fix that is linked to this bug.

I had directly patched changes to /usr/local/yang-models/sonic-buffer-queue.yang in sonic switch and tried that change. It did not work either. So internally we are still checking it.

can you try configuring create_only_config_db_buffers in DEVICE_METADATA|localhost? I think it should work with it configured
currently, DPB doesn't remove queue/PG counters after the port is removed if it is not configured.

@pavannaregundi
Copy link
Contributor Author

@pavannaregundi From the internally attached PR I see the backref was missing and hence the queue removal didn't happen in the first place. The PR 3076 in SWSS addresses a different race condition which is a statistical issue. I believe we need the yang fix that is linked to this bug.

I had directly patched changes to /usr/local/yang-models/sonic-buffer-queue.yang in sonic switch and tried that change. It did not work either. So internally we are still checking it.

can you try configuring create_only_config_db_buffers in DEVICE_METADATA|localhost? I think it should work with it configured currently, DPB doesn't remove queue/PG counters after the port is removed if it is not configured.

@stephenxs Thanks. We will try this and get back.

@pavannaregundi
Copy link
Contributor Author

@stephenxs Adding create_only_config_db_buffers.json is working. However I am not sure if this is how it is supposed to work.
In general if a port is removed from ASIC DB, its FLEX_COUNTER entry should also get removed.
Also, is there any other implications of using 'create_only_config_db_buffers'?

@stephenxs
Copy link
Collaborator

@stephenxs Adding create_only_config_db_buffers.json is working. However I am not sure if this is how it is supposed to work. In general if a port is removed from ASIC DB, its FLEX_COUNTER entry should also get removed. Also, is there any other implications of using 'create_only_config_db_buffers'?

Hi
The orchagent should remove the PG, queue counters when a port is removed. I think it is a missing logic in the DPB feature.
We fixed it partially when create_only_config_db_buffers is true when we were fixing another issue.
But in general, we should expect it to be fixed by the owner of DPB especially when the flag is not set.
When create_only_config_db_buffers is set, it only create counters for queues/PGs that are configured in BUFFER_PG and BUFFER_QUEUE tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

4 participants