-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
isisd: fix crash when calculating the neighbor spanning tree based on the fragmented LSP (backport #16365) #16398
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… the fragmented LSP 1. When the root IS regenerates an LSP, it calls lsp_build() -> lsp_clear_data() to free the TLV memory of the first fragment and all other fragments. If the number of fragments in the regenerated LSP decreases or if no fragmentation is needed, the extra LSP fragments are not immediately deleted. Instead, lsp_seqno_update() -> lsp_purge() is called to set the remaining time to zero and start aging, while also notifying other IS nodes to age these fragments. lsp_purge() usually does not reset lsp->hdr.seqno to zero because the LSP might recover during the aging process. 2. When other IS nodes receive an LSP, they always call process_lsp() -> isis_unpack_tlvs() to allocate TLV memory for the LSP. This does not differentiate whether the received LSP has a remaining lifetime of zero. Therefore, it is rare for an LSP of a non-root IS to have empty TLVs. Of course, if an LSP with a remaining time of zero and already corrupted is received, lsp_update() -> lsp_purge() will be called to free the TLV memory of the LSP, but this scenario is rare. 3. In LFA calculations, neighbors of the root IS are traversed, and each neighbor is taken as a new root to compute the neighbor SPT. During this process, the old root IS will serve as a neighbor of the new root IS, triggering a call to isis_spf_process_lsp() to parse the LSP of the old root IS and obtain its IP vertices and neighboring IS vertices. However, isis_spf_process_lsp() only checks whether the TLVs in the first fragment of the LSP exist, and does not check the TLVs in the fragmented LSP. If the TLV memory of the fragmented LSP of the old root IS has been freed, it can lead to a null pointer access, causing the current crash. Additionally, for the base SPT, there are only two places where the LSP of the root IS is parsed: 1. When obtaining the UP neighbors of the root IS via spf_adj_list_parse_lsp(). 2. When preloading the IP vertices of the root IS via isis_lsp_iterate_ip_reach(). Both of these checks ensure that frag->tlvs is not null, and they do not subsequently call isis_spf_process_lsp() to parse the root IS's LSP. It is very rare for non-root IS LSPs to have empty TLVs unless they are corrupted LSPs awaiting deletion. If it happens, a crash will occur. The backtrace is as follows: (gdb) bt #0 0x00007f3097281fe1 in raise () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00007f30973a2972 in core_handler (signo=11, siginfo=0x7ffce66c2870, context=0x7ffce66c2740) at ../lib/sigevent.c:261 #2 <signal handler called> #3 0x000055dfa805512b in isis_spf_process_lsp (spftree=0x55dfa950eee0, lsp=0x55dfa94cb590, cost=10, depth=1, root_sysid=0x55dfa950ef6c "", parent=0x55dfa952fca0) at ../isisd/isis_spf.c:898 #4 0x000055dfa805743b in isis_spf_loop (spftree=0x55dfa950eee0, root_sysid=0x55dfa950ef6c "") at ../isisd/isis_spf.c:1688 #5 0x000055dfa805784f in isis_run_spf (spftree=0x55dfa950eee0) at ../isisd/isis_spf.c:1808 #6 0x000055dfa8037ff5 in isis_spf_run_neighbors (spftree=0x55dfa9474440) at ../isisd/isis_lfa.c:1259 #7 0x000055dfa803ac17 in isis_spf_run_lfa (area=0x55dfa9477510, spftree=0x55dfa9474440) at ../isisd/isis_lfa.c:2300 #8 0x000055dfa8057964 in isis_run_spf_with_protection (area=0x55dfa9477510, spftree=0x55dfa9474440) at ../isisd/isis_spf.c:1827 #9 0x000055dfa8057c15 in isis_run_spf_cb (thread=0x7ffce66c38e0) at ../isisd/isis_spf.c:1889 #10 0x00007f30973bbf04 in thread_call (thread=0x7ffce66c38e0) at ../lib/thread.c:1990 #11 0x00007f309735497b in frr_run (master=0x55dfa91733c0) at ../lib/libfrr.c:1198 #12 0x000055dfa8029d5d in main (argc=5, argv=0x7ffce66c3b08, envp=0x7ffce66c3b38) at ../isisd/isis_main.c:273 (gdb) f 3 #3 0x000055dfa805512b in isis_spf_process_lsp (spftree=0x55dfa950eee0, lsp=0x55dfa94cb590, cost=10, depth=1, root_sysid=0x55dfa950ef6c "", parent=0x55dfa952fca0) at ../isisd/isis_spf.c:898 898 ../isisd/isis_spf.c: No such file or directory. (gdb) p te_neighs $1 = (struct isis_item_list *) 0x120 (gdb) p lsp->tlvs $2 = (struct isis_tlvs *) 0x0 (gdb) p lsp->hdr $3 = {pdu_len = 27, rem_lifetime = 0, lsp_id = "\000\000\000\000\000\001\000\001", seqno = 4, checksum = 59918, lsp_bits = 1 '\001'} The backtrace provided above pertains to version 8.5.4, but it seems that the same issue exists in the code of the master branch as well. I have reviewed the process for calculating the SPT based on the LSP, and isis_spf_process_lsp() is the only function that does not check whether the TLVs in the fragments are empty. Therefore, I believe that modifying this function alone should be sufficient. If the TLVs of the current fragment are already empty, we do not need to continue processing subsequent fragments. This is consistent with the behavior where we do not process fragments if the TLVs of the first fragment are empty. Of course, one could argue that lsp_purge() should still retain the TLV memory, freeing it and then reallocating it if needed. However, this is a debatable point because in some scenarios, it is permissible for the LSP to have empty TLVs. For example, after receiving an SNP (Sequence Number PDU) message, an empty LSP (with lsp->hdr.seqno = 0) might be created by calling lsp_new. If the corresponding LSP message is discarded due to domain or area authentication failure, the TLV memory wouldn't be allocated. Test scenario: In an LFA network, importing a sufficient number of static routes to cause LSP fragmentation, and then rolling back the imported static routes so that the LSP is no longer fragmented, can easily result in this issue. Signed-off-by: zhou-run <zhou.run@h3c.com> (cherry picked from commit e905177)
riw777
approved these changes
Jul 23, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
lsp_build() -> lsp_clear_data()
to free the TLV memory of the first fragment and all other fragments. If the number of fragments in the regenerated LSP decreases or if no fragmentation is needed, the extra LSP fragments are not immediately deleted. Instead,lsp_seqno_update() -> lsp_purge()
is called to set the remaining time to zero and start aging, while also notifying other IS nodes to age these fragments.lsp_purge()
usually does not resetlsp->hdr.seqno
to zero because the LSP might recover during the aging process.process_lsp() -> isis_unpack_tlvs()
to allocate TLV memory for the LSP. This does not differentiate whether the received LSP has a remaining lifetime of zero. Therefore, it is rare for an LSP of a non-root IS to have empty TLVs. Of course, if an LSP with a remaining time of zero and already corrupted is received,lsp_update() -> lsp_purge()
will be called to free the TLV memory of the LSP, but this scenario is rare.isis_spf_process_lsp()
to parse the LSP of the old root IS and obtain its IP vertices and neighboring IS vertices. However,isis_spf_process_lsp()
only checks whether the TLVs in the first fragment of the LSP exist, and does not check the TLVs in the fragmented LSP. If the TLV memory of the fragmented LSP of the old root IS has been freed, it can lead to a null pointer access, causing the current crash.Additionally, for the base SPT, there are only two places where the LSP of the root IS is parsed:
spf_adj_list_parse_lsp()
.isis_lsp_iterate_ip_reach()
. Both of these checks ensure thatfrag->tlvs
is not null, and they do not subsequently callisis_spf_process_lsp()
to parse the root IS's LSP. It is very rare for non-root IS LSPs to have empty TLVs unless they are corrupted LSPs awaiting deletion. If it happens, a crash will occur.The backtrace is as follows:
The backtrace provided above pertains to version 8.5.4, but it seems that the same issue exists in the code of the master branch as well.
I have reviewed the process for calculating the SPT based on the LSP, and
isis_spf_process_lsp()
is the only function that does not check whether the TLVs in the fragments are empty. Therefore, I believe that modifying this function alone should be sufficient. If the TLVs of the current fragment are already empty, we do not need to continue processing subsequent fragments. This is consistent with the behavior where we do not process fragments if the TLVs of the first fragment are empty.Of course, one could argue that
lsp_purge()
should still retain the TLV memory, freeing it and then reallocating it if needed. However, this is a debatable point because in some scenarios, it is permissible for the LSP to have empty TLVs. For example, after receiving an SNP packet, an empty LSP (becauselsp->hdr.seqno = 0
, which doesn't cause problems) might be created by callinglsp_new()
. If the corresponding LSP packet is discarded due to domain or area authentication failure, the TLV memory wouldn't be allocated.Test scenario:
In an LFA network, importing a sufficient number of static routes to cause LSP fragmentation, and then rolling back the imported static routes so that the LSP is no longer fragmented, can easily result in this issue.
Signed-off-by: zhou-run zhou.run@h3c.com
This is an automatic backport of pull request #16365 done by Mergify.