30 Years of Decompilation and the Unsolved Structuring Problem: Part 2 #7

2024-01-11T12:09:17Z

giscus[bot]
bot Jan 11, 2024

30 Years of Decompilation and the Unsolved Structuring Problem: Part 2

A two-part series on the history of decompiler research and the fight against the unsolved control flow structuring problem. In part 1, we revisit the history of foundational decompilers and techniques, concluding on a look at modern works. In part 2, we deep-dive into the fundamentals of modern control flow structuring techniques, and their limitations, and look to the future.

https://mahaloz.re/dec-history-pt2

PhilWallace · 2024-01-17T09:17:54Z

PhilWallace
Jan 17, 2024 — with giscus

Many thanks for the inspiring sharing, and thanks for your constructing works! May I ask about your comment about the new researches on recovering the variable names & types? i.e., the sp24 paper "Len or index or count, anything but v1": Predicting Variable Names in Decompilation Output with Transfer Learning, and the DIRTY and DIRE work.

These works are using languages models to recover stripped information, which is quite a different path of improving decompilation from your work. I felt this path is quite limited by the quality of the dataset, not to mention other natural challenges of the decompilation task. I'm also trying these tools, and I'm curious about how do you think of this trend: using AI or even SOTA LLM to enhance decompilation (like some papers have already focused on this topic:

Xu, Xiangzhe, et al. "LmPa: Improving Decompilation by Synergy of Large Language Model and Program Analysis." arXiv preprint arXiv:2306.02546 (2023).

Jin, Xin, et al. "Binary code summarization: Benchmarking chatgpt/gpt-4 and other large language models." arXiv preprint arXiv:2312.09601 (2023).

).

Many thanks! : )

2 replies

mahaloz Jan 17, 2024
Maintainer

@PhilWallace thanks! I've messed around with using AI in decompilation, the most significant of them being my work on VarBERT (the first AI paper you listed) with Ati Priya. I've found recent work in using LLMs to fully decompile assembly to be very lacking. So far, none have been able to generate code that even remotely makes sense consistently. The biggest problem in decompilation for AI is that you have A LOT of things you need to verify before outputting. When trying to go from assembly to C, that can nearly kill the project.

Instead, I think the future of AI and decompilation integration will be on letting AI make decisions where the "correct" choice is an opinion. That goes to say, reduce the set of choices an AI agent can make until all of them are correct, but now you must choose one. In VarBERT and DIRTY, this felt obvious: every choice for a variable name is a "correct" choice. None can cause a decompiler to output incorrect code, and thus, it was a perfect place for AI. I think many other places like this exist inside decompilation, where a correct choice is best chosen by an LLM.

I felt this path is quite limited by the quality of the dataset

Indeed, that pain was also felt on VarBERT, but luckily we have a dataset now :).

PhilWallace Jan 18, 2024 — with giscus

Many thanks for the reply! It is so great to see there is a DAILA developed top on IDA and Ghidra. I would definitely try it.

I feel the same that LLM can do some jobs good, but some are not so good. Particularly, I found LLM very helpful in explaining pseudocode (e.g., from IDA hex-ray decompiler), which saves a lot of effort in reverse engineering. Moreover, I think the feature of being able to handle iterative and interactive prompting / querying is very intriguing, which previous tools do not have (as we know IDA is good because of being Interactive). Although LLM suffers from its own limitations like hallucinations, it should be potential to assist some binary analysis problems as well.

GregoryMorse · 2024-08-29T12:59:58Z

GregoryMorse
Aug 29, 2024 — with giscus

"A perfect decompiler should produce a 0 CFGED, meaning 0 graph edit distance, and the same gotos as the source."

This seems to be an incorrect statement based on the fallacy that two different source codes cannot produce the same compiled output especially in the context of compiler optimizations. In fact there is ambiguity which I think can be proven that a source e.g. littered with gotos could produce the same output as structured code. So I think such a claim should be revised, as the most structured source version that would compile to produce such a byte code represented by such a CFG.

3 replies

mahaloz Aug 29, 2024
Maintainer

Hi @GregoryMorse, thanks for the read! I think you bring up some interesting points, however, I stand by my earlier claim that a "perfect" decompiler gets you the exact source code that went into the compiler, which would indicate a 0 CFGED.

two different source codes can produce the same compiled output

Of course, this is true; compilation is a many-to-one process, where many sources, X1, X2, ..., that result in the same Y. However, I argue that of those many X's, only a handful look like something an actual human would write. We observed this phenomenon in our recent work SAILR¹; we additionally found that there are not as many choices X as you might think. All of this summarizes to the idea that there are fewer choices than previously believed and that in many cases there is a way to distinguish which one is more probable to be correct. For decompiler research to progress, I think we ought to target lowering CFGED across the board and approaching 0.

the most structured source version that would compile to produce such a byte code

It's troubling to target the most structured version of code since structured is rather ambiguous. Previous works² have explored what it means to be structured, but this has led us down a path of decompilation that really looks nothing like source. That's not always bad, but it shows some weakness in the approach. As you likely know, the majority of decompilers used by people today avoid this type of approach altogether.

As for the byte-code match, it is an interesting concept and is still open to research. The one flaw, however, is that byte-match indicates nothing about readability or closeness to source. If your decompiler's goal is to be readable, than you may have two conflicting metrics. We had some findings about this in Figure 9 (Section 7) in that earlier mentioned SAILR paper. I still like this area of research though, as it should have some cool new findings :).

Basque, Zion Leonahenahe, et al. "Ahoy sailr! there is no need to dream of c: A compiler-aware structuring algorithm for binary decompilation." 33st USENIX Security Symposium (USENIX Security 24). 2024. ↩
Yakdan, Khaled, et al. "No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantic-Preserving Transformations." NDSS. 2015. ↩

GregoryMorse Aug 29, 2024 — with giscus

Thanks for replying. So yes, I see what you mean, I mostly meant the many-to-one make it of course so that original source recovery is impossible, but also not really relevant, as an improved recovery would actually be more valuable than an original source which was poorly structured. But when you say "look like something an actual human would write", there is some subjectivity to such a definition, and it would be interesting how to precisely pin this one down. Humans sometimes write code generators which produce rather non-human code for example, etc. I agree that terms like most-structured are pretty unclear.

In fact, I would dare to say it is many-to-many. Different compilers with different flags/optimization levels will also take the same source and produce multiple outputs.

So decompilation really in my opinion needs a huge pattern database which is per-compile-per-optimization-settings. And the decompiler for its input must have a target compiler and target optimization settings. Then it applies only those relative patterns. There are cases, where a highly optimized binary code, without compiler optimizations, could not be produced without a lot of gotos for example. As for SAILR, I feel this generalization is possible and almost the next level, not just compiler-aware but optimization-aware. In fact, I feel this is a near precondition to even coming close to a byte-exact study.

After all, compilers and their optimizations are just a big database with a big set of patterns, so no reason the reverse process would not need to do the same in reverse. At least for reversible optimizations. Obviously things like dead code elimination or single-use variable replacement are simply lost information. But interestingly sometimes reverse optimizations would overly structure code which was already optimized to begin with, making them more powerful than for decompilers but refactoring in general.

mahaloz Aug 29, 2024
Maintainer

I mostly meant the many-to-one make it of course so that original source recovery is impossible

Like many "impossible" problems, approximation is the key. Like in the traditional translation of languages, Spanish -> English -> Spanish, people argued it is just inherently impossible to recover the exact original sentence that was translated because of the many-to-many problem. That is true for a fully sound approach. However, we all agree that an approach that does perfect recovery in 90% of cases is a great approach. If decompilation was as good as modern language translation, I think we would have decompilers that very often get you that exact source. My goal, at least with approaching 0 CFGED, is not to be right in 100% of cases, but to be right in the majority.

how to precisely pin this one down

I agree :), it will be tricky, but I think it has a future.

making them more powerful than for decompilers but refactoring

For sure! I like decompilation because it seems to be a meeting point for so many different fields of research. It may be possible to transfer things learned in decompilation to other places.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

30 Years of Decompilation and the Unsolved Structuring Problem: Part 2 #7

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

30 Years of Decompilation and the Unsolved Structuring Problem: Part 2 #7

giscus[bot] bot Jan 11, 2024

30 Years of Decompilation and the Unsolved Structuring Problem: Part 2

Replies: 2 comments · 5 replies

PhilWallace Jan 17, 2024 — with giscus

mahaloz Jan 17, 2024 Maintainer

PhilWallace Jan 18, 2024 — with giscus

GregoryMorse Aug 29, 2024 — with giscus

mahaloz Aug 29, 2024 Maintainer

Footnotes

GregoryMorse Aug 29, 2024 — with giscus

mahaloz Aug 29, 2024 Maintainer

giscus[bot]
bot Jan 11, 2024

Replies: 2 comments 5 replies

PhilWallace
Jan 17, 2024 — with giscus

mahaloz Jan 17, 2024
Maintainer

GregoryMorse
Aug 29, 2024 — with giscus

mahaloz Aug 29, 2024
Maintainer

mahaloz Aug 29, 2024
Maintainer