mirror of https://github.com/albertan017/LLM4Decompile.git synced 2026-06-17 01:55:50 +00:00

Reverse Engineering: Decompiling Binary Code with Large Language Models https://aclanthology.org/2024.emnlp-main.203

binary decompile large-language-models reverse-engineering

Python 92.1%
Shell 7.2%
Dockerfile 0.7%

Find a file

Qi Luo a3ebf92fc7 update text-generation-inference script		2024-02-28 22:08:54 +08:00
data	update text-generation-inference script	2024-02-28 22:08:54 +08:00
evaluation	update text-generation-inference script	2024-02-28 22:08:54 +08:00
samples	Add files via upload	2024-02-28 17:01:14 +08:00
scripts	update text-generation-inference script	2024-02-28 22:08:54 +08:00
.gitignore	update text-generation-inference script	2024-02-28 22:08:54 +08:00
LICENSE	Initial commit	2024-02-28 11:16:45 +08:00
README.md	Update README.md	2024-02-28 21:21:09 +08:00
requirements.txt	update text-generation-inference script	2024-02-28 22:08:54 +08:00

README.md

LLM4Decompile

Reverse Engineering: Decompiling Binary Code with Large Language Models

1. Introduction of LLM4Decompile and Decompile-Eval

Our objective is to create and release the first open-source LLM dedicated to decompilation, and to assess its capabilities by constructing the first decompilation benchmark focused on re-compilability and re-executable.

We start by compiling a million C code samples from AnghaBench into assembly code using GCC with different configurations, forming a dataset of assembly-source pairs in 4 billion tokens. We then finetune the DeepSeek-Coder model, a leading-edge code LLM, using this dataset. Followed by constructing the evaluation benchmark, Decompile-Eval, based on HumanEval questions and test samples. Specifically, we formulate the evaluation from two perspectives: whether the decompiled code can recompile successfully, and whether it passes all assertions in the test cases.

Figure 1 presents the steps involved in our decompilation evaluation. First, the source code (denoted as src) is compiled by the GCC compiler with specific parameters, such as optimization levels, to produce the executable binary. This binary is then disassembled into assembly language (asm) using the objdump tool. The assembly instructions are subsequently decompiled to reconstruct the source code in a format that's readable to humans (noted as src'). To assess the quality of the decompiled code (src'), it is tested for its ability to be recompiled with the original GCC compiler (re-compilability) and for its functionality through test assertions (re-executability).

2. Evaluation Results

Metrics

Re-compilability and re-executability serve as critical indicators in validating the effectiveness of a decompilation process. When decompiled code can be recompiled, it provides strong evidence of syntactic integrity. It ensures that the decompiled code is not just readable, but also adheres to the structural and syntactical standards expected by the compiler. However, syntax alone does not guarantee semantic equivalence to the original pre-compiled program. Re-executability provides this critical measure of semantic correctness. By re-compiling the decompiled output and running the test cases, we assess if the decompilation preserved the program logic and behavior. Together, re-compilability and re-executability indicate syntax recovery and semantic preservation - both essential for usable and robust decompilation.

Results

3. How to Use The Model

Our LLM4Decompile includes models with sizes between 1.3 billion and 33 billion parameters, and we have made these models available on Hugging Face.

llm4decompile-1.3b

llm4decompile-6.7b

llm4decompile-33b

Here give an example of how to use our model.

Preprocessing: compile the C code into binary, disassemble the binary into assembly instructions.

import subprocess
import os
import re

digit_pattern = r'\b0x[a-fA-F0-9]+\b'# binary codes in Hexadecimal
zeros_pattern = r'^0+\s'#0s
OPT = ["O0", "O1", "O2", "O3"]
fileName = 'path/to/file'
with open(fileName+'.c','r') as f:#original file
    c_func = f.read()
for opt_state in OPT:
    output_file = fileName +'_' + opt_state
    input_file = fileName+'.c'
    compile_command = f'gcc -c -o {output_file}.o {input_file} -{opt_state} -lm'#compile the code with GCC on Linux
    subprocess.run(compile_command, shell=True, check=True)
    compile_command = f'objdump -d {output_file}.o > {output_file}.s'#disassemble the binary file into assembly instructions
    subprocess.run(compile_command, shell=True, check=True)
    
    input_asm = ''
    asm = read_file(output_file+'.s')
    asm = asm.split('Disassembly of section .text:')[-1].strip()
    for tmp in asm.split('\n'):
        tmp_asm = tmp.split('\t')[-1]#remove the binary code
        tmp_asm = tmp_asm.split('#')[0].strip()#remove the comments
        input_asm+=tmp_asm+'\n'
    input_asm = re.sub(zeros_pattern, '', input_asm)
    before = f"# This is the assembly code with {opt_state} optimization:\n"#prompt
    after = "\n# What is the source code?\n"#prompt
    input_asm_prompt = before+input_asm.strip()+after
    with open(fileName +'_' + opt_state +'.asm','w',encoding='utf-8') as f:
        f.write(input_asm_prompt)

Decompilation: use LLM4Decompile to translate the assembly instructions into C:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = 'arise-sustech/llm4decompile-1.3b'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.bfloat16).cuda()

with open(fileName +'_' + opt_state +'.asm','r') as f:#original file
    asm_func = f.read()
inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=200)
c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])

4. How to use Decompile-Eval

Data are stored in llm4decompile/decompile-eval/*.json, using JSON list format. There are five keys:

task_id: indicates the ID of the problem.
type: the optimization stage, is one of [O0, O1, O2, O3].
c_func: C solution for HumanEval problem.
c_test: C test assertions.
input_asm_prompt: assembly instructions with prompts, can be derived as in our preprocessing example.

To run the evaluation on Single GPU:

on going

To run the evaluation using TGI:

on going

5. On Going

LLM4Binary: We plan to include larger dataset to pre-train the model with assembly code and C code.

6. License

This code repository is licensed under the MIT License.

7. Contact

If you have any questions, please raise an issue.