mirror of https://github.com/albertan017/LLM4Decompile.git synced 2026-06-17 01:55:50 +00:00

BaiRiDreamer 239cba2673 feat(sk2decompile): add BringUpBench evaluation pipeline and results

Integrate BringUpBench evaluation into sk2decompile/evaluation/bringupbench/,
corresponding to Section A.6 of the paper (arXiv:2509.22114).

BringUpBench is a benchmark suite of 90 self-contained C programs (505 functions,
O0-O3). SK2Decompile achieves 42.3% compilation rate and 27.0% re-executability
rate, compared to IDA Pro's 23.6% / 21.7%.

Contents:
- scripts/: 5-step reproduction pipeline (compile, decompile, map, infer, eval)
- data/func_maps/: pre-built function-level mappings (source <-> pseudo <-> asm)
- data/infer_results/: SK2Decompile inference outputs for all opt levels
- reports/: per-opt-level evaluation result summaries (Markdown)
- config.env: template environment configuration
- README.md: comprehensive documentation with reproduction guide

Also updated sk2decompile/README.md to reference BringUpBench evaluation.

2026-02-12 00:02:25 +08:00

10 KiB

Raw Permalink Blame History

SK²Decompile

SK²Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin

🚀 Quick Start | 🤖 Training Pipeline | 📊 Evaluation | 📝 Citation

Overview

SK²Decompile is a novel two-phase framework for binary decompilation using Large Language Models (LLMs). Our approach decomposes the complex decompilation task into two manageable phases:

Phase 1 Structure Recovery (Skeleton): Transform binary/pseudo-code into obfuscated intermediate representations 🤗 HF Link
Phase 2 Identifier Naming (Skin): Generate human-readable source code with meaningful identifiers 🤗 HF Link

This repository contains the complete implementation of our paper, including data preprocessing tools, training scripts, and evaluation benchmarks.

Architecture

Our two-phase approach is inspired by the skeleton-to-skin metaphor:

Binary/Pseudo-code → [Phase 1: Skeleton] → Normalized IR → [Phase 2: Skin] → Source Code
                          ↓                                        ↓
                  (Structure Extraction)                   (Identifier Recovery)

Repository Structure

SK2Decompile/
├── Preprocess/        # Data preprocessing and normalization tools
├── LLaMA-Factory/     # Supervised Fine-Tuning (SFT) implementation
├── verl/              # Reinforcement Learning (RL) with VERL/GRPO
│   └── SK2DECOMPILE/
│       ├── data/              # Example RL training data
│       ├── reward_functions/  # Custom reward functions (4 variants)
│       ├── scripts/           # Training launch scripts
│       └── README.md          # Detailed RL documentation
├── evaluation/        # Comprehensive evaluation suite
│   ├── bringupbench/          # BringUpBench evaluation (Section A.6)
│   │   ├── scripts/           # Pipeline scripts (compile, decompile, evaluate)
│   │   ├── data/              # Pre-built function maps and inference results
│   │   ├── reports/           # Evaluation result summaries
│   │   └── README.md          # Detailed BringUpBench documentation
│   └── ...                    # HumanEval, MBPP evaluation scripts
└── README.md          # This file

Quick Start

Prerequisites

Python 3.8+
CUDA 11.0+
PyTorch 2.0+
40GB+ GPU memory (recommended)
Psychec (for data preprocessing)

Installation

git clone https://github.com/yourusername/SK2Decompile.git
cd SK2Decompile

Training Pipeline

Phase 0: Data Preprocessing

Transform raw pseudo-code into normalized representations suitable for training:

cd Preprocess

# Requirements
pip install tree-sitter==0.24.0 tree-sitter-c==0.23.4 tqdm

# Step 1: Normalize pseudo-code according to R2I standard
python3 normalize_pseudo.py --input_json exebench_c.json --output_json exebench_pseudonorm.json --key_name pseudo

# Step 2: Obfuscate source code to generate IR
python3 normalize_src_basedonpseudo.py --input_json exebench_pseudonorm.json --output_json exebench_norm_top0.json --top 0 --pseudo pseudo_norm

# Step 3: Format codes with clang-format
python3 format.py --input exebench_norm_top0.json --output exebench_format_top0.json

# Step 4: Infer types for obfuscated IR (used for compiler-based rewards)
python3 inf_type.py --input_json train_format_top0.json --output_name train_format_top0_type \
    --generator ../psychec/psychecgen --solver ../psychec/psychecsolver-exe --split 2 --idx 0

Phase 1: Supervised Fine-Tuning (SFT)

Our two-phase SFT approach trains specialized models for each transformation:

Setup LLaMA-Factory

cd ../LLaMA-Factory
# Follow installation instructions in LLaMA-Factory/README.md

Train Models

# Train Skeleton Model (pseudo2norm)
llamafactory-cli train LLaMA-Factory/SK2DECOMPILE/train/pseudo2norm-example.yaml

# Train Skin Model (norm2code)
llamafactory-cli train LLaMA-Factory/SK2DECOMPILE/train/norm2code-example.yaml

Sample Training Data:

Pseudo2Norm: LLaMA-Factory/SK2DECOMPILE/data/pseudo2norm-examples.jsonl
Norm2Code: LLaMA-Factory/SK2DECOMPILE/data/norm2code-examples.jsonl

Phase 2: Reinforcement Learning (RL)

After SFT, we apply GRPO (Group Relative Policy Optimization) to further align each model with task-specific objectives (Section 3.5 of the paper):

Structure Recovery
Identifier Naming

Our RL training is based on VERL v0.4.1 (Sheng et al., 2024).

Setup VERL

git clone https://github.com/volcengine/verl.git
cd verl && git checkout v0.4.1 && pip install -e .
pip install tree-sitter==0.24.0 tree-sitter-c==0.23.4 openai

Run RL Training

# Structure Recovery RL
bash verl/SK2DECOMPILE/scripts/run_struct_rl.sh

# Identifier Naming RL (requires embedding server)
bash verl/SK2DECOMPILE/scripts/run_ident_rl.sh

See verl/SK2DECOMPILE/README.md for the full reproduction guide, including how to integrate reward functions into VERL and prepare training data.

RL Training Data: verl/SK2DECOMPILE/data/sk2decompile-rl-examples.jsonl

Evaluation

cd ../evaluation

Inference

pip install vllm
apt install clang-format
#translate the data into reverse_sample.json format
python normalize_pseudo.py --input_json reverse_sample.json --output_json reverse_sample.json
python sk2decompile.py --dataset_path reverse_sample.json --model_path LLM4Binary/sk2decompile-struct-6.7b --recover_model_path LLM4Binary/sk2decompile-ident-6.7b

Detailed version 0. Install vllm and transformers via pip; install clang-format via apt.

Prepare a Linux-x64 executable file (ELF).
Use IDA to decompile it (you can also simply use this website: https://dogbolt.org/).
Convert the data into the corresponding format (https://huggingface.co/LLM4Binary/sk2decompile-struct-6.7b/blob/main/reverse_sample.json)

Run inference:

python normalize_pseudo.py --input_json reverse_sample.json --output_json reverse_sample.json
python sk2decompile.py --dataset_path reverse_sample.json \
    --model_path LLM4Binary/sk2decompile-struct-6.7b \
    --recover_model_path LLM4Binary/sk2decompile-ident-6.7b

Model page: https://huggingface.co/LLM4Binary/sk2decompile-struct-6.7b

Project overview: https://github.com/albertan017/LLM4Decompile/tree/main/sk2decompile

Notes:

IDA decompilation results should be preprocessed before inference.
Use vllm to recover function structure (sk2decompile-struct) and variable names (sk2decompile-ident) step by step.
Training was done on C language Linux-x64 code with IDA pseudocode; performance may degrade for other languages or architectures.

Comprehensive evaluation on standard benchmarks:

# evaluate exe_rate
python evaluate_exe.py --json_file your_json_file_path
                       --dcompilers decompiler1,decompiler2,...,decompilerN
# evaluate r2i
python evaluate_r2i.py --json_file your_json_file_path
                       --dcompilers decompiler1,decompiler2,...,decompilerN
                       --output_path your_output_path
# evaluate gpt-judge
python gpt_judge.py --json_file your_json_file_path
                    --decompilers decompiler1,decompiler2,...,decompilerN
                    --opt OPT
                    --api_key your_openai_api_key

BringUpBench Evaluation (Section A.6 of the paper)

We also evaluate on BringUpBench — 90 self-contained C programs with 505 functions across O0–O3. SK²Decompile achieves 42.3% compilation rate and 27.0% re-executability rate, compared to IDA Pro's 23.6% / 21.7%.

See evaluation/bringupbench/README.md for the full reproduction pipeline, pre-built data, and detailed results.

📊 Results

Our approach achieves state-of-the-art performance:

Metric	Dataset	Improvement
Re-executability	HumanEval	+21.6% over GPT-5-mini
R2I Score	GitHub2025	+29.4% over Idioms

🔬 Key Innovations

Two-Phase Decomposition: Separating structure recovery from identifier prediction
Compiler-Based RL: Using compiler feedback as reward signal
Generic Placeholders: Language-agnostic intermediate representation
Independent Optimization: Separate RL objectives for each phase

Citation

If you use SK²Decompile in your research, please cite our paper:

@article{tan2025sk2decompile,
  title={SK2Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin},
  author={Tan, Hanzhuo and Li, Weihao and Tian, Xiaolong and Wang, Siyi and Liu, Jiaming and Li, Jing and Zhang, Yuqun},
  journal={arXiv preprint arXiv:2509.22114},
  year={2025}
}

Contributing

We welcome contributions! Areas of interest:

Support for additional architectures (ARM, RISC-V)
Integration with more decompilation tools
Improved intermediate representations
Multi-language support

License

This project is licensed under the MIT License. See LICENSE file for details.

Acknowledgments

We thank the developers of:

LLaMA-Factory for the SFT framework
VERL for the RL implementation
Psychec for C type inference

For detailed documentation on each component, please refer to the individual README files in each directory.

10 KiB Raw Permalink Blame History Unescape Escape