LLM4Decompile

mirror of https://github.com/albertan017/LLM4Decompile.git synced 2026-06-17 01:55:50 +00:00

History

Qi Luo 8129a080de Update README.md		2024-10-17 21:43:27 +08:00
..
colossalai_llm4decompile	Update README.md	2024-08-07 11:16:31 +08:00
configs	add finetune script.	2024-10-17 21:25:49 +08:00
compile.py	update compile script	2024-03-18 15:35:20 +08:00
finetune.py	add finetune script.	2024-10-17 21:25:49 +08:00
README.md	Update README.md	2024-10-17 21:43:27 +08:00
requirements.txt	add finetune script.	2024-10-17 21:25:49 +08:00

README.md

How to Fine-tune LLM4Decompile

We provide script finetune.py, adapted from the deepseek-coder repository.

The script supports the training with DeepSpeed. You need install required packages by:

pip install -r requirements.txt

If you want to leverage FlashAttention to accelerate training, install it via:

pip install flash-attn

Please download the decompile-ghidra-100k dataset to your workspace, and process it into JSON format. Each line is a json-serialized string with two required fields instruction and output.

After data preparation, you can use the sample shell script to finetune llm4decompile model. Remember to specify DATA_PATH, OUTPUT_PATH.

WORKSPACE="/workspace"
DATA_PATH="${WORKSPACE}/decompile-ghidra-100k.json"
OUTPUT_PATH="${WORKSPACE}/output_models/llm4decompile-ref"
MODEL_PATH="deepseek-ai/deepseek-coder-1.3b-base"

CUDA_VISIBLE_DEVICES=0 deepspeed finetune.py \
    --model_name_or_path $MODEL_PATH \
    --data_path $DATA_PATH \
    --output_dir $OUTPUT_PATH \
    --num_train_epochs 2 \
    --model_max_length 1024 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --use_flash_attention \
    --save_steps 100 \
    --save_total_limit 100 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --weight_decay 0.1 \
    --warmup_ratio 0.025 \
    --logging_steps 1 \
    --lr_scheduler_type "cosine" \
    --gradient_checkpointing True \
    --report_to "tensorboard" \
    --bf16 True

Simple demo on constructing the training data. Note we use ExeBench as our final dataset.

Before compiling, please clone the AnghaBench dataset.

git clone https://github.com/brenocfg/AnghaBench

Then use the following script to compile AnghaBench:

python compile.py --root Anghabench_path --output AnghaBench_compile.jsonl