mirror of
https://github.com/albertan017/LLM4Decompile.git
synced 2026-06-17 01:55:50 +00:00
| .. | ||
| colossalai_llm4decompile | ||
| configs | ||
| compile.py | ||
| finetune.py | ||
| README.md | ||
| requirements.txt | ||
How to Fine-tune LLM4Decompile
We provide script finetune.py, adapted from the deepseek-coder repository.
The script supports the training with DeepSpeed. You need install required packages by:
pip install -r requirements.txt
If you want to leverage FlashAttention to accelerate training, install it via:
pip install flash-attn
Please download the decompile-ghidra-100k dataset to your workspace, and process it into JSON format.
Each line is a json-serialized string with two required fields instruction and output.
After data preparation, you can use the sample shell script to finetune llm4decompile model.
Remember to specify DATA_PATH, OUTPUT_PATH.
WORKSPACE="/workspace"
DATA_PATH="${WORKSPACE}/decompile-ghidra-100k.json"
OUTPUT_PATH="${WORKSPACE}/output_models/llm4decompile-ref"
MODEL_PATH="deepseek-ai/deepseek-coder-1.3b-base"
CUDA_VISIBLE_DEVICES=0 deepspeed finetune.py \
--model_name_or_path $MODEL_PATH \
--data_path $DATA_PATH \
--output_dir $OUTPUT_PATH \
--num_train_epochs 2 \
--model_max_length 1024 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 16 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--use_flash_attention \
--save_steps 100 \
--save_total_limit 100 \
--learning_rate 2e-5 \
--max_grad_norm 1.0 \
--weight_decay 0.1 \
--warmup_ratio 0.025 \
--logging_steps 1 \
--lr_scheduler_type "cosine" \
--gradient_checkpointing True \
--report_to "tensorboard" \
--bf16 True
Simple demo on constructing the training data. Note we use ExeBench as our final dataset.
Before compiling, please clone the AnghaBench dataset.
git clone https://github.com/brenocfg/AnghaBench
Then use the following script to compile AnghaBench:
python compile.py --root Anghabench_path --output AnghaBench_compile.jsonl