Performance Comparison - mirrors/LlamaFactory - KyuGit

mirrors/LlamaFactory

mirror of https://github.com/hiyouga/LlamaFactory.git synced 2026-06-17 02:27:23 +00:00

Table of Contents

Short-sequence training

NVIDIA A100 * 1
NVIDIA A100 * 2

Long-sequence training

Short-sequence training

NVIDIA A100 * 1

Method	Bits	TGS	VRAM	Speed
HF	16	2,392	18GB	100%
HF+FA2	16	2,954	17GB	123%
Unsloth+FA2	16	4,007	16GB	168%
HF	4	2,415	9GB	101%
Unsloth+FA2	4	3,726	7GB	160%

NVIDIA A100 * 2

Method	Bits	TGS	VRAM	Speed
HF	16	2,155	29GB	100%
HF+FA2	16	2,556	28GB	119%
Unsloth+FA2	16	3,400	27GB	158%

TGS: tokens per GPU per second
Model: LLaMA2-7B
Batch size: 4
Gradient accumulation: 2
LoRA rank: 8
LoRA modules: all
Max length: 1024

Long-sequence training

VRAM	1,024	2,048	4,096	8,192	16,384	32,768	65,536	100,000
FlashAttention2	6GB	7GB	9GB	12GB	19GB	32GB	OOM	OOM
Unsloth+FA2	5GB	6GB	7GB	8GB	10GB	16GB	25GB	37GB

TGS	1,024	2,048	4,096	8,192	16,384	32,768	65,536	100,000
FlashAttention2	2,295	2,741	2,926	3,128	3,542	2,216	OOM	OOM
Unsloth+FA2	2,556	3,178	3,413	3,632	4,050	2,456	1,820	1,202
Improvement	111%	116%	117%	116%	114%	111%

TGS: tokens per GPU per second
GPU: NVIDIA A100 40GB * 1
Model: LLaMA2-7B
Batch size: 1
Gradient accumulation: 4
LoRA rank: 8
LoRA modules: all
Quantization bit: 4

English Docs

Requirements
Usage
- Command Line Interface (WIP)
- LLaMA Board Web UI (WIP)
Guides
Features

中文文档

前期准备
使用方法
- 命令行接口（未完成）
- LLaMA Board 网页接口（未完成）
导引
特性