mirror of
https://github.com/hiyouga/LlamaFactory.git
synced 2026-06-17 02:27:23 +00:00
Page:
Performance Comparison
Pages
Acceleration
Command Line Interface
Datasets
Deployment
Environment
Evaluation
Home
Inference
LLaMA Board Web UI
LLaMA Board 网页接口
Memory Saving
Models
Others
Parallelism
Performance Comparison
Preference Optimization
Pretraining
Supervised Finetuning
偏好优化
其它
命令行接口
并行化
指令监督微调
推理
数据集
模型
环境
空间优化
评估
速度优化
部署
预训练
No results
5
Performance Comparison
hoshi-hiyouga edited this page 2024-04-16 03:05:55 +08:00
Table of Contents
Short-sequence training
NVIDIA A100 * 1
| Method | Bits | TGS | VRAM | Speed |
|---|---|---|---|---|
| HF | 16 | 2,392 | 18GB | 100% |
| HF+FA2 | 16 | 2,954 | 17GB | 123% |
| Unsloth+FA2 | 16 | 4,007 | 16GB | 168% |
| HF | 4 | 2,415 | 9GB | 101% |
| Unsloth+FA2 | 4 | 3,726 | 7GB | 160% |
NVIDIA A100 * 2
| Method | Bits | TGS | VRAM | Speed |
|---|---|---|---|---|
| HF | 16 | 2,155 | 29GB | 100% |
| HF+FA2 | 16 | 2,556 | 28GB | 119% |
| Unsloth+FA2 | 16 | 3,400 | 27GB | 158% |
- TGS: tokens per GPU per second
- Model: LLaMA2-7B
- Batch size: 4
- Gradient accumulation: 2
- LoRA rank: 8
- LoRA modules: all
- Max length: 1024
Long-sequence training
| VRAM | 1,024 | 2,048 | 4,096 | 8,192 | 16,384 | 32,768 | 65,536 | 100,000 |
|---|---|---|---|---|---|---|---|---|
| FlashAttention2 | 6GB | 7GB | 9GB | 12GB | 19GB | 32GB | OOM | OOM |
| Unsloth+FA2 | 5GB | 6GB | 7GB | 8GB | 10GB | 16GB | 25GB | 37GB |
| TGS | 1,024 | 2,048 | 4,096 | 8,192 | 16,384 | 32,768 | 65,536 | 100,000 |
|---|---|---|---|---|---|---|---|---|
| FlashAttention2 | 2,295 | 2,741 | 2,926 | 3,128 | 3,542 | 2,216 | OOM | OOM |
| Unsloth+FA2 | 2,556 | 3,178 | 3,413 | 3,632 | 4,050 | 2,456 | 1,820 | 1,202 |
| Improvement | 111% | 116% | 117% | 116% | 114% | 111% |
- TGS: tokens per GPU per second
- GPU: NVIDIA A100 40GB * 1
- Model: LLaMA2-7B
- Batch size: 1
- Gradient accumulation: 4
- LoRA rank: 8
- LoRA modules: all
- Quantization bit: 4
English Docs
- Requirements
- Usage
- Guides
- Features