mirror of
https://huggingface.co/Daniel777/VTS
synced 2026-06-18 03:24:27 +00:00
Update README.md
This commit is contained in:
parent
0e6579fb52
commit
e7be22c99d
1 changed files with 42 additions and 25 deletions
67
README.md
67
README.md
|
|
@ -11,51 +11,68 @@ tags:
|
|||
|
||||
# VTS
|
||||
|
||||

|
||||
VTS generates sound effects from:
|
||||
|
||||
VTS (Voice To Sound) generates sound effects from:
|
||||
|
||||
- a short vocal sketch
|
||||
- a short voice or audio sketch
|
||||
- a text prompt
|
||||
|
||||
This repository hosts the pretrained checkpoint files for the older `voice_cond` VTS pipeline.
|
||||
This model repository hosts the pretrained checkpoint for the VTS inference
|
||||
codebase.
|
||||
|
||||
## Files
|
||||
|
||||
- `model_voice_1030_24.pth`: main diffusion checkpoint
|
||||
- `vae_weight.pth`: VAE checkpoint used for decoding
|
||||
- `dynamic_v3_0415.ckpt`: main VTS checkpoint
|
||||
|
||||
The companion inference repository downloads additional frozen components at
|
||||
runtime, including `google/flan-t5-base` and vocoder files used by the local
|
||||
`vts/vocos_custom` implementation.
|
||||
|
||||
## Download
|
||||
|
||||
```bash
|
||||
pip install -U "huggingface_hub"
|
||||
hf download Daniel777/VTS model_voice_1030_24.pth vae_weight.pth --local-dir ./checkpoints
|
||||
hf download <your-user-or-org>/<your-model-repo> dynamic_v3_0415.ckpt --local-dir ./checkpoints
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
Use these checkpoints with the companion `voice_text_sfx` codebase.
|
||||
Use this checkpoint with the companion `vts_inference` repository.
|
||||
|
||||
```bash
|
||||
python3 scripts/infer.py \
|
||||
--model-ckpt ./checkpoints/model_voice_1030_24.pth \
|
||||
--ae-ckpt ./checkpoints/vae_weight.pth \
|
||||
--prompt-audio /path/to/prompt.wav \
|
||||
--text "glassy swipe with rising pitch" \
|
||||
--output /tmp/generated.wav \
|
||||
--duration 3.0 \
|
||||
--steps 100 \
|
||||
--cfg-scale 6.0 \
|
||||
python -u infer.py \
|
||||
--input-audio ./examples/voice.wav \
|
||||
--text "scifi cannon charging and shooting" \
|
||||
--temperature 0.7 \
|
||||
--model-path ./checkpoints/dynamic_v3_0415.ckpt \
|
||||
--output-dir ./outputs \
|
||||
--device cuda
|
||||
```
|
||||
|
||||
## Notes
|
||||
## Temperature Behavior
|
||||
|
||||
- This checkpoint matches the older `voice_cond` path.
|
||||
- It is not a drop-in checkpoint for later `script_embed` or `voice_prompt` variants.
|
||||
- This is a research checkpoint, not a packaged Hugging Face Inference API model.
|
||||
For normal inference, use `--temperature 0.7`. This keeps the original dynamic
|
||||
conditioning from the input audio and runs the standard `generate` path.
|
||||
|
||||
## SHA256
|
||||
- `< 0.6`: weak dynamic conditioning + `generate`
|
||||
- `0.6 <= temperature < 0.8`: full dynamic conditioning + `generate`
|
||||
- `>= 0.8`: input-audio latent mixing + `variation`
|
||||
|
||||
- `model_voice_1030_24.pth`: `a061bfb5e4fca61d8857c3056245304d0a421b55d4f86deca3b47442b08f5287`
|
||||
- `vae_weight.pth`: `45e2d5ab17e5bbb22dc533cd70798bb4ed96dbbe3487f6f20f5528fc9915558e`
|
||||
The input audio is not treated as a speaker embedding. It is converted into
|
||||
frame-level dynamic features and, for high-temperature variation, also encoded
|
||||
into the vocoder latent space.
|
||||
|
||||
## Intended Use
|
||||
|
||||
This checkpoint is intended for research and creative sound-effect generation
|
||||
from vocal sketches or short audio sketches plus text prompts.
|
||||
|
||||
## Limitations
|
||||
|
||||
- The model is optimized for short sound-effect style clips.
|
||||
- Output quality depends on checkpoint quality, input audio, prompt text, and
|
||||
sampling settings.
|
||||
- This is not packaged as a Hugging Face Inference API pipeline.
|
||||
|
||||
## License
|
||||
|
||||
MIT.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue