Update README.md

2026-06-18 03:24:27 +00:00 · 2026-06-12 14:38:55 +00:00 · 2026-06-12 14:38:55 +00:00 · e7be22c99d
commit e7be22c99d
parent 0e6579fb52
1 changed files with 42 additions and 25 deletions
--- a/README.md
+++ b/README.md
@ -11,51 +11,68 @@ tags:

 # VTS

-![VTS overview](./Thumbnail.png)
+VTS generates sound effects from:

-VTS (Voice To Sound) generates sound effects from:
-
- a short vocal sketch
+- a short voice or audio sketch
 - a text prompt

-This repository hosts the pretrained checkpoint files for the older `voice_cond` VTS pipeline.
+This model repository hosts the pretrained checkpoint for the VTS inference
+codebase.

 ## Files

- `model_voice_1030_24.pth`: main diffusion checkpoint
- `vae_weight.pth`: VAE checkpoint used for decoding
+- `dynamic_v3_0415.ckpt`: main VTS checkpoint
+
+The companion inference repository downloads additional frozen components at
+runtime, including `google/flan-t5-base` and vocoder files used by the local
+`vts/vocos_custom` implementation.

 ## Download

 ```bash
 pip install -U "huggingface_hub"
-hf download Daniel777/VTS model_voice_1030_24.pth vae_weight.pth --local-dir ./checkpoints
+hf download <your-user-or-org>/<your-model-repo> dynamic_v3_0415.ckpt --local-dir ./checkpoints
 ```

 ## Usage

-Use these checkpoints with the companion `voice_text_sfx` codebase.
+Use this checkpoint with the companion `vts_inference` repository.

 ```bash
-python3 scripts/infer.py \
-  --model-ckpt ./checkpoints/model_voice_1030_24.pth \
-  --ae-ckpt ./checkpoints/vae_weight.pth \
-  --prompt-audio /path/to/prompt.wav \
-  --text "glassy swipe with rising pitch" \
-  --output /tmp/generated.wav \
-  --duration 3.0 \
-  --steps 100 \
-  --cfg-scale 6.0 \
+python -u infer.py \
+  --input-audio ./examples/voice.wav \
+  --text "scifi cannon charging and shooting" \
+  --temperature 0.7 \
+  --model-path ./checkpoints/dynamic_v3_0415.ckpt \
+  --output-dir ./outputs \
  --device cuda
 ```

-## Notes
+## Temperature Behavior

- This checkpoint matches the older `voice_cond` path.
- It is not a drop-in checkpoint for later `script_embed` or `voice_prompt` variants.
- This is a research checkpoint, not a packaged Hugging Face Inference API model.
+For normal inference, use `--temperature 0.7`. This keeps the original dynamic
+conditioning from the input audio and runs the standard `generate` path.

-## SHA256
+- `< 0.6`: weak dynamic conditioning + `generate`
+- `0.6 <= temperature < 0.8`: full dynamic conditioning + `generate`
+- `>= 0.8`: input-audio latent mixing + `variation`

- `model_voice_1030_24.pth`: `a061bfb5e4fca61d8857c3056245304d0a421b55d4f86deca3b47442b08f5287`
- `vae_weight.pth`: `45e2d5ab17e5bbb22dc533cd70798bb4ed96dbbe3487f6f20f5528fc9915558e`
+The input audio is not treated as a speaker embedding. It is converted into
+frame-level dynamic features and, for high-temperature variation, also encoded
+into the vocoder latent space.
+
+## Intended Use
+
+This checkpoint is intended for research and creative sound-effect generation
+from vocal sketches or short audio sketches plus text prompts.
+
+## Limitations
+
+- The model is optimized for short sound-effect style clips.
+- Output quality depends on checkpoint quality, input audio, prompt text, and
+  sampling settings.
+- This is not packaged as a Hugging Face Inference API pipeline.
+
+## License
+
+MIT.