tinygrad/examples
hooved 01f7a4fadc
tinychat in browser, Part 2: model export (#9274)
* load llama3-1B to WEBGPU device

* include compile script for loading llama3 to WEBGPU

* parametrize max_context in build_transformer fxn

* jit_model with two different args sets

* compile for webgpu, split weights

* load model weight parts in browser

* export all tensors from initialized transformer

* run transformer inference in browser

* enable tiktoken with llama bpe in browser

* count total tokens on client with tiktoken.js

* full client-side chat streaming, eliminate server

* revert change that enabled jitting with 2 argsets

* llama without Variable or cache_kv, for webgpu

* have client use mask tokens / whole context

* cleanup staged weights

* add tiktoken.js build script, README

* export CLANG for Q6_k to float32 decompression

* fix and test exported CLANG code for Q6_k to fp32

* revert changes to jit and export_model

* isolate clang export

* test Q6_K to float32 decompression in browser

* gguf_load now also returns t_infos and data_start

* prepare llama-1B Q6_K gguf chunks for browser

* cache and decompress quantized llama in browser

* enable separate deployment of large files

* fix kv cache and symbolic with llama wgpu

* eliminate browser lag during decompression

* hash metadata and weight chunks

* delete obsolete indexeddb cache to free disk

* add progress bar, track model download/decompress

* refactor progress callback

* skip buffer hash verification for speed

* Display progress for entire loading scope

* Report page load errors to user

* actually display errors

* skip prompt tokens already seen by model

* skip prefilling with last assistant message tokens

* on page load tell user if webgpu not enabled

* push deployed URL root to window.history

* make note of bug sources with TODO items

* isolate bug in CLANG with BEAM=2

* remove clang_bug.py from diff

* decompress q6k to f32 on webgpu instead of clang

* remove unused code

* inter-weight decomp with larger wgpu kernels

* parallelize decompression submissions

* refactor dequantize scheduling

* add progress bar back

* fix bug

* temp fix for loading GGUF Q6_K to fp16 not fp32

* fix rendering of exported CLANG

* remove weight casts, sketch js functions for clang

* get symbolic vars from jit_cache for model export

* include symbolic vars in exported CLANG

* render js for clang transformer

* toggle clang/webgpu deployment; refactor decomp

* compile and render clang Q6_K->fp16 and int8 quant

* fix rendered clang for abs(fp16), to work in wasm

* simplify clang js wrapping

* run compiled clang in worker

* prepare llama weights in workers, q6k to int8/fp16

* tinychat on clang in browser, f32/int8 weights

* move wasm inference to (now flexible) worker

* don't load redundant embeddings

* modest wasm perf gain with compile flags

* set default backend, enable backend choice/backup

* render symbolic vars in exported WEBGPU

* quantize webgpu llama to int8/f32

* improve UX arising from rendered WEBGPU

* clean up webgpu launch

* new weights split: smaller chunks, tinygrad quant.

* switch webgpu inference to int8 quant

* remove unneeded clang decompression

* eliminate unneeded kv cache transfer to wasm

* use 1 worker for simplified clang decompression

* display launch errors

* refactor: stream load weight chunks to WebGPU

* show loading chunk completion

* quantize embeddings to int8

* test float16 as input for quantization

* webgpu: use f16 source, int8 embed, eliminate q6k

* simplify split weights prep: all from state_dict

* revert change to nn.state.gguf_load

* remove unneeded decompression from webgpu client

* remove unneeded code

* decrease dl chunks from 47 to 16 MiB

* improve stability of webgpu loading on mobile

* autodetect mobile, improve load stability

* refactor: progress closure

* refactor: one unified progress bar

* remove unneeded code

* revert changes to tinygrad core library

* enforce ios18.3 nerfed max buf size

* BEAM=3 webgpu

* cache integrity, mobile save throttling

* improve mobile UX - no autozoom on prompt box

* clang: int8 from f16, remove q6k

* reduce concurrent dls on mobile to 2 for stability

* refactor: wasm backend with stream loading

* prevent race between wasm load and indexedb save

* split wasm kernels into separate modules

* js wrapper for multiple wasm module inference

* revert multi-module wasm to single module

* make mobile wasm load more stable/fast

* refactor: copy weights into wasm without crashes

* fix bug in download queue; increase mobile dls

* refactor exported clang wrapper, split weights

* remove unnecessary code

* greatly improve int8 quant quality with rounding

* eliminate mobile throttling

* increase webgpu context to 4096 tokens

* export webgpu js functions

* enable separate hosted weights for mobile/pc

* enable prompt-thread switching during generation

* stop generation when max_context is reached

* show progress bar for prefill

* tell user if webgpu fails, while wasm loads

* make loading messages more concise

* update font

* revert changes to tinychat python app launch

* cleanup quantization, add scale_dtype param

* cleanup kv cache code

* cleanup compile code

* link tok_embeddings with output in webgpu export

* refactor: export_model webgpu: symbolic vars

* refactor: export_model weight loading

* forgot to commit export_model.py

* change CLANG to CPU

* deal with pylint incorrectly failing tests

* simplify f-strings for older CI python version

* fix pre-python3.12 parser errors

* [Int32Array] not Int32Array

* cleanup webgpu compile after refactor export_model

* refactor WASM export into export_model

* merge WebGPU/WASM compile scripts

* simplify max_contexts for local deployment

* fix parser issues and whitespace

* deduplicate variable defs for non-wasm clang export

* cleanup code

* cleanup compile scripts

* simplify wasm inference wrapping

* simplify webgpu symbolic vars export

* refactor: unify export of symbolic variables

* simplify WASM export

* simplify clang/wasm export

* update README and build scripts

* separate files for browser/python apps

* restore original python tinychat app files

* browser and python tinychats share assets

* minor cleanup

* isolate compile/export model

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-04 15:53:30 +08:00
..
conversation_data Whisper + LLAMA + VITS (#2332) 2023-12-02 15:03:46 -08:00
llm.c CLANG -> CPU (#9189) 2025-02-20 18:03:09 -05:00
mlperf revert buffer_view change (#9311) 2025-03-01 11:00:12 +01:00
openpilot no numpy change if no NPY (#9281) 2025-02-28 09:32:35 +08:00
other_mnist import tinygrad.frontend.torch (#9337) 2025-03-04 00:15:29 +08:00
rl more beautiful_cartpole with exposed hparams 2024-01-07 17:41:09 -08:00
sovits_helpers combine pad2d with pad (#7677) 2024-11-14 17:56:02 +08:00
tinychat tinychat in browser, Part 2: model export (#9274) 2025-03-04 15:53:30 +08:00
vgg7_helpers leakyrelu to leaky_relu (#9270) 2025-02-26 13:22:08 -05:00
webgpu WebGPU f16 support (f16 bounty part 2) (#8653) 2025-02-12 19:46:53 +08:00
__init__.py failing llama test 2023-03-11 16:28:10 -08:00
beautiful_cartpole.py tinytqdm.set_description and tinytrange (#5101) 2024-06-22 14:45:06 -04:00
beautiful_cifar.py Fix mypy examples/beautiful_*.py (#6978) 2024-10-10 11:34:29 -04:00
beautiful_mnist.py Revert "switch beautiful_mnist to use new optimizer [pr] (#8231)" (#8233) 2024-12-13 19:07:09 -08:00
beautiful_mnist_multigpu.py Fix mypy examples/beautiful_*.py (#6978) 2024-10-10 11:34:29 -04:00
benchmark_onnx.py add onnx_helpers to extra and add ort validate to benchmark_onnx (#8890) 2025-02-04 16:36:01 -05:00
coder.py apply the same fix_bf16 in llama and coder (#3789) 2024-03-17 21:25:24 -04:00
compile_efficientnet.py CLANG -> CPU (#9189) 2025-02-20 18:03:09 -05:00
compile_tensorflow.py CLANG -> CPU (#9189) 2025-02-20 18:03:09 -05:00
conversation.py Fix examples/conversation.py (#8425) 2024-12-26 12:45:19 -05:00
efficientnet.py remove clang program header (#4422) 2024-05-04 08:38:01 -07:00
flux1.py flux set model path in args (#7660) 2024-11-12 22:11:40 -05:00
flux1_seed0.png Flux.1 (#6334) 2024-09-24 10:08:04 +08:00
gpt2.py cleanup ci, split docs/autogen, testing_minimal, LLVM Speed [pr] (#8952) 2025-02-07 19:01:59 +08:00
handcode_opt.py move time_linearizer to extra.optimization.helpers [pr] (#9048) 2025-02-12 15:49:58 -05:00
hlb_cifar10.py MultiLazyBuffer is UOp [pr] (#8662) 2025-01-24 13:28:55 +09:00
llama.py validate llama quantize output (#7901) 2024-11-25 16:46:23 -05:00
llama3.py tinychat in browser, Part 1: llama (#9273) 2025-02-27 15:57:37 -05:00
mamba.py prev speed improvements (#5252) 2024-07-03 09:06:01 -07:00
mask_rcnn.py change Tensor.stack to method (#4719) 2024-05-24 17:04:19 -04:00
mixtral.py tinytqdm.set_description and tinytrange (#5101) 2024-06-22 14:45:06 -04:00
mnist_gan.py leakyrelu to leaky_relu (#9270) 2025-02-26 13:22:08 -05:00
openelm.py nn.RMSNorm (#5272) 2024-07-02 21:39:01 -04:00
qwq.py QwQ-32B-Preview support (#7962) 2024-12-04 21:46:37 -05:00
sdv2.py Stable Diffusion v2 Inference (#5283) 2024-07-03 22:47:10 -04:00
sdxl.py GlobalCounters.reset() in sdxl step [pr] (#8664) 2025-01-17 21:10:28 -05:00
sdxl_seed0.png default threefry (#6116) 2024-09-25 17:45:13 +08:00
self_tokenize.py make self_tokenize output more like a python file (#8411) 2024-12-25 14:16:30 -05:00
serious_mnist.py combine pad2d with pad (#7677) 2024-11-14 17:56:02 +08:00
simple_conv_bn.py fix various examples (#4691) 2024-05-22 20:43:21 -04:00
so_vits_svc.py leakyrelu to leaky_relu (#9270) 2025-02-26 13:22:08 -05:00
stable_diffusion.py Remove wgpu specific checks from stable diffusion example (#7991) 2024-12-02 11:31:14 +01:00
stable_diffusion_seed0.png default threefry (#6116) 2024-09-25 17:45:13 +08:00
stunning_mnist.py stunning_mnist [run_process_replay] (#6828) 2024-10-01 15:00:48 +08:00
test_onnx_imagenet.py hotfix: add replay_pkl debugging env 2025-02-17 17:34:58 +08:00
torch_cuda_kernel.py hotfix: interop example (#9237) 2025-02-25 10:32:00 +03:00
train_efficientnet.py tinytqdm.set_description and tinytrange (#5101) 2024-06-22 14:45:06 -04:00
train_resnet.py move things, clean up extra (#2292) 2023-11-13 20:18:40 -08:00
transformer.py fix onehot and jit in examples/transformer (#3073) 2024-01-10 02:22:41 -05:00
vgg7.py waifu2x vgg7: testcase, auto-RGBA->RGB, function to grab pretrained models, training "fix" (#2117) 2023-10-19 22:07:15 -07:00
vit.py move to new cached fetch (#2493) 2023-11-28 17:36:55 -08:00
vits.py leakyrelu to leaky_relu (#9270) 2025-02-26 13:22:08 -05:00
whisper.py enable whisper batch for long sequences (#6458) 2024-09-17 00:42:10 -04:00
yolov3.py leakyrelu to leaky_relu (#9270) 2025-02-26 13:22:08 -05:00
yolov8-onnx.py add onnx_helpers to extra and add ort validate to benchmark_onnx (#8890) 2025-02-04 16:36:01 -05:00
yolov8.py YoloV8 on WebGPU (#8007) 2024-12-03 15:10:41 +01:00