tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-06-24 02:14:17 +00:00

History

hooved 01f7a4fadc tinychat in browser, Part 2: model export (#9274 ) * load llama3-1B to WEBGPU device * include compile script for loading llama3 to WEBGPU * parametrize max_context in build_transformer fxn * jit_model with two different args sets * compile for webgpu, split weights * load model weight parts in browser * export all tensors from initialized transformer * run transformer inference in browser * enable tiktoken with llama bpe in browser * count total tokens on client with tiktoken.js * full client-side chat streaming, eliminate server * revert change that enabled jitting with 2 argsets * llama without Variable or cache_kv, for webgpu * have client use mask tokens / whole context * cleanup staged weights * add tiktoken.js build script, README * export CLANG for Q6_k to float32 decompression * fix and test exported CLANG code for Q6_k to fp32 * revert changes to jit and export_model * isolate clang export * test Q6_K to float32 decompression in browser * gguf_load now also returns t_infos and data_start * prepare llama-1B Q6_K gguf chunks for browser * cache and decompress quantized llama in browser * enable separate deployment of large files * fix kv cache and symbolic with llama wgpu * eliminate browser lag during decompression * hash metadata and weight chunks * delete obsolete indexeddb cache to free disk * add progress bar, track model download/decompress * refactor progress callback * skip buffer hash verification for speed * Display progress for entire loading scope * Report page load errors to user * actually display errors * skip prompt tokens already seen by model * skip prefilling with last assistant message tokens * on page load tell user if webgpu not enabled * push deployed URL root to window.history * make note of bug sources with TODO items * isolate bug in CLANG with BEAM=2 * remove clang_bug.py from diff * decompress q6k to f32 on webgpu instead of clang * remove unused code * inter-weight decomp with larger wgpu kernels * parallelize decompression submissions * refactor dequantize scheduling * add progress bar back * fix bug * temp fix for loading GGUF Q6_K to fp16 not fp32 * fix rendering of exported CLANG * remove weight casts, sketch js functions for clang * get symbolic vars from jit_cache for model export * include symbolic vars in exported CLANG * render js for clang transformer * toggle clang/webgpu deployment; refactor decomp * compile and render clang Q6_K->fp16 and int8 quant * fix rendered clang for abs(fp16), to work in wasm * simplify clang js wrapping * run compiled clang in worker * prepare llama weights in workers, q6k to int8/fp16 * tinychat on clang in browser, f32/int8 weights * move wasm inference to (now flexible) worker * don't load redundant embeddings * modest wasm perf gain with compile flags * set default backend, enable backend choice/backup * render symbolic vars in exported WEBGPU * quantize webgpu llama to int8/f32 * improve UX arising from rendered WEBGPU * clean up webgpu launch * new weights split: smaller chunks, tinygrad quant. * switch webgpu inference to int8 quant * remove unneeded clang decompression * eliminate unneeded kv cache transfer to wasm * use 1 worker for simplified clang decompression * display launch errors * refactor: stream load weight chunks to WebGPU * show loading chunk completion * quantize embeddings to int8 * test float16 as input for quantization * webgpu: use f16 source, int8 embed, eliminate q6k * simplify split weights prep: all from state_dict * revert change to nn.state.gguf_load * remove unneeded decompression from webgpu client * remove unneeded code * decrease dl chunks from 47 to 16 MiB * improve stability of webgpu loading on mobile * autodetect mobile, improve load stability * refactor: progress closure * refactor: one unified progress bar * remove unneeded code * revert changes to tinygrad core library * enforce ios18.3 nerfed max buf size * BEAM=3 webgpu * cache integrity, mobile save throttling * improve mobile UX - no autozoom on prompt box * clang: int8 from f16, remove q6k * reduce concurrent dls on mobile to 2 for stability * refactor: wasm backend with stream loading * prevent race between wasm load and indexedb save * split wasm kernels into separate modules * js wrapper for multiple wasm module inference * revert multi-module wasm to single module * make mobile wasm load more stable/fast * refactor: copy weights into wasm without crashes * fix bug in download queue; increase mobile dls * refactor exported clang wrapper, split weights * remove unnecessary code * greatly improve int8 quant quality with rounding * eliminate mobile throttling * increase webgpu context to 4096 tokens * export webgpu js functions * enable separate hosted weights for mobile/pc * enable prompt-thread switching during generation * stop generation when max_context is reached * show progress bar for prefill * tell user if webgpu fails, while wasm loads * make loading messages more concise * update font * revert changes to tinychat python app launch * cleanup quantization, add scale_dtype param * cleanup kv cache code * cleanup compile code * link tok_embeddings with output in webgpu export * refactor: export_model webgpu: symbolic vars * refactor: export_model weight loading * forgot to commit export_model.py * change CLANG to CPU * deal with pylint incorrectly failing tests * simplify f-strings for older CI python version * fix pre-python3.12 parser errors * [Int32Array] not Int32Array * cleanup webgpu compile after refactor export_model * refactor WASM export into export_model * merge WebGPU/WASM compile scripts * simplify max_contexts for local deployment * fix parser issues and whitespace * deduplicate variable defs for non-wasm clang export * cleanup code * cleanup compile scripts * simplify wasm inference wrapping * simplify webgpu symbolic vars export * refactor: unify export of symbolic variables * simplify WASM export * simplify clang/wasm export * update README and build scripts * separate files for browser/python apps * restore original python tinychat app files * browser and python tinychats share assets * minor cleanup * isolate compile/export model --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>		2025-03-04 15:53:30 +08:00
..
conversation_data	Whisper + LLAMA + VITS (#2332 )	2023-12-02 15:03:46 -08:00
llm.c	CLANG -> CPU (#9189 )	2025-02-20 18:03:09 -05:00
mlperf	revert buffer_view change (#9311 )	2025-03-01 11:00:12 +01:00
openpilot	no numpy change if no NPY (#9281 )	2025-02-28 09:32:35 +08:00
other_mnist	import tinygrad.frontend.torch (#9337 )	2025-03-04 00:15:29 +08:00
rl	more beautiful_cartpole with exposed hparams	2024-01-07 17:41:09 -08:00
sovits_helpers	combine pad2d with pad (#7677 )	2024-11-14 17:56:02 +08:00
tinychat	tinychat in browser, Part 2: model export (#9274 )	2025-03-04 15:53:30 +08:00
vgg7_helpers	leakyrelu to leaky_relu (#9270 )	2025-02-26 13:22:08 -05:00
webgpu	WebGPU f16 support (f16 bounty part 2) (#8653 )	2025-02-12 19:46:53 +08:00
__init__.py	failing llama test	2023-03-11 16:28:10 -08:00
beautiful_cartpole.py	tinytqdm.set_description and tinytrange (#5101 )	2024-06-22 14:45:06 -04:00
beautiful_cifar.py	Fix mypy examples/beautiful_*.py (#6978 )	2024-10-10 11:34:29 -04:00
beautiful_mnist.py	Revert "switch beautiful_mnist to use new optimizer [pr] (#8231 )" (#8233 )	2024-12-13 19:07:09 -08:00
beautiful_mnist_multigpu.py	Fix mypy examples/beautiful_*.py (#6978 )	2024-10-10 11:34:29 -04:00
benchmark_onnx.py	add onnx_helpers to extra and add ort validate to benchmark_onnx (#8890 )	2025-02-04 16:36:01 -05:00
coder.py	apply the same fix_bf16 in llama and coder (#3789 )	2024-03-17 21:25:24 -04:00
compile_efficientnet.py	CLANG -> CPU (#9189 )	2025-02-20 18:03:09 -05:00
compile_tensorflow.py	CLANG -> CPU (#9189 )	2025-02-20 18:03:09 -05:00
conversation.py	Fix examples/conversation.py (#8425 )	2024-12-26 12:45:19 -05:00
efficientnet.py	remove clang program header (#4422 )	2024-05-04 08:38:01 -07:00
flux1.py	flux set model path in args (#7660 )	2024-11-12 22:11:40 -05:00
flux1_seed0.png	Flux.1 (#6334 )	2024-09-24 10:08:04 +08:00
gpt2.py	cleanup ci, split docs/autogen, testing_minimal, LLVM Speed [pr] (#8952 )	2025-02-07 19:01:59 +08:00
handcode_opt.py	move time_linearizer to extra.optimization.helpers [pr] (#9048 )	2025-02-12 15:49:58 -05:00
hlb_cifar10.py	MultiLazyBuffer is UOp [pr] (#8662 )	2025-01-24 13:28:55 +09:00
llama.py	validate llama quantize output (#7901 )	2024-11-25 16:46:23 -05:00
llama3.py	tinychat in browser, Part 1: llama (#9273 )	2025-02-27 15:57:37 -05:00
mamba.py	prev speed improvements (#5252 )	2024-07-03 09:06:01 -07:00
mask_rcnn.py	change Tensor.stack to method (#4719 )	2024-05-24 17:04:19 -04:00
mixtral.py	tinytqdm.set_description and tinytrange (#5101 )	2024-06-22 14:45:06 -04:00
mnist_gan.py	leakyrelu to leaky_relu (#9270 )	2025-02-26 13:22:08 -05:00
openelm.py	nn.RMSNorm (#5272 )	2024-07-02 21:39:01 -04:00
qwq.py	QwQ-32B-Preview support (#7962 )	2024-12-04 21:46:37 -05:00
sdv2.py	Stable Diffusion v2 Inference (#5283 )	2024-07-03 22:47:10 -04:00
sdxl.py	GlobalCounters.reset() in sdxl step [pr] (#8664 )	2025-01-17 21:10:28 -05:00
sdxl_seed0.png	default threefry (#6116 )	2024-09-25 17:45:13 +08:00
self_tokenize.py	make self_tokenize output more like a python file (#8411 )	2024-12-25 14:16:30 -05:00
serious_mnist.py	combine pad2d with pad (#7677 )	2024-11-14 17:56:02 +08:00
simple_conv_bn.py	fix various examples (#4691 )	2024-05-22 20:43:21 -04:00
so_vits_svc.py	leakyrelu to leaky_relu (#9270 )	2025-02-26 13:22:08 -05:00
stable_diffusion.py	Remove wgpu specific checks from stable diffusion example (#7991 )	2024-12-02 11:31:14 +01:00
stable_diffusion_seed0.png	default threefry (#6116 )	2024-09-25 17:45:13 +08:00
stunning_mnist.py	stunning_mnist [run_process_replay] (#6828 )	2024-10-01 15:00:48 +08:00
test_onnx_imagenet.py	hotfix: add replay_pkl debugging env	2025-02-17 17:34:58 +08:00
torch_cuda_kernel.py	hotfix: interop example (#9237 )	2025-02-25 10:32:00 +03:00
train_efficientnet.py	tinytqdm.set_description and tinytrange (#5101 )	2024-06-22 14:45:06 -04:00
train_resnet.py	move things, clean up extra (#2292 )	2023-11-13 20:18:40 -08:00
transformer.py	fix onehot and jit in examples/transformer (#3073 )	2024-01-10 02:22:41 -05:00
vgg7.py	waifu2x vgg7: testcase, auto-RGBA->RGB, function to grab pretrained models, training "fix" (#2117 )	2023-10-19 22:07:15 -07:00
vit.py	move to new cached fetch (#2493 )	2023-11-28 17:36:55 -08:00
vits.py	leakyrelu to leaky_relu (#9270 )	2025-02-26 13:22:08 -05:00
whisper.py	enable whisper batch for long sequences (#6458 )	2024-09-17 00:42:10 -04:00
yolov3.py	leakyrelu to leaky_relu (#9270 )	2025-02-26 13:22:08 -05:00
yolov8-onnx.py	add onnx_helpers to extra and add ort validate to benchmark_onnx (#8890 )	2025-02-04 16:36:01 -05:00
yolov8.py	YoloV8 on WebGPU (#8007 )	2024-12-03 15:10:41 +01:00