tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-06-24 02:14:17 +00:00

History

hooved 01f7a4fadc tinychat in browser, Part 2: model export (#9274 ) * load llama3-1B to WEBGPU device * include compile script for loading llama3 to WEBGPU * parametrize max_context in build_transformer fxn * jit_model with two different args sets * compile for webgpu, split weights * load model weight parts in browser * export all tensors from initialized transformer * run transformer inference in browser * enable tiktoken with llama bpe in browser * count total tokens on client with tiktoken.js * full client-side chat streaming, eliminate server * revert change that enabled jitting with 2 argsets * llama without Variable or cache_kv, for webgpu * have client use mask tokens / whole context * cleanup staged weights * add tiktoken.js build script, README * export CLANG for Q6_k to float32 decompression * fix and test exported CLANG code for Q6_k to fp32 * revert changes to jit and export_model * isolate clang export * test Q6_K to float32 decompression in browser * gguf_load now also returns t_infos and data_start * prepare llama-1B Q6_K gguf chunks for browser * cache and decompress quantized llama in browser * enable separate deployment of large files * fix kv cache and symbolic with llama wgpu * eliminate browser lag during decompression * hash metadata and weight chunks * delete obsolete indexeddb cache to free disk * add progress bar, track model download/decompress * refactor progress callback * skip buffer hash verification for speed * Display progress for entire loading scope * Report page load errors to user * actually display errors * skip prompt tokens already seen by model * skip prefilling with last assistant message tokens * on page load tell user if webgpu not enabled * push deployed URL root to window.history * make note of bug sources with TODO items * isolate bug in CLANG with BEAM=2 * remove clang_bug.py from diff * decompress q6k to f32 on webgpu instead of clang * remove unused code * inter-weight decomp with larger wgpu kernels * parallelize decompression submissions * refactor dequantize scheduling * add progress bar back * fix bug * temp fix for loading GGUF Q6_K to fp16 not fp32 * fix rendering of exported CLANG * remove weight casts, sketch js functions for clang * get symbolic vars from jit_cache for model export * include symbolic vars in exported CLANG * render js for clang transformer * toggle clang/webgpu deployment; refactor decomp * compile and render clang Q6_K->fp16 and int8 quant * fix rendered clang for abs(fp16), to work in wasm * simplify clang js wrapping * run compiled clang in worker * prepare llama weights in workers, q6k to int8/fp16 * tinychat on clang in browser, f32/int8 weights * move wasm inference to (now flexible) worker * don't load redundant embeddings * modest wasm perf gain with compile flags * set default backend, enable backend choice/backup * render symbolic vars in exported WEBGPU * quantize webgpu llama to int8/f32 * improve UX arising from rendered WEBGPU * clean up webgpu launch * new weights split: smaller chunks, tinygrad quant. * switch webgpu inference to int8 quant * remove unneeded clang decompression * eliminate unneeded kv cache transfer to wasm * use 1 worker for simplified clang decompression * display launch errors * refactor: stream load weight chunks to WebGPU * show loading chunk completion * quantize embeddings to int8 * test float16 as input for quantization * webgpu: use f16 source, int8 embed, eliminate q6k * simplify split weights prep: all from state_dict * revert change to nn.state.gguf_load * remove unneeded decompression from webgpu client * remove unneeded code * decrease dl chunks from 47 to 16 MiB * improve stability of webgpu loading on mobile * autodetect mobile, improve load stability * refactor: progress closure * refactor: one unified progress bar * remove unneeded code * revert changes to tinygrad core library * enforce ios18.3 nerfed max buf size * BEAM=3 webgpu * cache integrity, mobile save throttling * improve mobile UX - no autozoom on prompt box * clang: int8 from f16, remove q6k * reduce concurrent dls on mobile to 2 for stability * refactor: wasm backend with stream loading * prevent race between wasm load and indexedb save * split wasm kernels into separate modules * js wrapper for multiple wasm module inference * revert multi-module wasm to single module * make mobile wasm load more stable/fast * refactor: copy weights into wasm without crashes * fix bug in download queue; increase mobile dls * refactor exported clang wrapper, split weights * remove unnecessary code * greatly improve int8 quant quality with rounding * eliminate mobile throttling * increase webgpu context to 4096 tokens * export webgpu js functions * enable separate hosted weights for mobile/pc * enable prompt-thread switching during generation * stop generation when max_context is reached * show progress bar for prefill * tell user if webgpu fails, while wasm loads * make loading messages more concise * update font * revert changes to tinychat python app launch * cleanup quantization, add scale_dtype param * cleanup kv cache code * cleanup compile code * link tok_embeddings with output in webgpu export * refactor: export_model webgpu: symbolic vars * refactor: export_model weight loading * forgot to commit export_model.py * change CLANG to CPU * deal with pylint incorrectly failing tests * simplify f-strings for older CI python version * fix pre-python3.12 parser errors * [Int32Array] not Int32Array * cleanup webgpu compile after refactor export_model * refactor WASM export into export_model * merge WebGPU/WASM compile scripts * simplify max_contexts for local deployment * fix parser issues and whitespace * deduplicate variable defs for non-wasm clang export * cleanup code * cleanup compile scripts * simplify wasm inference wrapping * simplify webgpu symbolic vars export * refactor: unify export of symbolic variables * simplify WASM export * simplify clang/wasm export * update README and build scripts * separate files for browser/python apps * restore original python tinychat app files * browser and python tinychats share assets * minor cleanup * isolate compile/export model --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>		2025-03-04 15:53:30 +08:00
..
accel	move things, clean up extra (#2292 )	2023-11-13 20:18:40 -08:00
amdpci	adaptive am_smi (#9319 )	2025-03-02 15:45:07 +03:00
assembly	s/UOps/Ops (#7500 )	2024-11-03 11:26:10 +08:00
backends	CLANG -> CPU (#9189 )	2025-02-20 18:03:09 -05:00
datasets	do not construct unmasked VALID (#8759 )	2025-01-28 20:51:21 +02:00
disassemblers/adreno	qcom fix disasm (#6703 )	2024-09-24 15:23:43 +08:00
dsp	dsp simulator (#8869 )	2025-02-04 09:45:04 +08:00
gemm	fast amd gemm (#9318 )	2025-03-03 12:01:14 +08:00
hip_gpu_driver	create_schedule([x.lazydata]) -> x.schedule() in tests (#8449 )	2024-12-31 03:15:52 +08:00
hiprtc	use comgr to compile (#3248 )	2024-01-26 18:27:49 -08:00
junk	coder.py can write and run code (#2439 )	2023-11-25 12:27:54 -08:00
models	least_upper_float is at least default_float (#9303 )	2025-02-28 10:41:56 -05:00
nv_gpu_driver	nv fix shared_memory_size (#7239 )	2024-10-23 21:59:47 +03:00
optimization	fix import time_linearizer [pr] (#9118 )	2025-02-15 21:33:28 -05:00
qcom_gpu_driver	qcom match texture/sampler descriptors to OpenCL (#7622 )	2024-11-11 21:56:51 +03:00
resnet18	beat mlx at resnet 18 (#6611 )	2024-09-20 11:28:01 +08:00
torch_backend	ruff torch backend (#9341 )	2025-03-03 15:15:23 -05:00
torch_hook	torch_hook fixes (#9334 )	2025-03-03 23:07:30 +03:00
webgpu	Autogen webgpu dawn, removing wgpu-py dependency (f16 support part 1) (#8646 )	2025-02-07 15:16:59 +08:00
archprobe.py	move dtypes to dtype.py (#2964 )	2024-01-01 14:58:48 -08:00
augment.py	[ready] Replacing os with pathlib (#1708 )	2023-08-30 10:41:08 -07:00
disk_read_speed.py	io_uring for copies from disk (#5035 )	2024-06-21 11:36:51 +03:00
dump_cache.py	wow how did i think that was okay (#2339 )	2023-11-16 21:21:11 -08:00
export_model.py	tinychat in browser, Part 2: model export (#9274 )	2025-03-04 15:53:30 +08:00
f16_decompress.py	u32 to f16 in tinygrad (#8074 )	2024-12-06 12:00:13 +01:00
gradcheck.py	tests from grad uop path [pr] (#8313 )	2024-12-18 09:25:05 -08:00
hip_events.py	move autogen to runtime/autogen (#3254 )	2024-01-26 12:44:19 -08:00
hook_cuda.py	cuda hooking (#9180 )	2025-02-20 19:20:01 +08:00
introspection.py	rename LazyBuffer -> UOp [pr] (#8169 )	2024-12-11 16:15:52 -08:00
lr_scheduler.py	use at least float32 for optim.lr (#4297 )	2024-04-25 14:42:28 -04:00
mcts_search.py	[TIP-9] rename Opt's amt to arg 2 (#8770 )	2025-01-27 14:19:04 -05:00
multitensor.py	multitensor start (#2676 )	2023-12-07 17:07:05 -08:00
onnx.py	Test Onnx quantization behavior (#9301 )	2025-03-01 19:21:58 -05:00
onnx_helpers.py	add test_onnx_ops.py (#8569 )	2025-02-24 16:15:22 -05:00
reduce_speed.py	reduce speed example [pr] (#8978 )	2025-02-09 14:13:59 +08:00
replay_pkl.py	hotfix: add replay_pkl debugging env	2025-02-17 17:34:58 +08:00
ring_copy.py	ring copy example (#3185 )	2024-01-19 23:34:30 -05:00
setup_mock_amd_osx.sh	add script to install amd mockgpu on macOS (#8536 )	2025-01-09 01:29:25 +03:00
setup_mock_nv_osx.sh	hotfix: setup_mock_nv_osx	2025-02-13 12:26:15 +08:00
thneed.py	new style device (#2530 )	2023-11-30 17:07:16 -08:00
threefry.py	feat: make buffer (#6745 )	2024-09-25 18:31:03 +08:00
to_movement_ops.py	full fix for as_strided in torch backend (#9257 )	2025-02-26 22:34:05 +08:00
training.py	tinytqdm.set_description and tinytrange (#5101 )	2024-06-22 14:45:06 -04:00
transfer_speed.py	hotfix: copy size is in bytes	2024-01-17 16:44:15 +00:00