mirrors/tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-06-24 02:14:17 +00:00

Author	SHA1	Message	Date
George Hotz	dbb50e4a00	knum 4	2025-03-21 15:48:50 +08:00
George Hotz	71c7c455a6	quantize	2025-03-21 14:55:29 +08:00
George Hotz	ff3438be4e	fast	2025-03-21 13:04:18 +08:00
George Hotz	bc5e23061b	diasm	2025-03-21 11:22:40 +08:00
George Hotz	5ce951fb34	l2	2025-03-21 11:14:12 +08:00
George Hotz	4a49d05a3f	Merge branch 'master' into dsp_search	2025-03-21 10:26:38 +08:00
George Hotz	c3c85c64ee	simpler	2025-03-21 09:24:33 +08:00
Sieds Lykles	3ad3ac4d1e	Change dtypes.int to dtypes.ints (#9517 )	2025-03-20 17:24:26 -04:00
chenyu	b9fab9b914	pin ruff to 0.11.0 in CI (#9520 ) 0.11.1 had a bug https://github.com/astral-sh/ruff/issues/16874 that breaks ci	2025-03-20 13:12:50 -04:00
George Hotz	61c02ca634	cleanups	2025-03-20 23:27:06 +08:00
George Hotz	325044bcaf	okay that should actually prefetch	2025-03-20 22:59:59 +08:00
George Hotz	91ac508878	prefetch	2025-03-20 22:56:38 +08:00
George Hotz	2ed30f5366	correct flops	2025-03-20 21:46:13 +08:00
George Hotz	d0b9c7e7ca	fast like nascar?	2025-03-20 21:27:26 +08:00
George Hotz	f6ed8f4a27	8 folds	2025-03-20 21:20:46 +08:00
George Hotz	87718170d2	more generic	2025-03-20 21:14:33 +08:00
George Hotz	b67af4049c	knum 20	2025-03-20 20:59:06 +08:00
George Hotz	16e425a4c0	work	2025-03-20 20:24:21 +08:00
George Hotz	c867a48ab4	custom	2025-03-20 20:02:35 +08:00
George Hotz	2dc82c0604	should be fast	2025-03-20 19:49:04 +08:00
George Hotz	e7402e6643	KNUM=13 will be fast like roadrunner	2025-03-20 18:45:53 +08:00
George Hotz	e5ccd9e846	work	2025-03-20 15:20:03 +08:00
George Hotz	624197f169	swizzle better	2025-03-20 12:41:24 +08:00
George Hotz	d42350a401	simple test	2025-03-20 12:37:29 +08:00
George Hotz	3c5161b4cb	add validation of the bounds of Ops.INDEX (#9503 ) * add validation of the bounds of Ops.INDEX * do mask properly * more validation * correct * fix gated * add CAST support to vmin/vmax * fix ptx and image * ptx no diff * upat.index also stays --------- Co-authored-by: qazal <qazal.software@gmail.com>	2025-03-20 12:15:55 +08:00
qazal	0b20f91ce7	remove move_mask from the devectorizer (#9511 ) * remove move_mask from the devectorizer * add (wrong) ptx * reason * enable index addition in PTX, we won't have the INDEX anyways * space	2025-03-20 11:53:12 +08:00
qazal	9302738263	hotfix: more consistent wgsl.py spacing + cleanups [pr] (#9515 ) * hotfix: more consistent wgsl.py spacing + cleanups [pr] * free things up	2025-03-20 11:07:15 +08:00
George Hotz	223feb2118	Merge branch 'master' into dsp_search	2025-03-20 10:52:30 +08:00
George Hotz	68053d0510	dsp stuff / sniff ioctls from snpe (#9490 ) * sniff ioctls from snpe * dump input buffers * snpe logs from dsp * NHWC support * knum 3 * this run? * revert those --------- Co-authored-by: Comma Device <device@comma.ai>	2025-03-20 10:38:23 +08:00
qazal	2223b93338	add UPat.or_casted [pr] (#9513 )	2025-03-20 10:08:32 +08:00
qazal	1839e8c9b3	place masks in INDEX for TestGatedStoreRewrite [pr] (#9512 )	2025-03-20 09:46:53 +08:00
b1tg	bd731a8624	AMDCompiler refactor (no_comgr prereq) (#9497 ) * add amdgpu_disassemble to helpers * refactor hip compiler --------- Co-authored-by: b1tg <b1tg@users.noreply.github.com>	2025-03-20 09:44:07 +08:00
geohotstan	8c0d0a122c	Add return_indices to max_pool (#9506 ) * wow argmax is so good * 1 less line * clean up and better variable names * is this torch thing right...? * add more tests * slap a TODO on it * clean ups * prettier looking code and fix ceil mode test * add return types and some docs * ok that was a bad example since indices == value, just no example	2025-03-19 15:25:37 -04:00
chenyu	189f62d44f	add rounding to tqdm unit scale (#9507 ) fixed `AssertionError: ' 1.00/10.0 1000it/s]' != ' 1.00/10.0 1.00kit/s]'`	2025-03-19 12:08:46 -04:00
nimlgen	a5c971ff3a	am: prereqs for rdna4 1/n (#9495 ) * am: ip_ver rename for acc * am: refactor this * fix version * ugh	2025-03-19 17:14:57 +08:00
Francis Lam	1e5d9ad8f7	extra/gemm/max_matmul: start of custom kernels for GEMM (#6926 ) * extra/gemm/max_matmul: start of custom kernels for GEMM * add an unoptimized FP16/FP16 MMA example * add slow 3-stage fp16 acc example * add correct 3-stage pipeline with unswizzled/flat smem input (slow) * add acc fp16 example with 3 stages and swizzle (no bank conflicts) * add max version of NV fp16_fp16_fp16 * fix up comments and removed unused code in max variations * add start of no_xor example * fix to account for UOps to Ops	2025-03-19 15:04:57 +08:00
George Hotz	865f23dd7b	olmoe memory usage cleanups	2025-03-19 12:28:18 +08:00
b1tg	2c87a22cf2	fix prg size calculation when there are adjacent mapped ranges (#9498 ) Co-authored-by: b1tg <b1tg@users.noreply.github.com>	2025-03-19 11:55:03 +08:00
b1tg	1d71436e6a	use libllvm19 in ci (#9494 ) Co-authored-by: b1tg <b1tg@users.noreply.github.com>	2025-03-19 11:53:32 +08:00
b1tg	a95b489a55	nanoGPT train works with tiny torch backend (#9283 ) * train_shakespeare_char.py works * move aten.where.self_out to tiny_backend_out * fix memory leak * corealize in the backward_hook * Update backend.py --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-03-19 11:51:02 +08:00
chenyu	f8976dd2eb	enable more webgpu tests (#9502 ) OSX has larger buffer number limit, and it supports fp16 now	2025-03-18 23:03:54 -04:00
qazal	ae688e4103	simple failing test for scheduling parallel reduce [pr] (#9501 ) * simple failing test for scheduling parallel reduce [pr] * atol	2025-03-19 10:52:13 +08:00
leopf	e4dad99145	nn.state docs cleanup (#8332 ) * doc cleanup * extension cleanup * manual definition * bring back accept_filename for gguf_load --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com>	2025-03-18 17:16:40 -04:00
chenyu	1ea4876dfa	olmoe touchups (#9499 ) GlobalCounters.reset() and only validate if temperature is 0	2025-03-18 15:25:45 -04:00
geohotstan	f7506c6c25	JIT OLMoE (#9396 ) * jit the forward * might timeout, idk just send it * this is dumb * naive bitonic lol * idk if this is correct, but that squeeze before is definitly not * vectorized bitonic sort, but still slow * yay 1 layer is correct * alright its pretty good * good enough * rerun CI * nit improve comment	2025-03-18 14:49:02 -04:00
Ignacio Sica	5c56cac0a0	MI300 mfma support (#9417 ) * add f16/f32 mfma support for MI300 - add 16x16 mfma shape support for f16 with f32 acc - add ops_python mfma emulation - add arch to AMDRenderer * minor cleanup * minor cleanup * add mfma emulation task to ci * add back todo * hotfix: comment * add tc=3 job to ci	2025-03-18 14:33:30 -03:00
hooved	5500887eed	improve reproducibility of WebGPU CI puppeteer test (#9496 ) * try to make CI test fail with slow JS import * prevent race between model import and reference * revert artificial delay in JS module import	2025-03-18 09:27:38 -04:00
qazal	cde4fd3be3	do not view_left assign + elementwise sources always have a shape [pr] (#9491 )	2025-03-18 17:42:51 +08:00
George Hotz	117b7a16ef	VALIDATE_WITH_CPU [pr] (#9488 ) * VALIDATE_WITH_CPU [pr] * fix test	2025-03-18 15:15:04 +08:00
qazal	935cd01f56	simple failing test for graph_rewrite children [pr] (#9489 ) * simple failing test for graph_rewrite children [pr] * lint * update too	2025-03-18 13:07:21 +08:00

1 2 3 4 5 ...

8,229 commits