Commit graph

8,229 commits

Author SHA1 Message Date
George Hotz
dbb50e4a00 knum 4 2025-03-21 15:48:50 +08:00
George Hotz
71c7c455a6 quantize 2025-03-21 14:55:29 +08:00
George Hotz
ff3438be4e fast 2025-03-21 13:04:18 +08:00
George Hotz
bc5e23061b diasm 2025-03-21 11:22:40 +08:00
George Hotz
5ce951fb34 l2 2025-03-21 11:14:12 +08:00
George Hotz
4a49d05a3f
Merge branch 'master' into dsp_search 2025-03-21 10:26:38 +08:00
George Hotz
c3c85c64ee simpler 2025-03-21 09:24:33 +08:00
Sieds Lykles
3ad3ac4d1e
Change dtypes.int to dtypes.ints (#9517) 2025-03-20 17:24:26 -04:00
chenyu
b9fab9b914
pin ruff to 0.11.0 in CI (#9520)
0.11.1 had a bug https://github.com/astral-sh/ruff/issues/16874 that breaks ci
2025-03-20 13:12:50 -04:00
George Hotz
61c02ca634 cleanups 2025-03-20 23:27:06 +08:00
George Hotz
325044bcaf okay that should actually prefetch 2025-03-20 22:59:59 +08:00
George Hotz
91ac508878 prefetch 2025-03-20 22:56:38 +08:00
George Hotz
2ed30f5366 correct flops 2025-03-20 21:46:13 +08:00
George Hotz
d0b9c7e7ca fast like nascar? 2025-03-20 21:27:26 +08:00
George Hotz
f6ed8f4a27 8 folds 2025-03-20 21:20:46 +08:00
George Hotz
87718170d2 more generic 2025-03-20 21:14:33 +08:00
George Hotz
b67af4049c knum 20 2025-03-20 20:59:06 +08:00
George Hotz
16e425a4c0 work 2025-03-20 20:24:21 +08:00
George Hotz
c867a48ab4 custom 2025-03-20 20:02:35 +08:00
George Hotz
2dc82c0604 should be fast 2025-03-20 19:49:04 +08:00
George Hotz
e7402e6643 KNUM=13 will be fast like roadrunner 2025-03-20 18:45:53 +08:00
George Hotz
e5ccd9e846 work 2025-03-20 15:20:03 +08:00
George Hotz
624197f169 swizzle better 2025-03-20 12:41:24 +08:00
George Hotz
d42350a401 simple test 2025-03-20 12:37:29 +08:00
George Hotz
3c5161b4cb
add validation of the bounds of Ops.INDEX (#9503)
* add validation of the bounds of Ops.INDEX

* do mask properly

* more validation

* correct

* fix gated

* add CAST support to vmin/vmax

* fix ptx and image

* ptx no diff

* upat.index also stays

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2025-03-20 12:15:55 +08:00
qazal
0b20f91ce7
remove move_mask from the devectorizer (#9511)
* remove move_mask from the devectorizer

* add (wrong) ptx

* reason

* enable index addition in PTX, we won't have the INDEX anyways

* space
2025-03-20 11:53:12 +08:00
qazal
9302738263
hotfix: more consistent wgsl.py spacing + cleanups [pr] (#9515)
* hotfix: more consistent wgsl.py spacing + cleanups [pr]

* free things up
2025-03-20 11:07:15 +08:00
George Hotz
223feb2118
Merge branch 'master' into dsp_search 2025-03-20 10:52:30 +08:00
George Hotz
68053d0510
dsp stuff / sniff ioctls from snpe (#9490)
* sniff ioctls from snpe

* dump input buffers

* snpe logs from dsp

* NHWC support

* knum 3

* this run?

* revert those

---------

Co-authored-by: Comma Device <device@comma.ai>
2025-03-20 10:38:23 +08:00
qazal
2223b93338
add UPat.or_casted [pr] (#9513) 2025-03-20 10:08:32 +08:00
qazal
1839e8c9b3
place masks in INDEX for TestGatedStoreRewrite [pr] (#9512) 2025-03-20 09:46:53 +08:00
b1tg
bd731a8624
AMDCompiler refactor (no_comgr prereq) (#9497)
* add amdgpu_disassemble to helpers

* refactor hip compiler

---------

Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-03-20 09:44:07 +08:00
geohotstan
8c0d0a122c
Add return_indices to max_pool (#9506)
* wow argmax is so good

* 1 less line

* clean up and better variable names

* is this torch thing right...?

* add more tests

* slap a TODO on it

* clean ups

* prettier looking code and fix ceil mode test

* add return types and some docs

* ok that was a bad example since indices == value, just no example
2025-03-19 15:25:37 -04:00
chenyu
189f62d44f
add rounding to tqdm unit scale (#9507)
fixed `AssertionError: ' 1.00/10.0  1000it/s]' != ' 1.00/10.0  1.00kit/s]'`
2025-03-19 12:08:46 -04:00
nimlgen
a5c971ff3a
am: prereqs for rdna4 1/n (#9495)
* am: ip_ver rename for acc

* am: refactor this

* fix version

* ugh
2025-03-19 17:14:57 +08:00
Francis Lam
1e5d9ad8f7
extra/gemm/max_matmul: start of custom kernels for GEMM (#6926)
* extra/gemm/max_matmul: start of custom kernels for GEMM

* add an unoptimized FP16/FP16 MMA example

* add slow 3-stage fp16 acc example

* add correct 3-stage pipeline with unswizzled/flat smem input (slow)

* add acc fp16 example with 3 stages and swizzle (no bank conflicts)

* add max version of NV fp16_fp16_fp16

* fix up comments and removed unused code in max variations

* add start of no_xor example

* fix to account for UOps to Ops
2025-03-19 15:04:57 +08:00
George Hotz
865f23dd7b olmoe memory usage cleanups 2025-03-19 12:28:18 +08:00
b1tg
2c87a22cf2
fix prg size calculation when there are adjacent mapped ranges (#9498)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-03-19 11:55:03 +08:00
b1tg
1d71436e6a
use libllvm19 in ci (#9494)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-03-19 11:53:32 +08:00
b1tg
a95b489a55
nanoGPT train works with tiny torch backend (#9283)
* train_shakespeare_char.py works

* move aten.where.self_out to tiny_backend_out

* fix memory leak

* corealize in the backward_hook

* Update backend.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-19 11:51:02 +08:00
chenyu
f8976dd2eb
enable more webgpu tests (#9502)
OSX has larger buffer number limit, and it supports fp16 now
2025-03-18 23:03:54 -04:00
qazal
ae688e4103
simple failing test for scheduling parallel reduce [pr] (#9501)
* simple failing test for scheduling parallel reduce [pr]

* atol
2025-03-19 10:52:13 +08:00
leopf
e4dad99145
nn.state docs cleanup (#8332)
* doc cleanup

* extension cleanup

* manual definition

* bring back accept_filename for gguf_load

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-18 17:16:40 -04:00
chenyu
1ea4876dfa
olmoe touchups (#9499)
GlobalCounters.reset() and only validate if temperature is 0
2025-03-18 15:25:45 -04:00
geohotstan
f7506c6c25
JIT OLMoE (#9396)
* jit the forward

* might timeout, idk just send it

* this is dumb

* naive bitonic lol

* idk if this is correct, but that squeeze before is definitly not

* vectorized bitonic sort, but still slow

* yay 1 layer is correct

* alright its pretty good

* good enough

* rerun CI

* nit improve comment
2025-03-18 14:49:02 -04:00
Ignacio Sica
5c56cac0a0
MI300 mfma support (#9417)
* add f16/f32 mfma support for MI300

- add 16x16 mfma shape support for f16 with f32 acc
- add ops_python mfma emulation
- add arch to AMDRenderer

* minor cleanup

* minor cleanup

* add mfma emulation task to ci

* add back todo

* hotfix: comment

* add tc=3 job to ci
2025-03-18 14:33:30 -03:00
hooved
5500887eed
improve reproducibility of WebGPU CI puppeteer test (#9496)
* try to make CI test fail with slow JS import

* prevent race between model import and reference

* revert artificial delay in JS module import
2025-03-18 09:27:38 -04:00
qazal
cde4fd3be3
do not view_left assign + elementwise sources always have a shape [pr] (#9491) 2025-03-18 17:42:51 +08:00
George Hotz
117b7a16ef
VALIDATE_WITH_CPU [pr] (#9488)
* VALIDATE_WITH_CPU [pr]

* fix test
2025-03-18 15:15:04 +08:00
qazal
935cd01f56
simple failing test for graph_rewrite children [pr] (#9489)
* simple failing test for graph_rewrite children [pr]

* lint

* update too
2025-03-18 13:07:21 +08:00