George Hotz
dbb50e4a00
knum 4
2025-03-21 15:48:50 +08:00
George Hotz
71c7c455a6
quantize
2025-03-21 14:55:29 +08:00
George Hotz
ff3438be4e
fast
2025-03-21 13:04:18 +08:00
George Hotz
bc5e23061b
diasm
2025-03-21 11:22:40 +08:00
George Hotz
5ce951fb34
l2
2025-03-21 11:14:12 +08:00
George Hotz
4a49d05a3f
Merge branch 'master' into dsp_search
2025-03-21 10:26:38 +08:00
George Hotz
c3c85c64ee
simpler
2025-03-21 09:24:33 +08:00
Sieds Lykles
3ad3ac4d1e
Change dtypes.int to dtypes.ints ( #9517 )
2025-03-20 17:24:26 -04:00
chenyu
b9fab9b914
pin ruff to 0.11.0 in CI ( #9520 )
...
0.11.1 had a bug https://github.com/astral-sh/ruff/issues/16874 that breaks ci
2025-03-20 13:12:50 -04:00
George Hotz
61c02ca634
cleanups
2025-03-20 23:27:06 +08:00
George Hotz
325044bcaf
okay that should actually prefetch
2025-03-20 22:59:59 +08:00
George Hotz
91ac508878
prefetch
2025-03-20 22:56:38 +08:00
George Hotz
2ed30f5366
correct flops
2025-03-20 21:46:13 +08:00
George Hotz
d0b9c7e7ca
fast like nascar?
2025-03-20 21:27:26 +08:00
George Hotz
f6ed8f4a27
8 folds
2025-03-20 21:20:46 +08:00
George Hotz
87718170d2
more generic
2025-03-20 21:14:33 +08:00
George Hotz
b67af4049c
knum 20
2025-03-20 20:59:06 +08:00
George Hotz
16e425a4c0
work
2025-03-20 20:24:21 +08:00
George Hotz
c867a48ab4
custom
2025-03-20 20:02:35 +08:00
George Hotz
2dc82c0604
should be fast
2025-03-20 19:49:04 +08:00
George Hotz
e7402e6643
KNUM=13 will be fast like roadrunner
2025-03-20 18:45:53 +08:00
George Hotz
e5ccd9e846
work
2025-03-20 15:20:03 +08:00
George Hotz
624197f169
swizzle better
2025-03-20 12:41:24 +08:00
George Hotz
d42350a401
simple test
2025-03-20 12:37:29 +08:00
George Hotz
3c5161b4cb
add validation of the bounds of Ops.INDEX ( #9503 )
...
* add validation of the bounds of Ops.INDEX
* do mask properly
* more validation
* correct
* fix gated
* add CAST support to vmin/vmax
* fix ptx and image
* ptx no diff
* upat.index also stays
---------
Co-authored-by: qazal <qazal.software@gmail.com>
2025-03-20 12:15:55 +08:00
qazal
0b20f91ce7
remove move_mask from the devectorizer ( #9511 )
...
* remove move_mask from the devectorizer
* add (wrong) ptx
* reason
* enable index addition in PTX, we won't have the INDEX anyways
* space
2025-03-20 11:53:12 +08:00
qazal
9302738263
hotfix: more consistent wgsl.py spacing + cleanups [pr] ( #9515 )
...
* hotfix: more consistent wgsl.py spacing + cleanups [pr]
* free things up
2025-03-20 11:07:15 +08:00
George Hotz
223feb2118
Merge branch 'master' into dsp_search
2025-03-20 10:52:30 +08:00
George Hotz
68053d0510
dsp stuff / sniff ioctls from snpe ( #9490 )
...
* sniff ioctls from snpe
* dump input buffers
* snpe logs from dsp
* NHWC support
* knum 3
* this run?
* revert those
---------
Co-authored-by: Comma Device <device@comma.ai>
2025-03-20 10:38:23 +08:00
qazal
2223b93338
add UPat.or_casted [pr] ( #9513 )
2025-03-20 10:08:32 +08:00
qazal
1839e8c9b3
place masks in INDEX for TestGatedStoreRewrite [pr] ( #9512 )
2025-03-20 09:46:53 +08:00
b1tg
bd731a8624
AMDCompiler refactor (no_comgr prereq) ( #9497 )
...
* add amdgpu_disassemble to helpers
* refactor hip compiler
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-03-20 09:44:07 +08:00
geohotstan
8c0d0a122c
Add return_indices to max_pool ( #9506 )
...
* wow argmax is so good
* 1 less line
* clean up and better variable names
* is this torch thing right...?
* add more tests
* slap a TODO on it
* clean ups
* prettier looking code and fix ceil mode test
* add return types and some docs
* ok that was a bad example since indices == value, just no example
2025-03-19 15:25:37 -04:00
chenyu
189f62d44f
add rounding to tqdm unit scale ( #9507 )
...
fixed `AssertionError: ' 1.00/10.0 1000it/s]' != ' 1.00/10.0 1.00kit/s]'`
2025-03-19 12:08:46 -04:00
nimlgen
a5c971ff3a
am: prereqs for rdna4 1/n ( #9495 )
...
* am: ip_ver rename for acc
* am: refactor this
* fix version
* ugh
2025-03-19 17:14:57 +08:00
Francis Lam
1e5d9ad8f7
extra/gemm/max_matmul: start of custom kernels for GEMM ( #6926 )
...
* extra/gemm/max_matmul: start of custom kernels for GEMM
* add an unoptimized FP16/FP16 MMA example
* add slow 3-stage fp16 acc example
* add correct 3-stage pipeline with unswizzled/flat smem input (slow)
* add acc fp16 example with 3 stages and swizzle (no bank conflicts)
* add max version of NV fp16_fp16_fp16
* fix up comments and removed unused code in max variations
* add start of no_xor example
* fix to account for UOps to Ops
2025-03-19 15:04:57 +08:00
George Hotz
865f23dd7b
olmoe memory usage cleanups
2025-03-19 12:28:18 +08:00
b1tg
2c87a22cf2
fix prg size calculation when there are adjacent mapped ranges ( #9498 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-03-19 11:55:03 +08:00
b1tg
1d71436e6a
use libllvm19 in ci ( #9494 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-03-19 11:53:32 +08:00
b1tg
a95b489a55
nanoGPT train works with tiny torch backend ( #9283 )
...
* train_shakespeare_char.py works
* move aten.where.self_out to tiny_backend_out
* fix memory leak
* corealize in the backward_hook
* Update backend.py
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-19 11:51:02 +08:00
chenyu
f8976dd2eb
enable more webgpu tests ( #9502 )
...
OSX has larger buffer number limit, and it supports fp16 now
2025-03-18 23:03:54 -04:00
qazal
ae688e4103
simple failing test for scheduling parallel reduce [pr] ( #9501 )
...
* simple failing test for scheduling parallel reduce [pr]
* atol
2025-03-19 10:52:13 +08:00
leopf
e4dad99145
nn.state docs cleanup ( #8332 )
...
* doc cleanup
* extension cleanup
* manual definition
* bring back accept_filename for gguf_load
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-18 17:16:40 -04:00
chenyu
1ea4876dfa
olmoe touchups ( #9499 )
...
GlobalCounters.reset() and only validate if temperature is 0
2025-03-18 15:25:45 -04:00
geohotstan
f7506c6c25
JIT OLMoE ( #9396 )
...
* jit the forward
* might timeout, idk just send it
* this is dumb
* naive bitonic lol
* idk if this is correct, but that squeeze before is definitly not
* vectorized bitonic sort, but still slow
* yay 1 layer is correct
* alright its pretty good
* good enough
* rerun CI
* nit improve comment
2025-03-18 14:49:02 -04:00
Ignacio Sica
5c56cac0a0
MI300 mfma support ( #9417 )
...
* add f16/f32 mfma support for MI300
- add 16x16 mfma shape support for f16 with f32 acc
- add ops_python mfma emulation
- add arch to AMDRenderer
* minor cleanup
* minor cleanup
* add mfma emulation task to ci
* add back todo
* hotfix: comment
* add tc=3 job to ci
2025-03-18 14:33:30 -03:00
hooved
5500887eed
improve reproducibility of WebGPU CI puppeteer test ( #9496 )
...
* try to make CI test fail with slow JS import
* prevent race between model import and reference
* revert artificial delay in JS module import
2025-03-18 09:27:38 -04:00
qazal
cde4fd3be3
do not view_left assign + elementwise sources always have a shape [pr] ( #9491 )
2025-03-18 17:42:51 +08:00
George Hotz
117b7a16ef
VALIDATE_WITH_CPU [pr] ( #9488 )
...
* VALIDATE_WITH_CPU [pr]
* fix test
2025-03-18 15:15:04 +08:00
qazal
935cd01f56
simple failing test for graph_rewrite children [pr] ( #9489 )
...
* simple failing test for graph_rewrite children [pr]
* lint
* update too
2025-03-18 13:07:21 +08:00