chenyu
687ade119e
IMAGE hand_coded_optimizations update ( #16720 )
2026-06-23 21:55:28 -04:00
George Hotz
0a8e61d0c5
switch to the new memory coaleser [pr] ( #16716 )
...
* switch to the new memory coalese
* move that stuff
* copy in allowed length logic
* mulitple buffers
* new coalese is better
* fine
* earlier
* fixes
* work
* work
* valid
* stack on index const
2026-06-23 18:03:48 -07:00
wozeparrot
dfea9e7994
llama: fused silu mul quantize mxfp8 ( #16704 )
2026-06-23 16:59:50 -07:00
chenyu
ce87d80911
better _drop_valid_stmts [pr] ( #16719 )
...
also dropped the unused is_increasing
2026-06-23 19:35:01 -04:00
George Hotz
5a2b3b7b06
early dtype decomp ( #16718 )
...
* early dtype decomp
* simplify
* cleanup
* that goes there
* doing too much
* stupid symbolic rules
2026-06-23 16:07:20 -07:00
Christopher Milan
116045cc8e
ci: remove tensorflow from testoptim ( #16717 )
2026-06-23 18:11:48 -04:00
nimlgen
7c1d0b6d9a
hcq2: use shrink(bitcast) ( #16713 )
...
* hcq2: use shrink(bitcast)
* x
2026-06-23 18:11:39 +03:00
George Hotz
c9dc1d63cc
small changes from new codegen ( #16712 )
...
* small changes from new codegen
* shrink/flatten
2026-06-22 17:44:15 -07:00
Christopher Milan
da98fae9e1
ci: try parallelizing tc tests ( #16710 )
2026-06-22 20:43:32 -04:00
chenyu
15988b5941
contiguous to mixin and cleanups [PR] ( #16711 )
2026-06-22 20:18:18 -04:00
Christopher Milan
cbfcf36e44
ci: remove generate_dataset and CL misc ( #16709 )
2026-06-22 18:01:07 -04:00
nimlgen
f9c8c697d6
hcq2: drop args after inner deps ( #16708 )
2026-06-22 23:26:11 +03:00
chenyu
0138480910
dropout and scaled_dot_product_attention to mixin ( #16707 )
2026-06-22 16:17:45 -04:00
chenyu
33b635d23a
Tensor.train -> TRAINING [PR] ( #16705 )
...
* Tensor.train -> TRAINING [PR]
* doc
2026-06-22 15:13:22 -04:00
chenyu
625d8bbd0d
TRAINING ContextVar ( #16703 )
2026-06-22 13:03:08 -04:00
wozeparrot
fe9b19b12d
llama: more mp mem fixes ( #16701 )
...
* llama: more mp mem fixes
* clean: unused
* fix: batch
2026-06-22 10:54:35 -04:00
chenyu
267af9c601
full_like to CreationMixin [PR] ( #16702 )
2026-06-22 09:33:23 -04:00
chenyu
97da54b9d6
more method to CreationMixin [PR] ( #16698 )
2026-06-22 00:01:22 -04:00
chenyu
fd0dc40689
clean up CreationMixin and DTypeMixin [PR] ( #16697 )
2026-06-21 21:13:40 -04:00
chenyu
2d8b802958
contiguous in wino conv ( #16696 )
...
also fixed test_counters
2026-06-21 17:11:46 -04:00
chenyu
ba1d3baae8
masked_select and nonzero to mixin [PR] ( #16695 )
...
with a .data stub
2026-06-21 15:10:44 -04:00
chenyu
d80a41d559
some rand method to RandMixin [PR] ( #16693 )
2026-06-21 12:16:51 -04:00
wozeparrot
5164c21b44
gemm: keep shape thru mxfp8 quantize ( #16692 )
2026-06-20 22:28:53 -07:00
chenyu
58ff75272e
const_like and invalids to mixin [PR] ( #16690 )
...
* const_like and invalids to mixin [PR]
* empty_like
* einsum
* type
2026-06-21 00:02:29 -04:00
chenyu
b50da5c205
move Tensor.__getitem__ to mixin [PR] ( #16689 )
2026-06-20 22:01:45 -04:00
chenyu
4618d27129
final const cleanups [PR] ( #16688 )
2026-06-20 21:38:16 -04:00
chenyu
9ae0a93d0e
more const cleanups [PR] ( #16682 )
2026-06-20 20:41:43 -04:00
George Hotz
30830850a9
small changes from new codegen ( #16681 )
...
* small changes from new codegen
* revert that
2026-06-19 18:29:01 -07:00
chenyu
8b07cca9f7
invalid clone try 3+ [PR] ( #16679 )
2026-06-19 20:13:52 -04:00
Christopher Milan
b2199c54a3
ci: update actions/cache/restore to suppress warnings ( #16680 )
2026-06-19 18:27:52 -04:00
Christopher Milan
1822eed8d3
ci: only test models on cpu ( #16678 )
2026-06-19 18:16:59 -04:00
wozeparrot
bba611bb59
gemm: fix mxfp8 on more shapes ( #16677 )
2026-06-19 13:28:53 -07:00
chenyu
67c3e589a1
invalid clone tests and prereq [PR] ( #16675 )
2026-06-19 13:20:43 -04:00
George Hotz
649971f02a
remove DEFINE_LOCAL and DEFINE_REG (gpt) ( #16673 )
...
* remove define_local and define_reg (gpt)
* fix precommit
* cleanups
* regalloc fix
* cleanups 2
2026-06-19 10:07:50 -07:00
George Hotz
b05bea81ce
x86 cleanups (fable) [pr] ( #16591 )
...
* x86 cleanups (fable)
* support shrink
* remove ptr dtype
* move that
* is_lane helper
* Revert "is_lane helper"
This reverts commit ea4571254d .
2026-06-19 09:04:51 -07:00
nimlgen
97c2e7a3d9
spec: add getaddr ( #16674 )
2026-06-19 15:37:33 +03:00
George Hotz
d7b10c69bc
update placeholder to not create DEFINE_LOCAL/DEFINE_REG ( #16671 )
...
* update placeholder to not create DEFINE_LOCAL/DEFINE_REG
* simpler
* define_local
2026-06-18 21:21:06 -07:00
Christopher Milan
091ec8d10d
use tinygrad.llm in benchmarks ( #16670 )
2026-06-19 00:03:57 -04:00
George Hotz
925c49ce99
use placeholder in tests ( #16672 )
2026-06-18 20:51:44 -07:00
wozeparrot
05249466ed
llama: fused quantize mxfp8 ( #16667 )
2026-06-18 16:02:28 -07:00
George Hotz
4a4b6956df
remove DEFINE_VAR from codebase (gpt) ( #16666 )
...
* remove DEFINE_VAR from codebase
* junk
* remove junk
2026-06-18 15:33:50 -07:00
nimlgen
eda0a402d1
hcq2: fix multi ( #16661 )
2026-06-18 22:56:49 +03:00
George Hotz
5989d0b150
remove DEFINE_VAR try 2 ( #16651 )
...
* remove DEFINE_VAR try 2
* param
* null index
* fix fuzzing
* fixes
* no gather neg params
* param is just Irreducible
* fixes
* skip stack
* need to filter slots there
2026-06-18 12:34:25 -07:00
wozeparrot
d37248c3ec
gemm: fix mxfp8 on odd shapes ( #16664 )
2026-06-18 12:03:59 -07:00
chenyu
d74f488376
clean up _function.depth properly [PR] ( #16663 )
2026-06-18 14:10:22 -04:00
chenyu
d7a1022188
minor function.py cleanups [PR] ( #16662 )
2026-06-18 13:36:48 -04:00
qazal
924bece1d5
remove some old scheduler tests ( #16660 )
2026-06-18 22:15:00 +09:00
qazal
b753fb5e4c
viz: view source working even if compile failed ( #16657 )
...
* failing test
* hard
* ret_dict
* switch to _data for tests too
* update sqtt
* start work
* Ops.LINEAR looks good
* baseline with depth works
* support depth
* types
* @needs_tracked_pm
* update, marg can error too
* unwrap_or goes to many more places
* move things to soft_err
* soft_err everywhere needed
* diff cleanup
* use list
* rewrite it
* change
* update depth number
* small comment change
2026-06-18 17:34:53 +09:00
qazal
31094a794f
viz: data not sent to client side starts with _ ( #16659 )
...
* ret_dict
* switch to _data for tests too
* update sqtt
* rename to filter_keys
* not cfg
2026-06-18 15:25:22 +09:00
qazal
1720987dc7
include exception name in Ops.REWRITE_ERROR ( #16658 )
2026-06-18 14:52:48 +09:00
wozeparrot
bed0c343a3
faster mxfp8 gemm ( #16656 )
2026-06-17 22:35:36 -07:00
Christopher Milan
e0fe6e542e
ci: fewer pydeps ( #16654 )
2026-06-17 22:52:14 -04:00
chenyu
a74b7130b4
Revert "invalid clone try 2 [PR] ( #16648 )" ( #16653 )
...
This reverts commit 1bd4551ee1 .
2026-06-17 22:05:30 -04:00
chenyu
df015ad541
remove many type ignores [PR] ( #16652 )
2026-06-17 21:38:45 -04:00
chenyu
1bd4551ee1
invalid clone try 2 [PR] ( #16648 )
2026-06-17 19:44:35 -04:00
George Hotz
53a1226a49
STACK 0 is dtype void ( #16650 )
...
* STACK 0 is dtype void
* spec for stack
* fix gemm group + END shape
* bump
2026-06-17 16:28:32 -07:00
George Hotz
aef85ddc4d
addrspace special/range ( #16647 )
...
* addrspace special/range
* just include indexing
* define var is alu
* bring old ignore indexing back
* mults to fix
* fixes
* ALU
* fixes
2026-06-17 15:57:37 -07:00
chenyu
1e08c0a07c
remove NOOP from AFTER with multiple srcs ( #16646 )
2026-06-17 14:35:02 -04:00
chenyu
1acc40600d
indexing an after with all fully invalid stores is invalid ( #16643 )
...
* indexing an after with all fully invalid stores is invalid
* typing cast
2026-06-17 11:06:36 -04:00
nimlgen
0f0c622086
hcq2: multi folders ( #16642 )
2026-06-17 15:20:25 +03:00
George Hotz
be9b570cb2
late numbering of var params ( #16640 )
...
* do_number_param
* fix sort order in x86
* we don't want this
2026-06-17 00:36:08 -07:00
qazal
c7055d658f
viz: only store kernel info ( #16641 )
2026-06-17 16:21:57 +09:00
George Hotz
d631716858
remove const without STACK ( #16639 )
...
* remove const without STACK
* fix GEP rewrite
* fix null tests
* fix openpilot regression
* it's 10 in CI
2026-06-16 21:25:42 -07:00
wozeparrot
36f6d1b064
gemm: fix bf16 atb for mp sharding ( #16637 )
2026-06-16 15:58:47 -07:00
qazal
1cb6b88d37
viz: show contents of vconst ( #16636 )
...
* failing test
* render vconst
* simpler test
* reorder
2026-06-17 02:31:03 +09:00
nimlgen
5644605d92
hcq2: pack bufs ( #16635 )
...
* hcq2: pack bufs
* x
2026-06-16 18:58:16 +03:00
chenyu
d5d59a2be6
remove dead rangeify rules [PR] ( #16634 )
2026-06-16 10:03:08 -04:00
chenyu
f0998e9bba
Revert "invalid clone is anonymous buffer" ( #16613 ) ( #16633 )
2026-06-16 08:27:48 -04:00
qazal
7d2b0b697d
simple failing test for invalid extra E kernel ( #16632 )
...
* simple failing test for invalid extra E kernel
* 6 kernels
2026-06-16 17:57:44 +09:00
wozeparrot
70cac72781
llama: realize weight init ( #16623 )
2026-06-15 23:00:19 -07:00
Christopher Milan
443f976305
fix buffer overrun in dcache_flush ( #16630 )
2026-06-15 23:26:32 -04:00
chenyu
aa2bef24a8
no_vectorized_alu in cstyle does nothing now [PR] ( #16631 )
2026-06-15 23:07:20 -04:00
chenyu
efd03d7153
invalid clone is anonymous buffer [PR] ( #16613 )
2026-06-15 20:14:26 -04:00
nimlgen
4a0488ae97
hcq2: optims ( #16624 )
...
* hcq2: optims
* x
2026-06-15 23:58:28 +03:00
George Hotz
41aa2fe119
test_gemm needs .clone() on eye ( #16629 )
2026-06-15 12:48:27 -07:00
qazal
10bdb9c9d0
viz: check node exists before anchoring zoom ( #16627 )
2026-06-15 21:03:24 +09:00
qazal
f998b9930a
fp8 gemm inv_scale in epilogue ( #16625 )
...
* fuse scale
* remove python inv_scale
* more inv_scale removal
* more cleanups
* cleaner
* diff polish
* work
* rename
* simpler
* simpler
* compute
* c
* Revert "c"
This reverts commit 8941fec7ca .
* Revert "compute"
This reverts commit 9db573a6d3 .
* Revert "simpler"
This reverts commit 910ad33f87 .
* Revert "simpler"
This reverts commit bf75d235a1 .
* s_g
* update types
* less diff noise
* remove
2026-06-15 18:44:41 +09:00
nimlgen
4dc51aff6e
hcq2: jit ( #16621 )
...
* hcq2: jit
* x
* x
* minor
2026-06-15 06:35:35 +07:00
chenyu
2adedf5ccb
clean up fold_divmod_general [pr] ( #16622 )
...
genralized fold_binary_numerator in fold_divmod_congruence
2026-06-14 17:15:52 -04:00
George Hotz
a6d7fb9d4d
only SHRINK for non scalar access ( #16619 )
2026-06-14 10:08:37 -07:00
George Hotz
b1fb39502d
delete that test
2026-06-14 09:42:58 -07:00
chenyu
2e181f4259
simpler cancel_divmod [PR] ( #16616 )
2026-06-14 11:41:31 -04:00
chenyu
5d5ead78da
inline unique_const in invalids [PR] ( #16612 )
2026-06-13 10:14:32 -04:00
Sieds Lykles
b00dd754a9
Remove if-condition from nested div rule [pr] ( #16611 )
...
* add rules and test
* trigger [pr]
2026-06-13 15:47:21 +02:00
nimlgen
5a9227b30a
hcq2: rebind var params ( #16610 )
2026-06-13 14:55:52 +03:00
nimlgen
8efc8d064f
unique based on opaque in from_buffer ( #16609 )
2026-06-13 14:31:58 +03:00
nimlgen
c43091a464
fix missing cast in cstyle ( #16608 )
...
* fix missing cast in cstyle
* x
* x
2026-06-13 10:04:06 +03:00
qazal
2e77bd01db
fp8 gemm cleanup ( #16607 )
2026-06-13 13:17:32 +09:00
Christopher Milan
bcdb988df0
split comma benchmark, dsp on c4 [PR] ( #16598 )
2026-06-12 23:26:05 -04:00
George Hotz
6b8fdfe4ca
alu addrspace is where the math happens ( #16606 )
...
* alu addrspace
* fix cstyle/llvm
* on ptx, reg+alu are the same thing
2026-06-12 20:01:28 -07:00
wozeparrot
67a4f129c2
llama: fix bf16 gemm oob ( #16603 )
2026-06-12 19:43:05 -07:00
Christopher Milan
8862c7549c
new-style dcache_flush ( #16602 )
2026-06-12 22:25:08 -04:00
chenyu
9e72a6b376
more indexing cleanup [PR] ( #16600 )
2026-06-12 21:33:47 -04:00
chenyu
aa32d309db
fix rangeify indexing for pad/reduce ( #16599 )
2026-06-12 20:26:15 -04:00
George Hotz
96b86aad7b
move new style transform up more ( #16593 )
...
* move new style transform up more
* pm_move_gates_from_index works on new style
2026-06-12 17:20:12 -07:00
chenyu
a35964493e
UPat method cleanups [PR] ( #16596 )
2026-06-12 17:22:54 -04:00
chenyu
3036b15ed9
remove Tensor.ufix [PR] ( #16594 )
...
* remove Tensor.ufix [PR]
* inline _ufix_keep_dtype
2026-06-12 14:40:28 -04:00
qazal
b2e95b2db3
rangeify: no copies for write+read of same slice ( #16585 )
...
* failing test
* cleaner failing tests
* assign and read of same slice shouldn't create copies
* err in the changes
* shrink with no overlapping regions in dest is fine
2026-06-13 02:19:47 +09:00
George Hotz
833cb37574
move up new style transform ( #16592 )
...
* simpler names
* move up new style transform
* fix that rule
2026-06-12 10:13:37 -07:00
George Hotz
51100d2c5c
new style cleanups ( #16584 )
...
* spec tighten
* revert
* lin fix
* lin fix
* needed for x86
* revert
2026-06-12 08:10:38 -07:00
Philip Sinitsin
76c10cd635
jit: don't memplan buffers reachable from live tensors ( #16588 )
...
The memory planner was suballocating BUFFERs created during JIT capture that are still referenced by external lazy tensor graphs, like the .grad tensors assigned by backward(). The replay then only writes the arena slices, so realizing such a tensor after the call reads freshly allocated memory and silently returns zeros. Hold every BUFFER reachable from a live Tensor instead of only the parameters of the return value; true internals are still planned. Fixes #16571 .
2026-06-12 17:51:54 +03:00
nimlgen
2bfdf85f87
hcq2: move pre bufferize ( #16589 )
...
* hcq2: move pre bufferize
* x
2026-06-12 16:11:59 +03:00
nimlgen
fb74f75485
var params sort after global params ( #16590 )
2026-06-12 14:33:15 +03:00
qazal
4d34590b7d
llama: less E kernels ( #16517 )
2026-06-12 19:49:25 +09:00
qazal
12f4cf0e49
rename amd/test_custom_kernel.py to test_asm_kernel ( #16586 )
...
* rename amd/test_custom_kernel.py to test_asm_kernel
* update
2026-06-12 16:11:01 +09:00
wozeparrot
e770805d21
llama: mxfp8 ( #16574 )
2026-06-11 22:15:24 -07:00
George Hotz
b8aec4cce7
port x86 to new_style (fable slop) and now everything is new style ( #16581 )
...
* port x86 to new_style (fable slop)
* don't change ops
* port NIR to new_style (fable)
* lil cleanup
* fix tests, and remove new_style
2026-06-11 21:09:34 -07:00
chenyu
762f50bd52
move gradient.py to mixin/ [PR] ( #16583 )
2026-06-11 23:58:21 -04:00
chenyu
a2cec397f3
UOp cast and bitcast takes DTypeLike [PR] ( #16582 )
...
* UOp cast and bitcast takes DTypeLike [PR]
match Tensor
* fix type
2026-06-11 22:38:54 -04:00
George Hotz
b97e3e01e3
port NIR to new_style (fable) ( #16580 )
...
* port NIR to new_style (fable)
* lil cleanup
2026-06-11 18:47:30 -07:00
Christopher Milan
4d893f626a
move a bunch of test_schedule to null ( #16578 )
2026-06-11 20:26:34 -04:00
George Hotz
b57639a6cc
port python to new_style (fable) ( #16579 )
...
* port python to new_style (fable)
* doesn't have to be const in python
2026-06-11 17:26:05 -07:00
George Hotz
a04d2fa4eb
port ptx to new_style (fable) ( #16577 )
...
* port ptx to new_style (fable)
* simplify
* simpler
2026-06-11 17:05:03 -07:00
George Hotz
587333fddb
replace DEFINE_VAR with PARAM ( #16576 )
...
* replace DEFINE_VAR with PARAM
* cleanups
* cleanups
2026-06-11 15:03:20 -07:00
chenyu
5f1e2d3900
PADTO pads Invalids ( #16562 )
2026-06-11 16:54:26 -04:00
George Hotz
434a8ffc38
move llvm to new style ( #16573 )
...
* move llvm to new style
* fix wmma
* buffer is early
2026-06-11 12:59:02 -07:00
George Hotz
347608a523
put loads back on reg ( #16572 )
...
* put loads back on reg
* fix dsp
2026-06-11 11:24:50 -07:00
nimlgen
e5f498de3b
hcq2: debug=2 info ( #16569 )
...
* hcq2: debug=2 info
* t
* x
* hcq2: debug=2 info
* x
2026-06-11 19:52:01 +03:00
qazal
a83710396c
support mselect input to CALL, less kernels in allreduce ( #16567 )
...
* support mselect input to CALL, less kernels in allreduce
* resolve mstack
2026-06-11 18:10:47 +09:00
qazal
7d4a77dce4
relax comma benchmark timeout ( #16568 )
2026-06-11 18:03:37 +09:00
qazal
21f1101691
add allreduce kernel count test ( #16566 )
2026-06-11 15:54:12 +09:00
wozeparrot
c38d6a7e3a
mxfp8 part 2 ( #16561 )
2026-06-10 23:36:11 -07:00
Christopher Milan
83971860d8
ci: simplify webgpu install ( #16557 )
2026-06-10 22:57:19 -04:00
Christopher Milan
6e1b61f16f
cleanup some amd deps ( #16563 )
...
don't load hsa runtime, remove ib autogen
2026-06-10 19:01:56 -04:00
George Hotz
7e6d617935
addrspace cleanups ( #16565 )
...
* addrspace cleanups
* bumps
* eh, relax a little
2026-06-10 15:57:18 -07:00
nimlgen
2c9d2c0d31
jit: memplan before compile ( #16560 )
2026-06-10 15:05:15 +03:00
qazal
34481830f1
rangeify: fix cost function for AFTER(out, CALL) ( #16559 )
...
* simple failing test
* fix rangeify cost function
* new ops count
2026-06-10 17:30:50 +09:00
chenyu
623b66e0e4
more tensor and mixin cleanups [PR] ( #16558 )
2026-06-10 00:39:33 -04:00
chenyu
7366d32247
getitem cleanups [PR] ( #16556 )
2026-06-09 22:48:58 -04:00
George Hotz
fd76ac992e
cstyle renderer is new style [pr] ( #16484 )
...
* cstyle new style
* switch cstyle renderer to new style
* fix hip
* fixes
* fix webgpu
* correct webgpu is_packed
* fix dsp
* fixes
* fix Ops.RANGE must be CONST
* old style render access
* this is correct
* fix cstyle to good
* dl/dr
* as array
* fix spec
* remove define_local/define_reg
* buffer in shrink
* fix test_tiny
* all tests fix
* param args aren't realized
* wgsl fix
* work
* new gate
* fix opencl qcom
* process replay
* sort order
* fix render index
2026-06-09 18:36:01 -07:00
Christopher Milan
97d483350c
ci: download prebuilt ocelot ( #16554 )
2026-06-09 19:51:33 -04:00
Christopher Milan
f9d88d3c3a
fix race in test_quantize_onnx ( #16555 )
2026-06-09 18:39:48 -04:00
wozeparrot
2bdc360606
gemm: mxfp8 hipkittens gemm ( #16541 )
...
* gemm: mxfp8 hipkittens gemm
* feat: update hipkittens
* feat: kernel signature
* clean: just kernel
* feat: from tinygrad
* feat: test
* fix: add back utils
* clean: no diff
* clean: no diff
2026-06-09 15:20:05 -07:00
chenyu
12addee14f
tesnor and mixin cleanups [PR] ( #16553 )
2026-06-09 15:33:13 -04:00
nimlgen
2ab2d51099
hcq2: fix repeated calls ( #16552 )
2026-06-09 19:11:42 +03:00
chenyu
3f053a3370
move functional part of rand to RandMixin ( #16551 )
2026-06-09 09:40:48 -04:00
nimlgen
fa31c744b9
hcq2: cleaner ( #16550 )
2026-06-09 16:33:05 +03:00
qazal
598cc13ad2
more readable null graph profile in VIZ ( #16548 )
...
* more readable null graph profile in VIZ
* change
* fix flaky test
2026-06-09 18:35:05 +09:00
qazal
d18ad49f20
fix flaky test_disktensor ( #16549 )
2026-06-09 18:23:22 +09:00
qazal
fa400f9790
less E kernels in all2all ( #16546 )
2026-06-09 13:51:57 +09:00
qazal
b8931440ae
add all2all schedule test ( #16545 )
2026-06-09 12:41:35 +09:00
wozeparrot
5ef30005fa
update hipkittens ( #16544 )
2026-06-08 18:53:25 -07:00
Christopher Milan
4e2e2e9956
ocelot: use c.DLL ( #16540 )
2026-06-08 21:27:28 -04:00
chenyu
11fee53527
RandMixin [PR] ( #16543 )
2026-06-08 19:11:28 -04:00
chenyu
e2ef5cf5c9
no args and kwargs for _multi_like [PR] ( #16539 )
2026-06-08 17:35:15 -04:00
chenyu
12764161c9
UOp.shard support axis=None [PR] ( #16538 )
...
match Tensor
2026-06-08 11:36:50 -04:00
chenyu
ebc5390c9a
advance indexing to mixin [PR] ( #16532 )
2026-06-08 09:24:49 -04:00
nimlgen
95d63d6c07
hcq2: lower to ins ( #16535 )
...
* hcq2: lower to ins
* pm4
* f
2026-06-08 16:15:30 +03:00
nimlgen
8baca185d5
hcq2: add kfd ( #16537 )
2026-06-08 13:48:27 +03:00
chenyu
03943cd1a0
use more _uop for cleanup [PR] ( #16531 )
...
`t.uop if isinstance(t, Tensor) else t` -> `t._uop`
2026-06-07 17:41:36 -04:00
chenyu
937aeaec60
remove device= from UPat.const [PR] ( #16530 )
2026-06-07 16:38:43 -04:00
George Hotz
eb1238436a
more prereqs for DL/DR -> BUFFER ( #16529 )
2026-06-07 12:25:11 -07:00
George Hotz
0336ba8eb1
buffer param arg + dsp fixups ( #16528 )
2026-06-07 12:07:00 -07:00
Dmitriy Strunin
75e903d533
remove unused device arg from _get_winograd_matcols ( #16527 )
2026-06-07 08:15:09 -04:00
chenyu
90b556ca48
move gradient to mixin [PR] ( #16526 )
2026-06-07 00:05:02 -04:00
chenyu
4e7c6260b0
clean up test_tesnor_uop_mixin ( #16525 )
...
most of those don't have UNIQUE anymore
2026-06-06 23:25:44 -04:00
George Hotz
2a2f81dd3d
remove ANON from addrspace, refactor marg ( #16523 )
...
* remove ANON from addrspace, refactor marg
* as_shape
* as_shape is cached
2026-06-06 09:49:09 -07:00
qazal
e69b4189b0
viz: hide STACK on PARAM by default ( #16522 )
2026-06-06 16:41:15 +09:00
Christopher Milan
857b1f5399
ci: more parallelism, less duplication ( #16509 )
2026-06-05 21:26:19 -04:00
wozeparrot
a1ec32cfd2
llama: current grad scaling ( #16518 )
2026-06-05 15:39:41 -07:00
Christopher Milan
8c0ba1da5c
cleanup more from test/backend ( #16521 )
2026-06-05 18:38:46 -04:00
chenyu
9982185b14
remove unused AFTER rules in pm_add_buffers[PR] ( #16519 )
2026-06-05 14:58:34 -04:00
nimlgen
5ebd44aa12
hcq2: merge queues ( #16514 )
...
* hcq2: mergw queues
* cleaner
2026-06-05 21:20:25 +03:00
chenyu
a51b5ba424
remove early fixup const copy [PR] ( #16516 )
2026-06-05 11:35:34 -04:00
Nueramarcos
8274140134
uop/ops: fix ~bool deprecation warning on Python 3.12+ (ORANGE Grok helped with the patch) ( #16512 )
2026-06-05 10:54:30 -04:00
chenyu
588c759a3d
remove unused GroupOp.Buffer [PR] ( #16515 )
2026-06-05 10:38:52 -04:00
qazal
79a13310b3
viz: kernel_graph.txt unique is per schedule ( #16511 )
2026-06-05 16:17:28 +09:00
Christopher Milan
9b0f75622c
many jit tests belong in unit ( #16508 )
2026-06-04 21:36:53 -04:00
chenyu
bb407d8b3c
fix transform_precompiled_call for MULTI ( #16510 )
...
based on my understanding for https://github.com/tinygrad/tinygrad/pull/16084
2026-06-04 20:09:58 -04:00
wozeparrot
f11f63007d
llama: immediate scaling on flag ( #16494 )
2026-06-04 10:30:00 -07:00
George Hotz
4fb8ce1831
update buffer in spec ( #16507 )
2026-06-04 10:12:31 -07:00
chenyu
4a8bf07a87
remove CONST(DEVICE) ( #16506 )
2026-06-04 11:29:46 -04:00
nimlgen
3838c8df1b
hcq2: move global sync ( #16504 )
2026-06-04 17:32:40 +03:00
chenyu
0faaf6df26
remove kwargs from arange and linspace [PR] ( #16505 )
...
it used to have requires_grad and device, now both are removed
2026-06-04 10:32:37 -04:00
qazal
3b1a5f9770
llama: a_bT and aT_b bf16 gemms ( #16487 )
...
* hk_bf16_gemm
* enable in 8b
* cleanups
* rename to USE_HK_BF16_GEMM
* work
* work
* work
* work
* change the gemms
* work
* work
* set as default
* work
* change
2026-06-04 23:30:21 +09:00
chenyu
5fad87252d
no device= into arange and eye ( #16503 )
2026-06-04 09:21:50 -04:00
nimlgen
11af81f96f
hcq2: cleaner ( #16502 )
2026-06-04 15:26:37 +03:00
chenyu
2c915c61ed
no CONST(DEVICE) in torch_backend ( #16499 )
2026-06-04 00:26:47 -04:00
wozeparrot
fd13080636
deviceless const skip axis check ( #16496 )
2026-06-03 19:13:20 -07:00
qazal
f7f03bd7e5
viz: better name for src id in kernel_graph.txt ( #16495 )
...
* viz: better name for src id in kernel_graph.txt
* better order
* cleanup
2026-06-04 11:09:29 +09:00
Christopher Milan
9dac781e45
ci: use uv ( #16492 )
2026-06-03 21:38:50 -04:00
George Hotz
9fdeaa402b
no anon addrspace, don't write hacks ( #16491 )
...
* no anon addrspace, don't write hacks
* revert that
* no reg there
2026-06-03 16:19:30 -07:00
chenyu
2f83d01ccf
fix deviceless materialize device ( #16493 )
...
symbolic arange currently does not fuse, which creates a deviceless UOp post rangeify that needs a device to bufferize
2026-06-03 19:13:21 -04:00
chenyu
19eb72ff60
remove use of full with buffer=False and non-None device= ( #16489 )
2026-06-03 16:21:24 -04:00
nimlgen
6f2a2857c8
hcq2: refactor deps ( #16490 )
2026-06-03 23:20:24 +03:00
chenyu
243446b44f
remove CONST(DEVICE) from const_like ( #16488 )
2026-06-03 14:04:51 -04:00
George Hotz
cee472a0ef
renderer Estimates uses maxel ( #16485 )
2026-06-03 10:55:00 -07:00
chenyu
8a4203638a
make full with buffer=False deviceless ( #16483 )
...
affects arange and eye
2026-06-03 12:35:59 -04:00
qazal
405866f2b7
viz: improve kernel_graph.py usability ( #16486 )
...
* better default
* always format kernel output
* also show ref
* sched num
2026-06-03 21:12:44 +09:00
Christopher Milan
f43cba5765
ci: native python where possible ( #16473 )
...
linters stays at 3.11
2026-06-02 22:40:12 -04:00
wozeparrot
7dcfd144b6
llama: columnwise fp8 scaling ( #16480 )
2026-06-02 18:55:45 -07:00
George Hotz
ffadd7a315
remove intel and amx support ( #16482 )
2026-06-02 18:53:05 -07:00
George Hotz
5f439e3b7c
refactor cstyle to avoid dtype [PR] ( #16478 )
...
* refactor cstyle to avoid dtype
* clean up rules
* add new style option
2026-06-02 18:27:12 -07:00
Christopher Milan
80eeb4dd21
mockgpu: use autogen.libc ( #16479 )
2026-06-02 19:59:36 -04:00
chenyu
a43b55d480
deviceless const folding schedule test ( #16477 )
2026-06-02 18:46:30 -04:00
George Hotz
14f843737b
renderer cleanups (pt 3) [PR] ( #16475 )
...
* renderer cleanups (pt 3)
* point refactors
* fix bugs
* fix PR
2026-06-02 14:24:24 -07:00
nimlgen
99e37b1ee3
hcq2: deps ( #16459 )
...
* start
* sin
* f
2026-06-02 22:34:25 +03:00
George Hotz
82f1c983d4
clean renderer migrations [pr] ( #16472 )
...
* clean renderer migrations
* minor webgpu
* use PARAM UOp as API
* make linter happy
2026-06-02 11:19:00 -07:00
Christopher Milan
9897658895
ci: fix ocelot compilation on macos ( #16471 )
2026-06-02 12:43:31 -04:00
chenyu
6b7d2b91df
update test_uop_graph ( #16470 )
...
use UOp methods instead of constructing UOp directly, some of it violated spec
2026-06-02 08:53:54 -04:00
qazal
854eac09c6
llama: no E_ copy after bf16 GEMM ( #16458 )
2026-06-02 14:14:13 +09:00
George Hotz
7d8ed8d4d7
add store to buffer's addrspace ( #16468 )
2026-06-01 22:07:43 -07:00
George Hotz
20242fdf1d
update test + spec from shrink_in_render ( #16467 )
...
* update test + spec from shrink_in_render
* cast
2026-06-01 19:24:43 -07:00
Christopher Milan
c6cad1ad67
ci: standardize runs-on ( #16466 )
...
* ci: use macos 26
* ugh github
* stick with github for arm
2026-06-01 21:39:58 -04:00
Christopher Milan
b0ecbb34d9
ci: cleanup python backend tests ( #16465 )
2026-06-01 20:08:05 -04:00
Christopher Milan
2d0f132a3b
ci: cleanup more duplicate tests ( #16462 )
2026-06-01 18:56:29 -04:00
wozeparrot
aab9a5a8a3
llama: allow specifying layer count ( #16464 )
2026-06-01 15:36:04 -07:00
chenyu
0167401fa2
minor hcopt WHERE cleanup [PR] ( #16463 )
2026-06-01 17:58:38 -04:00
George Hotz
124d2f8227
anon addrspace from new renderer ( #16461 )
...
* anon addrspace from new renderer
* use max_numel in python renderer
* add sizes to ptrs in tests
* more
* correct fix
2026-06-01 14:42:02 -07:00
chenyu
517eea5985
no CONST(DEVICE) in create_allreduce_function ( #16460 )
2026-06-01 17:12:34 -04:00
chenyu
7e7b481ba7
less CONST(DEVICE) ( #16452 )
...
* less CONST(DEVICE)
no DEVICE for single device in const_like, multi has other issues
* maybe
* that?
2026-06-01 15:55:12 -04:00
George Hotz
556defa0f7
minor updates from vec removal ( #16456 )
2026-05-31 09:48:51 -07:00
Javier De Jesus
989f713c1b
support negative pads in circular pad mode ( #16448 )
2026-05-31 09:28:45 -07:00
nimlgen
2c2cb339e0
fix word wrap ( #16450 )
2026-05-30 23:21:24 +03:00
qazal
29b47a0057
llama: update local amax implementation after ParamArgs change ( #16446 )
...
* local amax failing test
* update _local_abs_max_fxn
2026-05-30 16:55:43 +09:00
wozeparrot
6795c2d5c9
llama: zero grad this way ( #16445 )
2026-05-29 20:25:21 -07:00
George Hotz
cf55aaf01f
python prg is pkl uops ( #16443 )
...
* python prg is pkl uops
* refactor to use uop
* refactor to u.
2026-05-29 19:13:51 -07:00
Christopher Milan
c377d01491
ci: run dsp on tinygrad[testing] ( #16442 )
2026-05-29 21:16:56 -04:00
wozeparrot
c23652e486
llama: minimize peak init mem ( #16440 )
2026-05-29 18:00:37 -07:00
Christopher Milan
d943493b79
ci: remove duplicate op compile test ( #16441 )
2026-05-29 19:20:31 -04:00
chenyu
8ac62b28e5
fix AffineGrid fusion ( #16439 )
2026-05-29 17:59:47 -04:00
Christopher Milan
ef50a49693
ci: macos dev matrix ( #16436 )
2026-05-29 17:40:32 -04:00
Christopher Milan
434cfa96a3
ci: no fetch in backend tests ( #16438 )
...
should make for less actions cache thrashing
2026-05-29 17:11:16 -04:00
chenyu
b7280705a7
limit CONST(UNIQUE) to invalids only ( #16432 )
2026-05-29 16:02:06 -04:00
George Hotz
9506b78d73
fix viz addrspace ( #16437 )
...
* fix viz addrspace
* revert that
2026-05-29 12:58:05 -07:00
nimlgen
d69aca41a9
hcq2: rework pm_bufferize ( #16431 )
2026-05-29 22:09:52 +03:00
George Hotz
e2a0434403
full derivation of addrspace ( #16433 )
...
* full derivation of addrspace
* w/e, it fixes it
2026-05-29 11:39:31 -07:00
wozeparrot
6787de9f52
llama: fix mp ( #16434 )
2026-05-29 11:21:43 -07:00
chenyu
2d7e5baab4
remove vec= from UPat.cvar [PR] ( #16430 )
2026-05-29 10:52:30 -04:00
chenyu
fa666cefe8
remove dead branch in UOp [PR] ( #16429 )
2026-05-29 10:38:49 -04:00
qazal
81bc00c006
do not require clearing method_cache in viz tests ( #16428 )
...
* update
* update test_dedup
2026-05-29 18:12:34 +09:00
qazal
54cfb794b8
viz: addrspace little colored box ( #16427 )
...
* return addrspace
* layout
* render
* addrspace encodes color
* update colors
* in input_ast all are params are green
* update stroke
2026-05-29 17:25:07 +09:00
qazal
814d414f41
viz: set label offset for asm ( #16426 )
2026-05-29 13:16:34 +09:00
wozeparrot
f86966af56
llama: optim amax margin ( #16425 )
2026-05-28 20:18:11 -07:00
Christopher Milan
6e0d5262dc
ci: autocancel outdated pr jobs ( #16424 )
2026-05-28 23:14:35 -04:00
Christopher Milan
69aa2054f6
rename clangjit to clang ( #16423 )
2026-05-28 22:41:58 -04:00
Christopher Milan
a909acb882
move llvmspeed to benchmarks ( #16422 )
2026-05-28 22:26:22 -04:00
George Hotz
1e7f1dcf49
add ParamArgs [pr] ( #16421 )
...
* add ParamArgs
* fix export
* cleanups
* fixes
* simpler
2026-05-28 19:17:17 -07:00
Christopher Milan
7d38edffdb
ci: dev matrix ( #16420 )
...
windows just runs test_tiny
2026-05-28 22:04:04 -04:00
wozeparrot
36c8ff70c1
llama: use old scale for dequant in optim ( #16417 )
2026-05-28 15:21:19 -07:00
George Hotz
c87f3433d1
use namespace runners ( #16387 )
...
Co-authored-by: Christopher Milan <chrismilan@ucla.edu>
2026-05-28 18:05:46 -04:00
George Hotz
c9adde72c1
addrspace property ( #16418 )
...
* addrspace property
* movement addrspace
* regs
2026-05-28 14:39:25 -07:00
Christopher Milan
c8af163d2b
disable process replay by default ( #16419 )
...
enable process replay with [pr] and assert with [PR]
process replay no longer captures on master
2026-05-28 17:36:28 -04:00
nimlgen
b0e49afaf1
hcq2: new multi ( #16413 )
...
* hcq2: new multi
* op
2026-05-28 22:16:10 +03:00
George Hotz
edca5df25a
flip offset and shape in pad and shrink ( #16414 )
...
* flip offset and shape in pad and shrink
* dumb test
2026-05-28 11:58:19 -07:00
chenyu
d72d8ee065
.const() should not ignore dtype ( #16412 )
...
fixed a bug in postrange, also cleaner
2026-05-28 10:49:15 -04:00
Christopher Milan
0ae957bb0a
refactor webgpu ( #16406 )
2026-05-27 23:13:08 -04:00
qazal
202adc644e
viz: make call toggle easier to click on ( #16411 )
...
* call tag is a rect
* details
* colors
* simplify, better comment
2026-05-28 11:53:36 +09:00
George Hotz
5ee6b6b79e
fix slice store to remove the index ( #16410 )
...
* fix slice store to remove the index
* fix spec
2026-05-27 19:17:53 -07:00
qazal
88e88d63d6
viz: click on +- toggles sources ( #16409 )
2026-05-28 09:12:43 +09:00
George Hotz
b21afb4883
marg line cleanup ( #16408 )
...
* marg line cleanup
* bitcast is a mop
2026-05-27 16:41:04 -07:00
wozeparrot
dac3743d75
llama: delayed scaling in optim ( #16407 )
2026-05-27 15:40:03 -07:00
George Hotz
8ee3a37524
shrink/pad use (new_shape, offset) ( #16405 )
...
* shrink uses offset and shape
* pad does too
* fix
2026-05-27 15:13:08 -07:00
Christopher Milan
171401e8df
skip modulo by zero in test_dtype_alu ( #16404 )
2026-05-27 17:09:05 -04:00
qazal
452c7d4230
llama: don't allocate grad_xw13 in bf16 ( #16359 )
2026-05-28 04:33:07 +09:00
nimlgen
0c385e31c6
hcq2 rewrite ( #16375 )
...
* hcq2 rewrite
* fi
* x
* simpler
2026-05-27 22:25:35 +03:00
chenyu
c33b767407
bring back test and torch backend change for unique const ( #16403 )
2026-05-27 15:16:08 -04:00
Christopher Milan
bacabf0866
webgpu: fix enums ( #16402 )
2026-05-27 13:09:50 -04:00
chenyu
6da785562b
test_custom_kernel_precompile_multidevice ( #16401 )
...
add a test to show what invalids need
2026-05-27 11:19:16 -04:00
chenyu
3e80f375ee
skip test_setitem_fancy_on_unrealized_view ( #16400 )
...
crashes in linux llvm ci
2026-05-27 09:50:26 -04:00
chenyu
945ed4f689
revert const unique changes ( #16395 )
2026-05-27 00:06:41 -04:00
Christopher Milan
aacc8addf4
ci: use ubuntu 24.04 ( #16393 )
2026-05-26 23:22:01 -04:00
chenyu
fa14cde05c
test update for arange and eye ( #16394 )
...
these will need explicit clone to make a buffer
2026-05-26 22:48:34 -04:00
wozeparrot
3a7a6da7d5
llama: fakedata uses real vocab size ( #16389 )
2026-05-26 18:58:55 -07:00
George Hotz
156a4438d9
rename BUFFER_VIEW to SLICE ( #16391 )
...
* rename BUFFER_VIEW to SLICE
* fix comments
2026-05-26 18:15:00 -07:00
Christopher Milan
3adf7f5d95
disable flaky cl test ( #16388 )
2026-05-26 19:56:57 -04:00
Christopher Milan
d23659d38b
cleanup some old test skips ( #16384 )
2026-05-26 19:07:22 -04:00
George Hotz
fd963038a0
remove allow_any_len from store ( #16385 )
...
* remove allow_any_len from store
* a few more
* no bv there
* more fixes
* fixes
* oh that
2026-05-26 15:26:53 -07:00
chenyu
0b88827482
remove CONST(UNIQUE) ( #16383 )
2026-05-26 14:45:22 -04:00
chenyu
d861c50dce
remove unique_const ( #16382 )
2026-05-26 13:53:31 -04:00
George Hotz
bac82d4949
fix emu bug in gfx950 ( #16381 )
...
* fix emu bug in gfx950
* fix renderer
2026-05-26 10:32:03 -07:00
chenyu
9b00defc8c
Revert "remove unique_const ( #16372 )" ( #16380 )
...
This reverts commit 09019d6761 .
2026-05-26 12:30:07 -04:00
chenyu
09019d6761
remove unique_const ( #16372 )
...
* remove unique_const
* fix SDWA thing
* that?
2026-05-26 12:18:03 -04:00
George Hotz
7f1b02854e
bufferview offset is units of input dtype ( #16378 )
2026-05-26 08:49:31 -07:00
qazal
846a809af7
viz: add +- toggle for hidden UOps ( #16368 )
...
* first
* remove
* move src toggles to client side
* line
* update viz server tests
* remove those
* logic
* cleanup
* call matches
* fix const arg
* add labels
* keep changes
* the stack on movement ops hiding change
* structure
* rename to expandedNodes
* work
* test intention
2026-05-26 22:31:54 +09:00
nimlgen
032905dec9
hcq2: simpler ( #16361 )
2026-05-26 14:28:48 +03:00
George Hotz
322693dcd3
hotfix: bump Mac pytest timeout to 4 minutes (try 2)
2026-05-25 18:23:21 -07:00
George Hotz
41ee7dab1c
script to generate testsig for DSP ( #16371 )
...
* script to generate testsig for DSP
* cleanups
2026-05-25 17:54:58 -07:00
wozeparrot
76fc39ccc0
gather to single device ( #16354 )
2026-05-25 17:27:08 -07:00
George Hotz
942cb42b97
Revert "hotfix: bump Mac pytest timeout to 4 minutes"
...
This reverts commit 695a0069ed .
2026-05-25 17:25:11 -07:00
Christopher Milan
8ddd1328df
remove getenv(CI) ( #16365 )
...
gone everywhere except test_interop, because torch MPS does not work in actions
2026-05-25 20:23:33 -04:00
George Hotz
695a0069ed
hotfix: bump Mac pytest timeout to 4 minutes
2026-05-25 17:20:19 -07:00
George Hotz
689ab6a49f
move buffer view offset to src ( #16364 )
...
* this work?
* failed
2026-05-25 17:07:55 -07:00
Christopher Milan
d8f86be613
webgpu: shader-f16 support in arch ( #16370 )
2026-05-25 19:20:59 -04:00
qazal
4bcc53eb26
viz: stable node position for +- toggle ( #16367 )
2026-05-26 06:30:47 +09:00
qazal
3506eb08ec
viz: sidebar toggles always recenter ( #16366 )
...
* viz: sidebar toggles always recenters
* python brain
2026-05-26 06:14:32 +09:00
chenyu
cdeb861828
invalids is empty [pr] ( #16353 )
2026-05-25 16:11:38 -04:00
qazal
b73d2d17b9
viz/cli: add --interval ( #16363 )
...
* interval support
* add test_interval
* llama uses interval
2026-05-26 03:35:06 +09:00
C T
2ab90f31b1
use windows-specific alias nvcuda when loading cuda on windows ( #16260 )
...
This also makes it possible to use cuda on windows by specifying 3 env
vars with direct dll paths: NVCUDA_PATH, NVRTC_PATH and NVJITLINK_PATH
without name collision with CUDA_PATH which is used for cuda headers
include path in NVRTCCompiler.
2026-05-25 08:50:50 -07:00
wozeparrot
68d2102fd2
llama: offload master weights ( #16355 )
2026-05-25 08:48:13 -07:00
qazal
eecd4706ff
fix mailbox comment, add types ( #16360 )
2026-05-25 22:24:00 +09:00
nimlgen
64095cf2e2
use get_buf in exec_kernel ( #16356 )
2026-05-25 15:13:40 +03:00
chenyu
5d5e02871f
remove Tensor.from_uop ( #16344 )
...
and no device for const in Tensor init
2026-05-24 18:53:09 -04:00
nimlgen
a891727c9f
hcq2: multi ( #16347 )
...
* hcq2: multi
* cleaner a bit
2026-05-24 19:28:33 +03:00
chenyu
926d125a63
update test_stack ( #16345 )
...
also skip COMPILE_ONLY, it was comparing 0==0
2026-05-23 10:42:35 -04:00
chenyu
149a87dac2
deviceless const cleanups ( #16341 )
2026-05-22 20:11:01 -04:00
Christopher Milan
35461d4d8f
ci: cleanup some deps [pr] ( #16340 )
2026-05-22 19:16:08 -04:00
Christopher Milan
451f38155c
start cleanup of the slowest tests ( #16339 )
2026-05-22 18:39:36 -04:00
nimlgen
26b3b3f6a2
hcq2: move submit lowering to schedule ( #16330 )
...
* hcq: move submit lowering to schedule
* Dx
2026-05-22 23:15:19 +03:00
wozeparrot
2d48fe8b7b
feat: bump version to 0.13.0 ( #16337 )
2026-05-22 13:12:45 -07:00
chenyu
acc519720b
add missing init files, add chat.html to package-data ( #16334 )
2026-05-22 13:53:34 -04:00
googlefan256
eeadf26dad
Fix no module named error ( #16305 )
...
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-05-22 12:51:29 -04:00
nimlgen
90dbb45563
nv: fix boot mem ( #16332 )
...
* nv: fix boot mem
* linter
2026-05-22 19:28:38 +03:00
nimlgen
5d77a94923
am: mec_pipe0_reset on gfx12 only ( #16331 )
2026-05-22 19:02:18 +03:00
qazal
bbfe4f80ec
quantize_fp8 kernels in uops ( #16288 )
...
* add tests
* simple UOp kernel is n^2
* fast kernel matching c++, opts_to_apply=()
* remove cpp
* simple o(n) kernel, two passes
* fuse the loops
* works on DEV=CPU
* multi regression test
* fix multi, this can possibly be its own bugfix
* test cleanups
* minimal diff
* match C in UOps
* Revert "match C in UOps"
This reverts commit 0bef740c30 .
* edit test
* match speed with C try 2
* needs_second_gpu
* cleanup
2026-05-22 20:54:06 +09:00
chenyu
3115952266
more unique const removal prerequisite ( #16328 )
2026-05-21 23:51:40 -04:00
Christopher Milan
c2d06570a5
remove getenv(CI) from core tinygrad ( #16326 )
2026-05-21 22:20:33 -04:00
chenyu
9744d512d9
use more non-buffered const ( #16327 )
2026-05-21 21:37:52 -04:00
Christopher Milan
150a82de1f
start cleaning up dtype tests ( #16324 )
2026-05-21 21:11:49 -04:00
chenyu
31424cda71
Tensor.requires_grad -> is_param ( #16325 )
...
for optimizer
2026-05-21 19:39:57 -04:00
Christopher Milan
518e60534e
only load tinymesa_cpu when LVP is explicitly requested ( #16320 )
2026-05-21 19:03:13 -04:00
chenyu
720a27bed8
remove many requires_grad= args ( #16321 )
...
* remove many requires_grad= args
* doc and example
* not cifar
2026-05-21 18:37:11 -04:00
wozeparrot
0c41317a59
llama: update 405b scripts ( #16309 )
2026-05-21 14:03:34 -07:00
wozeparrot
fb718a5e9d
llama: realize amax ( #16308 )
2026-05-21 14:00:48 -07:00
chenyu
73ea36f4ac
full(buffer=True) ( #16311 )
...
make full a buffer with flag to turn off
2026-05-21 16:34:44 -04:00
George Hotz
6815f28849
dtype.vec shapes ( #16287 )
...
* dtype.vec shapes
* something
* Closer
* more passes
* shape is in spec
* fix reduce
* image dtype shape correct
* lil
* use reshape on image
* need BUFFER there
* remove that test
* fix ptx + x86
* fix nir
* x86 fix maybe
* x86 fixups
* x86 fix
* don't check that for NOOP
2026-05-21 11:56:49 -07:00
wozeparrot
afc5bfa183
llama: remove fused grad accum ( #16301 )
2026-05-21 09:38:40 -07:00
nimlgen
a321700baa
hcq2: multi prereqs ( #16304 )
2026-05-21 17:00:52 +03:00
qazal
e33e058d34
set SPLIT_W13=0 for 8b DP by default ( #16302 )
2026-05-21 22:09:10 +09:00
Christopher Milan
dd279ee25e
print dtype decomp warning in DEBUG=2 ( #16300 )
2026-05-20 22:08:48 -04:00
George Hotz
ec547250ef
don't use dtype vec for image idx ( #16298 )
...
* don't use dtype vec for image idx
* double gate
* y/x confused
* upd
* fix nir
* simplify_valid_image_load
2026-05-20 18:45:13 -07:00
Christopher Milan
172f9493e1
move is_dtype_supported to renderer ( #16226 )
2026-05-20 21:19:37 -04:00
chenyu
d548f8d0f3
use clone instead of unique_const in allreduce [pr] ( #16297 )
2026-05-20 18:58:47 -04:00
qazal
9e88b08f93
x86: don't use id ( #16296 )
...
* x86: don't use id
* diff
* more minimal change
* unique
2026-05-21 07:36:40 +09:00
Christopher Milan
da07b28998
am: override smu 13_0_7 to 13_0_0 ( #16292 )
2026-05-20 18:14:30 -04:00
chenyu
beea4633fc
UOp.clone [pr] ( #16295 )
...
generates the store after structure
2026-05-20 17:47:49 -04:00
qazal
a19fa2908f
fix x86 nondeterminism ( #16293 )
2026-05-21 05:48:05 +09:00
George Hotz
58d58c1659
remove DEVECTORIZE ( #16290 )
...
* remove DEVECTORIZE
* fully remove DEVECTORIZE
2026-05-20 13:25:49 -07:00
wozeparrot
825f30bf18
llama: apply_grad saves memory ( #16275 )
2026-05-20 13:14:06 -07:00
nimlgen
a88feef40f
hcq2: cleanups ( #16278 )
...
* s
* simpler
* simler
2026-05-20 21:48:50 +03:00
Philipp Braun
a01d5918af
fix: qlinearconv quant params ( #16234 )
...
* fix: qlinearconv quant params
* fix: simplify reshape
---------
Co-authored-by: Philipp Braun <braunphilipp@users.noreply.github.com>
2026-05-20 11:31:41 -07:00
George Hotz
19535df53c
enable broadcasting in _shape ( #16285 )
2026-05-20 11:21:51 -07:00
chenyu
4dbe6a2ee7
remove _force_unique from Tensor init ( #16277 )
2026-05-20 14:13:05 -04:00
Christopher Bradford
fe2d8d1ecf
filter by base_class in pci_scan_bus on macOS ( #16282 )
...
The Linux path of pci_scan_bus reads /sys/bus/pci/devices/.../class and
skips devices whose base class doesn't match. The macOS (IOKit) path
appended every IOPCIDevice unconditionally, so callers that supplied
base_class to narrow down to e.g. display devices would also get the
audio companion function of a multifunction GPU.
Concretely, an NVIDIA RTX Pro 6000 Blackwell exposes:
10de:2bb1 class 0x030000 (display)
10de:22e8 class 0x040300 (multimedia audio)
A PROBE for base_class=3 returned both. With the sorted() at the end of
pci_scan_bus, 22e8 (audio) came first, so the NV runtime picked the
audio function as device 0 and stalled on RESIZE_BAR.
This mirrors the Linux filter on line 70 using the existing read_prop
helper.
Co-authored-by: Christopher Bradford <christopher.bradford@joby.aero>
2026-05-20 20:09:35 +03:00
qazal
1e0fffe256
fused ce llama kernel in UOps ( #16263 )
...
* work
* using uops
* delete things
* work
* work
* higher level uops
* cleanups
2026-05-20 19:45:28 +09:00
chenyu
e1715b3b92
extent jit const error to deviceless inputs ( #16276 )
2026-05-20 02:02:45 -04:00
chenyu
170b857da9
clean up deviceless const _buffer ( #16274 )
...
process on CPU similar to multi
2026-05-19 22:47:45 -04:00
chenyu
7af7b6703a
relax policy ASSERT_MIN_STEP_TIME to 3.2 ( #16273 )
2026-05-19 22:29:09 -04:00
chenyu
188d7ec15e
clone can take device ( #16271 )
...
useful to materialize const on a specific device
2026-05-19 21:29:27 -04:00
wozeparrot
361553c0a8
llama: match flat_llama with model_train ( #16269 )
2026-05-19 17:25:56 -07:00
George Hotz
da7414d6dc
fix RUN_PICKLE and test it ( #16272 )
...
* add test for openpilot RUN_PICKLE
* fix RUN_PICKLE and test it
2026-05-19 17:00:25 -07:00
George Hotz
55515747b7
Remove Ops.VCONST ( #16267 )
...
* start removing vconst
* remove a lot of vconst
* const folding + strict ordering
* update tests
* spec from minigen
* move that
2026-05-19 16:35:24 -07:00
Christopher Milan
7cdd9cbdeb
PYTHONREMU: V_CVT_PK_BF8_F32 saturation ( #16268 )
2026-05-19 19:29:59 -04:00
Christopher Milan
bb2a51f1ea
fix mypy mockgpu and add tinygrad.renderer.isa to packages ( #16265 )
2026-05-19 16:45:03 -04:00
chenyu
890b731b1e
more prerequisuite test changed for deviceless const ( #16264 )
2026-05-19 15:43:45 -04:00
ttomsa
aa1e59ab97
X86 with Ops.INS ( #14873 )
...
* draft
* cleanup test_encodings
* cleanup test_isel
* model flag state and support rematerialization
* woops
* add vbroadcastss instruction
* don't fuse load if used multiple times in src
* add movabs instruction and fix idiv
* fixes
* add x86 backend to tests
* float16 fix
* rm TwoAddress2nd
* add BARRIER
* test windows ci
* yup isel fixes the mask stuff too and its beautiful
* add cmoves to the spec
* support storing imms
* no TUPLE_ORDER, breaks tests
* fix remaining seg faults
* add float max
* always fuse index
* minor
* fix DEFINE_VAR/SPECIAL and enable multithreading
* linter
* more linter
* more
* more
* more
* let's try this
* perhaps
* start new scheduler
* more scheduling info
* cleaner shuffle functions
* fixup isel tests
* skip bounds check when NOOPs exist
* skip inf rewrite tests
* fix const tag hack and add x86ops to _shape
* fix
* skip a few tests
* func arg order independent from op value
* x86 goes in own linearize
* switch to PARAM
* more
* add min x86op and neg in decomps
* do mulacc in isel
* use def_reg in test_encodings
* enable emulated int64 tests
* how much does this fix
* Ops becomes OpType
* fix
* rm noqa
* rm machine scheduler stuff
* and this
* allow for extending enums and move X86Ops out of uop
* fix imports
* rm X86GroupOp from ops.py
* spacing
* tell mypy to shut up
* more linter
* add x86op test
* allow set[X86Ops] in upat
* move NOOPs to pre_isel_matcher and rm NOOP from spec
* more asserts
* also this
* cleanup encode
* simplify live range
* fix idiv
* add Ops.INS to x86
* more changes
* more changes
* more changes
* fix
* fix
* fix
* fix
* print formatted assembly
* fix 8bit idiv?
* oops
* enable float16 and unaligned vector load/store
* actually no
* move x86 tests
* no more bool cast
* fix
* linter
* linter
* move X86Ops to x86.py
* fix vpbroadcast
* cleanups
* linter
* print correct reg names
* canonical max
* move max/min and add test
* support float16 vector load/store
* rm bad rewrite
* vpsrldq can't access memory
* regalloc takes renderer
* enable vector load/store on all dtypes
* more isel tests
* rm this for now
* a lot better
* fix
* fix
* fix
* deal with flags correctly
* fix
* enable gep noop rule
* fix
* fix
* fix
* add callee saved registers
* use Ops.CONST instead of X86Ops.IMM
* fix
* enable TUPLE_ORDER
* fix
* rm x86 code in linearizer
* fix
* fix
* fix
* move isa rewrites to codegen
* fix
* fix
* skip test_linearizer.py
* skip more tests
* fix
* fix for idiv/mod changes
* fix
* don't use fmadd if it duplicates fused op
* hacky
* fix
* cleanups
* cleanups
* fix
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-05-19 12:42:54 -07:00
George Hotz
b2e8102209
25000 lines for x86 backend
2026-05-19 11:27:41 -07:00
Sachith Shetty
74567c1958
fix: pass input device to ONNX helper internal tensors ( #16242 )
...
* fix: pass input device to onnx methods internal tensors
* test: onnx helper internal tensors use input device
2026-05-19 11:16:33 -07:00
Christopher Milan
a178301dbe
PYTHONREMU: fix CDNA VOP3 conditional writes ( #16258 )
2026-05-19 13:31:31 -04:00
nimlgen
b3dcf8f452
hcq2: split into schedule/realize ( #16216 )
...
* hcq2: split into schedule/realize
* missing
* x
* f
* clean
* cleaner
* x
* x
* x
* x
* x
2026-05-19 16:40:17 +03:00
qazal
e4350e7de9
set hipcc mac docker to 7.1 ( #16261 )
...
* set hipcc mac docker to 7.1
* pull from amd
2026-05-19 21:30:39 +09:00
George Hotz
a120709671
tighten shape spec for broadcasting ( #16206 )
...
* tighten shape spec for broadcasting
* use IndexError, not ValueError
* needs size
2026-05-18 22:12:04 -07:00
George Hotz
3f2d401464
all tests pass with NOOPT=1 ( #16257 )
...
* all tests pass with NOOPT=1
* fix a few more
* noopt 100% pass
* noopt 100% pass
2026-05-18 20:39:51 -07:00
chenyu
e694d7f222
more deviceless const prerequisites [pr] ( #16256 )
...
* more deviceless const prerequisites [pr]
* remove that
* arange.contiguous -> arange.clone in tests
arange will become deviceless const soon, update tests where it needs to be a buffer
2026-05-18 23:14:12 -04:00
chenyu
c1076ed56c
Tensor.device and UOp.device can be None ( #16255 )
2026-05-18 22:08:10 -04:00
wozeparrot
a3d59faef6
llama: don't save weight ( #16252 )
2026-05-18 17:05:45 -07:00
qazal
18b102f355
llama: also use 7.1 comgr, update startup_walltime.sh ( #16253 )
2026-05-19 08:59:02 +09:00
chenyu
d532b4f533
multi alu with deviceless const ( #16251 )
2026-05-18 19:31:53 -04:00
qazal
98b8a2b407
llama: use hipcc 7.1 version ( #16250 )
2026-05-19 08:09:57 +09:00
Christopher Milan
7515824a6d
ci: actually use clang-20, enable bfloat16 ( #16249 )
2026-05-18 19:06:43 -04:00
chenyu
754344087a
assign for deviceless const source ( #16248 )
2026-05-18 17:39:53 -04:00
chenyu
73e6b4963b
to and shard is noop for deviceless uop ( #16247 )
2026-05-18 16:11:10 -04:00
Christopher Milan
50481ec9b4
cl: check for cl_khr_fp64 ( #16246 )
2026-05-18 14:42:43 -04:00
chenyu
db639ebe3e
deviceless const from UOp ( #16243 )
2026-05-18 14:14:12 -04:00
qazal
bfb2d1f89a
Revert "fp8 gemm speedup ( #16236 )" ( #16245 )
...
This reverts commit d95bf394e1 .
2026-05-19 02:01:44 +09:00
chenyu
5ae4dbd599
make slow tests faster ( #16244 )
2026-05-18 11:42:02 -04:00
chenyu
981c12182f
remove requires_grad= in tinygrad/ ( #16241 )
2026-05-17 16:55:37 -04:00
chenyu
fcdd1af880
remove Tensor.detach override [pr] ( #16239 )
2026-05-16 23:58:12 -04:00
chenyu
dcee90aa3f
remove requires_grad use in extra/examples ( #16238 )
...
except the ones fed into optimizer
2026-05-16 18:40:26 -04:00
chenyu
8631b6f17d
remove use of requires_grad in test/ ( #16237 )
2026-05-16 17:21:07 -04:00
qazal
d95bf394e1
fp8 gemm speedup ( #16236 )
...
* add asm_gemm option
* milestone
* work
* edit
* only the fast kernel
* diff
2026-05-17 04:58:28 +09:00
chenyu
0ddc50d050
do not gate backward on requires_grad ( #16230 )
...
DETACH is filtered in _deepwalk. instead of None, it gets 0 grad now
2026-05-16 12:29:49 -04:00
nimlgen
bef5f717bc
fix nolocals and beam ( #16232 )
2026-05-16 18:09:19 +03:00
qazal
ebcb7b7cc0
fp8 gemm tests with scale args ( #16231 )
...
* update atol
* update fp8 path
* more work
* update profile.sh
2026-05-16 20:47:58 +09:00
nimlgen
e575f778f9
move debug prints ( #16218 )
...
* move debug prints
* x
2026-05-16 13:57:34 +03:00
wozeparrot
2d48d7ab09
remove more invalid ( #16227 )
2026-05-16 02:52:27 -07:00
wozeparrot
159694347e
llama: fix running flat_llama ( #16224 )
2026-05-15 20:16:48 -07:00
Christopher Milan
79c0ae5b89
metal: arch is GPU family ( #16223 )
2026-05-15 21:22:48 -04:00
Christopher Milan
2c61f65211
cl: device extensions in arch ( #16220 )
2026-05-15 18:59:20 -04:00
George Hotz
2549b14ec2
fix caformer onnx run ( #16222 )
2026-05-15 15:08:36 -07:00
George Hotz
2570bded8b
update spec for LOAD ( #16221 )
...
* add load to the spec
* can
2026-05-15 14:46:00 -07:00
chenyu
d62c1d83c0
remove Tensor.eye override ( #16219 )
...
* remove Tensor.eye override
was only needed for requires_grad arg
* README
2026-05-15 15:40:34 -04:00
chenyu
07a172dbbb
remove noop requires_grad_ calls ( #16213 )
2026-05-15 13:31:10 -04:00
chenyu
c6cf9e8f0c
remove test_svd_nonfull_5_5 ( #16217 )
...
flaky, kinda overlap with test_svd_general
2026-05-15 13:10:02 -04:00
qazal
d54fa86b71
viz/cli: select all calls in graph by default ( #16214 )
2026-05-15 21:01:44 +09:00
nimlgen
28b98e529d
nv: move structs to vram ( #16184 )
...
* nv: vram
* x
* 4090
* x
* move and sysmem on macos
* x
* remove hp
2026-05-15 13:41:42 +03:00
chenyu
409bb0c9ad
requires_grad cannot be None ( #16212 )
...
final goal is to remove requires_grad, first change the default to True, and don't allow None
2026-05-15 02:01:04 -04:00
Christopher Milan
c7870f11ff
mesa: suggest curl install tip ( #16211 )
2026-05-15 00:29:06 -04:00
chenyu
a612b88abb
better assert when setitem a refed tensor ( #16210 )
...
also decouple from requires_grad
2026-05-14 23:40:29 -04:00
chenyu
a75c14f010
some setitem tests ( #16209 )
2026-05-14 22:36:25 -04:00
Christopher Milan
891a1ae7c2
onnx: remove dtype_fallback ( #15717 )
2026-05-14 22:06:57 -04:00
wozeparrot
b4d267dfd4
llama: only save when small ( #16208 )
2026-05-14 17:46:29 -07:00
chenyu
ffa1aac7b1
gradient for STORE/AFTER ala clone ( #16205 )
2026-05-14 20:17:27 -04:00
chenyu
09096ea565
test_gradient_through_clone ( #16203 )
...
backward through clone crashes now
2026-05-14 19:26:47 -04:00
George Hotz
d4dcd8487b
aggressive shape check to prepare for broadcasting ( #16202 )
...
* add implicit broadcasting to shape
* NOOP/ALLREDUCE fixes
2026-05-14 16:15:44 -07:00
George Hotz
83ec66da34
fix a fastdiv edge case ( #16199 )
2026-05-14 13:12:18 -07:00
nimlgen
62ea73719d
hcq2: share more with graph ( #16196 )
...
* share more with graph
* comment
2026-05-14 22:28:11 +03:00
George Hotz
3b8cc31759
disable fast idiv by default, it's broken ( #16197 )
...
* disable fast idiv by default, it's broken
* fix fast idiv tests
2026-05-14 11:48:27 -07:00
Christopher Milan
8f811649ff
better compiler_cpu invalid arch errors ( #16194 )
2026-05-14 14:36:14 -04:00
qazal
f03a7fd6d1
viz/cli: readable uop json ( #16195 )
...
* viz/cli: readable uop json repr
* work
* better
2026-05-14 21:33:10 +09:00
C T
1b779a9058
add gelu approximate="none" (match pytorch) ( #16162 )
...
* add gelu approximate="none" (match pytorch)
* lint
* pass through onnx Gelu approximate
* type annotate
* explicit math.sqrt
* keep tinygrad's gelu approximate="tanh" default
2026-05-13 18:53:24 -07:00
chenyu
dd9187d9ee
minor hash cleanups ( #16190 )
...
same kernels
2026-05-13 20:59:24 -04:00
wozeparrot
88ac2ac1fd
llama: cleanups ( #16189 )
2026-05-13 17:08:06 -07:00
Christopher Milan
9a365d9978
ci: fix null image tests ( #16188 )
2026-05-13 18:00:05 -04:00
nimlgen
ad1fb7c981
hcq2: graph ( #16186 )
...
* keep this for now
* early graph
2026-05-13 22:49:43 +03:00
chenyu
3f9f6a51b2
minor image_conv2d cleanup ( #16187 )
...
remove some no-op slices
2026-05-13 15:47:40 -04:00
b1tg
59c34b9fe0
llm: precise device ( #16159 )
...
* llm: precise device
* llm: pass device to precompute_freqs_cis
2026-05-12 21:16:42 -07:00
b1tg
3c806ff406
clean up gguf ( #16160 )
2026-05-12 21:16:10 -07:00
wozeparrot
e97f2c1114
llama: only gemm + fa custom kernel ( #16180 )
...
* llama: tie store to grad directly
* llama: set mp flags
* llama: non fused grad fp8 quantize path
2026-05-12 21:03:49 -07:00
chenyu
38d407fd58
simplify svd more ( #16181 )
...
all the slowness is scheduling
2026-05-12 23:48:22 -04:00
Christopher Milan
f1fdd2ccec
ci: add IMAGE=1 compile-only tests ( #16182 )
...
* ci: add IMAGE=1 compile-only tests
* fix
2026-05-12 23:40:32 -04:00
George Hotz
faf7fb7513
update nir renderer for new image style ( #16179 )
...
* update nir renderer for new image style
* don't cast image indexes
2026-05-12 20:25:01 -07:00
Christopher Milan
7d0c5ab689
ci: ocelot needs nvcc on linux ( #16178 )
...
* ci: ocelot needs nvcc on linux
* cudart
2026-05-12 23:13:48 -04:00
chenyu
32138c2418
svd to mixin ( #16175 )
2026-05-12 22:29:01 -04:00
George Hotz
69e1f3b551
remove vec2 from image in gater ( #16165 )
...
* remove vec2 from image in gater
* only simple idx
* fix python with new image style
* fix vconst
* just vconst and stack
* cast to int there
* fix for const
* fix process replay
2026-05-12 19:25:52 -07:00
chenyu
2172363be5
don't use Tensor indexing in svd ( #16174 )
...
prepare mixin, also about 4X faster for 8x8 input
2026-05-12 21:56:19 -04:00
chenyu
420a08c6d1
qr to mixin ( #16173 )
2026-05-12 21:23:25 -04:00
chenyu
c6a82fe927
functional qr and svd ( #16172 )
...
no clone and setitem, will move to mixin next. slightly faster but still quite slow
2026-05-12 19:12:08 -04:00
Christopher Milan
3844a31f87
ci: untangle cuda/ocelot, less apt ( #16171 )
...
* ci: untangle cuda/ocelot, less apt
* ldconfig
2026-05-12 18:14:03 -04:00
Christopher Milan
316607f004
dsp: don't use docker in ci ( #16167 )
...
* dsp: don't use docker in ci
* add setup script for macos docker
2026-05-12 17:11:03 -04:00
chenyu
bdcdf1f1a1
jittable masked_select and nonzero ( #16170 )
...
* jittable masked_select and nonzero
make jittable with `size=`, matches jax
* COMPILE_ONLY
2026-05-12 16:39:36 -04:00
wozeparrot
a613bcfc6d
allow after on contiguous in spec ( #16169 )
...
* feat: allow after on contiguous
* feat: add test
2026-05-12 13:11:44 -07:00
chenyu
7c3e3fa154
fix empty input for masked_select and nonzero ( #16168 )
2026-05-12 15:36:51 -04:00
chenyu
da3b7e89a4
atol in test_custom_kernel_multi_output_backward_interacting ( #16166 )
2026-05-12 14:42:12 -04:00
chenyu
25583f6dc1
fix cumsum dtype for 0d input ( #16164 )
2026-05-12 14:18:08 -04:00
George Hotz
64c81dfd24
add all codegen stages to spec_tensor ( #16163 )
2026-05-12 10:35:38 -07:00
chenyu
f3e3c3851f
explicit args to Tensor.rand ( #16161 )
...
added requires_grad, other kwargs were silently dropped
2026-05-12 12:53:39 -04:00
nimlgen
e93fb5f9b9
hcq2: remove hcqprogram ( #16157 )
...
* hcq2 rm program
* nonbeauty
* no prog
* tiny
* f
* x
2026-05-12 18:49:13 +03:00
nimlgen
a708542308
fix ci spec ( #16156 )
2026-05-12 17:57:11 +03:00
nimlgen
e5729935c6
time_call ( #16152 )
...
* time_call
* x
* fix caches
2026-05-12 16:58:28 +03:00
qazal
fe39cf148a
add Ops.SOURCE test ( #16155 )
...
* simple failing test
* raises
* change
2026-05-12 22:49:32 +09:00
qazal
5cd0494b14
viz: canonicalize ast for schedule to codegen linking ( #16154 )
...
* simple failing test
* always null device
* viz: canonicalize ast for schedule to codegen linking
* SCACHE
2026-05-12 22:40:21 +09:00
qazal
c1d125ff3b
llm: add markers to --benchmark ( #16153 )
...
* markers in llm
* ui fix
2026-05-12 20:14:11 +09:00
wozeparrot
e9359d9e7d
more llama mp fixes ( #16151 )
...
* llama: SPLIT_W13
* llama: fix with no fused kernels
* llama: cast to bf16 on non asm_gemm patH
* llama: new mp flags
2026-05-11 21:29:23 -07:00
chenyu
09fd80fba6
fix randperm and _multi_like drop requires_grad ( #16150 )
2026-05-11 23:23:34 -04:00
George Hotz
8294d105a7
Update the spec in spec.py to match the current state ( #16132 )
...
* start work on specv2
* more spec
* more spec
* fix amd emulator
* more spec
* more
* fix test_uop_graph
* move those
* spec=2
* skip those questionable tests
* ptx fix
* more spec=2
* store
* allow custom function in tensor
* spec 2
* fix beam search for tensor cores
* delete the old specs
* fix import
2026-05-11 20:07:47 -07:00
chenyu
3942a80f66
fix wrong kwargs passed into rands ( #16149 )
...
working towards explicit args for these
2026-05-11 22:22:06 -04:00
Christopher Milan
039d84ff02
Revert "onnx: deduplicate simple proto parsers" ( #16148 )
...
This reverts commit 83eaefcd0f .
2026-05-11 21:45:17 -04:00
Christopher Milan
20f587d5d5
nv: rm _download ( #16147 )
2026-05-11 19:56:37 -04:00
chenyu
371ab2023f
clean up image_dot and image_conv2d ( #16145 )
2026-05-11 19:37:58 -04:00
Vikram Rangarajan
effa263865
Torch backend aten::cat.out fix ( #16121 )
...
* Handle empty 1D tensors in cat_out
* Undid other changes
* Fixed torch cat
* Improved cat.out, added more tests
* Cleaned code
* Type hinted dim
* Removed whitespace
2026-05-11 16:28:16 -07:00
chenyu
63c1f00b80
disable test_svd_general again ( #16146 )
...
flaky on CI
2026-05-11 19:24:32 -04:00
Christopher Milan
2dccd4a3eb
am: autogen pmc ( #16143 )
...
* am: autogen pmc
* cleanup
* fix
* type
2026-05-11 19:22:12 -04:00
Christopher Milan
7ba55ad3ba
nv: autogen regs ( #16139 )
...
* nv: autogen regs
* flcn cot
* ci
* gen
2026-05-11 18:52:24 -04:00
chenyu
0b02fb6797
Revert "[pr] match torch rmsnorm ( #16122 )" ( #16144 )
...
This reverts commit 692257dd70 .
2026-05-11 17:53:42 -04:00
chenyu
fbe8be0b8b
style cleanup to Tensor.qr and svd ( #16142 )
...
* style cleanup to Tensor.qr and svd
same kernels
* more
* enable
2026-05-11 17:16:59 -04:00
qazal
fc2cc1d77a
viz: call graph renderer example ( #16141 )
...
* work
* emits
* this
* cleaner repr for custom binaries
* --call-graph
* _ref
* this
* start
* this
* everything execpt the pyrender
* bring pyrender back
2026-05-12 05:07:30 +09:00
chenyu
f65e343fb3
spec.py cleanups ( #16140 )
...
removed END from shared_spec and NOOP from full_spec
2026-05-11 15:59:49 -04:00
Joshua James Venter
692257dd70
[pr] match torch rmsnorm ( #16122 )
...
* [pr] match rmsnorm torch
Signed-off-by: Joshua James Venter <venter.joshua@gmail.com>
* 1e-5
* ops.md
---------
Signed-off-by: Joshua James Venter <venter.joshua@gmail.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-05-11 14:36:41 -04:00
Sachith Shetty
59a81559d4
fix: add self.device to qr, svd, masked_select intermediates ( #16131 )
2026-05-11 11:22:54 -04:00
nimlgen
70c2480e71
hcq2 to extra ( #16126 )
...
* hcq2 in extra
* correct
* some revert from non-extra
* cln
* cpu
* x
* attach
* min
* remove attach
* linter
2026-05-11 17:17:30 +03:00
nimlgen
ad9738892c
get_buf() for Buffer ( #16134 )
...
* p
* mypy
* x
2026-05-11 16:36:14 +03:00
qazal
2dd84416bf
viz/cli: schedule renderer ( #16101 )
...
* simpler steps
* work
* work
* iterate
* faster
* better
* simplify more
* sys stdin
* less
* work
* work and mv
* better
* seen bufs
* all call graphs
* print query
* ux
* param to buffer / buffer_view
* work
* respect NO_COLOR in uop_to_json
* less
* render uops
* rm custom renderer
* call can't pyrender.
* unrelated diff
* assert
* 5
2026-05-11 01:56:16 +09:00
George Hotz
53f9587099
add canary
2026-05-10 09:38:18 -07:00
George Hotz
28cb7f1bcc
update readme with contributing guidelines
2026-05-10 09:35:48 -07:00
George Hotz
daed602569
rename BUFFERIZE to STAGE ( #16125 )
2026-05-10 09:26:46 -07:00
qazal
39ce780907
viz/cli: emit all runs of selected kernel, json fixes ( #16124 )
...
* keep print
* --json in tests, sqtt --json err
* work
* import
* less
* line
2026-05-10 21:45:51 +09:00
qazal
51c7dafb0d
split viz cli test helpers ( #16123 )
2026-05-10 19:42:24 +09:00
chenyu
b2a682ec60
remove _shape check in pm_mops [pr] ( #16120 )
...
seems fine now
2026-05-09 17:54:22 -04:00
wozeparrot
026688f03f
llama: move to correct dir ( #16118 )
2026-05-08 19:42:16 -07:00
Christopher Milan
a7512e0d12
PYTHON: images have no alignment constraints (by default) ( #16115 )
2026-05-08 20:35:03 -04:00
Christopher Milan
105b037c3c
cl: image alignment in arch ( #16106 )
2026-05-08 19:33:33 -04:00
Charlie Kerfoot
71a8c0da09
fix: trailing space format string ( #16005 )
2026-05-08 16:31:10 -07:00
Pawan
4dd6ad3514
gradient: add TRUNC backward ( #15925 )
...
* gradient: add TRUNC backward
* test: move round quantization gradient to test_ops
2026-05-08 16:27:55 -07:00
chenyu
5152ff95e7
_pad_constant and avg_pool2d cleanups ( #16110 )
2026-05-08 18:09:47 -04:00
chenyu
e6584532f4
minor elementwise cleanups ( #16102 )
2026-05-08 13:38:34 -04:00
nimlgen
49b55af619
jit: simpler free_intermediates ( #16099 )
2026-05-08 19:08:33 +03:00
chenyu
0f46c08582
div mixin cleanups ( #16100 )
2026-05-08 12:05:37 -04:00
chenyu
235044c9d8
Ops.IDIV -> Ops.CDIV, Ops.MOD -> Ops.CMOD ( #16093 )
...
* Ops.IDIV -> Ops.CDIV, Ops.MOD -> Ops.CMOD
* ruff
2026-05-07 23:18:15 -04:00
Christopher Milan
faabe6aa42
nv: remaining firmware from /lib/firmware ( #16088 )
2026-05-07 23:07:43 -04:00
b1tg
7ef901a81d
llm: moe speedup ( #16059 )
2026-05-07 19:06:35 -07:00
George Hotz
80da8a4b9c
add spec to main tinygrad repo ( #16092 )
2026-05-07 18:52:49 -07:00
June
83eaefcd0f
onnx: deduplicate simple proto parsers ( #16085 )
...
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-05-07 18:44:27 -07:00
George Hotz
c106c73e51
remove the gate from index ( #16081 )
...
* remove the gate from index
* gpt says this works
* remove hanging casts
* simplify
* move that down
* move gates
* ptr
* remove that simplify
* move that
2026-05-07 18:42:00 -07:00
wozeparrot
d11f4d0ec2
fix: don't copy on slice of DP weight ( #16089 )
2026-05-07 17:58:01 -07:00
George Hotz
1d1b726cf6
hotfix: disable flaky framework pytest
2026-05-07 17:05:06 -07:00
Christopher Milan
9a6f7f7576
nv: look for fmc firmware in /lib/firmware ( #16080 )
2026-05-07 18:08:27 -04:00
George Hotz
b796bbae87
fix valid in indexing tests ( #16087 )
2026-05-07 14:11:28 -07:00
wozeparrot
4d1a9dca41
fix: don't copy precompiled custom kernel outputs ( #16084 )
2026-05-07 14:02:38 -07:00
qazal
f9083cf901
use subactions for benchmark.yml process replay [pr] ( #13396 )
2026-05-08 03:46:25 +09:00
nimlgen
2f0aa884d5
tinygpu: minimal is macos13 for resets ( #16075 )
2026-05-07 21:25:56 +03:00
chenyu
072db9924c
div to mixin ( #16078 )
...
also deleted idiv method
2026-05-07 12:52:37 -04:00
chenyu
516b00e286
mod and fmod to mixin ( #16077 )
2026-05-07 12:13:39 -04:00
qazal
a9a87ad8fd
viz/cli: less flags ( #16076 )
...
* viz/cli: merge -s and -i flags
* only -t
* merge parser
* fix
2026-05-08 00:22:40 +09:00
qazal
f813a04b3f
viz: pickle path in str ( #16073 )
2026-05-07 18:49:21 +09:00
wozeparrot
730fa66bf3
llama speed 6 ( #16071 )
2026-05-06 20:51:03 -07:00
Christopher Milan
7b91f7c90c
nv: look for gsp firmware in /lib/firmware ( #16068 )
2026-05-06 21:35:47 -04:00
George Hotz
8e84317743
the renderer part of gate moving from index to load/store ( #16064 )
...
* the renderer part of gate moving from index to load/store
* fixed
* fix gated stores
* fix spec
* better?
* Where after gated load becomes alt value
* cleaner expression
* fix python backend
* remove dead code
2026-05-06 13:47:04 -07:00
chenyu
ef085304bc
stronger divmod_recombine ( #16066 )
2026-05-06 15:41:54 -04:00
qazal
d7d32d82ee
viz/cli: print first uop with DEBUG=6 ( #16065 )
...
* viz/cli: print first uop with DEBUG=6
* rename fmt to emit
* define inst
2026-05-07 03:39:34 +09:00
chenyu
af4140f3be
fix divmod recombine for floordiv ( #16062 )
2026-05-06 14:22:42 -04:00
chenyu
c6ad3d3ac2
better divmod late rewrite ( #16061 )
...
better order
2026-05-06 11:31:48 -04:00
chenyu
aaabe42373
relax fold_divmod_general ( #16058 )
2026-05-05 21:37:56 -04:00
Christopher Milan
1de14cf33a
am: autogen soc ( #16055 )
2026-05-05 20:39:43 -04:00
chenyu
869eae6b37
fix double div rewrites ( #16054 )
2026-05-05 19:34:35 -04:00
Christopher Milan
bd06ea9f97
am: simplify import_module ( #16046 )
2026-05-05 19:25:53 -04:00
qazal
795501e1da
fix device in null graph events ( #16053 )
...
* failing test
* fix compute
* fix sdma
2026-05-06 07:44:08 +09:00
wozeparrot
ab6218bc92
llama mp fixes ( #16050 )
2026-05-05 15:35:32 -07:00
chenyu
34fe37d64e
use FLOORDIV and FLOORMOD ( #16048 )
...
* use FLOORDIV and FLOORMOD
also removed CORRECT_DIVMOD_FOLDING
* fix
* Revert "fix"
This reverts commit 86af33b88ef31943c61e67189b072eca4896409a.
* fix
* fix
2026-05-05 18:32:54 -04:00
Christopher Milan
76ff378007
autogen: fewer apt dependencies ( #16049 )
2026-05-05 17:22:41 -04:00
nimlgen
5fa0016ffc
supports_exec_item -> supports_uop ( #16033 )
2026-05-05 22:41:13 +03:00
qazal
cee17e0d2f
viz: fix diff color ( #16045 )
2026-05-06 03:40:53 +09:00
chenyu
9c37a0c75d
Ops.FLOORDIV and Ops.FLOORMOD ( #16038 )
...
* Ops.FLOORDIV and Ops.FLOORMOD
lowered into IDIV and MOD in get_late_rewrite_patterns
* still need this
* exclude
* like that?
2026-05-05 11:42:14 -04:00
qazal
d79bf356c2
viz: add CALL -> codegen link ( #16044 )
...
* work
* cleaner
* details
* rm
2026-05-05 23:34:44 +09:00
Christopher Milan
1c8cb0769a
am: autogen asic_regs ( #16004 )
2026-05-04 22:52:07 -04:00
George Hotz
26406bed83
amd uses .valid, not index src valid ( #16042 )
2026-05-04 18:35:15 -07:00
chenyu
a357a0449a
Tensor.div cleanup ( #16041 )
2026-05-04 19:27:36 -04:00
nimlgen
5b4f62519d
cache buffer_views as well ( #16039 )
...
* cache buffer_views as well
* reuse
* back
* x
2026-05-05 00:00:09 +03:00
Christopher Milan
8e99c4f097
fetch checks sha256 ( #16037 )
2026-05-04 16:08:38 -04:00
George Hotz
1884f67a39
simplify full_rewrite_to_sink spec ( #16035 )
...
* simplify full_rewrite_to_sink spec
* test cleanups
2026-05-04 11:44:13 -07:00
chenyu
a4fccd23b2
remove kwargs in UOp.vectorize [pr] ( #16034 )
2026-05-04 12:46:38 -04:00
qazal
b1d88ebf02
viz/cli: aggregate flops in -t ( #16031 )
...
* 38
* plumbing
* more flops
* flop/s and bytes/s
* arithmetic mean
* tests
* harmonic mean
* range
* better
* simplify
* fix prints
* no string parsing needed
2026-05-04 17:35:02 +03:00
qazal
c02e390c2b
viz: encode flops, mem and metadata in json ( #16032 )
...
* gate print
* update everywhere to check path
* server encodes json
* ui changes
* cli changes
* tests never need regex
* no str replace
* update test_pipes
* remove that
2026-05-04 23:06:18 +09:00
bigyoshi
4024d8438f
runtime/graph: avoid core_id runtimevar merge conflicts ( #16026 )
...
Co-authored-by: bigyoshi51 <269989564+bigyoshi51@users.noreply.github.com>
2026-05-03 19:16:02 +03:00
qazal
9684334dfe
viz: fix flops in graph, add null graph tracing ( #16024 )
...
* min repro, todos
* null graph tracing
* work
* work
* work
* only test_flops
* exec points back
* first
* better
* integral timestamps maybe
* cleanup
* simpler, update NULL to use SDMA naming
* integration test
* sdma
2026-05-03 22:32:44 +09:00
wozeparrot
419d525553
feat: handle multioutput kernel grads ( #16028 )
2026-05-02 22:31:45 -07:00
mefengl
9717d3a3a2
hotfix: prepend LD_LIBRARY_PATH to DLL posix search dirs ( #16023 )
2026-05-02 20:45:19 +03:00
qazal
7daf4b7d52
viz: split cli test ( #16015 )
...
* viz: split cli test
* arg3 is msg
2026-05-03 01:47:11 +09:00
nimlgen
d65b8ca25f
jit: remove *input_list from the graph sources ( #16021 )
2026-05-02 14:42:47 +03:00
qazal
7dae9e6f7f
viz: keep VIZ.value = 0 during python shutdown, cleanup launch ( #16022 )
...
* viz: keep VIZ.value = 0 during python shutdown, cleaner execv
* rm
2026-05-02 20:35:53 +09:00
Christopher Milan
637bdd5530
am: only support CDNA3/4 and RDNA3/4 ( #16017 )
2026-05-02 00:02:14 -04:00
George Hotz
4a2e1f1076
STORE doesn't have ranges anymore ( #16019 )
...
* STORE doesn't have ranges anymore
* fix
2026-05-01 15:00:27 -07:00
chenyu
0bffbc5f8a
onnx fmod uses fmod ( #16018 )
2026-05-01 16:47:11 -04:00
chenyu
782d1ff80f
Tensor.fmod ( #16014 )
...
c-style mod matches torch
2026-05-01 16:02:18 -04:00
nimlgen
1079441332
revoke bus master ( #16007 )
2026-05-01 18:00:01 +03:00
qazal
8b147a9ed5
minimal repro for llama copies 2 ( #16011 )
2026-05-01 22:23:47 +09:00
qazal
a29dd7b19b
Revert "cleanup: untrack wait Metal buffers ( #15954 )" ( #16010 )
...
* Revert "cleanup: untrack wait Metal buffers (#15954 )"
This reverts commit 5eb1fd5d3c .
* regression test fixes
2026-05-01 21:18:19 +09:00
qazal
65879fe1b7
metal synchronize regression test ( #16008 )
...
* add test for metal wait=True
* add self.assertRaises
2026-05-01 20:10:57 +09:00
nimlgen
f6d92b55e6
am: use per pipe reset for gfx11+ ( #16006 )
2026-05-01 12:56:43 +03:00
Christopher Milan
cee73becbe
am: ip offsets in autogen ( #16003 )
2026-05-01 00:13:52 -04:00
George Hotz
4506688285
split render to render.py ( #16002 )
...
* split render to render.py
* move more print
2026-04-30 19:41:14 -07:00
George Hotz
d651b4bbf0
SPEC=3 checks the shape ( #16001 )
...
* SPEC=3 checks the shape
* buffer view
* Revert "buffer view"
This reverts commit ffd87889a9 .
* buffer view hack
* fix ptx
2026-04-30 18:41:37 -07:00
wozeparrot
528d35e306
llama speed 4 ( #15993 )
2026-04-30 17:14:41 -07:00
George Hotz
45fd7a3668
lil_image vectorize ( #16000 )
...
* lil_image vectorize
* 0 pitch on height 1
* Revert "0 pitch on height 1"
This reverts commit 58a83e6622 .
2026-04-30 16:12:43 -07:00
wozeparrot
eddcd4723b
am_smi throttle info ( #15997 )
2026-04-30 15:28:32 -07:00
chenyu
52c92e15ae
no replacement multinomial ( #15995 )
...
* no replacement multinomial
Efraimidis–Spirakis
* num_samples == 1 can use fast path
2026-04-30 17:35:26 -04:00
chenyu
e0b09f288f
input validation for rand functions ( #15990 )
2026-04-30 14:00:44 -04:00
nimlgen
11e1a2b89f
cleaner and faster run_linear ( #15987 )
...
* cleaner and faster run_linear
* x
* assert for now
* x
* x
* sym_infer
* remove sink
2026-04-30 20:15:22 +03:00
qazal
58b34e71bd
failing test for llama useless copies ( #15989 )
2026-05-01 00:55:29 +09:00
George Hotz
0f7e296f5b
fix some indexing edge cases ( #15988 )
2026-04-30 08:05:30 -07:00
nimlgen
6f8b10d251
remove base Runner ( #15986 )
...
* remove base Runner
* linters
2026-04-30 13:04:55 +03:00
George Hotz
46a36a838a
small dtype shapes fixups ( #15984 )
2026-04-29 19:40:38 -07:00
chenyu
b73248958a
minor rand cleanups ( #15982 )
2026-04-29 22:22:29 -04:00
chenyu
53a28bafbd
rand device seed to its own function ( #15979 )
2026-04-29 17:21:40 -04:00
Christopher Milan
d07741f1d7
am: look for firmware in /lib/firmware/amdgpu ( #15974 )
2026-04-29 17:15:09 -04:00
nimlgen
c73e667fc0
remove if for precompiled programs ( #15980 )
2026-04-29 23:43:36 +03:00
qazal
55915584e5
viz: fix cfg for emulated amd on the null device ( #15976 )
...
* simple failing when i test it end to end
* pass
* linter
* assemble
2026-04-30 05:18:09 +09:00
nimlgen
dfd2d07005
remove CompiledRunner ( #15970 )
...
* rm usage of CompiledRunner
* more tests
* last
* linter
* sink
* remove
* linter
2026-04-29 22:45:48 +03:00
wozeparrot
0080489abe
llama: use env vars ( #15978 )
2026-04-29 12:37:15 -07:00
qazal
a37b605523
remove arch from asm kernel class ( #15977 )
...
* rm arch from kernel
* update other tests
* update abstractions4.py
2026-04-30 03:39:52 +09:00
Christopher Milan
7a79c2948a
DEV visible device filter supports hyphenated syntax ( #15971 )
2026-04-29 14:02:21 -04:00
Christopher Milan
6b9a45568c
autogen: better version handling for llvm and libclang ( #15975 )
2026-04-29 14:01:33 -04:00
chenyu
654e611a29
_bits_to_rand to mixin ( #15972 )
2026-04-29 13:47:25 -04:00
George Hotz
5f441ecffc
unify reduce + reduce_axis ( #15973 )
...
* unify reduce + reduce_axis
* fix all tests
* lil cleanups
2026-04-29 10:29:56 -07:00
qazal
b63e0a5f74
viz/sqtt: move amd decoder to extra, don't import from ops_amd ( #15969 )
...
* don't import from ops_amd
* start
* cleanup
2026-04-30 00:49:15 +09:00
nimlgen
7787f76dcc
get_runner -> get_runtime ( #15967 )
...
* get_runner -> get_runtime
* do not use get_runner
* fix
* remove get_tunner
* remove
* fix
* x
2026-04-29 18:29:49 +03:00
chenyu
fb188c3c23
UOp.bitcast noop early return ( #15968 )
...
matches Tensor
2026-04-29 09:41:40 -04:00
qazal
30403c1e25
viz/cli: merge DEBUG=6 and -i ( #15966 )
...
* print_step contiguous
* merge
2026-04-29 19:52:17 +09:00
qazal
86621e9e7c
gate f32_to_fp8 renderer ( #15964 )
2026-04-29 19:12:46 +09:00
wozeparrot
ef09071073
llama: speed 2 ( #15960 )
2026-04-28 20:44:37 -07:00
Christopher Milan
e6863a1cc5
autogen: fewer type: ignores ( #15956 )
2026-04-28 21:58:13 -04:00
chenyu
836af56513
some RandMixin cleanup ( #15961 )
...
cleaner to just put inside OpMixin
2026-04-28 19:58:02 -04:00
chenyu
c4bea54e9c
_threefry_random_bits to mixin ( #15959 )
...
start RandMixin
2026-04-28 19:13:57 -04:00
George Hotz
796fdf9fd8
end has no shape ( #15958 )
2026-04-28 15:15:48 -07:00
Miguel Villa Floran
b36010c55a
DGX Spark and Jetson Thor support ( #15939 )
2026-04-28 18:08:21 -04:00
Nino Risteski
5eb1fd5d3c
cleanup: untrack wait Metal buffers ( #15954 )
2026-04-28 12:54:59 -07:00
nimlgen
77965a22e5
local optimize as rewrite ( #15953 )
...
* local optimize as rewrite
* better
* x
* slighly rename
* fix
* ugh
* remove
* x
* remove
* not weak
2026-04-28 22:51:04 +03:00
qazal
b3f0f8d349
llama: fix missing label_smoothing arg ( #15955 )
2026-04-29 02:12:14 +09:00
wozeparrot
5e861cd2c4
llama: move llama kernels to llama_kernels ( #15952 )
2026-04-27 22:48:53 -07:00
Christopher Milan
987b6dd193
python -m tinygrad.device prints interface info ( #15950 )
2026-04-27 22:15:38 -04:00
qazal
54f00e1013
sqtt: correct rdna4 structs ( #15948 )
2026-04-28 07:35:50 +09:00
Charlie Kerfoot
890d7be0c3
fix: muon not using device ( #15936 )
2026-04-27 14:56:48 -07:00
qazal
c58fd85a99
sqtt: add needs_rocprof decorator ( #15947 )
...
* sqtt: add needs_rocprof decorator
* version string
2026-04-28 06:22:50 +09:00
Christopher Milan
3f508810d8
cpu: lowercase arch ( #15943 )
2026-04-27 17:05:25 -04:00
chenyu
77f9125c21
move Tensor.pad to OpMixin ( #15946 )
2026-04-27 16:56:04 -04:00
nimlgen
4164666c72
programinfo ( #15942 )
...
* programinfo
* fix
* m
* x
* x
* changes
* x
* fix
* rm
2026-04-27 23:12:03 +03:00
chenyu
fe38d6de94
_pad_circular and _pad_reflect_replicate to mixin ( #15944 )
2026-04-27 16:07:05 -04:00
qazal
8c174bdad4
viz/sqtt: correct exec pipes ( #15885 )
...
* wmma
* p2
* test
* left
* work
* pickle
* handwritten failing tests
* start work
* test the pipes
* empirical evidence
* update rdna4 enum types
* VALU pipe 1
* TRANSCENDENTAL pipe
* transcendental function units
* reorder
* wmma pipe
* cleanup and notes
* smaller
* work
* diff cleanup
* pickle
* use se:1
* int
2026-04-28 05:05:49 +09:00
qazal
eeb8d5eb0c
viz: small ui changes ( #15940 )
...
* rename colors
* keep ctrl c
2026-04-27 04:00:13 +09:00
nimlgen
96165ff0d1
validate_with_cpu as rewrite ( #15938 )
...
* validate_with_cpu as rewrite
* compil
* x
* linter
* moved
* fix
2026-04-26 19:58:53 +03:00
nimlgen
117e9e22dd
estimates from graph ( #15937 )
...
* estimates from graph
* test
* x
2026-04-26 18:22:53 +03:00
chenyu
e9983e3516
remove unused QCOMTextureInfo, QueueType [pr] ( #15935 )
2026-04-25 14:32:31 -04:00
nimlgen
ac3494a7cc
remove some runners ( #15934 )
...
* remove runners
* mypy
2026-04-25 21:27:05 +03:00
nimlgen
bb652352c7
remove execitem ( #15932 )
...
* remove execitem
* f
* x
2026-04-25 19:33:04 +03:00
chenyu
e27444a0ff
remove unused UOp.shard_size [pr] ( #15933 )
2026-04-25 12:27:58 -04:00
nimlgen
e0ff6cc15c
remove old schedule ( #15930 )
...
* remove old schedule
* tests
* r
* x
2026-04-25 16:46:36 +03:00
qazal
9a23de7d27
viz/cli: unify profile and rewrites, -s ALL default ( #15931 )
...
* work
* workg
* better
* cleanup
* better defaults
* --ls
* better
* work
* update llama
* update
2026-04-25 22:31:24 +09:00
nimlgen
768106a542
remove schedule from extra/docs/examples ( #15929 )
...
* remove schedule from extra/docs/examples
* f
2026-04-25 14:09:12 +03:00
nimlgen
a5e9ea7a60
remove schedule batch 4 ( #15927 )
...
* remove schedule batch 4
* fini
2026-04-25 12:36:55 +03:00
nimlgen
d2ab6ea7a6
remove schedule batch 3 ( #15924 )
...
* remove shcedule batch 3
* batch 6
* batch 7
2026-04-25 11:53:16 +03:00
nimlgen
3c8a2db870
remove schedule() from tests batch 2 ( #15923 )
...
* remove schedule() from tests batch 2
* batch 4
2026-04-25 10:44:41 +03:00
Denys Melnyk
1fdcb13bfb
webgpu: fix weight lookup in export_model after compile_net key change ( #15919 )
...
* fix lookup site in export_model_webgpu after refactoring
webgpu (sd): fix export_model weight lookup after compile_net changes
fix lookup site in export_model_webgpu after refactoring
* add regression test
2026-04-25 10:04:55 +03:00
Christopher Milan
8b2826ef16
nv: fix shader local memory for NAK ( #15921 )
2026-04-25 01:03:11 -04:00
Christopher Milan
57fbaa3d49
amd: fallback to llvm when comgr is not available ( #15914 )
2026-04-24 23:30:16 -04:00
wozeparrot
4b908b6e2c
llama: fused ce loss ( #15920 )
2026-04-24 20:01:24 -07:00
nimlgen
d3378010ee
schedule() -> schedule_linear() in tests (batch 1) ( #15915 )
...
* schedule_with_vars -> linear_with_vars in tests
* tests batch 1
* batch 2
* estimate_uop
* simpler
* rm
2026-04-24 23:40:53 +03:00
chenyu
b501ba3e42
nll_loss to mixin ( #15918 )
2026-04-24 15:50:31 -04:00
chenyu
2f9fdb4a37
scatter to mixin ( #15917 )
2026-04-24 15:37:37 -04:00
nimlgen
f2751955cb
remove linear_to_schedule from tests ( #15912 )
...
* remove linear_to_schedule from tests
* x
2026-04-24 20:02:10 +03:00
nimlgen
56a9f1e3ff
remove last jit_cahce ( #15911 )
...
* remove last jit_cahce
* linter
2026-04-24 19:44:52 +03:00
chenyu
03a7604f76
sort argsort topk allclose to mixin ( #15910 )
2026-04-24 10:20:46 -04:00
nimlgen
4010aa4044
jit: no jit_cache in graphrunner ( #15907 )
...
* jit: no jit_cache in graphrunner
* m
2026-04-24 16:34:26 +03:00
chenyu
7a1adfd2aa
update Tensor.allclose to return Tensor ( #15904 )
...
matches jax
2026-04-24 08:27:17 -04:00
Eitan Turok
48d7ab2695
no uv.lock ( #15893 )
2026-04-24 20:07:07 +08:00
qazal
5eb641395a
viz/cli: select kernel events in -s DEV ( #15909 )
...
* simple test
* pass
2026-04-24 21:03:34 +09:00
nimlgen
c0f77c2e1c
hcq graph to linear ( #15888 )
...
* hcq
* f
* f
* linter
2026-04-24 12:42:49 +03:00
Christopher Milan
cbf4946ea6
usb: multiple gpus and better error messages ( #15900 )
2026-04-24 01:57:19 -04:00
wozeparrot
9d134a2848
llama: fix fakedata timing ( #15905 )
2026-04-23 21:37:03 -07:00
b1tg
aab50d1bca
llm: dedup MLA cache_v ( #15887 )
2026-04-24 12:32:10 +08:00
qazal
f379b5a40a
sqtt: match amd's TS_DELTA_SHORT offset ( #15901 )
2026-04-24 06:41:22 +03:00
chenyu
c24da99d56
avg_pool2d, max_pool2d to mixin ( #15903 )
...
* avg_pool2d, max_pool2d to mixin
* fix
* just dtype
* that
2026-04-23 23:36:17 -04:00
chenyu
08d9106c9f
scatter_reduce and sparse_categorical_crossentropy to mixin ( #15902 )
...
also use `.ne` to fix `# type: ignore[comparison-overlap]`
2026-04-23 21:06:36 -04:00
chenyu
8cc2c69e21
fix isclose mixin ( #15898 )
...
use `.eq` instead of `==`
2026-04-23 20:40:43 -04:00
nimlgen
3072862e2c
metal to linear ( #15884 )
...
* metal to linear
* x
* x
* fix
2026-04-23 23:32:22 +03:00
chenyu
782bc6aece
broadcast in ElementwiseMixin.div [pr] ( #15897 )
2026-04-23 16:02:43 -04:00
qazal
7745e05a2f
sqtt: update wave end packet names ( #15896 )
...
* sqtt: update wave end packet names
* update wavestart and emu
2026-04-24 04:21:22 +09:00
qazal
ee7644932b
viz/cli: -t default number ( #15894 )
...
* viz/cli: accept one path argument
* -t default
* hm
* only the -t change
2026-04-24 04:13:16 +09:00
chenyu
11c197955b
interpolate and cross_entropy to mixin ( #15895 )
2026-04-23 14:59:45 -04:00
chenyu
f0dbc68aa9
gather to mixin ( #15891 )
2026-04-23 14:00:57 -04:00
chenyu
87223f870e
logcumsumexp, argmax, argmin, sequential to mixin ( #15890 )
2026-04-23 12:10:42 -04:00
nimlgen
5cf4ad2fb6
fix resolve param ( #15889 )
2026-04-23 17:41:44 +03:00
nimlgen
e4696185bd
cleaner cuda graph ( #15886 )
2026-04-23 16:34:29 +03:00
wozeparrot
d3cbd781d9
llama: use fused norm mul quantize for w13 ( #15878 )
2026-04-22 21:27:41 -07:00
George Hotz
0c3260d5d9
rename VECTORIZE to STACK ( #15880 )
2026-04-23 10:43:42 +08:00
chenyu
7c9bc29e44
Tensor method raise if arg is on different device ( #15879 )
...
instead of implicit `to`. this matches torch
2026-04-22 22:20:22 -04:00
chenyu
1fc4b3788c
cummax/cummin to mixin ( #15877 )
2026-04-22 21:25:39 -04:00
chenyu
684e95e1d4
UOp binary op broadcasts dtype ( #15875 )
...
* UOp binary op broadcasts dtype
matches Tensor
* fix
* fix?
2026-04-22 20:37:19 -04:00
Christopher Milan
b0dc95a390
AMX in arch, better docs ( #15871 )
2026-04-22 17:25:18 -04:00
nimlgen
e5891acab2
jit: precompile ( #15848 )
...
* x
* jit: precompile as sep step
* x
* s
* x
* x
* x
* ?
* ?
* x
* x
* viz
* f
* x
* u
* x
* x
2026-04-23 00:23:32 +03:00
chenyu
b9e2bc619e
simplify bool.cast() != const ( #15874 )
2026-04-22 17:08:09 -04:00
nimlgen
2041945f4b
cuda graph to linear ( #15870 )
...
* cuda graph to linear
* fix
* keep as old for now
* x
* x
2026-04-22 23:39:58 +03:00
chenyu
e9ebd03e86
update reduce_to_acc index dtype [pr] ( #15873 )
...
index arg should have weakint dtype
2026-04-22 16:25:50 -04:00
chenyu
3c8daa9a75
update test_where_removal ( #15872 )
...
don't use UOp.ufix for const_like, it will broadcast dtype soon
2026-04-22 14:56:37 -04:00
George Hotz
09ff3e1883
hotfix: add bytes back to llm
2026-04-23 00:46:27 +08:00
b1tg
af93a677ae
llm: glm 4.5 air ( #15771 )
...
* llm: glm 4.5 air
* clean
* clean
* remove gguf_size
2026-04-22 22:47:37 +08:00
qazal
719a7bdac5
viz: respect optional estimates in kernel info ( #15867 )
...
* simple failing test
* unpack kernel info
2026-04-22 14:24:48 +03:00
George Hotz
2d7fa58e61
fix shapes to match vecless ( #15866 )
...
* fix shapes
* need to simplify shapes
2026-04-22 18:27:46 +08:00
qazal
de8f58899e
move elf assembler to renderer ( #15855 )
...
* move elf assembler to renderer
* other
2026-04-22 19:00:36 +09:00
George Hotz
d4c344b7fd
hotfix: keep VCONST exclude in viz
2026-04-22 15:54:24 +08:00
wozeparrot
87378331e8
llama: fused mul quantize fp8 ( #15863 )
2026-04-21 20:58:37 -07:00
George Hotz
0560fa7b0f
add shape to range/special ( #15862 )
2026-04-22 11:15:02 +08:00
chenyu
3821e442eb
_one_hot_along_dim and one_hot to mixin ( #15861 )
2026-04-21 20:24:38 -04:00
chenyu
f911a63a6b
don't allow negative num_classes in one_hot ( #15859 )
...
no auto infer num_classes, matches jax
2026-04-21 19:39:29 -04:00
Christopher Milan
697e7aa819
MOCK+AMD and MOCK+NV interfaces ( #15858 )
...
MOCK+AMD is an alias for MOCKKFD+AMD, MOCKNVK+NV is renamed to MOCK+NV
2026-04-21 18:22:16 -04:00
chenyu
75ee51a446
triu tril _tri to mixin ( #15857 )
2026-04-21 17:10:55 -04:00
qazal
e36ff22538
fix dev syntax in emulated amd tests, skip test_tk ( #15856 )
...
* fix dev syntax in emulated amd tests
* skip test_tk
2026-04-21 23:47:29 +03:00
Christopher Milan
99a0debd62
Device.count() ( #15842 )
2026-04-21 16:46:38 -04:00
chenyu
1946ae8b51
linspace and eye to mixin ( #15854 )
2026-04-21 15:58:03 -04:00
qazal
0fbe0a6a99
viz/cli: ux tweaks ( #15853 )
...
* viz/cli: rename to --json
* st_ms, end confuses kimi
* remove pickle spam
* better
* comment
2026-04-21 22:18:27 +03:00
chenyu
86ceb3bd6b
arange to mixin ( #15852 )
2026-04-21 13:00:19 -04:00
chenyu
420e4c4673
zeros, ones, invalids to mixin ( #15850 )
2026-04-21 11:53:08 -04:00
chenyu
9192c93b7e
Tensor.invalid -> Tesnor.invalids ( #15849 )
...
matches ones and zeros, and to not share name with UOp.invalid
2026-04-21 11:19:51 -04:00
nimlgen
bfe28ee2ad
rm run_schedule ( #15847 )
2026-04-21 18:14:30 +03:00
chenyu
d08b5d0a3b
full to mixin ( #15840 )
...
with unique_const
2026-04-21 10:53:43 -04:00
nimlgen
ae9b84d32f
rm beam uop ( #15844 )
2026-04-21 13:10:26 +03:00
nimlgen
01ac1c8c15
remove all run_schedule from tests ( #15846 )
2026-04-21 12:02:10 +03:00
qazal
f9655af2a3
viz/cli: move to tinygrad ( #15835 )
...
* move cli
* update imports
* cleanup the readme
* edit
* work
* details
* python -m tinygrad.viz.cli
* do not execv in non tty
* option
* lint
* simpler
* gemm pmc
2026-04-21 13:35:10 +09:00
Christopher Milan
1a8ba4cbd6
CPU renderers use arch ( #15839 )
2026-04-20 23:38:29 -04:00
chenyu
cabc347066
conv2d and conv_transpose2d to mixin ( #15838 )
...
* conv2d and conv_transpose2d to mixin
* cleanup
2026-04-20 18:10:06 -04:00
nimlgen
b8d3bf8970
run_linear in jit ( #15827 )
...
* run_linear in jit
* x
* x
* f
* casts
* ugh
* f
* x
* x
* simple
2026-04-20 23:03:30 +03:00
chenyu
e00cc8ae5e
split Tensor._conv2d_winograd ( #15837 )
2026-04-20 15:19:33 -04:00
chenyu
667b30b974
tensor pad arg cleanups ( #15836 )
2026-04-20 15:03:09 -04:00
chenyu
8eeb77a905
flat_to_grouped and resolve_pool_pads to helpers ( #15834 )
2026-04-20 14:03:35 -04:00
chenyu
b01704444b
einsum to ReduceMixin ( #15833 )
2026-04-20 11:49:24 -04:00
chenyu
3a557016cb
delete UOp.get_consumer_map [pr] ( #15832 )
...
not used
2026-04-20 10:57:42 -04:00
chenyu
04e8dbd7f8
remove getitem check in get_shape ( #15830 )
...
not needed
2026-04-20 10:40:46 -04:00
chenyu
72ecc61ca8
use more UOp method [pr] ( #15821 )
...
instead of constructing UOp directly
2026-04-20 09:17:56 -04:00
qazal
601b9d3f59
viz/cli: dedup DEBUG=3 pyrender ( #15826 )
2026-04-20 19:29:09 +09:00
ayanhan
80c7327e0f
resolve Metal ARC FIXME with explanation comment ( #13688 )
2026-04-20 17:10:37 +08:00
nimlgen
c0d7135b5f
do not use jit_cache in test ( #15823 )
...
* do not use jit_cache in test
* fix
2026-04-20 11:45:17 +03:00
George Hotz
5819c0abed
fix gc in gguf ( #15820 )
...
* fix gc in gguf
* fix mypy
2026-04-20 10:15:03 +08:00
George Hotz
67ed4c4eb3
move gguf stuff from nn/state.py to llm/gguf.py ( #15783 )
...
* move gguf stuff from nn/state.py to llm/gguf.py
* docs
2026-04-20 09:41:43 +08:00
chenyu
538841d1f2
remove_tags and _remove_all_tags are the same [pr] ( #15819 )
...
also other small UOp method cleanups
2026-04-19 21:37:49 -04:00
Kartik Vashishta
a1696e8413
objc: fix _classmethods_ dispatch flag ( #14854 )
...
* objc: fix _classmethods_ dispatch flag
* test: add objc _classmethods_ regression
2026-04-20 09:35:03 +08:00
oxrinz
f551a4bded
add threefry const folding ( #15787 )
...
* prim threefry
* test fix
* clean test
* cleanup
* cleanup 2
* cleanup 3
* fix conflict markers in test_const_folding.py
* update test
* fix lint
* use const instead of value for test
2026-04-20 09:30:03 +08:00
qazal
b05b1010bf
viz/cli: ux cleanups, show user python ( #15817 )
...
* small fixes
* print python trace
* jsonl
* cleanup fmt, fix tqdm
* print mode
* types
* less
* keep those
* fix
* everyone can print json
* pmc p2
2026-04-20 03:50:48 +03:00
chenyu
8b87b3522a
more UOp empty cleanups [pr] ( #15818 )
2026-04-19 19:48:36 -04:00
chenyu
2a5a6236ac
UOp.empty and UOp.empty_like ( #15816 )
...
* UOp.empty and UOp.empty_like
Tensor.empty and Tensor.empty_like use these, and removed _buffer_like
* import line
2026-04-19 16:01:01 -04:00
qazal
c6d8753ee1
viz/cli: --json support, refine docs ( #15528 )
...
* refine
* remove
* refine
* keep
* need to say this
* back
* feedback
* feedback
* json
* dur_ms
* et_ms
* remove useless thing
* docs
* respect NO_COLOR
* DEBUG also produces valid json
2026-04-19 21:53:38 +03:00
chenyu
50a7b82372
merge untag_and_append and append_after [pr] ( #15815 )
...
reads cleaner
2026-04-19 13:13:26 -04:00
chenyu
cace07c87a
clean up untag_and_append [pr] ( #15812 )
...
replace_uop does not change, and ret.op is always AFTER
2026-04-19 11:23:59 -04:00
wozeparrot
f28ea84de2
llama: fused silu fp8 amax ( #15798 )
...
* llama: combined w13
* llama: fused swiglu+fp8
* llama: fix amax interleaving
* llama: don't need seperate matmul
2026-04-19 12:03:55 +08:00
chenyu
5bdfd4883f
update test_assign ( #15809 )
...
clean up old skips and update tests
2026-04-18 21:25:44 -04:00
nimlgen
022d8c4a11
remove jit_cache usage in extra/examples ( #15808 )
...
* remove jit_cache usage in extra/examples
* cached
2026-04-18 23:00:18 +03:00
wozeparrot
06343092c8
llama: combined w13 ( #15803 )
2026-04-17 22:27:31 -07:00
Christopher Milan
6adf4c3cd9
MOCKGPU interfaces ( #15796 )
2026-04-17 21:56:29 -04:00
chenyu
8da308573f
update test_assign_changes_alt with clone ( #15802 )
2026-04-17 20:17:37 -04:00
qazal
2581985532
viz/cli: multi device profiler output, print markers ( #15795 )
...
* yield
* all devices
* better
* add unittests
* markers like this
* profile_markers work
* less
* update README
* tiny and null
2026-04-17 23:40:10 +03:00
chenyu
0191cc73dc
update arange range check ( #15794 )
...
it was not checking negative steps correctly
2026-04-17 16:07:50 -04:00
nimlgen
23ca680a3a
run_linear ( #15784 )
...
* run_linear try 2
* x
* f
* tests
* ctx, cleaner
* r
* x
2026-04-17 22:44:16 +03:00
qazal
8fcaaede9a
fix root cause of TestVizIntegration.test_link_sched_codegen flakiness ( #15793 )
2026-04-17 20:31:52 +03:00
googlefan256
482c8c1ec8
Fix no module named error ( #15792 )
2026-04-17 19:42:35 +03:00
qazal
a227dbece1
viz/cli: reconstruct DEBUG output ( #15791 )
...
* work
* work
* ext
* padding
* at time
* work
* reorder
* less flags
* num_rows
* feedback
* pmc
2026-04-17 18:27:58 +03:00
qazal
601d137e85
viz: rename to rewrites_data, only use ContextVar ( #15790 )
...
* viz: rename to rewrites_data
* tms also 0
* gt 0
2026-04-17 17:21:51 +03:00
qazal
afc3904e58
viz/cli: unit tests in CI ( #15788 )
...
* simple failing test
* test stdout
* cleanup sqttmap
2026-04-17 22:34:44 +09:00
qazal
9f2a578e26
unskip TestCall.test_call_gemm_uop [pr] ( #15786 )
2026-04-17 16:18:51 +03:00
qazal
7bdb3adbbf
viz/cli: simplification and reordering ( #15785 )
...
* remove
* work
* this is all one thing
* the reorder
2026-04-17 15:16:07 +03:00
George Hotz
e1d13bc4fe
add GGUF IQ4_XS support ( #15766 )
...
* add GGUF IQ4_XS support
* gguf 21
* gguf 21
* use plus
* ggml_common autogen for constant arrays
* fix
* ggml_common in autogen
* inline
2026-04-17 14:43:39 +08:00
wozeparrot
9e60e4a7e7
llama: native fp8 ( #15733 )
2026-04-16 22:16:05 -07:00
George Hotz
a9b6cfece0
refactor llm into files ( #15780 )
...
* refactor llm into files
* chat.html
* tokenizer cleanup
* cleanup
* tests
2026-04-17 12:33:11 +08:00
chenyu
1fac03ce54
softmax and friends to mixin ( #15778 )
...
with detach now
2026-04-16 23:03:37 -04:00
George Hotz
ec00cefa5b
llm is the only app ( #15779 )
...
* tinygrad/llm is the only app
* upd pyproject
* claude refs
* scoping
* min diff
2026-04-17 10:44:48 +08:00
qazal
0e69388f6b
viz/cli: add DEBUG, optional number of rows ( #15777 )
...
* tabulate switch
* support DEBUG
* --top
* improve
* work
* feedback
* 0
* print_kernel both ways
* simplify
2026-04-17 04:36:47 +03:00
chenyu
2d196fb9bb
move Tensor.size to mixin ( #15775 )
2026-04-16 17:56:17 -04:00
Christopher Milan
9f4b7bed25
add pickled jit regression test ( #15774 )
2026-04-16 16:59:09 -04:00
qazal
6d9320ffb3
add NO_COLOR ( #15765 )
...
* NO_COLOR in cli
* add in helpers
* rm flags
* docs
* fix that
* temp
* Revert "temp"
This reverts commit 7522e664f6 .
2026-04-16 22:44:55 +03:00
qazal
12c653a743
remove opts arg in get_program, everything uses opts_to_apply [pr] ( #15767 )
...
* check Ops.BEAM in process replay
* remove opts from the get_program api
* lint
* simplify
* cleanup
2026-04-16 22:42:43 +03:00
chenyu
f0c12a2004
another form of assign to itself ( #15770 )
2026-04-16 15:17:19 -04:00
b1tg
4e88d875ba
llm: glm 4.7 flash ( #15738 )
...
* glm 4.7
* test
* temperature, server enable_thinking
* --no-think
* remove think stuff
2026-04-16 22:42:04 +08:00
chenyu
d147e2a549
update test_nested_after_contiguous_store ( #15763 )
...
add kernel counts and some TODOs
2026-04-16 09:59:26 -04:00
qazal
126cda45f8
viz/cli: cleanups, add memory printer ( #15762 )
...
* simple repro
* use context
* work
* memory printer
* rm
* memory printer
* pylint
2026-04-16 22:44:47 +09:00
George Hotz
f57380cbc2
simplify GatedDeltaNetBlock using two state tensors ( #15704 )
...
* test double after
* simpler ssm
* no double test
2026-04-16 21:14:00 +08:00
nimlgen
c04f3eaa70
jit: capturedjit is linear ( #15743 )
...
* jit: capturedjit is linear
* x
* new beam
* test
* imp
* clean
* spec
* linter
2026-04-16 14:54:39 +03:00
George Hotz
d1cce7a476
put the ranges on store instead of after ( #15759 )
...
* put the ranges on store instead of after
* better assert
* fix stuff
* comment out slow rules i don't understand
* simpler rule
* closer
* return false for store
* fix loop
* only a few schedule failures remain
* remove stores to self
* all tests pass locally
* remove junk
* regression test and fix
* better test, bump broken torch count
* bugfix with regression test
* new fusion is better
2026-04-16 19:06:40 +08:00
George Hotz
d24466c844
CALL with return value is FUNCTION ( #15758 )
...
* CALL with return value is FUNCTION (GPT try)
* cleanups
2026-04-16 13:25:07 +08:00
chenyu
218d6b8988
delete old UOp.size [pr] ( #15756 )
2026-04-15 23:21:00 -04:00
wozeparrot
d090732270
usbgpu: reset endpoint for custom fw ( #15754 )
2026-04-15 20:01:27 -07:00
Muzammil
983a7bb576
exclude __del__ from TRACEMETA wrapping ( #15747 )
...
Session-Id: 019d9234-2531-75a0-a252-f0302cd9931f
2026-04-16 10:49:55 +08:00
chenyu
8bd4fead26
UOp.size -> prod(max_shape) ( #15755 )
...
and more test updates
2026-04-15 22:41:30 -04:00
chenyu
10c262ced8
update tests that use UOp.size ( #15753 )
2026-04-15 21:58:27 -04:00
qazal
96092d110c
fix process_replay Ops.BEAM [pr] ( #15752 )
2026-04-16 07:35:28 +09:00
chenyu
41421c3b48
BUFFER size is their arg ( #15750 )
2026-04-15 18:08:29 -04:00
Christopher Milan
be8005c5dc
DEV: secondary targets ( #15748 )
2026-04-15 17:26:20 -04:00
chenyu
507c02cecb
fix symbolic contiguous_view_offset ( #15749 )
...
* fix symbolic contiguous_view_offset
* flatten
2026-04-15 16:54:38 -04:00
nimlgen
164495678c
test_graph to use uops ( #15746 )
...
* test_graph to use uops
* x
* n
2026-04-15 21:59:41 +03:00
qazal
1f26584b2e
viz/cli: cleanups from linter ( #15745 )
...
* run linter
* pmc
2026-04-16 03:36:24 +09:00
chenyu
7cbfa1896a
comment out unused arm, triton in toml ( #15741 )
...
fixed `PYTHONPATH=. uv run tinygrad/apps/llm.py`
2026-04-15 10:05:19 -04:00
Christopher Milan
1c36878008
DEV: suggest alternatives ( #15732 )
2026-04-14 23:42:32 -04:00
George Hotz
1ae6528bb6
move schedule into schedule ( #15736 )
...
* move schedule into schedule
* callify to root
* sched docs
2026-04-15 11:03:25 +08:00
wozeparrot
3721c60bef
llama: bs 16 ( #15737 )
2026-04-14 19:52:03 -07:00
wozeparrot
480ad264a4
llama: per device amax ( #15735 )
2026-04-14 19:01:17 -07:00
Christopher Milan
adc96cd724
qcom: synchronize for copyin ( #15731 )
...
fixes : #15698
2026-04-14 18:31:15 -04:00
chenyu
3394d18066
size*itemsize -> nbytes ( #15729 )
...
and some UOp.size removal to prep for size to mixin change
2026-04-14 16:27:54 -04:00
nimlgen
e9ecc990ea
amd: add r9700 devid ( #15721 )
2026-04-14 20:15:00 +03:00
George Hotz
2450c8cba8
rename to callify + fix mypy ( #15727 )
...
* rename to callify + fix mypy
* update test
2026-04-14 23:43:19 +08:00
chenyu
528faa18ec
update env_vars.md ( #15722 )
...
remove HCQ_VISIBLE_DEVICES, IMAGE=2 and old DEBUG=3 stuff
2026-04-14 09:13:35 -04:00
George Hotz
359b1582d6
amd: EMU DPP support ( #15719 )
...
* EMU DPP support from GPT 5.4
* cleanups
* simple
* nope
* fix
2026-04-14 14:58:41 +08:00
wozeparrot
2b8d303f75
allreduce in precast dtype ( #15689 )
2026-04-13 20:24:12 -07:00
George Hotz
5683126844
llm: support for tekken tokenizer ( #15720 )
2026-04-14 10:52:07 +08:00
chenyu
70883a6950
cat the stack to mixin ( #15715 )
2026-04-13 18:44:39 -04:00
qazal
355e2729d3
viz: keep program UOp in data ( #15714 )
...
* refactor program uop access
* c.name
2026-04-14 07:04:16 +09:00
qazal
905b8adc97
viz: cli and server cleanups ( #15713 )
...
* update get_profile arg[0]
* uop_to_json arg[0]
* data is standalone in cli
2026-04-14 06:42:29 +09:00
Christopher Milan
d83707ec29
autogen: explicit types ( #15679 )
2026-04-13 16:54:39 -04:00
chenyu
ac41f15fc1
cumsum to mixin ( #15712 )
...
built on top of getitem
2026-04-13 15:06:08 -04:00
nimlgen
eac481b67f
mlx: fix ctypes ( #15711 )
...
* mlx: fix ctypes
* x
2026-04-13 20:43:56 +03:00
nimlgen
b370f5c5ac
hcq: call free for unmap ( #15710 )
2026-04-13 20:30:21 +03:00
chenyu
931d6cc62a
basic getitem to mixin ( #15697 )
...
* basic getitem to mixin
* cleanup
* fix
* cleanup
2026-04-13 13:04:36 -04:00
George Hotz
7610bdc59e
block multistore, it's not supported ( #15708 )
2026-04-13 20:57:59 +08:00
George Hotz
84d64b5835
hotfix: abstractions4 works in mock except asm
2026-04-13 20:57:00 +08:00
George Hotz
16f50a40a5
remove REMU from tree ( #15706 )
...
* no more compare emulators
* remove remu from tree
2026-04-13 20:43:08 +08:00
qazal
ac027055ef
viz: no global state ( #15705 )
...
* start viz data
* get_full_rewrites also moves
* update ref_map
* work
* update consumers
* cleaner cli
* linter
* cleanup tests
* back
* better
* sqtt tests
2026-04-13 21:35:20 +09:00
George Hotz
4c1fb18a09
Revert "Revert "Tests for GatedDeltaNetBlock + fix multi after assign issue (…" ( #15703 )
...
This reverts commit 0cec42db71 .
2026-04-13 19:09:38 +08:00
George Hotz
0cec42db71
Revert "Tests for GatedDeltaNetBlock + fix multi after assign issue ( #15700 )" ( #15702 )
...
This reverts commit 6f5d756282 .
2026-04-13 19:06:44 +08:00
George Hotz
6f5d756282
Tests for GatedDeltaNetBlock + fix multi after assign issue ( #15700 )
...
* broken after/assign test
* test for GatedDeltaNet
* better comments
* fix issue 1 with multi kernel
* fix 2
* fix
* linter
* public api + cleanup
2026-04-13 18:43:23 +08:00
b1tg
2b5ba0095d
qwen3.5 ( #15210 )
...
* qwen3.5
* faster
* or
* rm zero hack
* less float
* T=1
* clean
* clean
* 4b
* rope_dim
* Revert "jit: captures linears, not execitems (#15399 )"
This reverts commit 9656d97d97 .
* DeltaNetBlock
* pairwise_topk
* clean
* Reapply "jit: captures linears, not execitems (#15399 )"
This reverts commit cf3deff53d .
* clean topk, _swiglu
* common
* FFNBlock
* clean
* half
* no mix
* qwen3.5 test
* fix ssm cache invalidation
* TransformerConfig
* SSMConfig
* clean
* reset_state
* llm: reuse server conversation tokens to avoid BPE roundtrip cache miss
* import error
* prefill
* none check
* put it back
* clean pairwise_topk
* symbolic: fold BIND(CONST, CONST) to CONST
* clean
* simpler pm
* _cached_msg_count
* stream decoder; ssm checkpoints
* rm checkpoint
* attn_output_gate
* conflict, attn_output_gate
* clean, less has_ssm, assert
* chunked prefill
* _reset_cache
* _reusable_prefix_len
* revert loop
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-04-13 15:35:24 +08:00
qazal
2ada38f777
viz: execv after all producers complete ( #15696 )
2026-04-13 08:15:47 +09:00
chenyu
f7ff480fa6
start mixin getitem tests ( #15695 )
...
goal is to make Tensor[idx].uop equal to Tensor.uop[idx]
2026-04-12 18:54:33 -04:00
chenyu
77385ccb37
more trivial stuff to mixin ( #15693 )
2026-04-12 15:17:16 -04:00
chenyu
ff1de5ae13
normalize logsumexp contiguous_backward to mixin ( #15692 )
...
* normalize logsumexp contiguous_backward to mixin
* more
2026-04-12 13:13:00 -04:00
chenyu
0254cfe642
move usum and uprod to mixin ( #15690 )
...
and used it to clean up ops and tensor
2026-04-12 11:42:24 -04:00
nimlgen
e9b2e156b4
add jitbeam to tinygpu docs ( #15691 )
2026-04-12 18:20:26 +03:00
chenyu
e706f408cb
suppress test warnings from numpy ( #15688 )
2026-04-11 22:33:20 -04:00
nimlgen
938cba4fdf
amd: a bit faster usb, skip interrupts on sync ( #15686 )
2026-04-11 17:26:36 +03:00
qazal
054d78e6ff
fix llama profile.sh NULL source ( #15685 )
2026-04-11 22:56:05 +09:00
Graham Robbins
4ca844e96b
add Q1_0 gguf type ( #15683 )
...
* add Q1_0
* better description
* fix trailing whitespace
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-04-11 18:17:24 +08:00
George Hotz
5156a04cf5
add support for AM_POWER_LIMIT ( #15684 )
...
* add support for AM_POWER_LIMIT
* level None
2026-04-11 17:14:54 +08:00
wozeparrot
457508d5a0
llama: save more 2 ( #15681 )
2026-04-11 01:03:36 -07:00
George Hotz
29238b772f
AMD USB: support for 0xF3 power toggle
2026-04-11 13:04:38 +08:00
George Hotz
b5a9465b13
llm: add support for moonlight (deepseek MLA) ( #15466 )
...
* add gguf Q5_0
* it works
* rebase
* simpler test
* class
* less diff
* dicts
* normal names
* simplify
* this
* simpler
* work
* work
2026-04-11 10:32:48 +08:00
wozeparrot
590464c8d8
llama: only support wqkv path + cleanups ( #15680 )
...
* llama: only support wqkv path + cleanups
* llama: missing transpose
2026-04-11 07:39:27 +08:00
nimlgen
aa012d6f08
usb: faster custom ( #15678 )
...
* usb: _f0_out_buf for e4 cmd as well
* custom speed
* fast
2026-04-10 23:00:31 +03:00
nimlgen
58646f9569
usb fast copyout ( #15677 )
...
* usb
* fix usb
2026-04-10 21:04:49 +03:00
qazal
0d5cdc9600
viz: split draw loop ( #15676 )
...
* split draw loop
* one draw
* no functions
* inline all highlights
* cleanup
2026-04-10 23:25:50 +09:00
chenyu
e1334d3852
move canonicalize_device to device.py ( #15675 )
2026-04-10 09:43:56 -04:00
chenyu
8e7fcc8ca3
remove _include_initial in _cumalu ( #15674 )
...
handle negative pad in caller
2026-04-10 08:33:30 -04:00
George Hotz
9092f2a8c0
llm: add shared_expert and rope_dim support from qwen35 ( #15673 )
...
* llm: add shared_expert and rope_dim support from qwen35
* refactor into FFNBlock and TransformerBlock
* norms where they belong
2026-04-10 19:18:27 +08:00
b1tg
9ab1415937
llm: fix streaming UTF-8 decode ( #15653 )
2026-04-10 17:01:02 +08:00
wozeparrot
55bcd7cc9e
llama amax outside ( #15670 )
2026-04-09 23:08:03 -07:00
George Hotz
16f3448b26
Add HIP to abstractions4 ( #15672 )
...
* cleanup formatting
* add HIP option
* pass in correct
2026-04-10 14:05:52 +08:00
George Hotz
ed2a72bb23
work on abstractions4 ( #15671 )
...
* work on abstractions4
* works
* offst
* assembly works
* RAND
* cleanup
* work
2026-04-10 13:25:11 +08:00
Christopher Milan
dbc23e8a1b
move HCQ_VISIBLE_DEVICES into DEV ( #15668 )
2026-04-09 22:01:35 -04:00
George Hotz
fa02105546
hotfix: pin amd isa xml version
2026-04-10 06:47:00 +08:00
nimlgen
057dc173ab
beam uop ( #15660 )
...
* beam as uop
* x
2026-04-09 19:13:03 +03:00
nimlgen
0ff30b003d
am: reset queues from spi ( #15664 )
...
* am: reset queues from spi
* move
2026-04-09 18:25:50 +03:00
George Hotz
48a7627b04
add RDNA4 support to copy WMMA ( #15663 )
...
* add RDNA4 supportt to copy WMMA
* simpler
* simpler
* comment
* assert
2026-04-09 22:48:20 +08:00
chenyu
6837881b06
remove same_shape_noop [pr] ( #15662 )
...
no longer used
2026-04-09 09:50:26 -04:00
Christopher Milan
d08c76d9cb
c.Struct cleanup ( #15640 )
2026-04-08 20:07:16 -04:00
qazal
742b3894d7
viz/cli: add pmc printer ( #15651 )
...
* viz/cli: add pmc printer
* cli work
* s
* linter
* pack workgroups
* add : to wgp
* counter name
2026-04-09 08:50:54 +09:00
chenyu
4cf2759fc8
fix merge_reduce_ends ( #15659 )
...
* fix merge_reduce_ends
same range with different nesting should not merge, like cumsum twice should not merge
* skip that
2026-04-08 17:20:01 -04:00
chenyu
cb681da840
move UOp.pad to mixin ( #15657 )
...
the same arg works for Tensor.pad
2026-04-08 13:15:19 -04:00
nimlgen
28b14b0e38
mlx: remove to_be, use helpers ( #15655 )
2026-04-08 20:07:28 +03:00
nimlgen
1b44cb2ac6
split update stat from execitem ( #15654 )
2026-04-08 20:07:12 +03:00
qazal
71c83cc3f6
viz: put OTHER_ on the wave row ( #15650 )
...
* viz: put OTHER_ on the wave row
* update tests
* cleanup cli
2026-04-08 23:13:44 +09:00
chenyu
839d37b7bc
update median_step_time in model_train.py ( #15649 )
...
BENCHMARK=5 used to pick the 4th largest, not the middle one
2026-04-08 09:53:59 -04:00
chenyu
dae9dea903
clean up tensor random functions ( #15648 )
...
* clean up tensor random functions
* revert that
2026-04-08 09:44:37 -04:00
George Hotz
1ebeb52e59
RDNA4 asm gemm ( #15427 )
...
* sqtt: rdna4 decoder work
* diff cleanup
* more diff
* test
* 125
* r4
---------
Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2026-04-08 21:26:44 +08:00
nimlgen
b1e52ba0c2
the slowest line in hcq graph ( #15635 )
...
* the slowest line in hcq graph
* x
2026-04-08 15:53:52 +03:00
qazal
3ac16b3bea
viz: add wmma row, update exec duration logic ( #15646 )
...
* viz: split wmma to its own row, fix duration logic
* regs
* decrease number of loops, add pickle
* assert overlaps
2026-04-08 20:24:23 +09:00
George Hotz
35e3983840
Add Q5_0, Q5_1, and bfloat16 GGUF types ( #15644 )
2026-04-08 17:16:19 +08:00
qazal
39a029ec55
remove ASM_GEMM context var ( #15645 )
2026-04-08 18:02:40 +09:00
qazal
dc6a51e44d
viz: add # of bytes to sdma ( #15639 )
...
* viz: add # of bytes to sdma
* update test_viz
2026-04-08 17:43:37 +09:00
wozeparrot
70dbd35023
llama: move custom_kernel into flat_llama ( #15643 )
2026-04-08 00:19:14 -07:00
Christopher Milan
bcf6931a4f
fix: comma 4 does not have pcie ( #15642 )
2026-04-07 23:57:03 -04:00
George Hotz
f930579b7a
llm: change the default port to 8000 so you can remember it (match vLLM)
2026-04-08 11:25:38 +08:00
b1tg
bf3763526a
llm: buffer SSE chunks to fix parse errors from split reads ( #15641 )
2026-04-08 10:26:23 +08:00
qazal
a508b8fd2a
viz: delete redundant things ( #15637 )
...
* delete that
* remove
* delete graph config
2026-04-08 07:18:04 +09:00
chenyu
9c6e925b56
move lerp to mixin ( #15634 )
...
last function of math function section
2026-04-07 15:13:00 -04:00
qazal
890286e8d6
update llama profile.sh ( #15633 )
...
* update llama profile.sh
* BENCHMARK 5
2026-04-08 03:18:45 +09:00
nimlgen
b78b384d58
mlx: graph ( #15621 )
...
* Dx
* Dx
* simpler
* mypy
* x
* f
* Dx
* x
* c
* x
2026-04-07 19:43:51 +03:00
qazal
d29f0ef721
viz: speed up profiler first render ( #15632 )
...
* viz: speed up profiler first render
* better comment
2026-04-07 23:07:09 +09:00
George Hotz
d3de63d998
improvements to apps.llm ( #15631 )
2026-04-07 20:34:05 +08:00
George Hotz
2b01ca59dd
USB driver for custom ASM firmware ( #15597 )
...
* USB driver for custom ASM firmware
* timeout
* fix mypy
* pcie mem read
* flip in f/w
* one tx
* litle endian
* autodetect custom
* mock bypass
* lint
* clean
2026-04-07 13:45:41 +08:00
wozeparrot
810d7c00cd
llama: unify scripts ( #15628 )
2026-04-06 20:28:08 -07:00
Christopher Milan
19e96497ee
interface in DEV ( #15620 )
2026-04-06 19:59:28 -04:00
qazal
8ba58304f7
viz: reenable tests ( #15626 )
2026-04-07 07:52:44 +09:00
chenyu
2f7d085450
shared _normalize_indices for getitem ( #15625 )
...
* shared _normalize_indices for getitem
* list
2026-04-06 17:45:36 -04:00
chenyu
66ec188d50
more activations to mixin ( #15624 )
2026-04-06 15:41:41 -04:00
chenyu
1483f7e71c
support shift by Tensor ( #15623 )
...
* support shift by Tensor
* use mixin
2026-04-06 15:14:57 -04:00
chenyu
6e30a5f5ea
update shifts in torch backend ( #15622 )
2026-04-06 14:08:33 -04:00
chenyu
a444be172d
lower fuzz_symbolic_symbolic_div timeout ( #15619 )
...
mitigate timeout crash due to high total time
2026-04-06 12:58:29 -04:00
chenyu
01b49c8647
support int operand for shifts ( #15618 )
...
matches torch/jax, also symbolic rule to remove mask
2026-04-06 12:32:12 -04:00
nimlgen
e2700475cf
mlx: cleaner ( #15617 )
...
* mlx: cleaner
* x
2026-04-06 17:49:47 +03:00
Valtteri Valo
86c4431d74
add gpu_family detection to Metal, target MSL 4.0 on macOS 26+ ( #15079 )
...
use supportsFamily API to detect GPU generation instead of parsing
ICB debug description strings. also adds metal4.0 compiler target.
2026-04-06 06:51:38 +08:00
13Perrius
ff0c941548
remove redundant iteration and toposort in _deepwalk ( #15532 )
2026-04-06 06:38:45 +08:00
Andrew Cappelli
e39cfe685a
validate lr, momentum, weight_decay in optimizers ( #15576 )
2026-04-06 06:37:34 +08:00
nimlgen
6a334ceb27
hotfix: fix bert ( #15613 )
2026-04-05 23:41:21 +03:00
nimlgen
e3986a6b74
mlx: init runtime ( #15612 )
...
* mlx: init
* x
* swap
2026-04-05 22:52:29 +03:00
nimlgen
e0988dbae5
hcq: support non for signal_t and compute_t ( #15611 )
...
* hcq: support non for signal_t and compute_t
* revert
* x
2026-04-05 18:56:47 +03:00
nimlgen
5e134aa087
hcq: add write/poll_bit commands ( #15610 )
...
* hcq: add write/poll_bit commands
* x
2026-04-05 18:09:44 +03:00
nimlgen
604cdbf2f7
am: large allocs aligned to 2mb to use 2mb pages ( #15609 )
2026-04-05 18:01:31 +03:00
qazal
b2d5b29f45
assembly/amd: validate dsl keyword args ( #15608 )
...
* assembly/amd: validate dsl keyword args
* hm, this should use the SOP2 s_waits
* use the sop2 s_waits
2026-04-05 23:00:24 +09:00
qazal
056fcd7758
viz: web work from rdna4 gemm ( #15607 )
...
* add rdna4 barrier
* fix realtime
2026-04-05 19:14:16 +09:00
wozeparrot
7e54992bf6
fp8 llama ( #15588 )
...
Co-authored-by: qazal <qazal.software@gmail.com>
2026-04-04 18:24:57 -07:00
qazal
4d36366717
assembly/amd: match rdna4 hw gidx init in emulator ( #15604 )
...
* simple rdna4 copy kernel with hw fault
* the trivial fix: use ttmp instead of s
* now copy kernel fails in mockgpu
* rm crashing kernel
2026-04-05 02:28:18 +09:00
chenyu
2ba5a6ddc8
remove detach in selu ( #15602 )
...
UOp does not have detach. this does not change behavior
2026-04-04 11:04:29 -04:00
qazal
f7aed180e4
viz/cli: add Other row in profiler ( #15600 )
2026-04-04 22:40:53 +09:00
Christopher Milan
74ecf6d3e6
opaque structs are also c.Struct ( #15596 )
2026-04-03 19:40:43 -04:00
Christopher Milan
645d45d968
DEV has arch ( #15577 )
...
Co-authored-by: Comma Device <device@comma.ai>
2026-04-03 19:17:19 -04:00
nimlgen
902edc3781
hcq: hcqbuf in copy ( #15595 )
2026-04-03 22:47:36 +03:00
nimlgen
2c4271209e
hcq: peer groups for remote ( #15594 )
...
* hcq: set real peer group
* x
* x
* x
2026-04-03 19:03:07 +03:00
chenyu
8fdef2d3e4
mean/std/var to mixin ( #15593 )
2026-04-03 10:42:41 -04:00
qazal
9920b42b5e
hotfix: renderer.target.arch in disasm ( #15592 )
2026-04-03 22:23:51 +09:00
nimlgen
237084b276
remote: support several hosts ( #15585 )
...
* remote: support several hossts
* f
2026-04-03 11:22:15 +03:00
Christopher Milan
0ed8d9271d
Renderers accept Target or nothing ( #15590 )
2026-04-03 01:09:41 -04:00
wozeparrot
3a26920141
feat: framework ci ( #15589 )
2026-04-02 22:03:51 -07:00
Christopher Milan
736fea8412
select_first_inited cleanup and better errors ( #15587 )
2026-04-02 19:27:58 -04:00
Christopher Milan
8c50da800d
[pr] cleanup unused ctx's in codegen ( #15586 )
2026-04-02 19:06:58 -04:00
nimlgen
694dc5a717
install script in benchmark ( #15584 )
2026-04-02 18:15:58 +03:00
nimlgen
046c3f1240
mlx: add loopback with send/recv ( #15583 )
2026-04-02 18:15:46 +03:00
chenyu
c64226e97c
fix CreationMixin doc ( #15582 )
2026-04-02 09:46:28 -04:00
qazal
fefb0ebc2a
gemm/asm: fp8 cleanups ( #15580 )
...
* normal gemm here
* s/dtypes.fp8e4m3/FP8_DTYPE
* gemm_bw
* device UOp stays NULL
2026-04-02 19:02:38 +09:00
chenyu
61bc91aa8c
Tensor cumalu cleanups ( #15579 )
...
* Tensor cumalu cleanups
* happy
2026-04-02 05:23:22 -04:00
chenyu
1aa04eab08
simple CreationMixin ( #15567 )
...
start with full_like, zeros_like, ones_like
2026-04-01 23:00:56 -04:00
wozeparrot
5b2a3251c4
mlperf system json for mi350 ( #15575 )
2026-04-01 15:30:33 -07:00
Christopher Milan
6c67bd4c14
better error message when invalid renderer is specified ( #15573 )
2026-04-01 17:12:55 -04:00
Christopher Milan
0d6fbc2355
remove flaky and redundant image test ( #15574 )
2026-04-01 16:33:13 -04:00
Christopher Milan
20f7f0be8e
nir renderers use arch ( #15556 )
...
* nir renderers use arch
* fix
* fix null
2026-04-01 16:32:51 -04:00
nimlgen
148ad09559
am: do not use dbell for ih ( #15571 )
2026-04-01 21:34:21 +03:00
nimlgen
93a85c7348
am: raise when using more sdma engines ( #15569 )
2026-04-01 21:33:42 +03:00
nimlgen
da12c2ea16
better install msg ( #15570 )
2026-04-01 20:09:37 +03:00
b1tg
20497f2840
fold BIND to CONST when min==max ( #15568 )
2026-04-01 11:19:04 -04:00
qazal
9275f283e5
viz: update flag and display names ( #15566 )
...
* rename to occ, other_simd
* se pkts
* match viz cli tool in names
2026-04-01 21:48:37 +09:00
chenyu
f5c0794df2
fix Tensor.const_like ( #15565 )
...
used to always return a 0-d tensor, now returns an expanded Tensor based on self.shape and matches UOp
2026-04-01 08:35:19 -04:00
qazal
09f60d80fd
llama: fix FP8=1 FAKEDATA=1 ( #15564 )
2026-04-01 20:53:03 +09:00
nimlgen
6d1e992e89
copyout sharded w/o ioring ( #15562 )
...
* copyout sharded w/o ioring
* x
* x
* f
2026-04-01 14:47:29 +03:00
nimlgen
150c456977
add OSError to suppress_finalizing ( #15558 )
2026-04-01 12:33:59 +03:00
chenyu
fc5b94b902
fix UOp.where(const, const) ( #15560 )
...
* fix UOp.where(const, const)
* fix
2026-04-01 05:28:49 -04:00
chenyu
5aeb2273db
add amd_copy_matmul.py to CI ( #15555 )
...
more tests before cleanup
2026-03-31 22:39:18 -04:00
Christopher Milan
034f617971
NVCCRenderer is separate from CUDARenderer ( #15554 )
2026-03-31 21:26:13 -04:00
wozeparrot
8b5b9a0e90
llama: run_and_time ( #15533 )
2026-03-31 15:46:16 -07:00
Christopher Milan
acf239e4d2
specify renderer in DEV, <dev>_<ren>=1 is deprecated ( #15551 )
2026-03-31 18:35:14 -04:00
nimlgen
5181c8e23a
llm: fix nan in kvcache ( #15552 )
2026-04-01 00:38:45 +03:00
nimlgen
3af25ccdb4
docs: minor tinygpu changes ( #15550 )
2026-03-31 21:29:15 +03:00
nimlgen
477d194630
hipcomgr and tinygpu scripts ( #15549 )
2026-03-31 20:07:52 +03:00
nimlgen
83085f103c
tinygpu docs ( #15545 )
...
* tinygpu docs
* x
* x
* fix
2026-03-31 19:49:38 +03:00
nimlgen
ca89215a59
nv: use nvcc over nak by default ( #15547 )
2026-03-31 18:54:56 +03:00
qazal
a15345a53e
viz/cli: improve --help message ( #15546 )
...
* viz/cli: improve --help message
* not the default
* more work
* -s
* respect colored
2026-03-31 22:31:33 +09:00
nimlgen
10d570b3d5
signed tinygpu ( #15541 )
2026-03-31 14:55:09 +03:00
chenyu
4ac2552642
improve ReduceMixin.all ( #15544 )
...
use prod instead of min since `mul` lowered to `and` directly
2026-03-31 07:54:27 -04:00
chenyu
89ec22131a
tests to show double negation in min is not cancelled ( #15543 )
2026-03-31 06:59:13 -04:00
qazal
8feb8edc68
gemm/asm: add fp8 support to cdna asm_gemm ( #15542 )
...
* work
* hmm, mixins
* rhs_transposed
* also fix the dtype
* check for hipcc
* Exception
* select dev
* default
2026-03-31 19:32:54 +09:00
chenyu
2939ae8b22
more mixin ( #15540 )
...
isclose is elementwise, min, any, all to OpMixin
2026-03-31 05:46:55 -04:00
chenyu
e69f5f9f69
more movement methods to mixin ( #15536 )
...
* more movement methods to mixin
* cleanups
2026-03-31 05:16:47 -04:00
nimlgen
ceb63c8c2f
new bundle id ( #15307 )
...
* new bundle id
* new profiles
2026-03-31 12:16:03 +03:00
qazal
467c0af8aa
viz: skip flaky sever tests ( #15538 )
2026-03-31 17:20:30 +09:00
qazal
f88e255cea
gemm/asm: split and parameterize dtype in llama gemm tests ( #15408 )
...
* gemm/asm: more tests for emulator, parameterize llama gemm tests
* bf16 atol
2026-03-31 17:12:44 +09:00
b1tg
a63392a565
llm: pairwise ranking topk for MoE expert selection ( #15499 )
2026-03-31 12:46:39 +08:00
wozeparrot
79cccf3003
write sz output to file ( #15534 )
2026-03-30 20:16:17 -07:00
Christopher Milan
6fb038d109
replace CompilerSet with list ( #15530 )
...
* replace CompilerSet with list
* oops
* default Renderer list
2026-03-30 23:07:52 -04:00
qazal
bc866a93f0
viz: rename exec to sqtt ( #15527 )
...
* viz: rename exec to sqtt
* more
2026-03-31 08:06:51 +09:00
Christopher Milan
adbfd82d1d
DEV is ContextVar, setting Device.DEFAULT is deprecated ( #15508 )
2026-03-30 17:10:49 -04:00
nimlgen
9583489068
add mlx driver to extra ( #15526 )
...
* mlx driver
* x
* simpler
2026-03-30 20:28:49 +03:00
qazal
ad6347f6d8
sqtt: allow mapping sopk to IMMEDIATE packets ( #15525 )
...
* work
* with s_waitcnt
* with the sopp variants, increase threads
* remove that
* sdst=NULL produces IMMEDIATE, otherwise is SALU
2026-03-30 23:12:17 +09:00
chenyu
301b2cea57
move matmul to mixin ( #15524 )
2026-03-30 07:39:09 -04:00
chenyu
f0eaac4235
reduce mixin ( #15523 )
2026-03-30 05:23:58 -04:00
chenyu
f485d0b664
UOp.sum -> usum, prod -> uprod [pr] ( #15522 )
...
rename to prep reduce mixin
2026-03-29 04:51:55 -04:00
qazal
36a925e2a2
viz: color wmma, one color map for cli and web ( #15519 )
...
* viz: color wmma, one color map for cli and web
* op_type
* like uops
* mypy cli
2026-03-29 04:53:01 +09:00
wozeparrot
0c3e438229
llama: mllog ( #15502 )
2026-03-28 11:18:25 -07:00
nimlgen
7e57e101d5
better oor message in profiles ( #15516 )
...
* better oor message
* x
2026-03-28 20:25:07 +03:00
qazal
266fb07721
viz: show exec duration ( #15484 )
...
* duration
* handwritten tests
* rdna3 pickle
* rdna4 pickle
* asserts
* rm that
* wmma work
* r4
* this shows the overlap well
* ohh okay it goes back
* are ds_load and ds_store different queues on RDNA4?
* print msg, v_mul_lo_u32 is 4 cycles?
* discover
* wmma something
* wmma comment
* less
* less
* better comments
* work
* inst st
* delay column
* better cli
* emit_alt
* update test_handwritten
* work
2026-03-28 22:48:59 +09:00
chenyu
fe705def0d
move more broadcast method to mixin [pr] ( #15513 )
...
* move more broadcast method to mixin [pr]
all but div, mod, and where
* xor -1
2026-03-28 01:48:08 -04:00
chenyu
c0753ab62f
XOR simplifcation rules ( #15512 )
...
x^-1 has good vmin/vmax, and x^y^y is x
2026-03-27 23:23:27 -04:00
qazal
ccaa6bfc19
viz/cli cleanups ( #15511 )
...
* one less function
* work
* layout
* better handling of rewrites
* mypy passes
2026-03-28 08:50:38 +09:00
qazal
dcc2a5d23b
viz/cli: simplify to --source and --item flags ( #15510 )
...
* viz/cli: simplify to --source and --item flags
* update viz cli test
2026-03-28 04:46:39 +09:00
nimlgen
0d6fc0f571
jit: graphing in uops ( #15489 )
...
* jit: graphing as rewrite rule
* f
* +metal,cuda
* x
* cl
* x
* x
* simpler
* f
* m
* x
* revert?
* revert2
* back
* back
* t
* x
* m
* x
* c
* x
* l
* x
* comment
* smaller
* rv
* x
* x
2026-03-27 19:09:02 +03:00
chenyu
30ebbe7f17
few more fold valid tests ( #15509 )
...
from remove CORRECT_DIVMOD_FOLDING attempt
2026-03-27 10:38:42 -04:00
Christopher Milan
9e0cc5c6ae
create image buffers in late codegen ( #15493 )
2026-03-27 04:50:53 -04:00
chenyu
1198d6e908
move pow to mixin ( #15507 )
2026-03-27 03:16:40 -04:00
chenyu
323fcefd7d
Revert "DEV is a ContextVar ( #15505 )" ( #15506 )
...
This reverts commit fdb30cba96 .
2026-03-27 02:22:40 -04:00
Christopher Milan
fdb30cba96
DEV is a ContextVar ( #15505 )
2026-03-27 00:57:09 -04:00
wozeparrot
a65e958be9
llama: new apply_grad ( #15503 )
2026-03-26 19:39:25 -07:00
Christopher Milan
67a50fb738
move where on load with casts ( #15492 )
2026-03-26 22:11:27 -04:00
qazal
586c49642f
viz/cli: test in CI ( #15501 )
...
* viz cli work
* baseline test
* make cli test work without subprocess
* more checks
* check itrace
* s/return/return None
* change
* minimal
* colored
2026-03-27 06:47:15 +09:00
qazal
3f9f0fa846
viz: yield sqtt alt events ( #15500 )
...
* yield other
* less
* work
* less
2026-03-27 04:43:41 +09:00
qazal
237c25031f
sqtt: construct OTHER_SIMD op types with for loop ( #15495 )
...
* other-lds from amd_copy_matmul
* more other
* other simd work
2026-03-26 23:07:18 +09:00
nimlgen
7193f90746
test view input in jit ( #15497 )
...
* will anything fail?
* add test
2026-03-26 16:59:47 +03:00
nimlgen
de24b3fe37
jit: pass init params straight to base ( #15496 )
...
* jit: pass init params straight to base
* linter
2026-03-26 16:59:10 +03:00
qazal
ec5b7a249e
viz: refactor sqtt timeline builder ( #15494 )
...
* viz: refactor sqtt timeline builder
* barrier maps to waves
* clean up cli
2026-03-26 21:16:15 +09:00
Christopher Milan
313937ad6d
fix IMAGE TestEnd2End.test_linear_mnist ( #15488 )
2026-03-26 04:12:47 -04:00
Christopher Milan
bc180a963c
deprecate <dev>=1 in favor of DEV=<dev> ( #15467 )
...
* start work on target
* add test
* update actions to use DEV
* update docs
* update readmes
* tests need that too
* update example
* update tests (comments)
* fix that test
* ruff
* mypy
* oops
* remove getenvs
* don't add Target yet
* and the test
* lint
* and docs
* more stuff
* assert
* few more fixes
* test assert
2026-03-26 03:48:03 -04:00
chenyu
8426f820a1
Tensor.sub to mixin ( #15486 )
...
also _broadcasted skipped broadcasting shape if it does not have shape
2026-03-25 23:20:56 -04:00
wozeparrot
1ca178f379
llama: stochastic rounding ( #15456 )
2026-03-25 18:16:31 -07:00
chenyu
7c8f992894
move EXPAND dtype cast back to gradient.py ( #15481 )
...
only a concern for gradient, not mixin
2026-03-25 19:25:26 -04:00
nimlgen
9d2d0774b4
remote: disk copies ( #15482 )
...
* remote: disk copies
* lineter
* r
* nv
* x
2026-03-25 22:14:25 +03:00
qazal
7c2c8d3905
viz: small ux improvements ( #15483 )
...
* test
* better
* work
2026-03-26 03:18:25 +09:00
qazal
737d5f67f9
viz: compute canvas dims for auto zoom ( #15474 )
2026-03-26 00:05:23 +09:00
qazal
60bd546593
sqtt: add cycle count to rdna3 enums ( #15473 )
...
* update rdna3 sqtt enums to include cycle_count
* dispatch_to_exec
2026-03-25 23:19:54 +09:00
chenyu
142bf11926
logical_not to mixin [pr] ( #15472 )
...
also UPat.cast skips same dtype
2026-03-25 09:16:45 -04:00
George Hotz
25ff7146f2
add a status line to REMOTE with DEBUG=1 ( #15471 )
...
* python speedups of hot paths
* add a status line to REMOTE with DEBUG=1
* pc
* t
2026-03-25 20:54:56 +08:00
qazal
c973b508b8
viz/cli: pass ctrlc ( #15470 )
2026-03-25 21:13:28 +09:00
George Hotz
c1a7d90ccc
python speedups of hot paths ( #15469 )
2026-03-25 20:02:42 +08:00
George Hotz
ae7090b13b
print function timing with DEBUG=2 ( #15468 )
...
* add DEBUG=2 function timing
* remove those functions, they aren't useful
* fix spec
2026-03-25 19:07:32 +08:00
Christopher Milan
e7f389efda
fix height=1 images on macos ( #15460 )
2026-03-25 05:59:56 -04:00
George Hotz
789628df2e
hotfix: add USE_BOT flag to ASM24 USB
2026-03-25 15:00:08 +08:00
George Hotz
cd1a276f47
llm: support gguf path or url ( #15464 )
...
* llm: support gguf path or url
* one line
2026-03-25 14:43:19 +08:00
chenyu
713b322e70
add weakint to promo_lattice ( #15463 )
...
sits between bool and smallest int
2026-03-25 00:27:34 -04:00
chenyu
02878c5a2f
move _broadcasted to OpMixin ( #15461 )
...
it needs both ElementwiseMixin and MovementMixin
2026-03-24 23:56:01 -04:00
chenyu
519ba22470
more Tensor._broadcasted cleanup ( #15459 )
...
prep moving to mixin
2026-03-24 22:55:45 -04:00
George Hotz
fe2690399b
llm: support assistant prefill + refactor to TransformerConfig ( #15457 )
...
* llm: support assistant prefill
* refactor to ModelConfig
* TransformerConfig
* more
2026-03-25 10:50:48 +08:00
Christopher Milan
fd92aec094
cleanup unused image pitch code ( #15458 )
2026-03-24 22:47:16 -04:00
chenyu
f6ed4da268
Tensor.ufix ( #15452 )
...
* Tensor.ufix
prep moving _broadcasted to mixin
* remove backward_cast
2026-03-24 22:34:43 -04:00
qazal
1b3d00d6ac
viz/cli: remove --offset and --limit flags ( #15439 )
...
* work
* also no more no-color
* reorder
* update llama
* sqtt readme
* itertools
* rm that
* signals back
2026-03-25 09:52:27 +09:00
wozeparrot
da2031266a
llama: correct 8b init ( #15397 )
2026-03-24 13:41:41 -07:00
qazal
652bab8aad
viz: support nested track_rewrites ( #15454 )
...
* simple test
* stack active groups
2026-03-25 05:01:30 +09:00
qazal
41eb2cc41b
viz: preserve zoom between re renders ( #15451 )
2026-03-25 03:11:10 +09:00
Salman Chishti
84049fdc07
Upgrade GitHub Actions to latest versions ( #15446 )
...
Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-03-24 10:28:49 -04:00
Salman Chishti
9567075e20
Upgrade GitHub Actions for Node 24 compatibility ( #15445 )
...
Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-03-24 10:28:19 -04:00
chenyu
b7960841af
support shape broadcast in UOp.alu ( #15442 )
...
i think it can integrate tighter, but now Tensor also does ufix from UOp and implicit dtype upcast
2026-03-24 10:14:57 -04:00
George Hotz
a33ac869aa
llm server: temperature + test client ( #15444 )
...
* improvements to the llm server
* eval script
* eval llm
* better eval gets 58.71
* cleanups
* add temperature, but multinomial is absurdly slow
* claude is so smart
* lint
* remove slop
* no more stop
2026-03-24 21:07:15 +08:00
nimlgen
9db5d677c7
jit in viz ( #15447 )
2026-03-24 18:23:53 +08:00
Christopher Milan
2e4fbbcc9c
ir3: fix texture mapping and benchmark ( #15443 )
2026-03-24 04:52:54 -04:00
Christopher Milan
d5320a9ddf
QCOM cleanups ( #15435 )
2026-03-23 22:18:38 -04:00
George Hotz
85dee83f5d
amd flash attention cleanups + emulator fixes ( #15431 )
...
* amd flash attention cleanups
* simpler
* params
* fix emulator bugs
* fix idiv bug
* remove that test
* more emu fixes
2026-03-24 10:10:46 +08:00
chenyu
018a9e2d3c
remove match_dtype arg in Tensor._broadcasted ( #15440 )
...
reworked Tensor.where to not need it, also updated dtypes.from_py to use isinstance because ConstFloat issues
2026-03-23 22:10:39 -04:00
qazal
a590eded87
sqtt: rdna4 decoder work ( #15434 )
...
* sqtt: rdna4 decoder work
* diff cleanup
* more diff
* test
* work
* works
* TS_DELTA_SHORT
2026-03-24 03:49:32 +09:00
qazal
109472c37e
sqtt: new s_barrier pickles, handle rdna4 barriers in emulator ( #15437 )
2026-03-24 03:25:28 +09:00
nimlgen
fa4cdb422e
memplan on linears ( #15422 )
...
* memplan
* test
* x
* arenas
* correct
* set any size
* ugh
* make hevc happy
* x
* x
* held
* rm old
* del
* x
* fu
* f
* cl
* cl
* ok
2026-03-23 19:50:16 +08:00
nimlgen
2da008ae3b
jit: rm replan ( #15433 )
2026-03-23 19:31:51 +08:00
qazal
c4c53418f8
sqtt: comment out flaky rocprof timestamp assert ( #15432 )
...
* comment out rocprof assert, add new assert
* better than > 0 assert
* string
2026-03-23 19:24:04 +09:00
chenyu
66a86f88a0
simpler Tensor._broadcasted inferred dtype ( #15430 )
2026-03-23 05:20:11 -04:00
Pham Nguyen Hung
c89576921d
Updated the APIs of mnist_gan ( #15429 )
...
Co-authored-by: pnhung1703@gmail.com <Hung Pham>
2026-03-23 17:04:00 +08:00
George Hotz
c62dea6881
ai slop flash attention (it works) ( #15401 )
...
* ai slop flash attention (it works)
* speed up, 2 TFLOPS + 7 GB/s
* simpler
* simpler
* optimize
* faster
* warp shuffle
* sqtt: link dispatch to exec (#15396 )
* sqtt packet linking infra
python
* javascript
* ~doubly linked list
* ui works
* work
* exec can also highlight the pc, coloring work
* more work
* rm sqtt/model.py, doesn't need to be upstreamed
* viz: no context enters in cli, update llama profile (#15404 )
* removed unused named arg in rules [pr] (#15414 )
* viz: sqtt printer in viz/cli.py (#15411 )
* work
* sqtt timeline in CLI
* format all printers nicely
* s/Showed/Printed
* ansistrip
* sys.exit
* keep colors in list
* work from amd_copy_matmul
* has_more always gets returned
* linter
* don't print colors
* more colors
* wow this is so deep
* work
* minor details
* selected
* improve progress bar
* remove it
* 22, global_load_vaddr is so long
* remove *0 hack in sign, gradient materializes zeros for unconnected nodes (#15416 )
Amp-Thread-ID: https://ampcode.com/threads/T-019d1612-6322-706b-a94d-a812400a55cb
Co-authored-by: Amp <amp@ampcode.com>
* works
* cnt=20
* revert that
* uop slice tests
* simpler
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: gg <ggordbegli@gmail.com>
Co-authored-by: Amp <amp@ampcode.com>
2026-03-23 16:15:10 +08:00
qazal
1568a5ed07
viz: show dispatch to exec delay in sidebar ( #15428 )
2026-03-23 16:59:59 +09:00
Christopher Milan
ddaeebb500
nir: add shift support ( #15426 )
2026-03-23 03:37:44 -04:00
nimlgen
c74fa9bbe1
fix jitbeam not triggered ( #15424 )
...
* um
* beam
* x
* f
2026-03-23 15:34:59 +08:00
qazal
fd3559103b
viz/cli: better error message for empty itrace ( #15425 )
2026-03-23 15:50:20 +09:00
nimlgen
395aacd77d
jit: prune on linear ( #15423 )
...
* jit: prune on linear
* x
* this is from the future
2026-03-23 14:10:34 +08:00
chenyu
248cd9b39f
make Tensor init the only caller of Tensor.from_uop ( #15421 )
...
* make Tensor init the only caller of Tensor.from_uop
prep broadcast cleanups
* type
2026-03-23 00:29:08 -04:00
chenyu
67dcc79fdd
push Tensor(symbolic) logic to Tensor.from_uop ( #15420 )
2026-03-22 23:49:35 -04:00
gg
2087df814f
remove *0 hack in sign, gradient materializes zeros for unconnected nodes ( #15416 )
...
Amp-Thread-ID: https://ampcode.com/threads/T-019d1612-6322-706b-a94d-a812400a55cb
Co-authored-by: Amp <amp@ampcode.com>
2026-03-22 12:49:26 -04:00
qazal
c7b18e6108
viz: sqtt printer in viz/cli.py ( #15411 )
...
* work
* sqtt timeline in CLI
* format all printers nicely
* s/Showed/Printed
* ansistrip
* sys.exit
* keep colors in list
* work from amd_copy_matmul
* has_more always gets returned
* linter
* don't print colors
* more colors
* wow this is so deep
* work
* minor details
* selected
* improve progress bar
* remove it
* 22, global_load_vaddr is so long
2026-03-23 00:17:05 +09:00
chenyu
bcc08307da
removed unused named arg in rules [pr] ( #15414 )
2026-03-22 09:25:46 -04:00
qazal
2363bceb47
viz: no context enters in cli, update llama profile ( #15404 )
2026-03-22 05:47:02 +09:00
qazal
a9ceaf3c5f
sqtt: link dispatch to exec ( #15396 )
...
* sqtt packet linking infra
python
* javascript
* ~doubly linked list
* ui works
* work
* exec can also highlight the pc, coloring work
* more work
* rm sqtt/model.py, doesn't need to be upstreamed
2026-03-21 23:48:58 +09:00
nimlgen
9656d97d97
jit: captures linears, not execitems ( #15399 )
...
* jit: captures linears, not execitems
* x
* um
* etsts
* mockcuda
2026-03-21 16:32:12 +08:00
George Hotz
c13d9d29ff
add SHAPED_WMMA ( #15400 )
...
* add SHAPED_WMMA
* shaped wmma
* less bad
2026-03-21 16:16:03 +08:00
George Hotz
41a9b09683
minimal vec in amd_copy_matmul ( #15398 )
...
* minimal vec in amd_copy_matmul
* unified
* unify
* reshape/permute
* cleanups
* simpler
* move index
* cleanups
* more shared
2026-03-21 14:57:21 +08:00
qazal
30b3054fd5
whitespace cleanups in viz and sqtt.py ( #15395 )
2026-03-21 04:46:19 +09:00
qazal
71ccc69c52
FP8=1 llama works again, hipcc can run on macos ( #15394 )
...
* hipcc macos shim
* is_dtype_supported opens devices less
2026-03-20 23:43:15 +09:00
Christopher Milan
9470d5193a
deterministic decomp apply order ( #15393 )
2026-03-20 08:10:45 -04:00
Christopher Milan
376585b003
use should_emulate for target dtype in decomp ( #15392 )
2026-03-20 07:44:57 -04:00
Christopher Milan
a12d3951de
fix test_export_model imports ( #15389 )
2026-03-20 07:27:01 -04:00
George Hotz
1a2a203f48
add wmma support to amd_copy_matmul ( #15384 )
...
* add wmma support to amd_copy_matmul
* 15 TFLOPS and merged
* unify
* simpler
* simpler
* simpler
* cleanups
* TM/TN is the full regs
* comments
* WAVES_PER_SH + SQTT_EVENT
* Add WAVERDY support
* no split warp
* 3 range
2026-03-20 19:02:19 +08:00
Christopher Milan
1560b534a5
remove IMAGE=2 ( #15312 )
2026-03-20 06:26:52 -04:00
Christopher Milan
30d609432f
ci: only xcode-select for gpuocelot on macos ( #15387 )
2026-03-20 05:58:16 -04:00
chenyu
d1b4e37dfa
remove InvalidType branch in Tensor.__init__ ( #15386 )
...
it's handled by `elif isinstance(data, get_args(ConstType)):` already
2026-03-20 05:32:33 -04:00
chenyu
c491345766
pass device into Tensor._frompy ( #15385 )
...
* pass device into Tensor._frompy
with this, canonicalize_device is the only usage of Device in tensor.py
* export_model.py
2026-03-20 05:09:01 -04:00
George Hotz
3b75d8a7a2
fix double after bug in rangeify ( #15381 )
2026-03-20 14:53:46 +08:00
Christopher Milan
0c89340a1e
automatically emulate unsupported (tiny) floats [skip_process_replay] ( #15366 )
2026-03-20 02:31:44 -04:00
George Hotz
78ad089817
make precompile the default for llm ( #15376 )
...
* make precompile the default for llm
* works
* empty is okay for kvcache
* fix cache misses
* more tests
2026-03-20 14:08:55 +08:00
chenyu
459ef41ea0
don't exclude weakint in is_dtype_supported [pr] ( #15378 )
2026-03-20 02:08:29 -04:00
qazal
cf6a429aaa
mypy emulator pre-commit passing ( #15379 )
...
* fix dict stuff
* add type: ignores
* fix pcode to put uops not ints
2026-03-20 14:44:09 +09:00
wozeparrot
87c4ec1724
llama: use flat llama ( #15353 )
2026-03-19 22:12:38 -07:00
chenyu
da1700e16b
dtypes.index -> dtypes.weakint ( #15377 )
2026-03-20 01:08:46 -04:00
nimlgen
3b04e3ea28
no gmmu mappings with GMMU=0 ( #15369 )
...
* usb
* free
* simple gmmu=0
* x
* x
* vram
* init tests
* ppg
* x
2026-03-20 12:18:34 +08:00
ridoy majumdar
c1183b8872
remove dead code in pyrender ( #15115 )
...
* remove dead code in pyrender
* retrig CI
* retrig CI
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-03-19 23:59:56 -04:00
chenyu
bf33c5f796
remove gradient materialize_grads ( #15367 )
...
effectively default to True
and removed *0 hack in Tensor.copysign. now dy/dx=0 if y does not depend on x
remove
2026-03-19 23:36:03 -04:00
chenyu
45baf3ff3f
pin ci xcode version ( #15375 )
2026-03-19 23:13:16 -04:00
George Hotz
4091d37e8e
flat llama step work ( #15355 )
...
* flat llama step work
* fp8 support
* blacklisted matmul
* chestertons fence
2026-03-20 09:06:12 +08:00
qazal
176ad47d7d
cdna4 emulator testing ASM_GEMM in CI ( #15373 )
...
* cdna emulator work
* accvgprs
* cdna passes most tests
* ruff
* add cdna4 to tests
* cdna emu
* crash
* pass?
* work
* gen
* clean up wave_size access
* asm_gemm passes
* remove acc from dsl.py, emulator can keep its different reg file
it's purely an encoding here, the ASM_GEMM already encodes acc srcs with v[], this can
be cleaned up later, but not functionally required for emulator.
* split asm_gemm tests to ones fast on the emulator
* don't do that
* 124 stays null on rdna
* the segfault was because of hw regs, not this
* Revert "clean up wave_size access", it's explicitly tested
This reverts commit 1202ff5787 .
* nullcopyout
---------
Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-03-20 05:51:30 +09:00
nimlgen
16daffc042
remote connection timeout ( #15370 )
2026-03-19 19:44:16 +08:00
Christopher Milan
68d7a6b7be
PYTHONREMU: fix vop3p literals ( #15372 )
2026-03-19 07:05:01 -04:00
George Hotz
70dad9d642
add PING to RemoteCmd ( #15371 )
...
* add PING to RemoteCmd
* cleanup
2026-03-19 18:57:40 +08:00
nimlgen
1c978aeedb
amd: fix aql remote ( #15368 )
2026-03-19 18:11:03 +08:00
qazal
337c684047
viz: cycle time relative to kernel start in sidebar ( #15352 )
2026-03-19 18:41:29 +09:00
chenyu
d81b03cff4
pad_to to mixin [pr] ( #15365 )
2026-03-19 05:02:01 -04:00
chenyu
1abb6297f6
more Tensor(UOp) cleanups ( #15364 )
...
* more Tensor(UOp) cleanups
* function too
2026-03-19 03:34:30 -04:00
nimlgen
cf50ca23c3
better oom msg ( #15362 )
...
* better oom msg
* s
2026-03-19 14:07:01 +08:00
nimlgen
1a53393512
remote in ci benchmark ( #15344 )
...
* remote in ci benchmark
* move to the end
* move
* ports
* own this
2026-03-19 13:49:09 +08:00
chenyu
92dfef8060
Tensor(uop) does not need explicit device ( #15361 )
2026-03-19 00:44:33 -04:00
nimlgen
f32c2e43a7
memory: use pfree ( #15360 )
2026-03-19 12:39:23 +08:00
nimlgen
86eec01f97
limit gl*lc ( #15359 )
2026-03-19 12:38:55 +08:00
chenyu
b39816e998
failed test case for Tensor(np, "bf16") ( #15358 )
2026-03-18 23:40:14 -04:00
chenyu
e407ee410c
cosmetic Tensor._do_reduction cleanups ( #15357 )
2026-03-18 22:27:50 -04:00
chenyu
6aebf95dac
move neg and invert to mixin ( #15356 )
2026-03-18 22:03:41 -04:00
wozeparrot
f6687d1ffc
feat: sd seed0 update ( #15354 )
2026-03-18 18:42:00 -07:00
wozeparrot
c45a606750
feat: no if in rand ( #15333 )
2026-03-18 15:09:51 -07:00
qazal
23e0431848
viz: switch sqtt sidebar to a simple asm list ( #15350 )
...
* work
* something like this
* Revert "something like this"
This reverts commit 6c45098d2b .
* less
* path includes
* scroll only jumps up and down
* it's only pc and line now
2026-03-19 01:40:25 +09:00
qazal
709fc52d7b
viz: fix auto zoom range in sqtt, include endpgm packet ( #15349 )
...
* viz: fix automatic zoom range in sqtt packets
* it's x+width
* include s_endpgm
* endpgm also doesn't have exec
2026-03-18 22:52:32 +09:00
nimlgen
d4836ddbb0
canonicalize device from tuple ( #15348 )
...
* will it ifx ci?
* test
* um
2026-03-18 20:35:52 +08:00
George Hotz
5524916e39
llama compute gradients explicitly + 243 GB of RAM on MP=8 ( #15343 )
...
* llama compute gradients explicitly
* apply grads
* fix multi issue
* multi BUFFER_VIEW support
* simpler
* skip the flaky test
2026-03-18 19:54:40 +08:00
nimlgen
ff004d2114
remote: fix mmio ( #15347 )
2026-03-18 18:20:39 +08:00
nimlgen
f853371c83
fix compilers autoselect ( #15346 )
2026-03-18 18:19:53 +08:00
chenyu
761ce8c0d3
fix Invalid combine rules ( #15345 )
...
* fix Invalid combine rules
wrong conditions broke setiem into invalids
* fix
2026-03-18 04:58:02 -04:00
nimlgen
c0499ca3e8
nv: use mmio iface ( #15342 )
...
* nv: use mmio iface
* nv: use mmio iface
* revert
* f
2026-03-18 16:53:09 +08:00
Christopher Milan
499ad9a356
benchmark openpilot 0.11.0 ( #15341 )
2026-03-18 03:28:43 -04:00
George Hotz
6e196195d8
add test for flat llama ( #15327 )
...
* add test for flat llama
* simpler
* back to split w1/w3
* env
* still too much ram
* invalid
2026-03-18 15:16:33 +08:00
chenyu
fceb21c315
Tensor(uop) uses device from uop ( #15340 )
2026-03-18 02:56:06 -04:00
George Hotz
6109117af1
anonymous buffers are Invalid ( #15336 )
...
* anonymous buffers are Invalid
* unique_const
* work
* remove invalid writes
* test_anonymous_buffers_in_function
2026-03-18 14:52:56 +08:00
chenyu
e644e1cb6a
less Tensor(...).uop indirection in Tensor.__init__ ( #15339 )
2026-03-18 02:17:38 -04:00
nimlgen
0315faf938
remote bench ( #15331 )
2026-03-18 14:03:51 +08:00
nimlgen
d720d50e12
memory: traverse all valid ranges only ( #15338 )
...
* memory: traverse all valid ranges only
* x
2026-03-18 14:03:39 +08:00
chenyu
ac7a348d06
dtypes.as_const -> DType.const ( #15337 )
...
does not need to be a staticmethod
2026-03-18 00:48:41 -04:00
Christopher Milan
864d3917d5
add openpilot onnx parser test ( #15334 )
2026-03-18 00:12:02 -04:00
Christopher Milan
0222bfdf69
Revert "don't use intermediate dict in onnx parse" ( #15332 )
2026-03-17 23:46:30 -04:00
chenyu
94926d00d8
fix rand > uint32.max ( #15330 )
...
need to keep low and high as 1D tensor.
`PYTHONPATH=. LLAMA3_SIZE=405B python3 examples/mlperf/models/flat_llama.py` works now
2026-03-17 22:00:01 -04:00
wozeparrot
b45edeb965
fix: rand supports large tensors ( #15329 )
2026-03-17 15:45:41 -07:00
qazal
00817cf65e
viz: all tests can run on the NULL device ( #15328 )
...
* remove that
* move to test_viz
* get_cfg
* do not use os.environ
* hm
* it's always on NULL
* import renderer
* no import *
2026-03-18 04:14:20 +09:00
George Hotz
2605840ee2
flat llama ( #15324 )
...
* FlatTransformer
* works
* pass in buffer views
* print stuff
* print
* bugfixes
2026-03-17 19:39:55 +08:00
nimlgen
0a641ce17d
system: remote ( #15318 )
...
* system: remote
* listen
* print
* fix
* minor
2026-03-17 19:25:37 +08:00
Christopher Milan
69eefdca20
images with height=1 have less strict width rules ( #15325 )
2026-03-17 07:07:22 -04:00
chenyu
14eb8170e4
skip TestRunAsModule if libclang is loaded ( #15323 )
...
reverse rule of TestAutogen skip, otherwise `NULL=1 python -m pytest test/null/test_autogen.py test/null/test_device.py` crashes for me
2026-03-17 06:02:53 -04:00
qazal
e7c26b6319
viz: rename to Start Cycle for the sqtt graph ( #15320 )
2026-03-17 18:53:06 +09:00
nimlgen
e89a103984
remove dmaref ( #15321 )
...
* remove dmaref
* imports
2026-03-17 17:52:09 +08:00
chenyu
3090d4a6e0
disallow reshape from None shape [pr] ( #15322 )
...
test_multigpu_clip_score works without it now
2026-03-17 05:46:53 -04:00
nimlgen
a50fdb0528
nvcc macos ( #15308 )
...
* fix nvcc install macos
* um
* arm
* per
* tm
2026-03-17 17:25:33 +08:00
George Hotz
9d95321be3
set allow_implicit=False by default ( #15319 )
...
* set allow_implicit=False by default
* modernize beautiful mnist
2026-03-17 17:14:38 +08:00
nimlgen
e1c2d09720
system: rebar to remote devs ( #15316 )
2026-03-17 16:09:12 +08:00
chenyu
79d2e83853
tighter ALU/variable min==max -> CONST rule [pr] ( #15317 )
...
only check Ops that can be simplified through this rule. halved the time for that rule in `PYTHONPATH=. TRACK_MATCH_STATS=2 python3 -O test/external/external_benchmark_schedule.py`
2026-03-17 03:44:24 -04:00
George Hotz
584ec75aa2
precompile backward ( #15311 )
...
* add precompile backward support
* cleanups
* fix
* compact grad
* split v not split
* simpler
* no NOOPT
2026-03-17 15:28:40 +08:00
chenyu
6b6d1814ca
update no_vectorized_index [pr] ( #15313 )
...
combine no_vectorized_index and no_vectorized_index_broadcast
2026-03-17 03:05:23 -04:00
b1tg
856a839efc
llm: fix qwen3 moe topk renormalization ( #15201 )
2026-03-17 12:57:33 +08:00
chenyu
1283b57b4e
update fix_store_after_hazard ( #15309 )
...
actual gate is just not CONTIGUOUS, also don't need to check against full backward_slice
2026-03-16 23:55:59 -04:00
Christopher Milan
575b40b93a
determine image shapes before index devectorization ( #15304 )
2026-03-16 23:16:33 -04:00
George Hotz
3ff03be413
call always has tuple ( #15297 )
...
* call always has tuple
* fix pre-commit and simplify
* update
* fix
* move that assert
* tuple
* fix multi
* cleanups
* fix merge
2026-03-17 10:58:46 +08:00
chenyu
1b8b151195
simpler Tensor.assign ( #15302 )
2026-03-16 22:37:25 -04:00
wozeparrot
674c760974
embedded bwd vocab shard ( #15001 )
...
* fix: remove more multi from call
* feat: embedding bwd vocab sharding
* clean: unused import
* clean: don't actually need this pattern
2026-03-16 19:37:16 -07:00
Christopher Milan
62bfd48d95
smarter padding in image_conv2d ( #15289 )
2026-03-16 22:17:48 -04:00
chenyu
e1fab4d2a9
UOp.store is always void [pr] ( #15301 )
2026-03-16 21:58:05 -04:00
chenyu
02afb45f29
remove UOp.assign [pr] ( #15300 )
...
* remove UOp.assign [pr]
it's all store and after, UOp is immutable
* fix test
2026-03-16 21:45:41 -04:00
qazal
33bd33e783
sqtt: add CDNA ops enum, show in viz ( #15140 )
2026-03-17 09:38:42 +09:00
chenyu
3e2b7803e6
view assign replaces at buffer identity ( #15298 )
...
matches what functions capture
2026-03-16 19:58:38 -04:00
qazal
346596cdce
viz: nanoseconds time axis in sqtt ( #15299 )
...
* ui
* secondaryTick is optional
* shader markers data
* instSt infra
* path forward
* details
2026-03-17 07:20:18 +09:00
nimlgen
1bc4cb254c
signed tinygpu as default ( #15296 )
...
* signed tinygpu as default
* f
* no sip
2026-03-16 19:29:41 +08:00
Christopher Milan
0de519c7c2
[pr] fewer simplify calls in image_fixup ( #15283 )
2026-03-16 06:57:52 -04:00
nimlgen
27e29127b5
system: remote prereqs ( #15290 )
...
* x
* new format for apl
* this
* typing
* rpc
* tuple
* linter+new tinygpu
2026-03-16 18:45:41 +08:00
chenyu
837b06c609
style cleanups in allocations.py [pr] ( #15295 )
2026-03-16 05:45:24 -04:00
George Hotz
476276f4b4
support grads on tuples ( #15287 )
...
* support grads on tuples
* simpler
* grad_fxn works
* cleanups
* unused
2026-03-16 17:39:34 +08:00
chenyu
20799df10b
remove Ops.ASSIGN [pr] ( #15294 )
...
goodbye
2026-03-16 05:22:21 -04:00
chenyu
b3378e7022
UOp.assign is store+after [pr] ( #15292 )
2026-03-16 04:51:50 -04:00
George Hotz
2e1c81c23f
allow_implicit to disable implicit params ( #15291 )
...
* allow_implicit to disable implicit params
* get both Tensor and UOp
* no implicits in llm
2026-03-16 16:40:14 +08:00
chenyu
a0d1444790
Tensor.assign is store+after [pr] ( #15288 )
...
* Tensor.assign is store+after [pr]
* put that back
2026-03-16 04:04:55 -04:00
George Hotz
08662bc4ab
add TUPLE/GETTUPLE, simple tests pass ( #15286 )
...
* simple tuple stuff passes
* resolved
2026-03-16 15:06:02 +08:00
nimlgen
e7705fe311
system: pcidev doesn't care about bars ( #15284 )
2026-03-16 14:45:43 +08:00
nimlgen
ff0bcc8de0
system: iface p1 changes ( #15278 )
2026-03-16 10:48:25 +08:00
qazal
4445f50356
viz: variable duration rdna barriers ( #15277 )
...
* viz: variable length rdna barriers
* work
* tiny changes
* simple wave simd test
* small wave sync test
* good multi barrier bug find
* simple fix
* wave_sync asserts
* rdna4 work
* more rdna4
* find more bugs in my model
* it's so much simpler
* wave_sync tests duration
* r4
* should just call this rdna4
2026-03-16 06:06:19 +09:00
qazal
5cd1daa3bc
cdna asm_gemm in one file, remove old rdna3 asm ( #15281 )
2026-03-16 04:32:30 +09:00
chenyu
cd14e8e64b
allocations contiguous is store+after ( #15280 )
2026-03-15 11:58:40 -04:00
qazal
7b6211fdd7
sqtt: remove discover_ops script ( #15279 )
2026-03-15 22:17:06 +09:00
wozeparrot
473e5e4368
feat: make USE_ATOMICS embedding bwd faster ( #15151 )
2026-03-14 21:21:10 -07:00
qazal
3858bfc83d
sqtt: CDNA inst decodes ( #15274 )
...
* sqtt: CDNA inst decodes
* JUMP packets other way
* cdna insts
* r3
* r4
* lds from simd1 and simd2
2026-03-14 21:03:46 +09:00
Christopher Milan
d753c5d7e5
IMAGE=1 image_conv2d pads for bank conflicts ( #15252 )
2026-03-14 07:59:16 -04:00
Christopher Milan
9047249a7c
m.where(x.pad_to(m.shape), Invalid) ranges shrink ( #15275 )
2026-03-14 07:26:36 -04:00
nimlgen
f392c53c66
system: merge remote into pciiface ( #15273 )
...
* system: merge remote into pciiface
* clenaer
* move
* mypy
* fix
2026-03-14 18:44:20 +08:00
chenyu
13eec8fbe8
remove unused assign rules [pr] ( #15268 )
2026-03-14 05:37:49 -04:00
Christopher Milan
dabdc986df
shrink guarded ranges, try 2 ( #15272 )
2026-03-14 04:24:05 -04:00
Christopher Milan
7cf4b16c91
Revert "shrink guarded ranges" ( #15271 )
2026-03-14 03:44:38 -04:00
Christopher Milan
d9951e2f8e
shrink guarded ranges ( #15263 )
2026-03-14 03:38:48 -04:00
qazal
43ffd66fda
viz: oneline inst list ( #15269 )
...
* viz: oneline inst list
* save 5 chars
* gradual padding
2026-03-14 15:37:18 +09:00
George Hotz
86f17468ed
store in spec + USB BOT fix ( #15265 )
...
* move spec to store
* usb bot flag
* Revert "usb bot flag"
This reverts commit 7b8b7824f0 .
* fix assert
2026-03-14 13:25:05 +08:00
George Hotz
06d7cddb33
amd_copy_matmul is cleaner ( #15248 )
...
* amd_copy_matmul is cleaner
* it runs
* replicated stuff
* add tid there
* it runs
* cleanup
* x.src[1]
* flatten
* move that
* keep that assert
2026-03-14 12:56:09 +08:00
chenyu
b3600e4774
don't emit assign in transform_precompiled_call [pr] ( #15262 )
2026-03-13 22:42:35 -04:00
qazal
4d60312f7f
viz: asm python dsl syntax highlighting ( #15259 )
2026-03-14 06:37:43 +09:00
qazal
6209ddfc90
viz: improve disasm of s_code_end ( #15258 )
...
* viz: improve amd disasm of s_code_end
* better tests
* order was good
2026-03-14 03:31:14 +09:00
wozeparrot
a191ac0566
llama: use mlperf model ( #15257 )
2026-03-13 08:08:32 -07:00
Sieds Lykles
4b59083d7c
assign into empty works ( #15256 )
2026-03-13 10:24:29 -04:00
qazal
60b1b908c6
sqtt: CDNA layout header packet is the same size ( #15255 )
2026-03-13 22:28:24 +09:00
nimlgen
4e21735f31
system: update tinygpu app ( #15247 )
2026-03-13 20:36:57 +08:00
nimlgen
1fbe1fef2c
move write_configs to drivers ( #15253 )
2026-03-13 19:02:34 +08:00
chenyu
018c01508d
test case for call precompile multi ( #15254 )
2026-03-13 06:28:43 -04:00
nimlgen
bc16f80b50
am: remove dma_regions param ( #15251 )
...
* am: remove dma_regions param
* linter
2026-03-13 18:12:48 +08:00
chenyu
576e7f985f
remove handle_assign_mops [pr] ( #15249 )
2026-03-13 01:53:21 -04:00
Christopher Milan
c251fc67c5
ci: consider arch in venv and apt caches and go back to 3.12 ( #15250 )
2026-03-13 00:36:49 -04:00
Christopher Milan
d4b947ea9a
ci: explicitly request python 3.12.10 instead of 3.12 ( #15246 )
...
3.12.10 is the most recent 3.12 version that has toolcache builds for linux, macos, and windows
2026-03-12 23:00:46 -04:00
George Hotz
a7d2429c21
amd_uop_matmul more cleanups ( #15240 )
2026-03-13 10:24:43 +08:00
qazal
d893b14193
sqtt: update cdna packet names ( #15243 )
...
* sqtt: update cdna packet names
* change
* order
2026-03-13 08:49:09 +09:00
wozeparrot
749162bd2f
llama memory tweaks ( #15223 )
2026-03-12 12:36:23 -07:00
qazal
9a7173b7a0
viz: visualize full range of shader clock frequency, auto zoom to kernel range ( #15225 )
...
* start this
* work
* rm those
* relative to start cycle
* cleanup
* cover the full range of packets
* correct event type
* start the ui change
* fit=true
* better
* always the zoom identity
* diff cleanup
* shader engine itrace can be turned off
2026-03-13 00:07:31 +09:00
chenyu
d9c09397c0
Ops.STORE is shapeless [pr] ( #15239 )
2026-03-12 09:05:30 -04:00
nimlgen
d746ccb791
system: fix vfio ( #15235 )
2026-03-12 18:31:00 +08:00
nimlgen
d104a903f8
system: print output when err ( #15230 )
2026-03-12 18:30:49 +08:00
George Hotz
e560a46f59
update amd_uop_matmul ( #15236 )
...
* update amd_uop_matmul
* use custom kernel
* simpler
* ignore
2026-03-12 17:33:12 +08:00
chenyu
90b7f4341d
failed two level divmod recombine case ( #15233 )
2026-03-12 04:04:36 -04:00
chenyu
8b8d9a443c
remove unused invalid rules [pr] ( #15231 )
2026-03-12 03:10:34 -04:00
George Hotz
bdd62fd484
remove unneeded realize map entries ( #15229 )
...
* remove unneeded realize map entries
* not that
2026-03-12 14:23:19 +08:00
chenyu
842c978df3
remove staticmethod dtypes.max/min ( #15227 )
...
always use x.dtype.max/min
2026-03-11 23:11:24 -04:00
b1tg
18dc77ccab
add fp8 fnuz dtypes with PYTHON backend support ( #14945 )
...
* add fp8 fnuz dtypes with PYTHON backend support
* rm emu related change
* clarify fp8 fnuz zero handling
* Revert "rm emu related change"
This reverts commit efa4763c22 .
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-03-11 22:30:18 -04:00
George Hotz
4f3f55328b
do not patch on invalid tensor tests ( #15226 )
...
* do not patch on invalid tensor tests
* cleanup
2026-03-12 09:35:20 +08:00
wozeparrot
4fab320abe
llama: clean ( #15224 )
2026-03-11 13:33:59 -07:00
wozeparrot
05d6d9120a
llama offload null ( #15222 )
2026-03-11 10:04:31 -07:00
qazal
d3eef70162
viz: render shader clock frequency graph ( #15197 )
2026-03-12 01:32:49 +09:00
chenyu
39b0f4bcc1
remove Ops.THREEFRY in remove_bufferize [pr] ( #15220 )
2026-03-11 05:30:33 -04:00
chenyu
6489a6f212
Revert "remove mop_cleanup [pr] ( #15217 )" ( #15218 )
...
This reverts commit 6b50df940a .
2026-03-11 04:17:56 -04:00
chenyu
6b50df940a
remove mop_cleanup [pr] ( #15217 )
...
no kernel diff, i think this was needed due to force_reshape?
test/external/external_benchmark_schedule.py is about the same speed
2026-03-11 03:54:42 -04:00
Christopher Milan
2fb8a7f60f
fix test_invalid_tensor when before values are nan ( #15215 )
2026-03-10 23:51:19 -04:00
chenyu
fce87f19a8
better fold_add_divmod_recombine ( #15214 )
2026-03-10 23:24:22 -04:00
chenyu
df8deec949
test for nest_by_factor selection ( #15213 )
2026-03-10 22:41:31 -04:00
chenyu
be6b0bce1f
variations of (x%c)+(x//c)*c ( #15212 )
...
put those into one function
2026-03-10 22:41:14 -04:00
qazal
a408d90f4f
viz: always detect sqtt packet overlaps, add timeline tests ( #15211 )
...
* test
* work
* it's called CALL, better assert
* qol
* row_ends
2026-03-11 05:32:38 +09:00
nimlgen
d9c7290eb0
nv: nvdec as NVDEC:0 device ( #15209 )
2026-03-10 14:44:50 +03:00
Christopher Milan
25d86ec9e1
start using Invalid in image_conv2d ( #15208 )
2026-03-10 07:11:06 -04:00
chenyu
ecbddfcffe
clean up gcd_with_remainder [pr] ( #15207 )
...
this can operate with int gcd directly and not through UOp
2026-03-10 06:13:20 -04:00
chenyu
bb7888b281
cleanup (x%(k*c))//c and (x%(k*c))%c ( #15206 )
...
these two are in the same family
2026-03-10 05:21:32 -04:00
chenyu
8389a8d7c5
remove_nested_mod can work with negative ( #15205 )
2026-03-10 03:10:08 -04:00
Christopher Milan
ffaafd391a
Invalid in Tensor ( #15154 )
2026-03-10 02:49:54 -04:00
chenyu
68c7c3ca84
divmod test_gcd_with_remainder ( #15204 )
...
test cases for gcd_with_remainder
2026-03-09 23:51:47 -04:00
chenyu
a53187eef7
fix TestPartialAssignToSharedBuffer ( #15202 )
...
bufferize_to_store issue with assign
2026-03-09 23:14:23 -04:00
wozeparrot
525a178966
llama: jit more ( #15199 )
2026-03-10 11:04:59 +08:00
George Hotz
315ad50a1a
make late allreduce the default ( #15125 )
...
Co-authored-by: wozeparrot <wozeparrot@gmail.com>
2026-03-09 17:42:57 -07:00
chenyu
6b354b906d
fold_divmod_general cleanups [pr] ( #15196 )
2026-03-09 19:43:16 -04:00
qazal
02ceeab3a7
viz: ui cleanups from the sqtt real time branch ( #15195 )
...
* label location for packets
* work
* OTHER_* packets always get filtered out
* less
2026-03-10 05:33:53 +09:00
qazal
a615ed8ebe
sqtt: update RDNA timestamp marker fields ( #15194 )
...
* rt:realtime field name, correct RDNA4
* share rdna4 and rdna3
2026-03-10 05:18:47 +09:00
nimlgen
8bd6d270c5
rm ops.encdec ( #15193 )
...
* rm ops.encdec
* x
2026-03-09 18:52:48 +03:00
qazal
81ab499b4b
viz: small ui code cleanups ( #15192 )
...
* less
* more work
* tabulate returns node like colored
2026-03-09 21:17:33 +09:00
chenyu
60215deb60
tiebreak in fold_divmod_congruence ( #15190 )
...
need to try both direction
2026-03-09 03:40:39 -04:00
chenyu
a8d8351e5a
match IDIV and MOD in nest_by_factor ( #15188 )
2026-03-09 00:50:38 -04:00
Christopher Milan
7592622562
fix QCOMCLRenderer pickle ( #15189 )
2026-03-09 00:36:16 -04:00
Christopher Milan
2bb0970512
QCOM CL compiler prints LLVMIR when DEBUG>=8 ( #15187 )
2026-03-09 00:15:20 -04:00
chenyu
83b80da8f3
even more divmod recombine ( #15163 )
2026-03-08 23:52:26 -04:00
chenyu
82f7734501
use backward_slice in reduce_mul_chain [pr] ( #15186 )
2026-03-08 21:44:53 -04:00
qazal
25e82a9aca
viz: exclude redundant traceback from SDMA ( #15185 )
...
* viz: exclude redundant traceback from SDMA
* ctx
* cpu_profile
2026-03-09 05:12:14 +09:00
nimlgen
6ac99fd4c9
memplanner opt copy bufs ( #15110 )
...
* mtp
* x
* tests
* ss
* simp
* less slop
* x
* cleaner
* rm
* m
* c
* x
* f
2026-03-08 22:28:01 +03:00
nimlgen
633264feae
am: flush sdma pipeline ( #15184 )
...
* am: flush sdma pipeline
* f
* f
* fix
2026-03-08 20:27:56 +03:00
b1tg
891a73befc
llm: fix chunked prefill ( #15182 )
...
* llm: fix chunked prefill
* less lines
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2026-03-07 22:08:31 +08:00
chenyu
5d58b1c396
don't use intermediate dict in onnx parse ( #15181 )
...
also don't parse fields that are never used
2026-03-07 00:08:03 -05:00
nimlgen
086081e35b
tbgpu: add stapler to the script ( #15180 )
2026-03-07 00:07:27 +03:00
qazal
a03f512147
viz: clean up old / unused paths in sidebar rendering ( #15179 )
...
* src is unused
* less
2026-03-07 05:36:10 +09:00
chenyu
605b37c03f
use backward_slice in count_divmod [pr] ( #15178 )
2026-03-06 14:03:53 -05:00
Ananta Ranganathan
5bdad8ee41
update mxfp4 tests to use the same patterns as the others ( #15177 )
...
* update mxfp4 tests to use the same patterns as the others
* fix typo in test call not sure how it committed
2026-03-06 13:21:40 -05:00
qazal
d85109f9f7
viz: walk PROGRAM UOp back to source and binary only ( #15174 )
...
* work
* simpler
2026-03-07 01:39:07 +09:00
Ananta Ranganathan
5c50035e0d
avoid using arithmetic for mxfp4 ( #15172 )
...
* avoid using arithmetic for mxfp4
* update tests to use assert equal
* no longer todo
2026-03-06 11:17:56 -05:00
qazal
f064db0ac6
viz: later tooltip rendering ( #15170 )
2026-03-06 23:00:15 +09:00
Roelof van Dijk
4ed8bb7445
tie break for divmod ( #15169 )
2026-03-06 08:05:38 -05:00
qazal
83f1faa142
sqtt: update CDNA wave packet field, start unskipping tests ( #15168 )
...
* correct field names
* packet types
* packet 5 is regc
* test skips
2026-03-06 21:37:44 +09:00
Christopher Milan
7810be8d3c
compile QCOM without opening device ( #15165 )
...
Co-authored-by: Comma Device <device@comma.ai>
2026-03-06 06:24:27 -05:00
George Hotz
6fd18ef875
rename CAT to VCAT ( #15167 )
2026-03-06 18:46:28 +08:00
Roelof van Dijk
059c6326c0
metal uint32 icb offset overflow ( #15156 )
...
* metal uint32 icb offset overflow
fix: diff
supports_exec_item
GraphRunner.supports_exec_item
tests
fix: can't import on non-metal
stricter
* also test the non-metal buffer case
* imports on non-mac
2026-03-06 00:54:39 +03:00
chenyu
da61088ca4
more divmod recombine ( #15162 )
2026-03-05 12:53:22 -05:00
chenyu
167a1d56a6
improve divmod folding ( #15148 )
...
canonicalize to div than mod which enables more simplifcation
2026-03-05 10:07:36 -05:00
Christopher Milan
b824579e4d
simplify image_conv2d pitch alignment hacks ( #15158 )
2026-03-05 07:17:34 -05:00
qazal
5bf542469d
viz: python traceback for USER device ( #15160 )
...
* start
* ux
* unittests
2026-03-05 20:22:09 +09:00
Roelof van Dijk
d65923bda5
tensor.py: add normalize function ( #15159 )
...
* tensor.py: add normalize function
* p==0 should match torch
2026-03-05 18:55:53 +08:00
wozeparrot
4544da1c54
llama3 fixes part3 ( #15152 )
2026-03-05 01:17:54 -08:00
Roelof van Dijk
fc0534910c
q5k is like q4k ( #15155 )
2026-03-05 17:02:49 +08:00
Ananta Ranganathan
8ef656324e
FIXED TEST Q5_K GGUF dequant ( #15147 )
...
* q5_k gguf support as separate pr
* fix the problematic gemv test for q5_k
* add assert to make sure the gemv test cant fail with warning instead of error
2026-03-05 16:32:36 +08:00
George Hotz
e97922a57c
LLM speedup with two jits, prefill/rollout ( #15153 )
...
* START_TIME
* print cleanup
* fix tests
2026-03-05 16:21:09 +08:00
wozeparrot
be23772d43
llama3 fixes part2 ( #15150 )
2026-03-04 23:43:50 -08:00
wozeparrot
0c769289eb
llama3: more scripts ( #15107 )
2026-03-04 22:18:03 -08:00
George Hotz
fb43b415f9
fix symbolic shape call + chunked prefill ( #15149 )
...
* fix precompile for symbolic shape
* chunked prefill
* cleaner
* test that
2026-03-05 14:02:26 +08:00
George Hotz
8a82b26522
llm: print the prefill cache size ( #15146 )
...
* print the llm prefill cache size
* mock that too
2026-03-05 12:13:28 +08:00
chenyu
b5370fd52d
use copy_multi in alu_multi [pr] ( #15143 )
...
* use copy_multi in alu_multi [pr]
* copy to anything
2026-03-04 22:53:00 -05:00
George Hotz
72a9ed6e23
fix render depth bug + add warmup to serve + no realize default ( #15144 )
...
* fix render depth bug + add warmup to serve
* make realize not the default
2026-03-05 11:21:16 +08:00
George Hotz
ac1847cbf7
fully symbolic llm ( #15097 )
...
* work
* llm symbolic (almost)
* work
* revert that
* llm sym
* works
* cleanups
* cache tokens with the kv cache
* cleanups
* cleanups
2026-03-05 10:22:11 +08:00
qazal
33a1970045
sqtt: simplify inst mapping, validate JUMP processing in CI ( #15139 )
...
* jump cleanup
* assert there's a JUMP
* new example for JUMP
* regenerate examples
* rdna4 work
* new packets
* work
* less for branch handling
* less verbose
* fix err message
2026-03-05 09:53:12 +09:00
chenyu
04da527a7a
minor div_and_mod_symbolic cleanups ( #15138 )
2026-03-04 19:05:44 -05:00
chenyu
106d18b792
use UOp methods in allreduce.py [pr] ( #15137 )
...
except the one line with Ops.BUFFER and Ops.NOOP, not sure what that's for
2026-03-04 17:15:33 -05:00
chenyu
34594bcaaf
Revert "bug in metal: offset is stored as uint32, overflow ( #15129 )" ( #15136 )
...
This reverts commit 9c58db16fa .
2026-03-04 16:54:42 -05:00
Roelof van Dijk
9c58db16fa
bug in metal: offset is stored as uint32, overflow ( #15129 )
...
* metal uint32 icb offset overflow
* fix: diff
* supports_exec_item
* GraphRunner.supports_exec_item
* tests
* fix: can't import on non-metal
2026-03-04 22:52:12 +03:00
chenyu
4cce283790
relax test_tqdm_perf ( #15134 )
2026-03-04 12:58:47 -05:00
chenyu
fae400d300
update assign tests to also test the expected behavior ( #15132 )
2026-03-04 11:34:43 -05:00
chenyu
1f96cc2b51
update non-contiguous buffer error message [pr] ( #15131 )
...
* update non-contiguous buffer error message [pr]
also cleaned up the tests
* order
2026-03-04 11:13:26 -05:00
nimlgen
563d5c3211
more graph tests ( #15130 )
2026-03-04 19:01:12 +03:00
nimlgen
cdc48da9cd
hevc: assert and speed ( #15122 )
...
* hevc: assert and speed
* simpler
2026-03-04 19:01:02 +03:00
wozeparrot
4e9b85ecfd
fa: pull inputs out of call ( #15127 )
2026-03-04 03:15:49 -08:00
George Hotz
47faa2d7b4
hotfix: llm kv cache uses clone instead of realize to avoid many realize
2026-03-04 19:07:03 +08:00
George Hotz
8ebd24637b
fix fa forward building with clang 22 ( #15124 )
...
* fix fa forward building with clang 22
* fix: override rocm path
---------
Co-authored-by: Woze Parrot <wozeparrot@gmail.com>
2026-03-04 02:32:25 -08:00
Christopher Milan
592f9bf6c6
set OPENPILOT_HACKS=1 to enable replace assign ( #15123 )
2026-03-04 05:26:04 -05:00
wozeparrot
df23057984
fa: change bwd grid dim + unshuffle using mops ( #15068 )
2026-03-04 01:23:40 -08:00
Christopher Milan
5623cea7b1
move openpilot contiguous hacks to schedule ( #15120 )
2026-03-04 03:04:06 -05:00
wozeparrot
759c7fc81c
failing test for allreduce memory usage ( #15106 )
2026-03-03 23:38:38 -08:00
George Hotz
5ecfe549e7
allreduce is a function with LATE_ALLREDUCE=1 ( #15119 )
...
* allreduce as a function
* allreduce function
* support allreduce function
* LATE_ALLREDUCE
2026-03-04 15:17:58 +08:00
Christopher Milan
e7e70a3c95
simplify idx before counting backward_slice ( #15117 )
2026-03-03 23:53:50 -05:00
George Hotz
2d72a4a90c
fix copying padded const ( #15116 )
...
* fix const padding cpu
* remove comment
2026-03-04 10:39:45 +08:00
chenyu
b5ebb4d06d
contiguous_view_offset returns only offset [pr] ( #15113 )
...
size is always input.size
2026-03-03 15:23:39 -05:00
nimlgen
abd830b260
am: setup_rinf returns only doorbell ( #15112 )
2026-03-03 19:27:41 +03:00
nimlgen
4b42bb54aa
am: reset sdma to start from 0 ( #15109 )
2026-03-03 18:14:46 +03:00
George Hotz
01ddb4c267
add precompile to call ( #15099 )
...
* add precompile to call
* put get back
* something
* after structure
* alt
* keep it call
* resolve call
* resolve linear call
* precompile works with llm
* revert rangeify
* color for debugging
* getenv PRECOMPILE
* clean up deco pattern
* fully recursive sink scheduling
* revert llama
* fix SPEC=2
2026-03-03 22:32:42 +08:00
qazal
c7f908b788
sqtt: fix rdna4 structs ( #15111 )
...
* work
* DEBUG=2
2026-03-03 23:32:14 +09:00
qazal
8dd691761d
sqtt: remove old files ( #15108 )
2026-03-03 22:43:24 +09:00
Christopher Milan
de043226ba
benchmark comma usbgpu driving_vision step and load time ( #15103 )
...
Co-authored-by: Comma Device <device@comma.ai>
2026-03-03 06:08:03 -05:00
Christopher Milan
5f6b610da1
FLOAT16 logic for IMAGE==1 goes back to image_conv2d ( #15105 )
2026-03-03 05:37:57 -05:00
wozeparrot
529318259c
fix: fix null tests to actually use null device ( #15104 )
2026-03-03 02:05:47 -08:00
George Hotz
7d025089e3
no after removal ( #15102 )
...
* no after removal
* we are using walk
* null schedule test
* pytest deps
* Revert "pytest deps"
This reverts commit 5e1c5304ec .
* Revert "null schedule test"
This reverts commit 02da66053e .
* clean null tests
2026-03-03 17:50:31 +08:00
wozeparrot
92c16810ac
feat: per device mem_used ( #15100 )
2026-03-03 01:31:28 -08:00
qazal
e3a0598d0b
viz: the whole pc should be in view ( #15101 )
2026-03-03 17:17:53 +09:00
b1tg
a9ea36de79
assembly/amd: v_cmp_lg_f32 is ordered not-equal ( #14982 )
2026-03-03 15:37:48 +08:00
wozeparrot
c35de9bd68
asm_gemm: support more sharding ( #15002 )
2026-03-02 23:16:37 -08:00
wozeparrot
824ba4386a
llama3 dp fix ( #15098 )
2026-03-02 22:43:07 -08:00
chenyu
5dcf29b1a0
use clone in test_swap_slices ( #15096 )
2026-03-02 22:05:12 -05:00
Christopher Milan
c70e8af068
move IMAGE FLOAT16 logic to allocations ( #15095 )
...
* FLOAT16 logic in allocations
* cleanup
* separate that
* only apply when IMAGE == 1
* test passing now
* create image buffers earlier
2026-03-02 22:00:05 -05:00
George Hotz
d483e4153a
buffer view is like buffer ( #15082 )
...
* buffer view is like buffer
* fix
* swap_reshape_shrink
* contiguous on gguf, fix overlap
* revert that
* _device_supports_view
* this
* fix that test
* 0 buffers
* that test was wrong
* this
* check correct size
* contig BUFFER_VIEW
* this
* fix tests
* buffer view tests
* om
* fix torch
* no MOCKGPU
* skip
2026-03-03 09:52:33 +08:00
qazal
62ee976c1b
gemm/asm: cleanup repeated patterns to helper functions ( #15094 )
2026-03-03 08:14:47 +09:00
qazal
848f5cea96
viz: sqtt instruction packet trace ( #15065 )
2026-03-03 07:55:04 +09:00
chenyu
14d1c5fdfd
assign fusion tests on detach and contiguous_backward ( #15092 )
2026-03-02 15:21:51 -05:00
nimlgen
dfa180413d
tbgpu: sign nv ( #15087 )
2026-03-02 22:58:30 +03:00
chenyu
71f228f80f
test exact kernel count in torch_backend/test_kernel_fusion ( #15091 )
2026-03-02 14:26:32 -05:00
chenyu
f80b1033c5
simpler Tensor.all ( #15089 )
...
same generated kernel
2026-03-02 11:08:55 -05:00
chenyu
4008f7d4e8
move Tensor.one_hot +1 to python ( #15088 )
2026-03-02 10:56:41 -05:00
nimlgen
dafbe9733a
am: cleanup ( #15086 )
2026-03-02 17:06:21 +03:00
qazal
f7aeff6061
viz: cli.py cleanups, do not require PYTHONPATH ( #15085 )
...
* cleanup the print
* sys.exit
* equal check
* cleanup unpacker
* cli doesn't need PYTHONPATH
* no semicolons
* %s/PYTHONPATH=. //g
2026-03-02 19:24:38 +09:00
George Hotz
5ff278446c
add contiguous_view_offset ( #15084 )
...
* add contiguous_view_offset
* no int
2026-03-02 18:05:04 +08:00
Christopher Milan
977c270774
IMAGE=1 kernel count failing tests ( #15083 )
2026-03-02 04:35:26 -05:00
George Hotz
3539693555
Support triu variable on diagonal + SDPA symbolic ( #15081 )
...
* triu variable
* fails
* dumbbb
* no commutative in reshape
* real fix
* revert that
* sdpa symbolic tests
2026-03-02 12:19:48 +08:00
wozeparrot
a4f6365929
llama3: fstep takes grads ( #15069 )
2026-03-01 20:05:07 -08:00
Nick
8e8e9f6ff6
assert removal for _tri() + tests ( #15073 )
...
* assert removal for _tri() and tests
* removed import
* tests triu/tril like in prefill
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-03-02 10:34:28 +08:00
nimlgen
ccbbca05ef
beam: add dev_timeout for am ( #15063 )
...
* beam: add dev_timeout for am
* all covered
* fk
* x
* fuzz
* reset
* f
2026-03-01 16:57:29 +03:00
chenyu
8cb4368967
delete unused END NOOP rule [pr] ( #15077 )
2026-03-01 00:09:05 -05:00
chenyu
efce99adc9
skip isComposing key press in llm.py ( #15076 )
...
for the CJK input user
2026-02-28 20:31:53 -05:00
chenyu
103ea16ec0
add contiguous back to svd ( #15074 )
...
can cause infinite loop
2026-02-28 16:49:26 -05:00
chenyu
fe0fa8333b
Revert "improve Tensor.sort indices ( #15070 )" ( #15072 )
...
This reverts commit e3003631f2 .
2026-02-28 14:40:30 -05:00
chenyu
e3003631f2
improve Tensor.sort indices ( #15070 )
...
* improve Tensor.sort indices
instead of N^2 match at the end, have an arange to start and go through the same N(logN)^2 path
* contiguous
2026-02-28 14:16:16 -05:00
wozeparrot
cfc5cf65ad
llama3: vocab padding fix + jit copies on fakedata ( #15067 )
2026-02-28 08:44:55 -08:00
chenyu
76170d035a
relax atol for test_xlm_roberta_large ( #15066 )
2026-02-28 11:22:35 -05:00
qazal
cfb8e6922d
viz: arrow keys move through time ( #15064 )
...
* work
* automatic zoom, keeping scale
* the whole shape should be out of view
2026-02-28 23:52:36 +09:00
nimlgen
9b3450c9da
test gpu crash on cdna ( #15062 )
2026-02-28 13:17:59 +03:00
nimlgen
6bbf813dd3
ci: switch to tinygrad/amdcomgr_dylib ( #15061 )
2026-02-28 13:09:39 +03:00
nimlgen
77846300b2
am: reset vm fault ( #15060 )
2026-02-28 12:58:56 +03:00
George Hotz
dc54441e1f
add better printing to tinygrad.apps.llm ( #15059 )
...
* add better printing to tinygrad.apps.llm
* add gc.collect
* comment
2026-02-28 16:38:50 +08:00
George Hotz
bb84e389cf
functions for llama trainer ( #15045 )
...
* functions for llama trainer
* function there
* axis match
* fix multi
* lil cleaner
* there's a bug with HK_FLASH_ATTENTION
* training functions
* for commit
2026-02-28 12:15:18 +08:00
chenyu
9b4ba3f838
remove ReduceContext.range_to_ends [pr] ( #15055 )
...
* remove ReduceContext.range_to_ends [pr]
make merge_reduce_ends pure. this state is causing issue when introducing more reduce merging rewrites
* tag
2026-02-27 22:15:44 -05:00
chenyu
151608aa90
update test_multiple_to_single_device ( #15056 )
...
follow up to #14482 , add SCACHE=0 to the test
2026-02-27 21:44:33 -05:00
chenyu
5fd06f4f02
differentiable setitem ( #15054 )
...
* differentiable setitem
go through the where path for bw
* no return
2026-02-27 17:25:15 -05:00
chenyu
db6b3e1edc
fix mixed setitem with both basic and tensor indexing ( #15050 )
2026-02-27 15:35:48 -05:00
chenyu
c9f6d8751b
don't remove_bufferize for Invalid ( #15053 )
...
* don't remove_bufferize for Invalid
* replaced
2026-02-27 15:16:09 -05:00
qazal
b8a55d5f68
sqtt: new packet types, add discovery script ( #14960 )
2026-02-28 04:27:27 +09:00
nimlgen
4e12fc3fe6
am: mi3xx recovery ( #15051 )
2026-02-27 22:10:47 +03:00
chenyu
81a35cef38
rearrange Tensor.getitem code ( #15049 )
...
no-op change to prepare setitem fix
2026-02-27 12:57:16 -05:00
chenyu
1406d49eef
failed test cases for advanced setitem ( #15048 )
2026-02-27 10:50:18 -05:00
qazal
ef1017f7ed
viz: skip drawing offscreen tracks in profiler ( #15047 )
2026-02-27 22:19:08 +09:00
qazal
ad99b77f6d
assembly/amd: add gfx12_asm_vflat llvm tests, disasm fixes ( #15046 )
...
* add gfx12_asm_vflat.s
* work
2026-02-27 20:20:31 +09:00
George Hotz
010d2790ce
fix multi minimal ( #15044 )
2026-02-27 14:31:58 +08:00
George Hotz
3e1e12528c
hotfix: disable tinyfs load test
2026-02-27 12:04:41 +08:00
George Hotz
d23b79530e
remove disk from GGUF GEMV test ( #15041 )
...
* remove disk from GGUF GEMV test
* keep copy
2026-02-27 12:03:00 +08:00
chenyu
d345f7f5dc
remove _pending_assigns ( #15040 )
2026-02-26 22:38:10 -05:00
George Hotz
37e31e7da4
gguf gemv test ( #15039 )
...
* add gemv tests
* gguf big
* skip
* make realize optional
2026-02-27 10:54:43 +08:00
Nick
af94bfc401
fix retinanet shared memory race condition in parallel tests ( #15030 )
...
Append PID to shared memory names in batch_load_retinanet to prevent
FileExistsError when pytest-xdist runs multiple test workers that each
call _setup_shared_mem with the same hardcoded name.
2026-02-27 08:36:24 +08:00
George Hotz
2bbf8bbefa
improve call/param rendering ( #15023 )
2026-02-27 08:35:04 +08:00
chenyu
0f94a4bb73
failed test case for early fixup const copy ( #15038 )
...
* failed test case for early fixup const copy
wrong with PAD
* test no copy
2026-02-26 19:09:33 -05:00
chenyu
3a4db53b43
raise RuntimeError in schedule for conflicted var_val [pr] ( #15031 )
2026-02-26 15:16:01 -05:00
qazal
d65db32395
viz: only compute aggregate memory graph, defer n² per buffer graph ( #15029 )
2026-02-27 04:14:51 +09:00
qazal
c61fe57cfd
viz: fix n² tiny device linking in profiler ( #15028 )
2026-02-27 02:25:39 +09:00
qazal
88d650d606
viz: clean up call node detection check ( #15025 )
2026-02-26 19:57:56 +09:00
qazal
1c09890f66
sqtt: map instructions in the command line tool ( #15024 )
2026-02-26 12:34:24 +02:00
George Hotz
fe3ee8c27e
fix symbolic shapes in calls ( #15021 )
...
* fix symbolic shapes in calls
* fix after in the big graph
* real tests
2026-02-26 17:17:18 +08:00
qazal
12d179f5f4
viz: brighter call.src[0] edge color ( #15022 )
...
* work
* 2
* better color
2026-02-26 16:07:22 +09:00
George Hotz
2655655a0c
call gradient creates a call ( #15020 )
...
* function creates a full subgraph
* tests
* fix var
* fix tests
* implict assign/contig
* move kv init
2026-02-26 14:15:29 +08:00
Christopher Milan
94acd85285
fix typo in nn/__init__.py ( #15019 )
2026-02-25 20:01:32 -05:00
Christopher Milan
e5c0db66d1
num_batches_tracked does not need is_dtype_supported ( #15018 )
2026-02-25 19:50:57 -05:00
George Hotz
3244131f59
update dagre with more recursion fixes ( #15012 )
2026-02-26 08:35:05 +08:00
chenyu
ed9d475a12
assign tests with test_function ( #15015 )
2026-02-25 16:15:59 -05:00
nimlgen
faa66e0a61
mi350 hive_reset am repro ( #15014 )
2026-02-25 21:30:18 +03:00
nimlgen
8983830aa8
am: code style consistency ( #15013 )
2026-02-25 21:30:10 +03:00
George Hotz
0d35b67f2c
revert realize to only be buffers ( #15008 )
...
* revert realize to only be buffers
* fix that
* broken attention
* Revert "broken attention"
This reverts commit a23c3cd96c .
* and that
2026-02-25 22:43:06 +08:00
qazal
35f85c393f
viz: keep recursively nested call collapsed ( #15010 )
2026-02-25 22:45:18 +09:00
qazal
421b1d4a56
viz: monospace font for tags, no dy overrides ( #15009 )
...
* viz: monospace font for tags, no dy overrides
* str
2026-02-25 22:15:31 +09:00
qazal
448e997be4
gemm/asm: cleanup custom function args ( #15007 )
2026-02-25 22:05:56 +09:00
qazal
c58e91942c
viz: support collapsing individual CALL nodes ( #15006 )
...
* all
* contracted all by default
* simple call mask
* work
* minus not hyphen
* color / cleanup
* detail
2026-02-25 21:27:25 +09:00
George Hotz
68831cd852
add more tests to test_function ( #15003 )
...
* add more tests to test_function
* add function to llm
* function decorator on llm
* works
* symbolic fixups
* minimum change
* implicit inputs
* don't actually update llama yet
2026-02-25 18:42:06 +08:00
wozeparrot
d941dd5aeb
llama3: pad vocab when mp sharding ( #14998 )
2026-02-25 00:04:06 -08:00
wozeparrot
e1c9985715
llama3: better time keeping ( #14999 )
2026-02-24 22:42:05 -08:00
Christopher Milan
4a2fc7ecbb
autogen: cache downloads ( #14997 )
2026-02-25 01:34:27 -05:00
George Hotz
e3fa9896b7
start function and add walk rewrite ( #14992 )
...
* start function and add walk rewrite
* work
* add function on feed_forward
* llm progress
* stuff
* none of that
2026-02-25 13:56:27 +08:00
chenyu
fde7a40bb0
allow dtype mismatched assign on disk ( #14993 )
...
reverted #14473 , that was a bad idea. also added a test that safe_save only has copy
2026-02-24 20:49:55 -05:00
chenyu
46d9a9a74f
minor indexing cleanups [pr] ( #14991 )
2026-02-24 16:49:35 -05:00
chenyu
8dae9be573
move realize_map fixup into realize_assign_src [pr] ( #14990 )
2026-02-24 15:51:40 -05:00
chenyu
9d9151a21e
remove const normalization in indexing [pr] ( #14989 )
...
rangeify can create const with device, and all is normalized in to_define_global
2026-02-24 15:09:11 -05:00
chenyu
f68a472244
end range for COPY/BUFFER_VIEW [pr] ( #14987 )
2026-02-24 13:33:35 -05:00
chenyu
e5d27a3773
remove BUFFER_VIEW from ended_ranges special case [pr] ( #14986 )
...
* remove BUFFER_VIEW from ended_ranges special case [pr]
* will fix later
2026-02-24 10:37:29 -05:00
chenyu
5fd4fc0c6d
fix tinyfs ( #14974 )
...
* fix tinyfs
* fix that
2026-02-24 08:50:53 -05:00
George Hotz
8a6dffc87e
Tensor.callify will be the JIT ( #14983 )
...
* close
* simple callify, support linear in the scheduler
* all tests pass
* everyone is happy
* dumb test
* Remove unnecessary blank line in rangeify.py
2026-02-24 18:42:24 +08:00
nimlgen
6f1cb6be86
am: tiny err handling cleanups ( #14981 )
...
* am: tiny err handling cleanups
* x
* x
2026-02-24 12:43:45 +03:00
George Hotz
b643fca51e
clean up complete_create_schedule_with_vars ( #14980 )
...
* clean up complete_create_schedule_with_vars
* transform_to_call
* update viz tests
2026-02-24 16:12:36 +08:00
wozeparrot
8d9545e09e
llama3: correctly shard wqkv ( #14978 )
2026-02-23 23:57:10 -08:00
wozeparrot
a36a26d4ed
llama3: optim does grad acc in correct order ( #14965 )
2026-02-23 22:25:13 -08:00
George Hotz
e2b1f2620d
schedule is linear ( #14975 )
...
* schedule is linear
* cleanup
* cleanups
2026-02-24 11:30:41 +08:00
Christopher Milan
57ade7608a
consider indexing math cost for IMAGE=1 ( #14973 )
2026-02-23 18:57:45 -05:00
chenyu
0bda5585c7
unit test TestTinyFS ( #14972 )
...
these passed before the allocation change
2026-02-23 16:59:39 -05:00
imaolo
405d37423e
call release() in MetalAllocator._free ( #14970 )
...
* add failing test
* call MTLBuffer.release() in MetalAllocator._free()
* Update test_metal.py
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2026-02-23 23:33:31 +03:00
nimlgen
77db8e1c07
cpu: wait on dep signals ( #14862 )
...
* cpu: task_done() in case of failures
* print
* fix
* x
* f
* x
* um
* ?
* u
* f
* x
* gh
* f
* f
* virt
* x
* simpler
2026-02-23 21:09:41 +03:00
chenyu
127136421d
enable a few WEBGPU isnan tests that work now ( #14967 )
...
* enable a few WEBGPU isnan tests that work now
* still failed
2026-02-23 11:06:08 -05:00
ttomsa
0366474089
Bool cast to cmpne ( #14544 )
...
* test
* rm in llvmir
* rm in ptx and nir
* hmmmm
* rm in decompositions
* skip tests
* add test
* just this
* rm comment
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-02-23 10:31:36 -05:00
George Hotz
806581f807
rename rewrites + sink filter + bump to dagre 2.0.0 ( #14966 )
...
* bump to dagre 2.0.0
* transform to call
* cleanup names
* get kernel graph
* dagre recursion fix + better error
* add toggle to hide sink nodes
* no sink by default
* revert that
* only hide final sinks
* lol
2026-02-23 22:47:22 +08:00
nimlgen
d86f1d66b5
system: apl validate dev_id bounds ( #14964 )
2026-02-23 12:18:03 +03:00
George Hotz
b824490e3f
allocate generates a call ( #14958 )
...
* allocate generates a call
* symbolic works too
* DEFINE_VAR is param
* replace param later
* apply buffers
* name
* upd
* this was a bug...
2026-02-23 15:59:20 +08:00
wozeparrot
dd8302a6d0
fix: optim device is never none here ( #14963 )
2026-02-22 23:34:57 -08:00
wozeparrot
25565b2410
fa: test for mp ( #14907 )
2026-02-22 21:47:36 -08:00
qazal
d6145736c7
sqtt: examples generator changes from inst_discovery ( #14961 )
...
* sqtt examples generator changes from inst_discovery
* rdna4
* rdna3
* cdna
* sad reality for mi300x
2026-02-23 14:42:48 +09:00
George Hotz
3acd763684
simple call in allocate ( #14962 )
...
* allocate generates a call
* symbolic works too
* add min/max to PARAM
* revert viz
2026-02-23 13:34:20 +08:00
George Hotz
f45199269b
hotfix: regress NV cifar_10steps_half to 120 ms
2026-02-23 12:29:25 +08:00
George Hotz
677145b393
all consts have shapes ( #14959 )
...
* all consts have shapes
* vconst has shape too
* use normal schedule
* cast ptrdtype
* image
* bitcast issue + hack
2026-02-23 10:26:50 +08:00
qazal
1538960002
viz: smaller view for repeated asm instructions in cfg ( #14954 )
...
* simple test
* todo
* feature
2026-02-23 10:41:43 +09:00
George Hotz
226d4a2440
hotfix: code DEBUG=1 defensively
2026-02-23 08:44:54 +08:00
chenyu
4424757b9a
update test_sharded_memory ( #14956 )
...
cleaned up and moved to test/null
2026-02-22 16:56:08 -05:00
b1tg
f9b7493e7a
cleanup fp8 conversion helpers and fp8 edge-case tests ( #14953 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-02-22 09:16:42 -05:00
qazal
60f90dd97c
sqtt: fix jitted program deduping, failing test for graphed kernels ( #14951 )
...
* work
* hcq_profile fix, test with JIT=2 passes
* ci, -n=auto
* rm duplicate test
* less
2026-02-22 15:22:31 +09:00
chenyu
ccfd878e0f
minor fix_assign_hazard improvement [pr] ( #14949 )
...
target.base cannot be s if s.op is a movement
2026-02-21 21:21:28 -05:00
chenyu
24e8919438
raise explicitly for test_crossunder_assign ( #14948 )
2026-02-21 21:21:13 -05:00
chenyu
acf8f6b287
faster fix_assign_hazard [pr] ( #14947 )
...
one toposort. `time NULL_ALLOW_COPYOUT=1 MNISTMOCK=1 PYTHONPATH="." NULL=1 DEFAULT_FLOAT=HALF BENCHMARK=10 BS=256 GPUS=1 MODEL=resnet python3 examples/mlperf/model_train.py` 150s -> 40s
2026-02-21 19:42:13 -05:00
chenyu
9764e2561c
more assign into unrealize silent fail cases ( #14944 )
2026-02-21 18:12:57 -05:00
nimlgen
6de15dc480
mockam usb ( #14916 )
...
* mockam usb
* f
* win
* x
* x
2026-02-21 23:05:54 +03:00
chenyu
0dbcd764ad
a few assign into unrealized failed test case ( #14940 )
2026-02-21 13:18:45 -05:00
wozeparrot
3cda781876
llama optim offload ( #14901 )
2026-02-21 08:53:45 -08:00
chenyu
0255a64a27
update test_jit_init_empty ( #14938 )
...
* update test_jit_init_empty
now it fails silently
* that
2026-02-21 09:01:50 -05:00
George Hotz
8ef5544e4a
realized PYTHON copies ( #14934 )
...
* realized PYTHON copies
* comment that out
* fix that test
* append afters
* contig
* disk copies
* should be 124
* 332
2026-02-21 20:29:31 +08:00
qazal
cf23c2eee7
viz: merge readelfs, clean up toggles UI code ( #14936 )
...
* no extra readelf function
* that node can never be null, display block is wrong fix the css
2026-02-21 19:58:35 +09:00
George Hotz
639224e6e1
no call hack needed anymore ( #14935 )
2026-02-21 18:06:00 +08:00
George Hotz
d3b829a189
print schedule caller with DEBUG=1 ( #14933 )
2026-02-21 16:22:45 +08:00
qazal
8278886cf9
test_profiler cleanup, non flaky cpu_profile test ( #14932 )
...
* test_profiler cleanup, non flaky cpu_profile test
* existing device is okay
2026-02-21 16:58:10 +09:00
George Hotz
06fb35a1e5
don't graph_rewrite into calls ( #14931 )
...
* don't graph_rewrite into calls
* optional
* pm_gate_kernel_sink removed
2026-02-21 15:39:59 +08:00
qazal
c5029fa460
jit case with Tensor.empty input, realized means allocated ( #14930 )
...
* simple failing jit test case with Tensor.empty
* this used to exist in ops.py...
* Revert "removed if self.buffer.is_allocated() in realized (#14836 )"
This reverts commit 72cf603805 .
2026-02-21 16:33:55 +09:00
George Hotz
6533250246
remove more tags stuff ( #14927 )
...
* remove more tags stuff
* remove more
* unique consts aren't needed post tensor
2026-02-21 12:51:53 +08:00
chenyu
0c0d07d330
delete forced_reshape [pr] ( #14926 )
2026-02-20 22:35:31 -05:00
qazal
5b6fcd1cda
gemm/asm: smallest cdna4 asm gemm test ( #14925 )
2026-02-21 11:56:05 +09:00
George Hotz
ad3d821d63
move size 0 logic to allocations ( #14924 )
2026-02-21 09:57:40 +08:00
George Hotz
df7774661a
remove late numbering of UOps ( #14923 )
...
* remove late numbering of UOps
* stupid fix
* dead code
2026-02-21 09:18:48 +08:00
chenyu
c9b706125d
break Tensor.pad into methods ( #14922 )
2026-02-20 20:10:09 -05:00
Christopher Milan
5ee654b0d9
test IMAGE=1 driving_vision in mac pytest ( #14921 )
...
* test IMAGE=1 driving_vision in mac pytest
* don't multiply array
2026-02-20 18:28:10 -05:00
Christopher Milan
815780f72f
cl: fix multi-image arg kernels ( #14920 )
2026-02-20 17:34:17 -05:00
chenyu
24286c5593
fix clone for multi ( #14919 )
...
also update empty_like to make sure it's backed by buffers
2026-02-20 17:21:09 -05:00
chenyu
1fc1508f67
add assign to test_realize_is_realize.py ( #14918 )
2026-02-20 16:48:01 -05:00
chenyu
a4634b253a
fix empty_like for sharded tensor ( #14915 )
2026-02-20 16:30:04 -05:00
chenyu
86e7804d60
correct llm.py mem bw benchmark for moe ( #14626 )
...
only count active experts. verified on olmoe
2026-02-20 16:11:22 -05:00
Nicolas Pinto
aa905db7f7
ptx: use setp.neu for float CMPNE ( #14805 )
...
* ptx: use setp.neu for float CMPNE
* test ptx float CMPNE renders setp.neu
* check NaN behavior, not grep ptx strings...
* skip WEBGPU for test_cmpne_nan (Vulkan NaN behavior)
---------
Co-authored-by: Nicolas Pinto <41171+npinto@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-02-20 16:11:04 -05:00
chenyu
f9536f3cd4
wrap UOp.__float__ with float [pr] ( #14913 )
...
fix warning
tinygrad/test/null/test_uop_resolve.py:56: DeprecationWarning: UOp.__float__ returned non-float (type ConstFloat). The ability to return an instance of a strict subclass of float is deprecated, and may be removed in a future version of Python.
self.assertEqual(float(u), 11.5)
2026-02-20 14:03:53 -05:00
chenyu
697d0b06c2
update env for testmacpytest ( #14912 )
...
CI: ""
CAPTURE_PROCESS_REPLAY: "0"
2026-02-20 13:42:50 -05:00
chenyu
07d145debd
compile3 0.10.1 driving_vision in mac pytest ( #14911 )
...
* compile3 0.10.1 driving_vision in mac pytest
* sync before re-executing onetime kernels
2026-02-20 12:23:52 -05:00
chenyu
d895713116
remove temp onnx migration CI job ( #14910 )
2026-02-20 11:38:44 -05:00
George Hotz
2611907afb
start ripping out old scheduler -- no maps ( #14909 )
...
* start ripping out old scheduler -- no maps
* no more metadata
2026-02-20 21:05:04 +08:00
nimlgen
1b3b94a72a
fix mockam mypy ( #14908 )
2026-02-20 15:15:05 +03:00
George Hotz
55d3a5def9
preallocate all realized buffers ( #14823 )
...
* preallocate all realized buffers
* contiguous
* work
* comment that out
* move to schedule
* better
* correct fix
* just buffer
* disk bufs
* fixes disk tensor stuff
* fix symbolic stuff
* fix multi
* 162 failures
* bugfixes
* don't check that anymore
* fix schedule tests
* mnist should be contiguious
* type and buffer
* fix tests
* shrink axis correction
* mypy fixes
* tests skips
* same 37 failures
* dedup
* no shrink in the graph
* 29 failures
* skips
* fix custom kernel
* fix training
* those optimizations aren't supported currently
* simpler
* more correct
* tests
* 14 failures
* works
* fix that test
* broken
* 11 failures
* only kernel counts left
* fixes
* all tests pass
* remove tensor_map
* op test
* 200 -> 230
* test fixes
* fixes
* revert test_tiny thing
* guard
* revert that
* test tiny passes
* no contigs there
* base realize back
* Revert "no contigs there"
This reverts commit c45bb9fcfd .
* revert that
* chop many assigns
* 12 failures
* fix tests
* tests
* apply after
* pre-commit
* remove old code
* delete that
* fix types
* remove extra contig
* fix dataloader
* torch fix
* disk fix
* update kernel fusion numbres
* runs on amd
* restore kernel count
* add that rule back
* that
* disable that
* wrong
* add the correct rule for that folding
* more tests
* guard c1.arg
* no newlines
* realize those
* split into a different file
* remove detach/contig back
* skip 2
* update that
2026-02-20 20:05:54 +08:00
nimlgen
dbf894215a
init mockam ( #14889 )
...
* mockam
* more tests
* linter
* x
2026-02-20 14:09:11 +03:00
wozeparrot
4b9825c829
make optim _step return update ( #14906 )
2026-02-20 02:43:56 -08:00
George Hotz
6610255654
add the correct rule for gcd div/mod folding ( #14905 )
...
* add the correct rule for that folding
* more tests
* guard c1.arg
2026-02-20 18:11:54 +08:00
George Hotz
a28fc2fba7
hotfix: remove wrong symbolic rule
2026-02-20 17:09:18 +08:00
qazal
28451a5957
viz/sqtt: rdna4 wmma, cleanup inst rows ( #14904 )
...
* valu wmma
* viz/sqtt: rdna4 wmma, cleanup inst rows
2026-02-20 17:02:09 +09:00
qazal
16ae96fa58
finish rdna4 sqtt ( #14903 )
...
* unskip
* it's a wave pair in rdna4
* work
* that
* hidden archive
* generic s_delay, mystery InstOpRDNA4.UNK_60
* branch failing test
* UNK_60 is OTHER_VMEM_STORE
* rdna4 has both s_delay_alu and s_wait_alu
* real branch failing test
* rdna4 doesn't have JUMP_NO, it's NEXT with a flag for no jump
* make inst_delay skips recursive
* all rdna4 tests pass
* simm16 unwraps
* that has a name
2026-02-20 16:06:13 +09:00
qazal
52b51a0324
test fixes from rdna4 sqtt ( #14902 )
2026-02-20 14:42:33 +09:00
qazal
32f569b573
viz/sqtt: decoder fixes pre rdna4/cdna4 work ( #14900 )
...
* viz/sqtt: decoder fixes pre rdna4/cdna4 work
* fix
* branch_inst + more tests
* smaller
2026-02-20 12:10:15 +09:00
qazal
e9ae3da711
viz: click on CALL node goes to codegen ( #14609 )
...
* viz: click on CALL node goes to codegen
* colored name
2026-02-20 11:13:11 +09:00
George Hotz
fc5677c28b
resnet dataloader + more test cleanups ( #14899 )
...
* resnet dataloader
* tests
2026-02-20 10:05:47 +08:00
chenyu
b9744ab62b
one more test_gpudims test ( #14898 )
...
failure from the bad simplification attempt
2026-02-19 18:18:44 -05:00
chenyu
9d6cf00be2
fix gpudim bug and test_split_2d_to_3d ( #14896 )
2026-02-19 16:46:24 -05:00
chenyu
2b31823ef9
update test_gpudims to prove bijectivity ( #14895 )
...
* update test_gpudims to prove bijectivity
* one more
2026-02-19 16:18:59 -05:00
chenyu
19ce7a3f7f
use z3 to verify gpudims output index ( #14894 )
...
found a bug with z3
2026-02-19 15:24:38 -05:00
chenyu
52f727738b
move test_grouped_dims to test/null ( #14893 )
...
it's a pure helper
2026-02-19 14:50:53 -05:00
chenyu
af997c1ea5
use .expr to access variable expr instead of arg[0] [pr] ( #14892 )
...
only apply when it's more readable
2026-02-19 12:24:36 -05:00
chenyu
7400362a86
remove UOp.vars [pr] ( #14891 )
2026-02-19 12:09:39 -05:00
chenyu
f54a49e733
restructure alu_multi [pr] ( #14888 )
2026-02-19 11:11:49 -05:00
chenyu
06ef8a26b7
add a test case that triggers CALL passthrough_multi ( #14887 )
2026-02-19 10:45:40 -05:00
nimlgen
071403f9a1
system: use MAP_FIXED_NOREPLACE ( #14884 )
2026-02-19 18:32:50 +03:00
nimlgen
041dc0cf85
fix typos ( #14886 )
2026-02-19 17:37:15 +03:00
Kartik Vashishta
9a9c7648e9
system: fix pci_scan_bus vendor filter ( #14885 )
...
* system: fix pci_scan_bus vendor filter
* fix: formatting
2026-02-19 17:23:32 +03:00
chenyu
877a5d4c45
improve types and simplify allgather in multi [pr] ( #14878 )
2026-02-19 09:02:15 -05:00
wozeparrot
9317e96881
fa: explicitly pass shapes ( #14857 )
2026-02-19 05:26:16 -08:00
George Hotz
f6c1cf343c
new symbolic rule from prealloc_bufs ( #14883 )
...
* new symbolic rule from prealloc_bufs
* optim
2026-02-19 20:57:30 +08:00
qazal
658c32864a
viz: show event number in track line ( #14882 )
2026-02-19 20:58:37 +09:00
qazal
911399bee5
assembly/amd: move the kernel capture stuff out of helpers ( #14881 )
2026-02-19 16:28:48 +09:00
qazal
1f34ba4511
viz: remove global amd targets mapping ( #14879 )
...
* viz: remove global amd targets mapping
* rename to amd_counters and nv_counters
* diff
2026-02-19 15:31:12 +09:00
George Hotz
2f0f8b5776
more test relaxations from prealloc_bufs ( #14880 )
2026-02-19 14:23:28 +08:00
qazal
5bc65ec669
applied_opts/estimates in program spec are aliases for the sink arg ( #14860 )
...
* remove applied_opts from programspec
* comment that out
* placement
* update tests
* p.ast.arg
* remove todo comment
* maybe this too
* it can exist as an alias, also for estimates
2026-02-19 13:08:26 +09:00
chenyu
8d8da185ec
minor handle_allreduce cleanup [pr] ( #14876 )
...
no more lbs, also use a divmod
2026-02-18 22:53:28 -05:00
Christopher Milan
b5588d341b
uop_given_valid fixes many gated reads for IMAGE=1 ( #14877 )
...
* add replay script
* pkl is arg
* that needs uop_given_valid
* cleanup
2026-02-18 22:49:47 -05:00
George Hotz
ab61c16730
fixes and test relaxations from prealloc_bufs ( #14875 )
...
* fixes and test relaxations from prealloc_bufs
* fix error type and guard _mop
* revert that
* contiguous makes extra/torch_backend/test_kernel_fusion.py fail
2026-02-19 11:37:25 +08:00
chenyu
0c85b93938
support shink sharded and non-sharded axes ( #14874 )
...
simpler to just support it
2026-02-18 20:54:10 -05:00
chenyu
e8252e6e4f
use offical gguf in test ( #14872 )
...
also deleted bad test_load_sample_mxfp4, added some hard coded simple tests
2026-02-18 19:55:09 -05:00
chenyu
8c830c5b44
test_full_like_shrink_on_shard_axis ( #14870 )
...
* test_full_like_shrink_on_shard_axis
add a test case that triggers non-copy branch in mstack_early_shrink
* 0
2026-02-18 19:23:44 -05:00
Ananta Ranganathan
4005e9db6d
Mxfp4 fix ( #14866 )
...
* double e2m1 values for mxfp4
* check if assert equal works in ci
* Revert "check if assert equal works in ci"
This reverts commit 8cf902ce0d .
* remove unnecessary whitespace change
* add test case that fails for old implementation but passes for new
* add note that the previous test is bad
* clarification on the methodology for the test
* fix the indent problem that happened to skip this test
* for now update mxfp4 block test to similarly use allclose (bad)
* add gist link and clearer explanation of process for computing test data
2026-02-18 18:50:59 -05:00
chenyu
0e4cf21a75
remove handle_allreduce_multirank and group_id [pr] ( #14869 )
...
leftovers from ops_remote
2026-02-18 16:13:54 -05:00
chenyu
f771de6738
gc.collect() to get the correct GlobalCounters.mem_used in tests ( #14868 )
...
test can be flaky if gc happens in between
2026-02-18 15:01:23 -05:00
chenyu
f84a11bb9f
delete uneven shard tests and mentions ( #14867 )
2026-02-18 14:10:33 -05:00
nimlgen
1c8c17a593
am: aca ( #14861 )
2026-02-18 21:40:09 +03:00
chenyu
b3cdb61067
clean up expand_multi [pr] ( #14865 )
...
remove dead assert, also make it more like a view
2026-02-18 12:21:13 -05:00
chenyu
0260406f49
simplify reshape_multi [pr] ( #14864 )
2026-02-18 11:46:26 -05:00
chenyu
5746a605ce
UOp.axis raises for invalid reshape ( #14863 )
...
reshape is lazy now, so better to raise from the .axis call and not have caller to handle invalid case
2026-02-18 11:28:56 -05:00
nimlgen
3b95fa0ed4
am_smi: enable mem usage back ( #14858 )
2026-02-18 19:27:27 +03:00
qazal
a212881130
viz: second profiler link goes to source code ( #14855 )
2026-02-18 19:40:34 +09:00
qazal
b0110c4469
viz: simplify shape clicking ( #14853 )
...
* setFocus is the more clear name
* do less
2026-02-18 19:03:26 +09:00
George Hotz
af839b2bd1
remove all the outerworld stuff, it was too complex ( #14852 )
2026-02-18 17:44:11 +08:00
wozeparrot
6d301ad2c4
feat: llama wqkv ( #14841 )
2026-02-17 23:01:33 -08:00
qazal
a3d516c4b5
viz: start displaying pma ( #14848 )
...
* viz: start displaying pma
* s
* work
* colors
* cleaner
* max packets
* fine
* work
* pma
* diff cleanup
2026-02-18 14:22:32 +09:00
George Hotz
d5636fba90
assign after copy shouldn't contig ( #14847 )
...
* assign after copy shouldn't contig
* fix assign copy
2026-02-18 12:23:49 +08:00
George Hotz
ab55e8c6b9
assign should be used as output buffer ( #14845 )
...
* assign should be used as buffer
* late removed
* the fix
* better fix
* backward slice
2026-02-18 09:37:46 +08:00
chenyu
e3c120c8e1
exclude 100 in test_assign_add ( #14846 )
...
this can crash, not sure why. skip 100 to see if it's better
2026-02-17 19:12:47 -05:00
Christopher Milan
7641ed61af
remove doublecast in IMAGE=1 ( #14839 )
2026-02-17 18:22:14 -05:00
Christopher Milan
5b11519d5e
LLVM actually supports ops ( #14843 )
...
LLVM should support eg, SHL/SHR, but this was never actually rendered
2026-02-17 18:21:33 -05:00
wozeparrot
95e97ec341
seperate llama optim ( #14810 )
2026-02-17 13:02:35 -08:00
chenyu
72cf603805
removed if self.buffer.is_allocated() in realized ( #14836 )
...
automatically fixes is_realized issue for empty
2026-02-17 15:35:56 -05:00
chenyu
aec8a6c85b
Revert "one run_schedule for assign realize ( #14835 )" ( #14837 )
...
This reverts commit df7c37f611 .
2026-02-17 14:34:26 -05:00
chenyu
df7c37f611
one run_schedule for assign realize ( #14835 )
...
concat schedules. separate out the execution part
2026-02-17 14:01:55 -05:00
chenyu
61867c2f35
TestRealizeIsRealized ( #14834 )
...
test after calling .realize(), uop.is_realized is True. currently not working for empty (thus disk tensor), and const
2026-02-17 13:30:35 -05:00
chenyu
f147791105
update test to reset and test kernel_count directly ( #14832 )
2026-02-17 11:48:46 -05:00
chenyu
9d4937ab5e
remove assign test @unittest.skip("this test is crashing!") ( #14831 )
2026-02-17 10:30:58 -05:00
nimlgen
dda5ccf63b
hcq: fix usb<->cpu mappings ( #14827 )
...
* hcq: fix usb<->cpu mappings
* non cpu
* um
2026-02-17 18:04:18 +03:00
nimlgen
801677cf12
am: GCVM_L2_PROTECTION_FAULT_STATUS prints device ( #14830 )
2026-02-17 18:03:52 +03:00
chenyu
f07898c68a
move assign chain fix to rangeify ( #14829 )
2026-02-17 09:40:34 -05:00
nimlgen
a2586e4c70
nv: move reset earlier ( #14824 )
2026-02-17 17:25:49 +03:00
chenyu
f2f039cc0f
fix chained full-buffer assign ( #14828 )
...
this shows issue that pm_remove_bufferize drops tags, will fix in bufferize next. this also fixed rand being different in jit vs no-jit
2026-02-17 09:11:04 -05:00
chenyu
58fa82eef5
stronger test_assign_add ( #14826 )
...
also test self add 10 and 100 times
2026-02-17 08:36:09 -05:00
George Hotz
ff60dab622
Revert "big sink is on base ( #14819 )" ( #14825 )
...
This reverts commit 5fc3d8109f .
2026-02-17 19:18:06 +08:00
qazal
f8e485ee9e
nvcc/nvdisasm macos shim ( #14822 )
...
* move to backend
* and arch
* setup_nvcc_osx
* blackwell
* min test
* now getting dumb assert is_ptx
* support cubin.
* work
* remove that
* simpler
2026-02-17 20:07:05 +09:00
qazal
d24781f45f
viz: do not, ever, open devices ( #14820 )
...
* viz: do not, ever, open devices
* unwrap
* on the kernel info
2026-02-17 19:42:44 +09:00
George Hotz
5fc3d8109f
big sink is on base ( #14819 )
...
* big sink is on base
* contiguous fixes tests
2026-02-17 18:32:56 +08:00
qazal
99a988b9d2
viz: remove ProgramSpec from trace ( #14818 )
2026-02-17 19:04:58 +09:00
qazal
f590564bf7
gemm multiple is only for cdna4 asm ( #14814 )
...
* gemm multiple is only for cdna4 asm
* move to backend
* and arch
* path
2026-02-17 14:00:02 +09:00
George Hotz
5bd2862d1a
late compile the cdna gemm ( #14783 )
...
* late compile the cdna gemm
* remove old things
* finalize inplace
---------
Co-authored-by: qazal <qazal.software@gmail.com>
2026-02-17 13:04:22 +09:00
Christopher Milan
275319c789
IMAGE=1 2d indexing ( #14809 )
...
* IMAGE=1 2d indexing
* cleanup
* oops
* go back to 'idx'
* fix vals
* fix
* ugh
2026-02-16 22:51:18 -05:00
George Hotz
f081f154ae
parameterize the CDNA asm gemm ( #14813 )
...
* parameterize the CDNA asm gemm
* fix llama test
* fix
* add more gemmt ests
* confirm all match
* test these asm gemms
2026-02-17 11:35:18 +08:00
George Hotz
bc3487d607
VIZ display cleanups ( #14811 )
...
* exclude reshape/expand broadcasts from viz
* limit src lines
2026-02-17 10:03:08 +08:00
chenyu
5bca5be2d2
test slice assign twice retains the buffer ( #14807 )
2026-02-16 20:01:47 -05:00
ridoy majumdar
ba39a19114
viz: remove duplicate Ops.PARAM color ( #14808 )
2026-02-17 09:31:47 +09:00
chenyu
9b44fbe0b8
update test_assign_add_twice ( #14806 )
...
failed test case to show that `+=1` twice returns a different buffer
2026-02-16 17:52:11 -05:00
chenyu
f290af6c7d
test_schedule always test with SPLIT_REDUCEOP=0 ( #14802 )
...
* test_schedule always test with SPLIT_REDUCEOP=0
except tests that tests SPLIT_REDUCEOP=1
* like that
2026-02-16 15:30:26 -05:00
kevvz
e41da0c396
use relative address for MOCKGPU rdna4 tracing ( #14801 )
...
* rdna3/4 trace separation
* remove comments
2026-02-16 22:59:46 +03:00
nimlgen
131bbbbfd8
am: smu_v13_0_12 ( #14800 )
2026-02-16 22:58:10 +03:00
nimlgen
7ddc888ad5
am: 48bit for gfx950 ( #14799 )
2026-02-16 22:48:07 +03:00
nimlgen
9f8afb518c
viz: sdma gb/s in graph ( #14798 )
...
* viz: sdma gb/s in graph
* f
2026-02-16 16:45:06 +03:00
qazal
db3db476ff
viz: add GB/s to SDMA ( #14795 )
...
* work
* better
* fix that
* no decimal
2026-02-16 20:09:20 +09:00
qazal
2b36708c6d
viz: split all long labels with ... ( #14794 )
2026-02-16 19:18:42 +09:00
qazal
d213fe95a0
viz: integer ticks on the x axis, fix small cycle numbers ( #14792 )
2026-02-16 18:07:40 +09:00
George Hotz
47d39a6b8b
add sqtt support to the emulator ( #14791 )
...
* add sqtt support to the emulator
* more sqtt
* cleanup
* cleanups
* simpler tests
* some decent tests
* test branch
2026-02-16 16:48:26 +08:00
wozeparrot
45aebe1572
hipkittens fa backward ( #14723 )
2026-02-16 00:38:44 -08:00
Nicolas Pinto
20b658b786
fuse MULACC after MUL->SHL ( #14788 )
...
* decompositions: fuse (x << n) + c to MULACC
MUL→SHL converts x*(2^n) to x<<n before MULACC can fuse (x*c)+y.
Add pattern to also fuse (x<<n)+c → MULACC(x, 2^n, c) for backends
that support both MULACC and SHL.
* test: add test_mulacc_shl for SHL->MULACC fusion
* test: relax test_mulacc_unrolled to >= 4
SHL->MULACC fusion now also catches power-of-2 address calculations,
increasing MULACC count from 4 to 6 on PTX. the test's intent is that
each unrolled multiply is individually fused (not grouped), so >= 4
is the correct assertion.
---------
Co-authored-by: Prithvish <deformercoding@gmail.com>
Co-authored-by: Nicolas Pinto <41171+npinto@users.noreply.github.com>
Co-authored-by: Nicolas Pinto <npinto@mbp23.local>
2026-02-16 16:26:44 +08:00
qazal
ac62d28ddc
viz: amdgpu arch cleanup ( #14790 )
...
* viz: amdgpu arch cleanup
* don't do that
* simpler sqttmap
* work
* self.arch
2026-02-16 16:48:12 +09:00
George Hotz
401095e3e7
emulator barrier tests ( #14789 )
2026-02-16 15:31:01 +08:00
qazal
c7a4dbf918
viz: get program binary from the UOp ( #14787 )
...
* viz: get program binary from the UOp
* remove that
* less
* rename View Program to View Source
* two words
* fix
2026-02-16 15:46:58 +09:00
Bautista Garcia
0f1ca8eb43
torch_load: fix shared storage slicing ( #14771 )
...
* faster zip_extract + usage in torch load
* clean zip in torch load
* working zipextract in torchload
* tar_extract in tar path
* faster tar path
* tests passing, cleanup needed
* faster tar with 1MB buffer
* comments
* unify storage_source with all paths
* use bufferedreader in zip path
* fix ruff
* clean
* removed unnecessary string conversion
* fix for tensors that share storage
* less hacky
* shared storage test
* test comment
* linter
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-02-16 14:30:13 +08:00
George Hotz
dff9cf35c2
amd asm emulator fixes + run it in CI ( #14786 )
...
* amd asm fix, try 2
* fix tests
2026-02-16 13:24:21 +08:00
qazal
55a4dfa2e0
cdna4 asm_gemm tests in CI on the null backend ( #14785 )
...
* cdna4 asm_gemm tests in CI on the null backend
* no .numpy() in null
* better
* gemm/asm: device comes from renderer
2026-02-16 14:06:23 +09:00
qazal
c2be31e75b
move Estimates to rewrite rules [pr] ( #14782 )
...
* move Estimates to rewrite rules [pr]
* don't need this cached_property
* tuple
* return
2026-02-16 12:59:42 +09:00
George Hotz
0abcb9aac2
move more to mixins ( #14780 )
...
* move more to mixins
* revert
* move some
* do not change
* more
* fix tests
* Revert "more"
This reverts commit d942d59fa4 .
* go
* work
* more
* work
* guard
* base
2026-02-16 11:35:00 +08:00
qazal
8e7c5f5b09
remove Tensor.training = True in test_arange ( #14781 )
2026-02-16 11:19:42 +09:00
kevvz
33b2ade8cd
Rdna4 emulator test_ops, dtypes pass ( #14773 )
...
* test_ops, test_dtypes pass
* merge cdna4
* ruff + more tests
* reorganize
* /backend
* again
* again...
* add rdna4
2026-02-16 10:13:39 +08:00
qazal
156b6cb7e4
native bf16 cast in cdna4 ( #14574 )
...
* native bf16 cast in cdna4
* don't need contig backward
* simpler
* contig bw still wins in those cases
2026-02-16 10:51:32 +09:00
chenyu
3adb5062c5
clean up assign_to_contiguous [pr] ( #14779 )
...
slice hazard is handled in fix_assign_hazard
2026-02-15 20:45:49 -05:00
George Hotz
bd18217f32
add rdna3/rdna4/cdna4 to testamd ( #14778 )
...
* add rdna3/rdna4/cdna4 to testamd
* test simplify
* ci cleanups
* mergable
* skip slow
2026-02-16 09:45:16 +08:00
George Hotz
ac079e43d7
ElementwiseMixin ( #14777 )
2026-02-16 08:50:47 +08:00
Christopher Milan
9c95a11f90
autogen: handle rocm bump and better error wording ( #14776 )
...
* autogen: handle rocm bump and better error wording
* regen
2026-02-15 19:23:47 -05:00
chenyu
1ded250bbe
remove collapse_nested_assign [pr] ( #14775 )
...
the else branch is dead code, and we can check directly with UPat
2026-02-15 18:04:47 -05:00
chenyu
17db43ab46
remove some contiguous call in frontend ( #14772 )
...
these should work without contiguous
2026-02-15 16:33:56 -05:00
nimlgen
26193cbf9a
nv: prof cpu_access for nvd only ( #14769 )
2026-02-15 21:42:04 +03:00
qazal
33b31d9cd6
tinykittens flash attention dtype fix, add CI ( #14770 )
...
* don't hardcdoe amd device
* add failing tests, ci too
* fix: fix for dtype mixin
* bump to rocm 7.1
---------
Co-authored-by: Woze Parrot <wozeparrot@gmail.com>
2026-02-16 01:15:11 +09:00
chenyu
352845d8cc
update cast to uint tests ( #14768 )
...
result in valid range should work, add intermediate cast to NIRRenderer since it's UB for [128, 256)
2026-02-15 10:55:13 -05:00
qazal
ceccc8eb86
unskip now passing multi tests [pr] ( #14759 )
2026-02-15 20:30:00 +09:00
George Hotz
713143a46a
more mixins pt 2 ( #14765 )
...
* more mixins pt 2
* lil cleanups
2026-02-15 17:57:04 +08:00
qazal
9da7f5e733
disable process replay for AMD emulator renderer [pr] ( #14766 )
...
* disable process replay for AMD emulator renderer [pr]
* line
* skip
2026-02-15 18:52:37 +09:00
George Hotz
9759fd6193
dtype mixin ( #14763 )
...
* dtype mixin
* dtype mixin methods
2026-02-15 16:03:48 +08:00
qazal
42b6bf0b7a
fix sdpa causal failing test on multi ( #14762 )
...
* simple failing test
* device is from xq
2026-02-15 16:54:33 +09:00
George Hotz
8091661df3
more more to mixins ( #14761 )
2026-02-15 15:18:37 +08:00
George Hotz
0e215c433d
remove hack from cast ( #14760 )
...
* remove hack from cast
* skip tests
* linters to 3.12, another skip
* fix rand
* m_
2026-02-15 13:56:38 +08:00
George Hotz
d176af6269
start outerworld call test, fix gate ( #14758 )
2026-02-15 12:35:01 +08:00
qazal
9bb6014900
keep existing profile trace in viz cli ( #14757 )
2026-02-15 13:16:32 +09:00
chenyu
ca68037f26
lazy basic setitem to unrealized Tensor ( #14756 )
...
undo the view and make it a mask, this fuses the setitem with any pending compute too.
one behavior change is that for target not backed by a buffer (const and arange), rangeify makes output contiguous under the hood.
this is stricter better than raise and ask user to call contiguous, as that would no longer be fuse-able.
2026-02-14 20:27:03 -05:00
George Hotz
32980c74d1
hotfix: skip flaky tests, looped many times on tinymac3
2026-02-15 07:46:29 +08:00
chenyu
902dc7c09c
fix test_numpy_parity_and_backward_2d ( #14755 )
...
test setup issue, test failed locally with `RUN_SLOW=1`
2026-02-14 17:59:00 -05:00
chenyu
043f5dbfa0
fix write-after-read tracking ( #14754 )
...
AFTER-AFTER was silently dropped, which breaks write-after-read
2026-02-14 17:23:05 -05:00
chenyu
d79c63a0ff
test_multi_step_assign_read_write_same_buffer ( #14752 )
...
pattern in LAMB that can be off subtly
2026-02-14 16:39:08 -05:00
chenyu
95f4c7e90a
fix limit_bufs to not limit index ( #14751 )
...
index is not real buffer. also made MAX_KERNEL_BUFFERS a ContextVar
2026-02-14 16:00:03 -05:00
chenyu
0ce4a55dad
clean up test_setitem_slice ( #14750 )
...
moved to test_setitem_schedule, and use contiguous zeros as scheduler handles empty differently now
2026-02-14 14:29:16 -05:00
chenyu
8f6772fd8c
more setitem kernel mem tests ( #14749 )
...
* more setitem kernel mem tests
test only the slice is accessed
* update
2026-02-14 11:01:03 -05:00
chenyu
446909fb7a
more setitem kernel tests ( #14748 )
...
check where realize happened
2026-02-14 09:57:46 -05:00
nimlgen
4ab51b55bd
stream pma decoder ( #14746 )
2026-02-14 17:40:18 +03:00
nimlgen
e1a18dadae
fix devices for copies ( #14747 )
...
* fix devices for copies
* add test
2026-02-14 17:39:41 +03:00
George Hotz
e35bd960e8
Revert "use zip_extract and tar_extract in torch load ( #14734 )" ( #14745 )
...
This reverts commit 9d9ef81608 .
2026-02-14 13:24:01 +08:00
Christopher Milan
eaa9506a00
disallow subnormals in emulated test_dtype ( #14744 )
2026-02-14 00:11:57 -05:00
Bautista Garcia
9d9ef81608
use zip_extract and tar_extract in torch load ( #14734 )
...
* faster zip_extract + usage in torch load
* clean zip in torch load
* working zipextract in torchload
* tar_extract in tar path
* faster tar path
* tests passing, cleanup needed
* faster tar with 1MB buffer
* comments
* unify storage_source with all paths
* use bufferedreader in zip path
* fix ruff
* clean
* removed unnecessary string conversion
2026-02-14 12:57:28 +08:00
qazal
c88bb075f0
hotfix: correct way to get renderer arch ( #14743 )
2026-02-14 12:38:20 +08:00
George Hotz
f9d2eca91a
clean up amd/elf.py ( #14741 )
2026-02-14 12:09:05 +08:00
qazal
6dc7ea58fd
make flash attention tests run on DEV=NULL EMULATE=AMD_CDNA4 ( #14742 )
...
* make flash attention tests run on DEV=NULL EMULATE=AMD_CDNA4
* no if CI, this is just the arch
2026-02-14 12:24:37 +09:00
George Hotz
e8bd432bf6
move amd emulator out of tree ( #14740 )
...
* move amd emulator out of tree
* move the readme too
2026-02-14 10:32:00 +08:00
chenyu
dca7819f76
more setitem into unrealized tests ( #14737 )
...
* more setitem into unrealized tests
into empty, const with alu, and arange
* typo
2026-02-13 20:28:51 -05:00
chenyu
9f607cf84f
disk setitem does not need realize either ( #14736 )
...
disk base is a COPY and is_realized is always False for now, disk assign is still eager
2026-02-13 12:57:58 -05:00
chenyu
8b205a007e
lazy setitem for realized target ( #14735 )
2026-02-13 12:20:14 -05:00
nimlgen
3bee6638e3
external_test_hive_reset ( #14729 )
...
* external_test_hive_reset
* add fault
2026-02-13 19:08:36 +03:00
nimlgen
7d88626068
nv: fix pma_bytes to be system memory ( #14733 )
2026-02-13 17:55:46 +03:00
George Hotz
c0fe78f73b
BUG: metadata is lost with partial assign ( #14732 )
2026-02-13 21:35:21 +08:00
qazal
d0543063dd
viz: wave color is locally scoped ( #14728 )
2026-02-13 18:22:20 +09:00
nimlgen
ba67425680
am: reset mi300 with pm4 ( #14727 )
2026-02-13 11:22:32 +03:00
George Hotz
c0de4f75b1
improve mmapeak, print names with sqtt ( #14726 )
2026-02-13 16:07:06 +08:00
George Hotz
5289b4e882
renderer/amd: add cdna emulator ( #14721 )
...
* renderer/amd: add cdna emulator
* fixes
* no predecode
* no early
* REMU_PATH
* delete that
* round
* Fix cache invalidation check in _compile_smem
2026-02-13 16:06:58 +08:00
Christopher Milan
08a555c875
skip test_expand_buffer_before_cast on WEBGPU metal ( #14724 )
2026-02-13 00:01:05 -05:00
Christopher Milan
7993f3a277
autogen: use snapshot.debian.org for linux src ( #14718 )
2026-02-12 23:36:38 -05:00
wozeparrot
0613c0ac0c
hipkittens fa forward ( #14692 )
2026-02-12 20:16:43 -08:00
chenyu
50cb40be88
clean up test/null/test_indexing.py ( #14720 )
2026-02-12 22:36:53 -05:00
qazal
5b624b5e93
viz: better error message for out of range timestamps ( #14722 )
...
* test_timestamp_out_of_range
* rel_ts helper
* linter
2026-02-13 12:13:40 +09:00
George Hotz
4088d686b2
remove llvm requirement from amd ( #14717 )
...
* remove llvm requirement from amd
* tests pass
* test
* sink kernarg_size
* move stuff
* amd_asm_matmul to new style
* default type
* fix tests, simpler
* cu mode is faster and simpler
* darken
2026-02-13 10:50:12 +08:00
chenyu
9e33a08adb
use more pad_to and shrink_to in tensor.py ( #14719 )
...
good wins
2026-02-12 20:10:57 -05:00
George Hotz
d3adb8428e
Revert "hotfix: skip test/amd in macpytest" ( #14704 )
...
* Revert "hotfix: skip test/amd in macpytest"
This reverts commit b7dade2adf .
* no llvm subprocess
* simpler
* sys.exec
* cleanup
* process safe
* diag
* arm ftz support
* 5 sec
* this one
2026-02-13 08:00:24 +08:00
Christopher Milan
d4bc5ab609
autogen: download linux sources ( #14714 )
2026-02-12 18:50:50 -05:00
Christopher Milan
084d0d0103
cleanup macos webgpu tests ( #14715 )
2026-02-12 17:56:34 -05:00
Christopher Milan
c30bb0f006
fix WEBGPU isnan check ( #14711 )
2026-02-12 17:01:18 -05:00
chenyu
9b3b597423
minor getitem cleanups ( #14713 )
2026-02-12 16:54:54 -05:00
chenyu
787998fac3
fix getitem tensor indexing detection ( #14712 )
...
issue with sint
2026-02-12 16:04:37 -05:00
chenyu
86352988d8
update test_uops_stats for setitem ( #14710 )
...
realize both full tensor and the slice should not add to global_mem
2026-02-12 12:26:13 -05:00
chenyu
56caf6a3a2
fix Estimate.from_uops for sliced access ( #14695 )
...
"assume all DEFINE_GLOBAL memory is accessed" is wrong for partial load. get accessed accumulated from INDEX, then cap at full size. now mem_est never exceeds lds_est
2026-02-12 11:18:07 -05:00
chenyu
8551fa50d3
support bitcast in sym_infer ( #14708 )
...
fixed `DEBUG=2 DEV=WEBGPU python -m pytest test/backend/test_tensor_variable.py::TestTensorVariable::test_symbolic_pad`
2026-02-12 10:21:05 -05:00
chenyu
212789e31e
fix long_decomp with None tag ( #14707 )
...
fixed `DEBUG=2 WEBGPU=1 python -m pytest test/null/test_tensor.py::TestIdxUpcast::test_int64_unsupported_overflow_sym`
2026-02-12 09:31:52 -05:00
chenyu
557134e1c7
model/test fix that failed with WEBGPU=1 DEBUG=2 ( #14706 )
2026-02-12 09:08:16 -05:00
nimlgen
10c94d2c2d
amd: print more info about device hang ( #14705 )
2026-02-12 15:34:08 +03:00
nimlgen
b376bd7a21
jit: fix raw in same kernel ( #14699 )
...
* jit: fix raw in same kernel
* fix
* ugh
* x
* simpler
2026-02-12 15:33:32 +03:00
George Hotz
19e68a1833
skip AMD on not AMD ( #14703 )
2026-02-12 18:56:54 +08:00
George Hotz
b7dade2adf
hotfix: skip test/amd in macpytest
2026-02-12 18:16:04 +08:00
George Hotz
4680247e35
renderer/amd: move in tree ( #14702 )
...
* renderer/amd: move in tree
* fix paths in tests
* 24000 lines
* no delete for amd files
2026-02-12 18:09:16 +08:00
George Hotz
d5fc3ea1ba
assembly/amd: mypy+ruff passes ( #14701 )
...
* assembly/amd: mypy+ruff passes
* touchups
2026-02-12 16:59:42 +08:00
George Hotz
095a064ba8
test.yml explicitly says backend ( #14700 )
...
* test.yml explicitly says backend
* 1e-5
2026-02-12 16:03:44 +08:00
nimlgen
14a1991da6
viz: sort tracks in timeline ( #14591 )
...
* viz: sort devices in timeline
* fix
* rev
* upd
* skip
2026-02-12 10:51:41 +03:00
George Hotz
025049c521
clean up sqtt / update src formatting in viz ( #14696 )
...
* update src formatting in viz
* rename to RDNA3/RDNA4 in sqtt
* wrap
* move sqttmap
* update readme
* why did that change?
* cdna
* that's just for test
2026-02-12 14:27:14 +08:00
Christopher Milan
b1a3876492
IMAGE=1 supports FLOAT16=1 ( #14693 )
...
requires 2d indexing to be actually fast
2026-02-12 00:30:55 -05:00
George Hotz
befc1e800c
assembly/amd: disasm is test only ( #14694 )
...
* assembly/amd: disasm is test only
* viz uses str
2026-02-12 12:33:46 +08:00
George Hotz
c331798201
move tests to test/backend ( #14691 )
...
* move tests to test/backend
* fix imports
* fix CI
* revert that one
* Fix formatting in README for test command
2026-02-12 11:09:44 +08:00
wozeparrot
4b5d3bda1f
llama3: data seed ( #14681 )
2026-02-11 19:04:40 -08:00
chenyu
0c63f63ee4
recursive resolve assign dependency ( #14688 )
...
remove the .realize in llm.py
2026-02-11 17:41:05 -05:00
nimlgen
869083e373
nv: pciiface pma ( #14686 )
...
* x
* w
* z
* clean
* o
* r
* x
* c
* r
* list
* deanon
* b
2026-02-11 23:29:07 +03:00
chenyu
cbbc2fdea5
update test_assign_slice_then_read ( #14687 )
...
passes locally now
2026-02-11 15:02:44 -05:00
chenyu
7465b22ba0
handle setitem target in rangeify ( #14685 )
2026-02-11 11:38:59 -05:00
chenyu
0d215b962e
few setitem test cases diff from numpy ( #14684 )
...
have claude fuzzed frontend and found some real bugs
2026-02-11 08:41:03 -05:00
nimlgen
df8b21eeb5
add real self assign test ( #14683 )
...
* self assign fix
* no
2026-02-11 12:41:53 +03:00
wozeparrot
a60220bed9
llama3: move dl to numpy & jit more ( #14677 )
...
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-02-10 18:16:40 -08:00
George Hotz
4565958792
some lil speedups ( #14679 )
2026-02-11 10:01:58 +08:00
George Hotz
2d4ad9e739
add a waitlist for graph rewrite ( #14678 )
...
* add a waitlist for graph rewrite
* cleaner
* one context on spec check
2026-02-11 09:30:13 +08:00
Christopher Milan
389e2eeda1
Revert "transcendental works with long decomp" ( #14676 )
2026-02-10 19:46:34 -05:00
Christopher Milan
0662c8037d
transcendental works with long decomp ( #14672 )
2026-02-10 19:30:24 -05:00
George Hotz
3fab43c57c
add cache to asm gemm ( #14675 )
2026-02-11 08:26:30 +08:00
chenyu
ebef63dba0
update test_self_assign_same_device_copy ( #14673 )
...
that test would have passed without the optimization because .to shortcut
2026-02-10 17:23:43 -05:00
nimlgen
aafa9dcb5b
eliminate same-device copy self-assigns ( #14671 )
...
* eliminate same-device copy self-assigns
* ugh
2026-02-10 22:54:51 +03:00
chenyu
494eec2694
test_setitem_const_fused ( #14668 )
...
did not realize #14640 also fixed #10690 , so added a test for it
2026-02-10 08:33:02 -05:00
nimlgen
42ded7c34d
amd: bind aql ( #14666 )
...
* amd: bind to aql
* bind
* x
* f
2026-02-10 16:28:11 +03:00
George Hotz
82974929b7
use PARAM in schedule ( #14665 )
...
* use PARAM in schedule
* create_new_buffer
2026-02-10 19:18:40 +08:00
George Hotz
8dc46dde07
everything has dtype.long now ( #14661 )
...
* everything has dtype.long now
* int64/uint64 are everywhere now
* that doesn't work
2026-02-10 15:08:50 +08:00
Christopher Milan
cdb78954cb
better cl compiler name ( #14660 )
...
cl_compiler instead of compiler because overriding Compiled.compiler seems more confusing
2026-02-10 01:03:46 -05:00
George Hotz
cc9bf8ccbc
move more to null/unit tests ( #14658 )
...
* move more to null tests
* move test_gc
* no test fusion op
2026-02-10 13:35:17 +08:00
chenyu
83f6d28579
two less realize in setitem ( #14655 )
2026-02-09 23:45:24 -05:00
wozeparrot
69574542ab
fix: use correct fa implementation in eval ( #14651 )
2026-02-09 18:20:44 -08:00
chenyu
0dedf4063c
minor test_setitem cleanup ( #14654 )
2026-02-09 20:40:29 -05:00
Christopher Milan
b36b62eb59
don't push docker cache for PRs ( #14652 )
2026-02-09 19:55:55 -05:00
Christopher Milan
e6562a5061
remove CompilerPair ( #14638 )
2026-02-09 19:51:18 -05:00
Christopher Milan
396e1320fb
bump cache version for z3 ( #14650 )
2026-02-09 19:32:07 -05:00
chenyu
9e3f24db9f
assign realize fix ( #14649 )
...
fix the need for explicit assign. track pending assigns for each buffer, and run those before the main realize in order
2026-02-09 17:46:46 -05:00
chenyu
0913c068ea
clean up setitem disk path ( #14648 )
2026-02-09 15:58:04 -05:00
chenyu
205a1212b7
delegate non Tensor src setitem to assign ( #14647 )
...
cannot do this for DISK in the unified path
2026-02-09 13:53:20 -05:00
chenyu
e9f40f49d4
explicitly check advanced setitem ( #14644 )
...
advanced setitem DISK would failed in rangeify with bad error, now it's checked directly in setitem. eventully DISK can use regular setitem path
2026-02-09 13:36:46 -05:00
chenyu
20a132b1c4
relax atol for test_uop_scan_matmul ( #14646 )
...
flaky, also log max diff
2026-02-09 13:25:19 -05:00
qazal
50d3f6cea5
EVAL_BS=0 in llama profile ( #14643 )
2026-02-10 00:49:43 +09:00
chenyu
8a2c23d3dc
raise RuntimeError for setitem dtype mismatch ( #14642 )
2026-02-09 10:37:08 -05:00
qazal
80b0119cef
llama: add new asm gemm shape ( #14611 )
...
* llama: add new asm gemm shape
* work
* cleanup
* half dtype
* more comment
2026-02-10 00:34:29 +09:00
chenyu
a49e038c0c
dont manually broadcast in setitem ( #14641 )
...
handled by assign
2026-02-09 09:34:09 -05:00
chenyu
2c3e3559eb
remove a contiguous in basic setitem ( #14640 )
...
handled in rangeify
2026-02-09 09:19:46 -05:00
chenyu
6c0c8e2ac3
setitem push a realize to basic setitem ( #14637 )
...
advanced setitem does not need it
2026-02-09 08:54:07 -05:00
nimlgen
e087c58ae0
print tables in llama/profile.sh ( #14639 )
2026-02-09 12:32:54 +03:00
Christopher Milan
27f7ea478b
new style DSP renderer ( #14636 )
...
* new style DSP renderer
* cleanup
2026-02-09 00:39:03 -05:00
Christopher Milan
efac5b9ef6
new style NV/CUDA renderers, try 2 ( #14634 )
...
* new style NV/CUDA renderers, try 2
* fix diskcache
2026-02-08 22:58:48 -05:00
Christopher Milan
0ebb508b85
new style metal compiler ( #14632 )
2026-02-08 21:58:25 -05:00
Christopher Milan
9eef9f38ad
new style python renderer ( #14631 )
2026-02-08 21:45:07 -05:00
Christopher Milan
5f2f2cc956
Revert "new style NV/CUDA renderers ( #14627 )" ( #14633 )
...
This reverts commit 0e505951b0 .
2026-02-08 21:16:03 -05:00
Christopher Milan
4ad787ece2
new style CPULLVMRenderer ( #14629 )
2026-02-08 21:05:01 -05:00
Christopher Milan
0e505951b0
new style NV/CUDA renderers ( #14627 )
...
* new style NV/CUDA renderers
* fix pickle
* oops
* fix CUDA_CC=NVCC
* mockgpu uses PTXCompiler
* oops
* ruff
* dont discard stderr
* ugh
2026-02-08 21:04:51 -05:00
Filip Brzek
1667669c46
fix: python3 -m tinygrad.device reporting on AMD/CPU ( #14622 )
...
* test: device module expects PASS in -m tinygrad.device for CPU
* fix: use device._compiler_name instead of unwrap_class_type(compiler).__name__ in enumerate_devices_str
2026-02-08 20:22:35 +03:00
nimlgen
01a4ee4d66
do not hive_reset when amdgpu ( #14624 )
2026-02-08 19:14:13 +03:00
nimlgen
a615b9d781
am: f8_mode for gfx94x only ( #14620 )
2026-02-08 17:38:48 +03:00
chenyu
c28f7d0167
remove realize in Tensor.svd ( #14623 )
2026-02-08 09:36:31 -05:00
qazal
087dab4c3b
gemm/asm: split out cdna tests from CI ( #14619 )
...
* gemm/asm: split out cdna tests from CI
* reorder
* work
2026-02-08 21:33:42 +09:00
George Hotz
183d38b128
remove CUSTOM_KERNEL / directly construct it ( #14604 )
...
* remove CUSTOM_KERNEL / directly construct it
* clean that up
* simpler multi
* custom kernel spec
* remove Kernel
* fix multi
* use sharded shape
* explicit regression test
2026-02-08 18:43:33 +08:00
nimlgen
e29a88ca09
hive_reset respects lock ( #14618 )
2026-02-08 10:47:25 +03:00
qazal
b10802eb53
use existing VIZ ContextVar instead of getenv ( #14610 )
2026-02-08 15:37:55 +09:00
chenyu
510b65489e
style change rangeify assign [pr] ( #14616 )
...
consistent naming, also a standalone fucntion to replace complicated lambda
2026-02-07 15:47:32 -05:00
chenyu
b7afd4471c
use arg instead of 3rd op for ASSIGN [pr] ( #14613 )
2026-02-07 14:17:10 -05:00
nimlgen
88c3022223
amd: kfd iface early exit ( #14612 )
...
* amd: kfd iface early exit
* l
* revert
2026-02-07 18:57:10 +03:00
nimlgen
ce7bfc6ce8
nv: use nv_flags for all fields ( #14607 )
2026-02-07 15:01:38 +03:00
qazal
c2544e2252
viz: remove outdated comment ( #14608 )
2026-02-07 20:05:24 +09:00
nimlgen
6838b35cff
mockgpu: hevc ( #14606 )
...
* mockgpu: hevc
* eng
2026-02-07 12:27:55 +03:00
chenyu
884592f6c8
pin z3-solver version ( #14605 )
...
found exact input that crashes z3 4.15.4
2026-02-06 22:49:31 -05:00
George Hotz
7a2a3b5c71
Remove Ops.KERNEL, it's all Ops.CALL now ( #14603 )
2026-02-07 10:21:54 +08:00
George Hotz
ca6604eae2
kernel is call ( #14577 )
...
* call is kernel
* closer
* fix bugs
* dedup
* pm_gate_kernel_sink
* better
* Revert "better"
This reverts commit b4c799b810 .
* Reapply "better"
This reverts commit e53f094ce7 .
* cleanups
* work
* remove junk
* subtle fix
* index
* viz cleanups
* disable assert for now
2026-02-07 10:10:14 +08:00
wozeparrot
d87ae1c84c
feat: tinyfs load test in benchmark ( #14602 )
2026-02-06 18:00:00 -08:00
ttomsa
462b455562
cleanup linearize ( #14523 )
2026-02-07 08:54:02 +08:00
ttomsa
d5652e4da2
new dtype aliases ( #14596 )
2026-02-07 08:53:35 +08:00
Christopher Milan
ad9e2f0de7
decompose bf16 ( #14601 )
2026-02-06 19:24:09 -05:00
Christopher Milan
7bb45e7df0
decompose fp8 to bigger floats [skip_process_replay] ( #14554 )
...
* decompose fp8 also
* it works
* cleanup
* no shift required
* default to float
* cleanup
* fixes
* fp8e5m2
* don't rely on behavior comparing nans
* cleanup
2026-02-06 19:05:40 -05:00
chenyu
81f6cdb4ab
delete realize_assign [pr] ( #14575 )
...
use realize and realize_srcs like COPY and STORE. src[0] always has BUFFER for base
2026-02-06 17:12:33 -05:00
chenyu
7d193a6e26
fix wgsl bitcast ( #14600 )
...
was wrong for signed int
2026-02-06 16:57:36 -05:00
chenyu
b9fe8b7591
fix opt in process replay [pr] ( #14599 )
2026-02-06 16:49:56 -05:00
chenyu
197ebcbbbc
log seed with flush=True in fuzz_symbolic ( #14597 )
...
* log seed with flush=True in fuzz_symbolic
i think z3 can crash. added reading seed from argv to see if we repro later
* fuzz_symbolic_symbolic_div
2026-02-06 15:03:57 -05:00
nimlgen
fbb67a3f95
am_smi: fix after regen ( #14594 )
2026-02-06 20:57:41 +03:00
qazal
a80fb4e641
viz: better ordering of device engines in profiler ( #14590 )
2026-02-06 23:08:09 +09:00
qazal
b7e3fbe07e
llama: add VIZ=-1 to dev_run ( #14583 )
...
* llama: add VIZ=-1 to dev_run
* readme
* cleaner
* add profile.sh script
* better grouping of options
* add other row
* readme edits
* work
2026-02-06 22:59:22 +09:00
nimlgen
fbeb978170
diff devices for sdma ( #14589 )
...
* start
* x
* fix
* sdma
* c
* clean
* x
* hm
* cleaer
2026-02-06 16:39:12 +03:00
George Hotz
7cb996e153
bottom up earliest rewrites ( #14587 )
...
* better
* bottom up earliest rewrites
* fix
2026-02-06 18:13:07 +08:00
George Hotz
03af2404e2
small changes and test fixes from kernel is call ( #14586 )
2026-02-06 17:08:33 +08:00
George Hotz
3c26ce29b2
make disk tensor tests process safe ( #14584 )
2026-02-06 15:39:55 +08:00
qazal
cf73d7e2a7
hotfix: disable slower asm gemm shape from llama seqlen 8192 ( #14582 )
2026-02-06 15:05:19 +09:00
qazal
be77873974
llama: contig backward for wk / wv matmul backward ( #14581 )
2026-02-06 14:54:00 +09:00
chenyu
15d3344d9e
use int inputs in test_assign ( #14580 )
...
int is less flaky
2026-02-06 00:07:31 -05:00
qazal
50a166a5fa
viz: cleanup amdgpu target mapping ( #14579 )
...
* viz: cleanup amdgpu target mapping
* linter
* unwraps
2026-02-06 13:51:51 +09:00
chenyu
b09dc646f5
revert some late_buffer_view change ( #14578 )
...
revert #14478 which breaks tinyfs
2026-02-05 22:51:40 -05:00
chenyu
d41836f135
remove KERNEL special case in realize_assign [pr] ( #14573 )
2026-02-05 21:55:44 -05:00
George Hotz
6cbcf98627
KernelInfo is required on get_program ( #14571 )
...
* rangeify always adds KernelInfo
* fix tests
* skip flaky test
2026-02-06 10:49:27 +08:00
George Hotz
28c56a783c
add CallInfo and viz call toggle ( #14570 )
2026-02-06 09:30:58 +08:00
wozeparrot
f73468d516
fa: block skipping for fa kv bwd ( #14569 )
2026-02-05 16:13:53 -08:00
chenyu
b7ef775677
more cleanup in create_schedule [pr] ( #14566 )
...
fixed wrong comments and simplified queue building
2026-02-05 16:12:17 -05:00
Garret Castro
cee7ef7ab2
disable threads ( #14555 )
2026-02-05 16:11:32 -05:00
chenyu
79b7799dba
clean up linearize schedule [pr] ( #14565 )
...
* clean up linearize schedule [pr]
don't mix ScheduleItem and UOp in schedule queue
* ok
2026-02-05 15:24:09 -05:00
chenyu
41a179f542
fix test_xlm_roberta_large ( #14564 )
...
onnxruntime does not allow symlink that's outside model dir. update snapshot_download to use local_dir instead of cache_dir. some ad hoc migration step to copy the existing model too
2026-02-05 14:56:06 -05:00
Christopher Milan
aa9dc50577
dtype decomps don't require bitshifts ( #14542 )
...
* dtype decomps don't require bitshifts
* simplify shr/shl
* ruff
2026-02-05 14:42:30 -05:00
Christopher Milan
b47397ab17
list ml_dtypes as dependency for DSP ( #14562 )
...
* pin onnxruntime to 1.23.2 for DSP
* list ml_dtypes instead
This reverts commit 84bb2cc0fc .
2026-02-05 14:27:50 -05:00
chenyu
2b47a9a1b5
skip test_xlm_roberta_large ( #14563 )
...
symlink model not allowed in latest onnxruntime
2026-02-05 14:00:24 -05:00
chenyu
42c18da88a
add Ops asserts in toposort sched_sink [pr] ( #14561 )
...
more explicit
2026-02-05 12:40:02 -05:00
nimlgen
483bba4f05
nv: use prof_exec_counter ( #14559 )
2026-02-05 19:00:14 +03:00
qazal
190042358f
llama: faster bf16 matmul / rope backward ( #14558 )
2026-02-05 23:57:25 +09:00
George Hotz
b398335f62
assembly/amd: fix saturation in python remu ( #14557 )
...
* PYTHONREMU: failing test for V_SUB_NC_U32_E64 clamp
* fix saturation in PYTHON_REMU
* simpler
* more tests, less lines
---------
Co-authored-by: Christopher Milan <chrismilan@ucla.edu>
2026-02-05 18:35:57 +08:00
wozeparrot
c1ea6687e5
fa: simpler is faster ( #14548 )
2026-02-05 01:13:17 -08:00
George Hotz
43e7eda4e7
grad_b uses custom gemm ( #14550 )
...
* grad_b uses custom gemm
* fix multi backward, acc is in float32
* test_gemm_batched
* square gemm
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: qazal <qazal.software@gmail.com>
2026-02-05 15:22:27 +09:00
qazal
f9cfb64cd9
test asm_gemm in CI ( #14551 )
...
* test asm_gemm in CI
* default float16
* use a smaller shape for multi
* smaller size
* smaller for CI
* smaller for ci
* need half
2026-02-05 13:32:22 +09:00
chenyu
c0ca7f9c51
use more UOp.sum and UOp.prod [pr] ( #14549 )
2026-02-04 22:05:20 -05:00
chenyu
e8dace41b6
clean up UOp.vars [pr] ( #14547 )
2026-02-04 20:52:25 -05:00
Christopher Milan
232848d086
PYTHONREMU: VOP3P integer operations with constants don't cast to fp16 ( #14546 )
...
* PYTHONREMU: VOP3P integer operations with constants don't cast to fp16
* put that back
* cleaner
* do that once
2026-02-04 20:10:59 -05:00
wozeparrot
2966619834
feat: llama uses enable_gqa during training ( #14545 )
2026-02-04 16:22:31 -08:00
chenyu
664f1bf76d
minor ops/jit cleanups [pr] ( #14543 )
2026-02-04 17:21:34 -05:00
chenyu
03d0fa9c3f
merge as_buf into buf_uop [pr] ( #14541 )
2026-02-04 16:32:23 -05:00
chenyu
43ef24a8af
remove buf_target [pr] ( #14540 )
...
not really needed
2026-02-04 15:03:47 -05:00
chenyu
8b7343b950
clean up is_realized [pr] ( #14538 )
...
base cannot be Ops.MULTI since MULTI is a view now
2026-02-04 14:24:10 -05:00
Christopher Milan
5338ce6b74
test S_PACK in extra/assembly/amd/test/hw ( #14537 )
...
* S_PACK_LL_B32_B16 in test/hw
* add rest of S_PACK instructions
2026-02-04 14:17:16 -05:00
chenyu
9052db678f
remove allow_shape_mismatch in Tensor.replace ( #14536 )
...
move all logic to torch_backend and not hacking Tensor method
2026-02-04 12:38:18 -05:00
nimlgen
ec2b6bbda8
hcq: update signal logic ( #14531 )
2026-02-04 19:32:56 +03:00
nimlgen
62786d488a
am: mi3xx perf ( #14529 )
2026-02-04 19:32:43 +03:00
chenyu
d57d24c7d4
Buffer.as_buffer -> Buffer.as_memoryview [pr] ( #14535 )
...
it casts to memoryview. also inline the as_typed_buffer checks to Tensor._data
2026-02-04 11:31:11 -05:00
chenyu
024f57ecf5
jit input_buffers cleanup [pr] ( #14532 )
2026-02-04 10:14:38 -05:00
chenyu
67f91e897b
UOp.is_contiguous -> UOp.has_buffer_identity [pr] ( #14530 )
...
one more confusing buffer related method, but it's definitely not is_contiguous
2026-02-04 09:21:26 -05:00
George Hotz
fb9df1e031
pretty print binary ( #14520 )
2026-02-04 18:04:35 +08:00
Christopher Milan
8c3c026d86
decomp float16 to float32 ( #14417 )
...
* decomp float16 to float32
* denormals arent zero
* add test
* denormals are zero
* fix
* oops
* bitcast works
* fix LOADs
* test_dtype passing
* cleanup
* mypy
* debug print
* only emulate if EMULATED
* very ugly, but passes spec
* add test_dtype_alu tests
* Revert "very ugly, but passes spec"
This reverts commit fdc3999b654d630678bf208927ab2f55e026b4ca.
* bottom up decompositions
* that should have symbolic
* simplify a bit
* SPEC really works
* run with DEBUG
* debug=4
* rm debug
2026-02-04 01:37:47 -05:00
Christopher Milan
ecbce5269e
PYTHONREMU properly supports S_PACK_LL_B32_B16 ( #14527 )
...
* PYTHONREMU properly supports S_PACK_LL_B32_B16
* default
2026-02-03 23:45:33 -05:00
wozeparrot
720c9597a9
feat: llama uses is_causal on sdpa during training ( #14528 )
2026-02-03 20:24:30 -08:00
chenyu
9c2fc118ef
relax setitem target check ( #14526 )
...
old check was too conservative
2026-02-03 22:32:49 -05:00
qazal
d1bfbe9ce3
isolate slow llama gemm ( #14525 )
2026-02-04 12:20:10 +09:00
nimlgen
2f55005ad9
qcom: sync cpu cache when from_blob ( #14518 )
...
* um
* fx
* d
* x
* x
* x
* x
* f
* ren
2026-02-03 21:51:03 +03:00
chenyu
ee9d6a1f36
remove DEFINE_VAR in to_define_global [pr] ( #14522 )
...
not needed
2026-02-03 10:12:33 -05:00
Nino Risteski
af4c74bb41
delete extra cast ( #14517 )
2026-02-03 08:29:04 -05:00
chenyu
9d1e9e643e
removed a duplicated remove_bufferize rule [pr] ( #14519 )
2026-02-03 08:28:07 -05:00
George Hotz
d59e6e7a37
move more tests to test/null, split some existing ones ( #14512 )
...
* move more tests to test/null, split some existing ones
* null work
* null work
* move more
* fixes
* move PIL
* PIL in CLIP
* don't move that
2026-02-03 20:20:20 +08:00
qazal
a98c53769a
ASM_GEMM=1 runs the UOp gemm on non cdna ( #14516 )
...
* ASM_GEMM=1 runs the UOp gemm on non cdna
tests run on mac in 3 seconds
* min diff
2026-02-03 20:42:02 +09:00
qazal
5c1d21349e
viz: profiler command line tool ( #14515 )
2026-02-03 19:51:25 +09:00
George Hotz
dd2de4f838
rename all DEFINE_GLOBAL to PARAM ( #14511 )
2026-02-03 15:09:38 +08:00
George Hotz
dc77b3318b
move files that pass with NULL=1 to test/null ( #14508 )
...
* move files that pass with NULL=1 to test/null
* fix windows
* cpu 0
* bugfix + durations
2026-02-03 13:52:36 +08:00
George Hotz
888819ee09
call autodiff gradient ( #14510 )
2026-02-03 13:51:02 +08:00
wozeparrot
bbcd3d67a3
fa: faster ( #14453 )
2026-02-02 21:34:17 -08:00
Christopher Milan
e579613b90
IR3 has aux ( #14509 )
2026-02-02 23:46:41 -05:00
George Hotz
85c7b23160
add pytest -nauto to benchmark for mac ( #14458 )
...
* add pytest -nauto to benchmark
* 3 minute timeout
* 3 min
* setup env
* comment
* fresh db
* in the pyenv
2026-02-03 12:26:09 +08:00
Christopher Milan
a5d7eb37db
IR3 works on versions earlier than 3.14 ( #14507 )
2026-02-02 23:10:19 -05:00
George Hotz
33c886cafa
disable copyout on NULL backend by default ( #14506 )
...
* disable copyout on NULL backend
* gate it
* allow copyout on some tests
2026-02-03 11:57:47 +08:00
chenyu
3c5845e8a5
remove cut_store_range ( #14505 )
...
special scheduling for CPU
2026-02-02 21:58:36 -05:00
chenyu
4f2e7aed24
fix multiple REDUCE on same RANGE ( #14504 )
...
each RANGE maps to one END, but reduce_to_acc is local and would not know this
2026-02-02 20:42:09 -05:00
chenyu
93c41a78fa
clean up NOOP [pr] ( #14503 )
...
should not be used as a COPY, started with removing from ALWAYS_RUN_OPS
2026-02-02 19:46:45 -05:00
chenyu
66d2b02f11
delete files that depends on extra.optimization.helpers ( #14499 )
2026-02-02 13:33:33 -05:00
George Hotz
ec0398fceb
test amd gpu crashes ( #14459 )
...
* test amd gpu crashes
* cleanup
* less sketch tests
2026-02-02 18:57:47 +03:00
nimlgen
6e4238c016
amd: recovery ( #14461 )
...
* rec
* ?
* rv
* cleaner
* post merge
* not used
* um
* clnr
* x
* x
* d
* move
2026-02-02 18:57:35 +03:00
chenyu
61ca19ff24
after with empty src is self [pr] ( #14496 )
2026-02-02 10:19:05 -05:00
George Hotz
6e958dbfd4
assembly/amd: add RDNA4 support to emulator ( #14341 )
...
* start new rdna4
* work
* plus works
* more pass
* rdna4
* assembly/amd: fix RDNA4 emulator for float16 and VOP3 clamp
* stale
* rev
* rr
* rdna4 emu tests
* cleanup
* cleanup
* simp
* works
* better factorizaion
* hacks
* fix mockgpu
* guard both
* cleaner
* gate
* bug fix and a few tests
* all test_tiny
2026-02-02 21:35:59 +08:00
chenyu
a908f447d5
remove disk special case in mstack_early_shrink [pr] ( #14494 )
2026-02-02 08:34:45 -05:00
qazal
965940dd00
sqtt: update examples after event field change ( #14493 )
...
* regen sqtt examples
* cdna
* rdna4
* packet counts for rdna3
* sqttmap work
2026-02-02 21:39:48 +09:00
George Hotz
965149a46d
assembly/amd: add ds perm instructions ( #14486 )
...
* assembly/amd: add ds perm instructions
* NO SKIP
* fix preexisting RDNA3 issues
* pcode
* assert
* asserts
* unify
* simp
* good fix
2026-02-02 16:02:00 +08:00
qazal
1746d1f997
remove SPEC=0 context in custom_kernel tests, pyrender always skips it ( #14489 )
2026-02-02 16:32:01 +09:00
George Hotz
d4007f36e0
remove DEFINE_GLOBAL (it is PARAM now) ( #14488 )
2026-02-02 14:56:37 +08:00
qazal
6c487656f9
viz: kernel metadata from rodata entry ( #14487 )
2026-02-02 15:41:42 +09:00
Robbe Derks
d75a1b0d5a
usbgpu: use BOT interface for patch.py ( #13644 )
...
* BOT usage
* cleanup
* fix lint
* fix ruff
* fix -7?
2026-02-02 11:54:46 +08:00
Christopher Milan
2931b52875
skip autogen if MTLCompiler is loaded ( #14466 )
2026-02-01 22:12:27 -05:00
George Hotz
9a32d6e090
add depth limit for SPEC=2 ( #14485 )
...
* make SPEC=2 work for everything
* that's a horrible fix
* add depth limit
2026-02-02 10:43:28 +08:00
George Hotz
368a692e1a
make SPEC=2 work for everything ( #14476 )
...
* make SPEC=2 work for everything
* that's a horrible fix
2026-02-02 10:30:56 +08:00
chenyu
ea1f1d2b9d
test_assign_to_bitcast_view ( #14483 )
...
currently disk allows assign same size dtype into a bitcasted view
2026-02-01 16:46:04 -05:00
chenyu
6deeccc192
fix RING with single dest ( #14482 )
2026-02-01 12:14:46 -05:00
chenyu
3ff390159b
don't implicitly change dtype in assign ( #14481 )
...
broadcast shape is fine, but implicitly cast dtype is hard to find
2026-02-01 11:48:54 -05:00
imaolo
2111762a48
failed test case for RING output device ( #14191 )
...
* Add enable/disable scheduler cache ContextVar
* add allreduce ring and naive to() tests
* clearer test comparing native vs ring allreduce
* split tests, add helper
* removing trailing whitespace
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-02-01 11:48:43 -05:00
chenyu
02afae04f4
atol in test_call_gemm ( #14480 )
...
flaky
2026-02-01 11:24:58 -05:00
chenyu
5705398a1f
assign cleanup [pr] ( #14479 )
...
share more code path between disk and non-disk. also raise RuntimeError instead of Assert for mismatches
2026-02-01 09:10:22 -05:00
chenyu
da500dbe06
simplify late_buffer_view [pr] ( #14478 )
...
check the only allowed Ops in the chain, and offset cannot be negative
2026-01-31 22:38:40 -05:00
chenyu
b4f96301e0
remove unused rules [pr] ( #14477 )
2026-01-31 21:29:30 -05:00
qazal
54e78dbec8
viz: remove hardcoded strings in cfg tests ( #14462 )
2026-02-01 09:30:43 +09:00
chenyu
5d38db9da6
generic bitcast assign ( #14474 )
...
a.bitcast(X).assign(src) -> a.assign(src.bitcast(a.dtype))
2026-01-31 17:29:20 -05:00
chenyu
b38fc43b07
assert assign dtype mismatch for disk [pr] ( #14473 )
...
the disk hack is generally wrong, now force bitcast on the source before assign
2026-01-31 17:08:54 -05:00
chenyu
ced886f26c
failed test case for assign into bitcast ( #14469 )
...
* failed test case for assign into bitcast
DISK assign has custom hack for this. need to fix before we can unify assign
* test_assign_bitcast_different_size
2026-01-31 14:26:47 -05:00
chenyu
81eee5b30a
unused spec [pr] ( #14468 )
...
no BUFFER_VIEW in tensor, and no CONTIGUOUS in KERNEL
2026-01-31 13:53:16 -05:00
nimlgen
f873c7b6c5
amd: fetch_name is file_name ( #14465 )
2026-01-31 20:11:07 +03:00
chenyu
c765641215
remove unused allow_any_len [pr] ( #14464 )
...
STORE has 2 src, RESHAPE has 2 src, BUFFER has 2 src
added some tests for the untested allow_any_len
2026-01-31 11:05:42 -05:00
chenyu
b4f5a51ebb
move tests to unit ( #14463 )
...
test_uop_graph does not need device, test_memory_planner can use NULL
2026-01-31 10:49:31 -05:00
qazal
616e9c1483
CDNA assembly gemm in tensor.py with flag ( #14310 )
...
* work
* work
* the assembly
* remove the old one
* remove ws bufs, assert splitk
* notes cleanup
* work
* gemm args
* gemm in mixins would be nice
* add gemm gradient
* print counters
* the realize is for DEBUG=2 aesthetics
* dedup
* rewrite to python dsl, no list copies
* leave that
* add B, M, N, K to gemm name
* it's M0 not NULL
* fp16 support
* test cleanup + more gemms
* work from viz
* more work
* gemm batch_size
* xccg path work
* tiny comments on the label naming
* s_waitcnt
2026-01-31 22:34:14 +09:00
chenyu
55f806b713
tighter late_buffer_view match [pr] ( #14456 )
...
src must be len 2 at that point
2026-01-31 07:28:26 -05:00
qazal
d69bc5aa1a
make DEV=NULL EMULATE=AMD amd_asm_matmul run ( #14460 )
2026-01-31 20:45:24 +09:00
qazal
4976544bf9
multi ram usage tests on the NULL device ( #14457 )
2026-01-31 14:14:53 +09:00
chenyu
99b44121bc
failed test case for non-consecutive disk read ( #14455 )
...
silently fail now
2026-01-30 23:44:04 -05:00
George Hotz
b705c9143c
assembly/amd: test more instructions ( #14365 )
...
* assembly/amd: test more instructions
* more
* passing
* revert
* no const fold
* remove junk
* cleaner
2026-01-31 12:40:22 +08:00
George Hotz
c9a3ddb341
benchmark llama walltime script ( #14454 )
...
* benchmark llama walltime script
* adj layers
2026-01-31 10:21:54 +08:00
George Hotz
f5346d6a1a
fix USE_ATOMICS for non float dtypes and make it the default ( #14444 )
...
* embedded multistep test
* complex test
* with jit
* fix dtypes and reenable USE_ATOMICS
* that test didn't catch anything
2026-01-31 09:44:16 +08:00
Christopher Milan
e575dd8275
prevent UB in long decomp and more emulated tests ( #14447 )
2026-01-30 19:38:41 -05:00
chenyu
3204f94454
correct var_vals schedule filter ( #14451 )
...
complete_create_schedule_with_vars returns var_vals that's used in schedule
2026-01-30 17:10:07 -05:00
chenyu
cfcd1debb5
test schedule with multiple AFTER ( #14449 )
2026-01-30 15:59:00 -05:00
nimlgen
486d53d646
device: call free for external_ptr ( #14448 )
...
* device: call free for external_ptr
* lin
2026-01-30 23:53:17 +03:00
nimlgen
e0978498dc
amd: read_ptr/write_ptr/doorbells are not lists ( #14445 )
2026-01-30 23:11:57 +03:00
Christopher Milan
1803ee939d
EMULATED_DTYPES=long works with CPU_LLVM ( #14446 )
2026-01-30 13:54:43 -05:00
chenyu
03613e83ad
update TestTensorMetadata ( #14443 )
...
run with SCACHE=0 some more TODOs
2026-01-30 12:39:01 -05:00
George Hotz
cbb1eed57b
hotfix: partial revert of 9eb449f88, caused llama NaN
2026-01-30 17:19:27 +00:00
chenyu
26f5c00265
move TestTensorMetadata to unit ( #14442 )
2026-01-30 12:14:21 -05:00
chenyu
c05a0b85ae
flip unique const src order [pr] ( #14441 )
...
* flip unique const src order [pr]
matches buffer, simplifies replace_input_buffer
* combine rules
2026-01-30 11:44:18 -05:00
George Hotz
ee2c78709d
mlperf/llama: disable USE_ATOMICS for now
2026-01-31 00:42:08 +08:00
chenyu
beecac4d85
expand ranges -> unroll outer ranges [pr] ( #14440 )
2026-01-30 11:26:05 -05:00
chenyu
9eb449f882
clean up toposort sched_sink [pr] ( #14439 )
2026-01-30 10:18:28 -05:00
George Hotz
838cd078bc
use atomics for embedding backward ( #14400 )
...
* embedding is slow
* failing
* float is fine
* null
* it fails
* simplify embedding with broadcasting
* ATOMIC_ADD incoming
* min change
* simpler test
* better test
* fix test
* real test
* simpler
* cleanups
* types and names
* _zero_kernel
* grad multi
* hack
* none
* multi unshard
* more for call
* don't tag in call
* good
* call_multi
* call_multi wow claude is useless
* embedding backward mutli test
* test passes
* fix as_param
* shape_to_shape_arg
* add clip
* before cast
* fix spec=2, use atomics
2026-01-30 18:10:59 +08:00
nimlgen
1998e0bb28
nv: add prof props to dev ( #14437 )
2026-01-30 12:51:43 +03:00
George Hotz
7a9dee4e50
add call/param UOps ( #14433 )
...
* add call/param UOps
* resolve call
* skip that for now
* grad on call
* fix tests
2026-01-30 14:51:45 +08:00
qazal
66d6a68016
viz: sqtt work from cdna gemm ( #14434 )
...
* it's the tag
* initialize rows based on the disasm
* test_cfg with Ops.BINARY
* pyremu wants s_code_end?
* test_diamond
* diff cleanup
2026-01-30 14:00:56 +09:00
Christopher Milan
88caf57ef4
ci: unify python versions ( #14430 )
2026-01-29 21:42:03 -05:00
chenyu
86a204d22a
allow Tensor setitem input to be list/tuple ( #14432 )
...
matches assign, and generally matches numpy
2026-01-29 21:26:58 -05:00
chenyu
4a80319093
clean up split_store final logic [pr] ( #14429 )
...
explicitly check the structure
2026-01-29 18:40:07 -05:00
Christopher Milan
e47f12f671
ci: replace testing_minimal with testing_unit ( #14427 )
2026-01-29 18:02:43 -05:00
wozeparrot
c2fb8b208f
fa: 32 block size ( #14416 )
2026-01-29 13:59:13 -08:00
chenyu
a979fafae5
cleanup around disk buffer [pr] ( #14428 )
...
style change, prep for refactor
2026-01-29 16:18:44 -05:00
nimlgen
dc977a03b0
nv_pma: bw decoder ( #14424 )
...
* nv_pma: bw decoder
* decoder fix
* better
2026-01-30 00:12:39 +03:00
chenyu
ddc041854b
failed test case for disk setitem ( #14426 )
...
strided setitem is wrong
2026-01-29 14:54:19 -05:00
chenyu
31706bf6bc
add few more types [pr] ( #14425 )
2026-01-29 14:04:09 -05:00
nimlgen
2d5c24879f
nv: pma for 5090 ( #14420 )
...
* nv: pma for 5090
* hm
* 4090
2026-01-29 20:06:01 +03:00
nimlgen
c8dc6332d2
memory: read_fields is not universal ( #14348 )
2026-01-29 20:00:00 +03:00
chenyu
dbe8f034a7
pass z3.Context in validate ctx [pr] ( #14423 )
...
does not need to pass the whole solver
2026-01-29 11:11:47 -05:00
chenyu
033ce1b885
types for validate.py ( #14422 )
2026-01-29 10:56:50 -05:00
nimlgen
230d08ec70
test for am recovery and faults handling ( #14421 )
...
* test for am recovery and faults handling
* linter
2026-01-29 17:11:24 +03:00
George Hotz
793afbd473
simplify nn.Embedding, support AFTER in CUSTOM_KERNEL ( #14419 )
2026-01-29 17:22:13 +08:00
Christopher Milan
0c855d6149
ci: remove unused pydeps ( #14418 )
2026-01-29 01:51:26 -05:00
wozeparrot
4845e42135
llama3 gradacc fixes ( #14414 )
2026-01-28 19:12:39 -08:00
chenyu
37cde4a01a
add one line mypy report ( #14415 )
2026-01-28 20:39:32 -05:00
chenyu
15aed51544
return types for all math.py function ( #14413 )
...
calling int() on sint -> int, i think it's better support since some UOp can be safely cast to int
2026-01-28 20:10:11 -05:00
nimlgen
aec1ae0de1
llama: set manual_seed ( #14409 )
2026-01-28 14:40:00 -08:00
chenyu
0870ed28b1
add Self type to MathMixin ( #14411 )
...
these don't cause error
2026-01-28 16:59:38 -05:00
chenyu
079f33c208
fix type in Tensor.mean and Tensor.var ( #14410 )
...
use Tensor.from_uop to wrap UOp from symbolic shape, kernels are the same
2026-01-28 15:24:02 -05:00
chenyu
2b5e99ccc1
minor type cleanups [pr] ( #14408 )
...
mypy --warn-redundant-casts has false negative
2026-01-28 14:11:50 -05:00
chenyu
726415dbc8
import sint directly in movement.py TYPE_CHECKING ( #14406 )
...
avoid creating string TypeAlias, fixed warning in `TYPED=1 python test/test_tiny.py`
2026-01-28 12:47:26 -05:00
nimlgen
acb2fc36ba
nv_pma: add decoder ( #14404 )
...
* nv_pma: add decoder
* cl
2026-01-28 20:44:02 +03:00
chenyu
7b9bc1d8cf
_MockMemoryviewMeta for mockgpu ( #14405 )
...
fixed `PYTHONPATH=. TYPED=1 DEV=AMD MOCKGPU=1 python test/test_tiny.py`. basically make `isinstance(TrackedMemoryView_instance, memoryview)` true
2026-01-28 11:59:00 -05:00
chenyu
93793a645b
use cl.cl_mem instead of internal ctypes._CData ( #14403 )
...
fixed `CHECK_OOB=0 DEV=CL TYPED=1 python test/test_tiny.py`
2026-01-28 10:56:41 -05:00
chenyu
a9b44070a8
fix webgpu runtime types ( #14402 )
...
`CHECK_OOB=0 DEV=WEBGPU TYPED=1 python test/test_tiny.py` passed, also skip tests that failed locally
2026-01-28 10:37:25 -05:00
George Hotz
0c6b3f50aa
add marker to llama training ( #14401 )
2026-01-28 22:44:28 +08:00
Jakob Sachs
2b7c00d3d2
fix sd-example dtype for CLIP embeddings ( #14397 )
2026-01-28 09:07:19 -05:00
qazal
a5a9ce3fdf
viz: disasm cleanups from null emulate ( #14399 )
...
* it's AMDHIPRenderer
* don't need that indent
* less assignment stuff
* that arg order did not make sense
* pmc
2026-01-28 22:03:30 +09:00
nimlgen
544928766d
hcq_smi: kill mac pids ( #14398 )
2026-01-28 15:00:28 +03:00
George Hotz
202b74b369
assembly/amd: continue refactors ( #14386 )
...
* simpler
* merge
* flat
* no ctx
* use the correct apis
* dup code
* write clean code
* remove bad helpers
* bits junk remove
* junk remove
* smem test
* fix tests
* correct fix + tests
* Fmt matters it seems
* wmma refactor
* a lil more
* kimi cleanups
* line
2026-01-28 17:33:03 +08:00
qazal
5bffa17f82
llama train: better NULL=1 EMULATE=AMD_CDNA4 dev experience ( #14395 )
...
* beam opens devices
* switch to hip renderer
* amd: true?
* llvm true is for test_autogen
2026-01-28 17:31:22 +09:00
qazal
0294014108
fix bufferize cost function for multi, improve VIZ=-1 cli ( #14394 )
...
* improve cli
* remove_bufferize change
2026-01-28 15:53:18 +09:00
qazal
c158acea29
failing multi ram usage test from llama gemm ( #14392 )
2026-01-28 14:32:32 +09:00
Christopher Milan
067e27857e
nested composite actions don't work ( #14393 )
2026-01-28 00:13:30 -05:00
Christopher Milan
9dddf3d478
don't save caches for PRs, try 2 ( #14391 )
2026-01-27 23:30:17 -05:00
Christopher Milan
68fe5d8b36
Revert "don't save caches for PRs ( #14389 )" ( #14390 )
2026-01-27 23:22:26 -05:00
Christopher Milan
4ab228b498
don't save caches for PRs ( #14389 )
2026-01-27 23:21:31 -05:00
Christopher Milan
5e36482314
decompose long to ints where unsupported, try 2 ( #14383 )
2026-01-27 23:20:43 -05:00
wozeparrot
e496547720
llama3 gradacc ( #14291 )
2026-01-27 19:48:10 -08:00
George Hotz
88bc5ee212
assembly/amd: rename to better names ( #14384 )
...
* assembly/amd: rename to better names
* might help fuzzing segfault
* emu2 -> emu
2026-01-28 10:00:54 +08:00
George Hotz
065b95cfb0
Revert "add retry to fetch ( #14370 )" ( #14385 )
...
This reverts commit dc4d7f2d55 .
2026-01-28 09:35:37 +08:00
Eitan Turok
dc4d7f2d55
add retry to fetch ( #14370 )
2026-01-27 14:04:25 -08:00
chenyu
8d1f3c8885
fix copysign for inf input ( #14381 )
...
* fix copysign for inf input
* llvm olt
2026-01-27 16:45:48 -05:00
Christopher Milan
289a3e415e
also skip test_nonoverlapping_shrink_assignment ( #14382 )
2026-01-27 16:26:26 -05:00
Christopher Milan
f34efc1ad1
DISABLE_FAST_IDIV actually works as a ContextVar ( #14378 )
2026-01-27 16:12:42 -05:00
chenyu
8c899e4aaf
fix copysign for -0 ( #14380 )
...
test both x and 1/x < 0 work too. and found another big with the * 0 hack
2026-01-27 15:44:58 -05:00
chenyu
62884585a7
failed test case for copysign -0.0 ( #14379 )
...
* failed test case for copysign -0.0
* skip those
2026-01-27 14:37:17 -05:00
nimlgen
ec1b28bc2c
am: exit early in case of failures ( #14376 )
...
* am: exit early in case of failures
* sorry, pre-linter
* reset when error state
2026-01-27 22:10:02 +03:00
chenyu
cd22ee9ed0
add InvalidType to ConstType [pr] ( #14373 )
...
* add InvalidType to ConstType [pr]
TYPED=1 python test/test_tiny.py passes.
added PyConst = float|int|bool for some Tensor level input types
* hcq
2026-01-27 14:09:34 -05:00
Christopher Milan
5b42a1357b
SCACHE=0 works with DEBUG ( #14377 )
2026-01-27 13:12:43 -05:00
chenyu
db010a31be
IGNORE_OOB -> CHECK_OOB [pr] ( #14374 )
...
flip the meaning
2026-01-27 12:20:59 -05:00
chenyu
c22667b0c4
also skip test_overlapping_shrink_assignment_reverse ( #14375 )
...
crashing
2026-01-27 12:20:39 -05:00
nimlgen
e52d58b041
autogen: update amd ( #14372 )
2026-01-27 19:53:14 +03:00
nimlgen
cbf94a0a95
nv: exit early in case of failures ( #14363 )
...
* nv: exit early in case of failures
* f
* cleaner
2026-01-27 19:16:22 +03:00
nimlgen
ec691cb299
am: print sq intrs ( #14366 )
...
* am: print sq intrs
* cleaner
2026-01-27 18:28:13 +03:00
qazal
a5f3d46423
hcq: do not assume kernel names are unique ( #14371 )
...
* hcq: do not assume kernel names are unique
* colored kernel name
2026-01-27 23:03:15 +09:00
George Hotz
e5df7e640b
fix branches in amd_asm_matmul ( #14369 )
2026-01-27 20:48:42 +08:00
George Hotz
0ced258726
HOTFIX: skip crashing assign test
2026-01-27 20:35:17 +08:00
George Hotz
131ae604de
force_transcendental on sqrt ( #14368 )
2026-01-27 20:24:41 +08:00
imaolo
14574c68fa
Add ContextVar to disable the scheduler cache ( #14257 )
...
* add scheduler cache ContextVar
* test scheduler cache context var
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-01-27 19:55:29 +08:00
George Hotz
bfc88bcfb8
assembly/amd: emu refactors + enable PYTHON_REMU by default ( #14361 )
...
* assembly/amd: start refactors
* cleanups
* those are global
* methods on ctx
* const cleanup
* range helper
* types and imports
* cleanups
* cleanups
* remove stale name
* fix emu2 types
* more typing
* more mypy
* cleanups
* fxns
* scc cleanup
* cleanups
* cleanups
* simpler parse_pcode
* laneid
* no defaults for pcode
* pcode is not optional
* cleanups
* functions cleanup
* splat
* expr_parser functions
* single tok
* invert global loops
* try_eat
* minor
* run parser on all
* no silent 0
* tests
2026-01-27 17:42:24 +08:00
Christopher Milan
2e72625652
Revert "decompose dtypes.long to ints where unsupported ( #14261 )" ( #14362 )
2026-01-27 02:04:59 -05:00
qazal
f866b2a513
mfma loop in asm dsl ( #14349 )
...
* mfma loop in asm dsl
* work
2026-01-27 11:11:37 +09:00
Christopher Milan
0793319929
decompose dtypes.long to ints where unsupported ( #14261 )
...
* add works
* use carry not overflow
* bitwise ops
* use tag instead of vec
* cleaner
* mul somewhat works
* mul actually works
* SUB and NEG work
* SHL/SHR
* ulong support
* this should work?
* oops
* fix indexing
* all ALU mostly works
* refactor
* test_dtype passing
* signed division works
* format
* clean
* some tests
* ruff
2026-01-26 18:34:13 -05:00
wozeparrot
a987a4abc3
feat: llama8b dev_beam.sh ( #14358 )
2026-01-26 14:51:23 -08:00
Christopher Milan
c9c533fc78
libclang path is homebrew on macos ( #14357 )
...
* libclang path is homebrew macos
* typo
* ugh
* typo
* regen
* no LIBCLANG_PATH
2026-01-26 17:32:09 -05:00
chenyu
d641e63189
improve min/max for AND ( #14356 )
2026-01-26 15:44:18 -05:00
chenyu
f16372487a
fix assign hazard on shrink ( #14355 )
...
* fix assign hazard on shrink
possible to have race if both assign src and dest are shrink
* test_nonoverlapping_shrink_assignment
2026-01-26 14:46:30 -05:00
chenyu
145df879c1
find_permutes -> fix_assign_hazard [pr] ( #14354 )
...
some noop tweaks and comment updates
2026-01-26 14:05:19 -05:00
nimlgen
e152f1b0f5
llama: use ALL2ALL ( #14353 )
2026-01-26 22:01:53 +03:00
nimlgen
3f25eb3026
am: ih ( #14346 )
...
* am: ih
* um
* fix
* line
* no trap and fix ring
* keep
* fix
2026-01-26 20:11:04 +03:00
chenyu
823bc17fb5
failed test case for shrink overlap assigns ( #14350 )
...
* failed test case for shrink overlap assigns
current logic can create a race resulted in wrong output
* skip for now
2026-01-26 11:58:45 -05:00
George Hotz
204f51e739
assembly/amd: bug fixes for PYTHON_REMU ( #14347 )
...
* default PYTHON_REMU to 1
* mockgpu
* less size
* normal compile path
* uniqie
* more
* fix clamp
* Change PYTHON_REMU default to 0 in _try_dlopen_remu
2026-01-27 00:48:22 +08:00
chenyu
231305603d
remove REAL_DEV [pr] ( #14337 )
...
it's just Device.DEFAULT now
2026-01-26 10:08:16 -05:00
Martin Szewieczek
9cbe99348a
func meshgrid: change param index to type str ( #14331 )
2026-01-26 10:07:56 -05:00
George Hotz
3b43d26f10
assembly/amd: emu speed ( #14344 )
...
* assembly/amd: emu speed
* fix spec
* go
* don't do this
* simpler
* no stupid consts
* hack
* simpler
* no index
* no where
* faster linearizer
* fix spec
* no index dtype
2026-01-26 22:21:34 +08:00
George Hotz
774a454bb5
assembly/amd: fix scratch SVE ( #14340 )
...
* assembly/amd: default python REMU
* mem_used
* no lane
* sve
* remove that
* needs s_code_end in tests
2026-01-26 21:03:51 +08:00
qazal
2d91fe6310
use amdgpu dsl in mmapeak ( #14342 )
...
* use amdgpu dsl in mmapeak
* don't rely on llvm for vgpr counting
* llvm roundtrip assert
* rm it, add ci
* vgpr_count
* move emulated test to amd, it needs comgr
* env
* arch
* inst._fields -> inst.operands
* vgpr offset
2026-01-26 22:03:43 +09:00
qazal
b2e2ace85b
viz: remove ci check, it's VIZ=-1/-2 ( #14343 )
2026-01-26 20:36:23 +09:00
George Hotz
be23776ba7
assembly/amd: replace pcode with ucode ( #14002 )
...
* a bunch of todos for my boy claude
* uops have types
* lil cleanups
* simpler ucode
* isNAN
* calls
* move more
* cleanup pcode_parse
* cvt functions
* fix parser bugs
* no void
* minmax
* more pcode parse
* pretty print
* transform
* comments
* move to transform
* assign/declare
* simpler norm
* single PM
* just Uops
* simpler
* more typed
* all rewrite
* less verbose
* work
* spec
* transform
* work
* simpler spec
* less spec
* bitcast
* simpler
* simp ucode
* work
* more in pcode_transform
* remove junk
* more functions
* bug
* no void assign
* load/store
* wave
* fixes
* move denorm
* move more functions
* tests
* cat is shape None
* uop syntax
* move a few more
* program_spec
* cat stuff
* assign fix clear
* unused
* nans
* fp bits
* works with simplify
* remove junk
* special
* meh
* more
* more
* update test pcode parse
* improve parser
* parse some for loops
* merge master
* dead files
* tests pass
* emu2
* better emu2
* test_plus works
* uselessly write more instructions
* use pcode
* something
* something
* bench_emu
* progress
* ds works
* work
* work
* more passing
* run compare
* bench_emu
* more pcode
* a few more
* bugfixes
* bugfix
* test fixes
* tests pass without USE_HW
* all hw tests pass
* add more hw tests
* new hw tests
* bit
* less handcode
* parse more
* consolidate pcode
* fixes
* rsrc
* lane pcode
* cleanups
* simpler
* emu bugs
* one cmp test fails
* fix decode and upd name
* fix name and test harness
* _ftz_f32
* fix denorm
* fix VOPD and use load
* fix carry bug
* no load where / just invalid
* clean
* simpler
* merge sops
* refactoring
* simplifications
* bugfixes
* new tests
* f16 sin fix
* assertion and hw tests
* cvt functions
* one more failure
* bugfixes
* bugfix + regression
* more tests
* fmac
* no manual unrolling
* ordering
* LLVM backend is a lot faster
* compile inst
* more bugs
* f16
* bugfix
* fix regression
* one clang call
* 1M inst
* scratch works
* do scratch correctly
* cleanup
* regression
* cmp
* fmamk fixes
* merge
* fix vcmpx
* unify memory
* remove unused code
* ignore oob for test
* cleanups
* fix mbs
* unify cmp
* test
* minor cleanups
* bump timeout
* fix tests
* revert the CMPLE stuff
* remove opt
* less diff
* simpler
* revert
* support multiple backends
* memset is a lot faster
* split out in bench emu
* improve timing
* timing
* cache that
* cache that
* simpler and faster
* tokenize
* binop table
* simpler
* move to parser
* tok for lambda
* refactor
* expr_parser
* delete emu2_pcode
* import cleanup
* lil
* if parse
* work
* simpler
* no v
* trig preop is faster
* durations for tests
* fix cmp bug
* sdst
* remove scartch_size hack
* null behavior
* _MXCSRContext
* bugfixes
* DEBUG >= 3
* test smem crashes my gpu
* debug
* test
* test smem
* profiler
* full inst
* bugfix
* rtag(1)
* pc is 64-bit and word
* pc is real code now
* dynamic
* more dynamic
* fix oob access
* fix crash, more dyn
* all dyn
* really all dyn
* correct null mask
* lit + format
* 21s on the tests
* 13s on the tests
* canonical name
* simm16
* more dyn
* 14s
* proper saddr dedup
* dyn
* debug 5
* better 5
* revert dynamic stuff
* that can be dyn
* negative offsets
* dyn wmma
* f16 wmma support / ops / dtype / dtype_alu
* symbolic changes not needed
* ConstFloat
* more uop.const
* __eq__
* uop tests
* fix f16
* bf16 tensor cores
* whitespace
* remove cast roundtrip
* Revert "remove cast roundtrip"
This reverts commit c5bb0381c3 .
* just the fix
* remove dead paths
* llvm runs
2026-01-26 18:04:29 +08:00
George Hotz
984cdc4840
add wrapper class for the -0.0 != 0.0 issue ( #14339 )
...
* add wrapper class for the -0.0 != 0.0 issue
* fixes
* spec fix
* missed one
2026-01-26 16:52:37 +08:00
qazal
92bfe92138
assembly/amd: fix cdna mfma xml ( #14329 )
...
* handwritten failing test
* new amdxml
* more mfma from fixes
* ci
* move arch of test integration
* alt
* amdxml human cleanup
* _TestIntegration rename to IntegrationTestBase
* it's the same problem as _LIT
* better comment
* better variable name
2026-01-26 17:51:26 +09:00
Garret Castro
6c109f4d75
LLVM: CPU threading support ( #14320 )
...
* make generic llvmrenderer class for cpu and amd
* move `tensor_cores` back to parent
* remove empty line
* restore extra matcher position
* add threading
* dont need to add core_id here
* dont move code for workitem
* cleanup
---------
Co-authored-by: TheVanadium <claude_user@ret2022.localdomain>
2026-01-26 13:12:39 +08:00
George Hotz
cc49e47ea2
tinygrad changes from ucode ( #14336 )
...
* tinygrad changes from ucode
* dtype
2026-01-26 11:30:18 +08:00
Garret Castro
8477368d07
generic LLVMRenderer class for CPU and AMD ( #14321 )
...
* make generic llvmrenderer class for cpu and amd
* move `tensor_cores` back to parent
* remove empty line
* restore extra matcher position
* cleanup
---------
Co-authored-by: TheVanadium <claude_user@ret2022.localdomain>
2026-01-26 09:11:49 +08:00
George Hotz
11ce1e847d
llama train: null device support
2026-01-26 08:53:05 +08:00
chenyu
e3601788fa
update torch backend function ( #14333 )
...
those have tensor.py implementation
2026-01-25 16:39:34 -05:00
nimlgen
9865f51e39
cupti: ref collector ( #14330 )
...
* cupti: ref collector
* ll
2026-01-25 20:35:21 +03:00
nimlgen
21ab23ae18
nv: add pma for ada ( #14328 )
...
* nv: add pma for ada
* um
* fix
* shorter
* mock
2026-01-25 17:33:37 +03:00
George Hotz
49db266b96
ReprEnum for repr roundtrips ( #14327 )
...
* ReprEnum for repr roundtrips
* dsl
* bugfixes
* vdsty fixes
* cleaner
* fix
* fix cdna fields
* tests all pass
2026-01-25 18:58:31 +08:00
qazal
bf2d9d138f
viz: simplify amdgpu cfg ( #14326 )
...
* viz: replace llvm disasm with our disasm
* it starts with more code
* then it becomes less
* simpler, cdna disassembles with decimal simm16
* s_branch is upper case, add test
* simm16s and others
2026-01-25 15:21:45 +09:00
qazal
647e527a7e
viz: replace llvm disasm with our disasm ( #14325 )
2026-01-25 13:56:56 +09:00
nimlgen
4280a8eef2
am: update fw ( #14323 )
2026-01-25 01:08:47 +03:00
chenyu
7e41da1ae8
fix generate_dataset.sh ( #14324 )
...
added `set -e` so wrong pathes would fail the script, then fixed the path
2026-01-24 16:47:10 -05:00
chenyu
311bfd91d6
clean up where_on_load [pr] ( #14322 )
...
no repeated split_uop and general cleanup
2026-01-24 14:43:43 -05:00
nimlgen
8b282ba6d2
memory: reserved vram ( #14318 )
2026-01-24 19:39:24 +03:00
chenyu
00e9ba0b82
update type for split_uop and where_on_load [pr] ( #14319 )
...
also variable names in where_on_load, before logic update
2026-01-24 11:17:41 -05:00
chenyu
cb69b7b2b2
comment out fold_where_closure ( #14316 )
2026-01-24 10:15:42 -05:00
wozeparrot
d74587f16d
fa multi fix 2 ( #14314 )
2026-01-23 23:35:02 -08:00
chenyu
d9f0ad1d87
update return type for Tensor.tolist ( #14313 )
...
since sequence is incorrect since it can be list of list, use Any to avoid recursive type
2026-01-23 23:21:49 -05:00
qazal
807bc40931
assembly/amd: dsl and disasm cleanup ( #14311 )
...
* rdna4 inst helper
* remove dsl aliases
2026-01-24 11:36:12 +09:00
Christopher Milan
e782d44918
WEBGPU/NIR truncates ints ( #14307 )
...
* WEBGPU truncates ints
* nir has this bug too
2026-01-23 19:28:06 -05:00
nimlgen
26220a472e
no core_id ( #14265 )
...
* no core_id
* kwargs
* est
* linters
* ugh
* revert this
* deps
* glb
* should work?
* nn
* line
* fx
* ym
* z
* d
* um?
* revert
* this one?
* first half
* um p2
* all?
* um
* cleaner
* um
2026-01-23 21:30:12 +03:00
chenyu
e65bc7a7c5
where closure folding ( #14304 )
2026-01-23 10:55:13 -05:00
chenyu
d5a3b02a9c
clean up xpow ( #14295 )
...
mostly for `ret * (base < 0).where(adj, ret.const_like(1))` -> `(base < 0).where(neg_base, ret)`, since it's good for NAN neg_base but not generic
2026-01-23 10:19:47 -05:00
qazal
b913c910c5
assembly/amd: rdna4 passing test_roundtrip ( #14300 )
...
* test_roundtrip on different archs
* failing tests
* take RDNA4 xml changes from the emu branch
* work
* min diff to disasm flat
* test_add passes, rdna4 first
* correct vgpr field for the multi dword store stuff
* amdllvm
* recompile in roundtrip, get sources from emulator
* amdllvm, 2
* clean clean
* note, don't rely on that os.environ
---------
Co-authored-by: George Hotz <geohot@gmail.com>
2026-01-23 21:33:53 +09:00
qazal
f3b0e42863
remove extra sqtt pickles in gfx1200 ( #14302 )
2026-01-23 20:13:48 +09:00
George Hotz
d116312b1a
get cdna sqtt working ( #14301 )
...
* get cdna sqtt working
* cnd aprser
* wavestart/waveend
* names
* cdna
* test that
2026-01-23 18:46:15 +08:00
George Hotz
a5c4fa39d1
RDNA4 support in SQTT ( #14299 )
...
* table test
* cleanups
* dead file
* delta short
* tests
* delta test
* work
* l4 tests pass
* l0
* cnda
* print
* reverT
* wave failure
* wave failure
* test
* encs
* no l0 crap
* L4
* rdna4 sqtt
* notes
* linter
2026-01-23 16:16:45 +08:00
wozeparrot
963c59ebdb
fix: pull fixes from gradacc branch ( #14296 )
2026-01-22 23:07:54 -08:00
Christopher Milan
68668b8f28
fix WEBGPU NEG ( #14298 )
...
* fix WEBGPU NEG
* add test
* parenthesize
2026-01-23 01:44:52 -05:00
qazal
3b8a7bb8c9
use existing roc.py infra for sqtt tests ( #14297 )
...
* add pc, per kernel tracing
* work
* remove those imports
* min diff
2026-01-23 14:07:11 +09:00
chenyu
5f32f7a06b
fix winograd padding order ( #14294 )
2026-01-22 23:00:14 -05:00
George Hotz
52b989c6c8
don't place consts early + fixes from anthropic challenge ( #14286 )
...
* don't place consts early
* add anthropic challenge
* with ref
* do we still have to devectorize bools?
* tests pass
* just WHERE
* fine, revert that
* fine, revert
* only index
* z3 validator doesn't support vectorized
* Revert "z3 validator doesn't support vectorized"
This reverts commit 1b7930ecb3 .
* z3 not for vec
* no spec
* VLIWRenderer
* loop unrolling
* better comments
* cleanups
* skip cast
* renderer
* cleanups
* prints
* no hack
* hacks
* bump to 11
* reg warning
* lil clean
* cleaner renderer
2026-01-23 10:48:39 +09:00
chenyu
0903782bc0
remove few dead or unneeded codes [pr] ( #14275 )
2026-01-22 20:05:43 -05:00
chenyu
3eb5cd7d32
stronger test_rand_is_lazy ( #14293 )
2026-01-22 18:58:53 -05:00
chenyu
c15b6e6709
update test_randn_finite skipped device ( #14292 )
2026-01-22 18:26:02 -05:00
chenyu
073c6a81b5
raise if Tensor._buffer is called during jit ( #14114 )
...
* raise if Tensor._buffer is called during jit
* cleaner
2026-01-22 17:30:18 -05:00
nimlgen
8cd22df2dd
amd: alive wgps ( #14149 )
...
* amd: disabled wgps
* l
* wgp
* uoops
* mockgpu
* drm
* ad this
* fi
* reg
2026-01-23 00:08:45 +03:00
chenyu
a738c4bb22
test symbolic view broken with jit ( #14290 )
2026-01-22 13:44:47 -05:00
chenyu
f22fa6a5be
test rand is lazy ( #14289 )
2026-01-22 13:07:55 -05:00
chenyu
1726b884f2
update test_jit_v_nojit_random_regen ( #14288 )
...
current behavior is that jit and non-jit consume random seed differently, still the random values are different
2026-01-22 12:21:47 -05:00
chenyu
fbed36fa15
jit graph handle input==output aliasing ( #14287 )
...
a position that wasn't an input during capture should never become an input during execution, but graph cannot tell this by jit_cache and input_buffers only
2026-01-22 11:37:41 -05:00
chenyu
8bb61c2490
stronger test_graph_input_output_aliasing ( #14282 )
...
* stronger test_graph_input_output_aliasing
* comfirmed failure
2026-01-22 09:59:34 -05:00
qazal
d7afa02085
clean up the extra/sqtt directory ( #14284 )
...
* remove legacy test_timing stuff
* remove legacy test_pmc, update active_sqtt_parse
2026-01-22 19:10:59 +09:00
qazal
dff5f361b0
support rendering assembly kernels on the NULL backend ( #14283 )
...
* assembly custom kernels in DEV=NULL, use renderer arch
* update mmapeak
* llvm
2026-01-22 15:49:07 +09:00
qazal
dfefeddeed
add tflops to cdna gemm custom kernel ( #14281 )
2026-01-22 12:48:28 +09:00
qazal
18f408a35a
custom assembly kernel with variable tests ( #14280 )
...
* custom assembly kernel with variable tests
* different threads
* sink
* zeros like / flatten
2026-01-22 11:34:17 +09:00
chenyu
4de107b764
jit graph bug when input is output ( #14278 )
...
* jit graph bug when input is output
wrong result in llm
* not just metal
2026-01-21 18:49:52 -05:00
wozeparrot
76a9242a66
fa: merge kv bwd into one kernel ( #14277 )
2026-01-21 15:24:41 -08:00
chenyu
6279ae4a94
remove llm generate always reset start_pos ( #14276 )
...
* remove llm generate always reset start_pos
by itself seems like a bug, also added a test to repro forward_jit.reset() issue
* issue is jit graph, so revert that test
2026-01-21 16:54:30 -05:00
nimlgen
da1fedc3c8
working ioctls ( #14272 )
2026-01-21 20:29:04 +03:00
chenyu
574d171fa6
fix onnx Pad constant_value=None ( #14271 )
...
also removed a dead branch in _resolve_pool_pads
2026-01-21 11:51:34 -05:00
chenyu
a18d34be1e
simpler split_store outer range check [pr] ( #14273 )
...
also fixed comment
2026-01-21 11:51:14 -05:00
chenyu
e64111ad08
update all_same [pr] ( #14270 )
...
add type annotation and unit test
2026-01-21 11:26:15 -05:00
chenyu
9ad3c865ac
fix bug in logsumexp keepdim=True ( #14268 )
2026-01-21 09:49:55 -05:00
George Hotz
41d00a046d
add device to local, fix PCONTIG=2 ( #14266 )
...
* add device to local, fix PCONTIG=2
* regression test
* remove the device when we render
* viz slowness
* no long
2026-01-21 22:12:18 +09:00
wozeparrot
c1d14ea832
llama8b train fixes ( #14264 )
2026-01-20 20:34:47 -08:00
qazal
549dbabfcb
move ALLOW_DEVICE_USAGE=0 to get_program [pr] ( #14263 )
2026-01-21 12:56:05 +09:00
qazal
78a28227c6
assembly/amd: cdna4 mfma support ( #14206 )
2026-01-21 09:12:05 +09:00
George Hotz
1baefed530
assembly/amd: add hw tests from ucode branch ( #14259 )
...
* assembly/amd: add hw tests from ucode branch
* fix is per lane
2026-01-21 08:53:54 +09:00
wozeparrot
ba90e1b52e
feat: script to run llama8b training ( #14239 )
2026-01-20 12:44:06 -08:00
Christopher Milan
daf9414bff
fix nullptr arg to CUDA_KERNEL_NODE_PARAMS_v1 ( #14256 )
...
* fix nullptr arg to CUDA_KERNEL_NODE_PARAMS_v1
* ruff
2026-01-20 12:30:07 -05:00
chenyu
e04767e39e
run pre-commit in ci ( #14253 )
...
* run pre-commit in ci
prevents pre-commit regression
* IGNORE_OOB=1
* pytest
* unit test
* split
2026-01-20 12:24:33 -05:00
nimlgen
22af7132cd
fix test_dev_jitter_matrix ( #14255 )
2026-01-20 20:07:51 +03:00
Robbe Derks
c7fbd177d4
USBGPU: debug script for comma chestnut ( #14252 )
...
* initial debug script
* improvements
2026-01-20 18:52:25 +03:00
C T
26f8b12e01
Whisper audio helpers (mel filters in tinygrad) ( #13478 )
...
* add whisper audio helpers for stft/mel/resample
* cleanup
* add whisper stft test
* make only stft test explicitly depend on librosa
* extract sinc_window_kernel
* dehardcode device
* use same device argument
* simplify
* type annotate
* ruff format audio_helpers.py
* ruff format test_whisper.py
* add WHISPER_NEW_STFT
* rename
* undo ruff format changes
* use new stft and mel for whisper
* remove stft test that depends on librosa
* remove whitespace
* add Tensor.log10 with test\test_ops.py::TestOps::test_log10
* use Tensor.log10
* fix lint
* future: remove unused STFT class
* future: remove resample code since it isn't used (yet)
* match openai with pad_mode="reflect"
* pad_to
* future: cut resample leftovers
* cleanup
* add mel tests
* future: cut stft
* future: cut non-mel prep_audio changes
* reduce diff
* move audio_helpers.py to examples
* reduce whitespace
* fix imports
* reduce whitespace
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-01-20 10:50:02 -05:00
nimlgen
dc82856084
tbgpu: shim binary + remote apl pci dev ( #14124 )
...
* shim binary + remote pci dev
* v2
* rip out apl
* cmds
* rename
* clean
* remove
* rm gitignore
* ui
* install
* linter
* um
* cleaner
* assets
* normal install in ui
* cleaner app
* install script
* support fd mmap
* cleaner
* kill server when disconn
* rename + pcidevs
* sign
* install and reinstall
* no sip install
* will trigger update
* nv
* ugh
* this
* fix
* nv
* use nosip sign
* auto install
* remove
* mypy
* upd
* ditto
* print
* simpler
* ditto
* um
* simpler
* upd
* upd
* cleaner
* autogen
* cleaner
* move
* annotations
* server cleaner
2026-01-20 16:15:18 +03:00
qazal
4548fcc1b8
amd/sqtt: add rdna4 and cdna sqtt examples ( #14251 )
...
* amd/sqtt: add rdna4 and cdna sqtt examples
* work
* comment out rdna and cdna tests
2026-01-20 21:11:48 +09:00
qazal
2dc281b32a
assembly/amd: test helpers for arch to gfx target mapping ( #14250 )
2026-01-20 20:35:09 +09:00
nimlgen
823e88c0d0
nv: request bar 3 ( #14249 )
2026-01-20 13:52:38 +03:00
qazal
dddd0e384f
ALLOW_DEVICE_USAGE=0 in codegen ( #14238 )
2026-01-20 15:15:16 +09:00
George Hotz
0243f4a0f1
clear wins from ucode branch ( #14243 )
...
* clear wins from ucode branch
* two more
* revert those
2026-01-20 15:11:09 +09:00
George Hotz
5e24643889
minor import speedups ( #14244 )
...
* minor import speedups
* server stuff in server places
* pre-commit
* fix
2026-01-20 15:05:36 +09:00
George Hotz
d60a155e48
defer compilation of upats ( #14242 )
...
* defer compilation of upats
* mypy
2026-01-20 13:50:00 +09:00
George Hotz
56c8926d32
import speedups: refactor validate to late import ( #14241 )
...
* refactor validate to late import
* preommit stuff
* fix mypy
2026-01-20 13:23:39 +09:00
chenyu
9d3b1cf1e7
simpler _cached_to_python_const ( #14236 )
2026-01-19 23:10:53 -05:00
qazal
b1c5a242b7
Revert "move is_dtype_supported logic to renderer ( #14188 )" ( #14237 )
...
This reverts commit 161fee9a48 .
2026-01-20 12:19:14 +09:00
wozeparrot
1f89eaf790
tk: fa bert mask fix + some numerical stability improvements ( #14214 )
2026-01-19 19:18:07 -08:00
chenyu
9ea63d7d52
failed test case for onnx IF with jit ( #14235 )
...
silently fails now since onnx treats IF cond as a const
2026-01-19 18:10:05 -05:00
Garret Castro
b65dc9fd8e
refactor: use generic type for ContextVar [pr] ( #13998 )
...
* use generic type for context var
removes ops_python string cast thing, allows for handling of other string vars like `_CC`
* update Context.old_context type
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-01-19 13:37:54 -05:00
Martin Szewieczek
7010c176cf
pre commit: fix path to test_assign.py ( #14231 )
2026-01-19 13:36:30 -05:00
Christopher Milan
34f6192739
look for cuda in /opt/cuda ( #14230 )
...
* look for cuda in /opt/cuda
* regen
2026-01-19 11:51:00 -05:00
qazal
0f61cbd51f
viz: draw shapes directly on the canvas ( #14229 )
2026-01-20 00:57:06 +09:00
nimlgen
acb0045ba0
system: alloc_sysmem is part of interface ( #14226 )
2026-01-19 18:15:54 +03:00
qazal
ab426cb671
viz: simplify row line logic ( #14227 )
2026-01-20 00:00:28 +09:00
nimlgen
01653db4fd
nv: GPPut is mmiointerface ( #14225 )
2026-01-19 17:36:26 +03:00
nimlgen
7cb7abeeb0
amd: fix scratch_wave64_lane_byte_size ( #14223 )
2026-01-19 15:21:39 +03:00
nimlgen
979ce211f7
amd: missing self in aql's exec ( #14224 )
2026-01-19 14:27:54 +03:00
George Hotz
31bcbed6bb
AMD_DISABLE_SDMA for testing with -n12 ( #14216 )
2026-01-19 16:10:30 +09:00
qazal
578a4a50d3
viz: row lines in timeline ( #14213 )
...
* simple start, already works for memory graph
* add height to exec packets
* math.max, border-color
* borderline is in pixels
* row border color
2026-01-19 13:01:43 +09:00
Christopher Milan
161fee9a48
move is_dtype_supported logic to renderer ( #14188 )
...
* move is_dtype_supported logic to renderer
* fix CPU_COUNT
* mypy happy
* early import libclang too with llvm
* run with debug
* skip autogen tests if MTLCompiler or llvm is loaded
* run autogen tests separately in CI
* lint
2026-01-18 22:37:04 -05:00
qazal
7abe9b020f
viz: add border colors to pkts timeline ( #14211 )
...
* viz: add border colors to pkts timeline
* 10
2026-01-19 11:37:46 +09:00
chenyu
67d9712ef6
jit copy aliased output if it's read later ( #14210 )
2026-01-18 18:48:59 -05:00
chenyu
97333b1954
jit footguns test case on assign with same buffer outputs ( #14209 )
...
related https://github.com/tinygrad/tinygrad/issues/13364
2026-01-18 16:01:09 -05:00
chenyu
e7c2df9113
improve consecutive Tensor indexing ( #14208 )
...
* improve consecutive Tensor indexing
instead of O(idx_counts*src_dims), it can just be O(idx_counts)
* test correctness
2026-01-18 15:14:33 -05:00
chenyu
c7b8f6496f
remove dtypes.index_like and dtypes.fields [pr] ( #14207 )
...
barely used, so just use inline and DTYPES_DICT
2026-01-18 11:49:01 -05:00
qazal
e27a0002c5
viz: only keep the sqtt bytes for pkts ( #14203 )
...
* viz: only keep the sqtt bytes for pkts
* better option name
* work
* renames
2026-01-18 17:04:26 +09:00
qazal
d8f87ae2f2
SQTT packets to assembly mapper ( #14198 )
...
* disasm + compare to llvm
* start inst trace
* base tests pass
* work
* work
* all kernels
* qol
* refactor
* work
* work
* wave_focus
* simple
* work
* add a lot of asserts
* focus on wave0
* correct handling of IMMEDIATE_MASK
* work
* viz work
* use the metadata infra
* better
2026-01-18 16:32:13 +09:00
Christopher Milan
1eb110cd7d
fix memory corruption in NIR, reenable process replay ( #14204 )
2026-01-18 02:05:12 -05:00
George Hotz
a51e0a86db
assembly/amd: clean up disasm.py + add CDNA support ( #14200 )
...
* assembly/amd: clean up disasm.py
* cleanups
* add missing encodings
* decode is pretty
* cdna
* assert on failure
* cdna roudtrip
* cdna passing
* test
* lil cleanup
* variant cleanups
* cleanups
2026-01-18 14:48:44 +09:00
chenyu
4b18c92bc5
simpler Context.__enter__ [pr] ( #14201 )
2026-01-18 00:38:59 -05:00
qazal
feaa804158
skip lvp process replay in CI [pr] ( #14202 )
2026-01-18 13:25:04 +09:00
chenyu
b12a9fea80
runtime int call instead of cast(int) ( #14183 )
2026-01-17 20:34:45 -05:00
George Hotz
79c1559f69
amd asm can still be simpler ( #14199 )
...
* amd asm can still be simpler
* simpler
* V_LANE_ID
* simpler
* simpler
* compact vgpr
2026-01-17 18:40:10 +09:00
chenyu
5e6a72c33f
new Onnx Gather ( #14187 )
...
instead of assuming const indices, check if it showed as a const
2026-01-16 22:24:07 -05:00
George Hotz
9f7f2f0e0c
MAX_SQTT_PKTS
2026-01-17 12:05:36 +09:00
George Hotz
50554115ee
fix VALU_SALU / IMMED_MASK and improve amd_asm_matmul ( #14196 )
...
* fix VALU_SALU / IMMED_MASK and improve amd_asm_matmul
* immed
* wave override
* restore ALT
* advance sgprs correctly
* no helpers
* decrease to 192 VGPRs
2026-01-17 11:58:34 +09:00
chenyu
ab244c7f81
onnx Gather should not assume indices to be const ( #14185 )
...
* onnx Gather should not assume indices to be const
added a failed test case
* just list
2026-01-16 20:55:00 -05:00
wozeparrot
a879b54234
tk: fa jit fix ( #14170 )
2026-01-16 16:38:45 -08:00
qazal
a8ae9757dd
viz: put alts in the same row, LDS color ( #14194 )
...
* viz: put alts in the same row, coloring work
* assert if packets overlap
* lds color
2026-01-17 09:36:14 +09:00
qazal
5aa71f437b
viz: precise clock cycles in PKTS ( #14179 )
...
* viz: relative clock cycles in PKTS
* format clocks as xM yK 999 cycles
2026-01-17 09:08:13 +09:00
Christopher Milan
eafcd44d95
fix OSX image pitch ( #14193 )
2026-01-16 19:07:33 -05:00
Christopher Milan
3960e2758c
suppress_finalizing in hip ( #14189 )
2026-01-16 18:56:29 -05:00
qazal
9302ab003a
viz: show ALT/OTHER packets on second lane ( #14192 )
...
* viz: show dimmer ALT/OTHER packets
* remove todo comment
* work
* current vmem is gray
2026-01-17 08:55:24 +09:00
qazal
551454f476
viz: fix wave sort, show message if sqtt trace is empty ( #14190 )
...
* show message if sqtt trace is empty
* work
* fix wave sort
* back
2026-01-17 08:01:26 +09:00
George Hotz
8a2549d42b
improve amd_asm_matmul + minor VIZ PKTS improvements ( #14186 )
...
* improve amd_asm_matmul + minor VIZ PKTS improvements
* fix waitcnt issue
* cleanups
2026-01-17 06:56:59 +09:00
George Hotz
7d1d9d4568
assembly/amd: remove IMG instruction support and asm.py ( #14163 )
...
* assembly/amd: return IMG instruction supports
* remove asm.py
* op2dsl
2026-01-17 06:21:50 +09:00
chenyu
dc4ae7dd08
lower ASSERT_MIN_STEP_TIME for driving_policy to 3ms ( #14184 )
...
seems quite stable at 2.7ms now
2026-01-16 15:04:53 -05:00
chenyu
0a14e1fcd4
fix some type ignore ( #14182 )
2026-01-16 13:56:45 -05:00
chenyu
fc10470883
add UOp.__index__ ( #14181 )
...
Tensor slice is handled by __getitem__, so the index method is just for SupportsIndex
2026-01-16 12:28:33 -05:00
chenyu
6790165ef8
minor _apply_uop cleanup ( #14180 )
...
give fxn a return type and minor style change
2026-01-16 11:27:55 -05:00
nimlgen
e855ec8ee3
tbgpu: refactor dext to support user mappings ( #14177 )
2026-01-16 15:55:57 +03:00
qazal
bbc55962ee
viz: color SQTT INST Ops like UOps ( #14175 )
2026-01-16 21:24:43 +09:00
qazal
3751b29a3d
viz: skip OTHER_ SQTT packets ( #14178 )
2026-01-16 20:37:19 +09:00
qazal
7c1f1cb2bc
viz: fix INST packets coloring ( #14176 )
...
* viz: fix INST packets coloring
* work
2026-01-16 18:46:13 +09:00
qazal
1696991988
viz: add PKTS group to sqtt trace ( #14173 )
...
* viz: add PKTS group to sqtt trace
* soft_err for rdna4
* different itrace
2026-01-16 17:29:47 +09:00
Christopher Milan
a021b84604
autogen: fix enum ( #14171 )
2026-01-16 01:30:11 -05:00
qazal
fa5475307c
viz: collapse wave packets in one row, 1 clk per packet ( #14169 )
...
* per wave packets in one row
* work
* row_tuple
* cleaner
* one row and one lane per wave
* globals split into rows based on type
* barrier length
2026-01-16 13:52:07 +09:00
Christopher Milan
5abc262e22
fix dll.bind caching ( #14168 )
2026-01-15 20:25:42 -05:00
Christopher Milan
f9ca072b61
cuda compilers disassemble properly ( #14166 )
...
* cuda compilers disassemble properly
* this can use system
2026-01-15 19:02:40 -05:00
chenyu
14e9a71a41
move test_assign to unit ( #14165 )
...
scheduling these should not depend on device
2026-01-15 17:10:13 -05:00
nimlgen
a0dd9d2146
tbgpu: correct com.apple.developer.driverkit.transport.pci entitlements ( #14164 )
...
* tbgpu: correct com.apple.developer.driverkit.transport.pci entitlements
* format
2026-01-15 20:56:39 +03:00
qazal
32e1c267ee
viz: SQTT timeline with our decoder ( #14139 )
...
* viz: sqtt OCC/INST timeline in our decoder
* todo
* lint
* work
* cleaner
* profiling
* better timing
* keep the generic api
* more generic
* 80x -> 20x off the C decoder
* unusably slow
* rm filters
* work
* work
* other way to sort ops
* work
* first 10k
* 100K actually tells a story
* barrier INST packets get their own red color and row
* minor detail
* 50K
* soft_err
2026-01-15 20:45:16 +09:00
Christopher Milan
0cb024a5bb
remove ctypes.Structure ( #13651 )
2026-01-15 05:06:22 -05:00
George Hotz
255e0573b1
assembly/amd: clean up asm/disasm ( #14158 )
...
* assembly/amd: clean up asm/disasm
* update disasm
* revert dumb stuff
* update decode
* use fmt
2026-01-15 17:45:40 +09:00
qazal
164bc678a6
scheduler: sched_cache bugfix for different Tensor.custom_kernel schedules ( #14161 )
...
* simplest failing test
* min fix
* same function reuses the cache
* SPEC=2 never worked for custom_kernel
2026-01-15 14:59:14 +09:00
qazal
b46da603fe
codegen/custom_kernel: do not attach KernelInfo to user program ( #14160 )
2026-01-15 14:01:48 +09:00
George Hotz
fd60626ea1
assembly/amd: refactor to use op_bits/op_regs ( #14156 )
...
* assembly/amd: refactor to use op_bits/op_regs
* remove that skip
* remove another hack
* remove another hack
* precompute mask
* more reg, less hasattr
2026-01-15 11:20:21 +09:00
chenyu
add7da268f
multiple slice assign test ( #14157 )
...
GANing test cases
2026-01-14 21:08:03 -05:00
George Hotz
e9ce12028e
assembly/amd: amdxml cleanups, remove broken SDWA/DPP, merge in pdf.py ( #14154 )
...
* assembly/amd: amdxml cleanups, remove broken SDWA/DPP
* remove buf junk
* simplify
* simplify
* lil cleanup
* dead fixes
* strip non pcode extraction from pdf
* merge pdf.py into amdxml.py
* only amdxml
2026-01-15 09:23:19 +09:00
wozeparrot
7e5687f6a3
more fa multi fix ( #14152 )
2026-01-14 13:57:11 -08:00
chenyu
1381daac06
many more failed assign tests ( #14153 )
...
assign is quite broken
2026-01-14 16:20:28 -05:00
nimlgen
8c55ef4f01
amd: cleanup props ( #14145 )
...
* amd: cleanup props
* f
2026-01-14 20:27:41 +03:00
chenyu
899a56446e
failed assign test cases with write before read ( #14148 )
...
slice assign write before read fails now. this is why kv cache needs a realize
2026-01-14 10:30:50 -05:00
chenyu
986e865830
fix TINY_BACKEND=1 cumsum ( #14138 )
...
* fix TINY_BACKEND=1 cumsum
old hack was wrong, need to apply contiguous on the input
* test time
* test_linalg_svd is slow
2026-01-14 09:54:49 -05:00
qazal
434dbafab5
optional Estimates in KernelInfo ( #14147 )
...
* optional Estimates in KernelInfo
* custom asm test plumbing
* s_code_end
* estimates test
* vaddr arg in global_store
* kernel desc
* Ops.DEVICE name
2026-01-14 22:55:03 +09:00
qazal
76b577ee76
viz: only SIMD name in sqtt timeline rows ( #14146 )
2026-01-14 20:13:27 +09:00
George Hotz
e5500ae4ad
add ALU stuff to default perf counters ( #14135 )
...
* add ALU stuff to default perf counters
* lds
* add alu utilization
* cleaner
* format as percent
* cleanest
* roc
2026-01-14 19:47:59 +09:00
nimlgen
86708ccac5
hip_ioctl: dump aql ( #14142 )
2026-01-14 13:15:10 +03:00
nimlgen
f9147422a3
ci: add setcap ( #14143 )
2026-01-14 13:15:01 +03:00
nimlgen
62c1a014a6
amd: rename to be consistent ( #14141 )
2026-01-14 11:41:04 +03:00
Christopher Milan
e0eea0d833
autogen: verify all files in CI ( #14140 )
...
* autogen: verify all files in CI
* dont delete libclang
2026-01-14 02:35:54 -05:00
chenyu
2a2c1eacf6
disable fast_idiv on metal ( #14137 )
...
there's a metal compiler bug which was the root cause that keccak needs a contigous hack
2026-01-13 21:40:40 -05:00
wozeparrot
a92778aa0c
tk: fa multi fix ( #14134 )
2026-01-13 17:22:15 -08:00
George Hotz
2ab18ea7e3
assembly/amd: use xml instead of pdf ( #14118 )
...
* assembly/amd: use xml instead of pdf
* use amdxml to generate info about op sizes
* fix many tests with invalid instructions
* fix info generation
* chad xml fixes many bugs
* rename to operands
* simplify
* amdxml
* bug fix
2026-01-14 10:03:37 +09:00
qazal
002ea39da7
assembly/amd: use Tensor.custom_kernel to run assembly ( #14125 )
...
* assembly/amd: use Tensor.custom_kernel to run assembly
* PRINT_ASM=1 is DEBUG=4
2026-01-14 08:29:25 +09:00
chenyu
fe00682502
clean up svd tests ( #14133 )
...
removed from test_ops and added to TestTorchBackend
2026-01-13 16:32:21 -05:00
chenyu
84b88a0a31
more doc of newly added functions ( #14132 )
2026-01-13 15:48:45 -05:00
chenyu
e610821c52
Tensor.cummin and Tensor.nonzero ( #14131 )
2026-01-13 15:09:56 -05:00
chenyu
176a934ddd
Tensor.diagonal support offset and dims ( #14130 )
2026-01-13 14:49:06 -05:00
chenyu
2a217ba206
tinybackend isin and log10 ( #14120 )
...
can use tinygrad directly
2026-01-13 14:14:09 -05:00
qazal
79d00521f8
viz: fix cfg err when endpgm is in the middle of stream ( #14128 )
...
* kernel from beautiful_mnist
* minimal test
* correct way to do this
* rm that
2026-01-14 02:00:34 +09:00
qazal
7fe91e5db9
viz: cleanup cfg renderer ( #14127 )
...
* remove colorDomains from sqtt
* colors in js
* work
2026-01-14 01:10:42 +09:00
nimlgen
1364449cab
system: early pci perm check ( #14126 )
...
* system: early pci perm check
* l
2026-01-13 17:45:05 +03:00
George Hotz
a28c8105a5
assembly/amd: 2% faster amd_uop_matmul + SQTT ( #14122 )
...
* assembly/amd: 2% faster amd_uop_matmul
* SQTT_TOKEN_EXCLUDE + SQTT_SIMD_SEL
* sqtt printer
* fix printer
* fast decode
* fast decoder
* test packet counts
* ugh it's not faster
* dead
2026-01-13 19:55:32 +09:00
qazal
6cd318e377
viz: add link to graph from sqtt ( #14123 )
2026-01-13 17:31:03 +09:00
qazal
fd10fd245a
viz: cfg tokenizer fix and unit tests ( #14121 )
...
* output Ops.BINARY
* failing test for the cfg
* dsl renamed to offset and sz
* add better asserts
* move the note
2026-01-13 15:08:55 +09:00
chenyu
05fcb57696
also return index in Tensor.cummax ( #14117 )
...
* also return index in Tensor.cummax
* fix
2026-01-12 22:42:10 -05:00
wozeparrot
7c967399a4
tk: add failing test for fa multidevice ( #14116 )
2026-01-12 19:11:09 -08:00
George Hotz
330a0b686e
assembly/amd: clean up dsl and make type verification strict ( #14102 )
...
* assembly/amd: start newdsl
* work
* newdsl upd
* Reg is p nice
* cleaner
* work
* getting clean
* all fields
* more BitFields
* redo the pdfs with dsl2 syntax
* no lit
* cleanups
* more defaults
* fix get and remove crap
* aliases
* ugly but kind of works
* NULL, not rawimm
* clean up defaults
* only dsl
* asm fixes
* lit fixup
* more lit
* cleanups
* olddsl
* single pcode dict
* emu sort of works
* trash test
* global is global
* types property
* reg mods
* fix a few tests
* remove monkey patch
* fixes
* less hacks in tests
* less hacks in tests
* 4 test failures
* hw tests all pass
* fix compare emulator
* fix some tests
* 3 more
* fix and shorten sqtt
* handwritten
* fix validation
* test corrections
* all types validate
* fix dsl2 tests
* fix bugs in disasm
* skips on cdna
* work
* repr with reg[]
* fix bitfield tests
* merge pcodes in dsl
* remove override
* disasm uses inst.types
* simpler
2026-01-13 08:52:16 +09:00
C T
a8c821f45e
add Tensor.log10 with test\test_ops.py::TestOps::test_log10 ( #14113 )
2026-01-12 13:45:47 -05:00
chenyu
6b0a9f5ee6
don't strip sink in to_uops_list [pr] ( #14111 )
2026-01-12 11:19:03 -05:00
chenyu
cad7feec02
more onnx ops ( #14104 )
...
HannWindow, HammingWindow, BlackmanWindow, Hardmax, LpNormalization
2026-01-12 09:11:13 -05:00
nimlgen
635ed2df9d
system: use pci.PCI_VENDOR_ID instead of const ( #14109 )
2026-01-12 15:24:09 +03:00
qazal
6c0f0e29ff
Revert "viz: loading... ( #14107 )" ( #14108 )
...
This reverts commit 9347757c2d .
2026-01-12 20:45:37 +09:00
nimlgen
9347757c2d
viz: loading... ( #14107 )
2026-01-12 13:24:24 +03:00
wozeparrot
3a92df66ea
feat: bump version to 0.12.0 ( #14105 )
2026-01-11 21:19:49 -08:00
chenyu
7c234a9c7c
wgsl cleanup [pr] ( #14103 )
...
refactor common pack functions
2026-01-11 21:23:45 -05:00
George Hotz
91bde927ef
assembly/amd: split asm.py into asm.py and disasm.py ( #14101 )
...
* split asm.py into asm.py and disasm.py
* split decoder
* move to pcode
* tests
2026-01-12 07:22:02 +09:00
George Hotz
44135e2e84
assembly/amd: always use v_nop in test for rocprof-trace-decoder ( #14100 )
...
* assembly/amd: always use v_nop in test for rocprof-trace-decoder
* test touchups
2026-01-12 05:31:58 +09:00
George Hotz
8b1b15aec0
assembly/amd: SQTT support ( #14099 )
...
* assembly/amd: SQTT support
* simpler
* cmp wave
* instruction compare
* rocprof decode
* simpler
* no llvm
* no strcmp
2026-01-12 05:07:17 +09:00
nimlgen
8b5ff403fa
am: flag successful finalization ( #14097 )
...
* am: flag successful finalization
* import
2026-01-11 16:24:53 +03:00
qazal
d8aba24967
amd: use kernel descriptor struct in AMDProgram ( #14096 )
2026-01-11 18:25:16 +09:00
chenyu
9973a81356
add channels_last to QLinearGlobalAveragePool ( #14094 )
...
and other minor cleanups
2026-01-10 18:38:19 -05:00
chenyu
c5492f8f75
cstyle cleanup [pr] ( #14093 )
2026-01-10 09:44:50 -05:00
nimlgen
d5f954858d
viz: show precise timings ( #14092 )
2026-01-10 16:21:08 +03:00
nimlgen
3e2c05ee9f
hevc: decoder as iterator ( #14091 )
2026-01-10 14:57:56 +03:00
chenyu
35c9701df0
update outdated tests and comments ( #14090 )
2026-01-10 01:00:48 -05:00
chenyu
92246ea731
update tests, WEBGPU=1 pytest . passes ( #14089 )
...
* update tests, `WEBGPU=1 pytest .` passes
* minor update
2026-01-10 00:03:02 -05:00
chenyu
c34c6d9468
fix wgsl packed_store can drop valid ( #14088 )
...
* fix wgsl packed_store can drop valid
* fix
2026-01-09 15:22:06 -05:00
chenyu
eacccc5ace
more disk assign tests ( #14087 )
...
covers more edge cases
2026-01-09 14:14:52 -05:00
chenyu
ed295e74dc
don't skip gguf test if ggml is not installed ( #14086 )
...
* don't skip gguf test if ggml is not installed
should just let it fail
* fix
2026-01-09 12:05:58 -05:00
chenyu
cff33c8d78
add some disk assign tests ( #14085 )
2026-01-09 11:50:59 -05:00
chenyu
74fa3c7d09
decomp pow for LVP ( #14084 )
...
test failed due to undefined behavior, so use decomp instead
2026-01-09 10:50:28 -05:00
b1tg
0fbc551622
train bert with fp8 ( #13874 )
...
* fp8 train
* clean
* lint
* test fix from #13439
* skip first/last layer
* rm __init__, restore unroll <=32 check
* tests
* clean test, remove unused
* multi-gpu test, clean quantize_to_fp8
* remove bert contiguous
* run script
* test: better check
* run script search
* add seed in bert data shuffle
* move script to mi350x folder
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-01-09 09:21:59 -05:00
nimlgen
ba209d6305
am: utc_l1_enable on all sdma inst ( #14083 )
2026-01-09 17:17:05 +03:00
nimlgen
6b308b89b7
viz: timeline time ( #14080 )
...
* viz: timeline time
* less lines
* cut
2026-01-09 16:43:45 +03:00
nimlgen
40f9fa2db4
autogen: new kfd ( #14082 )
2026-01-09 16:08:17 +03:00
qazal
2917ed1616
roc: propagate decoder errors to main thread ( #14081 )
...
* roc: propagate decoder errors to main thread
* types
* add cause
2026-01-09 21:10:45 +09:00
qazal
f3f4d9b387
viz: fix disasm node width ( #14079 )
2026-01-09 16:37:37 +09:00
anu
c70c112254
fix CUDA=1 disassembly (VIZ=1) by stripping null terminator ( #14046 )
...
* fix ptxas disassembly bug
* single '
* move fix to get_bytes
* move rstrip
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2026-01-09 15:19:59 +09:00
qazal
13e5d00d0e
viz: exclude comma in register highlight ( #14078 )
...
* viz: exclude comma in register highlight
* simplify
2026-01-09 15:10:30 +09:00
qazal
a071adffc0
viz: amdgpu disassembly register highlighting UI ( #14059 )
...
* viz: amdgpu disassembly register highlighting
* minor details
* details from IDA
* more details from IDA
* refactor token colors
* move tokenizer to python
* simplify
* minimal tokenizer for registers
* all the operand types
2026-01-09 11:27:09 +09:00
chenyu
b878f9d5a4
reuse Tensor init with const path [pr] ( #14076 )
2026-01-08 17:49:37 -05:00
chenyu
efcb32f6a9
unique const when requires_grad is set to True ( #14075 )
...
* unique const when requires_grad is set to True
* fix pyrender
2026-01-08 16:30:45 -05:00
chenyu
b34c637767
support bfloat16 for CL ( #14073 )
2026-01-08 14:14:29 -05:00
Garret Castro
16b652302e
skip bf16 test if not supported by device ( #14070 )
2026-01-08 13:37:24 -05:00
nimlgen
3f61a96d79
am: SetSoftMaxByFreq on gfx10+ ( #14068 )
2026-01-08 17:00:03 +03:00
George Hotz
e7b5d8a434
assembly/amd: more RDNA4 asm ( #14062 )
...
* rdna4 more
* asm
* fixes
* assembly/amd: handwritten wmma failing test
* passes
* wmma default hacks
* space
* 0 skips in rdna3/rdna4 disasm
* more RDNA4 tests
---------
Co-authored-by: qazal <qazal.software@gmail.com>
2026-01-08 05:09:37 -08:00
nimlgen
e372c841ba
hevc: beam in decode ( #14067 )
...
* hevc: beam in decode
* fine
* g
2026-01-08 15:47:16 +03:00
nimlgen
1732a4ec4b
am: rework set_clocks ( #14065 )
2026-01-08 15:33:32 +03:00
nimlgen
f3aceaa08b
hevc: fast decoder ( #14057 )
2026-01-08 15:20:37 +03:00
qazal
309197bca5
assembly/amd: test_roundtrip for cdna/rdna4 ( #14066 )
2026-01-08 21:03:13 +09:00
qazal
15a056715d
fix amd assembly IDE tests on macbook ( #14063 )
2026-01-08 17:27:52 +09:00
wozeparrot
027b935269
tk: fix grouped load store ( #14035 )
2026-01-07 22:38:02 -08:00
George Hotz
2db04d0696
assembly/amd: start adding RDNA4 support ( #14060 )
...
* assembly/amd: start adding RDNA4 support
* rdna4 asm
2026-01-07 21:19:30 -08:00
George Hotz
cb500466c2
assembly/amd: amd_asm_matmul ( #13989 )
...
* amd_asm_matmul
* dsl transform
* asm roundtrip
* fixed
* less
* better
* more
* simpler
* simplify
* lil
* simpler
* compact
* work
* cleanups
* simplify
* simpler
* cleanup
* name the regs
* simp
* big simp
* big simp
* simp
* acc grid
* fast
* stuff
* fast
* simpler
* owrks
* save vgprs
* save vgprs
* Compact
* less VGPRs
* after
* SQTT support
* fastest
* faster
* lil faster
* tile regs
* faster
* readable
* one more
* simpler
* lil simpler
* NO_GLOBAL skips early globals
* stock kernel
* cleanups
* cleanups
* one b reg
* safe reg changes
* acc is compact now
* remove confusing stuff
* sregs
* lds cleanups
* vopd
2026-01-07 20:11:05 -08:00
chenyu
3caa1e2c98
fix cast HALF with PYTHON backend ( #14058 )
2026-01-07 16:52:05 -05:00
chenyu
5f1ede7f7e
clean up test_dtype ( #14055 )
...
use less lambda
2026-01-07 15:45:42 -05:00
nimlgen
5bd4593eda
hevc: cleaner decoder ( #14056 )
...
* hevc: cleaner decoder
* nn
2026-01-07 18:29:30 +03:00
b1tg
241f0402b4
add seed in bert data shuffle ( #14054 )
2026-01-07 10:02:05 -05:00
nimlgen
25c82dd242
nv: profile nvdec ( #14053 )
2026-01-07 15:56:54 +03:00
qazal
35900290b2
viz: configure text height for cfg ( #14052 )
2026-01-07 18:58:56 +09:00
chenyu
87f4bc5446
update variable names around jit [pr] ( #14049 )
...
lbs, st_vars_dtype_device and rawbuffers no more
2026-01-06 22:32:41 -05:00
chenyu
2833c5a54b
few more jit tests with multi tensor inputs ( #14047 )
2026-01-06 22:05:22 -05:00
chenyu
72a3f78d19
jit includes tensor inputs in containers ( #14043 )
...
* jit includes tensor inputs in containers
* cleanup
2026-01-06 19:42:06 -05:00
chenyu
c714881832
don't allow jit input to be const ( #14045 )
...
* don't allow jit input to be unbuffered like const
* just const to fix multi
* fix rnnt
2026-01-06 18:15:22 -05:00
chenyu
a8896f28e1
test_unrealized_const_input_frozen ( #14044 )
...
unrealized const is not replaced in jit
2026-01-06 14:17:43 -05:00
nimlgen
325f4006ff
amd: copies w/o sdma ( #14036 )
...
* amd: copies w/o sdma
* as_args
* fixes
* f
2026-01-06 21:15:58 +03:00
chenyu
7fb18f7e47
raise when jit fxn returns non-Tensor output ( #14042 )
2026-01-06 12:59:20 -05:00
chenyu
4491ec0c9e
JitError ( #14041 )
...
* JitError
* test_symbolic_jit
2026-01-06 12:19:50 -05:00
chenyu
6ddddc68af
test jit tolist failure ( #14040 )
...
also moved tests to test_jit_footguns
2026-01-06 11:16:57 -05:00
chenyu
b699b9f763
test case for jit a function with item call ( #14039 )
...
* test case for jit a function with item call
output is silently wrong now
* no dtype
2026-01-06 10:40:43 -05:00
nimlgen
02084f5376
mockdsp: use dsp allocator ( #14037 )
...
* mockdsp: use dsp allocator
* fix
* ?
2026-01-06 16:04:47 +03:00
wozeparrot
2b3e01e79c
tk: support sliced local -> reg load ( #14034 )
2026-01-06 05:33:24 -05:00
George Hotz
45f7fd073d
assembly/amd: pcode bug fixes ( #14032 )
...
* bring over pcode parser
* fixes
* pdf test
* delay alu
2026-01-06 00:15:48 -08:00
wozeparrot
21d0f6bb76
tk: flat global -> local load ( #14033 )
2026-01-05 23:35:53 -08:00
qazal
3170365a5b
visualize SQTT with the same cfg infrastructure ( #13870 )
...
* start
* rough sketch
* post render dag
* art
* intro g key
* work
* custom color scale
* colors
* more blue
* better
* smaller
* use for loop in test
2026-01-06 14:53:20 +09:00
Christopher Milan
0120d69caa
autogen: avcodec (and simplify workflow) ( #14031 )
...
* simplify autogen workflow and add avcodec verification
- Consolidate all regeneration into single steps (delete + import)
- Remove continue-on-error and individual diff checks
- Use git diff at end to catch all differences
- Show artifact URL in failure message
- Add avcodec.py verification
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* patch avcodec
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-05 23:30:25 -05:00
George Hotz
20653d2996
assembly/amd: make pdf.py code shine ( #14029 )
...
* assembly/amd: make pdf.py code shine
* no merge
* pdf2 is the future
* something
* regen enums
* test
* work
* remove junk
* write
* pcode extraction
* pdf2 passes all tests
* simplify
* simpler pdf
* late filter
* remove hacks
* simplify pdf2.py
* field type
* remove defaults
* don't export srcenum
* simple pdf.py
* simpler
* cleaner
* less hack in PDF
2026-01-05 18:49:40 -08:00
qazal
ea7b149ca5
viz command line tool ( #14030 )
2026-01-06 10:19:47 +09:00
Christopher Milan
f86c728440
load libclang as 'libclang.so' too ( #14028 )
2026-01-05 16:56:16 -05:00
chenyu
eda6a73897
clean up canonicalize_device ( #14027 )
...
centralize the type check
2026-01-05 10:29:55 -05:00
chenyu
ce464b147a
clean up comments that mentioned outdated terms ( #14026 )
...
no MultiLazyBuffer and no ShapeTracker in comments
2026-01-05 09:42:58 -05:00
chenyu
83063cc3e4
onnx TensorScatter ( #14024 )
2026-01-05 09:05:22 -05:00
chenyu
9497ec00f2
fix onnx attention permute ( #14025 )
...
* fix onnx attention permute
* skip test_attention_4d_fp16_cpu too
2026-01-05 08:58:50 -05:00
qazal
5cff5698f7
viz: g key toggles graph and text view ( #14023 )
2026-01-05 22:41:45 +09:00
chenyu
7a81a3cb98
more passed onnx tests ( #14022 )
2026-01-05 07:46:27 -05:00
kim yongjin
34fe105386
remove unused LazySeq ( #14020 )
2026-01-05 07:38:33 -05:00
qazal
4f2f38bf64
viz: split cfg and table render ( #14021 )
2026-01-05 20:59:08 +09:00
nimlgen
70405b4f3c
am_smi: mi350 ( #14018 )
2026-01-05 13:10:56 +03:00
Christopher Milan
b2a0b9c551
autogen: dump patch in CI ( #14010 )
...
* autogen: don't fast-fail, produce patch artifact on differences
All verification steps now use continue-on-error to run completely.
Each job generates a patch artifact containing all differences found.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* add gen from header test
* fix tests
* fail if diff
* add forward decl autogen test
* remove confusing/wrong comments
* macos unittests set LIBCLANG_PATH
---------
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-04 22:38:12 -05:00
chenyu
aae08b20e0
enable passed onnx tests ( #14017 )
2026-01-04 22:12:50 -05:00
chenyu
785d04d127
simpler einsum ( #14014 )
2026-01-04 20:38:59 -05:00
chenyu
f6a78a29e0
support einsum trace ( #14012 )
...
* support einsum trace
* test_einsum_scalar_cpu
2026-01-04 19:27:27 -05:00
George Hotz
404eed6172
assembly/amd: improve tests for asm ( #14007 )
...
* assembly/amd: improve tests for asm
* upd
* skip
* tests
* re bug
* more passing
* cleanups
* cdna fixups
* improve tests, better CDNA parsing
* fix CI
* no defs
* simpler
* all pass
* from pdf
* regen
2026-01-04 15:14:08 -08:00
wozeparrot
f550f9204c
fa: failing test for bwd jit ( #14009 )
...
* tk: failing test for bwd jit
* feat: mark expectedFailure
* clean: spaces
2026-01-04 16:57:43 -05:00
George Hotz
7abf4591ba
use bitsize on dtype ( #14011 )
...
* use bitsize on dtype [pr]
* bitsize
* bitsize in js export, but might be wrong
* reverts
* revert that
2026-01-04 12:16:21 -08:00
chenyu
cfb8bf5814
faster image load ( #13977 )
...
sometimes image load does not need to init with NAN
2026-01-04 13:09:59 -05:00
George Hotz
7ebda28692
assembly/amd: add CDNA support to asm ( #13982 )
...
* add CDNA support
* more cdna tests
* something
* fix more stuff
* more work
* simpler
* simplier
* cdna
* disasm
* less skip
* fixes
* simpler
2026-01-04 08:53:56 -08:00
chenyu
ad041416ca
delete unused rewrite rule [pr] ( #14006 )
2026-01-04 09:48:52 -05:00
nimlgen
bf356ae996
am: mi300 48bit address space ( #14004 )
...
* am: mi300 48bit address space
* fix
2026-01-04 15:19:25 +03:00
nimlgen
606786e152
am: do not sleep for each hive node during resets ( #14003 )
2026-01-04 14:02:11 +03:00
George Hotz
34ea053b26
assembly/amd: clean up pcode, jit pcode instead of static ( #14001 )
...
* assembly/amd: clean up pcode
* regen
* lil
* jit the pcode
* sendmsg
* cleanups
* inst prefetch lol
2026-01-03 23:06:15 -08:00
kamilisjon
280790e438
Reuse toposort in recursive_property ( #13993 )
2026-01-03 22:04:13 -08:00
kamilisjon
9a9564118c
[pr] Delete reverse_toposort ( #13987 )
...
* Delete reverse_toposort
* Update comment and profiler name
* Update profiler name
2026-01-03 22:03:44 -08:00
George Hotz
8328511808
assembly/amd: make the emu.py code shine ( #13996 )
...
* assembly/amd: make the code shine
* lil clean
* reg back in pcode
* cleanups
* gen fma_mix
* no writelane hacks
* fn cleanup
* dead vgpr_write
* readable
* smem
* cleanup bench_emu
* speedups
* simpler and faster
* direct inst._fn
* split fxn
* Revert "simpler and faster"
This reverts commit e85f6594b3 .
* move lds to wavestate
* dispatcher
* pc in dispatch
* literal isn't wavestate
* cleanups + program
* one readlane
* exec_vop3sd in exec_vop
* cleaner exec_vopd
* fully merge VOP3P
* no special paths
* no SliceProxy
* low=0
* no bigint
* failing tests
* fma on python 3.13
2026-01-03 20:33:09 -08:00
qazal
bdb421f13e
process_replay: passthrough sink arg for Ops.PROGRAM input ( #14000 )
2026-01-04 13:09:39 +09:00
Galax
66caa9fe1d
fix: library linking for fedora systems ( #13999 )
2026-01-03 17:40:56 -08:00
chenyu
8003db2a28
test case of NOOP store load folding ( #13997 )
2026-01-03 14:39:26 -05:00
chenyu
c1b8644a3f
test removing expander rules [pr] ( #13994 )
2026-01-03 12:38:01 -05:00
Christopher Milan
35c2870b1f
gate image_conv2d pitch hacks on IMAGE==1 ( #13995 )
...
* gate image_conv2d pitch hacks on IMAGE==1
* fix opencl image copies
* cleanup
2026-01-03 12:27:31 -05:00
nimlgen
a49924a0e9
hcq: _sleep report status ( #13992 )
...
* hcq: _sleep report status
* msg
* print all
2026-01-03 14:28:28 +03:00
nimlgen
3b354bc11f
hcq: better queue managment ( #13991 )
2026-01-03 13:11:15 +03:00
nimlgen
efb2ae87c6
hcq sync aql ( #13756 )
...
* hcq sync aql
* w
2026-01-03 12:59:24 +03:00
qazal
bd55507ee4
RDNA3 fp16 assembly gemm 85 TFLOPS ( #13990 )
2026-01-03 18:34:23 +09:00
wozeparrot
6242a9d151
tk: no global copy and clear ranges ( #13988 )
2026-01-02 23:45:15 -08:00
wozeparrot
9f082e8e25
fa: split kv bwd into 2 kernels ( #13981 )
2026-01-02 18:45:51 -08:00
qazal
2cc64d71b0
simplify mi350x gemm / viz asm tests ( #13984 )
...
* mi350x gemm cleanup
* asm tests work
* simpler asm tests
2026-01-03 11:11:07 +09:00
chenyu
7cbafb2ef1
update hypothesis min version ( #13983 )
...
there was a local_constants perf regression that made hypothesis related tests slow
2026-01-02 21:01:57 -05:00
Christopher Milan
9dc524536f
IMAGE=1 creates "dynamic" images ( #13769 )
...
* remove image from BufferSpec
* cl tiny_gemm (64) works
* mypy
* padding
* openpilot CL
* reshape properly
* remove extra qcom checks
* pad output
* mypy
* update compile test
* move undo
* TestImageCopy valid images
* TestImageRealization valid images
* TestImageDType valid images
* cleanups
* test_renderer_failures
* ruff
* mypy
* simplify ops_qcom
* bump step time
* Revert "bump step time"
This reverts commit 75a037c7d0 .
* "dynamic textures" are optional
* a start
* IMAGE=1 works, no FLOAT16
* fast but wrong
* mypy
* some fixes
* better
* works
* refactor
* oops
2026-01-02 16:22:39 -05:00
Christopher Milan
61dc70f1a8
add driving_vision IMAGE=1 benchmark ( #13979 )
2026-01-02 13:58:27 -05:00
George Hotz
0e282025ff
assembly/amd: split test_emu into hw tests ( #13966 )
...
* assmebly/amd: split test_emu into hw tests
* hw tests
* bugfixes
* more tests and fix
2026-01-02 08:04:56 -08:00
chenyu
2e2b5fed12
fix misspellings ( #13976 )
2026-01-02 10:37:38 -05:00
nietras
f49e4714af
Fix spelling errors in README for AMD assembly ( #13975 )
2026-01-02 10:15:20 -05:00
b1tg
a78fcc55a4
amd tc 1616128 ( #13439 )
...
* amd tc 1616128
* fix test
* remove hardcoded check in test
2026-01-02 09:01:05 -05:00
chenyu
fcbb896e05
remove unused to_struct [pr] ( #13973 )
2026-01-02 08:54:57 -05:00
nimlgen
ff7853a65a
am: fix aid doorbells ( #13971 )
2026-01-02 15:53:44 +03:00
nimlgen
42abb0586c
am: fix aid doorbells ( #13972 )
2026-01-02 15:53:13 +03:00
nimlgen
ebbaad6bfd
am: enable all sdma engines ( #13970 )
2026-01-02 15:25:15 +03:00
qazal
5f52266225
mi350x gemm: use Tensor.custom_kernel in asm test ( #13969 )
...
* mi350x gemm: use Tensor.custom_kernel in asm test
* A @ B for baseline
2026-01-02 18:30:50 +09:00
George Hotz
5a1a561e0f
assembly/amd: rdna4 autogen ( #13967 )
...
* assembly/amd: add pcode ds ops
* refactors
* fix ds op
* update autogen
* fix flat bug
* more tests
* fix emu test
* that's a hack
* generic
* fix all tests
* two tests
* fix test failure
* better
* remove __all__
* assembly/amd: fix autogen for RDNA4
2026-01-01 23:12:18 -05:00
wozeparrot
b27527f05a
fix: missed inner tracked range ( #13964 )
2026-01-01 18:09:57 -08:00
wozeparrot
ecbac8a338
tk: fa cleanups + causal test ( #13963 )
2026-01-01 18:05:00 -08:00
chenyu
af0392efea
only set DiskDevice.size if it opens successfully ( #13962 )
2026-01-01 19:33:26 -05:00
chenyu
e036d6df89
properly fix DiskDevice reuse ( #13961 )
2026-01-01 18:08:23 -05:00
George Hotz
dfb813b760
assembly/amd: add pcode ds ops ( #13939 )
...
* assembly/amd: add pcode ds ops
* refactors
* fix ds op
* update autogen
* fix flat bug
* more tests
* fix emu test
* that's a hack
* generic
* fix all tests
* two tests
* fix test failure
* better
* remove __all__
2026-01-01 16:24:13 -05:00
chenyu
cb7c76a3bd
update test_fuzz_failure to not contruct full UOp ( #13960 )
2026-01-01 15:09:58 -05:00
chenyu
51398edf9c
fix indirect import ( #13958 )
...
also deleted old external tests
2026-01-01 14:22:45 -05:00
chenyu
8e416df438
simpler InvalidType [pr] ( #13957 )
...
simpler singleton pattern
2026-01-01 13:55:51 -05:00
nimlgen
b8ea0d779c
am: remove pipe, queue from setup_ring ( #13947 )
2026-01-01 21:06:41 +03:00
chenyu
4d5c4d256d
update tqdm for edge case ( #13956 )
...
1.00kit/s and not 1000it/s for value 999.5
2026-01-01 11:37:26 -05:00
chenyu
ed222070f7
update xlog2 fp16 decomp to not use fp32 ( #13955 )
2026-01-01 11:18:29 -05:00
chenyu
ce84a23142
remove tee in benchmark ( #13954 )
2026-01-01 10:55:36 -05:00
b1tg
24723327ac
fix tc_up in search ( #13438 )
...
* tensor_core is missing from Scheduler
* test upcast max
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-01-01 10:25:08 -05:00
qazal
9726500de8
enable using assembly in Tensor.custom_kernel ( #13895 )
2026-01-02 00:12:01 +09:00
qazal
c0f52c9dcb
split assembly gemm to per arch directory ( #13953 )
2026-01-02 00:10:22 +09:00
chenyu
c69470be52
fix test_symbolic_arange_sym_step ( #13952 )
2026-01-01 09:41:07 -05:00
chenyu
b91b46091c
delete test_tensor_uop ( #13951 )
...
old test for shape tracker. also update tests that refer shapetracker
names
2026-01-01 09:25:05 -05:00
chenyu
17ef4af72c
new ceildiv that fixed symbolic conv ( #13944 )
...
* new ceildiv that fixed symbolic conv
* smaller test case
2026-01-01 09:02:41 -05:00
qazal
6a5430ab00
correct args order in mi350x gemm ( #13949 )
2026-01-01 23:01:46 +09:00
chenyu
baff10d32c
clean up Tensor.svd slices ( #13948 )
2026-01-01 08:18:45 -05:00
nimlgen
1c5ed8e8b5
am: remove doorbells from setup_ring ( #13946 )
2026-01-01 14:39:21 +03:00
haofei
526fd4ec71
Fix SVD rank‑1 Jacobi rotation when tau == 0 ( #13945 )
2026-01-01 00:30:18 -05:00
haofei
20777f30b9
Fix QR/SVD NaNs on zero/orthogonal inputs ( #13943 )
2025-12-31 23:40:09 -05:00
chenyu
0ed58c1fcd
clean up some functions in helpers [pr] ( #13942 )
2025-12-31 18:29:16 -05:00
chenyu
e2987001ee
unify pre-commit mypy and ci mypy ( #13940 )
2025-12-31 17:51:51 -05:00
chenyu
8bf7c9c1d2
no-op cleanups for ptx [pr] ( #13938 )
2025-12-31 17:28:39 -05:00
George Hotz
2bb07d4824
assembly/amd: move Reg out of the psuedocode ( #13934 )
...
* assembly/amd: move Reg out of the psuedocode
* remove extra
* fix pcode tests
* simpler pcode
* simpler
* simpler
* cleaner
* fix mypy
2025-12-31 15:34:51 -05:00
chenyu
52acadc160
consolidate IGNORE_OOB=0 tests ( #13937 )
...
add a new unit test file and add more cases
2025-12-31 15:24:20 -05:00
chenyu
c0c1c1c8c8
remove unused validate rule ( #13936 )
2025-12-31 15:02:49 -05:00
chenyu
b6d08f247d
assert z3_xor input type ( #13933 )
2025-12-31 13:37:57 -05:00
George Hotz
f14428090f
assembly/amd: speed up emulator ( #13932 )
2025-12-31 13:32:25 -05:00
Christopher Milan
13973e4dea
refactor image pitch ( #13928 )
2025-12-31 13:22:38 -05:00
chenyu
051fe6c8bc
less toposort iteration in oob validate ( #13929 )
2025-12-31 13:16:34 -05:00
chenyu
a9a7b33404
IGNORE_OOB=0 in CI ( #13903 )
2025-12-31 12:56:59 -05:00
George Hotz
29402034a1
assembly/amd: cleanups to asm and emu ( #13912 )
...
* a bunch of cleanups
* ops are back
* bug fixes
* cleanups
* a lil simpler
* more refactors
* _disasm_vop1
* sops
* more
* continue
* more
* num_srcs
* simpler
* no _is16
* op cleanups
* isinstnace
2025-12-31 12:46:11 -05:00
chenyu
ba9aa5cd6f
skip some PTX IGNORE_OOB validation ( #13927 )
2025-12-31 12:40:21 -05:00
chenyu
4968060ad4
fix IGNORE_OOB=0 for WEBGPU ( #13926 )
2025-12-31 10:41:28 -05:00
chenyu
35bd39e4ba
update mypy and torch version in ci ( #13925 )
2025-12-31 10:29:28 -05:00
George Hotz
b998a80b5d
assembly/amd: split generated stuff into enum/ins ( #13924 )
2025-12-31 10:10:52 -05:00
chenyu
404755bafd
merge ci ruff tests and update ruff version ( #13922 )
2025-12-31 09:53:49 -05:00
nimlgen
25440f0f72
all2all ( #13902 )
...
* all2all
* um
* fix
* x
* um
* simler
* mypy
* fix
* t
* cmnts
2025-12-31 16:38:32 +03:00
nimlgen
f7ee644950
amd: lazy sdma queue allocation ( #13920 )
...
* ams: lazy queue
* nv
* linter
* f
2025-12-31 15:17:13 +03:00
nimlgen
b063518ea7
am: several sdmas ( #13919 )
...
* am: several sdmas
* fix
2025-12-31 14:19:22 +03:00
qazal
b23f4517ab
prep mi350x gemm for python dsl ( #13918 )
...
* start by pruning existing asm
* better branch names
* split to template and real instructions
2025-12-31 20:00:57 +09:00
qazal
3f3786ded9
mmapeak: fix compiler import ( #13915 )
2025-12-31 16:52:23 +09:00
Christopher Milan
a14896fff2
refactor QCOM arg parsing ( #13914 )
...
* refactor QCOM arg parsing
* ruff
* mypy
2025-12-30 19:26:02 -05:00
Christopher Milan
c475c3a6d7
remove useless cast ( #13911 )
2025-12-30 19:24:29 -05:00
George Hotz
0221b96761
assembly/amd: fix all ops tests ( #13910 )
...
* assembly/amd: fix all ops tests
* test_ops with smaller sizes
* ds store/load 2addr
2025-12-30 18:01:34 -05:00
chenyu
dc27eb48ac
remove PYTHONPATH="." from test.yml ( #13909 )
2025-12-30 17:00:16 -05:00
George Hotz
efc99d0c55
assembly/amd: more refactors ( #13907 )
...
* assembly/amd: more refactors
* more refactors
* more refactors
* simpler emu
* generate.py
* regen all
* cleanups
* more
* work
* more readme
* lil
2025-12-30 16:13:24 -05:00
George Hotz
49d1bf93d6
assembly/amd: refactor asm.py to be simpler ( #13900 )
...
* assembly/amd: refactor asm.py
* assembly/amd: refactor asm.py to be simpler
* multiple fxns
* fast
* more tests pass
* regen
* stop decode
2025-12-30 13:51:40 -05:00
George Hotz
04c79505ec
no subnormal bf16 ( #13905 )
2025-12-30 13:02:53 -05:00
chenyu
39f99b207a
update IGNORE_OOB error message ( #13904 )
...
IGNORE_OOB=1 to disable
2025-12-30 12:25:55 -05:00
George Hotz
7e14cdcb06
assembly/amd: clean up clt/ctz hack ( #13901 )
...
* assembly/amd: clean up clt/ctz hack
* add breaks
2025-12-30 11:59:28 -05:00
George Hotz
69cdc8066d
assembly/amd: add dtype tests to AMD IDE CI ( #13899 )
...
* add dtype tests to AMD IDE CI
* more tests
* add trig preop
* regen done
* split to amd autogen
* simpler
2025-12-30 11:09:51 -05:00
George Hotz
9c89be5235
assembly/amd: fix v_perm_b32 + PC fixes ( #13897 )
...
* assembly/amd: fix v_perm_b32
* add pc support
2025-12-30 09:25:40 -05:00
George Hotz
2b838dc1d8
assembly/amd: fix AMD_LLVM=1 support in emulator ( #13881 )
...
* fix AMD_LLVM=1 support in emulator
* more llvm with dtype
* work
* more fixes
* fix dtype
2025-12-30 09:09:57 -05:00
nimlgen
a19d21ea9c
am: mi3xx smu clocks ( #13894 )
...
* am: mi3xx smu clocks
* x
2025-12-30 16:44:17 +03:00
qazal
b557c46233
assembly gemm clean ups, instructions for cli ( #13892 )
2025-12-30 16:14:06 +09:00
qazal
d7e1f26e3d
command line interface for sqtt viz ( #13891 )
...
* command line interface for sqtt viz
* cleanup
* api surface area
* this confuses the llms
* document
2025-12-30 12:33:21 +09:00
chenyu
ab58926b00
update sampling in test_float_cast_to_unsigned ( #13889 )
...
filter is slow for small dtypes
2025-12-29 21:35:46 -05:00
Christopher Milan
0497387e45
NIR: new-style (fix beam) ( #13887 )
...
* NIR: fix beam
* new reduce
* Revert "Revert "NIR: new-style compilers (#13875 )" (#13888 )"
This reverts commit fc4faed0b2 .
* oops
2025-12-29 18:41:29 -05:00
Christopher Milan
fc4faed0b2
Revert "NIR: new-style compilers ( #13875 )" ( #13888 )
...
This reverts commit 72236bbd3d .
2025-12-29 17:42:28 -05:00
George Hotz
94bca91f3e
assembly/amd: have asm go through the dsl ( #13886 )
...
* assembly/amd: have asm go through the dsl
* lil
2025-12-29 17:39:11 -05:00
George Hotz
7322d9ec4a
assembly/amd: add new instruction support to pcode ( #13885 )
...
* assembly/amd: add new instruction support
* more
* regen all
2025-12-29 17:30:17 -05:00
George Hotz
0d326f5b9b
fix missing instructions in psuedocode ( #13884 )
2025-12-29 16:11:22 -05:00
Christopher Milan
9c6850fc01
remove try-catches on llvm import ( #13883 )
2025-12-29 15:56:17 -05:00
George Hotz
9d8397be11
add CDNA3+RDNA4 support ( #13882 )
...
* fix CI
* remove junk
* rename lib to dsl
* correct
* cleanups
2025-12-29 15:51:29 -05:00
Christopher Milan
72236bbd3d
NIR: new-style compilers ( #13875 )
...
* NIR: new-style compilers
* mypy
* simplify NIR compilers
* lvp compiler too
* mypy
* simplify
* mypy
2025-12-29 15:31:41 -05:00
George Hotz
81cf9ea0ab
rename to extra.assembly.amd ( #13879 )
2025-12-29 14:10:55 -05:00
George Hotz
37f0fa11b6
rdna3 test cleanups ( #13878 )
...
* rdna3 test cleanups
* cleanups
* ugh DONT SKIP
2025-12-29 13:41:59 -05:00
George Hotz
35db73b231
add cdna4 support to parsers ( #13877 )
...
* add cdna4 support to parsers
* cdna4
2025-12-29 13:23:43 -05:00
Clément Verrier
d178235309
delete tree structure from CLAUDE.md ( #13876 )
...
Claude Code should be able to figure out the correct structure, and the
hardcoded tree structure might become outdated.
2025-12-29 13:23:20 -05:00
George Hotz
ff856a74cb
minor refactoring for rdna3 ( #13873 )
...
* minor refactoring for rdna3
* fix div scale stuff
* more bugfixes
2025-12-29 13:20:00 -05:00
C T
39923203ba
fix exception in cuda bindings code on windows ( #13823 )
...
* fix cuda on windows
* fix linter errors
* test github action install cuda-toolkit
* Revert "test github action install cuda-toolkit"
This reverts commit c18ad6f937 .
* Revert "fix linter errors"
This reverts commit 00aa943e91 .
* Revert "fix cuda on windows"
This reverts commit 7aea5256b1 .
* fix windows sysconfig.get_config_var("MULTIARCH") is None
2025-12-29 12:58:22 -05:00
b1tg
63a1bb8507
multi custom kernel: support input mixed with copy and shard ( #13748 )
2025-12-29 12:54:27 -05:00
chenyu
0a98fd38b3
fix tests that failed locally on mac ( #13872 )
...
keccak output was silently broken without contiguous
2025-12-29 11:23:38 -05:00
Clément Verrier
0e409ff5ce
fix indentation in UOp pretty_print for repeated references ( #13857 )
...
* fix correct indentation in UOp pretty_print for repeated references
When a UOp was referenced multiple times, the walrus operator notation
(e.g., x0:=) was correctly used for the first occurrence, but subsequent
references had misaligned indentation due to an extra space character.
Fix indentation misalignment in pretty_print() when UOps are referenced
multiple times.
* add simple unit tests for UOp repr
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-12-29 10:46:16 -05:00
George Hotz
f1471a3b99
speed up rdna3 unit tests + add to CI ( #13871 )
...
* speed up rdna3 unit tests
* add test to CI
* faster and simpler
* speedups
* bugfixes
* use helper
* fix CI maybe
* test fixes
* llvm-21 on 24.04
* upd
* llvm-21
* fix test
* bring that back
* merge gen into lib
* test generators
2025-12-29 10:26:48 -05:00
h-vetinari
37720fd6c0
also look for linux libraries in RHEL-themed paths ( #13863 )
2025-12-29 10:05:32 -05:00
George Hotz
25ef866e89
write python emulator from RDNA3 psuedocode in pdf ( #13841 )
...
* write python emulator from RDNA3 psuedocode in pdf
* emu2
* more emu
* working
* more psueod
* progress
* cleanups
* delete junk
* delete stale files
* just emu
* work
* emu compare
* bemu
* cleanups and more failures
* revert bench emu
* fix emu cmp
* four tests fail
* bugfixes
* dsl
* ext
* refactor
* dsl
* div scale fix
* test_emu
* fix emu tests
* pcode
* test pcode
* top imports
* fix test_emu to use run_asm
* emu tests on real hardware
* more tests
* more emu tests
* more
* work
* work
* bug fix
* bugfixes
* fix fp16 gemm
* all ops tests pass in emulator
* fix llvm tests
* fix a few more tests
* fix mockgpu timeout
2025-12-29 07:39:53 -05:00
nimlgen
88eb230326
memory: correct pa allocator size ( #13861 )
2025-12-29 14:49:44 +03:00
qazal
f541540129
variable N for asm gemm ( #13869 )
...
* variable N for asm gemm
* cleanup spacing
2025-12-29 19:35:50 +09:00
nimlgen
c6769badc2
mockgpu: async support ( #13868 )
...
* mockgpu: async support
* cpu
2025-12-29 13:18:37 +03:00
qazal
fc5278746f
mi350x assembly gemm cleanups ( #13867 )
2025-12-29 18:47:23 +09:00
George Hotz
f07c39cfa4
hwtest fixes for rdna3 dsl ( #13865 )
2025-12-28 20:42:29 -05:00
George Hotz
d9603c1bee
improve asm dsl syntax ( #13864 )
...
* improve asm dsl syntax
* improve asm dsl syntax
2025-12-28 20:04:59 -05:00
chenyu
f5090192c8
reorder AMD tensor core benchmark test ( #13860 )
...
* reorder AMD tensor core benchmark test
* disable that
2025-12-28 12:29:51 -05:00
qazal
066d96c397
print tflops in asm gemm test ( #13859 )
...
* print tflops in asm gemm test
* change order
2025-12-29 02:26:40 +09:00
chenyu
a03cd43e78
fix typing in compute_gradient ( #13852 )
2025-12-28 11:52:14 -05:00
chenyu
cba05acadf
re-enable TYPED=1 import test ( #13858 )
2025-12-28 11:49:06 -05:00
qazal
2cfbabdc34
mi350x 1tflop bf16 gemm in extra ( #13702 )
2025-12-28 21:45:42 +09:00
qazal
2180eee5e4
use the asm dsl in remu hwtest.py ( #13856 )
...
* remu hw test with the asm dsl
* simpler
* nthreads and exec mask
* cmp/cmpx
* assembler error in s_mov_b32
* vopd in dsl?
2025-12-28 11:32:41 +09:00
chenyu
784b919f7f
Revert "optim empty shard #13513 ( #13598 )" ( #13855 )
...
* Revert "optim empty shard #13513 (#13598 )"
This reverts commit 76d465dbc3 .
* test_arange_shrink
* update test
2025-12-27 21:10:23 -05:00
anu
9b4de8abc7
fix beam in python 3.14+ ( #13836 )
...
* fix beam search on python 3.14
* add PickleableCount class to helpers
* change name, add test, add step
* tidy count init
2025-12-27 16:24:22 -05:00
chenyu
0f74909ae9
clean up rearrange ( #13851 )
2025-12-27 11:06:10 -05:00
qazal
f6c660f7fa
simplify sqtt decoder infra ( #13849 )
...
* more work
* simpler
2025-12-28 00:31:16 +09:00
Clément Verrier
ae013beab8
handle empty VECTORIZE in UOp.render() ( #13847 )
...
`UOp.render()` crashed with `IndexError: tuple index out of range` when
the UOp graph contained a `VECTORIZE` with empty `src=()`. This occurs
when reshaping to scalar shape `()`, e.g., `Tensor.ones(4).sum()`.
The bug was in the renderer's VECTORIZE pattern: `all_same(())` returns
`True` (vacuous truth), causing the code to access `x.src[0]` on an
empty tuple.
- Fix `IndexError` when calling `UOp.render()` on graphs containing
empty `VECTORIZE` nodes.
- Add test for empty `VECTORIZE` rendering.
2025-12-27 10:09:39 -05:00
qazal
a2da61d096
use new style amd compiler in viz ( #13848 )
...
* working version, handcode gfx1100 arch
* get target from device properties
* lib in cfg test program spec
2025-12-27 23:59:30 +09:00
JINO ROHIT
1ee92003ea
minor typo ( #13846 )
2025-12-27 09:34:57 -05:00
nimlgen
276159cb87
system: add base_class to pci_scan_bus ( #13845 )
...
* system: add base_class to pci_scan_bus
* fix
2025-12-27 13:22:21 +03:00
Francis Lata
fac137779e
remove flux1 seed image ( #13843 )
2025-12-27 00:45:11 -05:00
qazal
f6de9095a0
switch asm tests to dsl ( #13840 )
...
* switch asm tests to dsl
* labeled basic blocks also work
* indenting for basic blocks
* allow define from star import
2025-12-27 02:15:16 +09:00
chenyu
ba922094f2
remove redudant check in disk_supports_fast_copyout ( #13838 )
2025-12-26 11:30:55 -05:00
George Hotz
e9f2aaba2a
simplify rdna3 asm ( #13835 )
...
* simplify rdna3 asm
* cleanups
* fix names
* fix tests
* fixes
* more test fixes
* type fixes
* tests pass + mypy passes
* 3.11 syntax
2025-12-26 11:21:03 -05:00
nimlgen
c44b4f9ae0
am: fix sdma warm boot ( #13837 )
2025-12-26 12:38:06 +03:00
George Hotz
c6937fa744
more work on RDNA3 asm ( #13833 )
...
* more llvm asm tests
* roundtrip test
* work
* more handwritten
* more handwritten
* work
* tests pass
* dual mov
* all tests pass
* all tests pass fast
2025-12-25 23:28:14 -05:00
George Hotz
f1111ac7de
move amd compilers to new style ( #13831 )
...
* move amd compilers to new style
* simplest diff
* AMDHIPrenderer
2025-12-25 13:42:24 -05:00
George Hotz
9d94b8c6b2
python asm dsl in extra + python REMU ( #13436 )
...
* having fun with python asm dsl
* rdna3
* meh
* all in rdna3
* work
* more work
* work
* integration
* tests
* simpler
* simpler
* asm
* better
* simpler
* progress
* emu
* simpler
* emu
* tests
* types
* vopd
* cleaups
* work
* memory ranges
* add tracing
* refactors
* run_asm exit
* more readable
* compare to remu
* test gemm
* bug + stale
* more tests
* refactor
* tests fix
* more ins
* more instructions
* refactor
* faster
* match case
* match case
* simpler
* work
* tests
* run_asm
* work
* bug fixes
* more emu
* alu/emu
* refactor
* no pipeline emu yet
* alu direct
* fix
* bugfixes + new test
* fix exceptions in emulators
* update gen.py
* pylint
* no pdf
* improve bench_emu
* speedups
* cleanups
* more tests
2025-12-25 13:04:14 -05:00
nimlgen
b5f3a5ad79
am: cleanup comment ( #13828 )
2025-12-25 18:00:28 +03:00
chenyu
8985a4a023
one less branch in Buffer.view [pr] ( #13829 )
2025-12-25 09:34:15 -05:00
chenyu
094753b4e0
renderer arch version cleanup [pr] ( #13830 )
2025-12-25 09:32:56 -05:00
chenyu
54af29dbdb
trange can just be a function ( #13827 )
2025-12-24 23:57:10 -05:00
qazal
a1c1684b91
set .amdhsa_kernarg_size in asm test ( #13826 )
2025-12-25 13:08:14 +09:00
chenyu
da1cb6a9ec
update llama dataloader ( #13825 )
...
separate creating dataset from itererating over the dataset to not create eval data for each eval
2025-12-24 17:42:08 -05:00
chenyu
a7fc0c288b
clean up BufferCopy init [pr] ( #13824 )
2025-12-24 10:40:15 -05:00
chenyu
903753c60c
llama wandb logging ( #13822 )
2025-12-24 10:24:59 -05:00
qazal
e3a646dce3
viz: skip plaintext disassemble for cfg ( #13821 )
2025-12-24 23:16:59 +09:00
chenyu
cb07c5d0e8
fewer import annotations ( #13819 )
2025-12-23 18:45:50 -05:00
George Hotz
43c6e973d8
add optional compiler in Renderer ( #13817 )
...
* add optional compiler in Renderer [pr]
* fix
* late init
* remove precompiled
* cleanup
2025-12-23 17:58:46 -05:00
George Hotz
8eab6175ee
get_program refactor ( #13816 )
...
* get_program refactor
* fix docs
* cleanup
2025-12-23 16:44:46 -05:00
George Hotz
3d3c5b2fb9
add device to program ( #13815 )
...
* add device to program
* from_uop
* from_uop no renderer
* simpler global_size
2025-12-23 16:15:33 -05:00
nimlgen
90b217896f
am: xgmi p2p ( #13811 )
...
* system: use addr space
* am: xgmi
* fix
* ugh
2025-12-23 20:11:38 +03:00
George Hotz
6439a515be
test fixups / speedups / var_vals refactor ( #13812 )
...
* no PYTHONPATH + llm server port 0
* llm tok speedup
* refactor var_vals
2025-12-23 12:05:59 -05:00
George Hotz
8dcba2e2cc
no full_rewrite [pr] ( #13809 )
...
* no full_rewrite [pr]
* fix
* fix docs
2025-12-22 23:20:01 -05:00
George Hotz
edce2303f4
rewrite to program ( #13808 )
2025-12-22 20:03:33 -05:00
George Hotz
2af2b4da5d
Revert "rewrites for renderer and compiler ( #13646 )" ( #13806 )
...
This reverts commit 339dadf056 .
2025-12-22 19:21:33 -05:00
George Hotz
339dadf056
rewrites for renderer and compiler ( #13646 )
...
* rewrites for renderer and compiler
* full_rewrite_to_program
* fix pre-commit
* compiler passed into get_program
* no pkl compiler
* lib on program spec
* fix spec
* fix test
* no device
* compiler_device
* nm
* fix nir
* fix
* simplest
* fix tests
* revert
2025-12-22 18:58:43 -05:00
Daniel Xu
4edaaf19e5
Handle tied embeddings for llama 3.2 1B ( #13796 )
...
Previously the output.weight layer would not be loaded, and would only
contain randomly initialized values. This led to junk when doing a
forward pass.
Signed-off-by: Daniel Xu <daniel@thinkingmachines.ai>
2025-12-22 16:31:40 -05:00
chenyu
7f1d41c9f9
delete files that import ShapeTracker ( #13805 )
2025-12-22 15:54:18 -05:00
qazal
b31373ca70
remove llvm-mca stuff from viz ( #13802 )
2025-12-23 01:41:51 +08:00
chenyu
27d899ce97
TRAIN=0 to only eval llama ( #13804 )
2025-12-22 11:55:46 -05:00
chenyu
39d962106f
update llama logging ( #13803 )
...
```
REWRITE_STACK_LIMIT=1000000 SMALL=1 BASEDIR=/raid/datasets/c4-8b SAMPLES=1000 BS=8 DP=8 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=8B SEQLEN=1024 PYTHONPATH=. MODEL=llama3 python3 examples/mlperf/model_train.py
1 93.44 s run, 11.8750 loss, 0.000000000001 LR, 642.43 GB used, 19644.30 GFLOPS
2 101.78 s run, 11.8750 loss, 0.000000000001 LR, 1454.57 GB used, 17039.35 GFLOPS
3 7.34 s run, 11.8750 loss, 0.000000000002 LR, 1454.57 GB used, 236258.78 GFLOPS
4 4.32 s run, 11.8750 loss, 0.000000000002 LR, 1454.57 GB used, 401488.40 GFLOPS
5 4.36 s run, 11.9375 loss, 0.000000000003 LR, 1454.57 GB used, 398116.13 GFLOPS
6 4.32 s run, 11.8750 loss, 0.000000000003 LR, 1454.57 GB used, 401878.60 GFLOPS
7 4.34 s run, 11.8750 loss, 0.000000000004 LR, 1454.57 GB used, 399822.57 GFLOPS
8 4.35 s run, 11.8750 loss, 0.000000000004 LR, 1454.57 GB used, 398512.24 GFLOPS
9 4.36 s run, 11.8750 loss, 0.000000000005 LR, 1454.57 GB used, 397832.61 GFLOPS
10 4.40 s run, 11.8750 loss, 0.000000000005 LR, 1454.57 GB used, 394520.83 GFLOPS
```
2025-12-22 11:28:29 -05:00
qazal
389f01c7f4
viz: amdgpu assembly basic block graph ( #13755 )
2025-12-22 23:17:16 +08:00
George Hotz
df0f9d6860
add olmoe support to llm ( #13792 )
...
* add olmoe support to llm
* cleanups
* simpler
* clean
* fix mypy
* lil
* remove dumb assert
2025-12-22 10:41:35 -04:00
qazal
81d9053013
roc: cast to nullptr instead of changing header ( #13801 )
2025-12-22 22:34:06 +08:00
nimlgen
d299d30f2c
am_smi: fix with new autogen ( #13800 )
2025-12-22 16:53:26 +03:00
nimlgen
f6bda6ae4e
am: continue from saved state ( #13799 )
...
* am: gfx queue cont
* f
* reset
* f
* l
2025-12-22 15:55:07 +03:00
qazal
6237bd86f6
sqtt/pmc viz improvements ( #13797 )
2025-12-22 18:16:35 +09:00
Sitananda Prasad
3000b8d762
symbolic: add x ^ x -> 0 folding pattern ( #13794 )
2025-12-21 21:47:28 -04:00
chenyu
5cb827f7bf
clean up can_lossless_cast and add missing pairs [p] ( #13793 )
2025-12-21 12:18:33 -05:00
George Hotz
75a6a03664
add qwen3 moe support to tinygrad.apps.llm ( #13775 )
...
* qwen moe works
* simple moe
* one test
* integration
2025-12-21 12:36:02 -04:00
chenyu
29ef0809bb
can_safe_cast -> can_lossless_cast ( #13789 )
...
safe cast in numpy only means the result won't overflow, so lossless is more precise
2025-12-21 11:29:19 -05:00
chenyu
ed1fd7023b
use getattr in dtype.truncate [pr] ( #13788 )
2025-12-21 11:05:43 -05:00
qazal
9839838fdd
viz UOp layout cleanup ( #13787 )
...
* use the same names in server and client
* first layout args, then renderer args
2025-12-21 22:11:40 +08:00
nimlgen
e523971028
am: make mqd contig ( #13786 )
2025-12-21 17:00:33 +03:00
qazal
09e060eab5
simplify viz node labels ( #13784 )
2025-12-21 16:45:06 +08:00
qazal
dc660c9fc0
remove stale / untested viz related files ( #13785 )
2025-12-21 16:42:48 +08:00
George Hotz
59c02dd87f
does this fix the dtype test? ( #13779 )
...
* does this fix the dtype test?
* simpler
2025-12-20 17:31:46 -04:00
George Hotz
5228f7bd06
hotfix: opencode should not reformat files
2025-12-20 15:55:29 -04:00
chenyu
733ef0452c
update test_uop_resolve ( #13777 )
...
plain @unittest.expectedFailure is too broad
2025-12-20 12:40:59 -05:00
nimlgen
3db2104fb8
am: timeout sos start ( #13776 )
2025-12-20 17:41:33 +03:00
qazal
94f97f6988
generic viz cleanups from the basic blocks branch ( #13774 )
...
* simpler codeblock highlight
* simpler append
* status enum
2025-12-20 18:18:03 +08:00
George Hotz
a987a8ed44
add neg VIZ support to not start server ( #13772 )
2025-12-20 00:36:38 -04:00
qazal
b7c2f0dd1b
remove stale extra/sched directory ( #13770 )
2025-12-20 11:57:30 +08:00
George Hotz
86cd1e9e81
remove UPatAny for typing fix [pr] ( #13766 )
...
* remove UPatAny for typing fix [pr]
* fix dtype
2025-12-19 17:41:18 -04:00
George Hotz
4702da41d5
hotfix: mkdir for extra/disassemblers
2025-12-19 17:18:37 -04:00
George Hotz
45c459848d
remove more stale stuff ( #13765 )
...
* remove more stale stuff
* remove disassemblers/adreno
* stale
2025-12-19 17:14:56 -04:00
George Hotz
744af193f0
remove ScheduleItem and merge it with ExecItem ( #13759 )
...
* remove ExecItem and merge it with ScheduleItem
* less diff
* fix issues
* min diff
* don't change bufs in _lower
* min diff
* update
* revert
* fixes
* diff
2025-12-19 17:04:24 -04:00
George Hotz
df6cde8a00
cleanup stale examples/extra ( #13764 )
...
* cleanup stale files
* examples
* move those back
* old
* delete more
2025-12-19 16:27:37 -04:00
chenyu
80b84f5267
ruff lint tinykitten ( #13762 )
...
deleted used import and double spaces. a few ignore to not change the real code
2025-12-19 14:31:00 -05:00
Christopher Milan
97103831c5
Revert "remove image from BufferSpec ( #13636 )" ( #13761 )
...
This reverts commit 2571a1eb47 .
2025-12-19 13:54:36 -05:00
Christopher Milan
2571a1eb47
remove image from BufferSpec ( #13636 )
...
* remove image from BufferSpec
* cl tiny_gemm (64) works
* mypy
* padding
* openpilot CL
* reshape properly
* remove extra qcom checks
* pad output
* mypy
* update compile test
* move undo
* TestImageCopy valid images
* TestImageRealization valid images
* TestImageDType valid images
* cleanups
* test_renderer_failures
* ruff
* mypy
* simplify ops_qcom
* bump step time
2025-12-19 13:41:20 -05:00
chenyu
185a000882
gradient of COPY ( #13760 )
2025-12-19 13:33:59 -05:00
nimlgen
57fe4d0a59
am: no_update_ptr for master ( #13757 )
2025-12-19 19:37:37 +03:00
chenyu
7fcd3cf991
hotfix SPEC for AFTER(CONTIGUOUS) ( #13752 )
...
fixed spec error in `PYTHONPATH="." REWRITE_STACK_LIMIT=5000000 NULL=1 DEFAULT_FLOAT="HALF" BERT_LAYERS=2 BENCHMARK=10 BS=128 GPUS=1 MODEL=bert python3 examples/mlperf/model_train.py`
2025-12-19 10:05:45 -04:00
qazal
81b5815a66
viz: minimal data to render a graph ( #13754 )
2025-12-19 16:19:28 +08:00
Christopher Milan
849e46da21
DLL: _PATH variables can be parent dir ( #13753 )
2025-12-19 00:28:02 -05:00
qazal
159c0e92fa
viz: infrastructure for basic block graphs ( #13751 )
2025-12-19 13:08:19 +08:00
George Hotz
fa40df972f
fix tests for NV ( #13744 )
...
* small fix
* min diff
* bfloat16 out
2025-12-18 13:20:21 -04:00
nimlgen
77191fb744
hive_reset for mi350 ( #13746 )
2025-12-18 12:02:28 +03:00
nimlgen
ceff388f3d
am: extend va space ( #13745 )
2025-12-18 11:20:43 +03:00
wozeparrot
99e667bdcd
tk fa bwd ( #13480 )
2025-12-17 23:56:37 -08:00
George Hotz
aeb7516c8a
tests passing on tinybox h3 ( #13742 )
2025-12-17 19:04:34 -04:00
chenyu
7cd7593c5d
add script to train bert on mi350x ( #13743 )
...
adapted from mi300 config
2025-12-17 16:54:04 -05:00
George Hotz
22f3e7f995
better precommit coverage and faster ( #13740 )
...
* improve pre-commit hook speed and coverage
* remove a few
* lose that
2025-12-17 13:25:55 -04:00
George Hotz
bc78cf1197
filter warnings for nicer test output ( #13739 )
2025-12-17 13:25:27 -04:00
George Hotz
b013244c38
fix local tests for AMD_LLVM ( #13738 )
...
* fix local tests for AMD_LLVM
* fix linters
* skip that for now
* fix segfault
2025-12-17 12:23:46 -04:00
nimlgen
7081014c73
am_smi: mi300 ( #13737 )
...
* am_smi: mi300
* smi
* remo
2025-12-17 17:56:01 +03:00
George Hotz
3dbde178c1
mark slow tests as slow instead of as CI ( #13736 )
...
* mark slow tests as slow instead of as CI
* CI shouldn't have different behavior
* more skips / CI
* slow
2025-12-17 10:29:57 -04:00
George Hotz
9015a22523
make tests faster ( #13734 )
2025-12-17 09:39:44 -04:00
nimlgen
3eecb4f123
am: mi350 support ( #13733 )
2025-12-17 14:57:21 +03:00
wozeparrot
5151a341b3
tk: small changes from fa bwd ( #13732 )
2025-12-16 22:44:36 -08:00
chenyu
fda73c8180
support LAMB param offload ( #13730 )
...
also added Tensor.shard_like
2025-12-16 19:56:30 -05:00
George Hotz
cf0c28d5ae
all tests pass on strix halo ( #13728 )
2025-12-16 19:35:50 -04:00
Christopher Milan
af1d938a50
DLL: search wsl lib folder ( #13727 )
2025-12-16 18:27:09 -05:00
George Hotz
0fb645cc4c
move some methods to mixins ( #13725 )
...
* move some methods to mixins
* a few more
* math trunc
2025-12-16 19:20:04 -04:00
Christopher Milan
c6ba016da6
fix cuda check ( #13726 )
2025-12-16 18:00:09 -05:00
George Hotz
ee45669d14
pre extract afters + sched cleanups ( #13720 )
...
* pre extract afters + sched cleanups
* claude.md lesson
* tests for schedule cache
* Revert "tests for schedule cache"
This reverts commit fb3f2e800a .
2025-12-16 16:14:30 -04:00
George Hotz
4b741e893f
remove REMOTE=1 ( #13722 )
...
* remove REMOTE=1
* leave ibverbs
2025-12-16 15:58:10 -04:00
George Hotz
4d8d821f56
create schedule before the cache ( #13717 )
...
* create schedule before the cache
* move create_schedule
* simpler
* simpler
* simpler
2025-12-16 14:15:31 -04:00
George Hotz
bfe374c7f5
support symbolic shapes in split/chunk when split dim is concrete ( #13718 )
...
* support symbolic shapes in split/chunk when split dim is concrete
Previously split() and chunk() required all dimensions to be concrete.
Now they only require the dimension being split to be concrete, allowing
them to work with tensors that have symbolic shapes in other dimensions.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* update CLAUDE.md: add pre-commit and no-amend rules
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix dim resolution order in split/chunk
Ensure dim_sz is retrieved after dim is resolved, not before.
The previous one-liner evaluated self.shape[dim] with the original
unresolved dim value.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-16 13:55:06 -04:00
chenyu
e428fbfab6
verify dtype of llama model params ( #13719 )
2025-12-16 12:32:02 -05:00
George Hotz
e5a66ace80
multi custom kernel support ( #13716 )
...
* multi custom kernel support
* custom kernel xfrom
* works
* no SPEC=2 on ck
* panic
* touchups
2025-12-16 11:36:30 -04:00
nimlgen
5778722979
am: restore queues ( #13714 )
...
* am: restore queues
* l
* cmnt
2025-12-16 15:21:42 +03:00
chenyu
041e9a41c9
add contiguous in BertIntermediate ( #13713 )
...
faster step with a lot less recomputation
2025-12-15 22:37:36 -05:00
George Hotz
7589c897b2
split usbgpu tests into their own benchmark [pr] ( #13711 )
2025-12-15 21:42:40 -04:00
qazal
6bafd90248
remove unused process replay input [pr] ( #13712 )
2025-12-16 09:29:35 +08:00
George Hotz
321ab943b2
qwen model is working ( #13690 )
...
* qwen model is mostly working
* add Q4_K quantization support to GGUF parser, add qwen3:1.7b model
- Add Q4_K (type 12) dequantization in nn/state.py
- Add qwen3:1.7b model using Q4_K_M quantization (smaller than Q8_0)
- Make bos_token_id optional for models like Qwen3 that don't have it
- Fix line length issues and add preset parameter to SimpleTokenizer
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* smaller diff
* test dequant
* half split
* better
* simple tok
* mock token
* polish
* better
* fix
* replace
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 18:00:34 -04:00
George Hotz
d43e4c7553
llm args + lil html page ( #13710 )
...
* update llm args
* lil html page
* lil
* line size
* qol
2025-12-15 17:09:31 -04:00
George Hotz
ee4a7ee12f
rope half-split ( #13706 )
...
* rope half
* nicer
* this
* rearrange
2025-12-15 15:31:11 -04:00
Christopher Milan
2359e88f0c
wrap cdll redo ( #13705 )
...
* wrap CDLL with custom findlib
* lint
* regen
* fix
* mypy
* hardcode libc on macos
* fix frameworks
* fix webgpu win
* remove supports
* regen metal
* regen libclang
* regen
* simpler
* regen
* regen
* find nvrtc
* fix
* regen
* fix
* typo
* regen
* split
* rsplit one
* typo
* try load DLL
* string error
2025-12-15 13:15:02 -05:00
wozeparrot
5d509499b2
tk: kernel finish groups stores ( #13704 )
2025-12-15 09:16:17 -08:00
George Hotz
54a22aa298
add test for jit footguns ( #13701 )
...
* add test for jit footguns
* shorter
* notes
2025-12-15 10:47:44 -05:00
George Hotz
fd49bb512d
download cache by job ( #13703 )
2025-12-15 10:47:17 -05:00
George Hotz
a657a4e0f4
add Q4_K GGUF quantization support ( #13700 )
...
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 10:17:56 -05:00
nimlgen
615dcab767
am: minimal mi300 boot ( #13679 )
...
* nbio7_9
* psp
* gmc
* gfx
* sdma
* ih
* linter
* linter
* minor
* finish
* add missing
* do not allow warm boot for now
2025-12-15 15:55:03 +03:00
qazal
72e006cd59
fast VIZ=2 startup ( #13682 )
2025-12-15 19:16:43 +08:00
qazal
50d34428bd
fix viz endstream ( #13687 )
2025-12-15 16:54:18 +08:00
wozeparrot
7ef7ce2856
tk reg local store ( #13689 )
2025-12-14 23:07:30 -08:00
George Hotz
572ca80046
fast tinygrad.apps.llm ( #13685 )
...
* llm: add --benchmark support
* fix speed
* debug logging
* fix test attention
2025-12-14 21:05:21 -05:00
chenyu
6cad622f59
don't FREE_INTERMEDIATE in bert ( #13684 )
...
hangs green hcq consistently after an hour of training
2025-12-14 14:27:42 -05:00
chenyu
871ab8415f
some onnx cleanups ( #13683 )
2025-12-14 13:58:54 -05:00
nimlgen
75832ce4f6
am: psp with no autoload ( #13681 )
2025-12-14 20:20:09 +03:00
nimlgen
8bcb1038e4
am: nbio 7.9.0 ( #13680 )
2025-12-14 18:35:29 +03:00
George Hotz
013240938b
llm: add --benchmark support ( #13678 )
2025-12-14 08:35:05 -05:00
Robbe Derks
cddbdaf5e1
usbgpu: patch: auto-detect controller PID/VID ( #13645 )
...
* auto-detect controller
* fix lint?
* needs ''
* just try
2025-12-14 00:54:51 -05:00
George Hotz
d7fb5d9b62
speedups: early return from simplify ( #13665 )
...
* early return from simplify
* pm_rewrite
* more speed
* remove again
* early return from simplify
* ugh
2025-12-14 00:51:28 -05:00
George Hotz
bcbf832399
add chrism
2025-12-14 00:45:57 -05:00
chenyu
ed962786d6
use assign in Tensor.backward ( #13674 )
...
preserve the grad object so that jit works
2025-12-13 22:43:06 -05:00
chenyu
721a379c41
Revert "autogen: use wrapped CDLL with custom findlib ( #13666 )" ( #13675 )
...
This reverts commit f6cc3b13b9 .
2025-12-13 22:42:41 -05:00
nimlgen
6402dcf940
am: xccs queue creation ( #13672 )
2025-12-13 18:37:09 +03:00
nimlgen
8430ee7d5f
am: stop hqd only when active ( #13670 )
...
* am: stop hqd only when active
* this better
2025-12-13 17:41:44 +03:00
nimlgen
a49ba241bb
am: use fb_base/fb_end as mc aperture ( #13671 )
2025-12-13 17:29:03 +03:00
nimlgen
0b15c573ca
amd: xccs in PCIIface ( #13669 )
2025-12-13 17:22:11 +03:00
qazal
019e71f8ca
lds bank count tests from pmc counters ( #13667 )
...
* lds bank count tests from pmc counters
* these tests run on the RDNA3 card too
* rename duration to cycles, other rename comment
* add SQ_LDS_IDX_ACTIVE to gfx9 defaults
2025-12-13 17:39:32 +08:00
qazal
a6dfd8a672
viz server cleanups ( #13668 )
...
* viz server cleanups
* comment
2025-12-13 17:27:53 +08:00
Christopher Milan
f6cc3b13b9
autogen: use wrapped CDLL with custom findlib ( #13666 )
...
* wrap CDLL with custom findlib
* lint
* regen
* fix
* mypy
* hardcode libc on macos
* fix frameworks
* fix webgpu win
* remove supports
* regen metal
* regen libclang
* regen
* simpler
* regen
* regen
* find nvrtc
* fix
* regen
* fix
* typo
* regen
* split
* rsplit one
* typo
2025-12-13 01:31:30 -05:00
George Hotz
55845f7de7
schedule: cache unbinds for consistent cache keys ( #13664 )
...
* schedule: cache unbinds for consistent cache keys
strip BIND values before computing cache key so different bound values
(e.g. KV cache positions) hit the same schedule cache entry.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* spec: allow single-src BIND for schedule cache key normalization
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* docs: add lessons learned to CLAUDE.md
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* more claude.md
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 17:27:42 -05:00
George Hotz
27845353a0
add CLAUDE.md
2025-12-12 16:50:11 -05:00
George Hotz
8c87a0bf8d
Revert "schedule: cache unbinds for consistent cache keys ( #13662 )"
...
This reverts commit af86cae10c .
2025-12-12 16:49:50 -05:00
George Hotz
443b7fea80
Revert "add notes about jit to claude.md"
...
This reverts commit 429f82e6a9 .
2025-12-12 16:49:48 -05:00
George Hotz
429f82e6a9
add notes about jit to claude.md
2025-12-12 16:48:23 -05:00
George Hotz
af86cae10c
schedule: cache unbinds for consistent cache keys ( #13662 )
...
* schedule: cache unbinds for consistent cache keys
different bound variable values (e.g. kv cache positions) now produce
the same schedule cache key by unbinding BIND(DEFINE_VAR, CONST) before
computing the cache key and rebinding after lookup.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* schedule: cache unbinds for consistent cache keys
When scheduling, BIND(DEFINE_VAR, CONST) nodes are now unbound to
tagged DEFINE_VARs before computing the cache key. This ensures that
the same computation with different bound values (e.g., different
KV cache positions in LLM) gets the same cache key and reuses the
cached schedule.
The fix:
- pm_pre_sched_cache: replaces BIND with tagged DEFINE_VAR
- pm_post_sched_cache: restores tagged DEFINE_VAR back to original BIND
- pm_remove_rangeify_tags: excludes DEFINE_VAR to preserve tags through rangeify
- var_vals extracted from BINDs before cache key computation
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* schedule: fix BIND handling and add CLAUDE.md
- Handle BIND to RANGE in create_schedule (not matched by CONST pattern)
- Assert all BINDs on same variable have same value
- Add CLAUDE.md codebase guide
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 16:40:10 -05:00
chenyu
fcaed1e1dd
don't use empty in bert fake data ( #13661 )
...
somehow jit does not count empty as input
2025-12-12 15:59:50 -05:00
George Hotz
316da9f7ff
llm: add created/model fields, non-streaming support, and tests ( #13660 )
...
* llm: add created/model fields, non-streaming support, and tests
- Add `created` timestamp and `model` fields to response (required by OpenAI spec)
- Add non-streaming mode support for /v1/chat/completions
- Add `send_data` helper to HTTPRequestHandler for responses with Content-Length
- Refactor viz/serve.py to use send_data
- Add integration tests using real OpenAI client
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* add openai to testing
* toml
* Remove 'openai' from dependencies
Removed 'openai' from the dependencies list.
* bump cache
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 14:50:36 -05:00
George Hotz
9604773e45
add model choosing support to llm ( #13656 )
2025-12-12 11:22:11 -05:00
nimlgen
e36385e570
am: support xgmi systems ( #13659 )
...
* am: support xgmi systems
* fake_am
2025-12-12 18:55:45 +03:00
nimlgen
b4796e2d32
amd: set queue prio to normal ( #13658 )
2025-12-12 18:25:41 +03:00
nimlgen
a1de7787bf
am: xcc/inst support ( #13657 )
2025-12-12 17:40:42 +03:00
George Hotz
f0fa9bcd98
openai api for llm ( #13648 )
...
* openai api for llm
* responds to simple request
* schedule cache needs to unbind
* stream works
* share stream code
* 20k
* one print
* cid
2025-12-12 08:25:33 -05:00
qazal
93ad1f7732
viz: readable pmc print, share unpacker with tests ( #13655 )
...
* viz: readable pmc print, share unpacker with tests
* sections
* static analyzer
* rm that
2025-12-12 19:29:59 +08:00
Christopher Milan
760e508c3a
autogen: no deep walk ( #13654 )
...
* no deep walk
* reset init
* delete walk
* remove print
* regen
* linkage spec
* cleanup
2025-12-12 01:04:35 -05:00
wozeparrot
8f60b8dd1e
fix: cast on transpose ( #13653 )
2025-12-11 21:03:49 -08:00
Christopher Milan
950d8de00e
automatically inline anonymous ( #13652 )
2025-12-12 00:02:44 -05:00
chenyu
01e9ad0d52
clean up bert next_data ( #13650 )
...
train iter was designed to never stop for both real and fake data
2025-12-11 22:56:28 -05:00
Jakob Sachs
ab2220b834
Handle missing bfloat16 natives on CPU architectures ( #13553 )
...
* CPU: fix compiler-rt libcall by adding intermediate casts for bfloat16
* fix lint
* remove old manual bypass of bf16 for CPU tests, and add diversion converstion from bf16 to/from fp16
---------
Co-authored-by: Jakob Sachs <jakobs99@purelymail.com>
2025-12-11 15:38:43 -05:00
nimlgen
cbae33003d
ci: add usb4 ( #13643 )
...
* ci: add usb4
* debug=3
* undef
* revert
2025-12-11 19:41:41 +03:00
chenyu
03600aef1e
failed test case when init jit with empty inputs ( #13641 )
...
not related to bert grad acc, but still seems to be a bug
2025-12-10 22:03:06 -05:00
nimlgen
51f3c9f615
am: use va_base as base ( #13640 )
2025-12-10 21:09:35 +03:00
chenyu
5034c6fb37
reenable FREE_INTERMEDIATE for bert ( #13639 )
...
* reenable FREE_INTERMEDIATE for bert
* comment
2025-12-10 12:08:09 -05:00
qazal
be6d538351
viz: add kernel walltime to pmc scoreboard ( #13638 )
...
* viz: add kernel walltime to pmc scoreboard
* fix typing
* tiny TracingKey refactor
* key on kernel name
2025-12-10 20:16:42 +08:00
qazal
1666c4aaab
viz: fix counter names ordering ( #13637 )
2025-12-10 17:05:27 +08:00
qazal
c801bb7054
viz: show all kernel pmcs ( #13635 )
2025-12-10 07:16:02 +08:00
wozeparrot
4854a0c02c
fix: getattr returns AttributeError not ImportError when missing ( #13633 )
2025-12-09 14:26:54 -08:00
chenyu
016a59cafa
remove contiguous and use where in EmbeddingBert ( #13632 )
2025-12-09 15:49:21 -05:00
nimlgen
ddecba300f
amd: use getattr for autogen ( #13630 )
...
* amd: use getattr for autogen
* fi
2025-12-09 20:36:26 +03:00
Nino Risteski
76d465dbc3
optim empty shard #13513 ( #13598 )
...
* optim empty shard
* remove tuple
* simplify
* lint
* lint2
* test
* remove original buffer unique id
* new rule
* reset shard
* update
* reset shard
2025-12-09 12:28:36 -05:00
ayanhan
47a170be2e
test: enable cummax scalar IndexError test ( #13625 )
2025-12-09 12:25:56 -05:00
Christopher Milan
9eae9dc3be
regen smu_v13 with stdint ( #13631 )
...
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2025-12-09 12:20:01 -05:00
nimlgen
7cd8852f60
autogen: do no return tuples ( #13629 )
2025-12-09 20:08:13 +03:00
nimlgen
9e484b5b1c
hcq: check size is None, do not read the whole size for 0s ( #13628 )
2025-12-09 19:37:44 +03:00
nimlgen
1329033b8c
am: fix hot-queue restarts, only dequeue ( #13627 )
2025-12-09 19:37:21 +03:00
nimlgen
b07839493d
proclogs with xccs ( #13626 )
2025-12-09 16:46:08 +03:00
qazal
2c333818f4
simplify UOp stringifier [pr] ( #13618 )
...
* simplify UOp stringifier [pr]
* fix tuple
2025-12-09 05:06:16 +08:00
chenyu
2471b49e45
minor bert / llama change from grad acc branch ( #13622 )
...
* minor bert / llama change from grad acc branch
* revert those
2025-12-08 16:04:14 -05:00
Christopher Milan
cb3d756547
NAK compile-only test ( #13621 )
2025-12-08 15:53:46 -05:00
Christopher Milan
a4c3d48aa9
compile-only test for IR3 actually works ( #13619 )
2025-12-08 15:07:49 -05:00
Christopher Milan
a17077d1d9
skip test_double_assign in CI LVP ( #13620 )
2025-12-08 14:54:02 -05:00
Christopher Milan
1c16b6e082
Mesa: freedreno ( #12746 )
...
* ir3 init
* got a program
* 1 + 1 works
* use isa_disasm instead of shader_disasm
* wip
* matmul works
* works on py3.14
* fix const loading
* skip QCOM failing tests
* cleanup
* args actually work
* add compile-only tests
* fix typo and install tinymesa
* IR3 NULL backend
* (float32) images work
* autogen fix
* fix compile only test
* typo
* mypy happy
* compile-only uses py3.14
* bump mesa
* unify qcom disassembler
* float16 works
* disasm shows in viz
* save a line
* add real del
* variable workgroup sizes
* simplify diff
* bump line count
* properly set wgsz
* regen mesa
* no preamble
* bump lines
2025-12-08 14:02:08 -05:00
Douglas Nyberg
947c6eefc3
add Swish op ( #13541 )
...
* add Swish ONNX operator
* add Swish regression test
* remove trailing whitespace
* upgrade ONNX to 1.20, add excludes for unimplemented ops
* upgrade ONNX to 1.19, add Swish op
* upgrade ONNX to 1.19, TensorFlow to 2.18, add Swish op
* exclude attention_3d and attention_4d_gqa tests
* exclude attention fp16 tests
* exclude all attention tests
* retrigger CI
* retrigger CI - worker crash
2025-12-08 12:41:18 -05:00
nimlgen
dd8a1a10d4
amd: tiny cleanups ( #13616 )
2025-12-08 13:15:56 +03:00
qazal
2b07336c82
viz server cleanups ( #13615 )
...
* depths start at 0
* rename the api path
2025-12-08 17:44:43 +08:00
wozeparrot
89c4206e22
fix: typing ( #13614 )
2025-12-07 20:10:30 -08:00
qazal
572dfd5506
add static amd program info to viz ( #13594 )
...
* llvm-readelf
* amd_readelf + soft_err
* cleanup
* multiple metadata
* max wgp size, may be less
2025-12-08 04:08:14 +08:00
qazal
73093314bd
viz: support list of sidebar info ( #13612 )
2025-12-08 03:09:43 +08:00
chenyu
b981b6f89e
remove old llama grad_acc ( #13611 )
...
* remove old llama grad_acc
* GRADIENT_ACC_STEPS=1
2025-12-07 13:03:47 -05:00
Christopher Milan
94d7646bdc
fix anonymous struct fields ( #13610 )
2025-12-07 12:56:38 -05:00
nimlgen
dcd50baca4
amd/nv: cleanup ( #13608 )
2025-12-07 17:05:26 +03:00
nimlgen
ac5f1e115d
autogen: repro for the bug ( #13607 )
...
* autogen: repro for the test
* mute
2025-12-07 15:51:03 +03:00
Christopher Milan
4eae4b0ce6
unify adreno autogen with mesa ( #13604 )
...
* unify adreno autogen with mesa
* gen pm4
* TestTiny::test_plus works
* add a6xx enums
* IMAGE=2 TestTiny::test_gemm works
* remove adreno from CI
* cleanup
2025-12-06 15:17:36 -05:00
kamilisjon
e20bc0b9b5
remove unused function parameter in beam search ( #13602 )
2025-12-06 11:40:47 -05:00
nimlgen
abafb96441
hcq: check all subbufs are free ( #13599 )
...
* hcq: check all subbufs are free
* fix
* Update ops_amd.py
2025-12-06 17:43:18 +03:00
nimlgen
f2b549d921
amd: refactor scratch calc ( #13595 )
...
* amd: refactor scratch calc
* fix
2025-12-06 16:41:35 +03:00
chenyu
4562f217e1
more bert updates ( #13597 )
...
prep split jit
also lower BS to 72
2025-12-06 08:32:43 -05:00
wozeparrot
93f1baca77
feat: tk fa in tensor ( #13580 )
2025-12-05 14:36:29 -08:00
chenyu
cb4c6324ef
revert bert grad accumulation ( #13596 )
...
prep for the new split jit style
2025-12-05 17:30:08 -05:00
qazal
f20212e1ec
refactor viz error handler ( #13593 )
2025-12-06 02:37:39 +08:00
Christopher Milan
dec2f50aee
reenable process replay for lvp ( #13592 )
2025-12-05 12:36:35 -05:00
chenyu
0977206b1c
Revert am ( #13591 )
...
* Revert "hotfix: amd: tmpring (#13589 )"
This reverts commit 4d8b283b36 .
* Revert "amd: use correct structs (#13583 )"
This reverts commit d8b09eda57 .
2025-12-05 11:03:12 -05:00
chenyu
ac1227575f
IMAGE=1 driving_vision in benchmark ( #13587 )
2025-12-05 10:20:54 -05:00
nimlgen
4d8b283b36
hotfix: amd: tmpring ( #13589 )
...
* hotfix: amd: tmpring
* more
2025-12-05 18:19:05 +03:00
qazal
8c332219f9
viz: remove x86asm highlighter ( #13586 )
...
* viz: remove x86asm highlighter
* formatting
2025-12-05 21:05:50 +08:00
qazal
5d8726d8d2
viz: refactor to generic sidebar ( #13584 )
2025-12-05 20:09:41 +08:00
nimlgen
d8b09eda57
amd: use correct structs ( #13583 )
2025-12-05 14:46:38 +03:00
qazal
6d92e9ffbf
hotfix: skip process replay on lvp ( #13585 )
2025-12-05 19:25:23 +08:00
Christopher Milan
8011b953c9
mesa: remove glsl type hack ( #13578 )
...
* mesa: remove glsl type hack
* lazy type access
* save a line
* fix windows?
* mypy happy
2025-12-04 21:18:56 -05:00
George Hotz
c5bd28e21d
start work on schedule cache ( #13529 )
...
* start work on schedule cache
* local unique
* schedule cache works
* schedule cache cleanup
* fix tests
* preserve metadata
* oops, fix cache
* put that there
* fix spec
* always miss
* why is that broken?
* src[0].op
* fix process replay
* delete abstractions2
* reenable the actual schedule cache
* metadata is best effort
* fix JIT in examples/gradaccum_mnist.py
* full jit
* fixed and test is real
2025-12-04 17:24:49 -08:00
wozeparrot
62e2fc5108
tk: global load/store rv ( #13577 )
2025-12-04 17:23:48 -08:00
Christopher Milan
5cfe1698e8
autogen: strip function parameter qualifiers ( #13576 )
...
* autogen: strip function parameter qualifiers
* regen hip
* re-regen hip
2025-12-04 19:54:34 -05:00
qazal
f21c9dbf4b
enable PMC with VIZ=2 ( #13575 )
2025-12-05 03:09:53 +08:00
qazal
d7caae5f61
viz: tabulate pmc ( #13574 )
...
* viz: tabulate pmc
* linter
* enable nesting
* pmc comes before waves
2025-12-05 03:08:39 +08:00
chenyu
42f6cf3a90
tighter test_real_world mem and kernel count bounds ( #13573 )
...
also check if actual usage is within 20% of set limit, the old limits are too big to be useful
2025-12-04 13:35:39 -05:00
chenyu
89f9e1dcd5
add SGD to beautiful_mnist ( #13571 )
2025-12-04 12:17:29 -05:00
qazal
512a8f3dd4
viz: start global memory PMC tests ( #13569 )
2025-12-05 00:40:27 +08:00
chenyu
7df56d3b99
Optimizer.device is a property ( #13568 )
2025-12-04 09:25:15 -05:00
nimlgen
db99a61fad
qcom: support cpu mappings ( #13565 )
...
* test
* qcom: support cpu mappings
* clean
* msg
2025-12-04 14:50:46 +03:00
George Hotz
bd6a068ef7
move track_rewrites to outer schedule cache ( #13556 )
...
Co-authored-by: qazal <qazal.software@gmail.com>
2025-12-04 19:13:45 +08:00
qazal
3eae146139
faster process replay [pr] ( #13564 )
2025-12-04 18:52:07 +08:00
Rory Clear
6eab756578
fix and test loading num_batches_tracked ( #13538 )
...
* fix and test loading num_batches_tracked
* add failing reverse case
* try reshape state dict if mismatch
* reshape for () and (1,)
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-12-04 01:22:49 -08:00
nimlgen
877a7fdd61
jit: support encdec ( #13563 )
...
* jit: support encdec
* fix
2025-12-04 11:58:34 +03:00
Douglas Nyberg
a8a62bc08e
add max/min reduction support to ScatterND ( #13562 )
2025-12-04 00:53:47 -08:00
ayanhan
edf929ec9d
fix: add __delitem__ to Tensor with proper TypeError ( #13561 )
2025-12-04 00:53:08 -08:00
Douglas Nyberg
9411ecedc4
fix CUDA half-precision trunc() type mismatch ( #13559 )
2025-12-03 21:53:16 -05:00
ayanhan
92b40290c7
fix: add test_sum_int and remove outdated TODO in test_custom_kernel ( #13560 )
2025-12-03 21:51:58 -05:00
Christopher Milan
0a54434b15
mitigate ctypes c_bool bitfield bug ( #13558 )
...
* mitigate ctypes c_bool bitfield bug
* don't delete old test
2025-12-03 20:46:04 -05:00
George Hotz
96d16675fe
update examples/gradaccum_mnist.py to use the JIT
2025-12-03 16:11:42 -08:00
George Hotz
24ca8eeaa7
small fixups from schedule_cache ( #13557 )
2025-12-03 15:41:16 -08:00
Douglas Nyberg
f5abd38132
remove tfa dependency: use keras.optimizers.Lamb and tf.raw_ops for LARS ( #13555 )
2025-12-03 17:48:27 -05:00
George Hotz
a4c4e48385
add LUNIQUE op ( #13554 )
2025-12-03 14:34:34 -08:00
George Hotz
a909cd4581
faster HEVC decode ( #13552 )
...
* faster HEVC decode
* bind to variables
* cleanups
* more cleanups
2025-12-03 11:33:05 -08:00
chenyu
22777a89ea
minor test_uop_symbolic updates ( #13551 )
2025-12-03 13:17:44 -05:00
chenyu
a205f98ef4
tighter bound for MOD ( #13550 )
2025-12-03 11:24:29 -05:00
nimlgen
fcdb01abe7
hip: fix ioctl ( #13548 )
2025-12-03 16:40:43 +03:00
qazal
aab7535805
viz: format buffer size unit ( #13547 )
2025-12-03 21:35:49 +08:00
nimlgen
daea1161cc
nv: nvdec for blackwell ( #13546 )
2025-12-03 16:30:22 +03:00
nimlgen
549f3287a8
fix caching for fetch ( #13544 )
2025-12-03 14:34:14 +03:00
qazal
8390de39e6
amd: static flag check for sqtt/pmc ( #13545 )
2025-12-03 18:36:15 +08:00
George Hotz
ddf3f2d0c4
rdna3 asm + zip_extract ( #13499 )
...
* rdna3 asm + zip_extract
* include sqtt
* fix end parsing
* disassembler working
* parsing fields
* instruction
* op
* more parsing
2025-12-02 22:56:01 -08:00
George Hotz
6bd355fa26
add needs_second_gpu decorator ( #13543 )
...
* add needs_second_gpu decorator
* more skips
* two more fixes
2025-12-02 19:08:23 -08:00
wozeparrot
0d55aec605
fix after end ( #13542 )
2025-12-02 18:42:58 -08:00
chenyu
8902781dc1
enable more benchmarks ( #13540 )
...
* enable more benchmarks
* disable some
* adjust ASSERT_MIN_STEP_TIME
* mac NOCLANG=1
2025-12-02 20:31:14 -05:00
George Hotz
055d5aeb7f
add external_test_process_count
2025-12-02 17:26:30 -08:00
chenyu
e8879f7e31
match torch clamp backward ( #13533 )
...
* match torch clamp backward
* fix PYTHON
2025-12-02 17:58:32 -05:00
qazal
7622be761f
add new remu instructions from #13533 ( #13539 )
2025-12-03 06:29:20 +08:00
wozeparrot
18640f57b2
feat: configurable timeout ( #13537 )
2025-12-02 13:35:35 -08:00
chenyu
21aac568fd
limit lift x*y out of reduce to int [pr] ( #13535 )
2025-12-02 16:11:45 -05:00
Roelof van Dijk
c158e3c988
add cifar gated uop_given_valid regression test ( #13536 )
2025-12-02 16:02:47 -05:00
Roelof van Dijk
e329baffa7
fix cifar while keeping openpilot fused ( #13528 )
...
* this works
* test now passes
2025-12-02 12:05:56 -08:00
nimlgen
0874ba8cc8
test_hevc: do not download the whole file ( #13531 )
...
* test_hevc: do not download the whole file
* fix
2025-12-02 21:31:28 +03:00
qazal
366badaa68
require renderer argument in get_program, removes device opening in process replay [pr] ( #13524 )
2025-12-03 02:05:31 +08:00
George Hotz
21184ae6b1
bump cache to 14 ( #13530 )
2025-12-02 08:02:19 -08:00
George Hotz
037edc151c
late gate for ALLOW_TF32 ( #13527 )
...
* remove ALLOW_TF32
* the right place to put that gate
2025-12-02 07:51:58 -08:00
Douglas Nyberg
6a7c58abf1
fix(onnx): unwrap list/tuple value in Pad op ( #13500 )
...
* fix(onnx): unwrap list/tuple value in Pad op
* add regression test for Pad list value
* remove trailing whitespace
* use _resolve_const for Pad constant_value
2025-12-02 07:47:20 -08:00
qazal
c65aa93081
refactor sqtt loader to enable PMC=1 SQTT=0 ( #13526 )
2025-12-02 22:50:38 +08:00
chenyu
60f7c6cce6
simpler drop_and_clauses [pr] ( #13525 )
2025-12-02 09:12:21 -05:00
nimlgen
77a76d1b13
device: respect compiler ContextVars ( #13523 )
...
* device: envvars for cc
* fix
* fix
* x
* um
* fix
* remote
* em
* cleanup
* typing
* fix
* debug
* lvp?
* ugh
* singl
* rm
* lol
* fix
* ?
* this?
* why?
* rev
* mod test
* l
2025-12-02 14:42:04 +03:00
wozeparrot
1b7dbfb37f
tk: named kernels + per kernel range id ( #13522 )
2025-12-01 22:51:04 -08:00
wozeparrot
8713ae6de9
fix: dead sdv2 download link ( #13521 )
2025-12-01 22:50:53 -08:00
George Hotz
44104b0b7f
mnist with grad acc + Adam on CPU ( #13520 )
...
* mnist with grad acc + Adam on CPU
* still broken, but closer
* works w/o jit
* this works without the jit
2025-12-01 18:27:32 -08:00
George Hotz
7307120311
shard to one device is to ( #13519 )
...
* shard to one device is to
* fst
2025-12-01 16:29:53 -08:00
chenyu
0b92fd30f5
simpler simplify_valid [pr] ( #13514 )
...
dedup instead of getting a True clause which is removed later
2025-12-01 17:36:33 -05:00
qazal
a5ec3b24be
viz: start PMC in the counters view ( #13510 )
2025-12-02 00:01:57 +08:00
nimlgen
759b41ab91
amd: fix rsrc_word3 on gfx9 ( #13509 )
2025-12-01 12:47:54 +03:00
chenyu
ebbd114885
simpler invalid alu [pr] ( #13508 )
2025-11-30 22:18:42 -05:00
George Hotz
ada6b92b2d
add a gate to rewrite if there's no rules [pr] ( #13506 )
2025-11-30 17:40:52 -08:00
George Hotz
97b56e11e0
hotfix: 32 workgroups for radeon 8050s
2025-11-30 08:20:17 -08:00
George Hotz
bd4b9de7d2
use numpy in amd_uop_matmul for simpler tracing ( #13503 )
2025-11-30 08:04:38 -08:00
qazal
9023ca30ef
show number of waves in each SE/CU ( #13491 )
...
* show number of waves in each SE/CU
* update to test_ones
2025-11-30 22:29:16 +08:00
nimlgen
455dd88236
nv: minimal hevc ( #13502 )
...
* nv: minimal hevc
* validate
* not needed
* tralin
* var
* cpu
* fxi
* desc
* move
* cleanup
2025-11-30 16:46:55 +03:00
George Hotz
fd373fea7a
fix a few tests [pr] ( #13498 )
2025-11-29 13:43:45 -08:00
George Hotz
29b11c8992
bug in device enumerate where we didn't put default back ( #13495 )
2025-11-29 13:00:55 -08:00
George Hotz
6a140f74fe
split out unique_const and cache const [pr] ( #13493 )
...
* split out unique_const
* add cache to const
* call const in unique_const
2025-11-29 10:44:28 -08:00
George Hotz
c38b7684dc
improve microbenchmarks ( #13492 )
...
* improve microbenchmarks
* bugfix + ubench
* lil
* no src in const method
2025-11-29 10:15:22 -08:00
qazal
941597db71
viz UI cleanups ( #13490 )
2025-11-29 22:07:00 +08:00
qazal
d457ee0ba4
viz: correctly handle multiple sqtt traces of the same prg ( #13460 )
2025-11-29 20:52:41 +08:00
George Hotz
6f4d7c0c70
directly create tensor in _apply_uop ( #13489 )
2025-11-28 19:51:06 -08:00
kamilisjon
3d76ef9ba8
Update tests ( #13479 )
2025-11-28 18:35:28 -08:00
nimlgen
192bf4e00a
amd,nv: remove unused env vars ( #13487 )
2025-11-28 23:12:53 +03:00
qazal
ae9c56134e
skip test_tk failing locally on macbook ( #13476 )
2025-11-29 01:15:37 +08:00
qazal
f33ccd31fd
viz: instruction deduping for SQTT inst waves ( #13482 )
2025-11-28 23:17:07 +08:00
Roelof van Dijk
eb543a91e8
perf: remove graph-in-graph from expand_index ( #13473 )
...
* remove graph-in-graph from devectorizer
* vectorize, not sink
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-11-27 11:32:16 -08:00
Roelof van Dijk
d3e125d05d
keyword changed (import reserved in python) ( #13477 )
2025-11-27 11:23:00 -08:00
qazal
72ef533d9c
tracing: use u32 for buffer args encoding ( #13472 )
2025-11-28 00:19:51 +08:00
George Hotz
18addc0a1d
process replay only get_program ( #13475 )
2025-11-27 08:18:18 -08:00
George Hotz
a8e005b095
enable process replay (non-checking) by default ( #13474 )
2025-11-27 07:28:44 -08:00
qazal
952a6a8b10
viz: add kernel buffers back to the sidebar ( #13471 )
2025-11-27 22:10:35 +08:00
Kirill R.
57869387f9
Update wording in mnist.md ( #13469 )
2025-11-27 05:59:49 -08:00
nimlgen
1d207eca3d
cuda: fix fmt in compiler ( #13470 )
2025-11-27 16:51:17 +03:00
qazal
2df8a3474e
viz: bring back flops and mem in sidebar ( #13467 )
2025-11-27 17:27:44 +08:00
George Hotz
05cd2279d0
add cache on reshape ( #13466 )
...
* remove cache on divmod, way less objects
* _apply_reshape
* reshape
* no gc on realize
* wow that cache is fast
2025-11-26 18:57:40 -08:00
George Hotz
f4123b66df
add DEBUG_GC ( #13465 )
...
* add DEBUG_GC
* fixup create_schedule_with_vars
* work
2025-11-26 17:44:44 -08:00
George Hotz
19228e8d37
test_graph is flaky
2025-11-26 16:37:42 -08:00
George Hotz
268b3eb392
factor scheduling into complete_create_schedule_with_vars ( #13464 )
2025-11-26 15:43:27 -08:00
George Hotz
e4cd649ff0
remove kernelize to prepare for refactors ( #13463 )
...
* remove kernelize to prepare for refactors
* less kernelize
* last test
2025-11-26 14:18:50 -08:00
qazal
b63e5a7568
viz: full range x axis scroll ( #13459 )
2025-11-26 21:28:07 +08:00
qazal
c12e218751
viz: double click on INST wave ( #13458 )
2025-11-26 21:12:40 +08:00
qazal
e9cb738c7a
viz: event sidebar cleanup ( #13457 )
2025-11-26 19:47:15 +08:00
qazal
2a3b665972
viz: initial zoom at first event ( #13456 )
...
* viz: initial zoom at first event
* sidebar work
2025-11-26 16:42:06 +08:00
Christopher Milan
b2af92c821
fix HCQGraph.__del__ bug when finalizing ( #13298 )
...
* fix _do_ioctl import
* fix circular import
* suppress_finalizing instead
2025-11-25 20:33:48 -08:00
qazal
8c1e2a42fd
viz: start work on profiler speed ( #13455 )
2025-11-26 07:54:04 +08:00
wozeparrot
ffc31a23f4
tk mi350 ( #13288 )
2025-11-25 15:49:44 -08:00
nimlgen
436ab6bfc7
nv: use opt mutliple vaspaces ( #13453 )
2025-11-25 23:10:21 +03:00
qazal
7238df7a94
viz: cleanup sort_fn ( #13454 )
2025-11-26 04:10:10 +08:00
qazal
5520f1fb0b
viz: per cu timeline ( #13451 )
...
* add cu_loc
* work
* WAVE -> W
2025-11-26 00:05:20 +08:00
qazal
4a9562e353
viz: draw markers on top ( #13449 )
...
* viz: draw markers on top
* create generic label drawer
* same text rendering infrastructure for markers
* minor details
* diff
2025-11-25 17:27:01 +08:00
George Hotz
5373fd2d66
add user device ( #13447 )
...
* add user device
* add device_sort_fn (#13448 )
Co-authored-by: qazal <qazal.software@gmail.com>
* linter
* order by dname
---------
Co-authored-by: qazal <qazal.software@gmail.com>
2025-11-25 15:25:45 +08:00
George Hotz
241e533451
toposort recursive_property is faster ( #13446 )
2025-11-24 22:29:15 -08:00
George Hotz
8e8fec408e
fix n^2 _apply_map_to_tensors [pr] ( #13443 )
...
* clean up slow rules
* fix rule
* non n^2 toposort
* topovisit
* state dict profile_marker
2025-11-24 18:59:16 -08:00
wozeparrot
249553a119
tinyfs tweaks ( #13444 )
2025-11-24 18:07:32 -08:00
wozeparrot
f46bc31156
tk: start and step in range ( #13442 )
2025-11-24 15:43:24 -08:00
George Hotz
cc5e6323ac
stable diffusion profiling ( #13441 )
...
* stable diffusion profiling
Signed-off-by: George Hotz <geohot@gmail.com>
* profile_marker
* profile per step
* fix slow Context
* profile that
---------
Signed-off-by: George Hotz <geohot@gmail.com>
2025-11-24 15:25:45 -08:00
nimlgen
18cfb54736
amd: a bit better se limiting ( #13440 )
...
* amd: a bit better se limiting
* SQTT_LIMIT_SE=0
2025-11-24 21:51:47 +03:00
C T
2d53029be3
Whisper less flaky tests ( #13435 )
...
* use less flaky metric for whisper long transcription
* multiline long transcription 3 reference
* fix reference transcript
see https://homepage.ntu.edu.tw/~karchung/miniconversations/MC.htm
sanitized for whisper
* try lower wer threshold
* add test for wer metric
* extract TRANSCRIPTION_3_ALT
* rename test
* rename
* add tests for high WER difference
* move tests
* sync metric
2025-11-24 09:50:49 -08:00
qazal
2a9bd12700
sqtt: add occupancy events to the timeline ( #13430 )
2025-11-24 22:28:05 +08:00
Sieds Lykles
63a931ff76
Symbolic divisor fuzzer ( #13433 )
...
* render z3 range better
* working version
* rename
* add to workflow
* factor out variable_names
* smaller expressions
* smaller
* + back
2025-11-23 20:29:32 +01:00
nimlgen
677db34eba
nv: cleanup map flags ( #13434 )
2025-11-23 19:54:52 +03:00
qazal
712c7a6448
sqtt loader cleanups from the occupancy branch ( #13431 )
...
* cleanup err handling
* from disasms
* s/wave_execs/wave_insts
2025-11-23 21:50:34 +08:00
George Hotz
9d7a17ee39
beautiful SQTT_PARSE=1 with color ( #13428 )
...
* beautiful SQTT_PARSE=1 with color
* linter
* linter 2
* a few more labels
* filter and or
* wave alloc
* a few more
2025-11-23 01:05:14 -08:00
qazal
474a631877
viz: align left offset for nested items ( #13420 )
2025-11-23 14:22:51 +08:00
George Hotz
da0aa57a3b
add cu parsing to attempt_sqtt_parse
2025-11-22 22:09:05 -08:00
qazal
320ed78803
can view wave timeline with SQTT_ITRACE_SE_MASK=0 ( #13427 )
2025-11-23 13:55:47 +08:00
Pranil
c1838c71fc
display service name typo ( #13426 )
...
its tinybox-display.service
2025-11-22 20:49:56 -08:00
George Hotz
5110409339
continue work on parse sqtt, enable with SQTT_PARSE ( #13425 )
...
* continue work on parse sqtt, enable with SQTT_PARSE
* fix timing
* delta is pre instruction
* hi8 values
* a few more
* a bit more
* let it crash if you enabled it
* figure out simd
* hide 0x11
2025-11-22 19:03:17 -08:00
George Hotz
92170d0ff1
lil op cleanup ( #13424 )
...
* track flag count and op count
* text
* more
* file count
* lil op cleanup
* cleanups
* move
2025-11-22 15:21:15 -08:00
George Hotz
423b76a852
improve sqtt format parser (saturday coffee shop project) ( #13419 )
...
* improve sqtt format parser
* actually read the trash code ChatGPT wrote
* cleanups
* hand written parser
* quality
* more
* was missing first packet
* maybe
* filt
* fixups
* label the waves
* progress
2025-11-22 15:04:10 -08:00
George Hotz
9d6cf3472e
remove op/sentinel
2025-11-22 15:01:47 -08:00
Christopher Milan
310da2a201
remove hashFiles in setup-tinygrad ( #13423 )
...
* fix hashFiles in setup-tinygrad on macos
* remove hashFiles altogether
2025-11-22 17:47:10 -05:00
qazal
c14033e10f
viz: faster startup time with SQTT=1 ( #13337 )
...
* roc.py cleanups
* direct append
* viz index cleanup
* simd row details
* add kernel arg
* late instructions decode
* more instruction decode to sep server request
* 200ms startup, 6 second to waves timeline
* sort units
* creating new http paths is easy now
* instructions unpacker
* min diff, use hyphens
* summary table
2025-11-22 22:02:30 +08:00
qazal
1655fdb6de
viz: cleanup sqtt loader ( #13417 )
2025-11-22 20:10:23 +08:00
qazal
903eec3754
fix sz.py tinygrad import in ci ( #13418 )
2025-11-22 19:20:26 +08:00
nimlgen
3a42680e22
amd: pmc generic arch for gfx10+ ( #13407 )
2025-11-22 12:31:23 +03:00
George Hotz
1f8b24a6b9
track flag count and op count ( #13416 )
...
* track flag count and op count
* text
* more
* file count
2025-11-21 22:46:33 -08:00
George Hotz
4c0f4226b9
delete the PRECAST op [p] ( #13415 )
...
* don't use PRECAST in cstyle renderer [p]
* fix in metal
* fix opencl
* __builtin_bit_cast
* precast is unused
* cuda is c99?
* lambda_union_bitcast
* helper function
* delete precast op
2025-11-21 21:47:14 -08:00
wozeparrot
1f648bb1ba
feat: reenable mobilenetv2 dsp ( #13320 )
2025-11-21 15:21:49 -08:00
chenyu
054477a44f
remove full_symbolic in simplify ( #13413 )
...
only flip one schedule in winograd backward, no functional difference
2025-11-21 15:04:00 -05:00
chenyu
cb29265f23
add test that shows the validhack regression with bad rewrite order ( #13411 )
2025-11-21 13:48:30 -05:00
qazal
fdfe83880b
viz: unique sqtt wave names ( #13410 )
...
* viz: unique sqtt wave names
* better name for the shape
* it's a per program counter now
* table view, refactor to wave:insts dict
2025-11-22 02:43:31 +08:00
chenyu
a6c9b4ff6a
fix symbolic comments [pr] ( #13408 )
2025-11-21 09:18:50 -05:00
Sieds Lykles
114bb94c55
Fix load collapse MAX to ADD ( #13406 )
...
* add Ops.ADD to pattern
* add test
2025-11-21 12:26:14 +01:00
qazal
87c248eafa
small cleanups from viz memory usage fixes ( #13405 )
...
* shape link cleanups
* cleanup findRectAtPosition
2025-11-21 17:05:08 +08:00
qazal
0de1b24154
viz: SE : CU : SIMD : WAVE in sqtt timeline ( #13404 )
...
* wave id in device rows
* SE : CU : SIMD : WAVE
* automatic width
* better styling
* rm the blue
* sort
2025-11-21 15:42:29 +08:00
George Hotz
dabb02767f
set AMD profile mode with sudo on SQTT or PMC ( #13403 )
...
* require profile mode
* add mode setter
* cleanup
* not needed
* SQTT_LIMIT_SE
2025-11-20 23:19:11 -08:00
George Hotz
e1051d00d7
multi like on full_like as well as rand_like ( #13402 )
...
* multi like on full_like as well as rand_like
* add test and fix bug
* mismatch, optim match
* one line
2025-11-20 20:46:48 -08:00
chenyu
fa3def2f12
call less simplify in simplify_valid_load [pr] ( #13401 )
2025-11-20 19:54:22 -05:00
qazal
895ec7417e
viz: enable mapping function names to colors ( #13400 )
2025-11-21 06:43:02 +08:00
George Hotz
a74f6020d5
track apply map to tensors ( #13399 )
...
* track apply map to tensors
* sub
2025-11-20 14:24:55 -08:00
chenyu
647fde64e6
no sym in pm_reduce [pr] ( #13398 )
...
* no sym in pm_reduce [pr]
* fix that
2025-11-20 16:49:09 -05:00
qazal
1313250e0d
viz: use system helper for llvm-mca ( #13395 )
2025-11-21 04:47:25 +08:00
Christopher Milan
de3593957f
Revert "Revert "autogen: fix formatting on zero-argument function-like macros…" ( #13388 )
...
This reverts commit 0901a40685 .
2025-11-20 15:36:13 -05:00
qazal
1220072328
viz: refactor to generic steps api ( #13393 )
2025-11-21 04:33:23 +08:00
George Hotz
26ccbf7040
debufferize with symbolic in one pm ( #13392 )
2025-11-20 11:47:03 -08:00
George Hotz
c46f608703
top down remove_bufferize ( #13391 )
...
* top down remove_bufferize
* removable if ALWAYS_CONTIGUOUS
2025-11-20 11:32:00 -08:00
Christopher Milan
4043489803
set curl -f in setup-tinygrad ( #13389 )
...
* set curl -f in setup-tinygrad
* test bad redirect
* Revert "test bad redirect"
This reverts commit ad945e7ffc .
2025-11-20 13:45:47 -05:00
chenyu
0251a8e628
parse_valid minor cleanup [pr] ( #13385 )
...
* stricter parse_valid [pr]
* not stricter
* no VCONST
* Revert "no VCONST"
This reverts commit 330dbdf4060562596febcbf970bda6051a35012f.
2025-11-20 13:15:06 -05:00
Christopher Milan
0901a40685
Revert "autogen: fix formatting on zero-argument function-like macros ( #13386 )" ( #13387 )
...
This reverts commit 58d85d4bab .
2025-11-20 12:45:35 -05:00
b1tg
91e289cb14
amd fp8 llvm ( #13186 )
...
* amd fp8 llvm support
* fix max
* clean
* add test_mi350.sh
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-11-20 12:35:57 -05:00
Roelof van Dijk
1058748440
torch backend: no aten.detach for torch 2.10 compat ( #13381 )
...
* this works, less cpp?
* simpler = better
* keep torch 2.9 working as well
2025-11-20 09:12:15 -08:00
Christopher Milan
58d85d4bab
autogen: fix formatting on zero-argument function-like macros ( #13386 )
...
* fix formatting on zero-argument function-like macros
* autogen tests should run
* ugh
2025-11-20 12:11:04 -05:00
qazal
9dbc550692
roc: map disassembly to prog name ( #13384 )
2025-11-20 23:47:19 +08:00
qazal
ebcdf68bab
viz: use content headers for profiler ( #13383 )
2025-11-20 23:33:16 +08:00
nimlgen
0b0ea4981c
hcq: unwrap signals ( #13382 )
2025-11-20 18:12:41 +03:00
qazal
9dcd52287a
add external_benchmark_pyrender ( #13378 )
...
* add external_benchmark_pyrender
* can ctrlc it
* cpu_profile exists
2025-11-20 17:38:28 +08:00
George Hotz
cb38c704c3
delete nonfunctional ramp.py
2025-11-19 20:43:44 -08:00
George Hotz
8919c994b7
Revert "AxisType.PLACEHOLDER in reshape to do less graph_rewrite ( #13373 )" ( #13375 )
...
This reverts commit ac7559e33d .
2025-11-19 19:34:30 -08:00
George Hotz
ac7559e33d
AxisType.PLACEHOLDER in reshape to do less graph_rewrite ( #13373 )
...
* AxisType.PLACEHOLDER in reshape to do less graph_rewrite
* _apply_movement_op cache
2025-11-19 19:19:58 -08:00
chenyu
050682ab40
use invalid_gate consistently [pr] ( #13374 )
2025-11-19 22:15:12 -05:00
Roelof van Dijk
0dc2ff431d
fix: revive torch backend ( #13280 )
...
* fix: revive torch backend
* as_strided view vs copy
* Revert "as_strided view vs copy"
This reverts commit 82a61223f2 .
* add extra tests (move inplace, add fusion tests)
* better fusion with inplace_op
* no optimizer hooks (break mnist training fusion)
* split off fusion tests in separate file, assert on resnet fusion
fix: remove comments
* cleanup, reduce diff
* reduce diff
* better fusion and identity checks
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-11-19 15:26:50 -08:00
wozeparrot
56b2540349
tk: keep extra tile data by replacing uop ( #13370 )
2025-11-19 15:11:43 -08:00
George Hotz
ab7df42c78
bring back fold_divmod_general with bugfix and test [pr] ( #13369 )
...
* Revert "Revert "merge to fold_divmod_general [p] (#13359 )""
This reverts commit 05ccc69248 .
* Revert "Revert "actually merge to fold_divmod_general [pr] (#13363 )""
This reverts commit 90e5752199 .
* Revert "Revert "add cache to fold_divmod_general (#13365 )""
This reverts commit 8e17bd6791 .
* bring back fold_divmod_general with bugfix and test
2025-11-19 14:51:51 -08:00
George Hotz
986d113024
symbolic fuzz failure ( #13367 )
...
* symbolic fuzz failure
* skip flaky test
2025-11-19 14:21:08 -08:00
George Hotz
05ccc69248
Revert "merge to fold_divmod_general [p] ( #13359 )"
...
This reverts commit 7711bbac7f .
2025-11-19 14:18:09 -08:00
George Hotz
90e5752199
Revert "actually merge to fold_divmod_general [pr] ( #13363 )"
...
This reverts commit 3d82b83cec .
2025-11-19 14:18:08 -08:00
George Hotz
8e17bd6791
Revert "add cache to fold_divmod_general ( #13365 )"
...
This reverts commit b5309a5043 .
2025-11-19 14:18:08 -08:00
George Hotz
b5309a5043
add cache to fold_divmod_general ( #13365 )
2025-11-19 13:49:18 -08:00
George Hotz
3d82b83cec
actually merge to fold_divmod_general [pr] ( #13363 )
...
* actually merge to fold_divmod_general [pr]
* one more merge
* Revert "one more merge"
This reverts commit aa79f6781c .
* avoid that case for speed
* faster and simpler
2025-11-19 13:17:56 -08:00
chenyu
a91f00925b
remove VECTORIZE and WMMA rules from sym [pr] ( #13362 )
2025-11-19 14:51:21 -05:00
George Hotz
7711bbac7f
merge to fold_divmod_general [p] ( #13359 )
...
* merge to fold_divmod_general [p]
* merge more
* merge more
* merge more
2025-11-19 11:37:45 -08:00
George Hotz
6fdbd03104
more divmod cleanup [p] ( #13358 )
...
* more divmod cleanup [p]
* lil cleanups, faster
2025-11-19 10:35:15 -08:00
George Hotz
bd88a72149
div and mod to its own file, try 2 [p] ( #13357 )
2025-11-19 10:10:06 -08:00
George Hotz
957cf717e7
Python speed ( #13355 )
...
* skip process replay by default
* work on python speed
* fix names of rewrite rules
* fix that test
2025-11-19 09:03:00 -08:00
chenyu
fc19ea76b5
clean up threefry rules ( #13354 )
2025-11-19 11:48:07 -05:00
George Hotz
385618d45b
skip process replay by default ( #13353 )
2025-11-19 08:25:34 -08:00
chenyu
fba4535289
remove hacks for threefry long removal when padded [pr] ( #13352 )
2025-11-19 11:11:39 -05:00
George Hotz
225eb1500f
generic range changes that work for str + int ( #13350 )
...
* generic range changes that work for str + int
* opt range counts up
2025-11-19 08:07:49 -08:00
chenyu
1a72ac16a6
move where same false branch rule to symbolic_simple [pr] ( #13349 )
2025-11-19 10:15:38 -05:00
chenyu
79055ddb8b
clean propagate_invalid more [pr] ( #13347 )
2025-11-19 09:47:50 -05:00
nimlgen
0c9fbf87e1
nvioctl: classes ( #13346 )
2025-11-19 16:14:15 +03:00
qazal
f2221130bb
viz: pick shape by event type ( #13279 )
2025-11-19 20:15:52 +08:00
wozeparrot
be72b78dcb
tk: small fixes ( #13345 )
...
* fix: handle case where final uop isn't a tk wrapped one
* clean: remove after from mma
2025-11-19 00:58:50 -08:00
wozeparrot
e4fbde5b3b
fix: extra options need to go on second step too ( #13344 )
2025-11-19 00:58:09 -08:00
George Hotz
1a332afa76
spec test on 3.14 ( #12957 )
2025-11-19 00:43:04 -08:00
Christopher Milan
a438c277de
autogen tests for 3.14 ( #13343 )
2025-11-18 22:16:59 -05:00
chenyu
722e7a16ed
remove rule in propagate_invalid [pr] ( #13342 )
2025-11-18 21:38:33 -05:00
George Hotz
1afa3c0877
vmap on full model ( #13340 )
...
* vmap on full model
* vmap gemm
* reduce sums on end
* outer reduce
* only if there's ranges
* put those rules in symbolic
* ranges
* do opt later
* add zero range
2025-11-18 16:06:06 -08:00
chenyu
46cb65e692
delete rules from sym [pr] ( #13339 )
2025-11-18 14:57:35 -05:00
George Hotz
9c59b3d19e
vmap grad needs reduce_backward ( #13336 )
...
* vmap grad needs reduce_backward
* fuse and outer
2025-11-18 10:08:30 -08:00
qazal
a647c9eca6
sqtt ui minor fixes ( #13335 )
...
* roc.py cleanups
* direct append
* viz index cleanup
* simd row details
2025-11-19 01:27:56 +08:00
George Hotz
06e39a88a9
outer vmap works ( #13334 )
...
* outer vmap works
* fuse works
* vmap outer works
* outer ranges work
* grad work
* should be good to merge
2025-11-18 09:27:48 -08:00
chenyu
805de27e07
no load substitute in uop_given_valid [pr] ( #13333 )
2025-11-18 11:47:58 -05:00
chenyu
05294bc648
fix some mypy cast [pr] ( #13331 )
2025-11-18 09:23:42 -05:00
qazal
5623e765c8
VIZ=2 enables SQTT ( #13330 )
2025-11-18 22:20:31 +08:00
nimlgen
331f70aa75
roc: ctrlc ( #13255 )
...
* roc: ctrl-c works
* rm
2025-11-18 19:29:28 +08:00
George Hotz
583560ab72
this is the right way to write vmap ( #13328 )
2025-11-17 20:20:52 -08:00
Christopher Milan
8e8e53c886
int8_t is c_byte ( #13326 )
2025-11-17 21:25:40 -05:00
George Hotz
e4fead8a86
write scan in uops ( #13321 )
...
* write scan in uops
* ops range
* no need for variable
* meh, later
* shorter
2025-11-17 16:58:08 -08:00
wozeparrot
8894a5409d
feat: hipcc compiler ( #13319 )
2025-11-17 15:13:32 -08:00
George Hotz
6d3385c284
print special ops in postrange ( #13318 )
...
* print special ops in postrange
* fix on OSX
2025-11-17 14:43:23 -08:00
chenyu
b637093be9
remove a few rules in pm_lower_index_dtype [pr] ( #13317 )
2025-11-17 17:04:56 -05:00
George Hotz
98e9e73286
hotfix: amd_uop_matmul getenvs
2025-11-17 13:26:01 -08:00
qazal
e7e1935225
cleanup sqtt/test_timing ( #13315 )
2025-11-18 04:28:05 +08:00
wozeparrot
33773fda87
tk initial mi350 ( #13289 )
2025-11-17 11:46:32 -08:00
nimlgen
e2cee64050
Revert "hcq: add tag to exec events ( #13311 )" ( #13314 )
...
This reverts commit f63ded5817 .
2025-11-17 22:15:31 +03:00
chenyu
646372490c
move tiktoken import in llama3 ( #13316 )
...
only Tokenizer requires that
2025-11-17 14:09:37 -05:00
qazal
a37f221e44
viz: visualize waves in the timeline ( #13292 )
...
* viz: visualize waves in the timeline
* timeline in format
* per step
* rm that
2025-11-17 22:04:21 +08:00
nimlgen
f63ded5817
hcq: add tag to exec events ( #13311 )
...
* hcq: add tag to exec events
* f
* fix
* fix
2025-11-17 16:59:30 +03:00
qazal
50a443f558
viz: add shader engine to wave exec payload ( #13310 )
...
* viz: show sqtt shader engine
* order it from smallest unit
* easier to config
2025-11-17 19:11:34 +08:00
nimlgen
9bb17c53ea
amd: timer fix ( #13267 )
2025-11-17 13:59:03 +03:00
George Hotz
55be95da15
cleanup sqtt raw parser ( #13309 )
...
* cleanup sqtt raw parser
* better names (don't merge yet)
* clean up amd
* a few more names
* one more filter
2025-11-16 13:11:51 -08:00
George Hotz
cabd4add48
more work parsing SQTT, separate VIZ/PROFILE ( #13308 )
...
* more work parsing SQTT
* more minimal runner
* sep VIZ/PROFILE
* parse print new
* improve parser
* more filter
* that
* split them
* lil cleanup
* skip flaky test
* AQL in mmapeak
2025-11-16 10:40:39 -08:00
qazal
13efdf8c31
test s_nop stall ( #13307 )
2025-11-17 00:59:39 +08:00
George Hotz
295600dc5a
saturday coffee shop work parsing the att format ( #13295 )
...
* saturday coffee shop work parsing the att format
* add examples
* parser
* classes of packets
* fully vibe coded parser
* vibing
* empty
* some vibe names
* vibes
* most of these are wrong
* more vibes
* better names
* parsing
* parse
* cleanup parser
* touchups
2025-11-16 08:25:51 -08:00
Christopher Milan
a9ed241172
properly suppress NIRRenderer.__del__ error ( #13299 )
2025-11-16 18:58:04 +03:00
qazal
c70b06ec19
sqtt test_timing work ( #13304 )
...
* sqtt test_timing cleanups
* only the instruction
* v_mfma_f32_16x16x32_f16 16 cycles, only after second one though
2025-11-16 23:49:24 +08:00
chenyu
8f0e747b3a
Tensor._tri with arange ( #13297 )
2025-11-16 10:21:16 -05:00
chenyu
6372c95094
disable benchmark MobileNetV2 on DSP ( #13305 )
...
failed on tinyc2
2025-11-16 09:42:52 -05:00
Christopher Milan
61625a3898
fix objc finalizing bug ( #13296 )
2025-11-16 12:43:04 +03:00
nimlgen
acbe6361ab
qcom: suppress_finalizing to free ( #13294 )
2025-11-16 11:49:23 +03:00
wozeparrot
ef42334239
tk: load store cleanup ( #13290 )
2025-11-15 17:08:23 -08:00
chenyu
e8844853ed
Tensor.eye with arange ( #13287 )
...
with rangify we can write these with arange
2025-11-15 12:32:27 -05:00
Christopher Milan
5b823af696
Remove (pypi) clang dep for autogen ( #13284 )
...
* no more clang
* regen comgr_3
* ci doesn't need pypi clang
* fix objc
* REGEN for libclang
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-11-15 09:05:11 -08:00
George Hotz
df53c62a9f
bump line count
2025-11-15 08:16:20 -08:00
nimlgen
d37e1fe065
nv: wait for wpr to reset ( #13282 )
...
* nv: wait for wpr to reet
* fix
* comment
* wai
* f
* fix
2025-11-15 20:00:49 +08:00
George Hotz
22c08b470c
fold using outerworld range ( #13286 )
...
* scan using outerworld range
* almost
* sched
* simple range
* mypy
* woooo outer range
* spec passes
* print the numbers
* lol it runs
* real test
2025-11-14 20:43:41 -08:00
George Hotz
567066f51f
tests for cast there and back ( #13195 )
...
* fix cast folding in llama
* dtypes that work everywhere
* Skip test_cast_there_and_back for backend casts
Skip test due to backend casting issues.
2025-11-14 16:56:09 -08:00
George Hotz
6c5fa349e1
add (unused) outer range ( #13285 )
2025-11-14 16:47:52 -08:00
Christopher Milan
d1bb08c5a1
In-tree autogen: objective c ( #13223 )
...
* checkout changes from autogen branch
* move assert
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-11-14 14:08:42 -08:00
George Hotz
e5351699bd
openpilot warp ( #13283 )
...
* openpilot image warp test
* 0.4 ms on metal, 1 ms on CPU
* new inputs each time
* reshape
2025-11-14 13:55:32 -08:00
qazal
7c110e1a57
viz: minor cleanups for sqtt ( #13275 )
...
* small prg cleanup
* test_timing
2025-11-15 01:08:56 +08:00
chenyu
888aaab151
test_tiny cleanup ( #13276 )
2025-11-14 11:11:32 -05:00
nimlgen
3e63831b98
nv: support 580+ drivers ( #13269 )
...
* nv: 580+ support
* start
* f
* fake
* fix
2025-11-14 21:44:16 +08:00
qazal
2ee701a009
roc: fix CEnum access ( #13270 )
...
* roc: add decoder to ci
* also add installer
* use CEnum syntax
* try 2
* add to setup
* revert ci change
* the other enum too
2025-11-14 21:41:24 +08:00
nimlgen
c80d459d99
autogen: fix packed args structs ( #13274 )
...
* autogen: fix packed args structs
* and test this
2025-11-14 20:24:06 +08:00
nimlgen
14eb48b13a
autogen: rename nv_gpu to nv_570 ( #13273 )
...
* autogen: rename nv_gpu to nv_570
* rename
2025-11-14 20:07:19 +08:00
nimlgen
734bfa07b4
nv: refactor uvm calls ( #13272 )
2025-11-14 19:53:04 +08:00
nimlgen
f72b1fbca4
nv: read numClasses ( #13271 )
...
* nv: read numClasses
* fix
* d
2025-11-14 19:43:25 +08:00
nimlgen
84f065f2a2
autogen: warning and msg ( #13268 )
...
* autogen: warning and msg
* f
2025-11-14 19:10:26 +08:00
George Hotz
44d84228ff
move comgr_3 logic back to the old place ( #13266 )
...
* move comgr_3 logic back to the old place
* explicit
2025-11-13 20:05:54 -08:00
Christopher Milan
09f3aae169
In-tree autogen: all C libraries ( #13220 )
...
* checkout files from autogen branch
* ioctl with payload
* fix am generations
* properly fix generations
This reverts commit b2a54f4f41 .
* revert discovery.h
* support pragma pack(1)
* typo
* better getter
* typo
* NVCEC0_QMDV05_00_RELEASE[01]_ENABLE
* align support
* anon handling fix
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-11-13 18:57:44 -08:00
wozeparrot
777cbec5b3
tk: rename rt tile dims to base ( #13265 )
2025-11-13 18:43:02 -08:00
wozeparrot
7eb0d8e744
feat: mixins on tiles ( #13246 )
2025-11-13 16:52:52 -08:00
George Hotz
ba84d415fe
work from benchmarking tinybox red v2 ( #13264 )
...
* work from benchmarking tinybox red v2
* gpuburn
2025-11-13 16:38:40 -08:00
wozeparrot
547304c471
tk: group cleanup ( #13262 )
2025-11-13 14:19:51 -08:00
wozeparrot
4ada51618f
tk: don't flatten in clear ( #13249 )
2025-11-13 13:38:01 -08:00
George Hotz
6b1bae6614
ruff format mixin ( #13261 )
2025-11-13 10:10:38 -08:00
Faizaan Gagan
3049f3edda
support _rebuild_tensor method interception ( #13253 )
2025-11-13 09:41:21 -08:00
Harald Schäfer
3af231904e
openpilot compile tests: assert pre-rangify speeds ( #12775 )
...
* assert pre-rangify speeds
* typo
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-11-13 09:39:06 -08:00
George Hotz
faf68c03a8
more mi350x matmul work ( #13138 )
...
* more mi350x matmul work
* broken compute
2025-11-13 09:09:28 -08:00
Ayman Jabr
256f81bb02
Fix tracemeta 0 ( #13049 )
...
* chore: tclesius branch resolved
* fix: indentation
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-11-13 09:07:11 -08:00
alpharush
7e0aaadecd
feat: add repro command to summary ( #10930 )
2025-11-13 08:52:27 -08:00
nimlgen
6be86dde17
nv: add timeout when repsonding to rpc ( #13260 )
2025-11-14 00:42:21 +08:00
nimlgen
f9b7586e08
roc: fix blob gc ( #13256 )
2025-11-13 23:38:35 +08:00
George Hotz
263b724143
one cache and bump it ( #13258 )
2025-11-13 07:33:31 -08:00
George Hotz
5efa727b83
move _pool to MovementMixins ( #13257 )
2025-11-13 07:28:52 -08:00
George Hotz
bcdfc109b5
hotfix: disable flaky test
2025-11-13 06:19:28 -08:00
qazal
006dea4c3e
roc: only save instruction execs ( #13254 )
2025-11-13 21:28:40 +08:00
nimlgen
f9586b38ba
system: pci mask and val ( #13251 )
2025-11-13 20:44:58 +08:00
George Hotz
7316da3253
new readme ( #13250 )
...
* new readme
* update
2025-11-13 00:48:28 -08:00
George Hotz
17aa3379e9
hotfix: improve self_tokenize
2025-11-13 00:18:57 -08:00
chenyu
4e5a9132e7
JIT_BATCH_SIZE=0 in compile3 ( #13245 )
...
fixed some enqueue time
2025-11-12 23:12:45 -05:00
wozeparrot
759557f633
feat: move tk tests to testextra ( #13242 )
2025-11-12 17:06:53 -08:00
chenyu
3f939f3d3c
update pm_simplify_valid ( #13241 )
...
* update pm_simplify_valid
fixed openpilot conv regression
* IMAGE training is broken
2025-11-12 19:40:02 -05:00
chenyu
f9851a852f
minor update to uop_given_valid [pr] ( #13243 )
...
split from #13241
2025-11-12 19:03:18 -05:00
qazal
fe2876a6d8
hotfix: second GB/s in viz ( #13240 )
2025-11-13 07:14:27 +08:00
George Hotz
a23dea202b
actually make AMD_LLVM not default ( #13238 )
2025-11-12 15:07:23 -08:00
George Hotz
ab9fa964d8
DISABLE_COMPILER_CACHE -> CCACHE ( #13234 )
...
* DISABLE_COMPILER_CACHE -> CCACHE
* Fix cachekey assignment in Compiler constructor
2025-11-12 15:07:09 -08:00
qazal
be2e24cb25
roc: requires sudo to install ( #13237 )
2025-11-12 16:59:22 -05:00
George Hotz
8f1f195b6d
hotfix: no hexdump for usbgpu patch.py
2025-11-12 12:05:37 -08:00
nimlgen
9a53fcbde4
amd: sqtt on rdna3.5 ( #13233 )
2025-11-13 03:30:42 +08:00
George Hotz
13f10a31dc
AMD_LLVM default off ( #13232 )
2025-11-12 11:06:33 -08:00
qazal
8b26cf2b3d
sqtt: update rcp timing test ( #13231 )
...
* sqtt: assert correct output in timing test
* found why
2025-11-13 02:01:54 +08:00
Jan Akhremchik
bc8e537423
Add NONZERO op to onnx backend ( #13211 )
2025-11-12 08:55:51 -08:00
nimlgen
af17e07251
viz: sqtt touchups ( #13228 )
...
* viz: sqtt touchups
* revert
* matches
2025-11-12 22:40:37 +08:00
qazal
7a6853fa40
viz: show python callstack in the first graph ( #13218 )
2025-11-12 20:52:28 +08:00
nimlgen
82eb63d3ad
qcom: auto switch idle timer when profiling ( #13230 )
...
* qcom: auto switch idle timer when profiling
* fi
2025-11-12 20:31:24 +08:00
nimlgen
fcd8d0751a
test_timing for hip ( #13229 )
2025-11-12 20:28:58 +08:00
qazal
74b9d33acb
viz: direct link to program source ( #13227 )
2025-11-12 16:27:13 +08:00
wozeparrot
371c1f2355
tk: move tiles to class ( #13224 )
2025-11-11 21:53:46 -08:00
Christopher Milan
41a098a82d
In-tree autogen: libc.py ( #13217 )
...
* checkout changes from autogen branch
* parents
* pylint happy
* move sys to system in helpers.py
* typo
* typo
2025-11-11 19:13:48 -08:00
wozeparrot
222bb12ddf
tk softmax ( #13205 )
2025-11-11 15:13:16 -08:00
wozeparrot
787f0070ed
feat: don't use output reg as local reduce reg ( #13203 )
2025-11-11 14:35:16 -08:00
chenyu
ece1415def
clean up image_dot and image_conv2d ( #13222 )
...
* clean up image_dot and image_conv2d
* those are fine
* interesting
2025-11-11 15:53:03 -05:00
nimlgen
2f0ea29b34
qcom: 48bit timestamps ( #13214 )
...
* qcom: 48bit timestamps
* f
* lol
* fix
2025-11-12 04:14:33 +08:00
qazal
bc55bc4849
cleanup test_viz profiler tests ( #13221 )
2025-11-12 03:46:48 +08:00
chenyu
23b90945c3
add a benchmark for openpilot vision with DEBUG=2 ( #13219 )
...
see per kernel speed, also disable the jobs for 0.9.9
2025-11-11 14:41:52 -05:00
George Hotz
c2075f3613
gc disable during big rewrites ( #13215 )
...
* gc disable during big rewrites
* cleaner with helper
2025-11-11 10:30:47 -08:00
Roelof van Dijk
e59313da08
migrate pytest and ruff ( #13216 )
2025-11-11 13:27:51 -05:00
Gaétan Lepage
6fd7ce3832
migrate to pyproject.toml ( #13189 )
...
* migrate to pyproject.toml
* move mypy config to pyproject.toml
2025-11-11 09:09:27 -08:00
qazal
8002921a04
viz: improve the program run tooltip ( #13212 )
...
* add tflops to tooltip format
* show if the run was batched
2025-11-12 00:56:03 +08:00
qazal
f91e366a17
viz: display the graph layout recursion error ( #13194 )
...
* viz: display the graph layout recursion error
* share styles
* +min-width
* same thing
* inline the append
2025-11-11 15:25:12 +08:00
wozeparrot
73497af4c0
clean: use np for allclose ( #13204 )
2025-11-10 23:02:43 -08:00
George Hotz
a6360fd94d
store can have shape ( #13202 )
...
* store can have shape
* _shape
2025-11-10 22:16:47 -08:00
b1tg
f3692b7406
clean up hip renderer ( #13063 )
...
* clean up hip renderer
* ocml
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-11-11 00:44:24 -05:00
chenyu
22b8579234
one last regressed dm kernel ( #13201 )
2025-11-10 23:30:52 -05:00
chenyu
58b7e4fab3
GROUPTOP heuristic on more axes ( #13206 )
...
fixed dm speed
2025-11-10 23:30:37 -05:00
chenyu
829cdafccc
update openpilot slow conv uop ast ( #13197 )
...
the two remaining slow ones
2025-11-10 17:03:20 -05:00
George Hotz
0c978d45e6
stub attention ( #13196 )
...
* stub attention
* name the kernels
2025-11-10 13:48:38 -08:00
chenyu
58c30fc7ce
minor image_conv2d cleanup ( #13193 )
2025-11-10 16:05:40 -05:00
chenyu
60e55d9a2d
line count 18500 ( #13191 )
2025-11-10 13:52:13 -05:00
nimlgen
09a59c2203
qcom: support new chip versioning ( #13185 )
...
* qcom: support new chip versioning
* ops
* nit
* fix
* f
2025-11-10 23:57:29 +08:00
qazal
50934050bc
sqtt: append all wave execs ( #13190 )
2025-11-10 23:50:08 +08:00
qazal
38a24731a1
cleanup sqtt tooling ( #13188 )
...
* cleanup viz/serve.py
* use latest profile in rgptool.py
* unwrap nullable in roc.py, fix disasms typing
2025-11-10 20:52:57 +08:00
qazal
845a24dcc6
viz: group sqtt waves by program ( #13187 )
...
* viz: group sqtt waves by program
* color the names
2025-11-10 19:25:23 +08:00
George Hotz
fd6803000e
mutmut cfg ( #13184 )
...
* mutmut cfg
* coveragerc
2025-11-09 23:29:29 -08:00
wozeparrot
6252831ceb
feat: initial tk library ( #13160 )
2025-11-09 22:54:29 -08:00
George Hotz
925231aec1
repeat does less reshape for 1s ( #13183 )
2025-11-09 19:43:02 -08:00
George Hotz
d7369de048
hotfix: update weekly commits table
2025-11-09 19:37:06 -08:00
chenyu
6c48c87e51
improved ASSERT_MIN_STEP_TIME ( #13182 )
...
* improved ASSERT_MIN_STEP_TIME
getting close, current time +1ms then round up
* relax
2025-11-09 16:41:12 -05:00
nimlgen
17715688c7
system: validate vendor for APLPCIIfaceBase ( #13181 )
2025-11-10 02:49:21 +08:00
nimlgen
614783693e
nv: remove hardcoded expansion_rom_off ( #13180 )
...
* nv: remove hardcoded expansion_rom_off
* to max size
2025-11-09 21:43:19 +08:00
chenyu
e1d46de8f8
update GROUPTOP heuristic more ( #13178 )
...
reverts #13176
2025-11-09 02:31:12 -05:00
chenyu
41e45c20ff
minor stuff reading the printed code [pr] ( #13177 )
2025-11-09 00:58:51 -05:00
chenyu
8e868dced8
only GROUPTOP one reduce kernel ( #13176 )
...
* only GROUPTOP one reduce kernel
* ALLOWED_GATED_READ_IMAGE=148
2025-11-08 22:38:44 -05:00
chenyu
834067d91f
move onnx import in compile3 ( #13172 )
...
only used in test_vs_onnx
2025-11-08 09:44:34 -08:00
nimlgen
7f3240dbfe
nv: cleanup alloc ( #13170 )
...
* nv: cleanup alloc
* okay okay
2025-11-09 00:14:46 +08:00
qazal
7250fc0354
viz: double click on kernel run goes to codegen ( #13147 )
2025-11-08 23:40:50 +08:00
qazal
8a7fa9e7b4
sqtt: show total cycles of kernel in viz ( #13169 )
2025-11-08 21:00:40 +08:00
chenyu
2ba8b4946f
external_benchmark_op_cat.py ( #13168 )
...
* external_benchmark_op_cat.py
cat kernel that's 1ms on master and 50us with no GROUP and with NOLOCALS
* fix
2025-11-08 01:54:10 -05:00
chenyu
a62496cb3d
clean up get_grouped_dims [pr] ( #13159 )
2025-11-08 01:53:54 -05:00
wozeparrot
eb0192b0bb
feat: print ranges that aren't ended ( #13167 )
2025-11-07 22:01:29 -08:00
George Hotz
b41541bc44
bounty: Remove Tensor._pool alternative implementation and verify kernels remain the same ( #13164 )
2025-11-07 16:59:48 -08:00
George Hotz
ffb9e8396f
fix indexing bug with convs
...
* minimal difference for ONE_POOL=1
* fix indexing bug
* improve indexing debugger
* more debugger improvements
* always for reshape
2025-11-07 16:45:19 -08:00
chenyu
6a509da7f3
Scheduler.reduceops helper [pr] ( #13162 )
2025-11-07 18:59:46 -05:00
George Hotz
2413311289
make _pool simpler ( #13161 )
...
* make _pool simpler
* just syntax
* more correct and smaller
* try this now
* Revert "try this now"
This reverts commit 607cdc2164 .
* ONE_POOL
2025-11-07 15:58:44 -08:00
George Hotz
70054cdb14
move backward cast to broadcasted, expand to mixins ( #13156 )
...
* shrink_to mixin
* move backward cast into _broadcasted
* expand to movement mixin
* move a few more
* fix spec issue
2025-11-07 15:07:47 -08:00
George Hotz
f2519ea0ba
shrink_to mixin ( #13155 )
2025-11-07 11:46:24 -08:00
C T
0f9d7f650d
whisper: fix oob, explicit dtype ( #13144 )
...
* fix dtype depending on numpy version
numpy v2 np.array returns int64 which Tensor passed through for the
first decode call, swallowing the <|notimestamps|> token and corrupting
the sequence
* fix whisper OOB
global limit on whisper's context length
* enforce whisper max_tokens_to_sample (match openai)
local limit on max tokens decoded
2025-11-07 12:55:01 -05:00
Ahmed Harmouche
3ecff3a8da
Fix dim splitting bug for len(dim) == len(limited) case ( #13142 )
...
* Fix gpudims bug on webgpu
* Fix split dim bug
* Remove webgpu_bug from examples
* Add test for shape correctness
* Fix 3D indexing
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-11-07 12:31:06 -05:00
nimlgen
b8e48effcb
device: no compilers message with reasons ( #13146 )
...
* device: no compilers message with reasons
* typings
* mypy
2025-11-07 23:01:45 +08:00
nimlgen
35e461ef69
hcq: use exception group ( #12616 )
...
* hcq: use exception group
* fix
2025-11-07 21:23:12 +08:00
nimlgen
10dc8335d2
tinygpu: fix teardown crash ( #13143 )
...
* tinygpu: fix crash
* um?
* double relase
* restore
2025-11-07 19:52:54 +08:00
qazal
d4a216d7d9
viz: display compiler errors ( #13141 )
2025-11-07 18:09:50 +08:00
qazal
7e94369464
add helper for test_timing custom ops ( #13140 )
2025-11-07 17:13:55 +08:00
nimlgen
95620426d5
tinygpu: unmap dma when client closed ( #13129 )
...
* tinygpu: unmap dma when client closed
* syn
* tiny fixes
2025-11-07 16:08:43 +08:00
wozeparrot
500d7661fa
feat: show range len on index in viz ( #13139 )
2025-11-06 23:21:27 -08:00
George Hotz
bb6364d7c7
tuplize from linearizer behind flag ( #13136 )
...
* remove tuplize from linearizer
* optional tuplize
2025-11-06 20:15:03 -08:00
chenyu
bb8cf948f2
variation of (x%c)+(x//c)*c = x ( #13135 )
...
when x is in the form of y//b, the idiv term might have combined
2025-11-06 18:53:28 -05:00
George Hotz
42b34cf83d
bottom up linearizer ( #13133 )
...
* bottom up linearizer
* late stores
* more complete
* remove broken heuristic
* upcast size
* opt
* more conservative
* it needs that
* disable opencl half on QCOM
* fix
* make that a real test
* cpu test okay
* ptx skip
* end is after the range
2025-11-06 15:30:32 -08:00
George Hotz
e0d828dba8
little cleanups
2025-11-06 13:58:19 -08:00
chenyu
bfb0c0391f
test custom eye function ( #13134 )
...
this version is also faster with NOOPT
2025-11-06 14:51:55 -05:00
George Hotz
290441dd44
do loads early ( #13131 )
...
* do loads early
* local and reg
2025-11-06 09:57:09 -08:00
George Hotz
097264853d
very simple priority ( #13130 )
...
* very simple priority
* still simple
2025-11-06 09:25:28 -08:00
George Hotz
07b415e831
fixup op order ( #13128 )
...
* fixup op order
* more order
* move a few more
* more
* DEBUG_LINEARIZE
2025-11-06 08:50:04 -08:00
nimlgen
b9b68bf437
amd: add kern to sqtt event ( #13126 )
...
* amd: add kern to sqtt event
* fix
2025-11-06 22:02:02 +08:00
qazal
88245d6579
qol improvements to sqtt decoder and timing tests ( #13125 )
2025-11-06 20:51:30 +08:00
nimlgen
dafdb4bfb1
test hcq open with pytest ( #13124 )
...
* test hcq open with pytest
* fi
2025-11-06 20:09:51 +08:00
nimlgen
05e2ff4d87
system: fix flock on pcidevs ( #13123 )
...
* system: fix locking of hcq devices
* rename and fullrun
* force ok
* fix
* fix
2025-11-06 19:02:13 +08:00
qazal
3126c89b84
viz: visible horizontal scrollbar in long texts ( #13122 )
2025-11-06 17:23:02 +08:00
George Hotz
91cc773397
add run count to toposort ( #13119 )
2025-11-05 22:29:34 -08:00
Adeeb Shihadeh
dca7fb0a49
qcom: make priority configurable ( #13120 )
2025-11-05 22:27:54 -08:00
qazal
b2bb3af12a
make range_color work in VIZ ( #13121 )
2025-11-06 14:26:48 +08:00
chenyu
f33c182393
test custom qkv kernel ( #13118 )
...
adding the online softmax hits infinite loop so starting with this
2025-11-05 23:32:13 -05:00
George Hotz
c65e6d8887
add ranges to print_uops ( #13116 )
...
* remove tuplize from linearizer
* try this
* simple priority
* add colored ranges to print_uops
* improve comments
* fix no const in src
* fix mypy
* fix define global
* fix var placement
* no prefer early load
* revert linearizer for now
2025-11-05 20:26:56 -08:00
George Hotz
9b2b535fa4
fix issue with multi flip ( #13115 )
2025-11-05 15:28:50 -08:00
George Hotz
4027eef264
fix test warnings ( #13114 )
...
* fix test warnings
* precommit passes
* ignore std_mean warning
2025-11-05 15:06:29 -08:00
George Hotz
bcfe42937f
move permute/flip/shrink to mixins ( #13113 )
...
* move permute to mixins
* move more stuff
* two more
* fix local mypy
* fix tests
* fix shrink
2025-11-05 14:14:15 -08:00
George Hotz
2d4f01fda0
move mixins to mixin dir ( #13105 )
...
* move mixins to mixin dir
* math
2025-11-05 10:18:33 -08:00
chenyu
52f0081e77
use where instead of mul in Embedding ( #13112 )
2025-11-05 12:49:01 -05:00
b1tg
edc4e1aede
ignore trailing nops in llvm-objdump output ( #13110 )
2025-11-06 01:10:51 +08:00
chenyu
03ee0cfe45
minor fast_idiv cleanup [pr] ( #13109 )
2025-11-05 11:44:36 -05:00
chenyu
18d4ecc1f3
lower nv test_gemm_4096 target ( #13107 )
2025-11-05 11:05:16 -05:00
nimlgen
eff80beeed
amd: props in device not sqtt ( #13106 )
...
* amd: props in device not sqtt
* fix
* f
* fix
* fix
2025-11-05 23:43:20 +08:00
nimlgen
757ceab2a2
system: allow using vidmem for uc mem ( #13104 )
2025-11-05 19:12:59 +08:00
qazal
8119d9f082
sqtt: decode each instruction exec ( #13093 )
...
* sqtt: decode each instruction exec
* start tests
* run_asm
* capture sqtt per kernel
* chaining vgprs
* test things
* inst_execs in viz
* can also configure l and g
* 1l + cleanup
* test_sleep
* test_wmma
* work
* test sleep with llvm builtin
2025-11-05 17:30:27 +08:00
chenyu
54141e9cb9
DISABLE_COMPILER_CACHE=1 in speed_v_theoretical ( #13096 )
2025-11-04 11:28:18 -05:00
chenyu
1c9f720654
remove unused type ignore [pr] ( #13095 )
2025-11-04 10:08:07 -05:00
nimlgen
c857dc5af0
autogen: try/except in try_dlopen ( #13094 )
...
* autogen: try/except in try_dlopen
* ugh
2025-11-04 22:51:53 +08:00
nimlgen
eaf7cbc178
amd: flush sqtt after each kernel ( #13092 )
...
* amd: flush sqtt after each kernel
* merge for rgp
2025-11-04 22:12:48 +08:00
qazal
96417665e8
show sqtt decoder errs in viz ( #13088 )
...
* show sqtt decoder errs in viz
* don't touch roc.py
* give hljs a default language
* work from tinyr9
* work
2025-11-04 22:05:06 +08:00
nimlgen
49191ada77
roc: install sqtt decoder ( #13091 )
...
* roc: install?
* msg
* 0.1.4
2025-11-04 18:56:01 +08:00
nimlgen
16f1f644ba
amd: remove sqtt=2 ( #13090 )
2025-11-04 18:29:24 +08:00
nimlgen
2e97eaa866
roc: no nullptr when no wave instructions ( #13087 )
2025-11-04 17:32:14 +08:00
wozeparrot
9c00c0688a
tk fa: use 16x64 tiles ( #13086 )
2025-11-03 18:25:38 -08:00
wozeparrot
4ed0f216b5
fix: make max_matmul run again ( #13085 )
2025-11-03 18:09:09 -08:00
chenyu
ca17718b6d
remove symbolic_flat ( #13083 )
...
* remove symbolic_flat
some kernels are different but sometimes it's better so not clear, will merge as long as benchmark passes
* test_location
2025-11-03 17:25:21 -05:00
chenyu
fda720e013
simpler _is_balanced [pr] ( #13082 )
...
returns False earlier
2025-11-03 16:47:14 -05:00
chenyu
ddf01fdb15
revert mlperf.yml setting ( #13080 )
2025-11-03 15:24:13 -05:00
qazal
6df34a5887
lint sqtt parser with mypy ( #13079 )
...
* llvm address table errs
* mypy likes annotated dicts
* unwrap nullable
2025-11-04 00:53:59 +08:00
qazal
2d2040bc92
viz: tabulate sqtt ( #13078 )
...
* viz: tabulate sqtt
* nomore asdict
2025-11-04 00:03:15 +08:00
nimlgen
dfde3f54d9
rocprof: use llvm disasm ( #13077 )
...
* rocprof: use llvm disasm
* rm
2025-11-03 23:58:58 +08:00
qazal
27d42fd575
sqtt decoder print behind DEBUG>=5 ( #13076 )
...
* sqtt decoder print behind DEBUG>=5
* gfx version stuff also behind 5
2025-11-03 23:20:03 +08:00
George Hotz
416b15cc59
improve uop matmul syntax ( #13074 )
...
* improve uop matmul syntax
* store takes const
* copy
* cleanups
* faster and simpler
* label them reduce
* better syntax
* touchup
2025-11-03 21:34:26 +08:00
nimlgen
08855c162b
amd: correct sqtt_read for several xccs ( #13075 )
...
* amd: correct sqtt_read for several xccs
* default mask
2025-11-03 19:59:56 +08:00
qazal
1c0d4f1cd2
viz: counters loader ( #12987 )
...
* standalone custom loader
* first iteration on the ui
* work
* add center helper
* add edge offsets
* enumerate all edge types
* try dagre layout algorithm
* simpler spec
* bring back double edges
* more work on edge paths
* aesthetics
* custom edges also works
* dimmer inactive links
* cleanup
* cleanup
* split out the ncu layout
* this is just a k/v map now
* rm that
* more cleanup and comments
* do work
* also this work
* simpler start
* rm that
* sqtt work
* view sqtt
* sqtt
* --custom is just in profile
* wrap c call
* from tinygrad install
* eg. module not found
2025-11-03 19:42:36 +08:00
George Hotz
1e3d6e49a6
index slicing + allclose ( #13071 )
...
* continue work on slicing+allclose
* Revert "Revert "slicing + allclose""
This reverts commit 6c7a12f21c .
* fix tests + better syntax
* forgot an after
* slot is an integer
2025-11-03 13:01:48 +08:00
George Hotz
6c7a12f21c
Revert "slicing + allclose"
...
This reverts commit c9a1e35b1e .
2025-11-03 12:05:44 +08:00
George Hotz
c9a1e35b1e
slicing + allclose
2025-11-03 12:00:45 +08:00
chenyu
a317d6e625
extra/amdpci/setup_python_cap.sh ( #13070 )
2025-11-02 19:19:36 -05:00
chenyu
ad501ce50a
mlperf cron install tqdm ( #13069 )
...
one more...
2025-11-02 18:09:27 -05:00
chenyu
2c8d619147
mlperf cron install influxdb3-python ( #13068 )
2025-11-02 17:55:40 -05:00
chenyu
4c22f089fc
mlperf cron install tensorflow try 2 ( #13067 )
2025-11-02 17:11:01 -05:00
chenyu
c58cf91850
mlperf cron install tensorflow ( #13066 )
2025-11-02 16:48:05 -05:00
chenyu
74db65cf72
update mlperf bert LOGMLPERF ( #13065 )
2025-11-02 15:26:37 -05:00
chenyu
b18293de96
train bert in mlperf cron ( #13064 )
...
more relevant now
2025-11-02 15:04:02 -05:00
nimlgen
be0028d3ce
amd: universal set_grbm ( #13062 )
...
* amd: universal set_grbm
* fix
2025-11-03 03:35:55 +08:00
nimlgen
37a730abce
amd: fix pmc sq gfx11+ ( #13058 )
...
* amd: fix pmc sq gfx11+
* fix
2025-11-02 21:56:47 +08:00
qazal
24054bb655
viz: check overlay width after layout ( #13060 )
2025-11-02 21:47:58 +08:00
George Hotz
962d980919
fuse hasn't worked since rangeify, remove it ( #13057 )
2025-11-02 14:01:52 +08:00
George Hotz
036ee9f84c
Self type + mixins ( #13056 )
...
* use Self type
* mixin
* fix later
2025-11-02 13:30:01 +08:00
George Hotz
8cbef912d2
move reshape to MathTraits ( #13054 )
...
* move reshape to MathTraits
* confirm it works in amd_uop_matmul
2025-11-02 12:56:15 +08:00
George Hotz
1ff341bae5
python 3.11 is now required ( #13055 )
2025-11-02 12:55:40 +08:00
George Hotz
267be7fc5e
fp16 acc
2025-11-02 12:53:04 +08:00
wozeparrot
8206eab4fc
fix: tk fa 4 workers ( #13052 )
2025-11-01 16:41:29 -07:00
Sieds Lykles
885b6dea9e
multiple reduce range arange folding ( #13047 )
...
* multi reduce arange folding
* add test
* cvar to var
* add circular_pad_bw test
2025-11-01 22:11:26 +01:00
Sieds Lykles
f97fb703c8
catch group error in matvec heuristic ( #13051 )
2025-11-01 22:09:35 +01:00
Sieds Lykles
ecb8565f67
Revert "Better cleanup of arange bufferize ( #13046 )" ( #13048 )
...
This reverts commit c99b7dfd4a .
2025-11-01 18:09:37 +01:00
Sieds Lykles
c99b7dfd4a
Better cleanup of arange bufferize ( #13046 )
...
* check for reduce and index instead of cast
* add test
2025-11-01 16:16:31 +01:00
nimlgen
051aab5481
open viz with sqtt flags ( #13001 )
2025-11-01 22:48:17 +08:00
nimlgen
2db57f3a97
amd: better msg when out of perf regs ( #13042 )
2025-11-01 22:47:50 +08:00
chenyu
bebec73471
write custom_sum with set and after ( #13045 )
2025-11-01 10:45:30 -04:00
George Hotz
e98506735b
add CONTRACT support to UOp programs ( #13043 )
...
* add contract support
* use contract
* 342 tflops
2025-11-01 19:11:32 +08:00
George Hotz
65a0a31475
AMD mi350x matmul from stream ( #13040 )
...
* works
* working mfma
* 120 TFLOPS
* regs
* 192 TFLOPS
* try pipelining
* something
* notes
* contract
* linter to 3.11
* that was a bug
2025-11-01 17:55:19 +08:00
chenyu
f396df26ea
test custom sum ( #13039 )
...
* test custom sum
this is higher level than set and after?
* only float
2025-10-31 19:25:56 -04:00
nimlgen
a23226e61e
amd: pmc for gfx9 ( #13036 )
...
* amd: pmc for gfx9
* xcc
* vmid mask
* ugh
* tiny
* minor
* sorryg
2025-11-01 04:26:34 +08:00
nimlgen
f6786c1bfd
autogen: py314 ( #13038 )
...
* autogen: py314
* bump py?
2025-11-01 04:02:19 +08:00
nimlgen
d532117df5
amd: rename set_grbm_se -> set_grbm_se_sh ( #13037 )
2025-11-01 01:37:57 +08:00
nimlgen
a9e5ffd3d1
amd: new pmc src ( #13034 )
2025-11-01 01:33:23 +08:00
Sieds Lykles
3dc593c536
add strip_params to pyrender ( #13021 )
...
* add strip_params to pyrender
* update that one too
* strip_parens fix
* cleaner
* add test
* add some more tests
* cleaner strip_parens
2025-10-31 14:15:56 +01:00
George Hotz
bc178d14a9
matmul example on metal showing off tensor core ( #13033 )
...
* matmul example on metal showing off tensor core
* flip the args of placeholder
* mat_idx
* imp
2025-10-31 19:40:36 +08:00
George Hotz
e066b3176b
hotfix: types and names for custom kernel test
2025-10-31 17:34:55 +08:00
George Hotz
54f48f93c6
working backward pass in custom kernel ( #13032 )
...
* working backward pass in custom kernel
* custom_kernel tensor method
* no SPEC=2
2025-10-31 17:26:18 +08:00
George Hotz
b791d70725
support custom UOp kernels ( #13028 )
...
* support custom UOp kernels
* no number
* multioutput works
* backward kernel runs
* move kernel class
* grad later
* work
* no tags in kernel graph
* test arange
* arange + contig
* delete comment
2025-10-31 15:51:39 +08:00
qazal
9f0c25ec48
viz: use indexing toggle for schedule graph ( #13031 )
2025-10-31 15:32:08 +08:00
George Hotz
b2caf4c2b3
prepare for custom kernel ( #13029 )
2025-10-31 14:47:37 +08:00
qazal
564e9ccc31
fix show indexing toggle default on ( #13030 )
2025-10-31 14:41:15 +08:00
qazal
6cd341354e
viz: add toggle to hide indexing UOps ( #13027 )
...
* start
* pass opts to worker
* works
* rename to showIndexing
* keep toggle through rewrites
* fix nan
* real fix for nan
* move render function
* fix firefox
* fix safari
* more work
2025-10-31 13:21:11 +08:00
George Hotz
b46229ca51
use shrink in amd_matmul_uop ( #13026 )
...
* use shrink in amd_matmul_uop
* colors
2025-10-31 10:43:41 +08:00
wozeparrot
78f7650eec
faster tk matmul ( #13006 )
2025-10-30 19:09:27 -07:00
George Hotz
512513c403
cleanup amd uop matmul ( #13025 )
...
* cleanup amd uop matmul
* remove mod
* move that out
* better variable names
* var names
* more
* render fallback
* colors
2025-10-31 10:04:45 +08:00
chenyu
f6430a0559
add script for one slow openpilot conv ( #12953 )
...
* add script for one slow openpilot conv
* fix ruff
2025-10-30 18:08:41 -04:00
chenyu
73002ebffa
print p.applied_opts with DEBUG >= 3 ( #13024 )
2025-10-30 16:51:21 -04:00
chenyu
99e76f33a0
remove unneeded TYPE_CHECKING [pr] ( #13020 )
2025-10-30 12:01:13 -04:00
nimlgen
629b177b66
amd: sqtt works in profile mode ( #13019 )
2025-10-30 23:48:52 +08:00
Sieds Lykles
4c8362128b
New symbolic renderer + strip parens ( #13017 )
...
* new uop renderer
* better tester
* strip parens
* update tests
* split method check_uop_against_string
* use ctx.update instead of add_rendered method
* strip parens based on precedence
* update test
* new symbolic renderer
* add comment
2025-10-30 16:41:32 +01:00
chenyu
c78dfcc5a1
simplify ProgramSpec __post_init__ STORE/LOAD [pr] ( #13018 )
2025-10-30 11:13:21 -04:00
b1tg
363a201cc6
fp8 amd cstyle ( #12999 )
...
* amd fp8 cstyle
* don't repeat
* space
* lint
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-10-30 10:45:52 -04:00
nimlgen
5be3a93d02
amd: enable pmc on gfx12 ( #13015 )
2025-10-30 22:43:10 +08:00
nimlgen
cf5ab93b8e
amd: pmc grbm block ( #13016 )
2025-10-30 22:42:59 +08:00
nimlgen
4d7a7096c9
am: enable perfmon ( #13013 )
...
* am: enable perfmon
* try
* msg
2025-10-30 22:28:36 +08:00
chenyu
985b6eb95f
ues less typing.cast [pr] ( #13002 )
2025-10-30 09:29:52 -04:00
George Hotz
5eb87ab131
hotfix: bump cifar time to 350
2025-10-30 17:29:20 +08:00
George Hotz
4a741e8364
modernize amd uop matmul ( #13011 )
...
* modernize amd uop matmul
* progress
* comment
* more comments
* revert that
* mac cleanups
* fix estimates
* format
2025-10-30 17:02:38 +08:00
qazal
66ea3a0be4
put DEFINE_LOCAL counter in context ( #13008 )
2025-10-30 15:49:26 +08:00
George Hotz
e456f2cb1e
more uop programs ( #13007 )
...
* more uop program
* test_matmul_relu
* tests fix
2025-10-30 14:57:59 +08:00
wozeparrot
c18b283f58
feat: timeout on stuck socket ( #13009 )
2025-10-29 23:11:26 -07:00
wozeparrot
92a87e37e4
fix: fetch_file ( #13010 )
2025-10-29 22:44:22 -07:00
George Hotz
e64d4b3b44
uops programs ( #13005 )
...
* uops programs
* work
* work
* more syntax
* more syntax
* comments
2025-10-30 12:28:10 +08:00