chenyu
687ade119e
IMAGE hand_coded_optimizations update ( #16720 )
2026-06-23 21:55:28 -04:00
George Hotz
0a8e61d0c5
switch to the new memory coaleser [pr] ( #16716 )
...
* switch to the new memory coalese
* move that stuff
* copy in allowed length logic
* mulitple buffers
* new coalese is better
* fine
* earlier
* fixes
* work
* work
* valid
* stack on index const
2026-06-23 18:03:48 -07:00
wozeparrot
dfea9e7994
llama: fused silu mul quantize mxfp8 ( #16704 )
2026-06-23 16:59:50 -07:00
chenyu
ce87d80911
better _drop_valid_stmts [pr] ( #16719 )
...
also dropped the unused is_increasing
2026-06-23 19:35:01 -04:00
George Hotz
5a2b3b7b06
early dtype decomp ( #16718 )
...
* early dtype decomp
* simplify
* cleanup
* that goes there
* doing too much
* stupid symbolic rules
2026-06-23 16:07:20 -07:00
Christopher Milan
116045cc8e
ci: remove tensorflow from testoptim ( #16717 )
2026-06-23 18:11:48 -04:00
nimlgen
7c1d0b6d9a
hcq2: use shrink(bitcast) ( #16713 )
...
* hcq2: use shrink(bitcast)
* x
2026-06-23 18:11:39 +03:00
George Hotz
c9dc1d63cc
small changes from new codegen ( #16712 )
...
* small changes from new codegen
* shrink/flatten
2026-06-22 17:44:15 -07:00
Christopher Milan
da98fae9e1
ci: try parallelizing tc tests ( #16710 )
2026-06-22 20:43:32 -04:00
chenyu
15988b5941
contiguous to mixin and cleanups [PR] ( #16711 )
2026-06-22 20:18:18 -04:00
Christopher Milan
cbfcf36e44
ci: remove generate_dataset and CL misc ( #16709 )
2026-06-22 18:01:07 -04:00
nimlgen
f9c8c697d6
hcq2: drop args after inner deps ( #16708 )
2026-06-22 23:26:11 +03:00
chenyu
0138480910
dropout and scaled_dot_product_attention to mixin ( #16707 )
2026-06-22 16:17:45 -04:00
chenyu
33b635d23a
Tensor.train -> TRAINING [PR] ( #16705 )
...
* Tensor.train -> TRAINING [PR]
* doc
2026-06-22 15:13:22 -04:00
chenyu
625d8bbd0d
TRAINING ContextVar ( #16703 )
2026-06-22 13:03:08 -04:00
wozeparrot
fe9b19b12d
llama: more mp mem fixes ( #16701 )
...
* llama: more mp mem fixes
* clean: unused
* fix: batch
2026-06-22 10:54:35 -04:00
chenyu
267af9c601
full_like to CreationMixin [PR] ( #16702 )
2026-06-22 09:33:23 -04:00
chenyu
97da54b9d6
more method to CreationMixin [PR] ( #16698 )
2026-06-22 00:01:22 -04:00
chenyu
fd0dc40689
clean up CreationMixin and DTypeMixin [PR] ( #16697 )
2026-06-21 21:13:40 -04:00
chenyu
2d8b802958
contiguous in wino conv ( #16696 )
...
also fixed test_counters
2026-06-21 17:11:46 -04:00
chenyu
ba1d3baae8
masked_select and nonzero to mixin [PR] ( #16695 )
...
with a .data stub
2026-06-21 15:10:44 -04:00
chenyu
d80a41d559
some rand method to RandMixin [PR] ( #16693 )
2026-06-21 12:16:51 -04:00
wozeparrot
5164c21b44
gemm: keep shape thru mxfp8 quantize ( #16692 )
2026-06-20 22:28:53 -07:00
chenyu
58ff75272e
const_like and invalids to mixin [PR] ( #16690 )
...
* const_like and invalids to mixin [PR]
* empty_like
* einsum
* type
2026-06-21 00:02:29 -04:00
chenyu
b50da5c205
move Tensor.__getitem__ to mixin [PR] ( #16689 )
2026-06-20 22:01:45 -04:00
chenyu
4618d27129
final const cleanups [PR] ( #16688 )
2026-06-20 21:38:16 -04:00
chenyu
9ae0a93d0e
more const cleanups [PR] ( #16682 )
2026-06-20 20:41:43 -04:00
George Hotz
30830850a9
small changes from new codegen ( #16681 )
...
* small changes from new codegen
* revert that
2026-06-19 18:29:01 -07:00
chenyu
8b07cca9f7
invalid clone try 3+ [PR] ( #16679 )
2026-06-19 20:13:52 -04:00
Christopher Milan
b2199c54a3
ci: update actions/cache/restore to suppress warnings ( #16680 )
2026-06-19 18:27:52 -04:00
Christopher Milan
1822eed8d3
ci: only test models on cpu ( #16678 )
2026-06-19 18:16:59 -04:00
wozeparrot
bba611bb59
gemm: fix mxfp8 on more shapes ( #16677 )
2026-06-19 13:28:53 -07:00
chenyu
67c3e589a1
invalid clone tests and prereq [PR] ( #16675 )
2026-06-19 13:20:43 -04:00
George Hotz
649971f02a
remove DEFINE_LOCAL and DEFINE_REG (gpt) ( #16673 )
...
* remove define_local and define_reg (gpt)
* fix precommit
* cleanups
* regalloc fix
* cleanups 2
2026-06-19 10:07:50 -07:00
George Hotz
b05bea81ce
x86 cleanups (fable) [pr] ( #16591 )
...
* x86 cleanups (fable)
* support shrink
* remove ptr dtype
* move that
* is_lane helper
* Revert "is_lane helper"
This reverts commit ea4571254d .
2026-06-19 09:04:51 -07:00
nimlgen
97c2e7a3d9
spec: add getaddr ( #16674 )
2026-06-19 15:37:33 +03:00
George Hotz
d7b10c69bc
update placeholder to not create DEFINE_LOCAL/DEFINE_REG ( #16671 )
...
* update placeholder to not create DEFINE_LOCAL/DEFINE_REG
* simpler
* define_local
2026-06-18 21:21:06 -07:00
Christopher Milan
091ec8d10d
use tinygrad.llm in benchmarks ( #16670 )
2026-06-19 00:03:57 -04:00
George Hotz
925c49ce99
use placeholder in tests ( #16672 )
2026-06-18 20:51:44 -07:00
wozeparrot
05249466ed
llama: fused quantize mxfp8 ( #16667 )
2026-06-18 16:02:28 -07:00
George Hotz
4a4b6956df
remove DEFINE_VAR from codebase (gpt) ( #16666 )
...
* remove DEFINE_VAR from codebase
* junk
* remove junk
2026-06-18 15:33:50 -07:00
nimlgen
eda0a402d1
hcq2: fix multi ( #16661 )
2026-06-18 22:56:49 +03:00
George Hotz
5989d0b150
remove DEFINE_VAR try 2 ( #16651 )
...
* remove DEFINE_VAR try 2
* param
* null index
* fix fuzzing
* fixes
* no gather neg params
* param is just Irreducible
* fixes
* skip stack
* need to filter slots there
2026-06-18 12:34:25 -07:00
wozeparrot
d37248c3ec
gemm: fix mxfp8 on odd shapes ( #16664 )
2026-06-18 12:03:59 -07:00
chenyu
d74f488376
clean up _function.depth properly [PR] ( #16663 )
2026-06-18 14:10:22 -04:00
chenyu
d7a1022188
minor function.py cleanups [PR] ( #16662 )
2026-06-18 13:36:48 -04:00
qazal
924bece1d5
remove some old scheduler tests ( #16660 )
2026-06-18 22:15:00 +09:00
qazal
b753fb5e4c
viz: view source working even if compile failed ( #16657 )
...
* failing test
* hard
* ret_dict
* switch to _data for tests too
* update sqtt
* start work
* Ops.LINEAR looks good
* baseline with depth works
* support depth
* types
* @needs_tracked_pm
* update, marg can error too
* unwrap_or goes to many more places
* move things to soft_err
* soft_err everywhere needed
* diff cleanup
* use list
* rewrite it
* change
* update depth number
* small comment change
2026-06-18 17:34:53 +09:00
qazal
31094a794f
viz: data not sent to client side starts with _ ( #16659 )
...
* ret_dict
* switch to _data for tests too
* update sqtt
* rename to filter_keys
* not cfg
2026-06-18 15:25:22 +09:00
qazal
1720987dc7
include exception name in Ops.REWRITE_ERROR ( #16658 )
2026-06-18 14:52:48 +09:00
wozeparrot
bed0c343a3
faster mxfp8 gemm ( #16656 )
2026-06-17 22:35:36 -07:00
Christopher Milan
e0fe6e542e
ci: fewer pydeps ( #16654 )
2026-06-17 22:52:14 -04:00
chenyu
a74b7130b4
Revert "invalid clone try 2 [PR] ( #16648 )" ( #16653 )
...
This reverts commit 1bd4551ee1 .
2026-06-17 22:05:30 -04:00
chenyu
df015ad541
remove many type ignores [PR] ( #16652 )
2026-06-17 21:38:45 -04:00
chenyu
1bd4551ee1
invalid clone try 2 [PR] ( #16648 )
2026-06-17 19:44:35 -04:00
George Hotz
53a1226a49
STACK 0 is dtype void ( #16650 )
...
* STACK 0 is dtype void
* spec for stack
* fix gemm group + END shape
* bump
2026-06-17 16:28:32 -07:00
George Hotz
aef85ddc4d
addrspace special/range ( #16647 )
...
* addrspace special/range
* just include indexing
* define var is alu
* bring old ignore indexing back
* mults to fix
* fixes
* ALU
* fixes
2026-06-17 15:57:37 -07:00
chenyu
1e08c0a07c
remove NOOP from AFTER with multiple srcs ( #16646 )
2026-06-17 14:35:02 -04:00
chenyu
1acc40600d
indexing an after with all fully invalid stores is invalid ( #16643 )
...
* indexing an after with all fully invalid stores is invalid
* typing cast
2026-06-17 11:06:36 -04:00
nimlgen
0f0c622086
hcq2: multi folders ( #16642 )
2026-06-17 15:20:25 +03:00
George Hotz
be9b570cb2
late numbering of var params ( #16640 )
...
* do_number_param
* fix sort order in x86
* we don't want this
2026-06-17 00:36:08 -07:00
qazal
c7055d658f
viz: only store kernel info ( #16641 )
2026-06-17 16:21:57 +09:00
George Hotz
d631716858
remove const without STACK ( #16639 )
...
* remove const without STACK
* fix GEP rewrite
* fix null tests
* fix openpilot regression
* it's 10 in CI
2026-06-16 21:25:42 -07:00
wozeparrot
36f6d1b064
gemm: fix bf16 atb for mp sharding ( #16637 )
2026-06-16 15:58:47 -07:00
qazal
1cb6b88d37
viz: show contents of vconst ( #16636 )
...
* failing test
* render vconst
* simpler test
* reorder
2026-06-17 02:31:03 +09:00
nimlgen
5644605d92
hcq2: pack bufs ( #16635 )
...
* hcq2: pack bufs
* x
2026-06-16 18:58:16 +03:00
chenyu
d5d59a2be6
remove dead rangeify rules [PR] ( #16634 )
2026-06-16 10:03:08 -04:00
chenyu
f0998e9bba
Revert "invalid clone is anonymous buffer" ( #16613 ) ( #16633 )
2026-06-16 08:27:48 -04:00
qazal
7d2b0b697d
simple failing test for invalid extra E kernel ( #16632 )
...
* simple failing test for invalid extra E kernel
* 6 kernels
2026-06-16 17:57:44 +09:00
wozeparrot
70cac72781
llama: realize weight init ( #16623 )
2026-06-15 23:00:19 -07:00
Christopher Milan
443f976305
fix buffer overrun in dcache_flush ( #16630 )
2026-06-15 23:26:32 -04:00
chenyu
aa2bef24a8
no_vectorized_alu in cstyle does nothing now [PR] ( #16631 )
2026-06-15 23:07:20 -04:00
chenyu
efd03d7153
invalid clone is anonymous buffer [PR] ( #16613 )
2026-06-15 20:14:26 -04:00
nimlgen
4a0488ae97
hcq2: optims ( #16624 )
...
* hcq2: optims
* x
2026-06-15 23:58:28 +03:00
George Hotz
41aa2fe119
test_gemm needs .clone() on eye ( #16629 )
2026-06-15 12:48:27 -07:00
qazal
10bdb9c9d0
viz: check node exists before anchoring zoom ( #16627 )
2026-06-15 21:03:24 +09:00
qazal
f998b9930a
fp8 gemm inv_scale in epilogue ( #16625 )
...
* fuse scale
* remove python inv_scale
* more inv_scale removal
* more cleanups
* cleaner
* diff polish
* work
* rename
* simpler
* simpler
* compute
* c
* Revert "c"
This reverts commit 8941fec7ca .
* Revert "compute"
This reverts commit 9db573a6d3 .
* Revert "simpler"
This reverts commit 910ad33f87 .
* Revert "simpler"
This reverts commit bf75d235a1 .
* s_g
* update types
* less diff noise
* remove
2026-06-15 18:44:41 +09:00
nimlgen
4dc51aff6e
hcq2: jit ( #16621 )
...
* hcq2: jit
* x
* x
* minor
2026-06-15 06:35:35 +07:00
chenyu
2adedf5ccb
clean up fold_divmod_general [pr] ( #16622 )
...
genralized fold_binary_numerator in fold_divmod_congruence
2026-06-14 17:15:52 -04:00
George Hotz
a6d7fb9d4d
only SHRINK for non scalar access ( #16619 )
2026-06-14 10:08:37 -07:00
George Hotz
b1fb39502d
delete that test
2026-06-14 09:42:58 -07:00
chenyu
2e181f4259
simpler cancel_divmod [PR] ( #16616 )
2026-06-14 11:41:31 -04:00
chenyu
5d5ead78da
inline unique_const in invalids [PR] ( #16612 )
2026-06-13 10:14:32 -04:00
Sieds Lykles
b00dd754a9
Remove if-condition from nested div rule [pr] ( #16611 )
...
* add rules and test
* trigger [pr]
2026-06-13 15:47:21 +02:00
nimlgen
5a9227b30a
hcq2: rebind var params ( #16610 )
2026-06-13 14:55:52 +03:00
nimlgen
8efc8d064f
unique based on opaque in from_buffer ( #16609 )
2026-06-13 14:31:58 +03:00
nimlgen
c43091a464
fix missing cast in cstyle ( #16608 )
...
* fix missing cast in cstyle
* x
* x
2026-06-13 10:04:06 +03:00
qazal
2e77bd01db
fp8 gemm cleanup ( #16607 )
2026-06-13 13:17:32 +09:00
Christopher Milan
bcdb988df0
split comma benchmark, dsp on c4 [PR] ( #16598 )
2026-06-12 23:26:05 -04:00
George Hotz
6b8fdfe4ca
alu addrspace is where the math happens ( #16606 )
...
* alu addrspace
* fix cstyle/llvm
* on ptx, reg+alu are the same thing
2026-06-12 20:01:28 -07:00
wozeparrot
67a4f129c2
llama: fix bf16 gemm oob ( #16603 )
2026-06-12 19:43:05 -07:00
Christopher Milan
8862c7549c
new-style dcache_flush ( #16602 )
2026-06-12 22:25:08 -04:00
chenyu
9e72a6b376
more indexing cleanup [PR] ( #16600 )
2026-06-12 21:33:47 -04:00
chenyu
aa32d309db
fix rangeify indexing for pad/reduce ( #16599 )
2026-06-12 20:26:15 -04:00
George Hotz
96b86aad7b
move new style transform up more ( #16593 )
...
* move new style transform up more
* pm_move_gates_from_index works on new style
2026-06-12 17:20:12 -07:00
chenyu
a35964493e
UPat method cleanups [PR] ( #16596 )
2026-06-12 17:22:54 -04:00
chenyu
3036b15ed9
remove Tensor.ufix [PR] ( #16594 )
...
* remove Tensor.ufix [PR]
* inline _ufix_keep_dtype
2026-06-12 14:40:28 -04:00
qazal
b2e95b2db3
rangeify: no copies for write+read of same slice ( #16585 )
...
* failing test
* cleaner failing tests
* assign and read of same slice shouldn't create copies
* err in the changes
* shrink with no overlapping regions in dest is fine
2026-06-13 02:19:47 +09:00
George Hotz
833cb37574
move up new style transform ( #16592 )
...
* simpler names
* move up new style transform
* fix that rule
2026-06-12 10:13:37 -07:00
George Hotz
51100d2c5c
new style cleanups ( #16584 )
...
* spec tighten
* revert
* lin fix
* lin fix
* needed for x86
* revert
2026-06-12 08:10:38 -07:00
Philip Sinitsin
76c10cd635
jit: don't memplan buffers reachable from live tensors ( #16588 )
...
The memory planner was suballocating BUFFERs created during JIT capture that are still referenced by external lazy tensor graphs, like the .grad tensors assigned by backward(). The replay then only writes the arena slices, so realizing such a tensor after the call reads freshly allocated memory and silently returns zeros. Hold every BUFFER reachable from a live Tensor instead of only the parameters of the return value; true internals are still planned. Fixes #16571 .
2026-06-12 17:51:54 +03:00
nimlgen
2bfdf85f87
hcq2: move pre bufferize ( #16589 )
...
* hcq2: move pre bufferize
* x
2026-06-12 16:11:59 +03:00
nimlgen
fb74f75485
var params sort after global params ( #16590 )
2026-06-12 14:33:15 +03:00
qazal
4d34590b7d
llama: less E kernels ( #16517 )
2026-06-12 19:49:25 +09:00
qazal
12f4cf0e49
rename amd/test_custom_kernel.py to test_asm_kernel ( #16586 )
...
* rename amd/test_custom_kernel.py to test_asm_kernel
* update
2026-06-12 16:11:01 +09:00
wozeparrot
e770805d21
llama: mxfp8 ( #16574 )
2026-06-11 22:15:24 -07:00
George Hotz
b8aec4cce7
port x86 to new_style (fable slop) and now everything is new style ( #16581 )
...
* port x86 to new_style (fable slop)
* don't change ops
* port NIR to new_style (fable)
* lil cleanup
* fix tests, and remove new_style
2026-06-11 21:09:34 -07:00
chenyu
762f50bd52
move gradient.py to mixin/ [PR] ( #16583 )
2026-06-11 23:58:21 -04:00
chenyu
a2cec397f3
UOp cast and bitcast takes DTypeLike [PR] ( #16582 )
...
* UOp cast and bitcast takes DTypeLike [PR]
match Tensor
* fix type
2026-06-11 22:38:54 -04:00
George Hotz
b97e3e01e3
port NIR to new_style (fable) ( #16580 )
...
* port NIR to new_style (fable)
* lil cleanup
2026-06-11 18:47:30 -07:00
Christopher Milan
4d893f626a
move a bunch of test_schedule to null ( #16578 )
2026-06-11 20:26:34 -04:00
George Hotz
b57639a6cc
port python to new_style (fable) ( #16579 )
...
* port python to new_style (fable)
* doesn't have to be const in python
2026-06-11 17:26:05 -07:00
George Hotz
a04d2fa4eb
port ptx to new_style (fable) ( #16577 )
...
* port ptx to new_style (fable)
* simplify
* simpler
2026-06-11 17:05:03 -07:00
George Hotz
587333fddb
replace DEFINE_VAR with PARAM ( #16576 )
...
* replace DEFINE_VAR with PARAM
* cleanups
* cleanups
2026-06-11 15:03:20 -07:00
chenyu
5f1e2d3900
PADTO pads Invalids ( #16562 )
2026-06-11 16:54:26 -04:00
George Hotz
434a8ffc38
move llvm to new style ( #16573 )
...
* move llvm to new style
* fix wmma
* buffer is early
2026-06-11 12:59:02 -07:00
George Hotz
347608a523
put loads back on reg ( #16572 )
...
* put loads back on reg
* fix dsp
2026-06-11 11:24:50 -07:00
nimlgen
e5f498de3b
hcq2: debug=2 info ( #16569 )
...
* hcq2: debug=2 info
* t
* x
* hcq2: debug=2 info
* x
2026-06-11 19:52:01 +03:00
qazal
a83710396c
support mselect input to CALL, less kernels in allreduce ( #16567 )
...
* support mselect input to CALL, less kernels in allreduce
* resolve mstack
2026-06-11 18:10:47 +09:00
qazal
7d4a77dce4
relax comma benchmark timeout ( #16568 )
2026-06-11 18:03:37 +09:00
qazal
21f1101691
add allreduce kernel count test ( #16566 )
2026-06-11 15:54:12 +09:00
wozeparrot
c38d6a7e3a
mxfp8 part 2 ( #16561 )
2026-06-10 23:36:11 -07:00
Christopher Milan
83971860d8
ci: simplify webgpu install ( #16557 )
2026-06-10 22:57:19 -04:00
Christopher Milan
6e1b61f16f
cleanup some amd deps ( #16563 )
...
don't load hsa runtime, remove ib autogen
2026-06-10 19:01:56 -04:00
George Hotz
7e6d617935
addrspace cleanups ( #16565 )
...
* addrspace cleanups
* bumps
* eh, relax a little
2026-06-10 15:57:18 -07:00
nimlgen
2c9d2c0d31
jit: memplan before compile ( #16560 )
2026-06-10 15:05:15 +03:00
qazal
34481830f1
rangeify: fix cost function for AFTER(out, CALL) ( #16559 )
...
* simple failing test
* fix rangeify cost function
* new ops count
2026-06-10 17:30:50 +09:00
chenyu
623b66e0e4
more tensor and mixin cleanups [PR] ( #16558 )
2026-06-10 00:39:33 -04:00
chenyu
7366d32247
getitem cleanups [PR] ( #16556 )
2026-06-09 22:48:58 -04:00
George Hotz
fd76ac992e
cstyle renderer is new style [pr] ( #16484 )
...
* cstyle new style
* switch cstyle renderer to new style
* fix hip
* fixes
* fix webgpu
* correct webgpu is_packed
* fix dsp
* fixes
* fix Ops.RANGE must be CONST
* old style render access
* this is correct
* fix cstyle to good
* dl/dr
* as array
* fix spec
* remove define_local/define_reg
* buffer in shrink
* fix test_tiny
* all tests fix
* param args aren't realized
* wgsl fix
* work
* new gate
* fix opencl qcom
* process replay
* sort order
* fix render index
2026-06-09 18:36:01 -07:00
Christopher Milan
97d483350c
ci: download prebuilt ocelot ( #16554 )
2026-06-09 19:51:33 -04:00
Christopher Milan
f9d88d3c3a
fix race in test_quantize_onnx ( #16555 )
2026-06-09 18:39:48 -04:00
wozeparrot
2bdc360606
gemm: mxfp8 hipkittens gemm ( #16541 )
...
* gemm: mxfp8 hipkittens gemm
* feat: update hipkittens
* feat: kernel signature
* clean: just kernel
* feat: from tinygrad
* feat: test
* fix: add back utils
* clean: no diff
* clean: no diff
2026-06-09 15:20:05 -07:00
chenyu
12addee14f
tesnor and mixin cleanups [PR] ( #16553 )
2026-06-09 15:33:13 -04:00
nimlgen
2ab2d51099
hcq2: fix repeated calls ( #16552 )
2026-06-09 19:11:42 +03:00
chenyu
3f053a3370
move functional part of rand to RandMixin ( #16551 )
2026-06-09 09:40:48 -04:00
nimlgen
fa31c744b9
hcq2: cleaner ( #16550 )
2026-06-09 16:33:05 +03:00
qazal
598cc13ad2
more readable null graph profile in VIZ ( #16548 )
...
* more readable null graph profile in VIZ
* change
* fix flaky test
2026-06-09 18:35:05 +09:00
qazal
d18ad49f20
fix flaky test_disktensor ( #16549 )
2026-06-09 18:23:22 +09:00
qazal
fa400f9790
less E kernels in all2all ( #16546 )
2026-06-09 13:51:57 +09:00
qazal
b8931440ae
add all2all schedule test ( #16545 )
2026-06-09 12:41:35 +09:00
wozeparrot
5ef30005fa
update hipkittens ( #16544 )
2026-06-08 18:53:25 -07:00
Christopher Milan
4e2e2e9956
ocelot: use c.DLL ( #16540 )
2026-06-08 21:27:28 -04:00
chenyu
11fee53527
RandMixin [PR] ( #16543 )
2026-06-08 19:11:28 -04:00
chenyu
e2ef5cf5c9
no args and kwargs for _multi_like [PR] ( #16539 )
2026-06-08 17:35:15 -04:00
chenyu
12764161c9
UOp.shard support axis=None [PR] ( #16538 )
...
match Tensor
2026-06-08 11:36:50 -04:00
chenyu
ebc5390c9a
advance indexing to mixin [PR] ( #16532 )
2026-06-08 09:24:49 -04:00
nimlgen
95d63d6c07
hcq2: lower to ins ( #16535 )
...
* hcq2: lower to ins
* pm4
* f
2026-06-08 16:15:30 +03:00
nimlgen
8baca185d5
hcq2: add kfd ( #16537 )
2026-06-08 13:48:27 +03:00
chenyu
03943cd1a0
use more _uop for cleanup [PR] ( #16531 )
...
`t.uop if isinstance(t, Tensor) else t` -> `t._uop`
2026-06-07 17:41:36 -04:00
chenyu
937aeaec60
remove device= from UPat.const [PR] ( #16530 )
2026-06-07 16:38:43 -04:00
George Hotz
eb1238436a
more prereqs for DL/DR -> BUFFER ( #16529 )
2026-06-07 12:25:11 -07:00
George Hotz
0336ba8eb1
buffer param arg + dsp fixups ( #16528 )
2026-06-07 12:07:00 -07:00
Dmitriy Strunin
75e903d533
remove unused device arg from _get_winograd_matcols ( #16527 )
2026-06-07 08:15:09 -04:00
chenyu
90b556ca48
move gradient to mixin [PR] ( #16526 )
2026-06-07 00:05:02 -04:00
chenyu
4e7c6260b0
clean up test_tesnor_uop_mixin ( #16525 )
...
most of those don't have UNIQUE anymore
2026-06-06 23:25:44 -04:00
George Hotz
2a2f81dd3d
remove ANON from addrspace, refactor marg ( #16523 )
...
* remove ANON from addrspace, refactor marg
* as_shape
* as_shape is cached
2026-06-06 09:49:09 -07:00
qazal
e69b4189b0
viz: hide STACK on PARAM by default ( #16522 )
2026-06-06 16:41:15 +09:00
Christopher Milan
857b1f5399
ci: more parallelism, less duplication ( #16509 )
2026-06-05 21:26:19 -04:00
wozeparrot
a1ec32cfd2
llama: current grad scaling ( #16518 )
2026-06-05 15:39:41 -07:00
Christopher Milan
8c0ba1da5c
cleanup more from test/backend ( #16521 )
2026-06-05 18:38:46 -04:00
chenyu
9982185b14
remove unused AFTER rules in pm_add_buffers[PR] ( #16519 )
2026-06-05 14:58:34 -04:00
nimlgen
5ebd44aa12
hcq2: merge queues ( #16514 )
...
* hcq2: mergw queues
* cleaner
2026-06-05 21:20:25 +03:00
chenyu
a51b5ba424
remove early fixup const copy [PR] ( #16516 )
2026-06-05 11:35:34 -04:00
Nueramarcos
8274140134
uop/ops: fix ~bool deprecation warning on Python 3.12+ (ORANGE Grok helped with the patch) ( #16512 )
2026-06-05 10:54:30 -04:00
chenyu
588c759a3d
remove unused GroupOp.Buffer [PR] ( #16515 )
2026-06-05 10:38:52 -04:00
qazal
79a13310b3
viz: kernel_graph.txt unique is per schedule ( #16511 )
2026-06-05 16:17:28 +09:00
Christopher Milan
9b0f75622c
many jit tests belong in unit ( #16508 )
2026-06-04 21:36:53 -04:00
chenyu
bb407d8b3c
fix transform_precompiled_call for MULTI ( #16510 )
...
based on my understanding for https://github.com/tinygrad/tinygrad/pull/16084
2026-06-04 20:09:58 -04:00
wozeparrot
f11f63007d
llama: immediate scaling on flag ( #16494 )
2026-06-04 10:30:00 -07:00
George Hotz
4fb8ce1831
update buffer in spec ( #16507 )
2026-06-04 10:12:31 -07:00
chenyu
4a8bf07a87
remove CONST(DEVICE) ( #16506 )
2026-06-04 11:29:46 -04:00
nimlgen
3838c8df1b
hcq2: move global sync ( #16504 )
2026-06-04 17:32:40 +03:00
chenyu
0faaf6df26
remove kwargs from arange and linspace [PR] ( #16505 )
...
it used to have requires_grad and device, now both are removed
2026-06-04 10:32:37 -04:00
qazal
3b1a5f9770
llama: a_bT and aT_b bf16 gemms ( #16487 )
...
* hk_bf16_gemm
* enable in 8b
* cleanups
* rename to USE_HK_BF16_GEMM
* work
* work
* work
* work
* change the gemms
* work
* work
* set as default
* work
* change
2026-06-04 23:30:21 +09:00
chenyu
5fad87252d
no device= into arange and eye ( #16503 )
2026-06-04 09:21:50 -04:00
nimlgen
11af81f96f
hcq2: cleaner ( #16502 )
2026-06-04 15:26:37 +03:00
chenyu
2c915c61ed
no CONST(DEVICE) in torch_backend ( #16499 )
2026-06-04 00:26:47 -04:00
wozeparrot
fd13080636
deviceless const skip axis check ( #16496 )
2026-06-03 19:13:20 -07:00
qazal
f7f03bd7e5
viz: better name for src id in kernel_graph.txt ( #16495 )
...
* viz: better name for src id in kernel_graph.txt
* better order
* cleanup
2026-06-04 11:09:29 +09:00
Christopher Milan
9dac781e45
ci: use uv ( #16492 )
2026-06-03 21:38:50 -04:00
George Hotz
9fdeaa402b
no anon addrspace, don't write hacks ( #16491 )
...
* no anon addrspace, don't write hacks
* revert that
* no reg there
2026-06-03 16:19:30 -07:00
chenyu
2f83d01ccf
fix deviceless materialize device ( #16493 )
...
symbolic arange currently does not fuse, which creates a deviceless UOp post rangeify that needs a device to bufferize
2026-06-03 19:13:21 -04:00
chenyu
19eb72ff60
remove use of full with buffer=False and non-None device= ( #16489 )
2026-06-03 16:21:24 -04:00
nimlgen
6f2a2857c8
hcq2: refactor deps ( #16490 )
2026-06-03 23:20:24 +03:00
chenyu
243446b44f
remove CONST(DEVICE) from const_like ( #16488 )
2026-06-03 14:04:51 -04:00
George Hotz
cee472a0ef
renderer Estimates uses maxel ( #16485 )
2026-06-03 10:55:00 -07:00
chenyu
8a4203638a
make full with buffer=False deviceless ( #16483 )
...
affects arange and eye
2026-06-03 12:35:59 -04:00
qazal
405866f2b7
viz: improve kernel_graph.py usability ( #16486 )
...
* better default
* always format kernel output
* also show ref
* sched num
2026-06-03 21:12:44 +09:00
Christopher Milan
f43cba5765
ci: native python where possible ( #16473 )
...
linters stays at 3.11
2026-06-02 22:40:12 -04:00
wozeparrot
7dcfd144b6
llama: columnwise fp8 scaling ( #16480 )
2026-06-02 18:55:45 -07:00
George Hotz
ffadd7a315
remove intel and amx support ( #16482 )
2026-06-02 18:53:05 -07:00
George Hotz
5f439e3b7c
refactor cstyle to avoid dtype [PR] ( #16478 )
...
* refactor cstyle to avoid dtype
* clean up rules
* add new style option
2026-06-02 18:27:12 -07:00
Christopher Milan
80eeb4dd21
mockgpu: use autogen.libc ( #16479 )
2026-06-02 19:59:36 -04:00
chenyu
a43b55d480
deviceless const folding schedule test ( #16477 )
2026-06-02 18:46:30 -04:00
George Hotz
14f843737b
renderer cleanups (pt 3) [PR] ( #16475 )
...
* renderer cleanups (pt 3)
* point refactors
* fix bugs
* fix PR
2026-06-02 14:24:24 -07:00
nimlgen
99e37b1ee3
hcq2: deps ( #16459 )
...
* start
* sin
* f
2026-06-02 22:34:25 +03:00
George Hotz
82f1c983d4
clean renderer migrations [pr] ( #16472 )
...
* clean renderer migrations
* minor webgpu
* use PARAM UOp as API
* make linter happy
2026-06-02 11:19:00 -07:00
Christopher Milan
9897658895
ci: fix ocelot compilation on macos ( #16471 )
2026-06-02 12:43:31 -04:00
chenyu
6b7d2b91df
update test_uop_graph ( #16470 )
...
use UOp methods instead of constructing UOp directly, some of it violated spec
2026-06-02 08:53:54 -04:00
qazal
854eac09c6
llama: no E_ copy after bf16 GEMM ( #16458 )
2026-06-02 14:14:13 +09:00
George Hotz
7d8ed8d4d7
add store to buffer's addrspace ( #16468 )
2026-06-01 22:07:43 -07:00
George Hotz
20242fdf1d
update test + spec from shrink_in_render ( #16467 )
...
* update test + spec from shrink_in_render
* cast
2026-06-01 19:24:43 -07:00
Christopher Milan
c6cad1ad67
ci: standardize runs-on ( #16466 )
...
* ci: use macos 26
* ugh github
* stick with github for arm
2026-06-01 21:39:58 -04:00
Christopher Milan
b0ecbb34d9
ci: cleanup python backend tests ( #16465 )
2026-06-01 20:08:05 -04:00
Christopher Milan
2d0f132a3b
ci: cleanup more duplicate tests ( #16462 )
2026-06-01 18:56:29 -04:00
wozeparrot
aab9a5a8a3
llama: allow specifying layer count ( #16464 )
2026-06-01 15:36:04 -07:00
chenyu
0167401fa2
minor hcopt WHERE cleanup [PR] ( #16463 )
2026-06-01 17:58:38 -04:00
George Hotz
124d2f8227
anon addrspace from new renderer ( #16461 )
...
* anon addrspace from new renderer
* use max_numel in python renderer
* add sizes to ptrs in tests
* more
* correct fix
2026-06-01 14:42:02 -07:00
chenyu
517eea5985
no CONST(DEVICE) in create_allreduce_function ( #16460 )
2026-06-01 17:12:34 -04:00
chenyu
7e7b481ba7
less CONST(DEVICE) ( #16452 )
...
* less CONST(DEVICE)
no DEVICE for single device in const_like, multi has other issues
* maybe
* that?
2026-06-01 15:55:12 -04:00
George Hotz
556defa0f7
minor updates from vec removal ( #16456 )
2026-05-31 09:48:51 -07:00
Javier De Jesus
989f713c1b
support negative pads in circular pad mode ( #16448 )
2026-05-31 09:28:45 -07:00
nimlgen
2c2cb339e0
fix word wrap ( #16450 )
2026-05-30 23:21:24 +03:00
qazal
29b47a0057
llama: update local amax implementation after ParamArgs change ( #16446 )
...
* local amax failing test
* update _local_abs_max_fxn
2026-05-30 16:55:43 +09:00
wozeparrot
6795c2d5c9
llama: zero grad this way ( #16445 )
2026-05-29 20:25:21 -07:00
George Hotz
cf55aaf01f
python prg is pkl uops ( #16443 )
...
* python prg is pkl uops
* refactor to use uop
* refactor to u.
2026-05-29 19:13:51 -07:00
Christopher Milan
c377d01491
ci: run dsp on tinygrad[testing] ( #16442 )
2026-05-29 21:16:56 -04:00
wozeparrot
c23652e486
llama: minimize peak init mem ( #16440 )
2026-05-29 18:00:37 -07:00
Christopher Milan
d943493b79
ci: remove duplicate op compile test ( #16441 )
2026-05-29 19:20:31 -04:00
chenyu
8ac62b28e5
fix AffineGrid fusion ( #16439 )
2026-05-29 17:59:47 -04:00
Christopher Milan
ef50a49693
ci: macos dev matrix ( #16436 )
2026-05-29 17:40:32 -04:00
Christopher Milan
434cfa96a3
ci: no fetch in backend tests ( #16438 )
...
should make for less actions cache thrashing
2026-05-29 17:11:16 -04:00
chenyu
b7280705a7
limit CONST(UNIQUE) to invalids only ( #16432 )
2026-05-29 16:02:06 -04:00
George Hotz
9506b78d73
fix viz addrspace ( #16437 )
...
* fix viz addrspace
* revert that
2026-05-29 12:58:05 -07:00
nimlgen
d69aca41a9
hcq2: rework pm_bufferize ( #16431 )
2026-05-29 22:09:52 +03:00
George Hotz
e2a0434403
full derivation of addrspace ( #16433 )
...
* full derivation of addrspace
* w/e, it fixes it
2026-05-29 11:39:31 -07:00
wozeparrot
6787de9f52
llama: fix mp ( #16434 )
2026-05-29 11:21:43 -07:00
chenyu
2d7e5baab4
remove vec= from UPat.cvar [PR] ( #16430 )
2026-05-29 10:52:30 -04:00
chenyu
fa666cefe8
remove dead branch in UOp [PR] ( #16429 )
2026-05-29 10:38:49 -04:00
qazal
81bc00c006
do not require clearing method_cache in viz tests ( #16428 )
...
* update
* update test_dedup
2026-05-29 18:12:34 +09:00
qazal
54cfb794b8
viz: addrspace little colored box ( #16427 )
...
* return addrspace
* layout
* render
* addrspace encodes color
* update colors
* in input_ast all are params are green
* update stroke
2026-05-29 17:25:07 +09:00
qazal
814d414f41
viz: set label offset for asm ( #16426 )
2026-05-29 13:16:34 +09:00
wozeparrot
f86966af56
llama: optim amax margin ( #16425 )
2026-05-28 20:18:11 -07:00
Christopher Milan
6e0d5262dc
ci: autocancel outdated pr jobs ( #16424 )
2026-05-28 23:14:35 -04:00
Christopher Milan
69aa2054f6
rename clangjit to clang ( #16423 )
2026-05-28 22:41:58 -04:00
Christopher Milan
a909acb882
move llvmspeed to benchmarks ( #16422 )
2026-05-28 22:26:22 -04:00
George Hotz
1e7f1dcf49
add ParamArgs [pr] ( #16421 )
...
* add ParamArgs
* fix export
* cleanups
* fixes
* simpler
2026-05-28 19:17:17 -07:00
Christopher Milan
7d38edffdb
ci: dev matrix ( #16420 )
...
windows just runs test_tiny
2026-05-28 22:04:04 -04:00
wozeparrot
36c8ff70c1
llama: use old scale for dequant in optim ( #16417 )
2026-05-28 15:21:19 -07:00
George Hotz
c87f3433d1
use namespace runners ( #16387 )
...
Co-authored-by: Christopher Milan <chrismilan@ucla.edu>
2026-05-28 18:05:46 -04:00
George Hotz
c9adde72c1
addrspace property ( #16418 )
...
* addrspace property
* movement addrspace
* regs
2026-05-28 14:39:25 -07:00
Christopher Milan
c8af163d2b
disable process replay by default ( #16419 )
...
enable process replay with [pr] and assert with [PR]
process replay no longer captures on master
2026-05-28 17:36:28 -04:00
nimlgen
b0e49afaf1
hcq2: new multi ( #16413 )
...
* hcq2: new multi
* op
2026-05-28 22:16:10 +03:00
George Hotz
edca5df25a
flip offset and shape in pad and shrink ( #16414 )
...
* flip offset and shape in pad and shrink
* dumb test
2026-05-28 11:58:19 -07:00
chenyu
d72d8ee065
.const() should not ignore dtype ( #16412 )
...
fixed a bug in postrange, also cleaner
2026-05-28 10:49:15 -04:00
Christopher Milan
0ae957bb0a
refactor webgpu ( #16406 )
2026-05-27 23:13:08 -04:00
qazal
202adc644e
viz: make call toggle easier to click on ( #16411 )
...
* call tag is a rect
* details
* colors
* simplify, better comment
2026-05-28 11:53:36 +09:00
George Hotz
5ee6b6b79e
fix slice store to remove the index ( #16410 )
...
* fix slice store to remove the index
* fix spec
2026-05-27 19:17:53 -07:00
qazal
88e88d63d6
viz: click on +- toggles sources ( #16409 )
2026-05-28 09:12:43 +09:00
George Hotz
b21afb4883
marg line cleanup ( #16408 )
...
* marg line cleanup
* bitcast is a mop
2026-05-27 16:41:04 -07:00
wozeparrot
dac3743d75
llama: delayed scaling in optim ( #16407 )
2026-05-27 15:40:03 -07:00
George Hotz
8ee3a37524
shrink/pad use (new_shape, offset) ( #16405 )
...
* shrink uses offset and shape
* pad does too
* fix
2026-05-27 15:13:08 -07:00
Christopher Milan
171401e8df
skip modulo by zero in test_dtype_alu ( #16404 )
2026-05-27 17:09:05 -04:00
qazal
452c7d4230
llama: don't allocate grad_xw13 in bf16 ( #16359 )
2026-05-28 04:33:07 +09:00
nimlgen
0c385e31c6
hcq2 rewrite ( #16375 )
...
* hcq2 rewrite
* fi
* x
* simpler
2026-05-27 22:25:35 +03:00
chenyu
c33b767407
bring back test and torch backend change for unique const ( #16403 )
2026-05-27 15:16:08 -04:00
Christopher Milan
bacabf0866
webgpu: fix enums ( #16402 )
2026-05-27 13:09:50 -04:00
chenyu
6da785562b
test_custom_kernel_precompile_multidevice ( #16401 )
...
add a test to show what invalids need
2026-05-27 11:19:16 -04:00
chenyu
3e80f375ee
skip test_setitem_fancy_on_unrealized_view ( #16400 )
...
crashes in linux llvm ci
2026-05-27 09:50:26 -04:00
chenyu
945ed4f689
revert const unique changes ( #16395 )
2026-05-27 00:06:41 -04:00
Christopher Milan
aacc8addf4
ci: use ubuntu 24.04 ( #16393 )
2026-05-26 23:22:01 -04:00
chenyu
fa14cde05c
test update for arange and eye ( #16394 )
...
these will need explicit clone to make a buffer
2026-05-26 22:48:34 -04:00
wozeparrot
3a7a6da7d5
llama: fakedata uses real vocab size ( #16389 )
2026-05-26 18:58:55 -07:00
George Hotz
156a4438d9
rename BUFFER_VIEW to SLICE ( #16391 )
...
* rename BUFFER_VIEW to SLICE
* fix comments
2026-05-26 18:15:00 -07:00
Christopher Milan
3adf7f5d95
disable flaky cl test ( #16388 )
2026-05-26 19:56:57 -04:00
Christopher Milan
d23659d38b
cleanup some old test skips ( #16384 )
2026-05-26 19:07:22 -04:00
George Hotz
fd963038a0
remove allow_any_len from store ( #16385 )
...
* remove allow_any_len from store
* a few more
* no bv there
* more fixes
* fixes
* oh that
2026-05-26 15:26:53 -07:00
chenyu
0b88827482
remove CONST(UNIQUE) ( #16383 )
2026-05-26 14:45:22 -04:00
chenyu
d861c50dce
remove unique_const ( #16382 )
2026-05-26 13:53:31 -04:00
George Hotz
bac82d4949
fix emu bug in gfx950 ( #16381 )
...
* fix emu bug in gfx950
* fix renderer
2026-05-26 10:32:03 -07:00
chenyu
9b00defc8c
Revert "remove unique_const ( #16372 )" ( #16380 )
...
This reverts commit 09019d6761 .
2026-05-26 12:30:07 -04:00
chenyu
09019d6761
remove unique_const ( #16372 )
...
* remove unique_const
* fix SDWA thing
* that?
2026-05-26 12:18:03 -04:00
George Hotz
7f1b02854e
bufferview offset is units of input dtype ( #16378 )
2026-05-26 08:49:31 -07:00
qazal
846a809af7
viz: add +- toggle for hidden UOps ( #16368 )
...
* first
* remove
* move src toggles to client side
* line
* update viz server tests
* remove those
* logic
* cleanup
* call matches
* fix const arg
* add labels
* keep changes
* the stack on movement ops hiding change
* structure
* rename to expandedNodes
* work
* test intention
2026-05-26 22:31:54 +09:00
nimlgen
032905dec9
hcq2: simpler ( #16361 )
2026-05-26 14:28:48 +03:00
George Hotz
322693dcd3
hotfix: bump Mac pytest timeout to 4 minutes (try 2)
2026-05-25 18:23:21 -07:00
George Hotz
41ee7dab1c
script to generate testsig for DSP ( #16371 )
...
* script to generate testsig for DSP
* cleanups
2026-05-25 17:54:58 -07:00
wozeparrot
76fc39ccc0
gather to single device ( #16354 )
2026-05-25 17:27:08 -07:00
George Hotz
942cb42b97
Revert "hotfix: bump Mac pytest timeout to 4 minutes"
...
This reverts commit 695a0069ed .
2026-05-25 17:25:11 -07:00
Christopher Milan
8ddd1328df
remove getenv(CI) ( #16365 )
...
gone everywhere except test_interop, because torch MPS does not work in actions
2026-05-25 20:23:33 -04:00
George Hotz
695a0069ed
hotfix: bump Mac pytest timeout to 4 minutes
2026-05-25 17:20:19 -07:00
George Hotz
689ab6a49f
move buffer view offset to src ( #16364 )
...
* this work?
* failed
2026-05-25 17:07:55 -07:00
Christopher Milan
d8f86be613
webgpu: shader-f16 support in arch ( #16370 )
2026-05-25 19:20:59 -04:00
qazal
4bcc53eb26
viz: stable node position for +- toggle ( #16367 )
2026-05-26 06:30:47 +09:00
qazal
3506eb08ec
viz: sidebar toggles always recenter ( #16366 )
...
* viz: sidebar toggles always recenters
* python brain
2026-05-26 06:14:32 +09:00
chenyu
cdeb861828
invalids is empty [pr] ( #16353 )
2026-05-25 16:11:38 -04:00
qazal
b73d2d17b9
viz/cli: add --interval ( #16363 )
...
* interval support
* add test_interval
* llama uses interval
2026-05-26 03:35:06 +09:00
C T
2ab90f31b1
use windows-specific alias nvcuda when loading cuda on windows ( #16260 )
...
This also makes it possible to use cuda on windows by specifying 3 env
vars with direct dll paths: NVCUDA_PATH, NVRTC_PATH and NVJITLINK_PATH
without name collision with CUDA_PATH which is used for cuda headers
include path in NVRTCCompiler.
2026-05-25 08:50:50 -07:00
wozeparrot
68d2102fd2
llama: offload master weights ( #16355 )
2026-05-25 08:48:13 -07:00
qazal
eecd4706ff
fix mailbox comment, add types ( #16360 )
2026-05-25 22:24:00 +09:00
nimlgen
64095cf2e2
use get_buf in exec_kernel ( #16356 )
2026-05-25 15:13:40 +03:00
chenyu
5d5e02871f
remove Tensor.from_uop ( #16344 )
...
and no device for const in Tensor init
2026-05-24 18:53:09 -04:00
nimlgen
a891727c9f
hcq2: multi ( #16347 )
...
* hcq2: multi
* cleaner a bit
2026-05-24 19:28:33 +03:00
chenyu
926d125a63
update test_stack ( #16345 )
...
also skip COMPILE_ONLY, it was comparing 0==0
2026-05-23 10:42:35 -04:00
chenyu
149a87dac2
deviceless const cleanups ( #16341 )
2026-05-22 20:11:01 -04:00
Christopher Milan
35461d4d8f
ci: cleanup some deps [pr] ( #16340 )
2026-05-22 19:16:08 -04:00
Christopher Milan
451f38155c
start cleanup of the slowest tests ( #16339 )
2026-05-22 18:39:36 -04:00
nimlgen
26b3b3f6a2
hcq2: move submit lowering to schedule ( #16330 )
...
* hcq: move submit lowering to schedule
* Dx
2026-05-22 23:15:19 +03:00
wozeparrot
2d48fe8b7b
feat: bump version to 0.13.0 ( #16337 )
2026-05-22 13:12:45 -07:00
chenyu
acc519720b
add missing init files, add chat.html to package-data ( #16334 )
2026-05-22 13:53:34 -04:00
googlefan256
eeadf26dad
Fix no module named error ( #16305 )
...
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-05-22 12:51:29 -04:00
nimlgen
90dbb45563
nv: fix boot mem ( #16332 )
...
* nv: fix boot mem
* linter
2026-05-22 19:28:38 +03:00
nimlgen
5d77a94923
am: mec_pipe0_reset on gfx12 only ( #16331 )
2026-05-22 19:02:18 +03:00
qazal
bbfe4f80ec
quantize_fp8 kernels in uops ( #16288 )
...
* add tests
* simple UOp kernel is n^2
* fast kernel matching c++, opts_to_apply=()
* remove cpp
* simple o(n) kernel, two passes
* fuse the loops
* works on DEV=CPU
* multi regression test
* fix multi, this can possibly be its own bugfix
* test cleanups
* minimal diff
* match C in UOps
* Revert "match C in UOps"
This reverts commit 0bef740c30 .
* edit test
* match speed with C try 2
* needs_second_gpu
* cleanup
2026-05-22 20:54:06 +09:00
chenyu
3115952266
more unique const removal prerequisite ( #16328 )
2026-05-21 23:51:40 -04:00
Christopher Milan
c2d06570a5
remove getenv(CI) from core tinygrad ( #16326 )
2026-05-21 22:20:33 -04:00
chenyu
9744d512d9
use more non-buffered const ( #16327 )
2026-05-21 21:37:52 -04:00
Christopher Milan
150a82de1f
start cleaning up dtype tests ( #16324 )
2026-05-21 21:11:49 -04:00
chenyu
31424cda71
Tensor.requires_grad -> is_param ( #16325 )
...
for optimizer
2026-05-21 19:39:57 -04:00
Christopher Milan
518e60534e
only load tinymesa_cpu when LVP is explicitly requested ( #16320 )
2026-05-21 19:03:13 -04:00
chenyu
720a27bed8
remove many requires_grad= args ( #16321 )
...
* remove many requires_grad= args
* doc and example
* not cifar
2026-05-21 18:37:11 -04:00
wozeparrot
0c41317a59
llama: update 405b scripts ( #16309 )
2026-05-21 14:03:34 -07:00
wozeparrot
fb718a5e9d
llama: realize amax ( #16308 )
2026-05-21 14:00:48 -07:00
chenyu
73ea36f4ac
full(buffer=True) ( #16311 )
...
make full a buffer with flag to turn off
2026-05-21 16:34:44 -04:00
George Hotz
6815f28849
dtype.vec shapes ( #16287 )
...
* dtype.vec shapes
* something
* Closer
* more passes
* shape is in spec
* fix reduce
* image dtype shape correct
* lil
* use reshape on image
* need BUFFER there
* remove that test
* fix ptx + x86
* fix nir
* x86 fix maybe
* x86 fixups
* x86 fix
* don't check that for NOOP
2026-05-21 11:56:49 -07:00
wozeparrot
afc5bfa183
llama: remove fused grad accum ( #16301 )
2026-05-21 09:38:40 -07:00
nimlgen
a321700baa
hcq2: multi prereqs ( #16304 )
2026-05-21 17:00:52 +03:00
qazal
e33e058d34
set SPLIT_W13=0 for 8b DP by default ( #16302 )
2026-05-21 22:09:10 +09:00
Christopher Milan
dd279ee25e
print dtype decomp warning in DEBUG=2 ( #16300 )
2026-05-20 22:08:48 -04:00
George Hotz
ec547250ef
don't use dtype vec for image idx ( #16298 )
...
* don't use dtype vec for image idx
* double gate
* y/x confused
* upd
* fix nir
* simplify_valid_image_load
2026-05-20 18:45:13 -07:00
Christopher Milan
172f9493e1
move is_dtype_supported to renderer ( #16226 )
2026-05-20 21:19:37 -04:00
chenyu
d548f8d0f3
use clone instead of unique_const in allreduce [pr] ( #16297 )
2026-05-20 18:58:47 -04:00
qazal
9e88b08f93
x86: don't use id ( #16296 )
...
* x86: don't use id
* diff
* more minimal change
* unique
2026-05-21 07:36:40 +09:00
Christopher Milan
da07b28998
am: override smu 13_0_7 to 13_0_0 ( #16292 )
2026-05-20 18:14:30 -04:00
chenyu
beea4633fc
UOp.clone [pr] ( #16295 )
...
generates the store after structure
2026-05-20 17:47:49 -04:00
qazal
a19fa2908f
fix x86 nondeterminism ( #16293 )
2026-05-21 05:48:05 +09:00
George Hotz
58d58c1659
remove DEVECTORIZE ( #16290 )
...
* remove DEVECTORIZE
* fully remove DEVECTORIZE
2026-05-20 13:25:49 -07:00
wozeparrot
825f30bf18
llama: apply_grad saves memory ( #16275 )
2026-05-20 13:14:06 -07:00
nimlgen
a88feef40f
hcq2: cleanups ( #16278 )
...
* s
* simpler
* simler
2026-05-20 21:48:50 +03:00
Philipp Braun
a01d5918af
fix: qlinearconv quant params ( #16234 )
...
* fix: qlinearconv quant params
* fix: simplify reshape
---------
Co-authored-by: Philipp Braun <braunphilipp@users.noreply.github.com>
2026-05-20 11:31:41 -07:00
George Hotz
19535df53c
enable broadcasting in _shape ( #16285 )
2026-05-20 11:21:51 -07:00
chenyu
4dbe6a2ee7
remove _force_unique from Tensor init ( #16277 )
2026-05-20 14:13:05 -04:00
Christopher Bradford
fe2d8d1ecf
filter by base_class in pci_scan_bus on macOS ( #16282 )
...
The Linux path of pci_scan_bus reads /sys/bus/pci/devices/.../class and
skips devices whose base class doesn't match. The macOS (IOKit) path
appended every IOPCIDevice unconditionally, so callers that supplied
base_class to narrow down to e.g. display devices would also get the
audio companion function of a multifunction GPU.
Concretely, an NVIDIA RTX Pro 6000 Blackwell exposes:
10de:2bb1 class 0x030000 (display)
10de:22e8 class 0x040300 (multimedia audio)
A PROBE for base_class=3 returned both. With the sorted() at the end of
pci_scan_bus, 22e8 (audio) came first, so the NV runtime picked the
audio function as device 0 and stalled on RESIZE_BAR.
This mirrors the Linux filter on line 70 using the existing read_prop
helper.
Co-authored-by: Christopher Bradford <christopher.bradford@joby.aero>
2026-05-20 20:09:35 +03:00
qazal
1e0fffe256
fused ce llama kernel in UOps ( #16263 )
...
* work
* using uops
* delete things
* work
* work
* higher level uops
* cleanups
2026-05-20 19:45:28 +09:00
chenyu
e1715b3b92
extent jit const error to deviceless inputs ( #16276 )
2026-05-20 02:02:45 -04:00
chenyu
170b857da9
clean up deviceless const _buffer ( #16274 )
...
process on CPU similar to multi
2026-05-19 22:47:45 -04:00
chenyu
7af7b6703a
relax policy ASSERT_MIN_STEP_TIME to 3.2 ( #16273 )
2026-05-19 22:29:09 -04:00
chenyu
188d7ec15e
clone can take device ( #16271 )
...
useful to materialize const on a specific device
2026-05-19 21:29:27 -04:00
wozeparrot
361553c0a8
llama: match flat_llama with model_train ( #16269 )
2026-05-19 17:25:56 -07:00
George Hotz
da7414d6dc
fix RUN_PICKLE and test it ( #16272 )
...
* add test for openpilot RUN_PICKLE
* fix RUN_PICKLE and test it
2026-05-19 17:00:25 -07:00
George Hotz
55515747b7
Remove Ops.VCONST ( #16267 )
...
* start removing vconst
* remove a lot of vconst
* const folding + strict ordering
* update tests
* spec from minigen
* move that
2026-05-19 16:35:24 -07:00
Christopher Milan
7cdd9cbdeb
PYTHONREMU: V_CVT_PK_BF8_F32 saturation ( #16268 )
2026-05-19 19:29:59 -04:00
Christopher Milan
bb2a51f1ea
fix mypy mockgpu and add tinygrad.renderer.isa to packages ( #16265 )
2026-05-19 16:45:03 -04:00
chenyu
890b731b1e
more prerequisuite test changed for deviceless const ( #16264 )
2026-05-19 15:43:45 -04:00
ttomsa
aa1e59ab97
X86 with Ops.INS ( #14873 )
...
* draft
* cleanup test_encodings
* cleanup test_isel
* model flag state and support rematerialization
* woops
* add vbroadcastss instruction
* don't fuse load if used multiple times in src
* add movabs instruction and fix idiv
* fixes
* add x86 backend to tests
* float16 fix
* rm TwoAddress2nd
* add BARRIER
* test windows ci
* yup isel fixes the mask stuff too and its beautiful
* add cmoves to the spec
* support storing imms
* no TUPLE_ORDER, breaks tests
* fix remaining seg faults
* add float max
* always fuse index
* minor
* fix DEFINE_VAR/SPECIAL and enable multithreading
* linter
* more linter
* more
* more
* more
* let's try this
* perhaps
* start new scheduler
* more scheduling info
* cleaner shuffle functions
* fixup isel tests
* skip bounds check when NOOPs exist
* skip inf rewrite tests
* fix const tag hack and add x86ops to _shape
* fix
* skip a few tests
* func arg order independent from op value
* x86 goes in own linearize
* switch to PARAM
* more
* add min x86op and neg in decomps
* do mulacc in isel
* use def_reg in test_encodings
* enable emulated int64 tests
* how much does this fix
* Ops becomes OpType
* fix
* rm noqa
* rm machine scheduler stuff
* and this
* allow for extending enums and move X86Ops out of uop
* fix imports
* rm X86GroupOp from ops.py
* spacing
* tell mypy to shut up
* more linter
* add x86op test
* allow set[X86Ops] in upat
* move NOOPs to pre_isel_matcher and rm NOOP from spec
* more asserts
* also this
* cleanup encode
* simplify live range
* fix idiv
* add Ops.INS to x86
* more changes
* more changes
* more changes
* fix
* fix
* fix
* fix
* print formatted assembly
* fix 8bit idiv?
* oops
* enable float16 and unaligned vector load/store
* actually no
* move x86 tests
* no more bool cast
* fix
* linter
* linter
* move X86Ops to x86.py
* fix vpbroadcast
* cleanups
* linter
* print correct reg names
* canonical max
* move max/min and add test
* support float16 vector load/store
* rm bad rewrite
* vpsrldq can't access memory
* regalloc takes renderer
* enable vector load/store on all dtypes
* more isel tests
* rm this for now
* a lot better
* fix
* fix
* fix
* deal with flags correctly
* fix
* enable gep noop rule
* fix
* fix
* fix
* add callee saved registers
* use Ops.CONST instead of X86Ops.IMM
* fix
* enable TUPLE_ORDER
* fix
* rm x86 code in linearizer
* fix
* fix
* fix
* move isa rewrites to codegen
* fix
* fix
* skip test_linearizer.py
* skip more tests
* fix
* fix for idiv/mod changes
* fix
* don't use fmadd if it duplicates fused op
* hacky
* fix
* cleanups
* cleanups
* fix
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-05-19 12:42:54 -07:00
George Hotz
b2e8102209
25000 lines for x86 backend
2026-05-19 11:27:41 -07:00
Sachith Shetty
74567c1958
fix: pass input device to ONNX helper internal tensors ( #16242 )
...
* fix: pass input device to onnx methods internal tensors
* test: onnx helper internal tensors use input device
2026-05-19 11:16:33 -07:00
Christopher Milan
a178301dbe
PYTHONREMU: fix CDNA VOP3 conditional writes ( #16258 )
2026-05-19 13:31:31 -04:00
nimlgen
b3dcf8f452
hcq2: split into schedule/realize ( #16216 )
...
* hcq2: split into schedule/realize
* missing
* x
* f
* clean
* cleaner
* x
* x
* x
* x
* x
2026-05-19 16:40:17 +03:00
qazal
e4350e7de9
set hipcc mac docker to 7.1 ( #16261 )
...
* set hipcc mac docker to 7.1
* pull from amd
2026-05-19 21:30:39 +09:00
George Hotz
a120709671
tighten shape spec for broadcasting ( #16206 )
...
* tighten shape spec for broadcasting
* use IndexError, not ValueError
* needs size
2026-05-18 22:12:04 -07:00
George Hotz
3f2d401464
all tests pass with NOOPT=1 ( #16257 )
...
* all tests pass with NOOPT=1
* fix a few more
* noopt 100% pass
* noopt 100% pass
2026-05-18 20:39:51 -07:00
chenyu
e694d7f222
more deviceless const prerequisites [pr] ( #16256 )
...
* more deviceless const prerequisites [pr]
* remove that
* arange.contiguous -> arange.clone in tests
arange will become deviceless const soon, update tests where it needs to be a buffer
2026-05-18 23:14:12 -04:00
chenyu
c1076ed56c
Tensor.device and UOp.device can be None ( #16255 )
2026-05-18 22:08:10 -04:00
wozeparrot
a3d59faef6
llama: don't save weight ( #16252 )
2026-05-18 17:05:45 -07:00
qazal
18b102f355
llama: also use 7.1 comgr, update startup_walltime.sh ( #16253 )
2026-05-19 08:59:02 +09:00
chenyu
d532b4f533
multi alu with deviceless const ( #16251 )
2026-05-18 19:31:53 -04:00
qazal
98b8a2b407
llama: use hipcc 7.1 version ( #16250 )
2026-05-19 08:09:57 +09:00
Christopher Milan
7515824a6d
ci: actually use clang-20, enable bfloat16 ( #16249 )
2026-05-18 19:06:43 -04:00
chenyu
754344087a
assign for deviceless const source ( #16248 )
2026-05-18 17:39:53 -04:00
chenyu
73e6b4963b
to and shard is noop for deviceless uop ( #16247 )
2026-05-18 16:11:10 -04:00
Christopher Milan
50481ec9b4
cl: check for cl_khr_fp64 ( #16246 )
2026-05-18 14:42:43 -04:00
chenyu
db639ebe3e
deviceless const from UOp ( #16243 )
2026-05-18 14:14:12 -04:00
qazal
bfb2d1f89a
Revert "fp8 gemm speedup ( #16236 )" ( #16245 )
...
This reverts commit d95bf394e1 .
2026-05-19 02:01:44 +09:00
chenyu
5ae4dbd599
make slow tests faster ( #16244 )
2026-05-18 11:42:02 -04:00
chenyu
981c12182f
remove requires_grad= in tinygrad/ ( #16241 )
2026-05-17 16:55:37 -04:00
chenyu
fcdd1af880
remove Tensor.detach override [pr] ( #16239 )
2026-05-16 23:58:12 -04:00
chenyu
dcee90aa3f
remove requires_grad use in extra/examples ( #16238 )
...
except the ones fed into optimizer
2026-05-16 18:40:26 -04:00
chenyu
8631b6f17d
remove use of requires_grad in test/ ( #16237 )
2026-05-16 17:21:07 -04:00
qazal
d95bf394e1
fp8 gemm speedup ( #16236 )
...
* add asm_gemm option
* milestone
* work
* edit
* only the fast kernel
* diff
2026-05-17 04:58:28 +09:00
chenyu
0ddc50d050
do not gate backward on requires_grad ( #16230 )
...
DETACH is filtered in _deepwalk. instead of None, it gets 0 grad now
2026-05-16 12:29:49 -04:00
nimlgen
bef5f717bc
fix nolocals and beam ( #16232 )
2026-05-16 18:09:19 +03:00
qazal
ebcb7b7cc0
fp8 gemm tests with scale args ( #16231 )
...
* update atol
* update fp8 path
* more work
* update profile.sh
2026-05-16 20:47:58 +09:00
nimlgen
e575f778f9
move debug prints ( #16218 )
...
* move debug prints
* x
2026-05-16 13:57:34 +03:00
wozeparrot
2d48d7ab09
remove more invalid ( #16227 )
2026-05-16 02:52:27 -07:00
wozeparrot
159694347e
llama: fix running flat_llama ( #16224 )
2026-05-15 20:16:48 -07:00
Christopher Milan
79c0ae5b89
metal: arch is GPU family ( #16223 )
2026-05-15 21:22:48 -04:00
Christopher Milan
2c61f65211
cl: device extensions in arch ( #16220 )
2026-05-15 18:59:20 -04:00
George Hotz
2549b14ec2
fix caformer onnx run ( #16222 )
2026-05-15 15:08:36 -07:00
George Hotz
2570bded8b
update spec for LOAD ( #16221 )
...
* add load to the spec
* can
2026-05-15 14:46:00 -07:00
chenyu
d62c1d83c0
remove Tensor.eye override ( #16219 )
...
* remove Tensor.eye override
was only needed for requires_grad arg
* README
2026-05-15 15:40:34 -04:00
chenyu
07a172dbbb
remove noop requires_grad_ calls ( #16213 )
2026-05-15 13:31:10 -04:00
chenyu
c6cf9e8f0c
remove test_svd_nonfull_5_5 ( #16217 )
...
flaky, kinda overlap with test_svd_general
2026-05-15 13:10:02 -04:00
qazal
d54fa86b71
viz/cli: select all calls in graph by default ( #16214 )
2026-05-15 21:01:44 +09:00
nimlgen
28b98e529d
nv: move structs to vram ( #16184 )
...
* nv: vram
* x
* 4090
* x
* move and sysmem on macos
* x
* remove hp
2026-05-15 13:41:42 +03:00
chenyu
409bb0c9ad
requires_grad cannot be None ( #16212 )
...
final goal is to remove requires_grad, first change the default to True, and don't allow None
2026-05-15 02:01:04 -04:00
Christopher Milan
c7870f11ff
mesa: suggest curl install tip ( #16211 )
2026-05-15 00:29:06 -04:00
chenyu
a612b88abb
better assert when setitem a refed tensor ( #16210 )
...
also decouple from requires_grad
2026-05-14 23:40:29 -04:00
chenyu
a75c14f010
some setitem tests ( #16209 )
2026-05-14 22:36:25 -04:00
Christopher Milan
891a1ae7c2
onnx: remove dtype_fallback ( #15717 )
2026-05-14 22:06:57 -04:00
wozeparrot
b4d267dfd4
llama: only save when small ( #16208 )
2026-05-14 17:46:29 -07:00
chenyu
ffa1aac7b1
gradient for STORE/AFTER ala clone ( #16205 )
2026-05-14 20:17:27 -04:00
chenyu
09096ea565
test_gradient_through_clone ( #16203 )
...
backward through clone crashes now
2026-05-14 19:26:47 -04:00
George Hotz
d4dcd8487b
aggressive shape check to prepare for broadcasting ( #16202 )
...
* add implicit broadcasting to shape
* NOOP/ALLREDUCE fixes
2026-05-14 16:15:44 -07:00
George Hotz
83ec66da34
fix a fastdiv edge case ( #16199 )
2026-05-14 13:12:18 -07:00
nimlgen
62ea73719d
hcq2: share more with graph ( #16196 )
...
* share more with graph
* comment
2026-05-14 22:28:11 +03:00
George Hotz
3b8cc31759
disable fast idiv by default, it's broken ( #16197 )
...
* disable fast idiv by default, it's broken
* fix fast idiv tests
2026-05-14 11:48:27 -07:00
Christopher Milan
8f811649ff
better compiler_cpu invalid arch errors ( #16194 )
2026-05-14 14:36:14 -04:00
qazal
f03a7fd6d1
viz/cli: readable uop json ( #16195 )
...
* viz/cli: readable uop json repr
* work
* better
2026-05-14 21:33:10 +09:00
C T
1b779a9058
add gelu approximate="none" (match pytorch) ( #16162 )
...
* add gelu approximate="none" (match pytorch)
* lint
* pass through onnx Gelu approximate
* type annotate
* explicit math.sqrt
* keep tinygrad's gelu approximate="tanh" default
2026-05-13 18:53:24 -07:00
chenyu
dd9187d9ee
minor hash cleanups ( #16190 )
...
same kernels
2026-05-13 20:59:24 -04:00
wozeparrot
88ac2ac1fd
llama: cleanups ( #16189 )
2026-05-13 17:08:06 -07:00
Christopher Milan
9a365d9978
ci: fix null image tests ( #16188 )
2026-05-13 18:00:05 -04:00
nimlgen
ad1fb7c981
hcq2: graph ( #16186 )
...
* keep this for now
* early graph
2026-05-13 22:49:43 +03:00
chenyu
3f9f6a51b2
minor image_conv2d cleanup ( #16187 )
...
remove some no-op slices
2026-05-13 15:47:40 -04:00
b1tg
59c34b9fe0
llm: precise device ( #16159 )
...
* llm: precise device
* llm: pass device to precompute_freqs_cis
2026-05-12 21:16:42 -07:00
b1tg
3c806ff406
clean up gguf ( #16160 )
2026-05-12 21:16:10 -07:00
wozeparrot
e97f2c1114
llama: only gemm + fa custom kernel ( #16180 )
...
* llama: tie store to grad directly
* llama: set mp flags
* llama: non fused grad fp8 quantize path
2026-05-12 21:03:49 -07:00
chenyu
38d407fd58
simplify svd more ( #16181 )
...
all the slowness is scheduling
2026-05-12 23:48:22 -04:00
Christopher Milan
f1fdd2ccec
ci: add IMAGE=1 compile-only tests ( #16182 )
...
* ci: add IMAGE=1 compile-only tests
* fix
2026-05-12 23:40:32 -04:00
George Hotz
faf7fb7513
update nir renderer for new image style ( #16179 )
...
* update nir renderer for new image style
* don't cast image indexes
2026-05-12 20:25:01 -07:00
Christopher Milan
7d0c5ab689
ci: ocelot needs nvcc on linux ( #16178 )
...
* ci: ocelot needs nvcc on linux
* cudart
2026-05-12 23:13:48 -04:00
chenyu
32138c2418
svd to mixin ( #16175 )
2026-05-12 22:29:01 -04:00
George Hotz
69e1f3b551
remove vec2 from image in gater ( #16165 )
...
* remove vec2 from image in gater
* only simple idx
* fix python with new image style
* fix vconst
* just vconst and stack
* cast to int there
* fix for const
* fix process replay
2026-05-12 19:25:52 -07:00
chenyu
2172363be5
don't use Tensor indexing in svd ( #16174 )
...
prepare mixin, also about 4X faster for 8x8 input
2026-05-12 21:56:19 -04:00
chenyu
420a08c6d1
qr to mixin ( #16173 )
2026-05-12 21:23:25 -04:00
chenyu
c6a82fe927
functional qr and svd ( #16172 )
...
no clone and setitem, will move to mixin next. slightly faster but still quite slow
2026-05-12 19:12:08 -04:00
Christopher Milan
3844a31f87
ci: untangle cuda/ocelot, less apt ( #16171 )
...
* ci: untangle cuda/ocelot, less apt
* ldconfig
2026-05-12 18:14:03 -04:00
Christopher Milan
316607f004
dsp: don't use docker in ci ( #16167 )
...
* dsp: don't use docker in ci
* add setup script for macos docker
2026-05-12 17:11:03 -04:00
chenyu
bdcdf1f1a1
jittable masked_select and nonzero ( #16170 )
...
* jittable masked_select and nonzero
make jittable with `size=`, matches jax
* COMPILE_ONLY
2026-05-12 16:39:36 -04:00
wozeparrot
a613bcfc6d
allow after on contiguous in spec ( #16169 )
...
* feat: allow after on contiguous
* feat: add test
2026-05-12 13:11:44 -07:00
chenyu
7c3e3fa154
fix empty input for masked_select and nonzero ( #16168 )
2026-05-12 15:36:51 -04:00
chenyu
da3b7e89a4
atol in test_custom_kernel_multi_output_backward_interacting ( #16166 )
2026-05-12 14:42:12 -04:00
chenyu
25583f6dc1
fix cumsum dtype for 0d input ( #16164 )
2026-05-12 14:18:08 -04:00
George Hotz
64c81dfd24
add all codegen stages to spec_tensor ( #16163 )
2026-05-12 10:35:38 -07:00
chenyu
f3e3c3851f
explicit args to Tensor.rand ( #16161 )
...
added requires_grad, other kwargs were silently dropped
2026-05-12 12:53:39 -04:00
nimlgen
e93fb5f9b9
hcq2: remove hcqprogram ( #16157 )
...
* hcq2 rm program
* nonbeauty
* no prog
* tiny
* f
* x
2026-05-12 18:49:13 +03:00
nimlgen
a708542308
fix ci spec ( #16156 )
2026-05-12 17:57:11 +03:00
nimlgen
e5729935c6
time_call ( #16152 )
...
* time_call
* x
* fix caches
2026-05-12 16:58:28 +03:00
qazal
fe39cf148a
add Ops.SOURCE test ( #16155 )
...
* simple failing test
* raises
* change
2026-05-12 22:49:32 +09:00
qazal
5cd0494b14
viz: canonicalize ast for schedule to codegen linking ( #16154 )
...
* simple failing test
* always null device
* viz: canonicalize ast for schedule to codegen linking
* SCACHE
2026-05-12 22:40:21 +09:00
qazal
c1d125ff3b
llm: add markers to --benchmark ( #16153 )
...
* markers in llm
* ui fix
2026-05-12 20:14:11 +09:00
wozeparrot
e9359d9e7d
more llama mp fixes ( #16151 )
...
* llama: SPLIT_W13
* llama: fix with no fused kernels
* llama: cast to bf16 on non asm_gemm patH
* llama: new mp flags
2026-05-11 21:29:23 -07:00
chenyu
09fd80fba6
fix randperm and _multi_like drop requires_grad ( #16150 )
2026-05-11 23:23:34 -04:00
George Hotz
8294d105a7
Update the spec in spec.py to match the current state ( #16132 )
...
* start work on specv2
* more spec
* more spec
* fix amd emulator
* more spec
* more
* fix test_uop_graph
* move those
* spec=2
* skip those questionable tests
* ptx fix
* more spec=2
* store
* allow custom function in tensor
* spec 2
* fix beam search for tensor cores
* delete the old specs
* fix import
2026-05-11 20:07:47 -07:00
chenyu
3942a80f66
fix wrong kwargs passed into rands ( #16149 )
...
working towards explicit args for these
2026-05-11 22:22:06 -04:00
Christopher Milan
039d84ff02
Revert "onnx: deduplicate simple proto parsers" ( #16148 )
...
This reverts commit 83eaefcd0f .
2026-05-11 21:45:17 -04:00
Christopher Milan
20f587d5d5
nv: rm _download ( #16147 )
2026-05-11 19:56:37 -04:00
chenyu
371ab2023f
clean up image_dot and image_conv2d ( #16145 )
2026-05-11 19:37:58 -04:00
Vikram Rangarajan
effa263865
Torch backend aten::cat.out fix ( #16121 )
...
* Handle empty 1D tensors in cat_out
* Undid other changes
* Fixed torch cat
* Improved cat.out, added more tests
* Cleaned code
* Type hinted dim
* Removed whitespace
2026-05-11 16:28:16 -07:00
chenyu
63c1f00b80
disable test_svd_general again ( #16146 )
...
flaky on CI
2026-05-11 19:24:32 -04:00
Christopher Milan
2dccd4a3eb
am: autogen pmc ( #16143 )
...
* am: autogen pmc
* cleanup
* fix
* type
2026-05-11 19:22:12 -04:00
Christopher Milan
7ba55ad3ba
nv: autogen regs ( #16139 )
...
* nv: autogen regs
* flcn cot
* ci
* gen
2026-05-11 18:52:24 -04:00
chenyu
0b02fb6797
Revert "[pr] match torch rmsnorm ( #16122 )" ( #16144 )
...
This reverts commit 692257dd70 .
2026-05-11 17:53:42 -04:00
chenyu
fbe8be0b8b
style cleanup to Tensor.qr and svd ( #16142 )
...
* style cleanup to Tensor.qr and svd
same kernels
* more
* enable
2026-05-11 17:16:59 -04:00
qazal
fc2cc1d77a
viz: call graph renderer example ( #16141 )
...
* work
* emits
* this
* cleaner repr for custom binaries
* --call-graph
* _ref
* this
* start
* this
* everything execpt the pyrender
* bring pyrender back
2026-05-12 05:07:30 +09:00
chenyu
f65e343fb3
spec.py cleanups ( #16140 )
...
removed END from shared_spec and NOOP from full_spec
2026-05-11 15:59:49 -04:00
Joshua James Venter
692257dd70
[pr] match torch rmsnorm ( #16122 )
...
* [pr] match rmsnorm torch
Signed-off-by: Joshua James Venter <venter.joshua@gmail.com>
* 1e-5
* ops.md
---------
Signed-off-by: Joshua James Venter <venter.joshua@gmail.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-05-11 14:36:41 -04:00
Sachith Shetty
59a81559d4
fix: add self.device to qr, svd, masked_select intermediates ( #16131 )
2026-05-11 11:22:54 -04:00
nimlgen
70c2480e71
hcq2 to extra ( #16126 )
...
* hcq2 in extra
* correct
* some revert from non-extra
* cln
* cpu
* x
* attach
* min
* remove attach
* linter
2026-05-11 17:17:30 +03:00
nimlgen
ad9738892c
get_buf() for Buffer ( #16134 )
...
* p
* mypy
* x
2026-05-11 16:36:14 +03:00
qazal
2dd84416bf
viz/cli: schedule renderer ( #16101 )
...
* simpler steps
* work
* work
* iterate
* faster
* better
* simplify more
* sys stdin
* less
* work
* work and mv
* better
* seen bufs
* all call graphs
* print query
* ux
* param to buffer / buffer_view
* work
* respect NO_COLOR in uop_to_json
* less
* render uops
* rm custom renderer
* call can't pyrender.
* unrelated diff
* assert
* 5
2026-05-11 01:56:16 +09:00
George Hotz
53f9587099
add canary
2026-05-10 09:38:18 -07:00
George Hotz
28cb7f1bcc
update readme with contributing guidelines
2026-05-10 09:35:48 -07:00
George Hotz
daed602569
rename BUFFERIZE to STAGE ( #16125 )
2026-05-10 09:26:46 -07:00
qazal
39ce780907
viz/cli: emit all runs of selected kernel, json fixes ( #16124 )
...
* keep print
* --json in tests, sqtt --json err
* work
* import
* less
* line
2026-05-10 21:45:51 +09:00
qazal
51c7dafb0d
split viz cli test helpers ( #16123 )
2026-05-10 19:42:24 +09:00
chenyu
b2a682ec60
remove _shape check in pm_mops [pr] ( #16120 )
...
seems fine now
2026-05-09 17:54:22 -04:00
wozeparrot
026688f03f
llama: move to correct dir ( #16118 )
2026-05-08 19:42:16 -07:00
Christopher Milan
a7512e0d12
PYTHON: images have no alignment constraints (by default) ( #16115 )
2026-05-08 20:35:03 -04:00
Christopher Milan
105b037c3c
cl: image alignment in arch ( #16106 )
2026-05-08 19:33:33 -04:00
Charlie Kerfoot
71a8c0da09
fix: trailing space format string ( #16005 )
2026-05-08 16:31:10 -07:00
Pawan
4dd6ad3514
gradient: add TRUNC backward ( #15925 )
...
* gradient: add TRUNC backward
* test: move round quantization gradient to test_ops
2026-05-08 16:27:55 -07:00
chenyu
5152ff95e7
_pad_constant and avg_pool2d cleanups ( #16110 )
2026-05-08 18:09:47 -04:00
chenyu
e6584532f4
minor elementwise cleanups ( #16102 )
2026-05-08 13:38:34 -04:00
nimlgen
49b55af619
jit: simpler free_intermediates ( #16099 )
2026-05-08 19:08:33 +03:00
chenyu
0f46c08582
div mixin cleanups ( #16100 )
2026-05-08 12:05:37 -04:00
chenyu
235044c9d8
Ops.IDIV -> Ops.CDIV, Ops.MOD -> Ops.CMOD ( #16093 )
...
* Ops.IDIV -> Ops.CDIV, Ops.MOD -> Ops.CMOD
* ruff
2026-05-07 23:18:15 -04:00
Christopher Milan
faabe6aa42
nv: remaining firmware from /lib/firmware ( #16088 )
2026-05-07 23:07:43 -04:00
b1tg
7ef901a81d
llm: moe speedup ( #16059 )
2026-05-07 19:06:35 -07:00
George Hotz
80da8a4b9c
add spec to main tinygrad repo ( #16092 )
2026-05-07 18:52:49 -07:00
June
83eaefcd0f
onnx: deduplicate simple proto parsers ( #16085 )
...
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-05-07 18:44:27 -07:00
George Hotz
c106c73e51
remove the gate from index ( #16081 )
...
* remove the gate from index
* gpt says this works
* remove hanging casts
* simplify
* move that down
* move gates
* ptr
* remove that simplify
* move that
2026-05-07 18:42:00 -07:00
wozeparrot
d11f4d0ec2
fix: don't copy on slice of DP weight ( #16089 )
2026-05-07 17:58:01 -07:00
George Hotz
1d1b726cf6
hotfix: disable flaky framework pytest
2026-05-07 17:05:06 -07:00
Christopher Milan
9a6f7f7576
nv: look for fmc firmware in /lib/firmware ( #16080 )
2026-05-07 18:08:27 -04:00
George Hotz
b796bbae87
fix valid in indexing tests ( #16087 )
2026-05-07 14:11:28 -07:00
wozeparrot
4d1a9dca41
fix: don't copy precompiled custom kernel outputs ( #16084 )
2026-05-07 14:02:38 -07:00
qazal
f9083cf901
use subactions for benchmark.yml process replay [pr] ( #13396 )
2026-05-08 03:46:25 +09:00
nimlgen
2f0aa884d5
tinygpu: minimal is macos13 for resets ( #16075 )
2026-05-07 21:25:56 +03:00
chenyu
072db9924c
div to mixin ( #16078 )
...
also deleted idiv method
2026-05-07 12:52:37 -04:00
chenyu
516b00e286
mod and fmod to mixin ( #16077 )
2026-05-07 12:13:39 -04:00
qazal
a9a87ad8fd
viz/cli: less flags ( #16076 )
...
* viz/cli: merge -s and -i flags
* only -t
* merge parser
* fix
2026-05-08 00:22:40 +09:00
qazal
f813a04b3f
viz: pickle path in str ( #16073 )
2026-05-07 18:49:21 +09:00
wozeparrot
730fa66bf3
llama speed 6 ( #16071 )
2026-05-06 20:51:03 -07:00
Christopher Milan
7b91f7c90c
nv: look for gsp firmware in /lib/firmware ( #16068 )
2026-05-06 21:35:47 -04:00
George Hotz
8e84317743
the renderer part of gate moving from index to load/store ( #16064 )
...
* the renderer part of gate moving from index to load/store
* fixed
* fix gated stores
* fix spec
* better?
* Where after gated load becomes alt value
* cleaner expression
* fix python backend
* remove dead code
2026-05-06 13:47:04 -07:00
chenyu
ef085304bc
stronger divmod_recombine ( #16066 )
2026-05-06 15:41:54 -04:00
qazal
d7d32d82ee
viz/cli: print first uop with DEBUG=6 ( #16065 )
...
* viz/cli: print first uop with DEBUG=6
* rename fmt to emit
* define inst
2026-05-07 03:39:34 +09:00
chenyu
af4140f3be
fix divmod recombine for floordiv ( #16062 )
2026-05-06 14:22:42 -04:00
chenyu
c6ad3d3ac2
better divmod late rewrite ( #16061 )
...
better order
2026-05-06 11:31:48 -04:00
chenyu
aaabe42373
relax fold_divmod_general ( #16058 )
2026-05-05 21:37:56 -04:00
Christopher Milan
1de14cf33a
am: autogen soc ( #16055 )
2026-05-05 20:39:43 -04:00
chenyu
869eae6b37
fix double div rewrites ( #16054 )
2026-05-05 19:34:35 -04:00
Christopher Milan
bd06ea9f97
am: simplify import_module ( #16046 )
2026-05-05 19:25:53 -04:00
qazal
795501e1da
fix device in null graph events ( #16053 )
...
* failing test
* fix compute
* fix sdma
2026-05-06 07:44:08 +09:00
wozeparrot
ab6218bc92
llama mp fixes ( #16050 )
2026-05-05 15:35:32 -07:00
chenyu
34fe37d64e
use FLOORDIV and FLOORMOD ( #16048 )
...
* use FLOORDIV and FLOORMOD
also removed CORRECT_DIVMOD_FOLDING
* fix
* Revert "fix"
This reverts commit 86af33b88ef31943c61e67189b072eca4896409a.
* fix
* fix
2026-05-05 18:32:54 -04:00
Christopher Milan
76ff378007
autogen: fewer apt dependencies ( #16049 )
2026-05-05 17:22:41 -04:00
nimlgen
5fa0016ffc
supports_exec_item -> supports_uop ( #16033 )
2026-05-05 22:41:13 +03:00
qazal
cee17e0d2f
viz: fix diff color ( #16045 )
2026-05-06 03:40:53 +09:00
chenyu
9c37a0c75d
Ops.FLOORDIV and Ops.FLOORMOD ( #16038 )
...
* Ops.FLOORDIV and Ops.FLOORMOD
lowered into IDIV and MOD in get_late_rewrite_patterns
* still need this
* exclude
* like that?
2026-05-05 11:42:14 -04:00
qazal
d79bf356c2
viz: add CALL -> codegen link ( #16044 )
...
* work
* cleaner
* details
* rm
2026-05-05 23:34:44 +09:00
Christopher Milan
1c8cb0769a
am: autogen asic_regs ( #16004 )
2026-05-04 22:52:07 -04:00
George Hotz
26406bed83
amd uses .valid, not index src valid ( #16042 )
2026-05-04 18:35:15 -07:00
chenyu
a357a0449a
Tensor.div cleanup ( #16041 )
2026-05-04 19:27:36 -04:00
nimlgen
5b4f62519d
cache buffer_views as well ( #16039 )
...
* cache buffer_views as well
* reuse
* back
* x
2026-05-05 00:00:09 +03:00
Christopher Milan
8e99c4f097
fetch checks sha256 ( #16037 )
2026-05-04 16:08:38 -04:00
George Hotz
1884f67a39
simplify full_rewrite_to_sink spec ( #16035 )
...
* simplify full_rewrite_to_sink spec
* test cleanups
2026-05-04 11:44:13 -07:00
chenyu
a4fccd23b2
remove kwargs in UOp.vectorize [pr] ( #16034 )
2026-05-04 12:46:38 -04:00
qazal
b1d88ebf02
viz/cli: aggregate flops in -t ( #16031 )
...
* 38
* plumbing
* more flops
* flop/s and bytes/s
* arithmetic mean
* tests
* harmonic mean
* range
* better
* simplify
* fix prints
* no string parsing needed
2026-05-04 17:35:02 +03:00
qazal
c02e390c2b
viz: encode flops, mem and metadata in json ( #16032 )
...
* gate print
* update everywhere to check path
* server encodes json
* ui changes
* cli changes
* tests never need regex
* no str replace
* update test_pipes
* remove that
2026-05-04 23:06:18 +09:00
bigyoshi
4024d8438f
runtime/graph: avoid core_id runtimevar merge conflicts ( #16026 )
...
Co-authored-by: bigyoshi51 <269989564+bigyoshi51@users.noreply.github.com>
2026-05-03 19:16:02 +03:00
qazal
9684334dfe
viz: fix flops in graph, add null graph tracing ( #16024 )
...
* min repro, todos
* null graph tracing
* work
* work
* work
* only test_flops
* exec points back
* first
* better
* integral timestamps maybe
* cleanup
* simpler, update NULL to use SDMA naming
* integration test
* sdma
2026-05-03 22:32:44 +09:00
wozeparrot
419d525553
feat: handle multioutput kernel grads ( #16028 )
2026-05-02 22:31:45 -07:00
mefengl
9717d3a3a2
hotfix: prepend LD_LIBRARY_PATH to DLL posix search dirs ( #16023 )
2026-05-02 20:45:19 +03:00
qazal
7daf4b7d52
viz: split cli test ( #16015 )
...
* viz: split cli test
* arg3 is msg
2026-05-03 01:47:11 +09:00
nimlgen
d65b8ca25f
jit: remove *input_list from the graph sources ( #16021 )
2026-05-02 14:42:47 +03:00
qazal
7dae9e6f7f
viz: keep VIZ.value = 0 during python shutdown, cleanup launch ( #16022 )
...
* viz: keep VIZ.value = 0 during python shutdown, cleaner execv
* rm
2026-05-02 20:35:53 +09:00
Christopher Milan
637bdd5530
am: only support CDNA3/4 and RDNA3/4 ( #16017 )
2026-05-02 00:02:14 -04:00
George Hotz
4a2e1f1076
STORE doesn't have ranges anymore ( #16019 )
...
* STORE doesn't have ranges anymore
* fix
2026-05-01 15:00:27 -07:00
chenyu
0bffbc5f8a
onnx fmod uses fmod ( #16018 )
2026-05-01 16:47:11 -04:00
chenyu
782d1ff80f
Tensor.fmod ( #16014 )
...
c-style mod matches torch
2026-05-01 16:02:18 -04:00
nimlgen
1079441332
revoke bus master ( #16007 )
2026-05-01 18:00:01 +03:00
qazal
8b147a9ed5
minimal repro for llama copies 2 ( #16011 )
2026-05-01 22:23:47 +09:00
qazal
a29dd7b19b
Revert "cleanup: untrack wait Metal buffers ( #15954 )" ( #16010 )
...
* Revert "cleanup: untrack wait Metal buffers (#15954 )"
This reverts commit 5eb1fd5d3c .
* regression test fixes
2026-05-01 21:18:19 +09:00
qazal
65879fe1b7
metal synchronize regression test ( #16008 )
...
* add test for metal wait=True
* add self.assertRaises
2026-05-01 20:10:57 +09:00
nimlgen
f6d92b55e6
am: use per pipe reset for gfx11+ ( #16006 )
2026-05-01 12:56:43 +03:00
Christopher Milan
cee73becbe
am: ip offsets in autogen ( #16003 )
2026-05-01 00:13:52 -04:00
George Hotz
4506688285
split render to render.py ( #16002 )
...
* split render to render.py
* move more print
2026-04-30 19:41:14 -07:00
George Hotz
d651b4bbf0
SPEC=3 checks the shape ( #16001 )
...
* SPEC=3 checks the shape
* buffer view
* Revert "buffer view"
This reverts commit ffd87889a9 .
* buffer view hack
* fix ptx
2026-04-30 18:41:37 -07:00
wozeparrot
528d35e306
llama speed 4 ( #15993 )
2026-04-30 17:14:41 -07:00
George Hotz
45fd7a3668
lil_image vectorize ( #16000 )
...
* lil_image vectorize
* 0 pitch on height 1
* Revert "0 pitch on height 1"
This reverts commit 58a83e6622 .
2026-04-30 16:12:43 -07:00
wozeparrot
eddcd4723b
am_smi throttle info ( #15997 )
2026-04-30 15:28:32 -07:00
chenyu
52c92e15ae
no replacement multinomial ( #15995 )
...
* no replacement multinomial
Efraimidis–Spirakis
* num_samples == 1 can use fast path
2026-04-30 17:35:26 -04:00
chenyu
e0b09f288f
input validation for rand functions ( #15990 )
2026-04-30 14:00:44 -04:00
nimlgen
11e1a2b89f
cleaner and faster run_linear ( #15987 )
...
* cleaner and faster run_linear
* x
* assert for now
* x
* x
* sym_infer
* remove sink
2026-04-30 20:15:22 +03:00
qazal
58b34e71bd
failing test for llama useless copies ( #15989 )
2026-05-01 00:55:29 +09:00
George Hotz
0f7e296f5b
fix some indexing edge cases ( #15988 )
2026-04-30 08:05:30 -07:00
nimlgen
6f8b10d251
remove base Runner ( #15986 )
...
* remove base Runner
* linters
2026-04-30 13:04:55 +03:00
George Hotz
46a36a838a
small dtype shapes fixups ( #15984 )
2026-04-29 19:40:38 -07:00
chenyu
b73248958a
minor rand cleanups ( #15982 )
2026-04-29 22:22:29 -04:00
chenyu
53a28bafbd
rand device seed to its own function ( #15979 )
2026-04-29 17:21:40 -04:00
Christopher Milan
d07741f1d7
am: look for firmware in /lib/firmware/amdgpu ( #15974 )
2026-04-29 17:15:09 -04:00
nimlgen
c73e667fc0
remove if for precompiled programs ( #15980 )
2026-04-29 23:43:36 +03:00
qazal
55915584e5
viz: fix cfg for emulated amd on the null device ( #15976 )
...
* simple failing when i test it end to end
* pass
* linter
* assemble
2026-04-30 05:18:09 +09:00
nimlgen
dfd2d07005
remove CompiledRunner ( #15970 )
...
* rm usage of CompiledRunner
* more tests
* last
* linter
* sink
* remove
* linter
2026-04-29 22:45:48 +03:00
wozeparrot
0080489abe
llama: use env vars ( #15978 )
2026-04-29 12:37:15 -07:00
qazal
a37b605523
remove arch from asm kernel class ( #15977 )
...
* rm arch from kernel
* update other tests
* update abstractions4.py
2026-04-30 03:39:52 +09:00
Christopher Milan
7a79c2948a
DEV visible device filter supports hyphenated syntax ( #15971 )
2026-04-29 14:02:21 -04:00
Christopher Milan
6b9a45568c
autogen: better version handling for llvm and libclang ( #15975 )
2026-04-29 14:01:33 -04:00
chenyu
654e611a29
_bits_to_rand to mixin ( #15972 )
2026-04-29 13:47:25 -04:00
George Hotz
5f441ecffc
unify reduce + reduce_axis ( #15973 )
...
* unify reduce + reduce_axis
* fix all tests
* lil cleanups
2026-04-29 10:29:56 -07:00
qazal
b63e0a5f74
viz/sqtt: move amd decoder to extra, don't import from ops_amd ( #15969 )
...
* don't import from ops_amd
* start
* cleanup
2026-04-30 00:49:15 +09:00
nimlgen
7787f76dcc
get_runner -> get_runtime ( #15967 )
...
* get_runner -> get_runtime
* do not use get_runner
* fix
* remove get_tunner
* remove
* fix
* x
2026-04-29 18:29:49 +03:00
chenyu
fb188c3c23
UOp.bitcast noop early return ( #15968 )
...
matches Tensor
2026-04-29 09:41:40 -04:00
qazal
30403c1e25
viz/cli: merge DEBUG=6 and -i ( #15966 )
...
* print_step contiguous
* merge
2026-04-29 19:52:17 +09:00
qazal
86621e9e7c
gate f32_to_fp8 renderer ( #15964 )
2026-04-29 19:12:46 +09:00
wozeparrot
ef09071073
llama: speed 2 ( #15960 )
2026-04-28 20:44:37 -07:00
Christopher Milan
e6863a1cc5
autogen: fewer type: ignores ( #15956 )
2026-04-28 21:58:13 -04:00
chenyu
836af56513
some RandMixin cleanup ( #15961 )
...
cleaner to just put inside OpMixin
2026-04-28 19:58:02 -04:00
chenyu
c4bea54e9c
_threefry_random_bits to mixin ( #15959 )
...
start RandMixin
2026-04-28 19:13:57 -04:00
George Hotz
796fdf9fd8
end has no shape ( #15958 )
2026-04-28 15:15:48 -07:00
Miguel Villa Floran
b36010c55a
DGX Spark and Jetson Thor support ( #15939 )
2026-04-28 18:08:21 -04:00
Nino Risteski
5eb1fd5d3c
cleanup: untrack wait Metal buffers ( #15954 )
2026-04-28 12:54:59 -07:00
nimlgen
77965a22e5
local optimize as rewrite ( #15953 )
...
* local optimize as rewrite
* better
* x
* slighly rename
* fix
* ugh
* remove
* x
* remove
* not weak
2026-04-28 22:51:04 +03:00
qazal
b3f0f8d349
llama: fix missing label_smoothing arg ( #15955 )
2026-04-29 02:12:14 +09:00
wozeparrot
5e861cd2c4
llama: move llama kernels to llama_kernels ( #15952 )
2026-04-27 22:48:53 -07:00
Christopher Milan
987b6dd193
python -m tinygrad.device prints interface info ( #15950 )
2026-04-27 22:15:38 -04:00
qazal
54f00e1013
sqtt: correct rdna4 structs ( #15948 )
2026-04-28 07:35:50 +09:00
Charlie Kerfoot
890d7be0c3
fix: muon not using device ( #15936 )
2026-04-27 14:56:48 -07:00
qazal
c58fd85a99
sqtt: add needs_rocprof decorator ( #15947 )
...
* sqtt: add needs_rocprof decorator
* version string
2026-04-28 06:22:50 +09:00
Christopher Milan
3f508810d8
cpu: lowercase arch ( #15943 )
2026-04-27 17:05:25 -04:00
chenyu
77f9125c21
move Tensor.pad to OpMixin ( #15946 )
2026-04-27 16:56:04 -04:00
nimlgen
4164666c72
programinfo ( #15942 )
...
* programinfo
* fix
* m
* x
* x
* changes
* x
* fix
* rm
2026-04-27 23:12:03 +03:00
chenyu
fe38d6de94
_pad_circular and _pad_reflect_replicate to mixin ( #15944 )
2026-04-27 16:07:05 -04:00
qazal
8c174bdad4
viz/sqtt: correct exec pipes ( #15885 )
...
* wmma
* p2
* test
* left
* work
* pickle
* handwritten failing tests
* start work
* test the pipes
* empirical evidence
* update rdna4 enum types
* VALU pipe 1
* TRANSCENDENTAL pipe
* transcendental function units
* reorder
* wmma pipe
* cleanup and notes
* smaller
* work
* diff cleanup
* pickle
* use se:1
* int
2026-04-28 05:05:49 +09:00
qazal
eeb8d5eb0c
viz: small ui changes ( #15940 )
...
* rename colors
* keep ctrl c
2026-04-27 04:00:13 +09:00
nimlgen
96165ff0d1
validate_with_cpu as rewrite ( #15938 )
...
* validate_with_cpu as rewrite
* compil
* x
* linter
* moved
* fix
2026-04-26 19:58:53 +03:00
nimlgen
117e9e22dd
estimates from graph ( #15937 )
...
* estimates from graph
* test
* x
2026-04-26 18:22:53 +03:00
chenyu
e9983e3516
remove unused QCOMTextureInfo, QueueType [pr] ( #15935 )
2026-04-25 14:32:31 -04:00
nimlgen
ac3494a7cc
remove some runners ( #15934 )
...
* remove runners
* mypy
2026-04-25 21:27:05 +03:00
nimlgen
bb652352c7
remove execitem ( #15932 )
...
* remove execitem
* f
* x
2026-04-25 19:33:04 +03:00
chenyu
e27444a0ff
remove unused UOp.shard_size [pr] ( #15933 )
2026-04-25 12:27:58 -04:00
nimlgen
e0ff6cc15c
remove old schedule ( #15930 )
...
* remove old schedule
* tests
* r
* x
2026-04-25 16:46:36 +03:00
qazal
9a23de7d27
viz/cli: unify profile and rewrites, -s ALL default ( #15931 )
...
* work
* workg
* better
* cleanup
* better defaults
* --ls
* better
* work
* update llama
* update
2026-04-25 22:31:24 +09:00
nimlgen
768106a542
remove schedule from extra/docs/examples ( #15929 )
...
* remove schedule from extra/docs/examples
* f
2026-04-25 14:09:12 +03:00
nimlgen
a5e9ea7a60
remove schedule batch 4 ( #15927 )
...
* remove schedule batch 4
* fini
2026-04-25 12:36:55 +03:00
nimlgen
d2ab6ea7a6
remove schedule batch 3 ( #15924 )
...
* remove shcedule batch 3
* batch 6
* batch 7
2026-04-25 11:53:16 +03:00
nimlgen
3c8a2db870
remove schedule() from tests batch 2 ( #15923 )
...
* remove schedule() from tests batch 2
* batch 4
2026-04-25 10:44:41 +03:00
Denys Melnyk
1fdcb13bfb
webgpu: fix weight lookup in export_model after compile_net key change ( #15919 )
...
* fix lookup site in export_model_webgpu after refactoring
webgpu (sd): fix export_model weight lookup after compile_net changes
fix lookup site in export_model_webgpu after refactoring
* add regression test
2026-04-25 10:04:55 +03:00
Christopher Milan
8b2826ef16
nv: fix shader local memory for NAK ( #15921 )
2026-04-25 01:03:11 -04:00
Christopher Milan
57fbaa3d49
amd: fallback to llvm when comgr is not available ( #15914 )
2026-04-24 23:30:16 -04:00
wozeparrot
4b908b6e2c
llama: fused ce loss ( #15920 )
2026-04-24 20:01:24 -07:00
nimlgen
d3378010ee
schedule() -> schedule_linear() in tests (batch 1) ( #15915 )
...
* schedule_with_vars -> linear_with_vars in tests
* tests batch 1
* batch 2
* estimate_uop
* simpler
* rm
2026-04-24 23:40:53 +03:00
chenyu
b501ba3e42
nll_loss to mixin ( #15918 )
2026-04-24 15:50:31 -04:00
chenyu
2f9fdb4a37
scatter to mixin ( #15917 )
2026-04-24 15:37:37 -04:00
nimlgen
f2751955cb
remove linear_to_schedule from tests ( #15912 )
...
* remove linear_to_schedule from tests
* x
2026-04-24 20:02:10 +03:00
nimlgen
56a9f1e3ff
remove last jit_cahce ( #15911 )
...
* remove last jit_cahce
* linter
2026-04-24 19:44:52 +03:00
chenyu
03a7604f76
sort argsort topk allclose to mixin ( #15910 )
2026-04-24 10:20:46 -04:00
nimlgen
4010aa4044
jit: no jit_cache in graphrunner ( #15907 )
...
* jit: no jit_cache in graphrunner
* m
2026-04-24 16:34:26 +03:00
chenyu
7a1adfd2aa
update Tensor.allclose to return Tensor ( #15904 )
...
matches jax
2026-04-24 08:27:17 -04:00
Eitan Turok
48d7ab2695
no uv.lock ( #15893 )
2026-04-24 20:07:07 +08:00
qazal
5eb641395a
viz/cli: select kernel events in -s DEV ( #15909 )
...
* simple test
* pass
2026-04-24 21:03:34 +09:00
nimlgen
c0f77c2e1c
hcq graph to linear ( #15888 )
...
* hcq
* f
* f
* linter
2026-04-24 12:42:49 +03:00
Christopher Milan
cbf4946ea6
usb: multiple gpus and better error messages ( #15900 )
2026-04-24 01:57:19 -04:00
wozeparrot
9d134a2848
llama: fix fakedata timing ( #15905 )
2026-04-23 21:37:03 -07:00
b1tg
aab50d1bca
llm: dedup MLA cache_v ( #15887 )
2026-04-24 12:32:10 +08:00
qazal
f379b5a40a
sqtt: match amd's TS_DELTA_SHORT offset ( #15901 )
2026-04-24 06:41:22 +03:00
chenyu
c24da99d56
avg_pool2d, max_pool2d to mixin ( #15903 )
...
* avg_pool2d, max_pool2d to mixin
* fix
* just dtype
* that
2026-04-23 23:36:17 -04:00
chenyu
08d9106c9f
scatter_reduce and sparse_categorical_crossentropy to mixin ( #15902 )
...
also use `.ne` to fix `# type: ignore[comparison-overlap]`
2026-04-23 21:06:36 -04:00
chenyu
8cc2c69e21
fix isclose mixin ( #15898 )
...
use `.eq` instead of `==`
2026-04-23 20:40:43 -04:00
nimlgen
3072862e2c
metal to linear ( #15884 )
...
* metal to linear
* x
* x
* fix
2026-04-23 23:32:22 +03:00
chenyu
782bc6aece
broadcast in ElementwiseMixin.div [pr] ( #15897 )
2026-04-23 16:02:43 -04:00
qazal
7745e05a2f
sqtt: update wave end packet names ( #15896 )
...
* sqtt: update wave end packet names
* update wavestart and emu
2026-04-24 04:21:22 +09:00
qazal
ee7644932b
viz/cli: -t default number ( #15894 )
...
* viz/cli: accept one path argument
* -t default
* hm
* only the -t change
2026-04-24 04:13:16 +09:00
chenyu
11c197955b
interpolate and cross_entropy to mixin ( #15895 )
2026-04-23 14:59:45 -04:00
chenyu
f0dbc68aa9
gather to mixin ( #15891 )
2026-04-23 14:00:57 -04:00
chenyu
87223f870e
logcumsumexp, argmax, argmin, sequential to mixin ( #15890 )
2026-04-23 12:10:42 -04:00
nimlgen
5cf4ad2fb6
fix resolve param ( #15889 )
2026-04-23 17:41:44 +03:00
nimlgen
e4696185bd
cleaner cuda graph ( #15886 )
2026-04-23 16:34:29 +03:00
wozeparrot
d3cbd781d9
llama: use fused norm mul quantize for w13 ( #15878 )
2026-04-22 21:27:41 -07:00
George Hotz
0c3260d5d9
rename VECTORIZE to STACK ( #15880 )
2026-04-23 10:43:42 +08:00
chenyu
7c9bc29e44
Tensor method raise if arg is on different device ( #15879 )
...
instead of implicit `to`. this matches torch
2026-04-22 22:20:22 -04:00
chenyu
1fc4b3788c
cummax/cummin to mixin ( #15877 )
2026-04-22 21:25:39 -04:00
chenyu
684e95e1d4
UOp binary op broadcasts dtype ( #15875 )
...
* UOp binary op broadcasts dtype
matches Tensor
* fix
* fix?
2026-04-22 20:37:19 -04:00
Christopher Milan
b0dc95a390
AMX in arch, better docs ( #15871 )
2026-04-22 17:25:18 -04:00
nimlgen
e5891acab2
jit: precompile ( #15848 )
...
* x
* jit: precompile as sep step
* x
* s
* x
* x
* x
* ?
* ?
* x
* x
* viz
* f
* x
* u
* x
* x
2026-04-23 00:23:32 +03:00
chenyu
b9e2bc619e
simplify bool.cast() != const ( #15874 )
2026-04-22 17:08:09 -04:00
nimlgen
2041945f4b
cuda graph to linear ( #15870 )
...
* cuda graph to linear
* fix
* keep as old for now
* x
* x
2026-04-22 23:39:58 +03:00
chenyu
e9ebd03e86
update reduce_to_acc index dtype [pr] ( #15873 )
...
index arg should have weakint dtype
2026-04-22 16:25:50 -04:00
chenyu
3c8daa9a75
update test_where_removal ( #15872 )
...
don't use UOp.ufix for const_like, it will broadcast dtype soon
2026-04-22 14:56:37 -04:00
George Hotz
09ff3e1883
hotfix: add bytes back to llm
2026-04-23 00:46:27 +08:00
b1tg
af93a677ae
llm: glm 4.5 air ( #15771 )
...
* llm: glm 4.5 air
* clean
* clean
* remove gguf_size
2026-04-22 22:47:37 +08:00
qazal
719a7bdac5
viz: respect optional estimates in kernel info ( #15867 )
...
* simple failing test
* unpack kernel info
2026-04-22 14:24:48 +03:00
George Hotz
2d7fa58e61
fix shapes to match vecless ( #15866 )
...
* fix shapes
* need to simplify shapes
2026-04-22 18:27:46 +08:00
qazal
de8f58899e
move elf assembler to renderer ( #15855 )
...
* move elf assembler to renderer
* other
2026-04-22 19:00:36 +09:00
George Hotz
d4c344b7fd
hotfix: keep VCONST exclude in viz
2026-04-22 15:54:24 +08:00
wozeparrot
87378331e8
llama: fused mul quantize fp8 ( #15863 )
2026-04-21 20:58:37 -07:00
George Hotz
0560fa7b0f
add shape to range/special ( #15862 )
2026-04-22 11:15:02 +08:00
chenyu
3821e442eb
_one_hot_along_dim and one_hot to mixin ( #15861 )
2026-04-21 20:24:38 -04:00
chenyu
f911a63a6b
don't allow negative num_classes in one_hot ( #15859 )
...
no auto infer num_classes, matches jax
2026-04-21 19:39:29 -04:00
Christopher Milan
697e7aa819
MOCK+AMD and MOCK+NV interfaces ( #15858 )
...
MOCK+AMD is an alias for MOCKKFD+AMD, MOCKNVK+NV is renamed to MOCK+NV
2026-04-21 18:22:16 -04:00
chenyu
75ee51a446
triu tril _tri to mixin ( #15857 )
2026-04-21 17:10:55 -04:00
qazal
e36ff22538
fix dev syntax in emulated amd tests, skip test_tk ( #15856 )
...
* fix dev syntax in emulated amd tests
* skip test_tk
2026-04-21 23:47:29 +03:00
Christopher Milan
99a0debd62
Device.count() ( #15842 )
2026-04-21 16:46:38 -04:00
chenyu
1946ae8b51
linspace and eye to mixin ( #15854 )
2026-04-21 15:58:03 -04:00
qazal
0fbe0a6a99
viz/cli: ux tweaks ( #15853 )
...
* viz/cli: rename to --json
* st_ms, end confuses kimi
* remove pickle spam
* better
* comment
2026-04-21 22:18:27 +03:00
chenyu
86ceb3bd6b
arange to mixin ( #15852 )
2026-04-21 13:00:19 -04:00
chenyu
420e4c4673
zeros, ones, invalids to mixin ( #15850 )
2026-04-21 11:53:08 -04:00
chenyu
9192c93b7e
Tensor.invalid -> Tesnor.invalids ( #15849 )
...
matches ones and zeros, and to not share name with UOp.invalid
2026-04-21 11:19:51 -04:00
nimlgen
bfe28ee2ad
rm run_schedule ( #15847 )
2026-04-21 18:14:30 +03:00
chenyu
d08b5d0a3b
full to mixin ( #15840 )
...
with unique_const
2026-04-21 10:53:43 -04:00
nimlgen
ae9b84d32f
rm beam uop ( #15844 )
2026-04-21 13:10:26 +03:00
nimlgen
01ac1c8c15
remove all run_schedule from tests ( #15846 )
2026-04-21 12:02:10 +03:00
qazal
f9655af2a3
viz/cli: move to tinygrad ( #15835 )
...
* move cli
* update imports
* cleanup the readme
* edit
* work
* details
* python -m tinygrad.viz.cli
* do not execv in non tty
* option
* lint
* simpler
* gemm pmc
2026-04-21 13:35:10 +09:00
Christopher Milan
1a8ba4cbd6
CPU renderers use arch ( #15839 )
2026-04-20 23:38:29 -04:00
chenyu
cabc347066
conv2d and conv_transpose2d to mixin ( #15838 )
...
* conv2d and conv_transpose2d to mixin
* cleanup
2026-04-20 18:10:06 -04:00
nimlgen
b8d3bf8970
run_linear in jit ( #15827 )
...
* run_linear in jit
* x
* x
* f
* casts
* ugh
* f
* x
* x
* simple
2026-04-20 23:03:30 +03:00
chenyu
e00cc8ae5e
split Tensor._conv2d_winograd ( #15837 )
2026-04-20 15:19:33 -04:00
chenyu
667b30b974
tensor pad arg cleanups ( #15836 )
2026-04-20 15:03:09 -04:00
chenyu
8eeb77a905
flat_to_grouped and resolve_pool_pads to helpers ( #15834 )
2026-04-20 14:03:35 -04:00
chenyu
b01704444b
einsum to ReduceMixin ( #15833 )
2026-04-20 11:49:24 -04:00
chenyu
3a557016cb
delete UOp.get_consumer_map [pr] ( #15832 )
...
not used
2026-04-20 10:57:42 -04:00
chenyu
04e8dbd7f8
remove getitem check in get_shape ( #15830 )
...
not needed
2026-04-20 10:40:46 -04:00
chenyu
72ecc61ca8
use more UOp method [pr] ( #15821 )
...
instead of constructing UOp directly
2026-04-20 09:17:56 -04:00
qazal
601b9d3f59
viz/cli: dedup DEBUG=3 pyrender ( #15826 )
2026-04-20 19:29:09 +09:00
ayanhan
80c7327e0f
resolve Metal ARC FIXME with explanation comment ( #13688 )
2026-04-20 17:10:37 +08:00
nimlgen
c0d7135b5f
do not use jit_cache in test ( #15823 )
...
* do not use jit_cache in test
* fix
2026-04-20 11:45:17 +03:00
George Hotz
5819c0abed
fix gc in gguf ( #15820 )
...
* fix gc in gguf
* fix mypy
2026-04-20 10:15:03 +08:00
George Hotz
67ed4c4eb3
move gguf stuff from nn/state.py to llm/gguf.py ( #15783 )
...
* move gguf stuff from nn/state.py to llm/gguf.py
* docs
2026-04-20 09:41:43 +08:00
chenyu
538841d1f2
remove_tags and _remove_all_tags are the same [pr] ( #15819 )
...
also other small UOp method cleanups
2026-04-19 21:37:49 -04:00
Kartik Vashishta
a1696e8413
objc: fix _classmethods_ dispatch flag ( #14854 )
...
* objc: fix _classmethods_ dispatch flag
* test: add objc _classmethods_ regression
2026-04-20 09:35:03 +08:00
oxrinz
f551a4bded
add threefry const folding ( #15787 )
...
* prim threefry
* test fix
* clean test
* cleanup
* cleanup 2
* cleanup 3
* fix conflict markers in test_const_folding.py
* update test
* fix lint
* use const instead of value for test
2026-04-20 09:30:03 +08:00
qazal
b05b1010bf
viz/cli: ux cleanups, show user python ( #15817 )
...
* small fixes
* print python trace
* jsonl
* cleanup fmt, fix tqdm
* print mode
* types
* less
* keep those
* fix
* everyone can print json
* pmc p2
2026-04-20 03:50:48 +03:00
chenyu
8b87b3522a
more UOp empty cleanups [pr] ( #15818 )
2026-04-19 19:48:36 -04:00
chenyu
2a5a6236ac
UOp.empty and UOp.empty_like ( #15816 )
...
* UOp.empty and UOp.empty_like
Tensor.empty and Tensor.empty_like use these, and removed _buffer_like
* import line
2026-04-19 16:01:01 -04:00
qazal
c6d8753ee1
viz/cli: --json support, refine docs ( #15528 )
...
* refine
* remove
* refine
* keep
* need to say this
* back
* feedback
* feedback
* json
* dur_ms
* et_ms
* remove useless thing
* docs
* respect NO_COLOR
* DEBUG also produces valid json
2026-04-19 21:53:38 +03:00
chenyu
50a7b82372
merge untag_and_append and append_after [pr] ( #15815 )
...
reads cleaner
2026-04-19 13:13:26 -04:00
chenyu
cace07c87a
clean up untag_and_append [pr] ( #15812 )
...
replace_uop does not change, and ret.op is always AFTER
2026-04-19 11:23:59 -04:00
wozeparrot
f28ea84de2
llama: fused silu fp8 amax ( #15798 )
...
* llama: combined w13
* llama: fused swiglu+fp8
* llama: fix amax interleaving
* llama: don't need seperate matmul
2026-04-19 12:03:55 +08:00
chenyu
5bdfd4883f
update test_assign ( #15809 )
...
clean up old skips and update tests
2026-04-18 21:25:44 -04:00
nimlgen
022d8c4a11
remove jit_cache usage in extra/examples ( #15808 )
...
* remove jit_cache usage in extra/examples
* cached
2026-04-18 23:00:18 +03:00
wozeparrot
06343092c8
llama: combined w13 ( #15803 )
2026-04-17 22:27:31 -07:00
Christopher Milan
6adf4c3cd9
MOCKGPU interfaces ( #15796 )
2026-04-17 21:56:29 -04:00
chenyu
8da308573f
update test_assign_changes_alt with clone ( #15802 )
2026-04-17 20:17:37 -04:00
qazal
2581985532
viz/cli: multi device profiler output, print markers ( #15795 )
...
* yield
* all devices
* better
* add unittests
* markers like this
* profile_markers work
* less
* update README
* tiny and null
2026-04-17 23:40:10 +03:00
chenyu
0191cc73dc
update arange range check ( #15794 )
...
it was not checking negative steps correctly
2026-04-17 16:07:50 -04:00
nimlgen
23ca680a3a
run_linear ( #15784 )
...
* run_linear try 2
* x
* f
* tests
* ctx, cleaner
* r
* x
2026-04-17 22:44:16 +03:00
qazal
8fcaaede9a
fix root cause of TestVizIntegration.test_link_sched_codegen flakiness ( #15793 )
2026-04-17 20:31:52 +03:00
googlefan256
482c8c1ec8
Fix no module named error ( #15792 )
2026-04-17 19:42:35 +03:00
qazal
a227dbece1
viz/cli: reconstruct DEBUG output ( #15791 )
...
* work
* work
* ext
* padding
* at time
* work
* reorder
* less flags
* num_rows
* feedback
* pmc
2026-04-17 18:27:58 +03:00
qazal
601d137e85
viz: rename to rewrites_data, only use ContextVar ( #15790 )
...
* viz: rename to rewrites_data
* tms also 0
* gt 0
2026-04-17 17:21:51 +03:00
qazal
afc3904e58
viz/cli: unit tests in CI ( #15788 )
...
* simple failing test
* test stdout
* cleanup sqttmap
2026-04-17 22:34:44 +09:00
qazal
9f2a578e26
unskip TestCall.test_call_gemm_uop [pr] ( #15786 )
2026-04-17 16:18:51 +03:00
qazal
7bdb3adbbf
viz/cli: simplification and reordering ( #15785 )
...
* remove
* work
* this is all one thing
* the reorder
2026-04-17 15:16:07 +03:00
George Hotz
e1d13bc4fe
add GGUF IQ4_XS support ( #15766 )
...
* add GGUF IQ4_XS support
* gguf 21
* gguf 21
* use plus
* ggml_common autogen for constant arrays
* fix
* ggml_common in autogen
* inline
2026-04-17 14:43:39 +08:00
wozeparrot
9e60e4a7e7
llama: native fp8 ( #15733 )
2026-04-16 22:16:05 -07:00
George Hotz
a9b6cfece0
refactor llm into files ( #15780 )
...
* refactor llm into files
* chat.html
* tokenizer cleanup
* cleanup
* tests
2026-04-17 12:33:11 +08:00
chenyu
1fac03ce54
softmax and friends to mixin ( #15778 )
...
with detach now
2026-04-16 23:03:37 -04:00
George Hotz
ec00cefa5b
llm is the only app ( #15779 )
...
* tinygrad/llm is the only app
* upd pyproject
* claude refs
* scoping
* min diff
2026-04-17 10:44:48 +08:00
qazal
0e69388f6b
viz/cli: add DEBUG, optional number of rows ( #15777 )
...
* tabulate switch
* support DEBUG
* --top
* improve
* work
* feedback
* 0
* print_kernel both ways
* simplify
2026-04-17 04:36:47 +03:00
chenyu
2d196fb9bb
move Tensor.size to mixin ( #15775 )
2026-04-16 17:56:17 -04:00
Christopher Milan
9f4b7bed25
add pickled jit regression test ( #15774 )
2026-04-16 16:59:09 -04:00
qazal
6d9320ffb3
add NO_COLOR ( #15765 )
...
* NO_COLOR in cli
* add in helpers
* rm flags
* docs
* fix that
* temp
* Revert "temp"
This reverts commit 7522e664f6 .
2026-04-16 22:44:55 +03:00
qazal
12c653a743
remove opts arg in get_program, everything uses opts_to_apply [pr] ( #15767 )
...
* check Ops.BEAM in process replay
* remove opts from the get_program api
* lint
* simplify
* cleanup
2026-04-16 22:42:43 +03:00
chenyu
f0c12a2004
another form of assign to itself ( #15770 )
2026-04-16 15:17:19 -04:00
b1tg
4e88d875ba
llm: glm 4.7 flash ( #15738 )
...
* glm 4.7
* test
* temperature, server enable_thinking
* --no-think
* remove think stuff
2026-04-16 22:42:04 +08:00
chenyu
d147e2a549
update test_nested_after_contiguous_store ( #15763 )
...
add kernel counts and some TODOs
2026-04-16 09:59:26 -04:00
qazal
126cda45f8
viz/cli: cleanups, add memory printer ( #15762 )
...
* simple repro
* use context
* work
* memory printer
* rm
* memory printer
* pylint
2026-04-16 22:44:47 +09:00
George Hotz
f57380cbc2
simplify GatedDeltaNetBlock using two state tensors ( #15704 )
...
* test double after
* simpler ssm
* no double test
2026-04-16 21:14:00 +08:00
nimlgen
c04f3eaa70
jit: capturedjit is linear ( #15743 )
...
* jit: capturedjit is linear
* x
* new beam
* test
* imp
* clean
* spec
* linter
2026-04-16 14:54:39 +03:00
George Hotz
d1cce7a476
put the ranges on store instead of after ( #15759 )
...
* put the ranges on store instead of after
* better assert
* fix stuff
* comment out slow rules i don't understand
* simpler rule
* closer
* return false for store
* fix loop
* only a few schedule failures remain
* remove stores to self
* all tests pass locally
* remove junk
* regression test and fix
* better test, bump broken torch count
* bugfix with regression test
* new fusion is better
2026-04-16 19:06:40 +08:00
George Hotz
d24466c844
CALL with return value is FUNCTION ( #15758 )
...
* CALL with return value is FUNCTION (GPT try)
* cleanups
2026-04-16 13:25:07 +08:00
chenyu
218d6b8988
delete old UOp.size [pr] ( #15756 )
2026-04-15 23:21:00 -04:00
wozeparrot
d090732270
usbgpu: reset endpoint for custom fw ( #15754 )
2026-04-15 20:01:27 -07:00
Muzammil
983a7bb576
exclude __del__ from TRACEMETA wrapping ( #15747 )
...
Session-Id: 019d9234-2531-75a0-a252-f0302cd9931f
2026-04-16 10:49:55 +08:00
chenyu
8bd4fead26
UOp.size -> prod(max_shape) ( #15755 )
...
and more test updates
2026-04-15 22:41:30 -04:00
chenyu
10c262ced8
update tests that use UOp.size ( #15753 )
2026-04-15 21:58:27 -04:00
qazal
96092d110c
fix process_replay Ops.BEAM [pr] ( #15752 )
2026-04-16 07:35:28 +09:00
chenyu
41421c3b48
BUFFER size is their arg ( #15750 )
2026-04-15 18:08:29 -04:00
Christopher Milan
be8005c5dc
DEV: secondary targets ( #15748 )
2026-04-15 17:26:20 -04:00
chenyu
507c02cecb
fix symbolic contiguous_view_offset ( #15749 )
...
* fix symbolic contiguous_view_offset
* flatten
2026-04-15 16:54:38 -04:00
nimlgen
164495678c
test_graph to use uops ( #15746 )
...
* test_graph to use uops
* x
* n
2026-04-15 21:59:41 +03:00
qazal
1f26584b2e
viz/cli: cleanups from linter ( #15745 )
...
* run linter
* pmc
2026-04-16 03:36:24 +09:00
chenyu
7cbfa1896a
comment out unused arm, triton in toml ( #15741 )
...
fixed `PYTHONPATH=. uv run tinygrad/apps/llm.py`
2026-04-15 10:05:19 -04:00
Christopher Milan
1c36878008
DEV: suggest alternatives ( #15732 )
2026-04-14 23:42:32 -04:00
George Hotz
1ae6528bb6
move schedule into schedule ( #15736 )
...
* move schedule into schedule
* callify to root
* sched docs
2026-04-15 11:03:25 +08:00
wozeparrot
3721c60bef
llama: bs 16 ( #15737 )
2026-04-14 19:52:03 -07:00
wozeparrot
480ad264a4
llama: per device amax ( #15735 )
2026-04-14 19:01:17 -07:00
Christopher Milan
adc96cd724
qcom: synchronize for copyin ( #15731 )
...
fixes : #15698
2026-04-14 18:31:15 -04:00
chenyu
3394d18066
size*itemsize -> nbytes ( #15729 )
...
and some UOp.size removal to prep for size to mixin change
2026-04-14 16:27:54 -04:00
nimlgen
e9ecc990ea
amd: add r9700 devid ( #15721 )
2026-04-14 20:15:00 +03:00
George Hotz
2450c8cba8
rename to callify + fix mypy ( #15727 )
...
* rename to callify + fix mypy
* update test
2026-04-14 23:43:19 +08:00
chenyu
528faa18ec
update env_vars.md ( #15722 )
...
remove HCQ_VISIBLE_DEVICES, IMAGE=2 and old DEBUG=3 stuff
2026-04-14 09:13:35 -04:00
George Hotz
359b1582d6
amd: EMU DPP support ( #15719 )
...
* EMU DPP support from GPT 5.4
* cleanups
* simple
* nope
* fix
2026-04-14 14:58:41 +08:00
wozeparrot
2b8d303f75
allreduce in precast dtype ( #15689 )
2026-04-13 20:24:12 -07:00
George Hotz
5683126844
llm: support for tekken tokenizer ( #15720 )
2026-04-14 10:52:07 +08:00
chenyu
70883a6950
cat the stack to mixin ( #15715 )
2026-04-13 18:44:39 -04:00
qazal
355e2729d3
viz: keep program UOp in data ( #15714 )
...
* refactor program uop access
* c.name
2026-04-14 07:04:16 +09:00
qazal
905b8adc97
viz: cli and server cleanups ( #15713 )
...
* update get_profile arg[0]
* uop_to_json arg[0]
* data is standalone in cli
2026-04-14 06:42:29 +09:00
Christopher Milan
d83707ec29
autogen: explicit types ( #15679 )
2026-04-13 16:54:39 -04:00
chenyu
ac41f15fc1
cumsum to mixin ( #15712 )
...
built on top of getitem
2026-04-13 15:06:08 -04:00
nimlgen
eac481b67f
mlx: fix ctypes ( #15711 )
...
* mlx: fix ctypes
* x
2026-04-13 20:43:56 +03:00
nimlgen
b370f5c5ac
hcq: call free for unmap ( #15710 )
2026-04-13 20:30:21 +03:00
chenyu
931d6cc62a
basic getitem to mixin ( #15697 )
...
* basic getitem to mixin
* cleanup
* fix
* cleanup
2026-04-13 13:04:36 -04:00
George Hotz
7610bdc59e
block multistore, it's not supported ( #15708 )
2026-04-13 20:57:59 +08:00
George Hotz
84d64b5835
hotfix: abstractions4 works in mock except asm
2026-04-13 20:57:00 +08:00
George Hotz
16f50a40a5
remove REMU from tree ( #15706 )
...
* no more compare emulators
* remove remu from tree
2026-04-13 20:43:08 +08:00
qazal
ac027055ef
viz: no global state ( #15705 )
...
* start viz data
* get_full_rewrites also moves
* update ref_map
* work
* update consumers
* cleaner cli
* linter
* cleanup tests
* back
* better
* sqtt tests
2026-04-13 21:35:20 +09:00
George Hotz
4c1fb18a09
Revert "Revert "Tests for GatedDeltaNetBlock + fix multi after assign issue (…" ( #15703 )
...
This reverts commit 0cec42db71 .
2026-04-13 19:09:38 +08:00
George Hotz
0cec42db71
Revert "Tests for GatedDeltaNetBlock + fix multi after assign issue ( #15700 )" ( #15702 )
...
This reverts commit 6f5d756282 .
2026-04-13 19:06:44 +08:00
George Hotz
6f5d756282
Tests for GatedDeltaNetBlock + fix multi after assign issue ( #15700 )
...
* broken after/assign test
* test for GatedDeltaNet
* better comments
* fix issue 1 with multi kernel
* fix 2
* fix
* linter
* public api + cleanup
2026-04-13 18:43:23 +08:00
b1tg
2b5ba0095d
qwen3.5 ( #15210 )
...
* qwen3.5
* faster
* or
* rm zero hack
* less float
* T=1
* clean
* clean
* 4b
* rope_dim
* Revert "jit: captures linears, not execitems (#15399 )"
This reverts commit 9656d97d97 .
* DeltaNetBlock
* pairwise_topk
* clean
* Reapply "jit: captures linears, not execitems (#15399 )"
This reverts commit cf3deff53d .
* clean topk, _swiglu
* common
* FFNBlock
* clean
* half
* no mix
* qwen3.5 test
* fix ssm cache invalidation
* TransformerConfig
* SSMConfig
* clean
* reset_state
* llm: reuse server conversation tokens to avoid BPE roundtrip cache miss
* import error
* prefill
* none check
* put it back
* clean pairwise_topk
* symbolic: fold BIND(CONST, CONST) to CONST
* clean
* simpler pm
* _cached_msg_count
* stream decoder; ssm checkpoints
* rm checkpoint
* attn_output_gate
* conflict, attn_output_gate
* clean, less has_ssm, assert
* chunked prefill
* _reset_cache
* _reusable_prefix_len
* revert loop
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-04-13 15:35:24 +08:00
qazal
2ada38f777
viz: execv after all producers complete ( #15696 )
2026-04-13 08:15:47 +09:00
chenyu
f7ff480fa6
start mixin getitem tests ( #15695 )
...
goal is to make Tensor[idx].uop equal to Tensor.uop[idx]
2026-04-12 18:54:33 -04:00
chenyu
77385ccb37
more trivial stuff to mixin ( #15693 )
2026-04-12 15:17:16 -04:00
chenyu
ff1de5ae13
normalize logsumexp contiguous_backward to mixin ( #15692 )
...
* normalize logsumexp contiguous_backward to mixin
* more
2026-04-12 13:13:00 -04:00
chenyu
0254cfe642
move usum and uprod to mixin ( #15690 )
...
and used it to clean up ops and tensor
2026-04-12 11:42:24 -04:00
nimlgen
e9b2e156b4
add jitbeam to tinygpu docs ( #15691 )
2026-04-12 18:20:26 +03:00
chenyu
e706f408cb
suppress test warnings from numpy ( #15688 )
2026-04-11 22:33:20 -04:00
nimlgen
938cba4fdf
amd: a bit faster usb, skip interrupts on sync ( #15686 )
2026-04-11 17:26:36 +03:00
qazal
054d78e6ff
fix llama profile.sh NULL source ( #15685 )
2026-04-11 22:56:05 +09:00
Graham Robbins
4ca844e96b
add Q1_0 gguf type ( #15683 )
...
* add Q1_0
* better description
* fix trailing whitespace
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-04-11 18:17:24 +08:00
George Hotz
5156a04cf5
add support for AM_POWER_LIMIT ( #15684 )
...
* add support for AM_POWER_LIMIT
* level None
2026-04-11 17:14:54 +08:00
wozeparrot
457508d5a0
llama: save more 2 ( #15681 )
2026-04-11 01:03:36 -07:00
George Hotz
29238b772f
AMD USB: support for 0xF3 power toggle
2026-04-11 13:04:38 +08:00
George Hotz
b5a9465b13
llm: add support for moonlight (deepseek MLA) ( #15466 )
...
* add gguf Q5_0
* it works
* rebase
* simpler test
* class
* less diff
* dicts
* normal names
* simplify
* this
* simpler
* work
* work
2026-04-11 10:32:48 +08:00
wozeparrot
590464c8d8
llama: only support wqkv path + cleanups ( #15680 )
...
* llama: only support wqkv path + cleanups
* llama: missing transpose
2026-04-11 07:39:27 +08:00
nimlgen
aa012d6f08
usb: faster custom ( #15678 )
...
* usb: _f0_out_buf for e4 cmd as well
* custom speed
* fast
2026-04-10 23:00:31 +03:00
nimlgen
58646f9569
usb fast copyout ( #15677 )
...
* usb
* fix usb
2026-04-10 21:04:49 +03:00
qazal
0d5cdc9600
viz: split draw loop ( #15676 )
...
* split draw loop
* one draw
* no functions
* inline all highlights
* cleanup
2026-04-10 23:25:50 +09:00
chenyu
e1334d3852
move canonicalize_device to device.py ( #15675 )
2026-04-10 09:43:56 -04:00
chenyu
8e7fcc8ca3
remove _include_initial in _cumalu ( #15674 )
...
handle negative pad in caller
2026-04-10 08:33:30 -04:00
George Hotz
9092f2a8c0
llm: add shared_expert and rope_dim support from qwen35 ( #15673 )
...
* llm: add shared_expert and rope_dim support from qwen35
* refactor into FFNBlock and TransformerBlock
* norms where they belong
2026-04-10 19:18:27 +08:00
b1tg
9ab1415937
llm: fix streaming UTF-8 decode ( #15653 )
2026-04-10 17:01:02 +08:00
wozeparrot
55bcd7cc9e
llama amax outside ( #15670 )
2026-04-09 23:08:03 -07:00
George Hotz
16f3448b26
Add HIP to abstractions4 ( #15672 )
...
* cleanup formatting
* add HIP option
* pass in correct
2026-04-10 14:05:52 +08:00
George Hotz
ed2a72bb23
work on abstractions4 ( #15671 )
...
* work on abstractions4
* works
* offst
* assembly works
* RAND
* cleanup
* work
2026-04-10 13:25:11 +08:00
Christopher Milan
dbc23e8a1b
move HCQ_VISIBLE_DEVICES into DEV ( #15668 )
2026-04-09 22:01:35 -04:00
George Hotz
fa02105546
hotfix: pin amd isa xml version
2026-04-10 06:47:00 +08:00
nimlgen
057dc173ab
beam uop ( #15660 )
...
* beam as uop
* x
2026-04-09 19:13:03 +03:00
nimlgen
0ff30b003d
am: reset queues from spi ( #15664 )
...
* am: reset queues from spi
* move
2026-04-09 18:25:50 +03:00
George Hotz
48a7627b04
add RDNA4 support to copy WMMA ( #15663 )
...
* add RDNA4 supportt to copy WMMA
* simpler
* simpler
* comment
* assert
2026-04-09 22:48:20 +08:00
chenyu
6837881b06
remove same_shape_noop [pr] ( #15662 )
...
no longer used
2026-04-09 09:50:26 -04:00
Christopher Milan
d08c76d9cb
c.Struct cleanup ( #15640 )
2026-04-08 20:07:16 -04:00
qazal
742b3894d7
viz/cli: add pmc printer ( #15651 )
...
* viz/cli: add pmc printer
* cli work
* s
* linter
* pack workgroups
* add : to wgp
* counter name
2026-04-09 08:50:54 +09:00
chenyu
4cf2759fc8
fix merge_reduce_ends ( #15659 )
...
* fix merge_reduce_ends
same range with different nesting should not merge, like cumsum twice should not merge
* skip that
2026-04-08 17:20:01 -04:00
chenyu
cb681da840
move UOp.pad to mixin ( #15657 )
...
the same arg works for Tensor.pad
2026-04-08 13:15:19 -04:00
nimlgen
28b14b0e38
mlx: remove to_be, use helpers ( #15655 )
2026-04-08 20:07:28 +03:00
nimlgen
1b44cb2ac6
split update stat from execitem ( #15654 )
2026-04-08 20:07:12 +03:00
qazal
71c83cc3f6
viz: put OTHER_ on the wave row ( #15650 )
...
* viz: put OTHER_ on the wave row
* update tests
* cleanup cli
2026-04-08 23:13:44 +09:00
chenyu
839d37b7bc
update median_step_time in model_train.py ( #15649 )
...
BENCHMARK=5 used to pick the 4th largest, not the middle one
2026-04-08 09:53:59 -04:00
chenyu
dae9dea903
clean up tensor random functions ( #15648 )
...
* clean up tensor random functions
* revert that
2026-04-08 09:44:37 -04:00
George Hotz
1ebeb52e59
RDNA4 asm gemm ( #15427 )
...
* sqtt: rdna4 decoder work
* diff cleanup
* more diff
* test
* 125
* r4
---------
Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2026-04-08 21:26:44 +08:00
nimlgen
b1e52ba0c2
the slowest line in hcq graph ( #15635 )
...
* the slowest line in hcq graph
* x
2026-04-08 15:53:52 +03:00
qazal
3ac16b3bea
viz: add wmma row, update exec duration logic ( #15646 )
...
* viz: split wmma to its own row, fix duration logic
* regs
* decrease number of loops, add pickle
* assert overlaps
2026-04-08 20:24:23 +09:00
George Hotz
35e3983840
Add Q5_0, Q5_1, and bfloat16 GGUF types ( #15644 )
2026-04-08 17:16:19 +08:00
qazal
39a029ec55
remove ASM_GEMM context var ( #15645 )
2026-04-08 18:02:40 +09:00
qazal
dc6a51e44d
viz: add # of bytes to sdma ( #15639 )
...
* viz: add # of bytes to sdma
* update test_viz
2026-04-08 17:43:37 +09:00
wozeparrot
70dbd35023
llama: move custom_kernel into flat_llama ( #15643 )
2026-04-08 00:19:14 -07:00
Christopher Milan
bcf6931a4f
fix: comma 4 does not have pcie ( #15642 )
2026-04-07 23:57:03 -04:00
George Hotz
f930579b7a
llm: change the default port to 8000 so you can remember it (match vLLM)
2026-04-08 11:25:38 +08:00
b1tg
bf3763526a
llm: buffer SSE chunks to fix parse errors from split reads ( #15641 )
2026-04-08 10:26:23 +08:00
qazal
a508b8fd2a
viz: delete redundant things ( #15637 )
...
* delete that
* remove
* delete graph config
2026-04-08 07:18:04 +09:00
chenyu
9c6e925b56
move lerp to mixin ( #15634 )
...
last function of math function section
2026-04-07 15:13:00 -04:00
qazal
890286e8d6
update llama profile.sh ( #15633 )
...
* update llama profile.sh
* BENCHMARK 5
2026-04-08 03:18:45 +09:00
nimlgen
b78b384d58
mlx: graph ( #15621 )
...
* Dx
* Dx
* simpler
* mypy
* x
* f
* Dx
* x
* c
* x
2026-04-07 19:43:51 +03:00
qazal
d29f0ef721
viz: speed up profiler first render ( #15632 )
...
* viz: speed up profiler first render
* better comment
2026-04-07 23:07:09 +09:00
George Hotz
d3de63d998
improvements to apps.llm ( #15631 )
2026-04-07 20:34:05 +08:00
George Hotz
2b01ca59dd
USB driver for custom ASM firmware ( #15597 )
...
* USB driver for custom ASM firmware
* timeout
* fix mypy
* pcie mem read
* flip in f/w
* one tx
* litle endian
* autodetect custom
* mock bypass
* lint
* clean
2026-04-07 13:45:41 +08:00
wozeparrot
810d7c00cd
llama: unify scripts ( #15628 )
2026-04-06 20:28:08 -07:00
Christopher Milan
19e96497ee
interface in DEV ( #15620 )
2026-04-06 19:59:28 -04:00
qazal
8ba58304f7
viz: reenable tests ( #15626 )
2026-04-07 07:52:44 +09:00
chenyu
2f7d085450
shared _normalize_indices for getitem ( #15625 )
...
* shared _normalize_indices for getitem
* list
2026-04-06 17:45:36 -04:00
chenyu
66ec188d50
more activations to mixin ( #15624 )
2026-04-06 15:41:41 -04:00
chenyu
1483f7e71c
support shift by Tensor ( #15623 )
...
* support shift by Tensor
* use mixin
2026-04-06 15:14:57 -04:00
chenyu
6e30a5f5ea
update shifts in torch backend ( #15622 )
2026-04-06 14:08:33 -04:00
chenyu
a444be172d
lower fuzz_symbolic_symbolic_div timeout ( #15619 )
...
mitigate timeout crash due to high total time
2026-04-06 12:58:29 -04:00
chenyu
01b49c8647
support int operand for shifts ( #15618 )
...
matches torch/jax, also symbolic rule to remove mask
2026-04-06 12:32:12 -04:00
nimlgen
e2700475cf
mlx: cleaner ( #15617 )
...
* mlx: cleaner
* x
2026-04-06 17:49:47 +03:00
Valtteri Valo
86c4431d74
add gpu_family detection to Metal, target MSL 4.0 on macOS 26+ ( #15079 )
...
use supportsFamily API to detect GPU generation instead of parsing
ICB debug description strings. also adds metal4.0 compiler target.
2026-04-06 06:51:38 +08:00
13Perrius
ff0c941548
remove redundant iteration and toposort in _deepwalk ( #15532 )
2026-04-06 06:38:45 +08:00
Andrew Cappelli
e39cfe685a
validate lr, momentum, weight_decay in optimizers ( #15576 )
2026-04-06 06:37:34 +08:00
nimlgen
6a334ceb27
hotfix: fix bert ( #15613 )
2026-04-05 23:41:21 +03:00
nimlgen
e3986a6b74
mlx: init runtime ( #15612 )
...
* mlx: init
* x
* swap
2026-04-05 22:52:29 +03:00
nimlgen
e0988dbae5
hcq: support non for signal_t and compute_t ( #15611 )
...
* hcq: support non for signal_t and compute_t
* revert
* x
2026-04-05 18:56:47 +03:00
nimlgen
5e134aa087
hcq: add write/poll_bit commands ( #15610 )
...
* hcq: add write/poll_bit commands
* x
2026-04-05 18:09:44 +03:00
nimlgen
604cdbf2f7
am: large allocs aligned to 2mb to use 2mb pages ( #15609 )
2026-04-05 18:01:31 +03:00
qazal
b2d5b29f45
assembly/amd: validate dsl keyword args ( #15608 )
...
* assembly/amd: validate dsl keyword args
* hm, this should use the SOP2 s_waits
* use the sop2 s_waits
2026-04-05 23:00:24 +09:00
qazal
056fcd7758
viz: web work from rdna4 gemm ( #15607 )
...
* add rdna4 barrier
* fix realtime
2026-04-05 19:14:16 +09:00
wozeparrot
7e54992bf6
fp8 llama ( #15588 )
...
Co-authored-by: qazal <qazal.software@gmail.com>
2026-04-04 18:24:57 -07:00
qazal
4d36366717
assembly/amd: match rdna4 hw gidx init in emulator ( #15604 )
...
* simple rdna4 copy kernel with hw fault
* the trivial fix: use ttmp instead of s
* now copy kernel fails in mockgpu
* rm crashing kernel
2026-04-05 02:28:18 +09:00
chenyu
2ba5a6ddc8
remove detach in selu ( #15602 )
...
UOp does not have detach. this does not change behavior
2026-04-04 11:04:29 -04:00
qazal
f7aed180e4
viz/cli: add Other row in profiler ( #15600 )
2026-04-04 22:40:53 +09:00
Christopher Milan
74ecf6d3e6
opaque structs are also c.Struct ( #15596 )
2026-04-03 19:40:43 -04:00
Christopher Milan
645d45d968
DEV has arch ( #15577 )
...
Co-authored-by: Comma Device <device@comma.ai>
2026-04-03 19:17:19 -04:00
nimlgen
902edc3781
hcq: hcqbuf in copy ( #15595 )
2026-04-03 22:47:36 +03:00
nimlgen
2c4271209e
hcq: peer groups for remote ( #15594 )
...
* hcq: set real peer group
* x
* x
* x
2026-04-03 19:03:07 +03:00
chenyu
8fdef2d3e4
mean/std/var to mixin ( #15593 )
2026-04-03 10:42:41 -04:00
qazal
9920b42b5e
hotfix: renderer.target.arch in disasm ( #15592 )
2026-04-03 22:23:51 +09:00
nimlgen
237084b276
remote: support several hosts ( #15585 )
...
* remote: support several hossts
* f
2026-04-03 11:22:15 +03:00
Christopher Milan
0ed8d9271d
Renderers accept Target or nothing ( #15590 )
2026-04-03 01:09:41 -04:00
wozeparrot
3a26920141
feat: framework ci ( #15589 )
2026-04-02 22:03:51 -07:00
Christopher Milan
736fea8412
select_first_inited cleanup and better errors ( #15587 )
2026-04-02 19:27:58 -04:00
Christopher Milan
8c50da800d
[pr] cleanup unused ctx's in codegen ( #15586 )
2026-04-02 19:06:58 -04:00
nimlgen
694dc5a717
install script in benchmark ( #15584 )
2026-04-02 18:15:58 +03:00
nimlgen
046c3f1240
mlx: add loopback with send/recv ( #15583 )
2026-04-02 18:15:46 +03:00
chenyu
c64226e97c
fix CreationMixin doc ( #15582 )
2026-04-02 09:46:28 -04:00
qazal
fefb0ebc2a
gemm/asm: fp8 cleanups ( #15580 )
...
* normal gemm here
* s/dtypes.fp8e4m3/FP8_DTYPE
* gemm_bw
* device UOp stays NULL
2026-04-02 19:02:38 +09:00
chenyu
61bc91aa8c
Tensor cumalu cleanups ( #15579 )
...
* Tensor cumalu cleanups
* happy
2026-04-02 05:23:22 -04:00
chenyu
1aa04eab08
simple CreationMixin ( #15567 )
...
start with full_like, zeros_like, ones_like
2026-04-01 23:00:56 -04:00
wozeparrot
5b2a3251c4
mlperf system json for mi350 ( #15575 )
2026-04-01 15:30:33 -07:00
Christopher Milan
6c67bd4c14
better error message when invalid renderer is specified ( #15573 )
2026-04-01 17:12:55 -04:00
Christopher Milan
0d6fbc2355
remove flaky and redundant image test ( #15574 )
2026-04-01 16:33:13 -04:00
Christopher Milan
20f7f0be8e
nir renderers use arch ( #15556 )
...
* nir renderers use arch
* fix
* fix null
2026-04-01 16:32:51 -04:00
nimlgen
148ad09559
am: do not use dbell for ih ( #15571 )
2026-04-01 21:34:21 +03:00
nimlgen
93a85c7348
am: raise when using more sdma engines ( #15569 )
2026-04-01 21:33:42 +03:00
nimlgen
da12c2ea16
better install msg ( #15570 )
2026-04-01 20:09:37 +03:00
b1tg
20497f2840
fold BIND to CONST when min==max ( #15568 )
2026-04-01 11:19:04 -04:00
qazal
9275f283e5
viz: update flag and display names ( #15566 )
...
* rename to occ, other_simd
* se pkts
* match viz cli tool in names
2026-04-01 21:48:37 +09:00
chenyu
f5c0794df2
fix Tensor.const_like ( #15565 )
...
used to always return a 0-d tensor, now returns an expanded Tensor based on self.shape and matches UOp
2026-04-01 08:35:19 -04:00
qazal
09f60d80fd
llama: fix FP8=1 FAKEDATA=1 ( #15564 )
2026-04-01 20:53:03 +09:00
nimlgen
6d1e992e89
copyout sharded w/o ioring ( #15562 )
...
* copyout sharded w/o ioring
* x
* x
* f
2026-04-01 14:47:29 +03:00
nimlgen
150c456977
add OSError to suppress_finalizing ( #15558 )
2026-04-01 12:33:59 +03:00
chenyu
fc5b94b902
fix UOp.where(const, const) ( #15560 )
...
* fix UOp.where(const, const)
* fix
2026-04-01 05:28:49 -04:00
chenyu
5aeb2273db
add amd_copy_matmul.py to CI ( #15555 )
...
more tests before cleanup
2026-03-31 22:39:18 -04:00
Christopher Milan
034f617971
NVCCRenderer is separate from CUDARenderer ( #15554 )
2026-03-31 21:26:13 -04:00
wozeparrot
8b5b9a0e90
llama: run_and_time ( #15533 )
2026-03-31 15:46:16 -07:00
Christopher Milan
acf239e4d2
specify renderer in DEV, <dev>_<ren>=1 is deprecated ( #15551 )
2026-03-31 18:35:14 -04:00
nimlgen
5181c8e23a
llm: fix nan in kvcache ( #15552 )
2026-04-01 00:38:45 +03:00
nimlgen
3af25ccdb4
docs: minor tinygpu changes ( #15550 )
2026-03-31 21:29:15 +03:00
nimlgen
477d194630
hipcomgr and tinygpu scripts ( #15549 )
2026-03-31 20:07:52 +03:00
nimlgen
83085f103c
tinygpu docs ( #15545 )
...
* tinygpu docs
* x
* x
* fix
2026-03-31 19:49:38 +03:00
nimlgen
ca89215a59
nv: use nvcc over nak by default ( #15547 )
2026-03-31 18:54:56 +03:00
qazal
a15345a53e
viz/cli: improve --help message ( #15546 )
...
* viz/cli: improve --help message
* not the default
* more work
* -s
* respect colored
2026-03-31 22:31:33 +09:00
nimlgen
10d570b3d5
signed tinygpu ( #15541 )
2026-03-31 14:55:09 +03:00
chenyu
4ac2552642
improve ReduceMixin.all ( #15544 )
...
use prod instead of min since `mul` lowered to `and` directly
2026-03-31 07:54:27 -04:00
chenyu
89ec22131a
tests to show double negation in min is not cancelled ( #15543 )
2026-03-31 06:59:13 -04:00
qazal
8feb8edc68
gemm/asm: add fp8 support to cdna asm_gemm ( #15542 )
...
* work
* hmm, mixins
* rhs_transposed
* also fix the dtype
* check for hipcc
* Exception
* select dev
* default
2026-03-31 19:32:54 +09:00
chenyu
2939ae8b22
more mixin ( #15540 )
...
isclose is elementwise, min, any, all to OpMixin
2026-03-31 05:46:55 -04:00
chenyu
e69f5f9f69
more movement methods to mixin ( #15536 )
...
* more movement methods to mixin
* cleanups
2026-03-31 05:16:47 -04:00
nimlgen
ceb63c8c2f
new bundle id ( #15307 )
...
* new bundle id
* new profiles
2026-03-31 12:16:03 +03:00
qazal
467c0af8aa
viz: skip flaky sever tests ( #15538 )
2026-03-31 17:20:30 +09:00
qazal
f88e255cea
gemm/asm: split and parameterize dtype in llama gemm tests ( #15408 )
...
* gemm/asm: more tests for emulator, parameterize llama gemm tests
* bf16 atol
2026-03-31 17:12:44 +09:00
b1tg
a63392a565
llm: pairwise ranking topk for MoE expert selection ( #15499 )
2026-03-31 12:46:39 +08:00
wozeparrot
79cccf3003
write sz output to file ( #15534 )
2026-03-30 20:16:17 -07:00
Christopher Milan
6fb038d109
replace CompilerSet with list ( #15530 )
...
* replace CompilerSet with list
* oops
* default Renderer list
2026-03-30 23:07:52 -04:00
qazal
bc866a93f0
viz: rename exec to sqtt ( #15527 )
...
* viz: rename exec to sqtt
* more
2026-03-31 08:06:51 +09:00
Christopher Milan
adbfd82d1d
DEV is ContextVar, setting Device.DEFAULT is deprecated ( #15508 )
2026-03-30 17:10:49 -04:00
nimlgen
9583489068
add mlx driver to extra ( #15526 )
...
* mlx driver
* x
* simpler
2026-03-30 20:28:49 +03:00
qazal
ad6347f6d8
sqtt: allow mapping sopk to IMMEDIATE packets ( #15525 )
...
* work
* with s_waitcnt
* with the sopp variants, increase threads
* remove that
* sdst=NULL produces IMMEDIATE, otherwise is SALU
2026-03-30 23:12:17 +09:00
chenyu
301b2cea57
move matmul to mixin ( #15524 )
2026-03-30 07:39:09 -04:00
chenyu
f0eaac4235
reduce mixin ( #15523 )
2026-03-30 05:23:58 -04:00
chenyu
f485d0b664
UOp.sum -> usum, prod -> uprod [pr] ( #15522 )
...
rename to prep reduce mixin
2026-03-29 04:51:55 -04:00
qazal
36a925e2a2
viz: color wmma, one color map for cli and web ( #15519 )
...
* viz: color wmma, one color map for cli and web
* op_type
* like uops
* mypy cli
2026-03-29 04:53:01 +09:00
wozeparrot
0c3e438229
llama: mllog ( #15502 )
2026-03-28 11:18:25 -07:00
nimlgen
7e57e101d5
better oor message in profiles ( #15516 )
...
* better oor message
* x
2026-03-28 20:25:07 +03:00
qazal
266fb07721
viz: show exec duration ( #15484 )
...
* duration
* handwritten tests
* rdna3 pickle
* rdna4 pickle
* asserts
* rm that
* wmma work
* r4
* this shows the overlap well
* ohh okay it goes back
* are ds_load and ds_store different queues on RDNA4?
* print msg, v_mul_lo_u32 is 4 cycles?
* discover
* wmma something
* wmma comment
* less
* less
* better comments
* work
* inst st
* delay column
* better cli
* emit_alt
* update test_handwritten
* work
2026-03-28 22:48:59 +09:00
chenyu
fe705def0d
move more broadcast method to mixin [pr] ( #15513 )
...
* move more broadcast method to mixin [pr]
all but div, mod, and where
* xor -1
2026-03-28 01:48:08 -04:00
chenyu
c0753ab62f
XOR simplifcation rules ( #15512 )
...
x^-1 has good vmin/vmax, and x^y^y is x
2026-03-27 23:23:27 -04:00
qazal
ccaa6bfc19
viz/cli cleanups ( #15511 )
...
* one less function
* work
* layout
* better handling of rewrites
* mypy passes
2026-03-28 08:50:38 +09:00
qazal
dcc2a5d23b
viz/cli: simplify to --source and --item flags ( #15510 )
...
* viz/cli: simplify to --source and --item flags
* update viz cli test
2026-03-28 04:46:39 +09:00
nimlgen
0d6fc0f571
jit: graphing in uops ( #15489 )
...
* jit: graphing as rewrite rule
* f
* +metal,cuda
* x
* cl
* x
* x
* simpler
* f
* m
* x
* revert?
* revert2
* back
* back
* t
* x
* m
* x
* c
* x
* l
* x
* comment
* smaller
* rv
* x
* x
2026-03-27 19:09:02 +03:00
chenyu
30ebbe7f17
few more fold valid tests ( #15509 )
...
from remove CORRECT_DIVMOD_FOLDING attempt
2026-03-27 10:38:42 -04:00
Christopher Milan
9e0cc5c6ae
create image buffers in late codegen ( #15493 )
2026-03-27 04:50:53 -04:00
chenyu
1198d6e908
move pow to mixin ( #15507 )
2026-03-27 03:16:40 -04:00
chenyu
323fcefd7d
Revert "DEV is a ContextVar ( #15505 )" ( #15506 )
...
This reverts commit fdb30cba96 .
2026-03-27 02:22:40 -04:00
Christopher Milan
fdb30cba96
DEV is a ContextVar ( #15505 )
2026-03-27 00:57:09 -04:00
wozeparrot
a65e958be9
llama: new apply_grad ( #15503 )
2026-03-26 19:39:25 -07:00
Christopher Milan
67a50fb738
move where on load with casts ( #15492 )
2026-03-26 22:11:27 -04:00
qazal
586c49642f
viz/cli: test in CI ( #15501 )
...
* viz cli work
* baseline test
* make cli test work without subprocess
* more checks
* check itrace
* s/return/return None
* change
* minimal
* colored
2026-03-27 06:47:15 +09:00
qazal
3f9f0fa846
viz: yield sqtt alt events ( #15500 )
...
* yield other
* less
* work
* less
2026-03-27 04:43:41 +09:00
qazal
237c25031f
sqtt: construct OTHER_SIMD op types with for loop ( #15495 )
...
* other-lds from amd_copy_matmul
* more other
* other simd work
2026-03-26 23:07:18 +09:00
nimlgen
7193f90746
test view input in jit ( #15497 )
...
* will anything fail?
* add test
2026-03-26 16:59:47 +03:00
nimlgen
de24b3fe37
jit: pass init params straight to base ( #15496 )
...
* jit: pass init params straight to base
* linter
2026-03-26 16:59:10 +03:00
qazal
ec5b7a249e
viz: refactor sqtt timeline builder ( #15494 )
...
* viz: refactor sqtt timeline builder
* barrier maps to waves
* clean up cli
2026-03-26 21:16:15 +09:00
Christopher Milan
313937ad6d
fix IMAGE TestEnd2End.test_linear_mnist ( #15488 )
2026-03-26 04:12:47 -04:00
Christopher Milan
bc180a963c
deprecate <dev>=1 in favor of DEV=<dev> ( #15467 )
...
* start work on target
* add test
* update actions to use DEV
* update docs
* update readmes
* tests need that too
* update example
* update tests (comments)
* fix that test
* ruff
* mypy
* oops
* remove getenvs
* don't add Target yet
* and the test
* lint
* and docs
* more stuff
* assert
* few more fixes
* test assert
2026-03-26 03:48:03 -04:00
chenyu
8426f820a1
Tensor.sub to mixin ( #15486 )
...
also _broadcasted skipped broadcasting shape if it does not have shape
2026-03-25 23:20:56 -04:00
wozeparrot
1ca178f379
llama: stochastic rounding ( #15456 )
2026-03-25 18:16:31 -07:00
chenyu
7c8f992894
move EXPAND dtype cast back to gradient.py ( #15481 )
...
only a concern for gradient, not mixin
2026-03-25 19:25:26 -04:00
nimlgen
9d2d0774b4
remote: disk copies ( #15482 )
...
* remote: disk copies
* lineter
* r
* nv
* x
2026-03-25 22:14:25 +03:00
qazal
7c2c8d3905
viz: small ux improvements ( #15483 )
...
* test
* better
* work
2026-03-26 03:18:25 +09:00
qazal
737d5f67f9
viz: compute canvas dims for auto zoom ( #15474 )
2026-03-26 00:05:23 +09:00
qazal
60bd546593
sqtt: add cycle count to rdna3 enums ( #15473 )
...
* update rdna3 sqtt enums to include cycle_count
* dispatch_to_exec
2026-03-25 23:19:54 +09:00
chenyu
142bf11926
logical_not to mixin [pr] ( #15472 )
...
also UPat.cast skips same dtype
2026-03-25 09:16:45 -04:00
George Hotz
25ff7146f2
add a status line to REMOTE with DEBUG=1 ( #15471 )
...
* python speedups of hot paths
* add a status line to REMOTE with DEBUG=1
* pc
* t
2026-03-25 20:54:56 +08:00
qazal
c973b508b8
viz/cli: pass ctrlc ( #15470 )
2026-03-25 21:13:28 +09:00
George Hotz
c1a7d90ccc
python speedups of hot paths ( #15469 )
2026-03-25 20:02:42 +08:00
George Hotz
ae7090b13b
print function timing with DEBUG=2 ( #15468 )
...
* add DEBUG=2 function timing
* remove those functions, they aren't useful
* fix spec
2026-03-25 19:07:32 +08:00
Christopher Milan
e7f389efda
fix height=1 images on macos ( #15460 )
2026-03-25 05:59:56 -04:00
George Hotz
789628df2e
hotfix: add USE_BOT flag to ASM24 USB
2026-03-25 15:00:08 +08:00
George Hotz
cd1a276f47
llm: support gguf path or url ( #15464 )
...
* llm: support gguf path or url
* one line
2026-03-25 14:43:19 +08:00
chenyu
713b322e70
add weakint to promo_lattice ( #15463 )
...
sits between bool and smallest int
2026-03-25 00:27:34 -04:00
chenyu
02878c5a2f
move _broadcasted to OpMixin ( #15461 )
...
it needs both ElementwiseMixin and MovementMixin
2026-03-24 23:56:01 -04:00
chenyu
519ba22470
more Tensor._broadcasted cleanup ( #15459 )
...
prep moving to mixin
2026-03-24 22:55:45 -04:00
George Hotz
fe2690399b
llm: support assistant prefill + refactor to TransformerConfig ( #15457 )
...
* llm: support assistant prefill
* refactor to ModelConfig
* TransformerConfig
* more
2026-03-25 10:50:48 +08:00
Christopher Milan
fd92aec094
cleanup unused image pitch code ( #15458 )
2026-03-24 22:47:16 -04:00
chenyu
f6ed4da268
Tensor.ufix ( #15452 )
...
* Tensor.ufix
prep moving _broadcasted to mixin
* remove backward_cast
2026-03-24 22:34:43 -04:00
qazal
1b3d00d6ac
viz/cli: remove --offset and --limit flags ( #15439 )
...
* work
* also no more no-color
* reorder
* update llama
* sqtt readme
* itertools
* rm that
* signals back
2026-03-25 09:52:27 +09:00
wozeparrot
da2031266a
llama: correct 8b init ( #15397 )
2026-03-24 13:41:41 -07:00
qazal
652bab8aad
viz: support nested track_rewrites ( #15454 )
...
* simple test
* stack active groups
2026-03-25 05:01:30 +09:00
qazal
41eb2cc41b
viz: preserve zoom between re renders ( #15451 )
2026-03-25 03:11:10 +09:00
Salman Chishti
84049fdc07
Upgrade GitHub Actions to latest versions ( #15446 )
...
Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-03-24 10:28:49 -04:00
Salman Chishti
9567075e20
Upgrade GitHub Actions for Node 24 compatibility ( #15445 )
...
Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-03-24 10:28:19 -04:00
chenyu
b7960841af
support shape broadcast in UOp.alu ( #15442 )
...
i think it can integrate tighter, but now Tensor also does ufix from UOp and implicit dtype upcast
2026-03-24 10:14:57 -04:00
George Hotz
a33ac869aa
llm server: temperature + test client ( #15444 )
...
* improvements to the llm server
* eval script
* eval llm
* better eval gets 58.71
* cleanups
* add temperature, but multinomial is absurdly slow
* claude is so smart
* lint
* remove slop
* no more stop
2026-03-24 21:07:15 +08:00
nimlgen
9db5d677c7
jit in viz ( #15447 )
2026-03-24 18:23:53 +08:00
Christopher Milan
2e4fbbcc9c
ir3: fix texture mapping and benchmark ( #15443 )
2026-03-24 04:52:54 -04:00
Christopher Milan
d5320a9ddf
QCOM cleanups ( #15435 )
2026-03-23 22:18:38 -04:00
George Hotz
85dee83f5d
amd flash attention cleanups + emulator fixes ( #15431 )
...
* amd flash attention cleanups
* simpler
* params
* fix emulator bugs
* fix idiv bug
* remove that test
* more emu fixes
2026-03-24 10:10:46 +08:00
chenyu
018a9e2d3c
remove match_dtype arg in Tensor._broadcasted ( #15440 )
...
reworked Tensor.where to not need it, also updated dtypes.from_py to use isinstance because ConstFloat issues
2026-03-23 22:10:39 -04:00
qazal
a590eded87
sqtt: rdna4 decoder work ( #15434 )
...
* sqtt: rdna4 decoder work
* diff cleanup
* more diff
* test
* work
* works
* TS_DELTA_SHORT
2026-03-24 03:49:32 +09:00
qazal
109472c37e
sqtt: new s_barrier pickles, handle rdna4 barriers in emulator ( #15437 )
2026-03-24 03:25:28 +09:00
nimlgen
fa4cdb422e
memplan on linears ( #15422 )
...
* memplan
* test
* x
* arenas
* correct
* set any size
* ugh
* make hevc happy
* x
* x
* held
* rm old
* del
* x
* fu
* f
* cl
* cl
* ok
2026-03-23 19:50:16 +08:00
nimlgen
2da008ae3b
jit: rm replan ( #15433 )
2026-03-23 19:31:51 +08:00
qazal
c4c53418f8
sqtt: comment out flaky rocprof timestamp assert ( #15432 )
...
* comment out rocprof assert, add new assert
* better than > 0 assert
* string
2026-03-23 19:24:04 +09:00
chenyu
66a86f88a0
simpler Tensor._broadcasted inferred dtype ( #15430 )
2026-03-23 05:20:11 -04:00
Pham Nguyen Hung
c89576921d
Updated the APIs of mnist_gan ( #15429 )
...
Co-authored-by: pnhung1703@gmail.com <Hung Pham>
2026-03-23 17:04:00 +08:00
George Hotz
c62dea6881
ai slop flash attention (it works) ( #15401 )
...
* ai slop flash attention (it works)
* speed up, 2 TFLOPS + 7 GB/s
* simpler
* simpler
* optimize
* faster
* warp shuffle
* sqtt: link dispatch to exec (#15396 )
* sqtt packet linking infra
python
* javascript
* ~doubly linked list
* ui works
* work
* exec can also highlight the pc, coloring work
* more work
* rm sqtt/model.py, doesn't need to be upstreamed
* viz: no context enters in cli, update llama profile (#15404 )
* removed unused named arg in rules [pr] (#15414 )
* viz: sqtt printer in viz/cli.py (#15411 )
* work
* sqtt timeline in CLI
* format all printers nicely
* s/Showed/Printed
* ansistrip
* sys.exit
* keep colors in list
* work from amd_copy_matmul
* has_more always gets returned
* linter
* don't print colors
* more colors
* wow this is so deep
* work
* minor details
* selected
* improve progress bar
* remove it
* 22, global_load_vaddr is so long
* remove *0 hack in sign, gradient materializes zeros for unconnected nodes (#15416 )
Amp-Thread-ID: https://ampcode.com/threads/T-019d1612-6322-706b-a94d-a812400a55cb
Co-authored-by: Amp <amp@ampcode.com>
* works
* cnt=20
* revert that
* uop slice tests
* simpler
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: gg <ggordbegli@gmail.com>
Co-authored-by: Amp <amp@ampcode.com>
2026-03-23 16:15:10 +08:00
qazal
1568a5ed07
viz: show dispatch to exec delay in sidebar ( #15428 )
2026-03-23 16:59:59 +09:00
Christopher Milan
ddaeebb500
nir: add shift support ( #15426 )
2026-03-23 03:37:44 -04:00
nimlgen
c74fa9bbe1
fix jitbeam not triggered ( #15424 )
...
* um
* beam
* x
* f
2026-03-23 15:34:59 +08:00
qazal
fd3559103b
viz/cli: better error message for empty itrace ( #15425 )
2026-03-23 15:50:20 +09:00
nimlgen
395aacd77d
jit: prune on linear ( #15423 )
...
* jit: prune on linear
* x
* this is from the future
2026-03-23 14:10:34 +08:00
chenyu
248cd9b39f
make Tensor init the only caller of Tensor.from_uop ( #15421 )
...
* make Tensor init the only caller of Tensor.from_uop
prep broadcast cleanups
* type
2026-03-23 00:29:08 -04:00
chenyu
67dcc79fdd
push Tensor(symbolic) logic to Tensor.from_uop ( #15420 )
2026-03-22 23:49:35 -04:00
gg
2087df814f
remove *0 hack in sign, gradient materializes zeros for unconnected nodes ( #15416 )
...
Amp-Thread-ID: https://ampcode.com/threads/T-019d1612-6322-706b-a94d-a812400a55cb
Co-authored-by: Amp <amp@ampcode.com>
2026-03-22 12:49:26 -04:00
qazal
c7b18e6108
viz: sqtt printer in viz/cli.py ( #15411 )
...
* work
* sqtt timeline in CLI
* format all printers nicely
* s/Showed/Printed
* ansistrip
* sys.exit
* keep colors in list
* work from amd_copy_matmul
* has_more always gets returned
* linter
* don't print colors
* more colors
* wow this is so deep
* work
* minor details
* selected
* improve progress bar
* remove it
* 22, global_load_vaddr is so long
2026-03-23 00:17:05 +09:00
chenyu
bcc08307da
removed unused named arg in rules [pr] ( #15414 )
2026-03-22 09:25:46 -04:00
qazal
2363bceb47
viz: no context enters in cli, update llama profile ( #15404 )
2026-03-22 05:47:02 +09:00
qazal
a9ceaf3c5f
sqtt: link dispatch to exec ( #15396 )
...
* sqtt packet linking infra
python
* javascript
* ~doubly linked list
* ui works
* work
* exec can also highlight the pc, coloring work
* more work
* rm sqtt/model.py, doesn't need to be upstreamed
2026-03-21 23:48:58 +09:00
nimlgen
9656d97d97
jit: captures linears, not execitems ( #15399 )
...
* jit: captures linears, not execitems
* x
* um
* etsts
* mockcuda
2026-03-21 16:32:12 +08:00
George Hotz
c13d9d29ff
add SHAPED_WMMA ( #15400 )
...
* add SHAPED_WMMA
* shaped wmma
* less bad
2026-03-21 16:16:03 +08:00
George Hotz
41a9b09683
minimal vec in amd_copy_matmul ( #15398 )
...
* minimal vec in amd_copy_matmul
* unified
* unify
* reshape/permute
* cleanups
* simpler
* move index
* cleanups
* more shared
2026-03-21 14:57:21 +08:00
qazal
30b3054fd5
whitespace cleanups in viz and sqtt.py ( #15395 )
2026-03-21 04:46:19 +09:00
qazal
71ccc69c52
FP8=1 llama works again, hipcc can run on macos ( #15394 )
...
* hipcc macos shim
* is_dtype_supported opens devices less
2026-03-20 23:43:15 +09:00
Christopher Milan
9470d5193a
deterministic decomp apply order ( #15393 )
2026-03-20 08:10:45 -04:00
Christopher Milan
376585b003
use should_emulate for target dtype in decomp ( #15392 )
2026-03-20 07:44:57 -04:00
Christopher Milan
a12d3951de
fix test_export_model imports ( #15389 )
2026-03-20 07:27:01 -04:00
George Hotz
1a2a203f48
add wmma support to amd_copy_matmul ( #15384 )
...
* add wmma support to amd_copy_matmul
* 15 TFLOPS and merged
* unify
* simpler
* simpler
* simpler
* cleanups
* TM/TN is the full regs
* comments
* WAVES_PER_SH + SQTT_EVENT
* Add WAVERDY support
* no split warp
* 3 range
2026-03-20 19:02:19 +08:00
Christopher Milan
1560b534a5
remove IMAGE=2 ( #15312 )
2026-03-20 06:26:52 -04:00
Christopher Milan
30d609432f
ci: only xcode-select for gpuocelot on macos ( #15387 )
2026-03-20 05:58:16 -04:00
chenyu
d1b4e37dfa
remove InvalidType branch in Tensor.__init__ ( #15386 )
...
it's handled by `elif isinstance(data, get_args(ConstType)):` already
2026-03-20 05:32:33 -04:00
chenyu
c491345766
pass device into Tensor._frompy ( #15385 )
...
* pass device into Tensor._frompy
with this, canonicalize_device is the only usage of Device in tensor.py
* export_model.py
2026-03-20 05:09:01 -04:00
George Hotz
3b75d8a7a2
fix double after bug in rangeify ( #15381 )
2026-03-20 14:53:46 +08:00
Christopher Milan
0c89340a1e
automatically emulate unsupported (tiny) floats [skip_process_replay] ( #15366 )
2026-03-20 02:31:44 -04:00
George Hotz
78ad089817
make precompile the default for llm ( #15376 )
...
* make precompile the default for llm
* works
* empty is okay for kvcache
* fix cache misses
* more tests
2026-03-20 14:08:55 +08:00
chenyu
459ef41ea0
don't exclude weakint in is_dtype_supported [pr] ( #15378 )
2026-03-20 02:08:29 -04:00
qazal
cf6a429aaa
mypy emulator pre-commit passing ( #15379 )
...
* fix dict stuff
* add type: ignores
* fix pcode to put uops not ints
2026-03-20 14:44:09 +09:00
wozeparrot
87c4ec1724
llama: use flat llama ( #15353 )
2026-03-19 22:12:38 -07:00
chenyu
da1700e16b
dtypes.index -> dtypes.weakint ( #15377 )
2026-03-20 01:08:46 -04:00
nimlgen
3b04e3ea28
no gmmu mappings with GMMU=0 ( #15369 )
...
* usb
* free
* simple gmmu=0
* x
* x
* vram
* init tests
* ppg
* x
2026-03-20 12:18:34 +08:00
ridoy majumdar
c1183b8872
remove dead code in pyrender ( #15115 )
...
* remove dead code in pyrender
* retrig CI
* retrig CI
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-03-19 23:59:56 -04:00
chenyu
bf33c5f796
remove gradient materialize_grads ( #15367 )
...
effectively default to True
and removed *0 hack in Tensor.copysign. now dy/dx=0 if y does not depend on x
remove
2026-03-19 23:36:03 -04:00
chenyu
45baf3ff3f
pin ci xcode version ( #15375 )
2026-03-19 23:13:16 -04:00
George Hotz
4091d37e8e
flat llama step work ( #15355 )
...
* flat llama step work
* fp8 support
* blacklisted matmul
* chestertons fence
2026-03-20 09:06:12 +08:00
qazal
176ad47d7d
cdna4 emulator testing ASM_GEMM in CI ( #15373 )
...
* cdna emulator work
* accvgprs
* cdna passes most tests
* ruff
* add cdna4 to tests
* cdna emu
* crash
* pass?
* work
* gen
* clean up wave_size access
* asm_gemm passes
* remove acc from dsl.py, emulator can keep its different reg file
it's purely an encoding here, the ASM_GEMM already encodes acc srcs with v[], this can
be cleaned up later, but not functionally required for emulator.
* split asm_gemm tests to ones fast on the emulator
* don't do that
* 124 stays null on rdna
* the segfault was because of hw regs, not this
* Revert "clean up wave_size access", it's explicitly tested
This reverts commit 1202ff5787 .
* nullcopyout
---------
Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-03-20 05:51:30 +09:00
nimlgen
16daffc042
remote connection timeout ( #15370 )
2026-03-19 19:44:16 +08:00
Christopher Milan
68d7a6b7be
PYTHONREMU: fix vop3p literals ( #15372 )
2026-03-19 07:05:01 -04:00
George Hotz
70dad9d642
add PING to RemoteCmd ( #15371 )
...
* add PING to RemoteCmd
* cleanup
2026-03-19 18:57:40 +08:00
nimlgen
1c978aeedb
amd: fix aql remote ( #15368 )
2026-03-19 18:11:03 +08:00
qazal
337c684047
viz: cycle time relative to kernel start in sidebar ( #15352 )
2026-03-19 18:41:29 +09:00
chenyu
d81b03cff4
pad_to to mixin [pr] ( #15365 )
2026-03-19 05:02:01 -04:00
chenyu
1abb6297f6
more Tensor(UOp) cleanups ( #15364 )
...
* more Tensor(UOp) cleanups
* function too
2026-03-19 03:34:30 -04:00
nimlgen
cf50ca23c3
better oom msg ( #15362 )
...
* better oom msg
* s
2026-03-19 14:07:01 +08:00
nimlgen
1a53393512
remote in ci benchmark ( #15344 )
...
* remote in ci benchmark
* move to the end
* move
* ports
* own this
2026-03-19 13:49:09 +08:00
chenyu
92dfef8060
Tensor(uop) does not need explicit device ( #15361 )
2026-03-19 00:44:33 -04:00
nimlgen
f32c2e43a7
memory: use pfree ( #15360 )
2026-03-19 12:39:23 +08:00
nimlgen
86eec01f97
limit gl*lc ( #15359 )
2026-03-19 12:38:55 +08:00
chenyu
b39816e998
failed test case for Tensor(np, "bf16") ( #15358 )
2026-03-18 23:40:14 -04:00
chenyu
e407ee410c
cosmetic Tensor._do_reduction cleanups ( #15357 )
2026-03-18 22:27:50 -04:00
chenyu
6aebf95dac
move neg and invert to mixin ( #15356 )
2026-03-18 22:03:41 -04:00
wozeparrot
f6687d1ffc
feat: sd seed0 update ( #15354 )
2026-03-18 18:42:00 -07:00
wozeparrot
c45a606750
feat: no if in rand ( #15333 )
2026-03-18 15:09:51 -07:00
qazal
23e0431848
viz: switch sqtt sidebar to a simple asm list ( #15350 )
...
* work
* something like this
* Revert "something like this"
This reverts commit 6c45098d2b .
* less
* path includes
* scroll only jumps up and down
* it's only pc and line now
2026-03-19 01:40:25 +09:00
qazal
709fc52d7b
viz: fix auto zoom range in sqtt, include endpgm packet ( #15349 )
...
* viz: fix automatic zoom range in sqtt packets
* it's x+width
* include s_endpgm
* endpgm also doesn't have exec
2026-03-18 22:52:32 +09:00
nimlgen
d4836ddbb0
canonicalize device from tuple ( #15348 )
...
* will it ifx ci?
* test
* um
2026-03-18 20:35:52 +08:00
George Hotz
5524916e39
llama compute gradients explicitly + 243 GB of RAM on MP=8 ( #15343 )
...
* llama compute gradients explicitly
* apply grads
* fix multi issue
* multi BUFFER_VIEW support
* simpler
* skip the flaky test
2026-03-18 19:54:40 +08:00
nimlgen
ff004d2114
remote: fix mmio ( #15347 )
2026-03-18 18:20:39 +08:00
nimlgen
f853371c83
fix compilers autoselect ( #15346 )
2026-03-18 18:19:53 +08:00
chenyu
761ce8c0d3
fix Invalid combine rules ( #15345 )
...
* fix Invalid combine rules
wrong conditions broke setiem into invalids
* fix
2026-03-18 04:58:02 -04:00
nimlgen
c0499ca3e8
nv: use mmio iface ( #15342 )
...
* nv: use mmio iface
* nv: use mmio iface
* revert
* f
2026-03-18 16:53:09 +08:00
Christopher Milan
499ad9a356
benchmark openpilot 0.11.0 ( #15341 )
2026-03-18 03:28:43 -04:00
George Hotz
6e196195d8
add test for flat llama ( #15327 )
...
* add test for flat llama
* simpler
* back to split w1/w3
* env
* still too much ram
* invalid
2026-03-18 15:16:33 +08:00
chenyu
fceb21c315
Tensor(uop) uses device from uop ( #15340 )
2026-03-18 02:56:06 -04:00
George Hotz
6109117af1
anonymous buffers are Invalid ( #15336 )
...
* anonymous buffers are Invalid
* unique_const
* work
* remove invalid writes
* test_anonymous_buffers_in_function
2026-03-18 14:52:56 +08:00
chenyu
e644e1cb6a
less Tensor(...).uop indirection in Tensor.__init__ ( #15339 )
2026-03-18 02:17:38 -04:00
nimlgen
0315faf938
remote bench ( #15331 )
2026-03-18 14:03:51 +08:00
nimlgen
d720d50e12
memory: traverse all valid ranges only ( #15338 )
...
* memory: traverse all valid ranges only
* x
2026-03-18 14:03:39 +08:00
chenyu
ac7a348d06
dtypes.as_const -> DType.const ( #15337 )
...
does not need to be a staticmethod
2026-03-18 00:48:41 -04:00
Christopher Milan
864d3917d5
add openpilot onnx parser test ( #15334 )
2026-03-18 00:12:02 -04:00
Christopher Milan
0222bfdf69
Revert "don't use intermediate dict in onnx parse" ( #15332 )
2026-03-17 23:46:30 -04:00
chenyu
94926d00d8
fix rand > uint32.max ( #15330 )
...
need to keep low and high as 1D tensor.
`PYTHONPATH=. LLAMA3_SIZE=405B python3 examples/mlperf/models/flat_llama.py` works now
2026-03-17 22:00:01 -04:00
wozeparrot
b45edeb965
fix: rand supports large tensors ( #15329 )
2026-03-17 15:45:41 -07:00
qazal
00817cf65e
viz: all tests can run on the NULL device ( #15328 )
...
* remove that
* move to test_viz
* get_cfg
* do not use os.environ
* hm
* it's always on NULL
* import renderer
* no import *
2026-03-18 04:14:20 +09:00
George Hotz
2605840ee2
flat llama ( #15324 )
...
* FlatTransformer
* works
* pass in buffer views
* print stuff
* print
* bugfixes
2026-03-17 19:39:55 +08:00
nimlgen
0a641ce17d
system: remote ( #15318 )
...
* system: remote
* listen
* print
* fix
* minor
2026-03-17 19:25:37 +08:00
Christopher Milan
69eefdca20
images with height=1 have less strict width rules ( #15325 )
2026-03-17 07:07:22 -04:00
chenyu
14eb8170e4
skip TestRunAsModule if libclang is loaded ( #15323 )
...
reverse rule of TestAutogen skip, otherwise `NULL=1 python -m pytest test/null/test_autogen.py test/null/test_device.py` crashes for me
2026-03-17 06:02:53 -04:00
qazal
e7c26b6319
viz: rename to Start Cycle for the sqtt graph ( #15320 )
2026-03-17 18:53:06 +09:00
nimlgen
e89a103984
remove dmaref ( #15321 )
...
* remove dmaref
* imports
2026-03-17 17:52:09 +08:00
chenyu
3090d4a6e0
disallow reshape from None shape [pr] ( #15322 )
...
test_multigpu_clip_score works without it now
2026-03-17 05:46:53 -04:00
nimlgen
a50fdb0528
nvcc macos ( #15308 )
...
* fix nvcc install macos
* um
* arm
* per
* tm
2026-03-17 17:25:33 +08:00
George Hotz
9d95321be3
set allow_implicit=False by default ( #15319 )
...
* set allow_implicit=False by default
* modernize beautiful mnist
2026-03-17 17:14:38 +08:00
nimlgen
e1c2d09720
system: rebar to remote devs ( #15316 )
2026-03-17 16:09:12 +08:00
chenyu
79d2e83853
tighter ALU/variable min==max -> CONST rule [pr] ( #15317 )
...
only check Ops that can be simplified through this rule. halved the time for that rule in `PYTHONPATH=. TRACK_MATCH_STATS=2 python3 -O test/external/external_benchmark_schedule.py`
2026-03-17 03:44:24 -04:00
George Hotz
584ec75aa2
precompile backward ( #15311 )
...
* add precompile backward support
* cleanups
* fix
* compact grad
* split v not split
* simpler
* no NOOPT
2026-03-17 15:28:40 +08:00
chenyu
6b6d1814ca
update no_vectorized_index [pr] ( #15313 )
...
combine no_vectorized_index and no_vectorized_index_broadcast
2026-03-17 03:05:23 -04:00
b1tg
856a839efc
llm: fix qwen3 moe topk renormalization ( #15201 )
2026-03-17 12:57:33 +08:00
chenyu
1283b57b4e
update fix_store_after_hazard ( #15309 )
...
actual gate is just not CONTIGUOUS, also don't need to check against full backward_slice
2026-03-16 23:55:59 -04:00
Christopher Milan
575b40b93a
determine image shapes before index devectorization ( #15304 )
2026-03-16 23:16:33 -04:00
George Hotz
3ff03be413
call always has tuple ( #15297 )
...
* call always has tuple
* fix pre-commit and simplify
* update
* fix
* move that assert
* tuple
* fix multi
* cleanups
* fix merge
2026-03-17 10:58:46 +08:00
chenyu
1b8b151195
simpler Tensor.assign ( #15302 )
2026-03-16 22:37:25 -04:00
wozeparrot
674c760974
embedded bwd vocab shard ( #15001 )
...
* fix: remove more multi from call
* feat: embedding bwd vocab sharding
* clean: unused import
* clean: don't actually need this pattern
2026-03-16 19:37:16 -07:00
Christopher Milan
62bfd48d95
smarter padding in image_conv2d ( #15289 )
2026-03-16 22:17:48 -04:00
chenyu
e1fab4d2a9
UOp.store is always void [pr] ( #15301 )
2026-03-16 21:58:05 -04:00
chenyu
02afb45f29
remove UOp.assign [pr] ( #15300 )
...
* remove UOp.assign [pr]
it's all store and after, UOp is immutable
* fix test
2026-03-16 21:45:41 -04:00
qazal
33bd33e783
sqtt: add CDNA ops enum, show in viz ( #15140 )
2026-03-17 09:38:42 +09:00
chenyu
3e2b7803e6
view assign replaces at buffer identity ( #15298 )
...
matches what functions capture
2026-03-16 19:58:38 -04:00
qazal
346596cdce
viz: nanoseconds time axis in sqtt ( #15299 )
...
* ui
* secondaryTick is optional
* shader markers data
* instSt infra
* path forward
* details
2026-03-17 07:20:18 +09:00
nimlgen
1bc4cb254c
signed tinygpu as default ( #15296 )
...
* signed tinygpu as default
* f
* no sip
2026-03-16 19:29:41 +08:00
Christopher Milan
0de519c7c2
[pr] fewer simplify calls in image_fixup ( #15283 )
2026-03-16 06:57:52 -04:00
nimlgen
27e29127b5
system: remote prereqs ( #15290 )
...
* x
* new format for apl
* this
* typing
* rpc
* tuple
* linter+new tinygpu
2026-03-16 18:45:41 +08:00
chenyu
837b06c609
style cleanups in allocations.py [pr] ( #15295 )
2026-03-16 05:45:24 -04:00
George Hotz
476276f4b4
support grads on tuples ( #15287 )
...
* support grads on tuples
* simpler
* grad_fxn works
* cleanups
* unused
2026-03-16 17:39:34 +08:00
chenyu
20799df10b
remove Ops.ASSIGN [pr] ( #15294 )
...
goodbye
2026-03-16 05:22:21 -04:00
chenyu
b3378e7022
UOp.assign is store+after [pr] ( #15292 )
2026-03-16 04:51:50 -04:00
George Hotz
2e1c81c23f
allow_implicit to disable implicit params ( #15291 )
...
* allow_implicit to disable implicit params
* get both Tensor and UOp
* no implicits in llm
2026-03-16 16:40:14 +08:00
chenyu
a0d1444790
Tensor.assign is store+after [pr] ( #15288 )
...
* Tensor.assign is store+after [pr]
* put that back
2026-03-16 04:04:55 -04:00
George Hotz
08662bc4ab
add TUPLE/GETTUPLE, simple tests pass ( #15286 )
...
* simple tuple stuff passes
* resolved
2026-03-16 15:06:02 +08:00
nimlgen
e7705fe311
system: pcidev doesn't care about bars ( #15284 )
2026-03-16 14:45:43 +08:00
nimlgen
ff0bcc8de0
system: iface p1 changes ( #15278 )
2026-03-16 10:48:25 +08:00
qazal
4445f50356
viz: variable duration rdna barriers ( #15277 )
...
* viz: variable length rdna barriers
* work
* tiny changes
* simple wave simd test
* small wave sync test
* good multi barrier bug find
* simple fix
* wave_sync asserts
* rdna4 work
* more rdna4
* find more bugs in my model
* it's so much simpler
* wave_sync tests duration
* r4
* should just call this rdna4
2026-03-16 06:06:19 +09:00
qazal
5cd1daa3bc
cdna asm_gemm in one file, remove old rdna3 asm ( #15281 )
2026-03-16 04:32:30 +09:00
chenyu
cd14e8e64b
allocations contiguous is store+after ( #15280 )
2026-03-15 11:58:40 -04:00
qazal
7b6211fdd7
sqtt: remove discover_ops script ( #15279 )
2026-03-15 22:17:06 +09:00
wozeparrot
473e5e4368
feat: make USE_ATOMICS embedding bwd faster ( #15151 )
2026-03-14 21:21:10 -07:00
qazal
3858bfc83d
sqtt: CDNA inst decodes ( #15274 )
...
* sqtt: CDNA inst decodes
* JUMP packets other way
* cdna insts
* r3
* r4
* lds from simd1 and simd2
2026-03-14 21:03:46 +09:00
Christopher Milan
d753c5d7e5
IMAGE=1 image_conv2d pads for bank conflicts ( #15252 )
2026-03-14 07:59:16 -04:00
Christopher Milan
9047249a7c
m.where(x.pad_to(m.shape), Invalid) ranges shrink ( #15275 )
2026-03-14 07:26:36 -04:00
nimlgen
f392c53c66
system: merge remote into pciiface ( #15273 )
...
* system: merge remote into pciiface
* clenaer
* move
* mypy
* fix
2026-03-14 18:44:20 +08:00
chenyu
13eec8fbe8
remove unused assign rules [pr] ( #15268 )
2026-03-14 05:37:49 -04:00
Christopher Milan
dabdc986df
shrink guarded ranges, try 2 ( #15272 )
2026-03-14 04:24:05 -04:00
Christopher Milan
7cf4b16c91
Revert "shrink guarded ranges" ( #15271 )
2026-03-14 03:44:38 -04:00
Christopher Milan
d9951e2f8e
shrink guarded ranges ( #15263 )
2026-03-14 03:38:48 -04:00
qazal
43ffd66fda
viz: oneline inst list ( #15269 )
...
* viz: oneline inst list
* save 5 chars
* gradual padding
2026-03-14 15:37:18 +09:00
George Hotz
86f17468ed
store in spec + USB BOT fix ( #15265 )
...
* move spec to store
* usb bot flag
* Revert "usb bot flag"
This reverts commit 7b8b7824f0 .
* fix assert
2026-03-14 13:25:05 +08:00
George Hotz
06d7cddb33
amd_copy_matmul is cleaner ( #15248 )
...
* amd_copy_matmul is cleaner
* it runs
* replicated stuff
* add tid there
* it runs
* cleanup
* x.src[1]
* flatten
* move that
* keep that assert
2026-03-14 12:56:09 +08:00
chenyu
b3600e4774
don't emit assign in transform_precompiled_call [pr] ( #15262 )
2026-03-13 22:42:35 -04:00
qazal
4d60312f7f
viz: asm python dsl syntax highlighting ( #15259 )
2026-03-14 06:37:43 +09:00
qazal
6209ddfc90
viz: improve disasm of s_code_end ( #15258 )
...
* viz: improve amd disasm of s_code_end
* better tests
* order was good
2026-03-14 03:31:14 +09:00
wozeparrot
a191ac0566
llama: use mlperf model ( #15257 )
2026-03-13 08:08:32 -07:00
Sieds Lykles
4b59083d7c
assign into empty works ( #15256 )
2026-03-13 10:24:29 -04:00
qazal
60b1b908c6
sqtt: CDNA layout header packet is the same size ( #15255 )
2026-03-13 22:28:24 +09:00
nimlgen
4e21735f31
system: update tinygpu app ( #15247 )
2026-03-13 20:36:57 +08:00
nimlgen
1fbe1fef2c
move write_configs to drivers ( #15253 )
2026-03-13 19:02:34 +08:00
chenyu
018c01508d
test case for call precompile multi ( #15254 )
2026-03-13 06:28:43 -04:00
nimlgen
bc16f80b50
am: remove dma_regions param ( #15251 )
...
* am: remove dma_regions param
* linter
2026-03-13 18:12:48 +08:00
chenyu
576e7f985f
remove handle_assign_mops [pr] ( #15249 )
2026-03-13 01:53:21 -04:00
Christopher Milan
c251fc67c5
ci: consider arch in venv and apt caches and go back to 3.12 ( #15250 )
2026-03-13 00:36:49 -04:00
Christopher Milan
d4b947ea9a
ci: explicitly request python 3.12.10 instead of 3.12 ( #15246 )
...
3.12.10 is the most recent 3.12 version that has toolcache builds for linux, macos, and windows
2026-03-12 23:00:46 -04:00
George Hotz
a7d2429c21
amd_uop_matmul more cleanups ( #15240 )
2026-03-13 10:24:43 +08:00
qazal
d893b14193
sqtt: update cdna packet names ( #15243 )
...
* sqtt: update cdna packet names
* change
* order
2026-03-13 08:49:09 +09:00
wozeparrot
749162bd2f
llama memory tweaks ( #15223 )
2026-03-12 12:36:23 -07:00
qazal
9a7173b7a0
viz: visualize full range of shader clock frequency, auto zoom to kernel range ( #15225 )
...
* start this
* work
* rm those
* relative to start cycle
* cleanup
* cover the full range of packets
* correct event type
* start the ui change
* fit=true
* better
* always the zoom identity
* diff cleanup
* shader engine itrace can be turned off
2026-03-13 00:07:31 +09:00
chenyu
d9c09397c0
Ops.STORE is shapeless [pr] ( #15239 )
2026-03-12 09:05:30 -04:00
nimlgen
d746ccb791
system: fix vfio ( #15235 )
2026-03-12 18:31:00 +08:00
nimlgen
d104a903f8
system: print output when err ( #15230 )
2026-03-12 18:30:49 +08:00
George Hotz
e560a46f59
update amd_uop_matmul ( #15236 )
...
* update amd_uop_matmul
* use custom kernel
* simpler
* ignore
2026-03-12 17:33:12 +08:00
chenyu
90b7f4341d
failed two level divmod recombine case ( #15233 )
2026-03-12 04:04:36 -04:00
chenyu
8b8d9a443c
remove unused invalid rules [pr] ( #15231 )
2026-03-12 03:10:34 -04:00
George Hotz
bdd62fd484
remove unneeded realize map entries ( #15229 )
...
* remove unneeded realize map entries
* not that
2026-03-12 14:23:19 +08:00
chenyu
842c978df3
remove staticmethod dtypes.max/min ( #15227 )
...
always use x.dtype.max/min
2026-03-11 23:11:24 -04:00
b1tg
18dc77ccab
add fp8 fnuz dtypes with PYTHON backend support ( #14945 )
...
* add fp8 fnuz dtypes with PYTHON backend support
* rm emu related change
* clarify fp8 fnuz zero handling
* Revert "rm emu related change"
This reverts commit efa4763c22 .
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-03-11 22:30:18 -04:00
George Hotz
4f3f55328b
do not patch on invalid tensor tests ( #15226 )
...
* do not patch on invalid tensor tests
* cleanup
2026-03-12 09:35:20 +08:00
wozeparrot
4fab320abe
llama: clean ( #15224 )
2026-03-11 13:33:59 -07:00
wozeparrot
05d6d9120a
llama offload null ( #15222 )
2026-03-11 10:04:31 -07:00
qazal
d3eef70162
viz: render shader clock frequency graph ( #15197 )
2026-03-12 01:32:49 +09:00
chenyu
39b0f4bcc1
remove Ops.THREEFRY in remove_bufferize [pr] ( #15220 )
2026-03-11 05:30:33 -04:00
chenyu
6489a6f212
Revert "remove mop_cleanup [pr] ( #15217 )" ( #15218 )
...
This reverts commit 6b50df940a .
2026-03-11 04:17:56 -04:00
chenyu
6b50df940a
remove mop_cleanup [pr] ( #15217 )
...
no kernel diff, i think this was needed due to force_reshape?
test/external/external_benchmark_schedule.py is about the same speed
2026-03-11 03:54:42 -04:00
Christopher Milan
2fb8a7f60f
fix test_invalid_tensor when before values are nan ( #15215 )
2026-03-10 23:51:19 -04:00
chenyu
fce87f19a8
better fold_add_divmod_recombine ( #15214 )
2026-03-10 23:24:22 -04:00
chenyu
df8deec949
test for nest_by_factor selection ( #15213 )
2026-03-10 22:41:31 -04:00
chenyu
be6b0bce1f
variations of (x%c)+(x//c)*c ( #15212 )
...
put those into one function
2026-03-10 22:41:14 -04:00
qazal
a408d90f4f
viz: always detect sqtt packet overlaps, add timeline tests ( #15211 )
...
* test
* work
* it's called CALL, better assert
* qol
* row_ends
2026-03-11 05:32:38 +09:00
nimlgen
d9c7290eb0
nv: nvdec as NVDEC:0 device ( #15209 )
2026-03-10 14:44:50 +03:00
Christopher Milan
25d86ec9e1
start using Invalid in image_conv2d ( #15208 )
2026-03-10 07:11:06 -04:00
chenyu
ecbddfcffe
clean up gcd_with_remainder [pr] ( #15207 )
...
this can operate with int gcd directly and not through UOp
2026-03-10 06:13:20 -04:00
chenyu
bb7888b281
cleanup (x%(k*c))//c and (x%(k*c))%c ( #15206 )
...
these two are in the same family
2026-03-10 05:21:32 -04:00
chenyu
8389a8d7c5
remove_nested_mod can work with negative ( #15205 )
2026-03-10 03:10:08 -04:00
Christopher Milan
ffaafd391a
Invalid in Tensor ( #15154 )
2026-03-10 02:49:54 -04:00
chenyu
68c7c3ca84
divmod test_gcd_with_remainder ( #15204 )
...
test cases for gcd_with_remainder
2026-03-09 23:51:47 -04:00
chenyu
a53187eef7
fix TestPartialAssignToSharedBuffer ( #15202 )
...
bufferize_to_store issue with assign
2026-03-09 23:14:23 -04:00
wozeparrot
525a178966
llama: jit more ( #15199 )
2026-03-10 11:04:59 +08:00
George Hotz
315ad50a1a
make late allreduce the default ( #15125 )
...
Co-authored-by: wozeparrot <wozeparrot@gmail.com>
2026-03-09 17:42:57 -07:00
chenyu
6b354b906d
fold_divmod_general cleanups [pr] ( #15196 )
2026-03-09 19:43:16 -04:00
qazal
02ceeab3a7
viz: ui cleanups from the sqtt real time branch ( #15195 )
...
* label location for packets
* work
* OTHER_* packets always get filtered out
* less
2026-03-10 05:33:53 +09:00
qazal
a615ed8ebe
sqtt: update RDNA timestamp marker fields ( #15194 )
...
* rt:realtime field name, correct RDNA4
* share rdna4 and rdna3
2026-03-10 05:18:47 +09:00
nimlgen
8bd6d270c5
rm ops.encdec ( #15193 )
...
* rm ops.encdec
* x
2026-03-09 18:52:48 +03:00
qazal
81ab499b4b
viz: small ui code cleanups ( #15192 )
...
* less
* more work
* tabulate returns node like colored
2026-03-09 21:17:33 +09:00
chenyu
60215deb60
tiebreak in fold_divmod_congruence ( #15190 )
...
need to try both direction
2026-03-09 03:40:39 -04:00
chenyu
a8d8351e5a
match IDIV and MOD in nest_by_factor ( #15188 )
2026-03-09 00:50:38 -04:00
Christopher Milan
7592622562
fix QCOMCLRenderer pickle ( #15189 )
2026-03-09 00:36:16 -04:00
Christopher Milan
2bb0970512
QCOM CL compiler prints LLVMIR when DEBUG>=8 ( #15187 )
2026-03-09 00:15:20 -04:00
chenyu
83b80da8f3
even more divmod recombine ( #15163 )
2026-03-08 23:52:26 -04:00
chenyu
82f7734501
use backward_slice in reduce_mul_chain [pr] ( #15186 )
2026-03-08 21:44:53 -04:00
qazal
25e82a9aca
viz: exclude redundant traceback from SDMA ( #15185 )
...
* viz: exclude redundant traceback from SDMA
* ctx
* cpu_profile
2026-03-09 05:12:14 +09:00
nimlgen
6ac99fd4c9
memplanner opt copy bufs ( #15110 )
...
* mtp
* x
* tests
* ss
* simp
* less slop
* x
* cleaner
* rm
* m
* c
* x
* f
2026-03-08 22:28:01 +03:00
nimlgen
633264feae
am: flush sdma pipeline ( #15184 )
...
* am: flush sdma pipeline
* f
* f
* fix
2026-03-08 20:27:56 +03:00
b1tg
891a73befc
llm: fix chunked prefill ( #15182 )
...
* llm: fix chunked prefill
* less lines
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2026-03-07 22:08:31 +08:00
chenyu
5d58b1c396
don't use intermediate dict in onnx parse ( #15181 )
...
also don't parse fields that are never used
2026-03-07 00:08:03 -05:00
nimlgen
086081e35b
tbgpu: add stapler to the script ( #15180 )
2026-03-07 00:07:27 +03:00
qazal
a03f512147
viz: clean up old / unused paths in sidebar rendering ( #15179 )
...
* src is unused
* less
2026-03-07 05:36:10 +09:00
chenyu
605b37c03f
use backward_slice in count_divmod [pr] ( #15178 )
2026-03-06 14:03:53 -05:00
Ananta Ranganathan
5bdad8ee41
update mxfp4 tests to use the same patterns as the others ( #15177 )
...
* update mxfp4 tests to use the same patterns as the others
* fix typo in test call not sure how it committed
2026-03-06 13:21:40 -05:00
qazal
d85109f9f7
viz: walk PROGRAM UOp back to source and binary only ( #15174 )
...
* work
* simpler
2026-03-07 01:39:07 +09:00
Ananta Ranganathan
5c50035e0d
avoid using arithmetic for mxfp4 ( #15172 )
...
* avoid using arithmetic for mxfp4
* update tests to use assert equal
* no longer todo
2026-03-06 11:17:56 -05:00
qazal
f064db0ac6
viz: later tooltip rendering ( #15170 )
2026-03-06 23:00:15 +09:00
Roelof van Dijk
4ed8bb7445
tie break for divmod ( #15169 )
2026-03-06 08:05:38 -05:00
qazal
83f1faa142
sqtt: update CDNA wave packet field, start unskipping tests ( #15168 )
...
* correct field names
* packet types
* packet 5 is regc
* test skips
2026-03-06 21:37:44 +09:00
Christopher Milan
7810be8d3c
compile QCOM without opening device ( #15165 )
...
Co-authored-by: Comma Device <device@comma.ai>
2026-03-06 06:24:27 -05:00
George Hotz
6fd18ef875
rename CAT to VCAT ( #15167 )
2026-03-06 18:46:28 +08:00
Roelof van Dijk
059c6326c0
metal uint32 icb offset overflow ( #15156 )
...
* metal uint32 icb offset overflow
fix: diff
supports_exec_item
GraphRunner.supports_exec_item
tests
fix: can't import on non-metal
stricter
* also test the non-metal buffer case
* imports on non-mac
2026-03-06 00:54:39 +03:00
chenyu
da61088ca4
more divmod recombine ( #15162 )
2026-03-05 12:53:22 -05:00
chenyu
167a1d56a6
improve divmod folding ( #15148 )
...
canonicalize to div than mod which enables more simplifcation
2026-03-05 10:07:36 -05:00
Christopher Milan
b824579e4d
simplify image_conv2d pitch alignment hacks ( #15158 )
2026-03-05 07:17:34 -05:00
qazal
5bf542469d
viz: python traceback for USER device ( #15160 )
...
* start
* ux
* unittests
2026-03-05 20:22:09 +09:00
Roelof van Dijk
d65923bda5
tensor.py: add normalize function ( #15159 )
...
* tensor.py: add normalize function
* p==0 should match torch
2026-03-05 18:55:53 +08:00
wozeparrot
4544da1c54
llama3 fixes part3 ( #15152 )
2026-03-05 01:17:54 -08:00
Roelof van Dijk
fc0534910c
q5k is like q4k ( #15155 )
2026-03-05 17:02:49 +08:00
Ananta Ranganathan
8ef656324e
FIXED TEST Q5_K GGUF dequant ( #15147 )
...
* q5_k gguf support as separate pr
* fix the problematic gemv test for q5_k
* add assert to make sure the gemv test cant fail with warning instead of error
2026-03-05 16:32:36 +08:00
George Hotz
e97922a57c
LLM speedup with two jits, prefill/rollout ( #15153 )
...
* START_TIME
* print cleanup
* fix tests
2026-03-05 16:21:09 +08:00
wozeparrot
be23772d43
llama3 fixes part2 ( #15150 )
2026-03-04 23:43:50 -08:00
wozeparrot
0c769289eb
llama3: more scripts ( #15107 )
2026-03-04 22:18:03 -08:00
George Hotz
fb43b415f9
fix symbolic shape call + chunked prefill ( #15149 )
...
* fix precompile for symbolic shape
* chunked prefill
* cleaner
* test that
2026-03-05 14:02:26 +08:00
George Hotz
8a82b26522
llm: print the prefill cache size ( #15146 )
...
* print the llm prefill cache size
* mock that too
2026-03-05 12:13:28 +08:00
chenyu
b5370fd52d
use copy_multi in alu_multi [pr] ( #15143 )
...
* use copy_multi in alu_multi [pr]
* copy to anything
2026-03-04 22:53:00 -05:00
George Hotz
72a9ed6e23
fix render depth bug + add warmup to serve + no realize default ( #15144 )
...
* fix render depth bug + add warmup to serve
* make realize not the default
2026-03-05 11:21:16 +08:00
George Hotz
ac1847cbf7
fully symbolic llm ( #15097 )
...
* work
* llm symbolic (almost)
* work
* revert that
* llm sym
* works
* cleanups
* cache tokens with the kv cache
* cleanups
* cleanups
2026-03-05 10:22:11 +08:00
qazal
33a1970045
sqtt: simplify inst mapping, validate JUMP processing in CI ( #15139 )
...
* jump cleanup
* assert there's a JUMP
* new example for JUMP
* regenerate examples
* rdna4 work
* new packets
* work
* less for branch handling
* less verbose
* fix err message
2026-03-05 09:53:12 +09:00
chenyu
04da527a7a
minor div_and_mod_symbolic cleanups ( #15138 )
2026-03-04 19:05:44 -05:00
chenyu
106d18b792
use UOp methods in allreduce.py [pr] ( #15137 )
...
except the one line with Ops.BUFFER and Ops.NOOP, not sure what that's for
2026-03-04 17:15:33 -05:00
chenyu
34594bcaaf
Revert "bug in metal: offset is stored as uint32, overflow ( #15129 )" ( #15136 )
...
This reverts commit 9c58db16fa .
2026-03-04 16:54:42 -05:00
Roelof van Dijk
9c58db16fa
bug in metal: offset is stored as uint32, overflow ( #15129 )
...
* metal uint32 icb offset overflow
* fix: diff
* supports_exec_item
* GraphRunner.supports_exec_item
* tests
* fix: can't import on non-metal
2026-03-04 22:52:12 +03:00
chenyu
4cce283790
relax test_tqdm_perf ( #15134 )
2026-03-04 12:58:47 -05:00
chenyu
fae400d300
update assign tests to also test the expected behavior ( #15132 )
2026-03-04 11:34:43 -05:00
chenyu
1f96cc2b51
update non-contiguous buffer error message [pr] ( #15131 )
...
* update non-contiguous buffer error message [pr]
also cleaned up the tests
* order
2026-03-04 11:13:26 -05:00
nimlgen
563d5c3211
more graph tests ( #15130 )
2026-03-04 19:01:12 +03:00
nimlgen
cdc48da9cd
hevc: assert and speed ( #15122 )
...
* hevc: assert and speed
* simpler
2026-03-04 19:01:02 +03:00
wozeparrot
4e9b85ecfd
fa: pull inputs out of call ( #15127 )
2026-03-04 03:15:49 -08:00
George Hotz
47faa2d7b4
hotfix: llm kv cache uses clone instead of realize to avoid many realize
2026-03-04 19:07:03 +08:00
George Hotz
8ebd24637b
fix fa forward building with clang 22 ( #15124 )
...
* fix fa forward building with clang 22
* fix: override rocm path
---------
Co-authored-by: Woze Parrot <wozeparrot@gmail.com>
2026-03-04 02:32:25 -08:00
Christopher Milan
592f9bf6c6
set OPENPILOT_HACKS=1 to enable replace assign ( #15123 )
2026-03-04 05:26:04 -05:00
wozeparrot
df23057984
fa: change bwd grid dim + unshuffle using mops ( #15068 )
2026-03-04 01:23:40 -08:00
Christopher Milan
5623cea7b1
move openpilot contiguous hacks to schedule ( #15120 )
2026-03-04 03:04:06 -05:00
wozeparrot
759c7fc81c
failing test for allreduce memory usage ( #15106 )
2026-03-03 23:38:38 -08:00
George Hotz
5ecfe549e7
allreduce is a function with LATE_ALLREDUCE=1 ( #15119 )
...
* allreduce as a function
* allreduce function
* support allreduce function
* LATE_ALLREDUCE
2026-03-04 15:17:58 +08:00
Christopher Milan
e7e70a3c95
simplify idx before counting backward_slice ( #15117 )
2026-03-03 23:53:50 -05:00
George Hotz
2d72a4a90c
fix copying padded const ( #15116 )
...
* fix const padding cpu
* remove comment
2026-03-04 10:39:45 +08:00
chenyu
b5ebb4d06d
contiguous_view_offset returns only offset [pr] ( #15113 )
...
size is always input.size
2026-03-03 15:23:39 -05:00
nimlgen
abd830b260
am: setup_rinf returns only doorbell ( #15112 )
2026-03-03 19:27:41 +03:00
nimlgen
4b42bb54aa
am: reset sdma to start from 0 ( #15109 )
2026-03-03 18:14:46 +03:00
George Hotz
01ddb4c267
add precompile to call ( #15099 )
...
* add precompile to call
* put get back
* something
* after structure
* alt
* keep it call
* resolve call
* resolve linear call
* precompile works with llm
* revert rangeify
* color for debugging
* getenv PRECOMPILE
* clean up deco pattern
* fully recursive sink scheduling
* revert llama
* fix SPEC=2
2026-03-03 22:32:42 +08:00
qazal
c7f908b788
sqtt: fix rdna4 structs ( #15111 )
...
* work
* DEBUG=2
2026-03-03 23:32:14 +09:00
qazal
8dd691761d
sqtt: remove old files ( #15108 )
2026-03-03 22:43:24 +09:00
Christopher Milan
de043226ba
benchmark comma usbgpu driving_vision step and load time ( #15103 )
...
Co-authored-by: Comma Device <device@comma.ai>
2026-03-03 06:08:03 -05:00
Christopher Milan
5f6b610da1
FLOAT16 logic for IMAGE==1 goes back to image_conv2d ( #15105 )
2026-03-03 05:37:57 -05:00
wozeparrot
529318259c
fix: fix null tests to actually use null device ( #15104 )
2026-03-03 02:05:47 -08:00
George Hotz
7d025089e3
no after removal ( #15102 )
...
* no after removal
* we are using walk
* null schedule test
* pytest deps
* Revert "pytest deps"
This reverts commit 5e1c5304ec .
* Revert "null schedule test"
This reverts commit 02da66053e .
* clean null tests
2026-03-03 17:50:31 +08:00
wozeparrot
92c16810ac
feat: per device mem_used ( #15100 )
2026-03-03 01:31:28 -08:00
qazal
e3a0598d0b
viz: the whole pc should be in view ( #15101 )
2026-03-03 17:17:53 +09:00
b1tg
a9ea36de79
assembly/amd: v_cmp_lg_f32 is ordered not-equal ( #14982 )
2026-03-03 15:37:48 +08:00
wozeparrot
c35de9bd68
asm_gemm: support more sharding ( #15002 )
2026-03-02 23:16:37 -08:00
wozeparrot
824ba4386a
llama3 dp fix ( #15098 )
2026-03-02 22:43:07 -08:00
chenyu
5dcf29b1a0
use clone in test_swap_slices ( #15096 )
2026-03-02 22:05:12 -05:00
Christopher Milan
c70e8af068
move IMAGE FLOAT16 logic to allocations ( #15095 )
...
* FLOAT16 logic in allocations
* cleanup
* separate that
* only apply when IMAGE == 1
* test passing now
* create image buffers earlier
2026-03-02 22:00:05 -05:00
George Hotz
d483e4153a
buffer view is like buffer ( #15082 )
...
* buffer view is like buffer
* fix
* swap_reshape_shrink
* contiguous on gguf, fix overlap
* revert that
* _device_supports_view
* this
* fix that test
* 0 buffers
* that test was wrong
* this
* check correct size
* contig BUFFER_VIEW
* this
* fix tests
* buffer view tests
* om
* fix torch
* no MOCKGPU
* skip
2026-03-03 09:52:33 +08:00
qazal
62ee976c1b
gemm/asm: cleanup repeated patterns to helper functions ( #15094 )
2026-03-03 08:14:47 +09:00
qazal
848f5cea96
viz: sqtt instruction packet trace ( #15065 )
2026-03-03 07:55:04 +09:00
chenyu
14d1c5fdfd
assign fusion tests on detach and contiguous_backward ( #15092 )
2026-03-02 15:21:51 -05:00
nimlgen
dfa180413d
tbgpu: sign nv ( #15087 )
2026-03-02 22:58:30 +03:00
chenyu
71f228f80f
test exact kernel count in torch_backend/test_kernel_fusion ( #15091 )
2026-03-02 14:26:32 -05:00
chenyu
f80b1033c5
simpler Tensor.all ( #15089 )
...
same generated kernel
2026-03-02 11:08:55 -05:00
chenyu
4008f7d4e8
move Tensor.one_hot +1 to python ( #15088 )
2026-03-02 10:56:41 -05:00
nimlgen
dafbe9733a
am: cleanup ( #15086 )
2026-03-02 17:06:21 +03:00
qazal
f7aeff6061
viz: cli.py cleanups, do not require PYTHONPATH ( #15085 )
...
* cleanup the print
* sys.exit
* equal check
* cleanup unpacker
* cli doesn't need PYTHONPATH
* no semicolons
* %s/PYTHONPATH=. //g
2026-03-02 19:24:38 +09:00
George Hotz
5ff278446c
add contiguous_view_offset ( #15084 )
...
* add contiguous_view_offset
* no int
2026-03-02 18:05:04 +08:00
Christopher Milan
977c270774
IMAGE=1 kernel count failing tests ( #15083 )
2026-03-02 04:35:26 -05:00
George Hotz
3539693555
Support triu variable on diagonal + SDPA symbolic ( #15081 )
...
* triu variable
* fails
* dumbbb
* no commutative in reshape
* real fix
* revert that
* sdpa symbolic tests
2026-03-02 12:19:48 +08:00
wozeparrot
a4f6365929
llama3: fstep takes grads ( #15069 )
2026-03-01 20:05:07 -08:00
Nick
8e8e9f6ff6
assert removal for _tri() + tests ( #15073 )
...
* assert removal for _tri() and tests
* removed import
* tests triu/tril like in prefill
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-03-02 10:34:28 +08:00
nimlgen
ccbbca05ef
beam: add dev_timeout for am ( #15063 )
...
* beam: add dev_timeout for am
* all covered
* fk
* x
* fuzz
* reset
* f
2026-03-01 16:57:29 +03:00
chenyu
8cb4368967
delete unused END NOOP rule [pr] ( #15077 )
2026-03-01 00:09:05 -05:00
chenyu
efce99adc9
skip isComposing key press in llm.py ( #15076 )
...
for the CJK input user
2026-02-28 20:31:53 -05:00
chenyu
103ea16ec0
add contiguous back to svd ( #15074 )
...
can cause infinite loop
2026-02-28 16:49:26 -05:00
chenyu
fe0fa8333b
Revert "improve Tensor.sort indices ( #15070 )" ( #15072 )
...
This reverts commit e3003631f2 .
2026-02-28 14:40:30 -05:00
chenyu
e3003631f2
improve Tensor.sort indices ( #15070 )
...
* improve Tensor.sort indices
instead of N^2 match at the end, have an arange to start and go through the same N(logN)^2 path
* contiguous
2026-02-28 14:16:16 -05:00
wozeparrot
cfc5cf65ad
llama3: vocab padding fix + jit copies on fakedata ( #15067 )
2026-02-28 08:44:55 -08:00
chenyu
76170d035a
relax atol for test_xlm_roberta_large ( #15066 )
2026-02-28 11:22:35 -05:00
qazal
cfb8e6922d
viz: arrow keys move through time ( #15064 )
...
* work
* automatic zoom, keeping scale
* the whole shape should be out of view
2026-02-28 23:52:36 +09:00
nimlgen
9b3450c9da
test gpu crash on cdna ( #15062 )
2026-02-28 13:17:59 +03:00
nimlgen
6bbf813dd3
ci: switch to tinygrad/amdcomgr_dylib ( #15061 )
2026-02-28 13:09:39 +03:00
nimlgen
77846300b2
am: reset vm fault ( #15060 )
2026-02-28 12:58:56 +03:00
George Hotz
dc54441e1f
add better printing to tinygrad.apps.llm ( #15059 )
...
* add better printing to tinygrad.apps.llm
* add gc.collect
* comment
2026-02-28 16:38:50 +08:00
George Hotz
bb84e389cf
functions for llama trainer ( #15045 )
...
* functions for llama trainer
* function there
* axis match
* fix multi
* lil cleaner
* there's a bug with HK_FLASH_ATTENTION
* training functions
* for commit
2026-02-28 12:15:18 +08:00
chenyu
9b4ba3f838
remove ReduceContext.range_to_ends [pr] ( #15055 )
...
* remove ReduceContext.range_to_ends [pr]
make merge_reduce_ends pure. this state is causing issue when introducing more reduce merging rewrites
* tag
2026-02-27 22:15:44 -05:00
chenyu
151608aa90
update test_multiple_to_single_device ( #15056 )
...
follow up to #14482 , add SCACHE=0 to the test
2026-02-27 21:44:33 -05:00
chenyu
5fd06f4f02
differentiable setitem ( #15054 )
...
* differentiable setitem
go through the where path for bw
* no return
2026-02-27 17:25:15 -05:00
chenyu
db6b3e1edc
fix mixed setitem with both basic and tensor indexing ( #15050 )
2026-02-27 15:35:48 -05:00
chenyu
c9f6d8751b
don't remove_bufferize for Invalid ( #15053 )
...
* don't remove_bufferize for Invalid
* replaced
2026-02-27 15:16:09 -05:00
qazal
b8a55d5f68
sqtt: new packet types, add discovery script ( #14960 )
2026-02-28 04:27:27 +09:00
nimlgen
4e12fc3fe6
am: mi3xx recovery ( #15051 )
2026-02-27 22:10:47 +03:00
chenyu
81a35cef38
rearrange Tensor.getitem code ( #15049 )
...
no-op change to prepare setitem fix
2026-02-27 12:57:16 -05:00
chenyu
1406d49eef
failed test cases for advanced setitem ( #15048 )
2026-02-27 10:50:18 -05:00
qazal
ef1017f7ed
viz: skip drawing offscreen tracks in profiler ( #15047 )
2026-02-27 22:19:08 +09:00
qazal
ad99b77f6d
assembly/amd: add gfx12_asm_vflat llvm tests, disasm fixes ( #15046 )
...
* add gfx12_asm_vflat.s
* work
2026-02-27 20:20:31 +09:00
George Hotz
010d2790ce
fix multi minimal ( #15044 )
2026-02-27 14:31:58 +08:00
George Hotz
3e1e12528c
hotfix: disable tinyfs load test
2026-02-27 12:04:41 +08:00
George Hotz
d23b79530e
remove disk from GGUF GEMV test ( #15041 )
...
* remove disk from GGUF GEMV test
* keep copy
2026-02-27 12:03:00 +08:00
chenyu
d345f7f5dc
remove _pending_assigns ( #15040 )
2026-02-26 22:38:10 -05:00
George Hotz
37e31e7da4
gguf gemv test ( #15039 )
...
* add gemv tests
* gguf big
* skip
* make realize optional
2026-02-27 10:54:43 +08:00
Nick
af94bfc401
fix retinanet shared memory race condition in parallel tests ( #15030 )
...
Append PID to shared memory names in batch_load_retinanet to prevent
FileExistsError when pytest-xdist runs multiple test workers that each
call _setup_shared_mem with the same hardcoded name.
2026-02-27 08:36:24 +08:00
George Hotz
2bbf8bbefa
improve call/param rendering ( #15023 )
2026-02-27 08:35:04 +08:00
chenyu
0f94a4bb73
failed test case for early fixup const copy ( #15038 )
...
* failed test case for early fixup const copy
wrong with PAD
* test no copy
2026-02-26 19:09:33 -05:00
chenyu
3a4db53b43
raise RuntimeError in schedule for conflicted var_val [pr] ( #15031 )
2026-02-26 15:16:01 -05:00
qazal
d65db32395
viz: only compute aggregate memory graph, defer n² per buffer graph ( #15029 )
2026-02-27 04:14:51 +09:00
qazal
c61fe57cfd
viz: fix n² tiny device linking in profiler ( #15028 )
2026-02-27 02:25:39 +09:00
qazal
88d650d606
viz: clean up call node detection check ( #15025 )
2026-02-26 19:57:56 +09:00
qazal
1c09890f66
sqtt: map instructions in the command line tool ( #15024 )
2026-02-26 12:34:24 +02:00
George Hotz
fe3ee8c27e
fix symbolic shapes in calls ( #15021 )
...
* fix symbolic shapes in calls
* fix after in the big graph
* real tests
2026-02-26 17:17:18 +08:00
qazal
12d179f5f4
viz: brighter call.src[0] edge color ( #15022 )
...
* work
* 2
* better color
2026-02-26 16:07:22 +09:00
George Hotz
2655655a0c
call gradient creates a call ( #15020 )
...
* function creates a full subgraph
* tests
* fix var
* fix tests
* implict assign/contig
* move kv init
2026-02-26 14:15:29 +08:00
Christopher Milan
94acd85285
fix typo in nn/__init__.py ( #15019 )
2026-02-25 20:01:32 -05:00
Christopher Milan
e5c0db66d1
num_batches_tracked does not need is_dtype_supported ( #15018 )
2026-02-25 19:50:57 -05:00
George Hotz
3244131f59
update dagre with more recursion fixes ( #15012 )
2026-02-26 08:35:05 +08:00
chenyu
ed9d475a12
assign tests with test_function ( #15015 )
2026-02-25 16:15:59 -05:00
nimlgen
faa66e0a61
mi350 hive_reset am repro ( #15014 )
2026-02-25 21:30:18 +03:00
nimlgen
8983830aa8
am: code style consistency ( #15013 )
2026-02-25 21:30:10 +03:00
George Hotz
0d35b67f2c
revert realize to only be buffers ( #15008 )
...
* revert realize to only be buffers
* fix that
* broken attention
* Revert "broken attention"
This reverts commit a23c3cd96c .
* and that
2026-02-25 22:43:06 +08:00
qazal
35f85c393f
viz: keep recursively nested call collapsed ( #15010 )
2026-02-25 22:45:18 +09:00
qazal
421b1d4a56
viz: monospace font for tags, no dy overrides ( #15009 )
...
* viz: monospace font for tags, no dy overrides
* str
2026-02-25 22:15:31 +09:00
qazal
448e997be4
gemm/asm: cleanup custom function args ( #15007 )
2026-02-25 22:05:56 +09:00
qazal
c58e91942c
viz: support collapsing individual CALL nodes ( #15006 )
...
* all
* contracted all by default
* simple call mask
* work
* minus not hyphen
* color / cleanup
* detail
2026-02-25 21:27:25 +09:00
George Hotz
68831cd852
add more tests to test_function ( #15003 )
...
* add more tests to test_function
* add function to llm
* function decorator on llm
* works
* symbolic fixups
* minimum change
* implicit inputs
* don't actually update llama yet
2026-02-25 18:42:06 +08:00
wozeparrot
d941dd5aeb
llama3: pad vocab when mp sharding ( #14998 )
2026-02-25 00:04:06 -08:00
wozeparrot
e1c9985715
llama3: better time keeping ( #14999 )
2026-02-24 22:42:05 -08:00
Christopher Milan
4a2fc7ecbb
autogen: cache downloads ( #14997 )
2026-02-25 01:34:27 -05:00
George Hotz
e3fa9896b7
start function and add walk rewrite ( #14992 )
...
* start function and add walk rewrite
* work
* add function on feed_forward
* llm progress
* stuff
* none of that
2026-02-25 13:56:27 +08:00
chenyu
fde7a40bb0
allow dtype mismatched assign on disk ( #14993 )
...
reverted #14473 , that was a bad idea. also added a test that safe_save only has copy
2026-02-24 20:49:55 -05:00
chenyu
46d9a9a74f
minor indexing cleanups [pr] ( #14991 )
2026-02-24 16:49:35 -05:00
chenyu
8dae9be573
move realize_map fixup into realize_assign_src [pr] ( #14990 )
2026-02-24 15:51:40 -05:00
chenyu
9d9151a21e
remove const normalization in indexing [pr] ( #14989 )
...
rangeify can create const with device, and all is normalized in to_define_global
2026-02-24 15:09:11 -05:00
chenyu
f68a472244
end range for COPY/BUFFER_VIEW [pr] ( #14987 )
2026-02-24 13:33:35 -05:00
chenyu
e5d27a3773
remove BUFFER_VIEW from ended_ranges special case [pr] ( #14986 )
...
* remove BUFFER_VIEW from ended_ranges special case [pr]
* will fix later
2026-02-24 10:37:29 -05:00
chenyu
5fd4fc0c6d
fix tinyfs ( #14974 )
...
* fix tinyfs
* fix that
2026-02-24 08:50:53 -05:00
George Hotz
8a6dffc87e
Tensor.callify will be the JIT ( #14983 )
...
* close
* simple callify, support linear in the scheduler
* all tests pass
* everyone is happy
* dumb test
* Remove unnecessary blank line in rangeify.py
2026-02-24 18:42:24 +08:00
nimlgen
6f1cb6be86
am: tiny err handling cleanups ( #14981 )
...
* am: tiny err handling cleanups
* x
* x
2026-02-24 12:43:45 +03:00
George Hotz
b643fca51e
clean up complete_create_schedule_with_vars ( #14980 )
...
* clean up complete_create_schedule_with_vars
* transform_to_call
* update viz tests
2026-02-24 16:12:36 +08:00
wozeparrot
8d9545e09e
llama3: correctly shard wqkv ( #14978 )
2026-02-23 23:57:10 -08:00
wozeparrot
a36a26d4ed
llama3: optim does grad acc in correct order ( #14965 )
2026-02-23 22:25:13 -08:00
George Hotz
e2b1f2620d
schedule is linear ( #14975 )
...
* schedule is linear
* cleanup
* cleanups
2026-02-24 11:30:41 +08:00
Christopher Milan
57ade7608a
consider indexing math cost for IMAGE=1 ( #14973 )
2026-02-23 18:57:45 -05:00
chenyu
0bda5585c7
unit test TestTinyFS ( #14972 )
...
these passed before the allocation change
2026-02-23 16:59:39 -05:00
imaolo
405d37423e
call release() in MetalAllocator._free ( #14970 )
...
* add failing test
* call MTLBuffer.release() in MetalAllocator._free()
* Update test_metal.py
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2026-02-23 23:33:31 +03:00
nimlgen
77db8e1c07
cpu: wait on dep signals ( #14862 )
...
* cpu: task_done() in case of failures
* print
* fix
* x
* f
* x
* um
* ?
* u
* f
* x
* gh
* f
* f
* virt
* x
* simpler
2026-02-23 21:09:41 +03:00
chenyu
127136421d
enable a few WEBGPU isnan tests that work now ( #14967 )
...
* enable a few WEBGPU isnan tests that work now
* still failed
2026-02-23 11:06:08 -05:00
ttomsa
0366474089
Bool cast to cmpne ( #14544 )
...
* test
* rm in llvmir
* rm in ptx and nir
* hmmmm
* rm in decompositions
* skip tests
* add test
* just this
* rm comment
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-02-23 10:31:36 -05:00
George Hotz
806581f807
rename rewrites + sink filter + bump to dagre 2.0.0 ( #14966 )
...
* bump to dagre 2.0.0
* transform to call
* cleanup names
* get kernel graph
* dagre recursion fix + better error
* add toggle to hide sink nodes
* no sink by default
* revert that
* only hide final sinks
* lol
2026-02-23 22:47:22 +08:00
nimlgen
d86f1d66b5
system: apl validate dev_id bounds ( #14964 )
2026-02-23 12:18:03 +03:00
George Hotz
b824490e3f
allocate generates a call ( #14958 )
...
* allocate generates a call
* symbolic works too
* DEFINE_VAR is param
* replace param later
* apply buffers
* name
* upd
* this was a bug...
2026-02-23 15:59:20 +08:00
wozeparrot
dd8302a6d0
fix: optim device is never none here ( #14963 )
2026-02-22 23:34:57 -08:00
wozeparrot
25565b2410
fa: test for mp ( #14907 )
2026-02-22 21:47:36 -08:00
qazal
d6145736c7
sqtt: examples generator changes from inst_discovery ( #14961 )
...
* sqtt examples generator changes from inst_discovery
* rdna4
* rdna3
* cdna
* sad reality for mi300x
2026-02-23 14:42:48 +09:00
George Hotz
3acd763684
simple call in allocate ( #14962 )
...
* allocate generates a call
* symbolic works too
* add min/max to PARAM
* revert viz
2026-02-23 13:34:20 +08:00
George Hotz
f45199269b
hotfix: regress NV cifar_10steps_half to 120 ms
2026-02-23 12:29:25 +08:00
George Hotz
677145b393
all consts have shapes ( #14959 )
...
* all consts have shapes
* vconst has shape too
* use normal schedule
* cast ptrdtype
* image
* bitcast issue + hack
2026-02-23 10:26:50 +08:00
qazal
1538960002
viz: smaller view for repeated asm instructions in cfg ( #14954 )
...
* simple test
* todo
* feature
2026-02-23 10:41:43 +09:00
George Hotz
226d4a2440
hotfix: code DEBUG=1 defensively
2026-02-23 08:44:54 +08:00
chenyu
4424757b9a
update test_sharded_memory ( #14956 )
...
cleaned up and moved to test/null
2026-02-22 16:56:08 -05:00
b1tg
f9b7493e7a
cleanup fp8 conversion helpers and fp8 edge-case tests ( #14953 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-02-22 09:16:42 -05:00
qazal
60f90dd97c
sqtt: fix jitted program deduping, failing test for graphed kernels ( #14951 )
...
* work
* hcq_profile fix, test with JIT=2 passes
* ci, -n=auto
* rm duplicate test
* less
2026-02-22 15:22:31 +09:00
chenyu
ccfd878e0f
minor fix_assign_hazard improvement [pr] ( #14949 )
...
target.base cannot be s if s.op is a movement
2026-02-21 21:21:28 -05:00
chenyu
24e8919438
raise explicitly for test_crossunder_assign ( #14948 )
2026-02-21 21:21:13 -05:00
chenyu
acf8f6b287
faster fix_assign_hazard [pr] ( #14947 )
...
one toposort. `time NULL_ALLOW_COPYOUT=1 MNISTMOCK=1 PYTHONPATH="." NULL=1 DEFAULT_FLOAT=HALF BENCHMARK=10 BS=256 GPUS=1 MODEL=resnet python3 examples/mlperf/model_train.py` 150s -> 40s
2026-02-21 19:42:13 -05:00
chenyu
9764e2561c
more assign into unrealize silent fail cases ( #14944 )
2026-02-21 18:12:57 -05:00
nimlgen
6de15dc480
mockam usb ( #14916 )
...
* mockam usb
* f
* win
* x
* x
2026-02-21 23:05:54 +03:00
chenyu
0dbcd764ad
a few assign into unrealized failed test case ( #14940 )
2026-02-21 13:18:45 -05:00
wozeparrot
3cda781876
llama optim offload ( #14901 )
2026-02-21 08:53:45 -08:00
chenyu
0255a64a27
update test_jit_init_empty ( #14938 )
...
* update test_jit_init_empty
now it fails silently
* that
2026-02-21 09:01:50 -05:00
George Hotz
8ef5544e4a
realized PYTHON copies ( #14934 )
...
* realized PYTHON copies
* comment that out
* fix that test
* append afters
* contig
* disk copies
* should be 124
* 332
2026-02-21 20:29:31 +08:00
qazal
cf23c2eee7
viz: merge readelfs, clean up toggles UI code ( #14936 )
...
* no extra readelf function
* that node can never be null, display block is wrong fix the css
2026-02-21 19:58:35 +09:00
George Hotz
639224e6e1
no call hack needed anymore ( #14935 )
2026-02-21 18:06:00 +08:00
George Hotz
d3b829a189
print schedule caller with DEBUG=1 ( #14933 )
2026-02-21 16:22:45 +08:00
qazal
8278886cf9
test_profiler cleanup, non flaky cpu_profile test ( #14932 )
...
* test_profiler cleanup, non flaky cpu_profile test
* existing device is okay
2026-02-21 16:58:10 +09:00
George Hotz
06fb35a1e5
don't graph_rewrite into calls ( #14931 )
...
* don't graph_rewrite into calls
* optional
* pm_gate_kernel_sink removed
2026-02-21 15:39:59 +08:00
qazal
c5029fa460
jit case with Tensor.empty input, realized means allocated ( #14930 )
...
* simple failing jit test case with Tensor.empty
* this used to exist in ops.py...
* Revert "removed if self.buffer.is_allocated() in realized (#14836 )"
This reverts commit 72cf603805 .
2026-02-21 16:33:55 +09:00
George Hotz
6533250246
remove more tags stuff ( #14927 )
...
* remove more tags stuff
* remove more
* unique consts aren't needed post tensor
2026-02-21 12:51:53 +08:00
chenyu
0c0d07d330
delete forced_reshape [pr] ( #14926 )
2026-02-20 22:35:31 -05:00
qazal
5b6fcd1cda
gemm/asm: smallest cdna4 asm gemm test ( #14925 )
2026-02-21 11:56:05 +09:00
George Hotz
ad3d821d63
move size 0 logic to allocations ( #14924 )
2026-02-21 09:57:40 +08:00
George Hotz
df7774661a
remove late numbering of UOps ( #14923 )
...
* remove late numbering of UOps
* stupid fix
* dead code
2026-02-21 09:18:48 +08:00
chenyu
c9b706125d
break Tensor.pad into methods ( #14922 )
2026-02-20 20:10:09 -05:00
Christopher Milan
5ee654b0d9
test IMAGE=1 driving_vision in mac pytest ( #14921 )
...
* test IMAGE=1 driving_vision in mac pytest
* don't multiply array
2026-02-20 18:28:10 -05:00
Christopher Milan
815780f72f
cl: fix multi-image arg kernels ( #14920 )
2026-02-20 17:34:17 -05:00
chenyu
24286c5593
fix clone for multi ( #14919 )
...
also update empty_like to make sure it's backed by buffers
2026-02-20 17:21:09 -05:00
chenyu
1fc1508f67
add assign to test_realize_is_realize.py ( #14918 )
2026-02-20 16:48:01 -05:00
chenyu
a4634b253a
fix empty_like for sharded tensor ( #14915 )
2026-02-20 16:30:04 -05:00
chenyu
86e7804d60
correct llm.py mem bw benchmark for moe ( #14626 )
...
only count active experts. verified on olmoe
2026-02-20 16:11:22 -05:00
Nicolas Pinto
aa905db7f7
ptx: use setp.neu for float CMPNE ( #14805 )
...
* ptx: use setp.neu for float CMPNE
* test ptx float CMPNE renders setp.neu
* check NaN behavior, not grep ptx strings...
* skip WEBGPU for test_cmpne_nan (Vulkan NaN behavior)
---------
Co-authored-by: Nicolas Pinto <41171+npinto@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-02-20 16:11:04 -05:00
chenyu
f9536f3cd4
wrap UOp.__float__ with float [pr] ( #14913 )
...
fix warning
tinygrad/test/null/test_uop_resolve.py:56: DeprecationWarning: UOp.__float__ returned non-float (type ConstFloat). The ability to return an instance of a strict subclass of float is deprecated, and may be removed in a future version of Python.
self.assertEqual(float(u), 11.5)
2026-02-20 14:03:53 -05:00
chenyu
697d0b06c2
update env for testmacpytest ( #14912 )
...
CI: ""
CAPTURE_PROCESS_REPLAY: "0"
2026-02-20 13:42:50 -05:00
chenyu
07d145debd
compile3 0.10.1 driving_vision in mac pytest ( #14911 )
...
* compile3 0.10.1 driving_vision in mac pytest
* sync before re-executing onetime kernels
2026-02-20 12:23:52 -05:00
chenyu
d895713116
remove temp onnx migration CI job ( #14910 )
2026-02-20 11:38:44 -05:00
George Hotz
2611907afb
start ripping out old scheduler -- no maps ( #14909 )
...
* start ripping out old scheduler -- no maps
* no more metadata
2026-02-20 21:05:04 +08:00
nimlgen
1b3b94a72a
fix mockam mypy ( #14908 )
2026-02-20 15:15:05 +03:00
George Hotz
55d3a5def9
preallocate all realized buffers ( #14823 )
...
* preallocate all realized buffers
* contiguous
* work
* comment that out
* move to schedule
* better
* correct fix
* just buffer
* disk bufs
* fixes disk tensor stuff
* fix symbolic stuff
* fix multi
* 162 failures
* bugfixes
* don't check that anymore
* fix schedule tests
* mnist should be contiguious
* type and buffer
* fix tests
* shrink axis correction
* mypy fixes
* tests skips
* same 37 failures
* dedup
* no shrink in the graph
* 29 failures
* skips
* fix custom kernel
* fix training
* those optimizations aren't supported currently
* simpler
* more correct
* tests
* 14 failures
* works
* fix that test
* broken
* 11 failures
* only kernel counts left
* fixes
* all tests pass
* remove tensor_map
* op test
* 200 -> 230
* test fixes
* fixes
* revert test_tiny thing
* guard
* revert that
* test tiny passes
* no contigs there
* base realize back
* Revert "no contigs there"
This reverts commit c45bb9fcfd .
* revert that
* chop many assigns
* 12 failures
* fix tests
* tests
* apply after
* pre-commit
* remove old code
* delete that
* fix types
* remove extra contig
* fix dataloader
* torch fix
* disk fix
* update kernel fusion numbres
* runs on amd
* restore kernel count
* add that rule back
* that
* disable that
* wrong
* add the correct rule for that folding
* more tests
* guard c1.arg
* no newlines
* realize those
* split into a different file
* remove detach/contig back
* skip 2
* update that
2026-02-20 20:05:54 +08:00
nimlgen
dbf894215a
init mockam ( #14889 )
...
* mockam
* more tests
* linter
* x
2026-02-20 14:09:11 +03:00
wozeparrot
4b9825c829
make optim _step return update ( #14906 )
2026-02-20 02:43:56 -08:00
George Hotz
6610255654
add the correct rule for gcd div/mod folding ( #14905 )
...
* add the correct rule for that folding
* more tests
* guard c1.arg
2026-02-20 18:11:54 +08:00
George Hotz
a28fc2fba7
hotfix: remove wrong symbolic rule
2026-02-20 17:09:18 +08:00
qazal
28451a5957
viz/sqtt: rdna4 wmma, cleanup inst rows ( #14904 )
...
* valu wmma
* viz/sqtt: rdna4 wmma, cleanup inst rows
2026-02-20 17:02:09 +09:00
qazal
16ae96fa58
finish rdna4 sqtt ( #14903 )
...
* unskip
* it's a wave pair in rdna4
* work
* that
* hidden archive
* generic s_delay, mystery InstOpRDNA4.UNK_60
* branch failing test
* UNK_60 is OTHER_VMEM_STORE
* rdna4 has both s_delay_alu and s_wait_alu
* real branch failing test
* rdna4 doesn't have JUMP_NO, it's NEXT with a flag for no jump
* make inst_delay skips recursive
* all rdna4 tests pass
* simm16 unwraps
* that has a name
2026-02-20 16:06:13 +09:00
qazal
52b51a0324
test fixes from rdna4 sqtt ( #14902 )
2026-02-20 14:42:33 +09:00
qazal
32f569b573
viz/sqtt: decoder fixes pre rdna4/cdna4 work ( #14900 )
...
* viz/sqtt: decoder fixes pre rdna4/cdna4 work
* fix
* branch_inst + more tests
* smaller
2026-02-20 12:10:15 +09:00
qazal
e9ae3da711
viz: click on CALL node goes to codegen ( #14609 )
...
* viz: click on CALL node goes to codegen
* colored name
2026-02-20 11:13:11 +09:00
George Hotz
fc5677c28b
resnet dataloader + more test cleanups ( #14899 )
...
* resnet dataloader
* tests
2026-02-20 10:05:47 +08:00
chenyu
b9744ab62b
one more test_gpudims test ( #14898 )
...
failure from the bad simplification attempt
2026-02-19 18:18:44 -05:00
chenyu
9d6cf00be2
fix gpudim bug and test_split_2d_to_3d ( #14896 )
2026-02-19 16:46:24 -05:00
chenyu
2b31823ef9
update test_gpudims to prove bijectivity ( #14895 )
...
* update test_gpudims to prove bijectivity
* one more
2026-02-19 16:18:59 -05:00
chenyu
19ce7a3f7f
use z3 to verify gpudims output index ( #14894 )
...
found a bug with z3
2026-02-19 15:24:38 -05:00
chenyu
52f727738b
move test_grouped_dims to test/null ( #14893 )
...
it's a pure helper
2026-02-19 14:50:53 -05:00
chenyu
af997c1ea5
use .expr to access variable expr instead of arg[0] [pr] ( #14892 )
...
only apply when it's more readable
2026-02-19 12:24:36 -05:00
chenyu
7400362a86
remove UOp.vars [pr] ( #14891 )
2026-02-19 12:09:39 -05:00
chenyu
f54a49e733
restructure alu_multi [pr] ( #14888 )
2026-02-19 11:11:49 -05:00
chenyu
06ef8a26b7
add a test case that triggers CALL passthrough_multi ( #14887 )
2026-02-19 10:45:40 -05:00
nimlgen
071403f9a1
system: use MAP_FIXED_NOREPLACE ( #14884 )
2026-02-19 18:32:50 +03:00
nimlgen
041dc0cf85
fix typos ( #14886 )
2026-02-19 17:37:15 +03:00
Kartik Vashishta
9a9c7648e9
system: fix pci_scan_bus vendor filter ( #14885 )
...
* system: fix pci_scan_bus vendor filter
* fix: formatting
2026-02-19 17:23:32 +03:00
chenyu
877a5d4c45
improve types and simplify allgather in multi [pr] ( #14878 )
2026-02-19 09:02:15 -05:00
wozeparrot
9317e96881
fa: explicitly pass shapes ( #14857 )
2026-02-19 05:26:16 -08:00
George Hotz
f6c1cf343c
new symbolic rule from prealloc_bufs ( #14883 )
...
* new symbolic rule from prealloc_bufs
* optim
2026-02-19 20:57:30 +08:00
qazal
658c32864a
viz: show event number in track line ( #14882 )
2026-02-19 20:58:37 +09:00
qazal
911399bee5
assembly/amd: move the kernel capture stuff out of helpers ( #14881 )
2026-02-19 16:28:48 +09:00
qazal
1f34ba4511
viz: remove global amd targets mapping ( #14879 )
...
* viz: remove global amd targets mapping
* rename to amd_counters and nv_counters
* diff
2026-02-19 15:31:12 +09:00
George Hotz
2f0f8b5776
more test relaxations from prealloc_bufs ( #14880 )
2026-02-19 14:23:28 +08:00
qazal
5bc65ec669
applied_opts/estimates in program spec are aliases for the sink arg ( #14860 )
...
* remove applied_opts from programspec
* comment that out
* placement
* update tests
* p.ast.arg
* remove todo comment
* maybe this too
* it can exist as an alias, also for estimates
2026-02-19 13:08:26 +09:00
chenyu
8d8da185ec
minor handle_allreduce cleanup [pr] ( #14876 )
...
no more lbs, also use a divmod
2026-02-18 22:53:28 -05:00
Christopher Milan
b5588d341b
uop_given_valid fixes many gated reads for IMAGE=1 ( #14877 )
...
* add replay script
* pkl is arg
* that needs uop_given_valid
* cleanup
2026-02-18 22:49:47 -05:00
George Hotz
ab61c16730
fixes and test relaxations from prealloc_bufs ( #14875 )
...
* fixes and test relaxations from prealloc_bufs
* fix error type and guard _mop
* revert that
* contiguous makes extra/torch_backend/test_kernel_fusion.py fail
2026-02-19 11:37:25 +08:00
chenyu
0c85b93938
support shink sharded and non-sharded axes ( #14874 )
...
simpler to just support it
2026-02-18 20:54:10 -05:00
chenyu
e8252e6e4f
use offical gguf in test ( #14872 )
...
also deleted bad test_load_sample_mxfp4, added some hard coded simple tests
2026-02-18 19:55:09 -05:00
chenyu
8c830c5b44
test_full_like_shrink_on_shard_axis ( #14870 )
...
* test_full_like_shrink_on_shard_axis
add a test case that triggers non-copy branch in mstack_early_shrink
* 0
2026-02-18 19:23:44 -05:00
Ananta Ranganathan
4005e9db6d
Mxfp4 fix ( #14866 )
...
* double e2m1 values for mxfp4
* check if assert equal works in ci
* Revert "check if assert equal works in ci"
This reverts commit 8cf902ce0d .
* remove unnecessary whitespace change
* add test case that fails for old implementation but passes for new
* add note that the previous test is bad
* clarification on the methodology for the test
* fix the indent problem that happened to skip this test
* for now update mxfp4 block test to similarly use allclose (bad)
* add gist link and clearer explanation of process for computing test data
2026-02-18 18:50:59 -05:00
chenyu
0e4cf21a75
remove handle_allreduce_multirank and group_id [pr] ( #14869 )
...
leftovers from ops_remote
2026-02-18 16:13:54 -05:00
chenyu
f771de6738
gc.collect() to get the correct GlobalCounters.mem_used in tests ( #14868 )
...
test can be flaky if gc happens in between
2026-02-18 15:01:23 -05:00
chenyu
f84a11bb9f
delete uneven shard tests and mentions ( #14867 )
2026-02-18 14:10:33 -05:00
nimlgen
1c8c17a593
am: aca ( #14861 )
2026-02-18 21:40:09 +03:00
chenyu
b3cdb61067
clean up expand_multi [pr] ( #14865 )
...
remove dead assert, also make it more like a view
2026-02-18 12:21:13 -05:00
chenyu
0260406f49
simplify reshape_multi [pr] ( #14864 )
2026-02-18 11:46:26 -05:00
chenyu
5746a605ce
UOp.axis raises for invalid reshape ( #14863 )
...
reshape is lazy now, so better to raise from the .axis call and not have caller to handle invalid case
2026-02-18 11:28:56 -05:00
nimlgen
3b95fa0ed4
am_smi: enable mem usage back ( #14858 )
2026-02-18 19:27:27 +03:00
qazal
a212881130
viz: second profiler link goes to source code ( #14855 )
2026-02-18 19:40:34 +09:00
qazal
b0110c4469
viz: simplify shape clicking ( #14853 )
...
* setFocus is the more clear name
* do less
2026-02-18 19:03:26 +09:00
George Hotz
af839b2bd1
remove all the outerworld stuff, it was too complex ( #14852 )
2026-02-18 17:44:11 +08:00
wozeparrot
6d301ad2c4
feat: llama wqkv ( #14841 )
2026-02-17 23:01:33 -08:00
qazal
a3d516c4b5
viz: start displaying pma ( #14848 )
...
* viz: start displaying pma
* s
* work
* colors
* cleaner
* max packets
* fine
* work
* pma
* diff cleanup
2026-02-18 14:22:32 +09:00
George Hotz
d5636fba90
assign after copy shouldn't contig ( #14847 )
...
* assign after copy shouldn't contig
* fix assign copy
2026-02-18 12:23:49 +08:00
George Hotz
ab55e8c6b9
assign should be used as output buffer ( #14845 )
...
* assign should be used as buffer
* late removed
* the fix
* better fix
* backward slice
2026-02-18 09:37:46 +08:00
chenyu
e3c120c8e1
exclude 100 in test_assign_add ( #14846 )
...
this can crash, not sure why. skip 100 to see if it's better
2026-02-17 19:12:47 -05:00
Christopher Milan
7641ed61af
remove doublecast in IMAGE=1 ( #14839 )
2026-02-17 18:22:14 -05:00
Christopher Milan
5b11519d5e
LLVM actually supports ops ( #14843 )
...
LLVM should support eg, SHL/SHR, but this was never actually rendered
2026-02-17 18:21:33 -05:00
wozeparrot
95e97ec341
seperate llama optim ( #14810 )
2026-02-17 13:02:35 -08:00
chenyu
72cf603805
removed if self.buffer.is_allocated() in realized ( #14836 )
...
automatically fixes is_realized issue for empty
2026-02-17 15:35:56 -05:00
chenyu
aec8a6c85b
Revert "one run_schedule for assign realize ( #14835 )" ( #14837 )
...
This reverts commit df7c37f611 .
2026-02-17 14:34:26 -05:00
chenyu
df7c37f611
one run_schedule for assign realize ( #14835 )
...
concat schedules. separate out the execution part
2026-02-17 14:01:55 -05:00
chenyu
61867c2f35
TestRealizeIsRealized ( #14834 )
...
test after calling .realize(), uop.is_realized is True. currently not working for empty (thus disk tensor), and const
2026-02-17 13:30:35 -05:00
chenyu
f147791105
update test to reset and test kernel_count directly ( #14832 )
2026-02-17 11:48:46 -05:00
chenyu
9d4937ab5e
remove assign test @unittest.skip("this test is crashing!") ( #14831 )
2026-02-17 10:30:58 -05:00
nimlgen
dda5ccf63b
hcq: fix usb<->cpu mappings ( #14827 )
...
* hcq: fix usb<->cpu mappings
* non cpu
* um
2026-02-17 18:04:18 +03:00
nimlgen
801677cf12
am: GCVM_L2_PROTECTION_FAULT_STATUS prints device ( #14830 )
2026-02-17 18:03:52 +03:00
chenyu
f07898c68a
move assign chain fix to rangeify ( #14829 )
2026-02-17 09:40:34 -05:00
nimlgen
a2586e4c70
nv: move reset earlier ( #14824 )
2026-02-17 17:25:49 +03:00
chenyu
f2f039cc0f
fix chained full-buffer assign ( #14828 )
...
this shows issue that pm_remove_bufferize drops tags, will fix in bufferize next. this also fixed rand being different in jit vs no-jit
2026-02-17 09:11:04 -05:00
chenyu
58fa82eef5
stronger test_assign_add ( #14826 )
...
also test self add 10 and 100 times
2026-02-17 08:36:09 -05:00
George Hotz
ff60dab622
Revert "big sink is on base ( #14819 )" ( #14825 )
...
This reverts commit 5fc3d8109f .
2026-02-17 19:18:06 +08:00
qazal
f8e485ee9e
nvcc/nvdisasm macos shim ( #14822 )
...
* move to backend
* and arch
* setup_nvcc_osx
* blackwell
* min test
* now getting dumb assert is_ptx
* support cubin.
* work
* remove that
* simpler
2026-02-17 20:07:05 +09:00
qazal
d24781f45f
viz: do not, ever, open devices ( #14820 )
...
* viz: do not, ever, open devices
* unwrap
* on the kernel info
2026-02-17 19:42:44 +09:00
George Hotz
5fc3d8109f
big sink is on base ( #14819 )
...
* big sink is on base
* contiguous fixes tests
2026-02-17 18:32:56 +08:00
qazal
99a988b9d2
viz: remove ProgramSpec from trace ( #14818 )
2026-02-17 19:04:58 +09:00
qazal
f590564bf7
gemm multiple is only for cdna4 asm ( #14814 )
...
* gemm multiple is only for cdna4 asm
* move to backend
* and arch
* path
2026-02-17 14:00:02 +09:00
George Hotz
5bd2862d1a
late compile the cdna gemm ( #14783 )
...
* late compile the cdna gemm
* remove old things
* finalize inplace
---------
Co-authored-by: qazal <qazal.software@gmail.com>
2026-02-17 13:04:22 +09:00
Christopher Milan
275319c789
IMAGE=1 2d indexing ( #14809 )
...
* IMAGE=1 2d indexing
* cleanup
* oops
* go back to 'idx'
* fix vals
* fix
* ugh
2026-02-16 22:51:18 -05:00
George Hotz
f081f154ae
parameterize the CDNA asm gemm ( #14813 )
...
* parameterize the CDNA asm gemm
* fix llama test
* fix
* add more gemmt ests
* confirm all match
* test these asm gemms
2026-02-17 11:35:18 +08:00
George Hotz
bc3487d607
VIZ display cleanups ( #14811 )
...
* exclude reshape/expand broadcasts from viz
* limit src lines
2026-02-17 10:03:08 +08:00
chenyu
5bca5be2d2
test slice assign twice retains the buffer ( #14807 )
2026-02-16 20:01:47 -05:00
ridoy majumdar
ba39a19114
viz: remove duplicate Ops.PARAM color ( #14808 )
2026-02-17 09:31:47 +09:00
chenyu
9b44fbe0b8
update test_assign_add_twice ( #14806 )
...
failed test case to show that `+=1` twice returns a different buffer
2026-02-16 17:52:11 -05:00
chenyu
f290af6c7d
test_schedule always test with SPLIT_REDUCEOP=0 ( #14802 )
...
* test_schedule always test with SPLIT_REDUCEOP=0
except tests that tests SPLIT_REDUCEOP=1
* like that
2026-02-16 15:30:26 -05:00
kevvz
e41da0c396
use relative address for MOCKGPU rdna4 tracing ( #14801 )
...
* rdna3/4 trace separation
* remove comments
2026-02-16 22:59:46 +03:00
nimlgen
131bbbbfd8
am: smu_v13_0_12 ( #14800 )
2026-02-16 22:58:10 +03:00
nimlgen
7ddc888ad5
am: 48bit for gfx950 ( #14799 )
2026-02-16 22:48:07 +03:00
nimlgen
9f8afb518c
viz: sdma gb/s in graph ( #14798 )
...
* viz: sdma gb/s in graph
* f
2026-02-16 16:45:06 +03:00
qazal
db3db476ff
viz: add GB/s to SDMA ( #14795 )
...
* work
* better
* fix that
* no decimal
2026-02-16 20:09:20 +09:00
qazal
2b36708c6d
viz: split all long labels with ... ( #14794 )
2026-02-16 19:18:42 +09:00
qazal
d213fe95a0
viz: integer ticks on the x axis, fix small cycle numbers ( #14792 )
2026-02-16 18:07:40 +09:00
George Hotz
47d39a6b8b
add sqtt support to the emulator ( #14791 )
...
* add sqtt support to the emulator
* more sqtt
* cleanup
* cleanups
* simpler tests
* some decent tests
* test branch
2026-02-16 16:48:26 +08:00
wozeparrot
45aebe1572
hipkittens fa backward ( #14723 )
2026-02-16 00:38:44 -08:00
Nicolas Pinto
20b658b786
fuse MULACC after MUL->SHL ( #14788 )
...
* decompositions: fuse (x << n) + c to MULACC
MUL→SHL converts x*(2^n) to x<<n before MULACC can fuse (x*c)+y.
Add pattern to also fuse (x<<n)+c → MULACC(x, 2^n, c) for backends
that support both MULACC and SHL.
* test: add test_mulacc_shl for SHL->MULACC fusion
* test: relax test_mulacc_unrolled to >= 4
SHL->MULACC fusion now also catches power-of-2 address calculations,
increasing MULACC count from 4 to 6 on PTX. the test's intent is that
each unrolled multiply is individually fused (not grouped), so >= 4
is the correct assertion.
---------
Co-authored-by: Prithvish <deformercoding@gmail.com>
Co-authored-by: Nicolas Pinto <41171+npinto@users.noreply.github.com>
Co-authored-by: Nicolas Pinto <npinto@mbp23.local>
2026-02-16 16:26:44 +08:00
qazal
ac62d28ddc
viz: amdgpu arch cleanup ( #14790 )
...
* viz: amdgpu arch cleanup
* don't do that
* simpler sqttmap
* work
* self.arch
2026-02-16 16:48:12 +09:00
George Hotz
401095e3e7
emulator barrier tests ( #14789 )
2026-02-16 15:31:01 +08:00
qazal
c7a4dbf918
viz: get program binary from the UOp ( #14787 )
...
* viz: get program binary from the UOp
* remove that
* less
* rename View Program to View Source
* two words
* fix
2026-02-16 15:46:58 +09:00
Bautista Garcia
0f1ca8eb43
torch_load: fix shared storage slicing ( #14771 )
...
* faster zip_extract + usage in torch load
* clean zip in torch load
* working zipextract in torchload
* tar_extract in tar path
* faster tar path
* tests passing, cleanup needed
* faster tar with 1MB buffer
* comments
* unify storage_source with all paths
* use bufferedreader in zip path
* fix ruff
* clean
* removed unnecessary string conversion
* fix for tensors that share storage
* less hacky
* shared storage test
* test comment
* linter
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-02-16 14:30:13 +08:00
George Hotz
dff9cf35c2
amd asm emulator fixes + run it in CI ( #14786 )
...
* amd asm fix, try 2
* fix tests
2026-02-16 13:24:21 +08:00
qazal
55a4dfa2e0
cdna4 asm_gemm tests in CI on the null backend ( #14785 )
...
* cdna4 asm_gemm tests in CI on the null backend
* no .numpy() in null
* better
* gemm/asm: device comes from renderer
2026-02-16 14:06:23 +09:00
qazal
c2be31e75b
move Estimates to rewrite rules [pr] ( #14782 )
...
* move Estimates to rewrite rules [pr]
* don't need this cached_property
* tuple
* return
2026-02-16 12:59:42 +09:00
George Hotz
0abcb9aac2
move more to mixins ( #14780 )
...
* move more to mixins
* revert
* move some
* do not change
* more
* fix tests
* Revert "more"
This reverts commit d942d59fa4 .
* go
* work
* more
* work
* guard
* base
2026-02-16 11:35:00 +08:00
qazal
8e7c5f5b09
remove Tensor.training = True in test_arange ( #14781 )
2026-02-16 11:19:42 +09:00
kevvz
33b2ade8cd
Rdna4 emulator test_ops, dtypes pass ( #14773 )
...
* test_ops, test_dtypes pass
* merge cdna4
* ruff + more tests
* reorganize
* /backend
* again
* again...
* add rdna4
2026-02-16 10:13:39 +08:00
qazal
156b6cb7e4
native bf16 cast in cdna4 ( #14574 )
...
* native bf16 cast in cdna4
* don't need contig backward
* simpler
* contig bw still wins in those cases
2026-02-16 10:51:32 +09:00
chenyu
3adb5062c5
clean up assign_to_contiguous [pr] ( #14779 )
...
slice hazard is handled in fix_assign_hazard
2026-02-15 20:45:49 -05:00
George Hotz
bd18217f32
add rdna3/rdna4/cdna4 to testamd ( #14778 )
...
* add rdna3/rdna4/cdna4 to testamd
* test simplify
* ci cleanups
* mergable
* skip slow
2026-02-16 09:45:16 +08:00
George Hotz
ac079e43d7
ElementwiseMixin ( #14777 )
2026-02-16 08:50:47 +08:00
Christopher Milan
9c95a11f90
autogen: handle rocm bump and better error wording ( #14776 )
...
* autogen: handle rocm bump and better error wording
* regen
2026-02-15 19:23:47 -05:00
chenyu
1ded250bbe
remove collapse_nested_assign [pr] ( #14775 )
...
the else branch is dead code, and we can check directly with UPat
2026-02-15 18:04:47 -05:00
chenyu
17db43ab46
remove some contiguous call in frontend ( #14772 )
...
these should work without contiguous
2026-02-15 16:33:56 -05:00
nimlgen
26193cbf9a
nv: prof cpu_access for nvd only ( #14769 )
2026-02-15 21:42:04 +03:00
qazal
33b31d9cd6
tinykittens flash attention dtype fix, add CI ( #14770 )
...
* don't hardcdoe amd device
* add failing tests, ci too
* fix: fix for dtype mixin
* bump to rocm 7.1
---------
Co-authored-by: Woze Parrot <wozeparrot@gmail.com>
2026-02-16 01:15:11 +09:00
chenyu
352845d8cc
update cast to uint tests ( #14768 )
...
result in valid range should work, add intermediate cast to NIRRenderer since it's UB for [128, 256)
2026-02-15 10:55:13 -05:00
qazal
ceccc8eb86
unskip now passing multi tests [pr] ( #14759 )
2026-02-15 20:30:00 +09:00
George Hotz
713143a46a
more mixins pt 2 ( #14765 )
...
* more mixins pt 2
* lil cleanups
2026-02-15 17:57:04 +08:00
qazal
9da7f5e733
disable process replay for AMD emulator renderer [pr] ( #14766 )
...
* disable process replay for AMD emulator renderer [pr]
* line
* skip
2026-02-15 18:52:37 +09:00
George Hotz
9759fd6193
dtype mixin ( #14763 )
...
* dtype mixin
* dtype mixin methods
2026-02-15 16:03:48 +08:00
qazal
42b6bf0b7a
fix sdpa causal failing test on multi ( #14762 )
...
* simple failing test
* device is from xq
2026-02-15 16:54:33 +09:00
George Hotz
8091661df3
more more to mixins ( #14761 )
2026-02-15 15:18:37 +08:00
George Hotz
0e215c433d
remove hack from cast ( #14760 )
...
* remove hack from cast
* skip tests
* linters to 3.12, another skip
* fix rand
* m_
2026-02-15 13:56:38 +08:00
George Hotz
d176af6269
start outerworld call test, fix gate ( #14758 )
2026-02-15 12:35:01 +08:00
qazal
9bb6014900
keep existing profile trace in viz cli ( #14757 )
2026-02-15 13:16:32 +09:00
chenyu
ca68037f26
lazy basic setitem to unrealized Tensor ( #14756 )
...
undo the view and make it a mask, this fuses the setitem with any pending compute too.
one behavior change is that for target not backed by a buffer (const and arange), rangeify makes output contiguous under the hood.
this is stricter better than raise and ask user to call contiguous, as that would no longer be fuse-able.
2026-02-14 20:27:03 -05:00
George Hotz
32980c74d1
hotfix: skip flaky tests, looped many times on tinymac3
2026-02-15 07:46:29 +08:00
chenyu
902dc7c09c
fix test_numpy_parity_and_backward_2d ( #14755 )
...
test setup issue, test failed locally with `RUN_SLOW=1`
2026-02-14 17:59:00 -05:00
chenyu
043f5dbfa0
fix write-after-read tracking ( #14754 )
...
AFTER-AFTER was silently dropped, which breaks write-after-read
2026-02-14 17:23:05 -05:00
chenyu
d79c63a0ff
test_multi_step_assign_read_write_same_buffer ( #14752 )
...
pattern in LAMB that can be off subtly
2026-02-14 16:39:08 -05:00
chenyu
95f4c7e90a
fix limit_bufs to not limit index ( #14751 )
...
index is not real buffer. also made MAX_KERNEL_BUFFERS a ContextVar
2026-02-14 16:00:03 -05:00
chenyu
0ce4a55dad
clean up test_setitem_slice ( #14750 )
...
moved to test_setitem_schedule, and use contiguous zeros as scheduler handles empty differently now
2026-02-14 14:29:16 -05:00
chenyu
8f6772fd8c
more setitem kernel mem tests ( #14749 )
...
* more setitem kernel mem tests
test only the slice is accessed
* update
2026-02-14 11:01:03 -05:00
chenyu
446909fb7a
more setitem kernel tests ( #14748 )
...
check where realize happened
2026-02-14 09:57:46 -05:00
nimlgen
4ab51b55bd
stream pma decoder ( #14746 )
2026-02-14 17:40:18 +03:00
nimlgen
e1a18dadae
fix devices for copies ( #14747 )
...
* fix devices for copies
* add test
2026-02-14 17:39:41 +03:00
George Hotz
e35bd960e8
Revert "use zip_extract and tar_extract in torch load ( #14734 )" ( #14745 )
...
This reverts commit 9d9ef81608 .
2026-02-14 13:24:01 +08:00
Christopher Milan
eaa9506a00
disallow subnormals in emulated test_dtype ( #14744 )
2026-02-14 00:11:57 -05:00
Bautista Garcia
9d9ef81608
use zip_extract and tar_extract in torch load ( #14734 )
...
* faster zip_extract + usage in torch load
* clean zip in torch load
* working zipextract in torchload
* tar_extract in tar path
* faster tar path
* tests passing, cleanup needed
* faster tar with 1MB buffer
* comments
* unify storage_source with all paths
* use bufferedreader in zip path
* fix ruff
* clean
* removed unnecessary string conversion
2026-02-14 12:57:28 +08:00
qazal
c88bb075f0
hotfix: correct way to get renderer arch ( #14743 )
2026-02-14 12:38:20 +08:00
George Hotz
f9d2eca91a
clean up amd/elf.py ( #14741 )
2026-02-14 12:09:05 +08:00
qazal
6dc7ea58fd
make flash attention tests run on DEV=NULL EMULATE=AMD_CDNA4 ( #14742 )
...
* make flash attention tests run on DEV=NULL EMULATE=AMD_CDNA4
* no if CI, this is just the arch
2026-02-14 12:24:37 +09:00
George Hotz
e8bd432bf6
move amd emulator out of tree ( #14740 )
...
* move amd emulator out of tree
* move the readme too
2026-02-14 10:32:00 +08:00
chenyu
dca7819f76
more setitem into unrealized tests ( #14737 )
...
* more setitem into unrealized tests
into empty, const with alu, and arange
* typo
2026-02-13 20:28:51 -05:00
chenyu
9f607cf84f
disk setitem does not need realize either ( #14736 )
...
disk base is a COPY and is_realized is always False for now, disk assign is still eager
2026-02-13 12:57:58 -05:00
chenyu
8b205a007e
lazy setitem for realized target ( #14735 )
2026-02-13 12:20:14 -05:00
nimlgen
3bee6638e3
external_test_hive_reset ( #14729 )
...
* external_test_hive_reset
* add fault
2026-02-13 19:08:36 +03:00
nimlgen
7d88626068
nv: fix pma_bytes to be system memory ( #14733 )
2026-02-13 17:55:46 +03:00
George Hotz
c0fe78f73b
BUG: metadata is lost with partial assign ( #14732 )
2026-02-13 21:35:21 +08:00
qazal
d0543063dd
viz: wave color is locally scoped ( #14728 )
2026-02-13 18:22:20 +09:00
nimlgen
ba67425680
am: reset mi300 with pm4 ( #14727 )
2026-02-13 11:22:32 +03:00
George Hotz
c0de4f75b1
improve mmapeak, print names with sqtt ( #14726 )
2026-02-13 16:07:06 +08:00
George Hotz
5289b4e882
renderer/amd: add cdna emulator ( #14721 )
...
* renderer/amd: add cdna emulator
* fixes
* no predecode
* no early
* REMU_PATH
* delete that
* round
* Fix cache invalidation check in _compile_smem
2026-02-13 16:06:58 +08:00
Christopher Milan
08a555c875
skip test_expand_buffer_before_cast on WEBGPU metal ( #14724 )
2026-02-13 00:01:05 -05:00
Christopher Milan
7993f3a277
autogen: use snapshot.debian.org for linux src ( #14718 )
2026-02-12 23:36:38 -05:00
wozeparrot
0613c0ac0c
hipkittens fa forward ( #14692 )
2026-02-12 20:16:43 -08:00
chenyu
50cb40be88
clean up test/null/test_indexing.py ( #14720 )
2026-02-12 22:36:53 -05:00
qazal
5b624b5e93
viz: better error message for out of range timestamps ( #14722 )
...
* test_timestamp_out_of_range
* rel_ts helper
* linter
2026-02-13 12:13:40 +09:00
George Hotz
4088d686b2
remove llvm requirement from amd ( #14717 )
...
* remove llvm requirement from amd
* tests pass
* test
* sink kernarg_size
* move stuff
* amd_asm_matmul to new style
* default type
* fix tests, simpler
* cu mode is faster and simpler
* darken
2026-02-13 10:50:12 +08:00
chenyu
9e33a08adb
use more pad_to and shrink_to in tensor.py ( #14719 )
...
good wins
2026-02-12 20:10:57 -05:00
George Hotz
d3adb8428e
Revert "hotfix: skip test/amd in macpytest" ( #14704 )
...
* Revert "hotfix: skip test/amd in macpytest"
This reverts commit b7dade2adf .
* no llvm subprocess
* simpler
* sys.exec
* cleanup
* process safe
* diag
* arm ftz support
* 5 sec
* this one
2026-02-13 08:00:24 +08:00
Christopher Milan
d4bc5ab609
autogen: download linux sources ( #14714 )
2026-02-12 18:50:50 -05:00
Christopher Milan
084d0d0103
cleanup macos webgpu tests ( #14715 )
2026-02-12 17:56:34 -05:00
Christopher Milan
c30bb0f006
fix WEBGPU isnan check ( #14711 )
2026-02-12 17:01:18 -05:00
chenyu
9b3b597423
minor getitem cleanups ( #14713 )
2026-02-12 16:54:54 -05:00
chenyu
787998fac3
fix getitem tensor indexing detection ( #14712 )
...
issue with sint
2026-02-12 16:04:37 -05:00
chenyu
86352988d8
update test_uops_stats for setitem ( #14710 )
...
realize both full tensor and the slice should not add to global_mem
2026-02-12 12:26:13 -05:00
chenyu
56caf6a3a2
fix Estimate.from_uops for sliced access ( #14695 )
...
"assume all DEFINE_GLOBAL memory is accessed" is wrong for partial load. get accessed accumulated from INDEX, then cap at full size. now mem_est never exceeds lds_est
2026-02-12 11:18:07 -05:00
chenyu
8551fa50d3
support bitcast in sym_infer ( #14708 )
...
fixed `DEBUG=2 DEV=WEBGPU python -m pytest test/backend/test_tensor_variable.py::TestTensorVariable::test_symbolic_pad`
2026-02-12 10:21:05 -05:00
chenyu
212789e31e
fix long_decomp with None tag ( #14707 )
...
fixed `DEBUG=2 WEBGPU=1 python -m pytest test/null/test_tensor.py::TestIdxUpcast::test_int64_unsupported_overflow_sym`
2026-02-12 09:31:52 -05:00
chenyu
557134e1c7
model/test fix that failed with WEBGPU=1 DEBUG=2 ( #14706 )
2026-02-12 09:08:16 -05:00
nimlgen
10c94d2c2d
amd: print more info about device hang ( #14705 )
2026-02-12 15:34:08 +03:00
nimlgen
b376bd7a21
jit: fix raw in same kernel ( #14699 )
...
* jit: fix raw in same kernel
* fix
* ugh
* x
* simpler
2026-02-12 15:33:32 +03:00
George Hotz
19e68a1833
skip AMD on not AMD ( #14703 )
2026-02-12 18:56:54 +08:00
George Hotz
b7dade2adf
hotfix: skip test/amd in macpytest
2026-02-12 18:16:04 +08:00
George Hotz
4680247e35
renderer/amd: move in tree ( #14702 )
...
* renderer/amd: move in tree
* fix paths in tests
* 24000 lines
* no delete for amd files
2026-02-12 18:09:16 +08:00
George Hotz
d5fc3ea1ba
assembly/amd: mypy+ruff passes ( #14701 )
...
* assembly/amd: mypy+ruff passes
* touchups
2026-02-12 16:59:42 +08:00
George Hotz
095a064ba8
test.yml explicitly says backend ( #14700 )
...
* test.yml explicitly says backend
* 1e-5
2026-02-12 16:03:44 +08:00
nimlgen
14a1991da6
viz: sort tracks in timeline ( #14591 )
...
* viz: sort devices in timeline
* fix
* rev
* upd
* skip
2026-02-12 10:51:41 +03:00
George Hotz
025049c521
clean up sqtt / update src formatting in viz ( #14696 )
...
* update src formatting in viz
* rename to RDNA3/RDNA4 in sqtt
* wrap
* move sqttmap
* update readme
* why did that change?
* cdna
* that's just for test
2026-02-12 14:27:14 +08:00
Christopher Milan
b1a3876492
IMAGE=1 supports FLOAT16=1 ( #14693 )
...
requires 2d indexing to be actually fast
2026-02-12 00:30:55 -05:00
George Hotz
befc1e800c
assembly/amd: disasm is test only ( #14694 )
...
* assembly/amd: disasm is test only
* viz uses str
2026-02-12 12:33:46 +08:00
George Hotz
c331798201
move tests to test/backend ( #14691 )
...
* move tests to test/backend
* fix imports
* fix CI
* revert that one
* Fix formatting in README for test command
2026-02-12 11:09:44 +08:00
wozeparrot
4b5d3bda1f
llama3: data seed ( #14681 )
2026-02-11 19:04:40 -08:00
chenyu
0c63f63ee4
recursive resolve assign dependency ( #14688 )
...
remove the .realize in llm.py
2026-02-11 17:41:05 -05:00
nimlgen
869083e373
nv: pciiface pma ( #14686 )
...
* x
* w
* z
* clean
* o
* r
* x
* c
* r
* list
* deanon
* b
2026-02-11 23:29:07 +03:00
chenyu
cbbc2fdea5
update test_assign_slice_then_read ( #14687 )
...
passes locally now
2026-02-11 15:02:44 -05:00
chenyu
7465b22ba0
handle setitem target in rangeify ( #14685 )
2026-02-11 11:38:59 -05:00
chenyu
0d215b962e
few setitem test cases diff from numpy ( #14684 )
...
have claude fuzzed frontend and found some real bugs
2026-02-11 08:41:03 -05:00
nimlgen
df8b21eeb5
add real self assign test ( #14683 )
...
* self assign fix
* no
2026-02-11 12:41:53 +03:00
wozeparrot
a60220bed9
llama3: move dl to numpy & jit more ( #14677 )
...
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-02-10 18:16:40 -08:00
George Hotz
4565958792
some lil speedups ( #14679 )
2026-02-11 10:01:58 +08:00
George Hotz
2d4ad9e739
add a waitlist for graph rewrite ( #14678 )
...
* add a waitlist for graph rewrite
* cleaner
* one context on spec check
2026-02-11 09:30:13 +08:00
Christopher Milan
389e2eeda1
Revert "transcendental works with long decomp" ( #14676 )
2026-02-10 19:46:34 -05:00
Christopher Milan
0662c8037d
transcendental works with long decomp ( #14672 )
2026-02-10 19:30:24 -05:00
George Hotz
3fab43c57c
add cache to asm gemm ( #14675 )
2026-02-11 08:26:30 +08:00
chenyu
ebef63dba0
update test_self_assign_same_device_copy ( #14673 )
...
that test would have passed without the optimization because .to shortcut
2026-02-10 17:23:43 -05:00
nimlgen
aafa9dcb5b
eliminate same-device copy self-assigns ( #14671 )
...
* eliminate same-device copy self-assigns
* ugh
2026-02-10 22:54:51 +03:00
chenyu
494eec2694
test_setitem_const_fused ( #14668 )
...
did not realize #14640 also fixed #10690 , so added a test for it
2026-02-10 08:33:02 -05:00
nimlgen
42ded7c34d
amd: bind aql ( #14666 )
...
* amd: bind to aql
* bind
* x
* f
2026-02-10 16:28:11 +03:00
George Hotz
82974929b7
use PARAM in schedule ( #14665 )
...
* use PARAM in schedule
* create_new_buffer
2026-02-10 19:18:40 +08:00
George Hotz
8dc46dde07
everything has dtype.long now ( #14661 )
...
* everything has dtype.long now
* int64/uint64 are everywhere now
* that doesn't work
2026-02-10 15:08:50 +08:00
Christopher Milan
cdb78954cb
better cl compiler name ( #14660 )
...
cl_compiler instead of compiler because overriding Compiled.compiler seems more confusing
2026-02-10 01:03:46 -05:00
George Hotz
cc9bf8ccbc
move more to null/unit tests ( #14658 )
...
* move more to null tests
* move test_gc
* no test fusion op
2026-02-10 13:35:17 +08:00
chenyu
83f6d28579
two less realize in setitem ( #14655 )
2026-02-09 23:45:24 -05:00
wozeparrot
69574542ab
fix: use correct fa implementation in eval ( #14651 )
2026-02-09 18:20:44 -08:00
chenyu
0dedf4063c
minor test_setitem cleanup ( #14654 )
2026-02-09 20:40:29 -05:00
Christopher Milan
b36b62eb59
don't push docker cache for PRs ( #14652 )
2026-02-09 19:55:55 -05:00
Christopher Milan
e6562a5061
remove CompilerPair ( #14638 )
2026-02-09 19:51:18 -05:00
Christopher Milan
396e1320fb
bump cache version for z3 ( #14650 )
2026-02-09 19:32:07 -05:00
chenyu
9e3f24db9f
assign realize fix ( #14649 )
...
fix the need for explicit assign. track pending assigns for each buffer, and run those before the main realize in order
2026-02-09 17:46:46 -05:00
chenyu
0913c068ea
clean up setitem disk path ( #14648 )
2026-02-09 15:58:04 -05:00
chenyu
205a1212b7
delegate non Tensor src setitem to assign ( #14647 )
...
cannot do this for DISK in the unified path
2026-02-09 13:53:20 -05:00
chenyu
e9f40f49d4
explicitly check advanced setitem ( #14644 )
...
advanced setitem DISK would failed in rangeify with bad error, now it's checked directly in setitem. eventully DISK can use regular setitem path
2026-02-09 13:36:46 -05:00
chenyu
20a132b1c4
relax atol for test_uop_scan_matmul ( #14646 )
...
flaky, also log max diff
2026-02-09 13:25:19 -05:00
qazal
50d3f6cea5
EVAL_BS=0 in llama profile ( #14643 )
2026-02-10 00:49:43 +09:00
chenyu
8a2c23d3dc
raise RuntimeError for setitem dtype mismatch ( #14642 )
2026-02-09 10:37:08 -05:00
qazal
80b0119cef
llama: add new asm gemm shape ( #14611 )
...
* llama: add new asm gemm shape
* work
* cleanup
* half dtype
* more comment
2026-02-10 00:34:29 +09:00
chenyu
a49e038c0c
dont manually broadcast in setitem ( #14641 )
...
handled by assign
2026-02-09 09:34:09 -05:00
chenyu
2c3e3559eb
remove a contiguous in basic setitem ( #14640 )
...
handled in rangeify
2026-02-09 09:19:46 -05:00
chenyu
6c0c8e2ac3
setitem push a realize to basic setitem ( #14637 )
...
advanced setitem does not need it
2026-02-09 08:54:07 -05:00
nimlgen
e087c58ae0
print tables in llama/profile.sh ( #14639 )
2026-02-09 12:32:54 +03:00
Christopher Milan
27f7ea478b
new style DSP renderer ( #14636 )
...
* new style DSP renderer
* cleanup
2026-02-09 00:39:03 -05:00
Christopher Milan
efac5b9ef6
new style NV/CUDA renderers, try 2 ( #14634 )
...
* new style NV/CUDA renderers, try 2
* fix diskcache
2026-02-08 22:58:48 -05:00
Christopher Milan
0ebb508b85
new style metal compiler ( #14632 )
2026-02-08 21:58:25 -05:00
Christopher Milan
9eef9f38ad
new style python renderer ( #14631 )
2026-02-08 21:45:07 -05:00
Christopher Milan
5f2f2cc956
Revert "new style NV/CUDA renderers ( #14627 )" ( #14633 )
...
This reverts commit 0e505951b0 .
2026-02-08 21:16:03 -05:00
Christopher Milan
4ad787ece2
new style CPULLVMRenderer ( #14629 )
2026-02-08 21:05:01 -05:00
Christopher Milan
0e505951b0
new style NV/CUDA renderers ( #14627 )
...
* new style NV/CUDA renderers
* fix pickle
* oops
* fix CUDA_CC=NVCC
* mockgpu uses PTXCompiler
* oops
* ruff
* dont discard stderr
* ugh
2026-02-08 21:04:51 -05:00
Filip Brzek
1667669c46
fix: python3 -m tinygrad.device reporting on AMD/CPU ( #14622 )
...
* test: device module expects PASS in -m tinygrad.device for CPU
* fix: use device._compiler_name instead of unwrap_class_type(compiler).__name__ in enumerate_devices_str
2026-02-08 20:22:35 +03:00
nimlgen
01a4ee4d66
do not hive_reset when amdgpu ( #14624 )
2026-02-08 19:14:13 +03:00
nimlgen
a615b9d781
am: f8_mode for gfx94x only ( #14620 )
2026-02-08 17:38:48 +03:00
chenyu
c28f7d0167
remove realize in Tensor.svd ( #14623 )
2026-02-08 09:36:31 -05:00
qazal
087dab4c3b
gemm/asm: split out cdna tests from CI ( #14619 )
...
* gemm/asm: split out cdna tests from CI
* reorder
* work
2026-02-08 21:33:42 +09:00
George Hotz
183d38b128
remove CUSTOM_KERNEL / directly construct it ( #14604 )
...
* remove CUSTOM_KERNEL / directly construct it
* clean that up
* simpler multi
* custom kernel spec
* remove Kernel
* fix multi
* use sharded shape
* explicit regression test
2026-02-08 18:43:33 +08:00
nimlgen
e29a88ca09
hive_reset respects lock ( #14618 )
2026-02-08 10:47:25 +03:00
qazal
b10802eb53
use existing VIZ ContextVar instead of getenv ( #14610 )
2026-02-08 15:37:55 +09:00
chenyu
510b65489e
style change rangeify assign [pr] ( #14616 )
...
consistent naming, also a standalone fucntion to replace complicated lambda
2026-02-07 15:47:32 -05:00
chenyu
b7afd4471c
use arg instead of 3rd op for ASSIGN [pr] ( #14613 )
2026-02-07 14:17:10 -05:00
nimlgen
88c3022223
amd: kfd iface early exit ( #14612 )
...
* amd: kfd iface early exit
* l
* revert
2026-02-07 18:57:10 +03:00
nimlgen
ce7bfc6ce8
nv: use nv_flags for all fields ( #14607 )
2026-02-07 15:01:38 +03:00
qazal
c2544e2252
viz: remove outdated comment ( #14608 )
2026-02-07 20:05:24 +09:00
nimlgen
6838b35cff
mockgpu: hevc ( #14606 )
...
* mockgpu: hevc
* eng
2026-02-07 12:27:55 +03:00
chenyu
884592f6c8
pin z3-solver version ( #14605 )
...
found exact input that crashes z3 4.15.4
2026-02-06 22:49:31 -05:00
George Hotz
7a2a3b5c71
Remove Ops.KERNEL, it's all Ops.CALL now ( #14603 )
2026-02-07 10:21:54 +08:00
George Hotz
ca6604eae2
kernel is call ( #14577 )
...
* call is kernel
* closer
* fix bugs
* dedup
* pm_gate_kernel_sink
* better
* Revert "better"
This reverts commit b4c799b810 .
* Reapply "better"
This reverts commit e53f094ce7 .
* cleanups
* work
* remove junk
* subtle fix
* index
* viz cleanups
* disable assert for now
2026-02-07 10:10:14 +08:00
wozeparrot
d87ae1c84c
feat: tinyfs load test in benchmark ( #14602 )
2026-02-06 18:00:00 -08:00
ttomsa
462b455562
cleanup linearize ( #14523 )
2026-02-07 08:54:02 +08:00
ttomsa
d5652e4da2
new dtype aliases ( #14596 )
2026-02-07 08:53:35 +08:00
Christopher Milan
ad9e2f0de7
decompose bf16 ( #14601 )
2026-02-06 19:24:09 -05:00
Christopher Milan
7bb45e7df0
decompose fp8 to bigger floats [skip_process_replay] ( #14554 )
...
* decompose fp8 also
* it works
* cleanup
* no shift required
* default to float
* cleanup
* fixes
* fp8e5m2
* don't rely on behavior comparing nans
* cleanup
2026-02-06 19:05:40 -05:00
chenyu
81f6cdb4ab
delete realize_assign [pr] ( #14575 )
...
use realize and realize_srcs like COPY and STORE. src[0] always has BUFFER for base
2026-02-06 17:12:33 -05:00
chenyu
7d193a6e26
fix wgsl bitcast ( #14600 )
...
was wrong for signed int
2026-02-06 16:57:36 -05:00
chenyu
b9fe8b7591
fix opt in process replay [pr] ( #14599 )
2026-02-06 16:49:56 -05:00
chenyu
197ebcbbbc
log seed with flush=True in fuzz_symbolic ( #14597 )
...
* log seed with flush=True in fuzz_symbolic
i think z3 can crash. added reading seed from argv to see if we repro later
* fuzz_symbolic_symbolic_div
2026-02-06 15:03:57 -05:00
nimlgen
fbb67a3f95
am_smi: fix after regen ( #14594 )
2026-02-06 20:57:41 +03:00
qazal
a80fb4e641
viz: better ordering of device engines in profiler ( #14590 )
2026-02-06 23:08:09 +09:00
qazal
b7e3fbe07e
llama: add VIZ=-1 to dev_run ( #14583 )
...
* llama: add VIZ=-1 to dev_run
* readme
* cleaner
* add profile.sh script
* better grouping of options
* add other row
* readme edits
* work
2026-02-06 22:59:22 +09:00
nimlgen
fbeb978170
diff devices for sdma ( #14589 )
...
* start
* x
* fix
* sdma
* c
* clean
* x
* hm
* cleaer
2026-02-06 16:39:12 +03:00
George Hotz
7cb996e153
bottom up earliest rewrites ( #14587 )
...
* better
* bottom up earliest rewrites
* fix
2026-02-06 18:13:07 +08:00
George Hotz
03af2404e2
small changes and test fixes from kernel is call ( #14586 )
2026-02-06 17:08:33 +08:00
George Hotz
3c26ce29b2
make disk tensor tests process safe ( #14584 )
2026-02-06 15:39:55 +08:00
qazal
cf73d7e2a7
hotfix: disable slower asm gemm shape from llama seqlen 8192 ( #14582 )
2026-02-06 15:05:19 +09:00
qazal
be77873974
llama: contig backward for wk / wv matmul backward ( #14581 )
2026-02-06 14:54:00 +09:00
chenyu
15d3344d9e
use int inputs in test_assign ( #14580 )
...
int is less flaky
2026-02-06 00:07:31 -05:00
qazal
50a166a5fa
viz: cleanup amdgpu target mapping ( #14579 )
...
* viz: cleanup amdgpu target mapping
* linter
* unwraps
2026-02-06 13:51:51 +09:00
chenyu
b09dc646f5
revert some late_buffer_view change ( #14578 )
...
revert #14478 which breaks tinyfs
2026-02-05 22:51:40 -05:00
chenyu
d41836f135
remove KERNEL special case in realize_assign [pr] ( #14573 )
2026-02-05 21:55:44 -05:00
George Hotz
6cbcf98627
KernelInfo is required on get_program ( #14571 )
...
* rangeify always adds KernelInfo
* fix tests
* skip flaky test
2026-02-06 10:49:27 +08:00
George Hotz
28c56a783c
add CallInfo and viz call toggle ( #14570 )
2026-02-06 09:30:58 +08:00
wozeparrot
f73468d516
fa: block skipping for fa kv bwd ( #14569 )
2026-02-05 16:13:53 -08:00
chenyu
b7ef775677
more cleanup in create_schedule [pr] ( #14566 )
...
fixed wrong comments and simplified queue building
2026-02-05 16:12:17 -05:00
Garret Castro
cee7ef7ab2
disable threads ( #14555 )
2026-02-05 16:11:32 -05:00
chenyu
79b7799dba
clean up linearize schedule [pr] ( #14565 )
...
* clean up linearize schedule [pr]
don't mix ScheduleItem and UOp in schedule queue
* ok
2026-02-05 15:24:09 -05:00
chenyu
41a179f542
fix test_xlm_roberta_large ( #14564 )
...
onnxruntime does not allow symlink that's outside model dir. update snapshot_download to use local_dir instead of cache_dir. some ad hoc migration step to copy the existing model too
2026-02-05 14:56:06 -05:00
Christopher Milan
aa9dc50577
dtype decomps don't require bitshifts ( #14542 )
...
* dtype decomps don't require bitshifts
* simplify shr/shl
* ruff
2026-02-05 14:42:30 -05:00
Christopher Milan
b47397ab17
list ml_dtypes as dependency for DSP ( #14562 )
...
* pin onnxruntime to 1.23.2 for DSP
* list ml_dtypes instead
This reverts commit 84bb2cc0fc .
2026-02-05 14:27:50 -05:00
chenyu
2b47a9a1b5
skip test_xlm_roberta_large ( #14563 )
...
symlink model not allowed in latest onnxruntime
2026-02-05 14:00:24 -05:00
chenyu
42c18da88a
add Ops asserts in toposort sched_sink [pr] ( #14561 )
...
more explicit
2026-02-05 12:40:02 -05:00
nimlgen
483bba4f05
nv: use prof_exec_counter ( #14559 )
2026-02-05 19:00:14 +03:00
qazal
190042358f
llama: faster bf16 matmul / rope backward ( #14558 )
2026-02-05 23:57:25 +09:00
George Hotz
b398335f62
assembly/amd: fix saturation in python remu ( #14557 )
...
* PYTHONREMU: failing test for V_SUB_NC_U32_E64 clamp
* fix saturation in PYTHON_REMU
* simpler
* more tests, less lines
---------
Co-authored-by: Christopher Milan <chrismilan@ucla.edu>
2026-02-05 18:35:57 +08:00
wozeparrot
c1ea6687e5
fa: simpler is faster ( #14548 )
2026-02-05 01:13:17 -08:00
George Hotz
43e7eda4e7
grad_b uses custom gemm ( #14550 )
...
* grad_b uses custom gemm
* fix multi backward, acc is in float32
* test_gemm_batched
* square gemm
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: qazal <qazal.software@gmail.com>
2026-02-05 15:22:27 +09:00
qazal
f9cfb64cd9
test asm_gemm in CI ( #14551 )
...
* test asm_gemm in CI
* default float16
* use a smaller shape for multi
* smaller size
* smaller for CI
* smaller for ci
* need half
2026-02-05 13:32:22 +09:00
chenyu
c0ca7f9c51
use more UOp.sum and UOp.prod [pr] ( #14549 )
2026-02-04 22:05:20 -05:00
chenyu
e8dace41b6
clean up UOp.vars [pr] ( #14547 )
2026-02-04 20:52:25 -05:00
Christopher Milan
232848d086
PYTHONREMU: VOP3P integer operations with constants don't cast to fp16 ( #14546 )
...
* PYTHONREMU: VOP3P integer operations with constants don't cast to fp16
* put that back
* cleaner
* do that once
2026-02-04 20:10:59 -05:00
wozeparrot
2966619834
feat: llama uses enable_gqa during training ( #14545 )
2026-02-04 16:22:31 -08:00
chenyu
664f1bf76d
minor ops/jit cleanups [pr] ( #14543 )
2026-02-04 17:21:34 -05:00
chenyu
03d0fa9c3f
merge as_buf into buf_uop [pr] ( #14541 )
2026-02-04 16:32:23 -05:00
chenyu
43ef24a8af
remove buf_target [pr] ( #14540 )
...
not really needed
2026-02-04 15:03:47 -05:00
chenyu
8b7343b950
clean up is_realized [pr] ( #14538 )
...
base cannot be Ops.MULTI since MULTI is a view now
2026-02-04 14:24:10 -05:00
Christopher Milan
5338ce6b74
test S_PACK in extra/assembly/amd/test/hw ( #14537 )
...
* S_PACK_LL_B32_B16 in test/hw
* add rest of S_PACK instructions
2026-02-04 14:17:16 -05:00
chenyu
9052db678f
remove allow_shape_mismatch in Tensor.replace ( #14536 )
...
move all logic to torch_backend and not hacking Tensor method
2026-02-04 12:38:18 -05:00
nimlgen
ec2b6bbda8
hcq: update signal logic ( #14531 )
2026-02-04 19:32:56 +03:00
nimlgen
62786d488a
am: mi3xx perf ( #14529 )
2026-02-04 19:32:43 +03:00
chenyu
d57d24c7d4
Buffer.as_buffer -> Buffer.as_memoryview [pr] ( #14535 )
...
it casts to memoryview. also inline the as_typed_buffer checks to Tensor._data
2026-02-04 11:31:11 -05:00
chenyu
024f57ecf5
jit input_buffers cleanup [pr] ( #14532 )
2026-02-04 10:14:38 -05:00
chenyu
67f91e897b
UOp.is_contiguous -> UOp.has_buffer_identity [pr] ( #14530 )
...
one more confusing buffer related method, but it's definitely not is_contiguous
2026-02-04 09:21:26 -05:00
George Hotz
fb9df1e031
pretty print binary ( #14520 )
2026-02-04 18:04:35 +08:00
Christopher Milan
8c3c026d86
decomp float16 to float32 ( #14417 )
...
* decomp float16 to float32
* denormals arent zero
* add test
* denormals are zero
* fix
* oops
* bitcast works
* fix LOADs
* test_dtype passing
* cleanup
* mypy
* debug print
* only emulate if EMULATED
* very ugly, but passes spec
* add test_dtype_alu tests
* Revert "very ugly, but passes spec"
This reverts commit fdc3999b654d630678bf208927ab2f55e026b4ca.
* bottom up decompositions
* that should have symbolic
* simplify a bit
* SPEC really works
* run with DEBUG
* debug=4
* rm debug
2026-02-04 01:37:47 -05:00
Christopher Milan
ecbce5269e
PYTHONREMU properly supports S_PACK_LL_B32_B16 ( #14527 )
...
* PYTHONREMU properly supports S_PACK_LL_B32_B16
* default
2026-02-03 23:45:33 -05:00
wozeparrot
720c9597a9
feat: llama uses is_causal on sdpa during training ( #14528 )
2026-02-03 20:24:30 -08:00
chenyu
9c2fc118ef
relax setitem target check ( #14526 )
...
old check was too conservative
2026-02-03 22:32:49 -05:00
qazal
d1bfbe9ce3
isolate slow llama gemm ( #14525 )
2026-02-04 12:20:10 +09:00
nimlgen
2f55005ad9
qcom: sync cpu cache when from_blob ( #14518 )
...
* um
* fx
* d
* x
* x
* x
* x
* f
* ren
2026-02-03 21:51:03 +03:00
chenyu
ee9d6a1f36
remove DEFINE_VAR in to_define_global [pr] ( #14522 )
...
not needed
2026-02-03 10:12:33 -05:00
Nino Risteski
af4c74bb41
delete extra cast ( #14517 )
2026-02-03 08:29:04 -05:00
chenyu
9d1e9e643e
removed a duplicated remove_bufferize rule [pr] ( #14519 )
2026-02-03 08:28:07 -05:00
George Hotz
d59e6e7a37
move more tests to test/null, split some existing ones ( #14512 )
...
* move more tests to test/null, split some existing ones
* null work
* null work
* move more
* fixes
* move PIL
* PIL in CLIP
* don't move that
2026-02-03 20:20:20 +08:00
qazal
a98c53769a
ASM_GEMM=1 runs the UOp gemm on non cdna ( #14516 )
...
* ASM_GEMM=1 runs the UOp gemm on non cdna
tests run on mac in 3 seconds
* min diff
2026-02-03 20:42:02 +09:00
qazal
5c1d21349e
viz: profiler command line tool ( #14515 )
2026-02-03 19:51:25 +09:00
George Hotz
dd2de4f838
rename all DEFINE_GLOBAL to PARAM ( #14511 )
2026-02-03 15:09:38 +08:00
George Hotz
dc77b3318b
move files that pass with NULL=1 to test/null ( #14508 )
...
* move files that pass with NULL=1 to test/null
* fix windows
* cpu 0
* bugfix + durations
2026-02-03 13:52:36 +08:00
George Hotz
888819ee09
call autodiff gradient ( #14510 )
2026-02-03 13:51:02 +08:00
wozeparrot
bbcd3d67a3
fa: faster ( #14453 )
2026-02-02 21:34:17 -08:00
Christopher Milan
e579613b90
IR3 has aux ( #14509 )
2026-02-02 23:46:41 -05:00
George Hotz
85c7b23160
add pytest -nauto to benchmark for mac ( #14458 )
...
* add pytest -nauto to benchmark
* 3 minute timeout
* 3 min
* setup env
* comment
* fresh db
* in the pyenv
2026-02-03 12:26:09 +08:00
Christopher Milan
a5d7eb37db
IR3 works on versions earlier than 3.14 ( #14507 )
2026-02-02 23:10:19 -05:00
George Hotz
33c886cafa
disable copyout on NULL backend by default ( #14506 )
...
* disable copyout on NULL backend
* gate it
* allow copyout on some tests
2026-02-03 11:57:47 +08:00
chenyu
3c5845e8a5
remove cut_store_range ( #14505 )
...
special scheduling for CPU
2026-02-02 21:58:36 -05:00
chenyu
4f2e7aed24
fix multiple REDUCE on same RANGE ( #14504 )
...
each RANGE maps to one END, but reduce_to_acc is local and would not know this
2026-02-02 20:42:09 -05:00
chenyu
93c41a78fa
clean up NOOP [pr] ( #14503 )
...
should not be used as a COPY, started with removing from ALWAYS_RUN_OPS
2026-02-02 19:46:45 -05:00
chenyu
66d2b02f11
delete files that depends on extra.optimization.helpers ( #14499 )
2026-02-02 13:33:33 -05:00
George Hotz
ec0398fceb
test amd gpu crashes ( #14459 )
...
* test amd gpu crashes
* cleanup
* less sketch tests
2026-02-02 18:57:47 +03:00
nimlgen
6e4238c016
amd: recovery ( #14461 )
...
* rec
* ?
* rv
* cleaner
* post merge
* not used
* um
* clnr
* x
* x
* d
* move
2026-02-02 18:57:35 +03:00
chenyu
61ca19ff24
after with empty src is self [pr] ( #14496 )
2026-02-02 10:19:05 -05:00
George Hotz
6e958dbfd4
assembly/amd: add RDNA4 support to emulator ( #14341 )
...
* start new rdna4
* work
* plus works
* more pass
* rdna4
* assembly/amd: fix RDNA4 emulator for float16 and VOP3 clamp
* stale
* rev
* rr
* rdna4 emu tests
* cleanup
* cleanup
* simp
* works
* better factorizaion
* hacks
* fix mockgpu
* guard both
* cleaner
* gate
* bug fix and a few tests
* all test_tiny
2026-02-02 21:35:59 +08:00
chenyu
a908f447d5
remove disk special case in mstack_early_shrink [pr] ( #14494 )
2026-02-02 08:34:45 -05:00
qazal
965940dd00
sqtt: update examples after event field change ( #14493 )
...
* regen sqtt examples
* cdna
* rdna4
* packet counts for rdna3
* sqttmap work
2026-02-02 21:39:48 +09:00
George Hotz
965149a46d
assembly/amd: add ds perm instructions ( #14486 )
...
* assembly/amd: add ds perm instructions
* NO SKIP
* fix preexisting RDNA3 issues
* pcode
* assert
* asserts
* unify
* simp
* good fix
2026-02-02 16:02:00 +08:00
qazal
1746d1f997
remove SPEC=0 context in custom_kernel tests, pyrender always skips it ( #14489 )
2026-02-02 16:32:01 +09:00
George Hotz
d4007f36e0
remove DEFINE_GLOBAL (it is PARAM now) ( #14488 )
2026-02-02 14:56:37 +08:00
qazal
6c487656f9
viz: kernel metadata from rodata entry ( #14487 )
2026-02-02 15:41:42 +09:00
Robbe Derks
d75a1b0d5a
usbgpu: use BOT interface for patch.py ( #13644 )
...
* BOT usage
* cleanup
* fix lint
* fix ruff
* fix -7?
2026-02-02 11:54:46 +08:00
Christopher Milan
2931b52875
skip autogen if MTLCompiler is loaded ( #14466 )
2026-02-01 22:12:27 -05:00
George Hotz
9a32d6e090
add depth limit for SPEC=2 ( #14485 )
...
* make SPEC=2 work for everything
* that's a horrible fix
* add depth limit
2026-02-02 10:43:28 +08:00
George Hotz
368a692e1a
make SPEC=2 work for everything ( #14476 )
...
* make SPEC=2 work for everything
* that's a horrible fix
2026-02-02 10:30:56 +08:00
chenyu
ea1f1d2b9d
test_assign_to_bitcast_view ( #14483 )
...
currently disk allows assign same size dtype into a bitcasted view
2026-02-01 16:46:04 -05:00
chenyu
6deeccc192
fix RING with single dest ( #14482 )
2026-02-01 12:14:46 -05:00
chenyu
3ff390159b
don't implicitly change dtype in assign ( #14481 )
...
broadcast shape is fine, but implicitly cast dtype is hard to find
2026-02-01 11:48:54 -05:00
imaolo
2111762a48
failed test case for RING output device ( #14191 )
...
* Add enable/disable scheduler cache ContextVar
* add allreduce ring and naive to() tests
* clearer test comparing native vs ring allreduce
* split tests, add helper
* removing trailing whitespace
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-02-01 11:48:43 -05:00
chenyu
02afae04f4
atol in test_call_gemm ( #14480 )
...
flaky
2026-02-01 11:24:58 -05:00
chenyu
5705398a1f
assign cleanup [pr] ( #14479 )
...
share more code path between disk and non-disk. also raise RuntimeError instead of Assert for mismatches
2026-02-01 09:10:22 -05:00
chenyu
da500dbe06
simplify late_buffer_view [pr] ( #14478 )
...
check the only allowed Ops in the chain, and offset cannot be negative
2026-01-31 22:38:40 -05:00
chenyu
b4f96301e0
remove unused rules [pr] ( #14477 )
2026-01-31 21:29:30 -05:00
qazal
54e78dbec8
viz: remove hardcoded strings in cfg tests ( #14462 )
2026-02-01 09:30:43 +09:00
chenyu
5d38db9da6
generic bitcast assign ( #14474 )
...
a.bitcast(X).assign(src) -> a.assign(src.bitcast(a.dtype))
2026-01-31 17:29:20 -05:00
chenyu
b38fc43b07
assert assign dtype mismatch for disk [pr] ( #14473 )
...
the disk hack is generally wrong, now force bitcast on the source before assign
2026-01-31 17:08:54 -05:00
chenyu
ced886f26c
failed test case for assign into bitcast ( #14469 )
...
* failed test case for assign into bitcast
DISK assign has custom hack for this. need to fix before we can unify assign
* test_assign_bitcast_different_size
2026-01-31 14:26:47 -05:00
chenyu
81eee5b30a
unused spec [pr] ( #14468 )
...
no BUFFER_VIEW in tensor, and no CONTIGUOUS in KERNEL
2026-01-31 13:53:16 -05:00
nimlgen
f873c7b6c5
amd: fetch_name is file_name ( #14465 )
2026-01-31 20:11:07 +03:00
chenyu
c765641215
remove unused allow_any_len [pr] ( #14464 )
...
STORE has 2 src, RESHAPE has 2 src, BUFFER has 2 src
added some tests for the untested allow_any_len
2026-01-31 11:05:42 -05:00
chenyu
b4f5a51ebb
move tests to unit ( #14463 )
...
test_uop_graph does not need device, test_memory_planner can use NULL
2026-01-31 10:49:31 -05:00
qazal
616e9c1483
CDNA assembly gemm in tensor.py with flag ( #14310 )
...
* work
* work
* the assembly
* remove the old one
* remove ws bufs, assert splitk
* notes cleanup
* work
* gemm args
* gemm in mixins would be nice
* add gemm gradient
* print counters
* the realize is for DEBUG=2 aesthetics
* dedup
* rewrite to python dsl, no list copies
* leave that
* add B, M, N, K to gemm name
* it's M0 not NULL
* fp16 support
* test cleanup + more gemms
* work from viz
* more work
* gemm batch_size
* xccg path work
* tiny comments on the label naming
* s_waitcnt
2026-01-31 22:34:14 +09:00
chenyu
55f806b713
tighter late_buffer_view match [pr] ( #14456 )
...
src must be len 2 at that point
2026-01-31 07:28:26 -05:00
qazal
d69bc5aa1a
make DEV=NULL EMULATE=AMD amd_asm_matmul run ( #14460 )
2026-01-31 20:45:24 +09:00
qazal
4976544bf9
multi ram usage tests on the NULL device ( #14457 )
2026-01-31 14:14:53 +09:00
chenyu
99b44121bc
failed test case for non-consecutive disk read ( #14455 )
...
silently fail now
2026-01-30 23:44:04 -05:00
George Hotz
b705c9143c
assembly/amd: test more instructions ( #14365 )
...
* assembly/amd: test more instructions
* more
* passing
* revert
* no const fold
* remove junk
* cleaner
2026-01-31 12:40:22 +08:00
George Hotz
c9a3ddb341
benchmark llama walltime script ( #14454 )
...
* benchmark llama walltime script
* adj layers
2026-01-31 10:21:54 +08:00
George Hotz
f5346d6a1a
fix USE_ATOMICS for non float dtypes and make it the default ( #14444 )
...
* embedded multistep test
* complex test
* with jit
* fix dtypes and reenable USE_ATOMICS
* that test didn't catch anything
2026-01-31 09:44:16 +08:00
Christopher Milan
e575dd8275
prevent UB in long decomp and more emulated tests ( #14447 )
2026-01-30 19:38:41 -05:00
chenyu
3204f94454
correct var_vals schedule filter ( #14451 )
...
complete_create_schedule_with_vars returns var_vals that's used in schedule
2026-01-30 17:10:07 -05:00
chenyu
cfcd1debb5
test schedule with multiple AFTER ( #14449 )
2026-01-30 15:59:00 -05:00
nimlgen
486d53d646
device: call free for external_ptr ( #14448 )
...
* device: call free for external_ptr
* lin
2026-01-30 23:53:17 +03:00
nimlgen
e0978498dc
amd: read_ptr/write_ptr/doorbells are not lists ( #14445 )
2026-01-30 23:11:57 +03:00
Christopher Milan
1803ee939d
EMULATED_DTYPES=long works with CPU_LLVM ( #14446 )
2026-01-30 13:54:43 -05:00
chenyu
03613e83ad
update TestTensorMetadata ( #14443 )
...
run with SCACHE=0 some more TODOs
2026-01-30 12:39:01 -05:00
George Hotz
cbb1eed57b
hotfix: partial revert of 9eb449f88, caused llama NaN
2026-01-30 17:19:27 +00:00
chenyu
26f5c00265
move TestTensorMetadata to unit ( #14442 )
2026-01-30 12:14:21 -05:00
chenyu
c05a0b85ae
flip unique const src order [pr] ( #14441 )
...
* flip unique const src order [pr]
matches buffer, simplifies replace_input_buffer
* combine rules
2026-01-30 11:44:18 -05:00
George Hotz
ee2c78709d
mlperf/llama: disable USE_ATOMICS for now
2026-01-31 00:42:08 +08:00
chenyu
beecac4d85
expand ranges -> unroll outer ranges [pr] ( #14440 )
2026-01-30 11:26:05 -05:00
chenyu
9eb449f882
clean up toposort sched_sink [pr] ( #14439 )
2026-01-30 10:18:28 -05:00
George Hotz
838cd078bc
use atomics for embedding backward ( #14400 )
...
* embedding is slow
* failing
* float is fine
* null
* it fails
* simplify embedding with broadcasting
* ATOMIC_ADD incoming
* min change
* simpler test
* better test
* fix test
* real test
* simpler
* cleanups
* types and names
* _zero_kernel
* grad multi
* hack
* none
* multi unshard
* more for call
* don't tag in call
* good
* call_multi
* call_multi wow claude is useless
* embedding backward mutli test
* test passes
* fix as_param
* shape_to_shape_arg
* add clip
* before cast
* fix spec=2, use atomics
2026-01-30 18:10:59 +08:00
nimlgen
1998e0bb28
nv: add prof props to dev ( #14437 )
2026-01-30 12:51:43 +03:00
George Hotz
7a9dee4e50
add call/param UOps ( #14433 )
...
* add call/param UOps
* resolve call
* skip that for now
* grad on call
* fix tests
2026-01-30 14:51:45 +08:00
qazal
66d6a68016
viz: sqtt work from cdna gemm ( #14434 )
...
* it's the tag
* initialize rows based on the disasm
* test_cfg with Ops.BINARY
* pyremu wants s_code_end?
* test_diamond
* diff cleanup
2026-01-30 14:00:56 +09:00
Christopher Milan
88caf57ef4
ci: unify python versions ( #14430 )
2026-01-29 21:42:03 -05:00
chenyu
86a204d22a
allow Tensor setitem input to be list/tuple ( #14432 )
...
matches assign, and generally matches numpy
2026-01-29 21:26:58 -05:00
chenyu
4a80319093
clean up split_store final logic [pr] ( #14429 )
...
explicitly check the structure
2026-01-29 18:40:07 -05:00
Christopher Milan
e47f12f671
ci: replace testing_minimal with testing_unit ( #14427 )
2026-01-29 18:02:43 -05:00
wozeparrot
c2fb8b208f
fa: 32 block size ( #14416 )
2026-01-29 13:59:13 -08:00
chenyu
a979fafae5
cleanup around disk buffer [pr] ( #14428 )
...
style change, prep for refactor
2026-01-29 16:18:44 -05:00
nimlgen
dc977a03b0
nv_pma: bw decoder ( #14424 )
...
* nv_pma: bw decoder
* decoder fix
* better
2026-01-30 00:12:39 +03:00
chenyu
ddc041854b
failed test case for disk setitem ( #14426 )
...
strided setitem is wrong
2026-01-29 14:54:19 -05:00
chenyu
31706bf6bc
add few more types [pr] ( #14425 )
2026-01-29 14:04:09 -05:00
nimlgen
2d5c24879f
nv: pma for 5090 ( #14420 )
...
* nv: pma for 5090
* hm
* 4090
2026-01-29 20:06:01 +03:00
nimlgen
c8dc6332d2
memory: read_fields is not universal ( #14348 )
2026-01-29 20:00:00 +03:00
chenyu
dbe8f034a7
pass z3.Context in validate ctx [pr] ( #14423 )
...
does not need to pass the whole solver
2026-01-29 11:11:47 -05:00
chenyu
033ce1b885
types for validate.py ( #14422 )
2026-01-29 10:56:50 -05:00
nimlgen
230d08ec70
test for am recovery and faults handling ( #14421 )
...
* test for am recovery and faults handling
* linter
2026-01-29 17:11:24 +03:00
George Hotz
793afbd473
simplify nn.Embedding, support AFTER in CUSTOM_KERNEL ( #14419 )
2026-01-29 17:22:13 +08:00
Christopher Milan
0c855d6149
ci: remove unused pydeps ( #14418 )
2026-01-29 01:51:26 -05:00
wozeparrot
4845e42135
llama3 gradacc fixes ( #14414 )
2026-01-28 19:12:39 -08:00
chenyu
37cde4a01a
add one line mypy report ( #14415 )
2026-01-28 20:39:32 -05:00
chenyu
15aed51544
return types for all math.py function ( #14413 )
...
calling int() on sint -> int, i think it's better support since some UOp can be safely cast to int
2026-01-28 20:10:11 -05:00
nimlgen
aec1ae0de1
llama: set manual_seed ( #14409 )
2026-01-28 14:40:00 -08:00
chenyu
0870ed28b1
add Self type to MathMixin ( #14411 )
...
these don't cause error
2026-01-28 16:59:38 -05:00
chenyu
079f33c208
fix type in Tensor.mean and Tensor.var ( #14410 )
...
use Tensor.from_uop to wrap UOp from symbolic shape, kernels are the same
2026-01-28 15:24:02 -05:00
chenyu
2b5e99ccc1
minor type cleanups [pr] ( #14408 )
...
mypy --warn-redundant-casts has false negative
2026-01-28 14:11:50 -05:00
chenyu
726415dbc8
import sint directly in movement.py TYPE_CHECKING ( #14406 )
...
avoid creating string TypeAlias, fixed warning in `TYPED=1 python test/test_tiny.py`
2026-01-28 12:47:26 -05:00
nimlgen
acb2fc36ba
nv_pma: add decoder ( #14404 )
...
* nv_pma: add decoder
* cl
2026-01-28 20:44:02 +03:00
chenyu
7b9bc1d8cf
_MockMemoryviewMeta for mockgpu ( #14405 )
...
fixed `PYTHONPATH=. TYPED=1 DEV=AMD MOCKGPU=1 python test/test_tiny.py`. basically make `isinstance(TrackedMemoryView_instance, memoryview)` true
2026-01-28 11:59:00 -05:00
chenyu
93793a645b
use cl.cl_mem instead of internal ctypes._CData ( #14403 )
...
fixed `CHECK_OOB=0 DEV=CL TYPED=1 python test/test_tiny.py`
2026-01-28 10:56:41 -05:00
chenyu
a9b44070a8
fix webgpu runtime types ( #14402 )
...
`CHECK_OOB=0 DEV=WEBGPU TYPED=1 python test/test_tiny.py` passed, also skip tests that failed locally
2026-01-28 10:37:25 -05:00
George Hotz
0c6b3f50aa
add marker to llama training ( #14401 )
2026-01-28 22:44:28 +08:00
Jakob Sachs
2b7c00d3d2
fix sd-example dtype for CLIP embeddings ( #14397 )
2026-01-28 09:07:19 -05:00
qazal
a5a9ce3fdf
viz: disasm cleanups from null emulate ( #14399 )
...
* it's AMDHIPRenderer
* don't need that indent
* less assignment stuff
* that arg order did not make sense
* pmc
2026-01-28 22:03:30 +09:00
nimlgen
544928766d
hcq_smi: kill mac pids ( #14398 )
2026-01-28 15:00:28 +03:00
George Hotz
202b74b369
assembly/amd: continue refactors ( #14386 )
...
* simpler
* merge
* flat
* no ctx
* use the correct apis
* dup code
* write clean code
* remove bad helpers
* bits junk remove
* junk remove
* smem test
* fix tests
* correct fix + tests
* Fmt matters it seems
* wmma refactor
* a lil more
* kimi cleanups
* line
2026-01-28 17:33:03 +08:00
qazal
5bffa17f82
llama train: better NULL=1 EMULATE=AMD_CDNA4 dev experience ( #14395 )
...
* beam opens devices
* switch to hip renderer
* amd: true?
* llvm true is for test_autogen
2026-01-28 17:31:22 +09:00
qazal
0294014108
fix bufferize cost function for multi, improve VIZ=-1 cli ( #14394 )
...
* improve cli
* remove_bufferize change
2026-01-28 15:53:18 +09:00
qazal
c158acea29
failing multi ram usage test from llama gemm ( #14392 )
2026-01-28 14:32:32 +09:00
Christopher Milan
067e27857e
nested composite actions don't work ( #14393 )
2026-01-28 00:13:30 -05:00
Christopher Milan
9dddf3d478
don't save caches for PRs, try 2 ( #14391 )
2026-01-27 23:30:17 -05:00
Christopher Milan
68fe5d8b36
Revert "don't save caches for PRs ( #14389 )" ( #14390 )
2026-01-27 23:22:26 -05:00
Christopher Milan
4ab228b498
don't save caches for PRs ( #14389 )
2026-01-27 23:21:31 -05:00
Christopher Milan
5e36482314
decompose long to ints where unsupported, try 2 ( #14383 )
2026-01-27 23:20:43 -05:00
wozeparrot
e496547720
llama3 gradacc ( #14291 )
2026-01-27 19:48:10 -08:00
George Hotz
88bc5ee212
assembly/amd: rename to better names ( #14384 )
...
* assembly/amd: rename to better names
* might help fuzzing segfault
* emu2 -> emu
2026-01-28 10:00:54 +08:00
George Hotz
065b95cfb0
Revert "add retry to fetch ( #14370 )" ( #14385 )
...
This reverts commit dc4d7f2d55 .
2026-01-28 09:35:37 +08:00
Eitan Turok
dc4d7f2d55
add retry to fetch ( #14370 )
2026-01-27 14:04:25 -08:00
chenyu
8d1f3c8885
fix copysign for inf input ( #14381 )
...
* fix copysign for inf input
* llvm olt
2026-01-27 16:45:48 -05:00
Christopher Milan
289a3e415e
also skip test_nonoverlapping_shrink_assignment ( #14382 )
2026-01-27 16:26:26 -05:00
Christopher Milan
f34efc1ad1
DISABLE_FAST_IDIV actually works as a ContextVar ( #14378 )
2026-01-27 16:12:42 -05:00
chenyu
8c899e4aaf
fix copysign for -0 ( #14380 )
...
test both x and 1/x < 0 work too. and found another big with the * 0 hack
2026-01-27 15:44:58 -05:00
chenyu
62884585a7
failed test case for copysign -0.0 ( #14379 )
...
* failed test case for copysign -0.0
* skip those
2026-01-27 14:37:17 -05:00
nimlgen
ec1b28bc2c
am: exit early in case of failures ( #14376 )
...
* am: exit early in case of failures
* sorry, pre-linter
* reset when error state
2026-01-27 22:10:02 +03:00
chenyu
cd22ee9ed0
add InvalidType to ConstType [pr] ( #14373 )
...
* add InvalidType to ConstType [pr]
TYPED=1 python test/test_tiny.py passes.
added PyConst = float|int|bool for some Tensor level input types
* hcq
2026-01-27 14:09:34 -05:00
Christopher Milan
5b42a1357b
SCACHE=0 works with DEBUG ( #14377 )
2026-01-27 13:12:43 -05:00
chenyu
db010a31be
IGNORE_OOB -> CHECK_OOB [pr] ( #14374 )
...
flip the meaning
2026-01-27 12:20:59 -05:00
chenyu
c22667b0c4
also skip test_overlapping_shrink_assignment_reverse ( #14375 )
...
crashing
2026-01-27 12:20:39 -05:00
nimlgen
e52d58b041
autogen: update amd ( #14372 )
2026-01-27 19:53:14 +03:00
nimlgen
cbf94a0a95
nv: exit early in case of failures ( #14363 )
...
* nv: exit early in case of failures
* f
* cleaner
2026-01-27 19:16:22 +03:00
nimlgen
ec691cb299
am: print sq intrs ( #14366 )
...
* am: print sq intrs
* cleaner
2026-01-27 18:28:13 +03:00
qazal
a5f3d46423
hcq: do not assume kernel names are unique ( #14371 )
...
* hcq: do not assume kernel names are unique
* colored kernel name
2026-01-27 23:03:15 +09:00
George Hotz
e5df7e640b
fix branches in amd_asm_matmul ( #14369 )
2026-01-27 20:48:42 +08:00
George Hotz
0ced258726
HOTFIX: skip crashing assign test
2026-01-27 20:35:17 +08:00
George Hotz
131ae604de
force_transcendental on sqrt ( #14368 )
2026-01-27 20:24:41 +08:00
imaolo
14574c68fa
Add ContextVar to disable the scheduler cache ( #14257 )
...
* add scheduler cache ContextVar
* test scheduler cache context var
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2026-01-27 19:55:29 +08:00
George Hotz
bfc88bcfb8
assembly/amd: emu refactors + enable PYTHON_REMU by default ( #14361 )
...
* assembly/amd: start refactors
* cleanups
* those are global
* methods on ctx
* const cleanup
* range helper
* types and imports
* cleanups
* cleanups
* remove stale name
* fix emu2 types
* more typing
* more mypy
* cleanups
* fxns
* scc cleanup
* cleanups
* cleanups
* simpler parse_pcode
* laneid
* no defaults for pcode
* pcode is not optional
* cleanups
* functions cleanup
* splat
* expr_parser functions
* single tok
* invert global loops
* try_eat
* minor
* run parser on all
* no silent 0
* tests
2026-01-27 17:42:24 +08:00
Christopher Milan
2e72625652
Revert "decompose dtypes.long to ints where unsupported ( #14261 )" ( #14362 )
2026-01-27 02:04:59 -05:00
qazal
f866b2a513
mfma loop in asm dsl ( #14349 )
...
* mfma loop in asm dsl
* work
2026-01-27 11:11:37 +09:00
Christopher Milan
0793319929
decompose dtypes.long to ints where unsupported ( #14261 )
...
* add works
* use carry not overflow
* bitwise ops
* use tag instead of vec
* cleaner
* mul somewhat works
* mul actually works
* SUB and NEG work
* SHL/SHR
* ulong support
* this should work?
* oops
* fix indexing
* all ALU mostly works
* refactor
* test_dtype passing
* signed division works
* format
* clean
* some tests
* ruff
2026-01-26 18:34:13 -05:00
wozeparrot
a987a4abc3
feat: llama8b dev_beam.sh ( #14358 )
2026-01-26 14:51:23 -08:00
Christopher Milan
c9c533fc78
libclang path is homebrew on macos ( #14357 )
...
* libclang path is homebrew macos
* typo
* ugh
* typo
* regen
* no LIBCLANG_PATH
2026-01-26 17:32:09 -05:00
chenyu
d641e63189
improve min/max for AND ( #14356 )
2026-01-26 15:44:18 -05:00
chenyu
f16372487a
fix assign hazard on shrink ( #14355 )
...
* fix assign hazard on shrink
possible to have race if both assign src and dest are shrink
* test_nonoverlapping_shrink_assignment
2026-01-26 14:46:30 -05:00
chenyu
145df879c1
find_permutes -> fix_assign_hazard [pr] ( #14354 )
...
some noop tweaks and comment updates
2026-01-26 14:05:19 -05:00
nimlgen
e152f1b0f5
llama: use ALL2ALL ( #14353 )
2026-01-26 22:01:53 +03:00
nimlgen
3f25eb3026
am: ih ( #14346 )
...
* am: ih
* um
* fix
* line
* no trap and fix ring
* keep
* fix
2026-01-26 20:11:04 +03:00
chenyu
823bc17fb5
failed test case for shrink overlap assigns ( #14350 )
...
* failed test case for shrink overlap assigns
current logic can create a race resulted in wrong output
* skip for now
2026-01-26 11:58:45 -05:00
George Hotz
204f51e739
assembly/amd: bug fixes for PYTHON_REMU ( #14347 )
...
* default PYTHON_REMU to 1
* mockgpu
* less size
* normal compile path
* uniqie
* more
* fix clamp
* Change PYTHON_REMU default to 0 in _try_dlopen_remu
2026-01-27 00:48:22 +08:00
chenyu
231305603d
remove REAL_DEV [pr] ( #14337 )
...
it's just Device.DEFAULT now
2026-01-26 10:08:16 -05:00
Martin Szewieczek
9cbe99348a
func meshgrid: change param index to type str ( #14331 )
2026-01-26 10:07:56 -05:00
George Hotz
3b43d26f10
assembly/amd: emu speed ( #14344 )
...
* assembly/amd: emu speed
* fix spec
* go
* don't do this
* simpler
* no stupid consts
* hack
* simpler
* no index
* no where
* faster linearizer
* fix spec
* no index dtype
2026-01-26 22:21:34 +08:00
George Hotz
774a454bb5
assembly/amd: fix scratch SVE ( #14340 )
...
* assembly/amd: default python REMU
* mem_used
* no lane
* sve
* remove that
* needs s_code_end in tests
2026-01-26 21:03:51 +08:00
qazal
2d91fe6310
use amdgpu dsl in mmapeak ( #14342 )
...
* use amdgpu dsl in mmapeak
* don't rely on llvm for vgpr counting
* llvm roundtrip assert
* rm it, add ci
* vgpr_count
* move emulated test to amd, it needs comgr
* env
* arch
* inst._fields -> inst.operands
* vgpr offset
2026-01-26 22:03:43 +09:00
qazal
b2e2ace85b
viz: remove ci check, it's VIZ=-1/-2 ( #14343 )
2026-01-26 20:36:23 +09:00
George Hotz
be23776ba7
assembly/amd: replace pcode with ucode ( #14002 )
...
* a bunch of todos for my boy claude
* uops have types
* lil cleanups
* simpler ucode
* isNAN
* calls
* move more
* cleanup pcode_parse
* cvt functions
* fix parser bugs
* no void
* minmax
* more pcode parse
* pretty print
* transform
* comments
* move to transform
* assign/declare
* simpler norm
* single PM
* just Uops
* simpler
* more typed
* all rewrite
* less verbose
* work
* spec
* transform
* work
* simpler spec
* less spec
* bitcast
* simpler
* simp ucode
* work
* more in pcode_transform
* remove junk
* more functions
* bug
* no void assign
* load/store
* wave
* fixes
* move denorm
* move more functions
* tests
* cat is shape None
* uop syntax
* move a few more
* program_spec
* cat stuff
* assign fix clear
* unused
* nans
* fp bits
* works with simplify
* remove junk
* special
* meh
* more
* more
* update test pcode parse
* improve parser
* parse some for loops
* merge master
* dead files
* tests pass
* emu2
* better emu2
* test_plus works
* uselessly write more instructions
* use pcode
* something
* something
* bench_emu
* progress
* ds works
* work
* work
* more passing
* run compare
* bench_emu
* more pcode
* a few more
* bugfixes
* bugfix
* test fixes
* tests pass without USE_HW
* all hw tests pass
* add more hw tests
* new hw tests
* bit
* less handcode
* parse more
* consolidate pcode
* fixes
* rsrc
* lane pcode
* cleanups
* simpler
* emu bugs
* one cmp test fails
* fix decode and upd name
* fix name and test harness
* _ftz_f32
* fix denorm
* fix VOPD and use load
* fix carry bug
* no load where / just invalid
* clean
* simpler
* merge sops
* refactoring
* simplifications
* bugfixes
* new tests
* f16 sin fix
* assertion and hw tests
* cvt functions
* one more failure
* bugfixes
* bugfix + regression
* more tests
* fmac
* no manual unrolling
* ordering
* LLVM backend is a lot faster
* compile inst
* more bugs
* f16
* bugfix
* fix regression
* one clang call
* 1M inst
* scratch works
* do scratch correctly
* cleanup
* regression
* cmp
* fmamk fixes
* merge
* fix vcmpx
* unify memory
* remove unused code
* ignore oob for test
* cleanups
* fix mbs
* unify cmp
* test
* minor cleanups
* bump timeout
* fix tests
* revert the CMPLE stuff
* remove opt
* less diff
* simpler
* revert
* support multiple backends
* memset is a lot faster
* split out in bench emu
* improve timing
* timing
* cache that
* cache that
* simpler and faster
* tokenize
* binop table
* simpler
* move to parser
* tok for lambda
* refactor
* expr_parser
* delete emu2_pcode
* import cleanup
* lil
* if parse
* work
* simpler
* no v
* trig preop is faster
* durations for tests
* fix cmp bug
* sdst
* remove scartch_size hack
* null behavior
* _MXCSRContext
* bugfixes
* DEBUG >= 3
* test smem crashes my gpu
* debug
* test
* test smem
* profiler
* full inst
* bugfix
* rtag(1)
* pc is 64-bit and word
* pc is real code now
* dynamic
* more dynamic
* fix oob access
* fix crash, more dyn
* all dyn
* really all dyn
* correct null mask
* lit + format
* 21s on the tests
* 13s on the tests
* canonical name
* simm16
* more dyn
* 14s
* proper saddr dedup
* dyn
* debug 5
* better 5
* revert dynamic stuff
* that can be dyn
* negative offsets
* dyn wmma
* f16 wmma support / ops / dtype / dtype_alu
* symbolic changes not needed
* ConstFloat
* more uop.const
* __eq__
* uop tests
* fix f16
* bf16 tensor cores
* whitespace
* remove cast roundtrip
* Revert "remove cast roundtrip"
This reverts commit c5bb0381c3 .
* just the fix
* remove dead paths
* llvm runs
2026-01-26 18:04:29 +08:00
George Hotz
984cdc4840
add wrapper class for the -0.0 != 0.0 issue ( #14339 )
...
* add wrapper class for the -0.0 != 0.0 issue
* fixes
* spec fix
* missed one
2026-01-26 16:52:37 +08:00
qazal
92bfe92138
assembly/amd: fix cdna mfma xml ( #14329 )
...
* handwritten failing test
* new amdxml
* more mfma from fixes
* ci
* move arch of test integration
* alt
* amdxml human cleanup
* _TestIntegration rename to IntegrationTestBase
* it's the same problem as _LIT
* better comment
* better variable name
2026-01-26 17:51:26 +09:00
Garret Castro
6c109f4d75
LLVM: CPU threading support ( #14320 )
...
* make generic llvmrenderer class for cpu and amd
* move `tensor_cores` back to parent
* remove empty line
* restore extra matcher position
* add threading
* dont need to add core_id here
* dont move code for workitem
* cleanup
---------
Co-authored-by: TheVanadium <claude_user@ret2022.localdomain>
2026-01-26 13:12:39 +08:00
George Hotz
cc49e47ea2
tinygrad changes from ucode ( #14336 )
...
* tinygrad changes from ucode
* dtype
2026-01-26 11:30:18 +08:00
Garret Castro
8477368d07
generic LLVMRenderer class for CPU and AMD ( #14321 )
...
* make generic llvmrenderer class for cpu and amd
* move `tensor_cores` back to parent
* remove empty line
* restore extra matcher position
* cleanup
---------
Co-authored-by: TheVanadium <claude_user@ret2022.localdomain>
2026-01-26 09:11:49 +08:00
George Hotz
11ce1e847d
llama train: null device support
2026-01-26 08:53:05 +08:00
chenyu
e3601788fa
update torch backend function ( #14333 )
...
those have tensor.py implementation
2026-01-25 16:39:34 -05:00
nimlgen
9865f51e39
cupti: ref collector ( #14330 )
...
* cupti: ref collector
* ll
2026-01-25 20:35:21 +03:00
nimlgen
21ab23ae18
nv: add pma for ada ( #14328 )
...
* nv: add pma for ada
* um
* fix
* shorter
* mock
2026-01-25 17:33:37 +03:00
George Hotz
49db266b96
ReprEnum for repr roundtrips ( #14327 )
...
* ReprEnum for repr roundtrips
* dsl
* bugfixes
* vdsty fixes
* cleaner
* fix
* fix cdna fields
* tests all pass
2026-01-25 18:58:31 +08:00
qazal
bf2d9d138f
viz: simplify amdgpu cfg ( #14326 )
...
* viz: replace llvm disasm with our disasm
* it starts with more code
* then it becomes less
* simpler, cdna disassembles with decimal simm16
* s_branch is upper case, add test
* simm16s and others
2026-01-25 15:21:45 +09:00
qazal
647e527a7e
viz: replace llvm disasm with our disasm ( #14325 )
2026-01-25 13:56:56 +09:00
nimlgen
4280a8eef2
am: update fw ( #14323 )
2026-01-25 01:08:47 +03:00
chenyu
7e41da1ae8
fix generate_dataset.sh ( #14324 )
...
added `set -e` so wrong pathes would fail the script, then fixed the path
2026-01-24 16:47:10 -05:00
chenyu
311bfd91d6
clean up where_on_load [pr] ( #14322 )
...
no repeated split_uop and general cleanup
2026-01-24 14:43:43 -05:00
nimlgen
8b282ba6d2
memory: reserved vram ( #14318 )
2026-01-24 19:39:24 +03:00
chenyu
00e9ba0b82
update type for split_uop and where_on_load [pr] ( #14319 )
...
also variable names in where_on_load, before logic update
2026-01-24 11:17:41 -05:00
chenyu
cb69b7b2b2
comment out fold_where_closure ( #14316 )
2026-01-24 10:15:42 -05:00
wozeparrot
d74587f16d
fa multi fix 2 ( #14314 )
2026-01-23 23:35:02 -08:00
chenyu
d9f0ad1d87
update return type for Tensor.tolist ( #14313 )
...
since sequence is incorrect since it can be list of list, use Any to avoid recursive type
2026-01-23 23:21:49 -05:00
qazal
807bc40931
assembly/amd: dsl and disasm cleanup ( #14311 )
...
* rdna4 inst helper
* remove dsl aliases
2026-01-24 11:36:12 +09:00
Christopher Milan
e782d44918
WEBGPU/NIR truncates ints ( #14307 )
...
* WEBGPU truncates ints
* nir has this bug too
2026-01-23 19:28:06 -05:00
nimlgen
26220a472e
no core_id ( #14265 )
...
* no core_id
* kwargs
* est
* linters
* ugh
* revert this
* deps
* glb
* should work?
* nn
* line
* fx
* ym
* z
* d
* um?
* revert
* this one?
* first half
* um p2
* all?
* um
* cleaner
* um
2026-01-23 21:30:12 +03:00
chenyu
e65bc7a7c5
where closure folding ( #14304 )
2026-01-23 10:55:13 -05:00
chenyu
d5a3b02a9c
clean up xpow ( #14295 )
...
mostly for `ret * (base < 0).where(adj, ret.const_like(1))` -> `(base < 0).where(neg_base, ret)`, since it's good for NAN neg_base but not generic
2026-01-23 10:19:47 -05:00
qazal
b913c910c5
assembly/amd: rdna4 passing test_roundtrip ( #14300 )
...
* test_roundtrip on different archs
* failing tests
* take RDNA4 xml changes from the emu branch
* work
* min diff to disasm flat
* test_add passes, rdna4 first
* correct vgpr field for the multi dword store stuff
* amdllvm
* recompile in roundtrip, get sources from emulator
* amdllvm, 2
* clean clean
* note, don't rely on that os.environ
---------
Co-authored-by: George Hotz <geohot@gmail.com>
2026-01-23 21:33:53 +09:00
qazal
f3b0e42863
remove extra sqtt pickles in gfx1200 ( #14302 )
2026-01-23 20:13:48 +09:00
George Hotz
d116312b1a
get cdna sqtt working ( #14301 )
...
* get cdna sqtt working
* cnd aprser
* wavestart/waveend
* names
* cdna
* test that
2026-01-23 18:46:15 +08:00
George Hotz
a5c4fa39d1
RDNA4 support in SQTT ( #14299 )
...
* table test
* cleanups
* dead file
* delta short
* tests
* delta test
* work
* l4 tests pass
* l0
* cnda
* print
* reverT
* wave failure
* wave failure
* test
* encs
* no l0 crap
* L4
* rdna4 sqtt
* notes
* linter
2026-01-23 16:16:45 +08:00
wozeparrot
963c59ebdb
fix: pull fixes from gradacc branch ( #14296 )
2026-01-22 23:07:54 -08:00
Christopher Milan
68668b8f28
fix WEBGPU NEG ( #14298 )
...
* fix WEBGPU NEG
* add test
* parenthesize
2026-01-23 01:44:52 -05:00
qazal
3b8a7bb8c9
use existing roc.py infra for sqtt tests ( #14297 )
...
* add pc, per kernel tracing
* work
* remove those imports
* min diff
2026-01-23 14:07:11 +09:00
chenyu
5f32f7a06b
fix winograd padding order ( #14294 )
2026-01-22 23:00:14 -05:00
George Hotz
52b989c6c8
don't place consts early + fixes from anthropic challenge ( #14286 )
...
* don't place consts early
* add anthropic challenge
* with ref
* do we still have to devectorize bools?
* tests pass
* just WHERE
* fine, revert that
* fine, revert
* only index
* z3 validator doesn't support vectorized
* Revert "z3 validator doesn't support vectorized"
This reverts commit 1b7930ecb3 .
* z3 not for vec
* no spec
* VLIWRenderer
* loop unrolling
* better comments
* cleanups
* skip cast
* renderer
* cleanups
* prints
* no hack
* hacks
* bump to 11
* reg warning
* lil clean
* cleaner renderer
2026-01-23 10:48:39 +09:00
chenyu
0903782bc0
remove few dead or unneeded codes [pr] ( #14275 )
2026-01-22 20:05:43 -05:00
chenyu
3eb5cd7d32
stronger test_rand_is_lazy ( #14293 )
2026-01-22 18:58:53 -05:00
chenyu
c15b6e6709
update test_randn_finite skipped device ( #14292 )
2026-01-22 18:26:02 -05:00
chenyu
073c6a81b5
raise if Tensor._buffer is called during jit ( #14114 )
...
* raise if Tensor._buffer is called during jit
* cleaner
2026-01-22 17:30:18 -05:00
nimlgen
8cd22df2dd
amd: alive wgps ( #14149 )
...
* amd: disabled wgps
* l
* wgp
* uoops
* mockgpu
* drm
* ad this
* fi
* reg
2026-01-23 00:08:45 +03:00
chenyu
a738c4bb22
test symbolic view broken with jit ( #14290 )
2026-01-22 13:44:47 -05:00
chenyu
f22fa6a5be
test rand is lazy ( #14289 )
2026-01-22 13:07:55 -05:00
chenyu
1726b884f2
update test_jit_v_nojit_random_regen ( #14288 )
...
current behavior is that jit and non-jit consume random seed differently, still the random values are different
2026-01-22 12:21:47 -05:00
chenyu
fbed36fa15
jit graph handle input==output aliasing ( #14287 )
...
a position that wasn't an input during capture should never become an input during execution, but graph cannot tell this by jit_cache and input_buffers only
2026-01-22 11:37:41 -05:00
chenyu
8bb61c2490
stronger test_graph_input_output_aliasing ( #14282 )
...
* stronger test_graph_input_output_aliasing
* comfirmed failure
2026-01-22 09:59:34 -05:00
qazal
d7afa02085
clean up the extra/sqtt directory ( #14284 )
...
* remove legacy test_timing stuff
* remove legacy test_pmc, update active_sqtt_parse
2026-01-22 19:10:59 +09:00
qazal
dff5f361b0
support rendering assembly kernels on the NULL backend ( #14283 )
...
* assembly custom kernels in DEV=NULL, use renderer arch
* update mmapeak
* llvm
2026-01-22 15:49:07 +09:00
qazal
dfefeddeed
add tflops to cdna gemm custom kernel ( #14281 )
2026-01-22 12:48:28 +09:00
qazal
18f408a35a
custom assembly kernel with variable tests ( #14280 )
...
* custom assembly kernel with variable tests
* different threads
* sink
* zeros like / flatten
2026-01-22 11:34:17 +09:00
chenyu
4de107b764
jit graph bug when input is output ( #14278 )
...
* jit graph bug when input is output
wrong result in llm
* not just metal
2026-01-21 18:49:52 -05:00
wozeparrot
76a9242a66
fa: merge kv bwd into one kernel ( #14277 )
2026-01-21 15:24:41 -08:00
chenyu
6279ae4a94
remove llm generate always reset start_pos ( #14276 )
...
* remove llm generate always reset start_pos
by itself seems like a bug, also added a test to repro forward_jit.reset() issue
* issue is jit graph, so revert that test
2026-01-21 16:54:30 -05:00
nimlgen
da1fedc3c8
working ioctls ( #14272 )
2026-01-21 20:29:04 +03:00
chenyu
574d171fa6
fix onnx Pad constant_value=None ( #14271 )
...
also removed a dead branch in _resolve_pool_pads
2026-01-21 11:51:34 -05:00
chenyu
a18d34be1e
simpler split_store outer range check [pr] ( #14273 )
...
also fixed comment
2026-01-21 11:51:14 -05:00
chenyu
e64111ad08
update all_same [pr] ( #14270 )
...
add type annotation and unit test
2026-01-21 11:26:15 -05:00
chenyu
9ad3c865ac
fix bug in logsumexp keepdim=True ( #14268 )
2026-01-21 09:49:55 -05:00
George Hotz
41d00a046d
add device to local, fix PCONTIG=2 ( #14266 )
...
* add device to local, fix PCONTIG=2
* regression test
* remove the device when we render
* viz slowness
* no long
2026-01-21 22:12:18 +09:00
wozeparrot
c1d14ea832
llama8b train fixes ( #14264 )
2026-01-20 20:34:47 -08:00
qazal
549dbabfcb
move ALLOW_DEVICE_USAGE=0 to get_program [pr] ( #14263 )
2026-01-21 12:56:05 +09:00
qazal
78a28227c6
assembly/amd: cdna4 mfma support ( #14206 )
2026-01-21 09:12:05 +09:00
George Hotz
1baefed530
assembly/amd: add hw tests from ucode branch ( #14259 )
...
* assembly/amd: add hw tests from ucode branch
* fix is per lane
2026-01-21 08:53:54 +09:00
wozeparrot
ba90e1b52e
feat: script to run llama8b training ( #14239 )
2026-01-20 12:44:06 -08:00
Christopher Milan
daf9414bff
fix nullptr arg to CUDA_KERNEL_NODE_PARAMS_v1 ( #14256 )
...
* fix nullptr arg to CUDA_KERNEL_NODE_PARAMS_v1
* ruff
2026-01-20 12:30:07 -05:00
chenyu
e04767e39e
run pre-commit in ci ( #14253 )
...
* run pre-commit in ci
prevents pre-commit regression
* IGNORE_OOB=1
* pytest
* unit test
* split
2026-01-20 12:24:33 -05:00
nimlgen
22af7132cd
fix test_dev_jitter_matrix ( #14255 )
2026-01-20 20:07:51 +03:00
Robbe Derks
c7fbd177d4
USBGPU: debug script for comma chestnut ( #14252 )
...
* initial debug script
* improvements
2026-01-20 18:52:25 +03:00
C T
26f8b12e01
Whisper audio helpers (mel filters in tinygrad) ( #13478 )
...
* add whisper audio helpers for stft/mel/resample
* cleanup
* add whisper stft test
* make only stft test explicitly depend on librosa
* extract sinc_window_kernel
* dehardcode device
* use same device argument
* simplify
* type annotate
* ruff format audio_helpers.py
* ruff format test_whisper.py
* add WHISPER_NEW_STFT
* rename
* undo ruff format changes
* use new stft and mel for whisper
* remove stft test that depends on librosa
* remove whitespace
* add Tensor.log10 with test\test_ops.py::TestOps::test_log10
* use Tensor.log10
* fix lint
* future: remove unused STFT class
* future: remove resample code since it isn't used (yet)
* match openai with pad_mode="reflect"
* pad_to
* future: cut resample leftovers
* cleanup
* add mel tests
* future: cut stft
* future: cut non-mel prep_audio changes
* reduce diff
* move audio_helpers.py to examples
* reduce whitespace
* fix imports
* reduce whitespace
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-01-20 10:50:02 -05:00
nimlgen
dc82856084
tbgpu: shim binary + remote apl pci dev ( #14124 )
...
* shim binary + remote pci dev
* v2
* rip out apl
* cmds
* rename
* clean
* remove
* rm gitignore
* ui
* install
* linter
* um
* cleaner
* assets
* normal install in ui
* cleaner app
* install script
* support fd mmap
* cleaner
* kill server when disconn
* rename + pcidevs
* sign
* install and reinstall
* no sip install
* will trigger update
* nv
* ugh
* this
* fix
* nv
* use nosip sign
* auto install
* remove
* mypy
* upd
* ditto
* print
* simpler
* ditto
* um
* simpler
* upd
* upd
* cleaner
* autogen
* cleaner
* move
* annotations
* server cleaner
2026-01-20 16:15:18 +03:00
qazal
4548fcc1b8
amd/sqtt: add rdna4 and cdna sqtt examples ( #14251 )
...
* amd/sqtt: add rdna4 and cdna sqtt examples
* work
* comment out rdna and cdna tests
2026-01-20 21:11:48 +09:00
qazal
2dc281b32a
assembly/amd: test helpers for arch to gfx target mapping ( #14250 )
2026-01-20 20:35:09 +09:00
nimlgen
823e88c0d0
nv: request bar 3 ( #14249 )
2026-01-20 13:52:38 +03:00
qazal
dddd0e384f
ALLOW_DEVICE_USAGE=0 in codegen ( #14238 )
2026-01-20 15:15:16 +09:00
George Hotz
0243f4a0f1
clear wins from ucode branch ( #14243 )
...
* clear wins from ucode branch
* two more
* revert those
2026-01-20 15:11:09 +09:00
George Hotz
5e24643889
minor import speedups ( #14244 )
...
* minor import speedups
* server stuff in server places
* pre-commit
* fix
2026-01-20 15:05:36 +09:00
George Hotz
d60a155e48
defer compilation of upats ( #14242 )
...
* defer compilation of upats
* mypy
2026-01-20 13:50:00 +09:00
George Hotz
56c8926d32
import speedups: refactor validate to late import ( #14241 )
...
* refactor validate to late import
* preommit stuff
* fix mypy
2026-01-20 13:23:39 +09:00
chenyu
9d3b1cf1e7
simpler _cached_to_python_const ( #14236 )
2026-01-19 23:10:53 -05:00
qazal
b1c5a242b7
Revert "move is_dtype_supported logic to renderer ( #14188 )" ( #14237 )
...
This reverts commit 161fee9a48 .
2026-01-20 12:19:14 +09:00
wozeparrot
1f89eaf790
tk: fa bert mask fix + some numerical stability improvements ( #14214 )
2026-01-19 19:18:07 -08:00
chenyu
9ea63d7d52
failed test case for onnx IF with jit ( #14235 )
...
silently fails now since onnx treats IF cond as a const
2026-01-19 18:10:05 -05:00
Garret Castro
b65dc9fd8e
refactor: use generic type for ContextVar [pr] ( #13998 )
...
* use generic type for context var
removes ops_python string cast thing, allows for handling of other string vars like `_CC`
* update Context.old_context type
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-01-19 13:37:54 -05:00
Martin Szewieczek
7010c176cf
pre commit: fix path to test_assign.py ( #14231 )
2026-01-19 13:36:30 -05:00
Christopher Milan
34f6192739
look for cuda in /opt/cuda ( #14230 )
...
* look for cuda in /opt/cuda
* regen
2026-01-19 11:51:00 -05:00
qazal
0f61cbd51f
viz: draw shapes directly on the canvas ( #14229 )
2026-01-20 00:57:06 +09:00
nimlgen
acb0045ba0
system: alloc_sysmem is part of interface ( #14226 )
2026-01-19 18:15:54 +03:00
qazal
ab426cb671
viz: simplify row line logic ( #14227 )
2026-01-20 00:00:28 +09:00
nimlgen
01653db4fd
nv: GPPut is mmiointerface ( #14225 )
2026-01-19 17:36:26 +03:00
nimlgen
7cb7abeeb0
amd: fix scratch_wave64_lane_byte_size ( #14223 )
2026-01-19 15:21:39 +03:00
nimlgen
979ce211f7
amd: missing self in aql's exec ( #14224 )
2026-01-19 14:27:54 +03:00
George Hotz
31bcbed6bb
AMD_DISABLE_SDMA for testing with -n12 ( #14216 )
2026-01-19 16:10:30 +09:00
qazal
578a4a50d3
viz: row lines in timeline ( #14213 )
...
* simple start, already works for memory graph
* add height to exec packets
* math.max, border-color
* borderline is in pixels
* row border color
2026-01-19 13:01:43 +09:00
Christopher Milan
161fee9a48
move is_dtype_supported logic to renderer ( #14188 )
...
* move is_dtype_supported logic to renderer
* fix CPU_COUNT
* mypy happy
* early import libclang too with llvm
* run with debug
* skip autogen tests if MTLCompiler or llvm is loaded
* run autogen tests separately in CI
* lint
2026-01-18 22:37:04 -05:00
qazal
7abe9b020f
viz: add border colors to pkts timeline ( #14211 )
...
* viz: add border colors to pkts timeline
* 10
2026-01-19 11:37:46 +09:00
chenyu
67d9712ef6
jit copy aliased output if it's read later ( #14210 )
2026-01-18 18:48:59 -05:00
chenyu
97333b1954
jit footguns test case on assign with same buffer outputs ( #14209 )
...
related https://github.com/tinygrad/tinygrad/issues/13364
2026-01-18 16:01:09 -05:00
chenyu
e7c2df9113
improve consecutive Tensor indexing ( #14208 )
...
* improve consecutive Tensor indexing
instead of O(idx_counts*src_dims), it can just be O(idx_counts)
* test correctness
2026-01-18 15:14:33 -05:00
chenyu
c7b8f6496f
remove dtypes.index_like and dtypes.fields [pr] ( #14207 )
...
barely used, so just use inline and DTYPES_DICT
2026-01-18 11:49:01 -05:00
qazal
e27a0002c5
viz: only keep the sqtt bytes for pkts ( #14203 )
...
* viz: only keep the sqtt bytes for pkts
* better option name
* work
* renames
2026-01-18 17:04:26 +09:00
qazal
d8f87ae2f2
SQTT packets to assembly mapper ( #14198 )
...
* disasm + compare to llvm
* start inst trace
* base tests pass
* work
* work
* all kernels
* qol
* refactor
* work
* work
* wave_focus
* simple
* work
* add a lot of asserts
* focus on wave0
* correct handling of IMMEDIATE_MASK
* work
* viz work
* use the metadata infra
* better
2026-01-18 16:32:13 +09:00
Christopher Milan
1eb110cd7d
fix memory corruption in NIR, reenable process replay ( #14204 )
2026-01-18 02:05:12 -05:00
George Hotz
a51e0a86db
assembly/amd: clean up disasm.py + add CDNA support ( #14200 )
...
* assembly/amd: clean up disasm.py
* cleanups
* add missing encodings
* decode is pretty
* cdna
* assert on failure
* cdna roudtrip
* cdna passing
* test
* lil cleanup
* variant cleanups
* cleanups
2026-01-18 14:48:44 +09:00
chenyu
4b18c92bc5
simpler Context.__enter__ [pr] ( #14201 )
2026-01-18 00:38:59 -05:00
qazal
feaa804158
skip lvp process replay in CI [pr] ( #14202 )
2026-01-18 13:25:04 +09:00
chenyu
b12a9fea80
runtime int call instead of cast(int) ( #14183 )
2026-01-17 20:34:45 -05:00
George Hotz
79c1559f69
amd asm can still be simpler ( #14199 )
...
* amd asm can still be simpler
* simpler
* V_LANE_ID
* simpler
* simpler
* compact vgpr
2026-01-17 18:40:10 +09:00
chenyu
5e6a72c33f
new Onnx Gather ( #14187 )
...
instead of assuming const indices, check if it showed as a const
2026-01-16 22:24:07 -05:00
George Hotz
9f7f2f0e0c
MAX_SQTT_PKTS
2026-01-17 12:05:36 +09:00
George Hotz
50554115ee
fix VALU_SALU / IMMED_MASK and improve amd_asm_matmul ( #14196 )
...
* fix VALU_SALU / IMMED_MASK and improve amd_asm_matmul
* immed
* wave override
* restore ALT
* advance sgprs correctly
* no helpers
* decrease to 192 VGPRs
2026-01-17 11:58:34 +09:00
chenyu
ab244c7f81
onnx Gather should not assume indices to be const ( #14185 )
...
* onnx Gather should not assume indices to be const
added a failed test case
* just list
2026-01-16 20:55:00 -05:00
wozeparrot
a879b54234
tk: fa jit fix ( #14170 )
2026-01-16 16:38:45 -08:00
qazal
a8ae9757dd
viz: put alts in the same row, LDS color ( #14194 )
...
* viz: put alts in the same row, coloring work
* assert if packets overlap
* lds color
2026-01-17 09:36:14 +09:00
qazal
5aa71f437b
viz: precise clock cycles in PKTS ( #14179 )
...
* viz: relative clock cycles in PKTS
* format clocks as xM yK 999 cycles
2026-01-17 09:08:13 +09:00
Christopher Milan
eafcd44d95
fix OSX image pitch ( #14193 )
2026-01-16 19:07:33 -05:00
Christopher Milan
3960e2758c
suppress_finalizing in hip ( #14189 )
2026-01-16 18:56:29 -05:00
qazal
9302ab003a
viz: show ALT/OTHER packets on second lane ( #14192 )
...
* viz: show dimmer ALT/OTHER packets
* remove todo comment
* work
* current vmem is gray
2026-01-17 08:55:24 +09:00
qazal
551454f476
viz: fix wave sort, show message if sqtt trace is empty ( #14190 )
...
* show message if sqtt trace is empty
* work
* fix wave sort
* back
2026-01-17 08:01:26 +09:00
George Hotz
8a2549d42b
improve amd_asm_matmul + minor VIZ PKTS improvements ( #14186 )
...
* improve amd_asm_matmul + minor VIZ PKTS improvements
* fix waitcnt issue
* cleanups
2026-01-17 06:56:59 +09:00
George Hotz
7d1d9d4568
assembly/amd: remove IMG instruction support and asm.py ( #14163 )
...
* assembly/amd: return IMG instruction supports
* remove asm.py
* op2dsl
2026-01-17 06:21:50 +09:00
chenyu
dc4ae7dd08
lower ASSERT_MIN_STEP_TIME for driving_policy to 3ms ( #14184 )
...
seems quite stable at 2.7ms now
2026-01-16 15:04:53 -05:00
chenyu
0a14e1fcd4
fix some type ignore ( #14182 )
2026-01-16 13:56:45 -05:00
chenyu
fc10470883
add UOp.__index__ ( #14181 )
...
Tensor slice is handled by __getitem__, so the index method is just for SupportsIndex
2026-01-16 12:28:33 -05:00
chenyu
6790165ef8
minor _apply_uop cleanup ( #14180 )
...
give fxn a return type and minor style change
2026-01-16 11:27:55 -05:00
nimlgen
e855ec8ee3
tbgpu: refactor dext to support user mappings ( #14177 )
2026-01-16 15:55:57 +03:00
qazal
bbc55962ee
viz: color SQTT INST Ops like UOps ( #14175 )
2026-01-16 21:24:43 +09:00
qazal
3751b29a3d
viz: skip OTHER_ SQTT packets ( #14178 )
2026-01-16 20:37:19 +09:00
qazal
7c1f1cb2bc
viz: fix INST packets coloring ( #14176 )
...
* viz: fix INST packets coloring
* work
2026-01-16 18:46:13 +09:00
qazal
1696991988
viz: add PKTS group to sqtt trace ( #14173 )
...
* viz: add PKTS group to sqtt trace
* soft_err for rdna4
* different itrace
2026-01-16 17:29:47 +09:00
Christopher Milan
a021b84604
autogen: fix enum ( #14171 )
2026-01-16 01:30:11 -05:00
qazal
fa5475307c
viz: collapse wave packets in one row, 1 clk per packet ( #14169 )
...
* per wave packets in one row
* work
* row_tuple
* cleaner
* one row and one lane per wave
* globals split into rows based on type
* barrier length
2026-01-16 13:52:07 +09:00
Christopher Milan
5abc262e22
fix dll.bind caching ( #14168 )
2026-01-15 20:25:42 -05:00
Christopher Milan
f9ca072b61
cuda compilers disassemble properly ( #14166 )
...
* cuda compilers disassemble properly
* this can use system
2026-01-15 19:02:40 -05:00
chenyu
14e9a71a41
move test_assign to unit ( #14165 )
...
scheduling these should not depend on device
2026-01-15 17:10:13 -05:00
nimlgen
a0dd9d2146
tbgpu: correct com.apple.developer.driverkit.transport.pci entitlements ( #14164 )
...
* tbgpu: correct com.apple.developer.driverkit.transport.pci entitlements
* format
2026-01-15 20:56:39 +03:00
qazal
32e1c267ee
viz: SQTT timeline with our decoder ( #14139 )
...
* viz: sqtt OCC/INST timeline in our decoder
* todo
* lint
* work
* cleaner
* profiling
* better timing
* keep the generic api
* more generic
* 80x -> 20x off the C decoder
* unusably slow
* rm filters
* work
* work
* other way to sort ops
* work
* first 10k
* 100K actually tells a story
* barrier INST packets get their own red color and row
* minor detail
* 50K
* soft_err
2026-01-15 20:45:16 +09:00
Christopher Milan
0cb024a5bb
remove ctypes.Structure ( #13651 )
2026-01-15 05:06:22 -05:00
George Hotz
255e0573b1
assembly/amd: clean up asm/disasm ( #14158 )
...
* assembly/amd: clean up asm/disasm
* update disasm
* revert dumb stuff
* update decode
* use fmt
2026-01-15 17:45:40 +09:00
qazal
164bc678a6
scheduler: sched_cache bugfix for different Tensor.custom_kernel schedules ( #14161 )
...
* simplest failing test
* min fix
* same function reuses the cache
* SPEC=2 never worked for custom_kernel
2026-01-15 14:59:14 +09:00
qazal
b46da603fe
codegen/custom_kernel: do not attach KernelInfo to user program ( #14160 )
2026-01-15 14:01:48 +09:00
George Hotz
fd60626ea1
assembly/amd: refactor to use op_bits/op_regs ( #14156 )
...
* assembly/amd: refactor to use op_bits/op_regs
* remove that skip
* remove another hack
* remove another hack
* precompute mask
* more reg, less hasattr
2026-01-15 11:20:21 +09:00
chenyu
add7da268f
multiple slice assign test ( #14157 )
...
GANing test cases
2026-01-14 21:08:03 -05:00
George Hotz
e9ce12028e
assembly/amd: amdxml cleanups, remove broken SDWA/DPP, merge in pdf.py ( #14154 )
...
* assembly/amd: amdxml cleanups, remove broken SDWA/DPP
* remove buf junk
* simplify
* simplify
* lil cleanup
* dead fixes
* strip non pcode extraction from pdf
* merge pdf.py into amdxml.py
* only amdxml
2026-01-15 09:23:19 +09:00
wozeparrot
7e5687f6a3
more fa multi fix ( #14152 )
2026-01-14 13:57:11 -08:00
chenyu
1381daac06
many more failed assign tests ( #14153 )
...
assign is quite broken
2026-01-14 16:20:28 -05:00
nimlgen
8c55ef4f01
amd: cleanup props ( #14145 )
...
* amd: cleanup props
* f
2026-01-14 20:27:41 +03:00
chenyu
899a56446e
failed assign test cases with write before read ( #14148 )
...
slice assign write before read fails now. this is why kv cache needs a realize
2026-01-14 10:30:50 -05:00
chenyu
986e865830
fix TINY_BACKEND=1 cumsum ( #14138 )
...
* fix TINY_BACKEND=1 cumsum
old hack was wrong, need to apply contiguous on the input
* test time
* test_linalg_svd is slow
2026-01-14 09:54:49 -05:00
qazal
434dbafab5
optional Estimates in KernelInfo ( #14147 )
...
* optional Estimates in KernelInfo
* custom asm test plumbing
* s_code_end
* estimates test
* vaddr arg in global_store
* kernel desc
* Ops.DEVICE name
2026-01-14 22:55:03 +09:00
qazal
76b577ee76
viz: only SIMD name in sqtt timeline rows ( #14146 )
2026-01-14 20:13:27 +09:00
George Hotz
e5500ae4ad
add ALU stuff to default perf counters ( #14135 )
...
* add ALU stuff to default perf counters
* lds
* add alu utilization
* cleaner
* format as percent
* cleanest
* roc
2026-01-14 19:47:59 +09:00
nimlgen
86708ccac5
hip_ioctl: dump aql ( #14142 )
2026-01-14 13:15:10 +03:00
nimlgen
f9147422a3
ci: add setcap ( #14143 )
2026-01-14 13:15:01 +03:00
nimlgen
62c1a014a6
amd: rename to be consistent ( #14141 )
2026-01-14 11:41:04 +03:00
Christopher Milan
e0eea0d833
autogen: verify all files in CI ( #14140 )
...
* autogen: verify all files in CI
* dont delete libclang
2026-01-14 02:35:54 -05:00
chenyu
2a2c1eacf6
disable fast_idiv on metal ( #14137 )
...
there's a metal compiler bug which was the root cause that keccak needs a contigous hack
2026-01-13 21:40:40 -05:00
wozeparrot
a92778aa0c
tk: fa multi fix ( #14134 )
2026-01-13 17:22:15 -08:00
George Hotz
2ab18ea7e3
assembly/amd: use xml instead of pdf ( #14118 )
...
* assembly/amd: use xml instead of pdf
* use amdxml to generate info about op sizes
* fix many tests with invalid instructions
* fix info generation
* chad xml fixes many bugs
* rename to operands
* simplify
* amdxml
* bug fix
2026-01-14 10:03:37 +09:00
qazal
002ea39da7
assembly/amd: use Tensor.custom_kernel to run assembly ( #14125 )
...
* assembly/amd: use Tensor.custom_kernel to run assembly
* PRINT_ASM=1 is DEBUG=4
2026-01-14 08:29:25 +09:00
chenyu
fe00682502
clean up svd tests ( #14133 )
...
removed from test_ops and added to TestTorchBackend
2026-01-13 16:32:21 -05:00
chenyu
84b88a0a31
more doc of newly added functions ( #14132 )
2026-01-13 15:48:45 -05:00
chenyu
e610821c52
Tensor.cummin and Tensor.nonzero ( #14131 )
2026-01-13 15:09:56 -05:00
chenyu
176a934ddd
Tensor.diagonal support offset and dims ( #14130 )
2026-01-13 14:49:06 -05:00
chenyu
2a217ba206
tinybackend isin and log10 ( #14120 )
...
can use tinygrad directly
2026-01-13 14:14:09 -05:00
qazal
79d00521f8
viz: fix cfg err when endpgm is in the middle of stream ( #14128 )
...
* kernel from beautiful_mnist
* minimal test
* correct way to do this
* rm that
2026-01-14 02:00:34 +09:00
qazal
7fe91e5db9
viz: cleanup cfg renderer ( #14127 )
...
* remove colorDomains from sqtt
* colors in js
* work
2026-01-14 01:10:42 +09:00
nimlgen
1364449cab
system: early pci perm check ( #14126 )
...
* system: early pci perm check
* l
2026-01-13 17:45:05 +03:00
George Hotz
a28c8105a5
assembly/amd: 2% faster amd_uop_matmul + SQTT ( #14122 )
...
* assembly/amd: 2% faster amd_uop_matmul
* SQTT_TOKEN_EXCLUDE + SQTT_SIMD_SEL
* sqtt printer
* fix printer
* fast decode
* fast decoder
* test packet counts
* ugh it's not faster
* dead
2026-01-13 19:55:32 +09:00
qazal
6cd318e377
viz: add link to graph from sqtt ( #14123 )
2026-01-13 17:31:03 +09:00
qazal
fd10fd245a
viz: cfg tokenizer fix and unit tests ( #14121 )
...
* output Ops.BINARY
* failing test for the cfg
* dsl renamed to offset and sz
* add better asserts
* move the note
2026-01-13 15:08:55 +09:00
chenyu
05fcb57696
also return index in Tensor.cummax ( #14117 )
...
* also return index in Tensor.cummax
* fix
2026-01-12 22:42:10 -05:00
wozeparrot
7c967399a4
tk: add failing test for fa multidevice ( #14116 )
2026-01-12 19:11:09 -08:00
George Hotz
330a0b686e
assembly/amd: clean up dsl and make type verification strict ( #14102 )
...
* assembly/amd: start newdsl
* work
* newdsl upd
* Reg is p nice
* cleaner
* work
* getting clean
* all fields
* more BitFields
* redo the pdfs with dsl2 syntax
* no lit
* cleanups
* more defaults
* fix get and remove crap
* aliases
* ugly but kind of works
* NULL, not rawimm
* clean up defaults
* only dsl
* asm fixes
* lit fixup
* more lit
* cleanups
* olddsl
* single pcode dict
* emu sort of works
* trash test
* global is global
* types property
* reg mods
* fix a few tests
* remove monkey patch
* fixes
* less hacks in tests
* less hacks in tests
* 4 test failures
* hw tests all pass
* fix compare emulator
* fix some tests
* 3 more
* fix and shorten sqtt
* handwritten
* fix validation
* test corrections
* all types validate
* fix dsl2 tests
* fix bugs in disasm
* skips on cdna
* work
* repr with reg[]
* fix bitfield tests
* merge pcodes in dsl
* remove override
* disasm uses inst.types
* simpler
2026-01-13 08:52:16 +09:00
C T
a8c821f45e
add Tensor.log10 with test\test_ops.py::TestOps::test_log10 ( #14113 )
2026-01-12 13:45:47 -05:00
chenyu
6b0a9f5ee6
don't strip sink in to_uops_list [pr] ( #14111 )
2026-01-12 11:19:03 -05:00
chenyu
cad7feec02
more onnx ops ( #14104 )
...
HannWindow, HammingWindow, BlackmanWindow, Hardmax, LpNormalization
2026-01-12 09:11:13 -05:00
nimlgen
635ed2df9d
system: use pci.PCI_VENDOR_ID instead of const ( #14109 )
2026-01-12 15:24:09 +03:00
qazal
6c0f0e29ff
Revert "viz: loading... ( #14107 )" ( #14108 )
...
This reverts commit 9347757c2d .
2026-01-12 20:45:37 +09:00
nimlgen
9347757c2d
viz: loading... ( #14107 )
2026-01-12 13:24:24 +03:00
wozeparrot
3a92df66ea
feat: bump version to 0.12.0 ( #14105 )
2026-01-11 21:19:49 -08:00
chenyu
7c234a9c7c
wgsl cleanup [pr] ( #14103 )
...
refactor common pack functions
2026-01-11 21:23:45 -05:00
George Hotz
91bde927ef
assembly/amd: split asm.py into asm.py and disasm.py ( #14101 )
...
* split asm.py into asm.py and disasm.py
* split decoder
* move to pcode
* tests
2026-01-12 07:22:02 +09:00
George Hotz
44135e2e84
assembly/amd: always use v_nop in test for rocprof-trace-decoder ( #14100 )
...
* assembly/amd: always use v_nop in test for rocprof-trace-decoder
* test touchups
2026-01-12 05:31:58 +09:00
George Hotz
8b1b15aec0
assembly/amd: SQTT support ( #14099 )
...
* assembly/amd: SQTT support
* simpler
* cmp wave
* instruction compare
* rocprof decode
* simpler
* no llvm
* no strcmp
2026-01-12 05:07:17 +09:00
nimlgen
8b5ff403fa
am: flag successful finalization ( #14097 )
...
* am: flag successful finalization
* import
2026-01-11 16:24:53 +03:00
qazal
d8aba24967
amd: use kernel descriptor struct in AMDProgram ( #14096 )
2026-01-11 18:25:16 +09:00
chenyu
9973a81356
add channels_last to QLinearGlobalAveragePool ( #14094 )
...
and other minor cleanups
2026-01-10 18:38:19 -05:00
chenyu
c5492f8f75
cstyle cleanup [pr] ( #14093 )
2026-01-10 09:44:50 -05:00
nimlgen
d5f954858d
viz: show precise timings ( #14092 )
2026-01-10 16:21:08 +03:00
nimlgen
3e2c05ee9f
hevc: decoder as iterator ( #14091 )
2026-01-10 14:57:56 +03:00
chenyu
35c9701df0
update outdated tests and comments ( #14090 )
2026-01-10 01:00:48 -05:00
chenyu
92246ea731
update tests, WEBGPU=1 pytest . passes ( #14089 )
...
* update tests, `WEBGPU=1 pytest .` passes
* minor update
2026-01-10 00:03:02 -05:00
chenyu
c34c6d9468
fix wgsl packed_store can drop valid ( #14088 )
...
* fix wgsl packed_store can drop valid
* fix
2026-01-09 15:22:06 -05:00
chenyu
eacccc5ace
more disk assign tests ( #14087 )
...
covers more edge cases
2026-01-09 14:14:52 -05:00
chenyu
ed295e74dc
don't skip gguf test if ggml is not installed ( #14086 )
...
* don't skip gguf test if ggml is not installed
should just let it fail
* fix
2026-01-09 12:05:58 -05:00
chenyu
cff33c8d78
add some disk assign tests ( #14085 )
2026-01-09 11:50:59 -05:00
chenyu
74fa3c7d09
decomp pow for LVP ( #14084 )
...
test failed due to undefined behavior, so use decomp instead
2026-01-09 10:50:28 -05:00
b1tg
0fbc551622
train bert with fp8 ( #13874 )
...
* fp8 train
* clean
* lint
* test fix from #13439
* skip first/last layer
* rm __init__, restore unroll <=32 check
* tests
* clean test, remove unused
* multi-gpu test, clean quantize_to_fp8
* remove bert contiguous
* run script
* test: better check
* run script search
* add seed in bert data shuffle
* move script to mi350x folder
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-01-09 09:21:59 -05:00
nimlgen
ba209d6305
am: utc_l1_enable on all sdma inst ( #14083 )
2026-01-09 17:17:05 +03:00
nimlgen
6b308b89b7
viz: timeline time ( #14080 )
...
* viz: timeline time
* less lines
* cut
2026-01-09 16:43:45 +03:00
nimlgen
40f9fa2db4
autogen: new kfd ( #14082 )
2026-01-09 16:08:17 +03:00
qazal
2917ed1616
roc: propagate decoder errors to main thread ( #14081 )
...
* roc: propagate decoder errors to main thread
* types
* add cause
2026-01-09 21:10:45 +09:00
qazal
f3f4d9b387
viz: fix disasm node width ( #14079 )
2026-01-09 16:37:37 +09:00
anu
c70c112254
fix CUDA=1 disassembly (VIZ=1) by stripping null terminator ( #14046 )
...
* fix ptxas disassembly bug
* single '
* move fix to get_bytes
* move rstrip
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2026-01-09 15:19:59 +09:00
qazal
13e5d00d0e
viz: exclude comma in register highlight ( #14078 )
...
* viz: exclude comma in register highlight
* simplify
2026-01-09 15:10:30 +09:00
qazal
a071adffc0
viz: amdgpu disassembly register highlighting UI ( #14059 )
...
* viz: amdgpu disassembly register highlighting
* minor details
* details from IDA
* more details from IDA
* refactor token colors
* move tokenizer to python
* simplify
* minimal tokenizer for registers
* all the operand types
2026-01-09 11:27:09 +09:00
chenyu
b878f9d5a4
reuse Tensor init with const path [pr] ( #14076 )
2026-01-08 17:49:37 -05:00
chenyu
efcb32f6a9
unique const when requires_grad is set to True ( #14075 )
...
* unique const when requires_grad is set to True
* fix pyrender
2026-01-08 16:30:45 -05:00
chenyu
b34c637767
support bfloat16 for CL ( #14073 )
2026-01-08 14:14:29 -05:00
Garret Castro
16b652302e
skip bf16 test if not supported by device ( #14070 )
2026-01-08 13:37:24 -05:00
nimlgen
3f61a96d79
am: SetSoftMaxByFreq on gfx10+ ( #14068 )
2026-01-08 17:00:03 +03:00
George Hotz
e7b5d8a434
assembly/amd: more RDNA4 asm ( #14062 )
...
* rdna4 more
* asm
* fixes
* assembly/amd: handwritten wmma failing test
* passes
* wmma default hacks
* space
* 0 skips in rdna3/rdna4 disasm
* more RDNA4 tests
---------
Co-authored-by: qazal <qazal.software@gmail.com>
2026-01-08 05:09:37 -08:00
nimlgen
e372c841ba
hevc: beam in decode ( #14067 )
...
* hevc: beam in decode
* fine
* g
2026-01-08 15:47:16 +03:00
nimlgen
1732a4ec4b
am: rework set_clocks ( #14065 )
2026-01-08 15:33:32 +03:00
nimlgen
f3aceaa08b
hevc: fast decoder ( #14057 )
2026-01-08 15:20:37 +03:00
qazal
309197bca5
assembly/amd: test_roundtrip for cdna/rdna4 ( #14066 )
2026-01-08 21:03:13 +09:00
qazal
15a056715d
fix amd assembly IDE tests on macbook ( #14063 )
2026-01-08 17:27:52 +09:00
wozeparrot
027b935269
tk: fix grouped load store ( #14035 )
2026-01-07 22:38:02 -08:00
George Hotz
2db04d0696
assembly/amd: start adding RDNA4 support ( #14060 )
...
* assembly/amd: start adding RDNA4 support
* rdna4 asm
2026-01-07 21:19:30 -08:00
George Hotz
cb500466c2
assembly/amd: amd_asm_matmul ( #13989 )
...
* amd_asm_matmul
* dsl transform
* asm roundtrip
* fixed
* less
* better
* more
* simpler
* simplify
* lil
* simpler
* compact
* work
* cleanups
* simplify
* simpler
* cleanup
* name the regs
* simp
* big simp
* big simp
* simp
* acc grid
* fast
* stuff
* fast
* simpler
* owrks
* save vgprs
* save vgprs
* Compact
* less VGPRs
* after
* SQTT support
* fastest
* faster
* lil faster
* tile regs
* faster
* readable
* one more
* simpler
* lil simpler
* NO_GLOBAL skips early globals
* stock kernel
* cleanups
* cleanups
* one b reg
* safe reg changes
* acc is compact now
* remove confusing stuff
* sregs
* lds cleanups
* vopd
2026-01-07 20:11:05 -08:00
chenyu
3caa1e2c98
fix cast HALF with PYTHON backend ( #14058 )
2026-01-07 16:52:05 -05:00
chenyu
5f1ede7f7e
clean up test_dtype ( #14055 )
...
use less lambda
2026-01-07 15:45:42 -05:00
nimlgen
5bd4593eda
hevc: cleaner decoder ( #14056 )
...
* hevc: cleaner decoder
* nn
2026-01-07 18:29:30 +03:00
b1tg
241f0402b4
add seed in bert data shuffle ( #14054 )
2026-01-07 10:02:05 -05:00
nimlgen
25c82dd242
nv: profile nvdec ( #14053 )
2026-01-07 15:56:54 +03:00
qazal
35900290b2
viz: configure text height for cfg ( #14052 )
2026-01-07 18:58:56 +09:00
chenyu
87f4bc5446
update variable names around jit [pr] ( #14049 )
...
lbs, st_vars_dtype_device and rawbuffers no more
2026-01-06 22:32:41 -05:00
chenyu
2833c5a54b
few more jit tests with multi tensor inputs ( #14047 )
2026-01-06 22:05:22 -05:00
chenyu
72a3f78d19
jit includes tensor inputs in containers ( #14043 )
...
* jit includes tensor inputs in containers
* cleanup
2026-01-06 19:42:06 -05:00
chenyu
c714881832
don't allow jit input to be const ( #14045 )
...
* don't allow jit input to be unbuffered like const
* just const to fix multi
* fix rnnt
2026-01-06 18:15:22 -05:00
chenyu
a8896f28e1
test_unrealized_const_input_frozen ( #14044 )
...
unrealized const is not replaced in jit
2026-01-06 14:17:43 -05:00
nimlgen
325f4006ff
amd: copies w/o sdma ( #14036 )
...
* amd: copies w/o sdma
* as_args
* fixes
* f
2026-01-06 21:15:58 +03:00
chenyu
7fb18f7e47
raise when jit fxn returns non-Tensor output ( #14042 )
2026-01-06 12:59:20 -05:00
chenyu
4491ec0c9e
JitError ( #14041 )
...
* JitError
* test_symbolic_jit
2026-01-06 12:19:50 -05:00
chenyu
6ddddc68af
test jit tolist failure ( #14040 )
...
also moved tests to test_jit_footguns
2026-01-06 11:16:57 -05:00
chenyu
b699b9f763
test case for jit a function with item call ( #14039 )
...
* test case for jit a function with item call
output is silently wrong now
* no dtype
2026-01-06 10:40:43 -05:00
nimlgen
02084f5376
mockdsp: use dsp allocator ( #14037 )
...
* mockdsp: use dsp allocator
* fix
* ?
2026-01-06 16:04:47 +03:00
wozeparrot
2b3e01e79c
tk: support sliced local -> reg load ( #14034 )
2026-01-06 05:33:24 -05:00
George Hotz
45f7fd073d
assembly/amd: pcode bug fixes ( #14032 )
...
* bring over pcode parser
* fixes
* pdf test
* delay alu
2026-01-06 00:15:48 -08:00
wozeparrot
21d0f6bb76
tk: flat global -> local load ( #14033 )
2026-01-05 23:35:53 -08:00
qazal
3170365a5b
visualize SQTT with the same cfg infrastructure ( #13870 )
...
* start
* rough sketch
* post render dag
* art
* intro g key
* work
* custom color scale
* colors
* more blue
* better
* smaller
* use for loop in test
2026-01-06 14:53:20 +09:00
Christopher Milan
0120d69caa
autogen: avcodec (and simplify workflow) ( #14031 )
...
* simplify autogen workflow and add avcodec verification
- Consolidate all regeneration into single steps (delete + import)
- Remove continue-on-error and individual diff checks
- Use git diff at end to catch all differences
- Show artifact URL in failure message
- Add avcodec.py verification
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* patch avcodec
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-05 23:30:25 -05:00
George Hotz
20653d2996
assembly/amd: make pdf.py code shine ( #14029 )
...
* assembly/amd: make pdf.py code shine
* no merge
* pdf2 is the future
* something
* regen enums
* test
* work
* remove junk
* write
* pcode extraction
* pdf2 passes all tests
* simplify
* simpler pdf
* late filter
* remove hacks
* simplify pdf2.py
* field type
* remove defaults
* don't export srcenum
* simple pdf.py
* simpler
* cleaner
* less hack in PDF
2026-01-05 18:49:40 -08:00
qazal
ea7b149ca5
viz command line tool ( #14030 )
2026-01-06 10:19:47 +09:00
Christopher Milan
f86c728440
load libclang as 'libclang.so' too ( #14028 )
2026-01-05 16:56:16 -05:00
chenyu
eda6a73897
clean up canonicalize_device ( #14027 )
...
centralize the type check
2026-01-05 10:29:55 -05:00
chenyu
ce464b147a
clean up comments that mentioned outdated terms ( #14026 )
...
no MultiLazyBuffer and no ShapeTracker in comments
2026-01-05 09:42:58 -05:00
chenyu
83063cc3e4
onnx TensorScatter ( #14024 )
2026-01-05 09:05:22 -05:00
chenyu
9497ec00f2
fix onnx attention permute ( #14025 )
...
* fix onnx attention permute
* skip test_attention_4d_fp16_cpu too
2026-01-05 08:58:50 -05:00
qazal
5cff5698f7
viz: g key toggles graph and text view ( #14023 )
2026-01-05 22:41:45 +09:00
chenyu
7a81a3cb98
more passed onnx tests ( #14022 )
2026-01-05 07:46:27 -05:00
kim yongjin
34fe105386
remove unused LazySeq ( #14020 )
2026-01-05 07:38:33 -05:00
qazal
4f2f38bf64
viz: split cfg and table render ( #14021 )
2026-01-05 20:59:08 +09:00
nimlgen
70405b4f3c
am_smi: mi350 ( #14018 )
2026-01-05 13:10:56 +03:00
Christopher Milan
b2a0b9c551
autogen: dump patch in CI ( #14010 )
...
* autogen: don't fast-fail, produce patch artifact on differences
All verification steps now use continue-on-error to run completely.
Each job generates a patch artifact containing all differences found.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* add gen from header test
* fix tests
* fail if diff
* add forward decl autogen test
* remove confusing/wrong comments
* macos unittests set LIBCLANG_PATH
---------
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-04 22:38:12 -05:00
chenyu
aae08b20e0
enable passed onnx tests ( #14017 )
2026-01-04 22:12:50 -05:00
chenyu
785d04d127
simpler einsum ( #14014 )
2026-01-04 20:38:59 -05:00
chenyu
f6a78a29e0
support einsum trace ( #14012 )
...
* support einsum trace
* test_einsum_scalar_cpu
2026-01-04 19:27:27 -05:00
George Hotz
404eed6172
assembly/amd: improve tests for asm ( #14007 )
...
* assembly/amd: improve tests for asm
* upd
* skip
* tests
* re bug
* more passing
* cleanups
* cdna fixups
* improve tests, better CDNA parsing
* fix CI
* no defs
* simpler
* all pass
* from pdf
* regen
2026-01-04 15:14:08 -08:00
wozeparrot
f550f9204c
fa: failing test for bwd jit ( #14009 )
...
* tk: failing test for bwd jit
* feat: mark expectedFailure
* clean: spaces
2026-01-04 16:57:43 -05:00
George Hotz
7abf4591ba
use bitsize on dtype ( #14011 )
...
* use bitsize on dtype [pr]
* bitsize
* bitsize in js export, but might be wrong
* reverts
* revert that
2026-01-04 12:16:21 -08:00
chenyu
cfb8bf5814
faster image load ( #13977 )
...
sometimes image load does not need to init with NAN
2026-01-04 13:09:59 -05:00
George Hotz
7ebda28692
assembly/amd: add CDNA support to asm ( #13982 )
...
* add CDNA support
* more cdna tests
* something
* fix more stuff
* more work
* simpler
* simplier
* cdna
* disasm
* less skip
* fixes
* simpler
2026-01-04 08:53:56 -08:00
chenyu
ad041416ca
delete unused rewrite rule [pr] ( #14006 )
2026-01-04 09:48:52 -05:00
nimlgen
bf356ae996
am: mi300 48bit address space ( #14004 )
...
* am: mi300 48bit address space
* fix
2026-01-04 15:19:25 +03:00
nimlgen
606786e152
am: do not sleep for each hive node during resets ( #14003 )
2026-01-04 14:02:11 +03:00
George Hotz
34ea053b26
assembly/amd: clean up pcode, jit pcode instead of static ( #14001 )
...
* assembly/amd: clean up pcode
* regen
* lil
* jit the pcode
* sendmsg
* cleanups
* inst prefetch lol
2026-01-03 23:06:15 -08:00
kamilisjon
280790e438
Reuse toposort in recursive_property ( #13993 )
2026-01-03 22:04:13 -08:00
kamilisjon
9a9564118c
[pr] Delete reverse_toposort ( #13987 )
...
* Delete reverse_toposort
* Update comment and profiler name
* Update profiler name
2026-01-03 22:03:44 -08:00
George Hotz
8328511808
assembly/amd: make the emu.py code shine ( #13996 )
...
* assembly/amd: make the code shine
* lil clean
* reg back in pcode
* cleanups
* gen fma_mix
* no writelane hacks
* fn cleanup
* dead vgpr_write
* readable
* smem
* cleanup bench_emu
* speedups
* simpler and faster
* direct inst._fn
* split fxn
* Revert "simpler and faster"
This reverts commit e85f6594b3 .
* move lds to wavestate
* dispatcher
* pc in dispatch
* literal isn't wavestate
* cleanups + program
* one readlane
* exec_vop3sd in exec_vop
* cleaner exec_vopd
* fully merge VOP3P
* no special paths
* no SliceProxy
* low=0
* no bigint
* failing tests
* fma on python 3.13
2026-01-03 20:33:09 -08:00
qazal
bdb421f13e
process_replay: passthrough sink arg for Ops.PROGRAM input ( #14000 )
2026-01-04 13:09:39 +09:00
Galax
66caa9fe1d
fix: library linking for fedora systems ( #13999 )
2026-01-03 17:40:56 -08:00
chenyu
8003db2a28
test case of NOOP store load folding ( #13997 )
2026-01-03 14:39:26 -05:00
chenyu
c1b8644a3f
test removing expander rules [pr] ( #13994 )
2026-01-03 12:38:01 -05:00
Christopher Milan
35c2870b1f
gate image_conv2d pitch hacks on IMAGE==1 ( #13995 )
...
* gate image_conv2d pitch hacks on IMAGE==1
* fix opencl image copies
* cleanup
2026-01-03 12:27:31 -05:00
nimlgen
a49924a0e9
hcq: _sleep report status ( #13992 )
...
* hcq: _sleep report status
* msg
* print all
2026-01-03 14:28:28 +03:00
nimlgen
3b354bc11f
hcq: better queue managment ( #13991 )
2026-01-03 13:11:15 +03:00
nimlgen
efb2ae87c6
hcq sync aql ( #13756 )
...
* hcq sync aql
* w
2026-01-03 12:59:24 +03:00
qazal
bd55507ee4
RDNA3 fp16 assembly gemm 85 TFLOPS ( #13990 )
2026-01-03 18:34:23 +09:00
wozeparrot
6242a9d151
tk: no global copy and clear ranges ( #13988 )
2026-01-02 23:45:15 -08:00
wozeparrot
9f082e8e25
fa: split kv bwd into 2 kernels ( #13981 )
2026-01-02 18:45:51 -08:00
qazal
2cc64d71b0
simplify mi350x gemm / viz asm tests ( #13984 )
...
* mi350x gemm cleanup
* asm tests work
* simpler asm tests
2026-01-03 11:11:07 +09:00
chenyu
7cbafb2ef1
update hypothesis min version ( #13983 )
...
there was a local_constants perf regression that made hypothesis related tests slow
2026-01-02 21:01:57 -05:00
Christopher Milan
9dc524536f
IMAGE=1 creates "dynamic" images ( #13769 )
...
* remove image from BufferSpec
* cl tiny_gemm (64) works
* mypy
* padding
* openpilot CL
* reshape properly
* remove extra qcom checks
* pad output
* mypy
* update compile test
* move undo
* TestImageCopy valid images
* TestImageRealization valid images
* TestImageDType valid images
* cleanups
* test_renderer_failures
* ruff
* mypy
* simplify ops_qcom
* bump step time
* Revert "bump step time"
This reverts commit 75a037c7d0 .
* "dynamic textures" are optional
* a start
* IMAGE=1 works, no FLOAT16
* fast but wrong
* mypy
* some fixes
* better
* works
* refactor
* oops
2026-01-02 16:22:39 -05:00
Christopher Milan
61dc70f1a8
add driving_vision IMAGE=1 benchmark ( #13979 )
2026-01-02 13:58:27 -05:00
George Hotz
0e282025ff
assembly/amd: split test_emu into hw tests ( #13966 )
...
* assmebly/amd: split test_emu into hw tests
* hw tests
* bugfixes
* more tests and fix
2026-01-02 08:04:56 -08:00
chenyu
2e2b5fed12
fix misspellings ( #13976 )
2026-01-02 10:37:38 -05:00
nietras
f49e4714af
Fix spelling errors in README for AMD assembly ( #13975 )
2026-01-02 10:15:20 -05:00
b1tg
a78fcc55a4
amd tc 1616128 ( #13439 )
...
* amd tc 1616128
* fix test
* remove hardcoded check in test
2026-01-02 09:01:05 -05:00
chenyu
fcbb896e05
remove unused to_struct [pr] ( #13973 )
2026-01-02 08:54:57 -05:00
nimlgen
ff7853a65a
am: fix aid doorbells ( #13971 )
2026-01-02 15:53:44 +03:00
nimlgen
42abb0586c
am: fix aid doorbells ( #13972 )
2026-01-02 15:53:13 +03:00
nimlgen
ebbaad6bfd
am: enable all sdma engines ( #13970 )
2026-01-02 15:25:15 +03:00
qazal
5f52266225
mi350x gemm: use Tensor.custom_kernel in asm test ( #13969 )
...
* mi350x gemm: use Tensor.custom_kernel in asm test
* A @ B for baseline
2026-01-02 18:30:50 +09:00
George Hotz
5a1a561e0f
assembly/amd: rdna4 autogen ( #13967 )
...
* assembly/amd: add pcode ds ops
* refactors
* fix ds op
* update autogen
* fix flat bug
* more tests
* fix emu test
* that's a hack
* generic
* fix all tests
* two tests
* fix test failure
* better
* remove __all__
* assembly/amd: fix autogen for RDNA4
2026-01-01 23:12:18 -05:00
wozeparrot
b27527f05a
fix: missed inner tracked range ( #13964 )
2026-01-01 18:09:57 -08:00
wozeparrot
ecbac8a338
tk: fa cleanups + causal test ( #13963 )
2026-01-01 18:05:00 -08:00
chenyu
af0392efea
only set DiskDevice.size if it opens successfully ( #13962 )
2026-01-01 19:33:26 -05:00
chenyu
e036d6df89
properly fix DiskDevice reuse ( #13961 )
2026-01-01 18:08:23 -05:00
George Hotz
dfb813b760
assembly/amd: add pcode ds ops ( #13939 )
...
* assembly/amd: add pcode ds ops
* refactors
* fix ds op
* update autogen
* fix flat bug
* more tests
* fix emu test
* that's a hack
* generic
* fix all tests
* two tests
* fix test failure
* better
* remove __all__
2026-01-01 16:24:13 -05:00
chenyu
cb7c76a3bd
update test_fuzz_failure to not contruct full UOp ( #13960 )
2026-01-01 15:09:58 -05:00
chenyu
51398edf9c
fix indirect import ( #13958 )
...
also deleted old external tests
2026-01-01 14:22:45 -05:00
chenyu
8e416df438
simpler InvalidType [pr] ( #13957 )
...
simpler singleton pattern
2026-01-01 13:55:51 -05:00
nimlgen
b8ea0d779c
am: remove pipe, queue from setup_ring ( #13947 )
2026-01-01 21:06:41 +03:00
chenyu
4d5c4d256d
update tqdm for edge case ( #13956 )
...
1.00kit/s and not 1000it/s for value 999.5
2026-01-01 11:37:26 -05:00
chenyu
ed222070f7
update xlog2 fp16 decomp to not use fp32 ( #13955 )
2026-01-01 11:18:29 -05:00
chenyu
ce84a23142
remove tee in benchmark ( #13954 )
2026-01-01 10:55:36 -05:00
b1tg
24723327ac
fix tc_up in search ( #13438 )
...
* tensor_core is missing from Scheduler
* test upcast max
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2026-01-01 10:25:08 -05:00
qazal
9726500de8
enable using assembly in Tensor.custom_kernel ( #13895 )
2026-01-02 00:12:01 +09:00
qazal
c0f52c9dcb
split assembly gemm to per arch directory ( #13953 )
2026-01-02 00:10:22 +09:00
chenyu
c69470be52
fix test_symbolic_arange_sym_step ( #13952 )
2026-01-01 09:41:07 -05:00
chenyu
b91b46091c
delete test_tensor_uop ( #13951 )
...
old test for shape tracker. also update tests that refer shapetracker
names
2026-01-01 09:25:05 -05:00
chenyu
17ef4af72c
new ceildiv that fixed symbolic conv ( #13944 )
...
* new ceildiv that fixed symbolic conv
* smaller test case
2026-01-01 09:02:41 -05:00
qazal
6a5430ab00
correct args order in mi350x gemm ( #13949 )
2026-01-01 23:01:46 +09:00
chenyu
baff10d32c
clean up Tensor.svd slices ( #13948 )
2026-01-01 08:18:45 -05:00
nimlgen
1c5ed8e8b5
am: remove doorbells from setup_ring ( #13946 )
2026-01-01 14:39:21 +03:00
haofei
526fd4ec71
Fix SVD rank‑1 Jacobi rotation when tau == 0 ( #13945 )
2026-01-01 00:30:18 -05:00
haofei
20777f30b9
Fix QR/SVD NaNs on zero/orthogonal inputs ( #13943 )
2025-12-31 23:40:09 -05:00
chenyu
0ed58c1fcd
clean up some functions in helpers [pr] ( #13942 )
2025-12-31 18:29:16 -05:00
chenyu
e2987001ee
unify pre-commit mypy and ci mypy ( #13940 )
2025-12-31 17:51:51 -05:00
chenyu
8bf7c9c1d2
no-op cleanups for ptx [pr] ( #13938 )
2025-12-31 17:28:39 -05:00
George Hotz
2bb07d4824
assembly/amd: move Reg out of the psuedocode ( #13934 )
...
* assembly/amd: move Reg out of the psuedocode
* remove extra
* fix pcode tests
* simpler pcode
* simpler
* simpler
* cleaner
* fix mypy
2025-12-31 15:34:51 -05:00
chenyu
52acadc160
consolidate IGNORE_OOB=0 tests ( #13937 )
...
add a new unit test file and add more cases
2025-12-31 15:24:20 -05:00
chenyu
c0c1c1c8c8
remove unused validate rule ( #13936 )
2025-12-31 15:02:49 -05:00
chenyu
b6d08f247d
assert z3_xor input type ( #13933 )
2025-12-31 13:37:57 -05:00
George Hotz
f14428090f
assembly/amd: speed up emulator ( #13932 )
2025-12-31 13:32:25 -05:00
Christopher Milan
13973e4dea
refactor image pitch ( #13928 )
2025-12-31 13:22:38 -05:00
chenyu
051fe6c8bc
less toposort iteration in oob validate ( #13929 )
2025-12-31 13:16:34 -05:00
chenyu
a9a7b33404
IGNORE_OOB=0 in CI ( #13903 )
2025-12-31 12:56:59 -05:00
George Hotz
29402034a1
assembly/amd: cleanups to asm and emu ( #13912 )
...
* a bunch of cleanups
* ops are back
* bug fixes
* cleanups
* a lil simpler
* more refactors
* _disasm_vop1
* sops
* more
* continue
* more
* num_srcs
* simpler
* no _is16
* op cleanups
* isinstnace
2025-12-31 12:46:11 -05:00
chenyu
ba9aa5cd6f
skip some PTX IGNORE_OOB validation ( #13927 )
2025-12-31 12:40:21 -05:00
chenyu
4968060ad4
fix IGNORE_OOB=0 for WEBGPU ( #13926 )
2025-12-31 10:41:28 -05:00
chenyu
35bd39e4ba
update mypy and torch version in ci ( #13925 )
2025-12-31 10:29:28 -05:00
George Hotz
b998a80b5d
assembly/amd: split generated stuff into enum/ins ( #13924 )
2025-12-31 10:10:52 -05:00
chenyu
404755bafd
merge ci ruff tests and update ruff version ( #13922 )
2025-12-31 09:53:49 -05:00
nimlgen
25440f0f72
all2all ( #13902 )
...
* all2all
* um
* fix
* x
* um
* simler
* mypy
* fix
* t
* cmnts
2025-12-31 16:38:32 +03:00
nimlgen
f7ee644950
amd: lazy sdma queue allocation ( #13920 )
...
* ams: lazy queue
* nv
* linter
* f
2025-12-31 15:17:13 +03:00
nimlgen
b063518ea7
am: several sdmas ( #13919 )
...
* am: several sdmas
* fix
2025-12-31 14:19:22 +03:00
qazal
b23f4517ab
prep mi350x gemm for python dsl ( #13918 )
...
* start by pruning existing asm
* better branch names
* split to template and real instructions
2025-12-31 20:00:57 +09:00
qazal
3f3786ded9
mmapeak: fix compiler import ( #13915 )
2025-12-31 16:52:23 +09:00
Christopher Milan
a14896fff2
refactor QCOM arg parsing ( #13914 )
...
* refactor QCOM arg parsing
* ruff
* mypy
2025-12-30 19:26:02 -05:00
Christopher Milan
c475c3a6d7
remove useless cast ( #13911 )
2025-12-30 19:24:29 -05:00
George Hotz
0221b96761
assembly/amd: fix all ops tests ( #13910 )
...
* assembly/amd: fix all ops tests
* test_ops with smaller sizes
* ds store/load 2addr
2025-12-30 18:01:34 -05:00
chenyu
dc27eb48ac
remove PYTHONPATH="." from test.yml ( #13909 )
2025-12-30 17:00:16 -05:00
George Hotz
efc99d0c55
assembly/amd: more refactors ( #13907 )
...
* assembly/amd: more refactors
* more refactors
* more refactors
* simpler emu
* generate.py
* regen all
* cleanups
* more
* work
* more readme
* lil
2025-12-30 16:13:24 -05:00
George Hotz
49d1bf93d6
assembly/amd: refactor asm.py to be simpler ( #13900 )
...
* assembly/amd: refactor asm.py
* assembly/amd: refactor asm.py to be simpler
* multiple fxns
* fast
* more tests pass
* regen
* stop decode
2025-12-30 13:51:40 -05:00
George Hotz
04c79505ec
no subnormal bf16 ( #13905 )
2025-12-30 13:02:53 -05:00
chenyu
39f99b207a
update IGNORE_OOB error message ( #13904 )
...
IGNORE_OOB=1 to disable
2025-12-30 12:25:55 -05:00
George Hotz
7e14cdcb06
assembly/amd: clean up clt/ctz hack ( #13901 )
...
* assembly/amd: clean up clt/ctz hack
* add breaks
2025-12-30 11:59:28 -05:00
George Hotz
69cdc8066d
assembly/amd: add dtype tests to AMD IDE CI ( #13899 )
...
* add dtype tests to AMD IDE CI
* more tests
* add trig preop
* regen done
* split to amd autogen
* simpler
2025-12-30 11:09:51 -05:00
George Hotz
9c89be5235
assembly/amd: fix v_perm_b32 + PC fixes ( #13897 )
...
* assembly/amd: fix v_perm_b32
* add pc support
2025-12-30 09:25:40 -05:00
George Hotz
2b838dc1d8
assembly/amd: fix AMD_LLVM=1 support in emulator ( #13881 )
...
* fix AMD_LLVM=1 support in emulator
* more llvm with dtype
* work
* more fixes
* fix dtype
2025-12-30 09:09:57 -05:00
nimlgen
a19d21ea9c
am: mi3xx smu clocks ( #13894 )
...
* am: mi3xx smu clocks
* x
2025-12-30 16:44:17 +03:00
qazal
b557c46233
assembly gemm clean ups, instructions for cli ( #13892 )
2025-12-30 16:14:06 +09:00
qazal
d7e1f26e3d
command line interface for sqtt viz ( #13891 )
...
* command line interface for sqtt viz
* cleanup
* api surface area
* this confuses the llms
* document
2025-12-30 12:33:21 +09:00
chenyu
ab58926b00
update sampling in test_float_cast_to_unsigned ( #13889 )
...
filter is slow for small dtypes
2025-12-29 21:35:46 -05:00
Christopher Milan
0497387e45
NIR: new-style (fix beam) ( #13887 )
...
* NIR: fix beam
* new reduce
* Revert "Revert "NIR: new-style compilers (#13875 )" (#13888 )"
This reverts commit fc4faed0b2 .
* oops
2025-12-29 18:41:29 -05:00
Christopher Milan
fc4faed0b2
Revert "NIR: new-style compilers ( #13875 )" ( #13888 )
...
This reverts commit 72236bbd3d .
2025-12-29 17:42:28 -05:00
George Hotz
94bca91f3e
assembly/amd: have asm go through the dsl ( #13886 )
...
* assembly/amd: have asm go through the dsl
* lil
2025-12-29 17:39:11 -05:00
George Hotz
7322d9ec4a
assembly/amd: add new instruction support to pcode ( #13885 )
...
* assembly/amd: add new instruction support
* more
* regen all
2025-12-29 17:30:17 -05:00
George Hotz
0d326f5b9b
fix missing instructions in psuedocode ( #13884 )
2025-12-29 16:11:22 -05:00
Christopher Milan
9c6850fc01
remove try-catches on llvm import ( #13883 )
2025-12-29 15:56:17 -05:00
George Hotz
9d8397be11
add CDNA3+RDNA4 support ( #13882 )
...
* fix CI
* remove junk
* rename lib to dsl
* correct
* cleanups
2025-12-29 15:51:29 -05:00
Christopher Milan
72236bbd3d
NIR: new-style compilers ( #13875 )
...
* NIR: new-style compilers
* mypy
* simplify NIR compilers
* lvp compiler too
* mypy
* simplify
* mypy
2025-12-29 15:31:41 -05:00
George Hotz
81cf9ea0ab
rename to extra.assembly.amd ( #13879 )
2025-12-29 14:10:55 -05:00
George Hotz
37f0fa11b6
rdna3 test cleanups ( #13878 )
...
* rdna3 test cleanups
* cleanups
* ugh DONT SKIP
2025-12-29 13:41:59 -05:00
George Hotz
35db73b231
add cdna4 support to parsers ( #13877 )
...
* add cdna4 support to parsers
* cdna4
2025-12-29 13:23:43 -05:00
Clément Verrier
d178235309
delete tree structure from CLAUDE.md ( #13876 )
...
Claude Code should be able to figure out the correct structure, and the
hardcoded tree structure might become outdated.
2025-12-29 13:23:20 -05:00
George Hotz
ff856a74cb
minor refactoring for rdna3 ( #13873 )
...
* minor refactoring for rdna3
* fix div scale stuff
* more bugfixes
2025-12-29 13:20:00 -05:00
C T
39923203ba
fix exception in cuda bindings code on windows ( #13823 )
...
* fix cuda on windows
* fix linter errors
* test github action install cuda-toolkit
* Revert "test github action install cuda-toolkit"
This reverts commit c18ad6f937 .
* Revert "fix linter errors"
This reverts commit 00aa943e91 .
* Revert "fix cuda on windows"
This reverts commit 7aea5256b1 .
* fix windows sysconfig.get_config_var("MULTIARCH") is None
2025-12-29 12:58:22 -05:00
b1tg
63a1bb8507
multi custom kernel: support input mixed with copy and shard ( #13748 )
2025-12-29 12:54:27 -05:00
chenyu
0a98fd38b3
fix tests that failed locally on mac ( #13872 )
...
keccak output was silently broken without contiguous
2025-12-29 11:23:38 -05:00
Clément Verrier
0e409ff5ce
fix indentation in UOp pretty_print for repeated references ( #13857 )
...
* fix correct indentation in UOp pretty_print for repeated references
When a UOp was referenced multiple times, the walrus operator notation
(e.g., x0:=) was correctly used for the first occurrence, but subsequent
references had misaligned indentation due to an extra space character.
Fix indentation misalignment in pretty_print() when UOps are referenced
multiple times.
* add simple unit tests for UOp repr
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-12-29 10:46:16 -05:00
George Hotz
f1471a3b99
speed up rdna3 unit tests + add to CI ( #13871 )
...
* speed up rdna3 unit tests
* add test to CI
* faster and simpler
* speedups
* bugfixes
* use helper
* fix CI maybe
* test fixes
* llvm-21 on 24.04
* upd
* llvm-21
* fix test
* bring that back
* merge gen into lib
* test generators
2025-12-29 10:26:48 -05:00
h-vetinari
37720fd6c0
also look for linux libraries in RHEL-themed paths ( #13863 )
2025-12-29 10:05:32 -05:00
George Hotz
25ef866e89
write python emulator from RDNA3 psuedocode in pdf ( #13841 )
...
* write python emulator from RDNA3 psuedocode in pdf
* emu2
* more emu
* working
* more psueod
* progress
* cleanups
* delete junk
* delete stale files
* just emu
* work
* emu compare
* bemu
* cleanups and more failures
* revert bench emu
* fix emu cmp
* four tests fail
* bugfixes
* dsl
* ext
* refactor
* dsl
* div scale fix
* test_emu
* fix emu tests
* pcode
* test pcode
* top imports
* fix test_emu to use run_asm
* emu tests on real hardware
* more tests
* more emu tests
* more
* work
* work
* bug fix
* bugfixes
* fix fp16 gemm
* all ops tests pass in emulator
* fix llvm tests
* fix a few more tests
* fix mockgpu timeout
2025-12-29 07:39:53 -05:00
nimlgen
88eb230326
memory: correct pa allocator size ( #13861 )
2025-12-29 14:49:44 +03:00
qazal
f541540129
variable N for asm gemm ( #13869 )
...
* variable N for asm gemm
* cleanup spacing
2025-12-29 19:35:50 +09:00
nimlgen
c6769badc2
mockgpu: async support ( #13868 )
...
* mockgpu: async support
* cpu
2025-12-29 13:18:37 +03:00
qazal
fc5278746f
mi350x assembly gemm cleanups ( #13867 )
2025-12-29 18:47:23 +09:00
George Hotz
f07c39cfa4
hwtest fixes for rdna3 dsl ( #13865 )
2025-12-28 20:42:29 -05:00
George Hotz
d9603c1bee
improve asm dsl syntax ( #13864 )
...
* improve asm dsl syntax
* improve asm dsl syntax
2025-12-28 20:04:59 -05:00
chenyu
f5090192c8
reorder AMD tensor core benchmark test ( #13860 )
...
* reorder AMD tensor core benchmark test
* disable that
2025-12-28 12:29:51 -05:00
qazal
066d96c397
print tflops in asm gemm test ( #13859 )
...
* print tflops in asm gemm test
* change order
2025-12-29 02:26:40 +09:00
chenyu
a03cd43e78
fix typing in compute_gradient ( #13852 )
2025-12-28 11:52:14 -05:00
chenyu
cba05acadf
re-enable TYPED=1 import test ( #13858 )
2025-12-28 11:49:06 -05:00
qazal
2cfbabdc34
mi350x 1tflop bf16 gemm in extra ( #13702 )
2025-12-28 21:45:42 +09:00
qazal
2180eee5e4
use the asm dsl in remu hwtest.py ( #13856 )
...
* remu hw test with the asm dsl
* simpler
* nthreads and exec mask
* cmp/cmpx
* assembler error in s_mov_b32
* vopd in dsl?
2025-12-28 11:32:41 +09:00
chenyu
784b919f7f
Revert "optim empty shard #13513 ( #13598 )" ( #13855 )
...
* Revert "optim empty shard #13513 (#13598 )"
This reverts commit 76d465dbc3 .
* test_arange_shrink
* update test
2025-12-27 21:10:23 -05:00
anu
9b4de8abc7
fix beam in python 3.14+ ( #13836 )
...
* fix beam search on python 3.14
* add PickleableCount class to helpers
* change name, add test, add step
* tidy count init
2025-12-27 16:24:22 -05:00
chenyu
0f74909ae9
clean up rearrange ( #13851 )
2025-12-27 11:06:10 -05:00
qazal
f6c660f7fa
simplify sqtt decoder infra ( #13849 )
...
* more work
* simpler
2025-12-28 00:31:16 +09:00
Clément Verrier
ae013beab8
handle empty VECTORIZE in UOp.render() ( #13847 )
...
`UOp.render()` crashed with `IndexError: tuple index out of range` when
the UOp graph contained a `VECTORIZE` with empty `src=()`. This occurs
when reshaping to scalar shape `()`, e.g., `Tensor.ones(4).sum()`.
The bug was in the renderer's VECTORIZE pattern: `all_same(())` returns
`True` (vacuous truth), causing the code to access `x.src[0]` on an
empty tuple.
- Fix `IndexError` when calling `UOp.render()` on graphs containing
empty `VECTORIZE` nodes.
- Add test for empty `VECTORIZE` rendering.
2025-12-27 10:09:39 -05:00
qazal
a2da61d096
use new style amd compiler in viz ( #13848 )
...
* working version, handcode gfx1100 arch
* get target from device properties
* lib in cfg test program spec
2025-12-27 23:59:30 +09:00
JINO ROHIT
1ee92003ea
minor typo ( #13846 )
2025-12-27 09:34:57 -05:00
nimlgen
276159cb87
system: add base_class to pci_scan_bus ( #13845 )
...
* system: add base_class to pci_scan_bus
* fix
2025-12-27 13:22:21 +03:00
Francis Lata
fac137779e
remove flux1 seed image ( #13843 )
2025-12-27 00:45:11 -05:00
qazal
f6de9095a0
switch asm tests to dsl ( #13840 )
...
* switch asm tests to dsl
* labeled basic blocks also work
* indenting for basic blocks
* allow define from star import
2025-12-27 02:15:16 +09:00
chenyu
ba922094f2
remove redudant check in disk_supports_fast_copyout ( #13838 )
2025-12-26 11:30:55 -05:00
George Hotz
e9f2aaba2a
simplify rdna3 asm ( #13835 )
...
* simplify rdna3 asm
* cleanups
* fix names
* fix tests
* fixes
* more test fixes
* type fixes
* tests pass + mypy passes
* 3.11 syntax
2025-12-26 11:21:03 -05:00
nimlgen
c44b4f9ae0
am: fix sdma warm boot ( #13837 )
2025-12-26 12:38:06 +03:00
George Hotz
c6937fa744
more work on RDNA3 asm ( #13833 )
...
* more llvm asm tests
* roundtrip test
* work
* more handwritten
* more handwritten
* work
* tests pass
* dual mov
* all tests pass
* all tests pass fast
2025-12-25 23:28:14 -05:00
George Hotz
f1111ac7de
move amd compilers to new style ( #13831 )
...
* move amd compilers to new style
* simplest diff
* AMDHIPrenderer
2025-12-25 13:42:24 -05:00
George Hotz
9d94b8c6b2
python asm dsl in extra + python REMU ( #13436 )
...
* having fun with python asm dsl
* rdna3
* meh
* all in rdna3
* work
* more work
* work
* integration
* tests
* simpler
* simpler
* asm
* better
* simpler
* progress
* emu
* simpler
* emu
* tests
* types
* vopd
* cleaups
* work
* memory ranges
* add tracing
* refactors
* run_asm exit
* more readable
* compare to remu
* test gemm
* bug + stale
* more tests
* refactor
* tests fix
* more ins
* more instructions
* refactor
* faster
* match case
* match case
* simpler
* work
* tests
* run_asm
* work
* bug fixes
* more emu
* alu/emu
* refactor
* no pipeline emu yet
* alu direct
* fix
* bugfixes + new test
* fix exceptions in emulators
* update gen.py
* pylint
* no pdf
* improve bench_emu
* speedups
* cleanups
* more tests
2025-12-25 13:04:14 -05:00
nimlgen
b5f3a5ad79
am: cleanup comment ( #13828 )
2025-12-25 18:00:28 +03:00
chenyu
8985a4a023
one less branch in Buffer.view [pr] ( #13829 )
2025-12-25 09:34:15 -05:00
chenyu
094753b4e0
renderer arch version cleanup [pr] ( #13830 )
2025-12-25 09:32:56 -05:00
chenyu
54af29dbdb
trange can just be a function ( #13827 )
2025-12-24 23:57:10 -05:00
qazal
a1c1684b91
set .amdhsa_kernarg_size in asm test ( #13826 )
2025-12-25 13:08:14 +09:00
chenyu
da1cb6a9ec
update llama dataloader ( #13825 )
...
separate creating dataset from itererating over the dataset to not create eval data for each eval
2025-12-24 17:42:08 -05:00
chenyu
a7fc0c288b
clean up BufferCopy init [pr] ( #13824 )
2025-12-24 10:40:15 -05:00
chenyu
903753c60c
llama wandb logging ( #13822 )
2025-12-24 10:24:59 -05:00
qazal
e3a646dce3
viz: skip plaintext disassemble for cfg ( #13821 )
2025-12-24 23:16:59 +09:00
chenyu
cb07c5d0e8
fewer import annotations ( #13819 )
2025-12-23 18:45:50 -05:00
George Hotz
43c6e973d8
add optional compiler in Renderer ( #13817 )
...
* add optional compiler in Renderer [pr]
* fix
* late init
* remove precompiled
* cleanup
2025-12-23 17:58:46 -05:00
George Hotz
8eab6175ee
get_program refactor ( #13816 )
...
* get_program refactor
* fix docs
* cleanup
2025-12-23 16:44:46 -05:00
George Hotz
3d3c5b2fb9
add device to program ( #13815 )
...
* add device to program
* from_uop
* from_uop no renderer
* simpler global_size
2025-12-23 16:15:33 -05:00
nimlgen
90b217896f
am: xgmi p2p ( #13811 )
...
* system: use addr space
* am: xgmi
* fix
* ugh
2025-12-23 20:11:38 +03:00
George Hotz
6439a515be
test fixups / speedups / var_vals refactor ( #13812 )
...
* no PYTHONPATH + llm server port 0
* llm tok speedup
* refactor var_vals
2025-12-23 12:05:59 -05:00
George Hotz
8dcba2e2cc
no full_rewrite [pr] ( #13809 )
...
* no full_rewrite [pr]
* fix
* fix docs
2025-12-22 23:20:01 -05:00
George Hotz
edce2303f4
rewrite to program ( #13808 )
2025-12-22 20:03:33 -05:00
George Hotz
2af2b4da5d
Revert "rewrites for renderer and compiler ( #13646 )" ( #13806 )
...
This reverts commit 339dadf056 .
2025-12-22 19:21:33 -05:00
George Hotz
339dadf056
rewrites for renderer and compiler ( #13646 )
...
* rewrites for renderer and compiler
* full_rewrite_to_program
* fix pre-commit
* compiler passed into get_program
* no pkl compiler
* lib on program spec
* fix spec
* fix test
* no device
* compiler_device
* nm
* fix nir
* fix
* simplest
* fix tests
* revert
2025-12-22 18:58:43 -05:00
Daniel Xu
4edaaf19e5
Handle tied embeddings for llama 3.2 1B ( #13796 )
...
Previously the output.weight layer would not be loaded, and would only
contain randomly initialized values. This led to junk when doing a
forward pass.
Signed-off-by: Daniel Xu <daniel@thinkingmachines.ai>
2025-12-22 16:31:40 -05:00
chenyu
7f1d41c9f9
delete files that import ShapeTracker ( #13805 )
2025-12-22 15:54:18 -05:00
qazal
b31373ca70
remove llvm-mca stuff from viz ( #13802 )
2025-12-23 01:41:51 +08:00
chenyu
27d899ce97
TRAIN=0 to only eval llama ( #13804 )
2025-12-22 11:55:46 -05:00
chenyu
39d962106f
update llama logging ( #13803 )
...
```
REWRITE_STACK_LIMIT=1000000 SMALL=1 BASEDIR=/raid/datasets/c4-8b SAMPLES=1000 BS=8 DP=8 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=8B SEQLEN=1024 PYTHONPATH=. MODEL=llama3 python3 examples/mlperf/model_train.py
1 93.44 s run, 11.8750 loss, 0.000000000001 LR, 642.43 GB used, 19644.30 GFLOPS
2 101.78 s run, 11.8750 loss, 0.000000000001 LR, 1454.57 GB used, 17039.35 GFLOPS
3 7.34 s run, 11.8750 loss, 0.000000000002 LR, 1454.57 GB used, 236258.78 GFLOPS
4 4.32 s run, 11.8750 loss, 0.000000000002 LR, 1454.57 GB used, 401488.40 GFLOPS
5 4.36 s run, 11.9375 loss, 0.000000000003 LR, 1454.57 GB used, 398116.13 GFLOPS
6 4.32 s run, 11.8750 loss, 0.000000000003 LR, 1454.57 GB used, 401878.60 GFLOPS
7 4.34 s run, 11.8750 loss, 0.000000000004 LR, 1454.57 GB used, 399822.57 GFLOPS
8 4.35 s run, 11.8750 loss, 0.000000000004 LR, 1454.57 GB used, 398512.24 GFLOPS
9 4.36 s run, 11.8750 loss, 0.000000000005 LR, 1454.57 GB used, 397832.61 GFLOPS
10 4.40 s run, 11.8750 loss, 0.000000000005 LR, 1454.57 GB used, 394520.83 GFLOPS
```
2025-12-22 11:28:29 -05:00
qazal
389f01c7f4
viz: amdgpu assembly basic block graph ( #13755 )
2025-12-22 23:17:16 +08:00
George Hotz
df0f9d6860
add olmoe support to llm ( #13792 )
...
* add olmoe support to llm
* cleanups
* simpler
* clean
* fix mypy
* lil
* remove dumb assert
2025-12-22 10:41:35 -04:00
qazal
81d9053013
roc: cast to nullptr instead of changing header ( #13801 )
2025-12-22 22:34:06 +08:00
nimlgen
d299d30f2c
am_smi: fix with new autogen ( #13800 )
2025-12-22 16:53:26 +03:00
nimlgen
f6bda6ae4e
am: continue from saved state ( #13799 )
...
* am: gfx queue cont
* f
* reset
* f
* l
2025-12-22 15:55:07 +03:00
qazal
6237bd86f6
sqtt/pmc viz improvements ( #13797 )
2025-12-22 18:16:35 +09:00
Sitananda Prasad
3000b8d762
symbolic: add x ^ x -> 0 folding pattern ( #13794 )
2025-12-21 21:47:28 -04:00
chenyu
5cb827f7bf
clean up can_lossless_cast and add missing pairs [p] ( #13793 )
2025-12-21 12:18:33 -05:00
George Hotz
75a6a03664
add qwen3 moe support to tinygrad.apps.llm ( #13775 )
...
* qwen moe works
* simple moe
* one test
* integration
2025-12-21 12:36:02 -04:00
chenyu
29ef0809bb
can_safe_cast -> can_lossless_cast ( #13789 )
...
safe cast in numpy only means the result won't overflow, so lossless is more precise
2025-12-21 11:29:19 -05:00
chenyu
ed1fd7023b
use getattr in dtype.truncate [pr] ( #13788 )
2025-12-21 11:05:43 -05:00
qazal
9839838fdd
viz UOp layout cleanup ( #13787 )
...
* use the same names in server and client
* first layout args, then renderer args
2025-12-21 22:11:40 +08:00
nimlgen
e523971028
am: make mqd contig ( #13786 )
2025-12-21 17:00:33 +03:00
qazal
09e060eab5
simplify viz node labels ( #13784 )
2025-12-21 16:45:06 +08:00
qazal
dc660c9fc0
remove stale / untested viz related files ( #13785 )
2025-12-21 16:42:48 +08:00
George Hotz
59c02dd87f
does this fix the dtype test? ( #13779 )
...
* does this fix the dtype test?
* simpler
2025-12-20 17:31:46 -04:00
George Hotz
5228f7bd06
hotfix: opencode should not reformat files
2025-12-20 15:55:29 -04:00
chenyu
733ef0452c
update test_uop_resolve ( #13777 )
...
plain @unittest.expectedFailure is too broad
2025-12-20 12:40:59 -05:00
nimlgen
3db2104fb8
am: timeout sos start ( #13776 )
2025-12-20 17:41:33 +03:00
qazal
94f97f6988
generic viz cleanups from the basic blocks branch ( #13774 )
...
* simpler codeblock highlight
* simpler append
* status enum
2025-12-20 18:18:03 +08:00
George Hotz
a987a8ed44
add neg VIZ support to not start server ( #13772 )
2025-12-20 00:36:38 -04:00
qazal
b7c2f0dd1b
remove stale extra/sched directory ( #13770 )
2025-12-20 11:57:30 +08:00
George Hotz
86cd1e9e81
remove UPatAny for typing fix [pr] ( #13766 )
...
* remove UPatAny for typing fix [pr]
* fix dtype
2025-12-19 17:41:18 -04:00
George Hotz
4702da41d5
hotfix: mkdir for extra/disassemblers
2025-12-19 17:18:37 -04:00
George Hotz
45c459848d
remove more stale stuff ( #13765 )
...
* remove more stale stuff
* remove disassemblers/adreno
* stale
2025-12-19 17:14:56 -04:00
George Hotz
744af193f0
remove ScheduleItem and merge it with ExecItem ( #13759 )
...
* remove ExecItem and merge it with ScheduleItem
* less diff
* fix issues
* min diff
* don't change bufs in _lower
* min diff
* update
* revert
* fixes
* diff
2025-12-19 17:04:24 -04:00
George Hotz
df6cde8a00
cleanup stale examples/extra ( #13764 )
...
* cleanup stale files
* examples
* move those back
* old
* delete more
2025-12-19 16:27:37 -04:00
chenyu
80b84f5267
ruff lint tinykitten ( #13762 )
...
deleted used import and double spaces. a few ignore to not change the real code
2025-12-19 14:31:00 -05:00
Christopher Milan
97103831c5
Revert "remove image from BufferSpec ( #13636 )" ( #13761 )
...
This reverts commit 2571a1eb47 .
2025-12-19 13:54:36 -05:00
Christopher Milan
2571a1eb47
remove image from BufferSpec ( #13636 )
...
* remove image from BufferSpec
* cl tiny_gemm (64) works
* mypy
* padding
* openpilot CL
* reshape properly
* remove extra qcom checks
* pad output
* mypy
* update compile test
* move undo
* TestImageCopy valid images
* TestImageRealization valid images
* TestImageDType valid images
* cleanups
* test_renderer_failures
* ruff
* mypy
* simplify ops_qcom
* bump step time
2025-12-19 13:41:20 -05:00
chenyu
185a000882
gradient of COPY ( #13760 )
2025-12-19 13:33:59 -05:00
nimlgen
57fe4d0a59
am: no_update_ptr for master ( #13757 )
2025-12-19 19:37:37 +03:00
chenyu
7fcd3cf991
hotfix SPEC for AFTER(CONTIGUOUS) ( #13752 )
...
fixed spec error in `PYTHONPATH="." REWRITE_STACK_LIMIT=5000000 NULL=1 DEFAULT_FLOAT="HALF" BERT_LAYERS=2 BENCHMARK=10 BS=128 GPUS=1 MODEL=bert python3 examples/mlperf/model_train.py`
2025-12-19 10:05:45 -04:00
qazal
81b5815a66
viz: minimal data to render a graph ( #13754 )
2025-12-19 16:19:28 +08:00
Christopher Milan
849e46da21
DLL: _PATH variables can be parent dir ( #13753 )
2025-12-19 00:28:02 -05:00
qazal
159c0e92fa
viz: infrastructure for basic block graphs ( #13751 )
2025-12-19 13:08:19 +08:00
George Hotz
fa40df972f
fix tests for NV ( #13744 )
...
* small fix
* min diff
* bfloat16 out
2025-12-18 13:20:21 -04:00
nimlgen
77191fb744
hive_reset for mi350 ( #13746 )
2025-12-18 12:02:28 +03:00
nimlgen
ceff388f3d
am: extend va space ( #13745 )
2025-12-18 11:20:43 +03:00
wozeparrot
99e667bdcd
tk fa bwd ( #13480 )
2025-12-17 23:56:37 -08:00
George Hotz
aeb7516c8a
tests passing on tinybox h3 ( #13742 )
2025-12-17 19:04:34 -04:00
chenyu
7cd7593c5d
add script to train bert on mi350x ( #13743 )
...
adapted from mi300 config
2025-12-17 16:54:04 -05:00
George Hotz
22f3e7f995
better precommit coverage and faster ( #13740 )
...
* improve pre-commit hook speed and coverage
* remove a few
* lose that
2025-12-17 13:25:55 -04:00
George Hotz
bc78cf1197
filter warnings for nicer test output ( #13739 )
2025-12-17 13:25:27 -04:00
George Hotz
b013244c38
fix local tests for AMD_LLVM ( #13738 )
...
* fix local tests for AMD_LLVM
* fix linters
* skip that for now
* fix segfault
2025-12-17 12:23:46 -04:00
nimlgen
7081014c73
am_smi: mi300 ( #13737 )
...
* am_smi: mi300
* smi
* remo
2025-12-17 17:56:01 +03:00
George Hotz
3dbde178c1
mark slow tests as slow instead of as CI ( #13736 )
...
* mark slow tests as slow instead of as CI
* CI shouldn't have different behavior
* more skips / CI
* slow
2025-12-17 10:29:57 -04:00
George Hotz
9015a22523
make tests faster ( #13734 )
2025-12-17 09:39:44 -04:00
nimlgen
3eecb4f123
am: mi350 support ( #13733 )
2025-12-17 14:57:21 +03:00
wozeparrot
5151a341b3
tk: small changes from fa bwd ( #13732 )
2025-12-16 22:44:36 -08:00
chenyu
fda73c8180
support LAMB param offload ( #13730 )
...
also added Tensor.shard_like
2025-12-16 19:56:30 -05:00
George Hotz
cf0c28d5ae
all tests pass on strix halo ( #13728 )
2025-12-16 19:35:50 -04:00
Christopher Milan
af1d938a50
DLL: search wsl lib folder ( #13727 )
2025-12-16 18:27:09 -05:00
George Hotz
0fb645cc4c
move some methods to mixins ( #13725 )
...
* move some methods to mixins
* a few more
* math trunc
2025-12-16 19:20:04 -04:00
Christopher Milan
c6ba016da6
fix cuda check ( #13726 )
2025-12-16 18:00:09 -05:00
George Hotz
ee45669d14
pre extract afters + sched cleanups ( #13720 )
...
* pre extract afters + sched cleanups
* claude.md lesson
* tests for schedule cache
* Revert "tests for schedule cache"
This reverts commit fb3f2e800a .
2025-12-16 16:14:30 -04:00
George Hotz
4b741e893f
remove REMOTE=1 ( #13722 )
...
* remove REMOTE=1
* leave ibverbs
2025-12-16 15:58:10 -04:00
George Hotz
4d8d821f56
create schedule before the cache ( #13717 )
...
* create schedule before the cache
* move create_schedule
* simpler
* simpler
* simpler
2025-12-16 14:15:31 -04:00
George Hotz
bfe374c7f5
support symbolic shapes in split/chunk when split dim is concrete ( #13718 )
...
* support symbolic shapes in split/chunk when split dim is concrete
Previously split() and chunk() required all dimensions to be concrete.
Now they only require the dimension being split to be concrete, allowing
them to work with tensors that have symbolic shapes in other dimensions.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* update CLAUDE.md: add pre-commit and no-amend rules
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix dim resolution order in split/chunk
Ensure dim_sz is retrieved after dim is resolved, not before.
The previous one-liner evaluated self.shape[dim] with the original
unresolved dim value.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-16 13:55:06 -04:00
chenyu
e428fbfab6
verify dtype of llama model params ( #13719 )
2025-12-16 12:32:02 -05:00
George Hotz
e5a66ace80
multi custom kernel support ( #13716 )
...
* multi custom kernel support
* custom kernel xfrom
* works
* no SPEC=2 on ck
* panic
* touchups
2025-12-16 11:36:30 -04:00
nimlgen
5778722979
am: restore queues ( #13714 )
...
* am: restore queues
* l
* cmnt
2025-12-16 15:21:42 +03:00
chenyu
041e9a41c9
add contiguous in BertIntermediate ( #13713 )
...
faster step with a lot less recomputation
2025-12-15 22:37:36 -05:00
George Hotz
7589c897b2
split usbgpu tests into their own benchmark [pr] ( #13711 )
2025-12-15 21:42:40 -04:00
qazal
6bafd90248
remove unused process replay input [pr] ( #13712 )
2025-12-16 09:29:35 +08:00
George Hotz
321ab943b2
qwen model is working ( #13690 )
...
* qwen model is mostly working
* add Q4_K quantization support to GGUF parser, add qwen3:1.7b model
- Add Q4_K (type 12) dequantization in nn/state.py
- Add qwen3:1.7b model using Q4_K_M quantization (smaller than Q8_0)
- Make bos_token_id optional for models like Qwen3 that don't have it
- Fix line length issues and add preset parameter to SimpleTokenizer
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* smaller diff
* test dequant
* half split
* better
* simple tok
* mock token
* polish
* better
* fix
* replace
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 18:00:34 -04:00
George Hotz
d43e4c7553
llm args + lil html page ( #13710 )
...
* update llm args
* lil html page
* lil
* line size
* qol
2025-12-15 17:09:31 -04:00
George Hotz
ee4a7ee12f
rope half-split ( #13706 )
...
* rope half
* nicer
* this
* rearrange
2025-12-15 15:31:11 -04:00
Christopher Milan
2359e88f0c
wrap cdll redo ( #13705 )
...
* wrap CDLL with custom findlib
* lint
* regen
* fix
* mypy
* hardcode libc on macos
* fix frameworks
* fix webgpu win
* remove supports
* regen metal
* regen libclang
* regen
* simpler
* regen
* regen
* find nvrtc
* fix
* regen
* fix
* typo
* regen
* split
* rsplit one
* typo
* try load DLL
* string error
2025-12-15 13:15:02 -05:00
wozeparrot
5d509499b2
tk: kernel finish groups stores ( #13704 )
2025-12-15 09:16:17 -08:00
George Hotz
54a22aa298
add test for jit footguns ( #13701 )
...
* add test for jit footguns
* shorter
* notes
2025-12-15 10:47:44 -05:00
George Hotz
fd49bb512d
download cache by job ( #13703 )
2025-12-15 10:47:17 -05:00
George Hotz
a657a4e0f4
add Q4_K GGUF quantization support ( #13700 )
...
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-15 10:17:56 -05:00
nimlgen
615dcab767
am: minimal mi300 boot ( #13679 )
...
* nbio7_9
* psp
* gmc
* gfx
* sdma
* ih
* linter
* linter
* minor
* finish
* add missing
* do not allow warm boot for now
2025-12-15 15:55:03 +03:00
qazal
72e006cd59
fast VIZ=2 startup ( #13682 )
2025-12-15 19:16:43 +08:00
qazal
50d34428bd
fix viz endstream ( #13687 )
2025-12-15 16:54:18 +08:00
wozeparrot
7ef7ce2856
tk reg local store ( #13689 )
2025-12-14 23:07:30 -08:00
George Hotz
572ca80046
fast tinygrad.apps.llm ( #13685 )
...
* llm: add --benchmark support
* fix speed
* debug logging
* fix test attention
2025-12-14 21:05:21 -05:00
chenyu
6cad622f59
don't FREE_INTERMEDIATE in bert ( #13684 )
...
hangs green hcq consistently after an hour of training
2025-12-14 14:27:42 -05:00
chenyu
871ab8415f
some onnx cleanups ( #13683 )
2025-12-14 13:58:54 -05:00
nimlgen
75832ce4f6
am: psp with no autoload ( #13681 )
2025-12-14 20:20:09 +03:00
nimlgen
8bcb1038e4
am: nbio 7.9.0 ( #13680 )
2025-12-14 18:35:29 +03:00
George Hotz
013240938b
llm: add --benchmark support ( #13678 )
2025-12-14 08:35:05 -05:00
Robbe Derks
cddbdaf5e1
usbgpu: patch: auto-detect controller PID/VID ( #13645 )
...
* auto-detect controller
* fix lint?
* needs ''
* just try
2025-12-14 00:54:51 -05:00
George Hotz
d7fb5d9b62
speedups: early return from simplify ( #13665 )
...
* early return from simplify
* pm_rewrite
* more speed
* remove again
* early return from simplify
* ugh
2025-12-14 00:51:28 -05:00
George Hotz
bcbf832399
add chrism
2025-12-14 00:45:57 -05:00
chenyu
ed962786d6
use assign in Tensor.backward ( #13674 )
...
preserve the grad object so that jit works
2025-12-13 22:43:06 -05:00
chenyu
721a379c41
Revert "autogen: use wrapped CDLL with custom findlib ( #13666 )" ( #13675 )
...
This reverts commit f6cc3b13b9 .
2025-12-13 22:42:41 -05:00
nimlgen
6402dcf940
am: xccs queue creation ( #13672 )
2025-12-13 18:37:09 +03:00
nimlgen
8430ee7d5f
am: stop hqd only when active ( #13670 )
...
* am: stop hqd only when active
* this better
2025-12-13 17:41:44 +03:00
nimlgen
a49ba241bb
am: use fb_base/fb_end as mc aperture ( #13671 )
2025-12-13 17:29:03 +03:00
nimlgen
0b15c573ca
amd: xccs in PCIIface ( #13669 )
2025-12-13 17:22:11 +03:00
qazal
019e71f8ca
lds bank count tests from pmc counters ( #13667 )
...
* lds bank count tests from pmc counters
* these tests run on the RDNA3 card too
* rename duration to cycles, other rename comment
* add SQ_LDS_IDX_ACTIVE to gfx9 defaults
2025-12-13 17:39:32 +08:00
qazal
a6dfd8a672
viz server cleanups ( #13668 )
...
* viz server cleanups
* comment
2025-12-13 17:27:53 +08:00
Christopher Milan
f6cc3b13b9
autogen: use wrapped CDLL with custom findlib ( #13666 )
...
* wrap CDLL with custom findlib
* lint
* regen
* fix
* mypy
* hardcode libc on macos
* fix frameworks
* fix webgpu win
* remove supports
* regen metal
* regen libclang
* regen
* simpler
* regen
* regen
* find nvrtc
* fix
* regen
* fix
* typo
* regen
* split
* rsplit one
* typo
2025-12-13 01:31:30 -05:00
George Hotz
55845f7de7
schedule: cache unbinds for consistent cache keys ( #13664 )
...
* schedule: cache unbinds for consistent cache keys
strip BIND values before computing cache key so different bound values
(e.g. KV cache positions) hit the same schedule cache entry.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* spec: allow single-src BIND for schedule cache key normalization
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* docs: add lessons learned to CLAUDE.md
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* more claude.md
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 17:27:42 -05:00
George Hotz
27845353a0
add CLAUDE.md
2025-12-12 16:50:11 -05:00
George Hotz
8c87a0bf8d
Revert "schedule: cache unbinds for consistent cache keys ( #13662 )"
...
This reverts commit af86cae10c .
2025-12-12 16:49:50 -05:00
George Hotz
443b7fea80
Revert "add notes about jit to claude.md"
...
This reverts commit 429f82e6a9 .
2025-12-12 16:49:48 -05:00
George Hotz
429f82e6a9
add notes about jit to claude.md
2025-12-12 16:48:23 -05:00
George Hotz
af86cae10c
schedule: cache unbinds for consistent cache keys ( #13662 )
...
* schedule: cache unbinds for consistent cache keys
different bound variable values (e.g. kv cache positions) now produce
the same schedule cache key by unbinding BIND(DEFINE_VAR, CONST) before
computing the cache key and rebinding after lookup.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* schedule: cache unbinds for consistent cache keys
When scheduling, BIND(DEFINE_VAR, CONST) nodes are now unbound to
tagged DEFINE_VARs before computing the cache key. This ensures that
the same computation with different bound values (e.g., different
KV cache positions in LLM) gets the same cache key and reuses the
cached schedule.
The fix:
- pm_pre_sched_cache: replaces BIND with tagged DEFINE_VAR
- pm_post_sched_cache: restores tagged DEFINE_VAR back to original BIND
- pm_remove_rangeify_tags: excludes DEFINE_VAR to preserve tags through rangeify
- var_vals extracted from BINDs before cache key computation
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* schedule: fix BIND handling and add CLAUDE.md
- Handle BIND to RANGE in create_schedule (not matched by CONST pattern)
- Assert all BINDs on same variable have same value
- Add CLAUDE.md codebase guide
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 16:40:10 -05:00
chenyu
fcaed1e1dd
don't use empty in bert fake data ( #13661 )
...
somehow jit does not count empty as input
2025-12-12 15:59:50 -05:00
George Hotz
316da9f7ff
llm: add created/model fields, non-streaming support, and tests ( #13660 )
...
* llm: add created/model fields, non-streaming support, and tests
- Add `created` timestamp and `model` fields to response (required by OpenAI spec)
- Add non-streaming mode support for /v1/chat/completions
- Add `send_data` helper to HTTPRequestHandler for responses with Content-Length
- Refactor viz/serve.py to use send_data
- Add integration tests using real OpenAI client
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* add openai to testing
* toml
* Remove 'openai' from dependencies
Removed 'openai' from the dependencies list.
* bump cache
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 14:50:36 -05:00
George Hotz
9604773e45
add model choosing support to llm ( #13656 )
2025-12-12 11:22:11 -05:00
nimlgen
e36385e570
am: support xgmi systems ( #13659 )
...
* am: support xgmi systems
* fake_am
2025-12-12 18:55:45 +03:00
nimlgen
b4796e2d32
amd: set queue prio to normal ( #13658 )
2025-12-12 18:25:41 +03:00
nimlgen
a1de7787bf
am: xcc/inst support ( #13657 )
2025-12-12 17:40:42 +03:00
George Hotz
f0fa9bcd98
openai api for llm ( #13648 )
...
* openai api for llm
* responds to simple request
* schedule cache needs to unbind
* stream works
* share stream code
* 20k
* one print
* cid
2025-12-12 08:25:33 -05:00
qazal
93ad1f7732
viz: readable pmc print, share unpacker with tests ( #13655 )
...
* viz: readable pmc print, share unpacker with tests
* sections
* static analyzer
* rm that
2025-12-12 19:29:59 +08:00
Christopher Milan
760e508c3a
autogen: no deep walk ( #13654 )
...
* no deep walk
* reset init
* delete walk
* remove print
* regen
* linkage spec
* cleanup
2025-12-12 01:04:35 -05:00
wozeparrot
8f60b8dd1e
fix: cast on transpose ( #13653 )
2025-12-11 21:03:49 -08:00
Christopher Milan
950d8de00e
automatically inline anonymous ( #13652 )
2025-12-12 00:02:44 -05:00
chenyu
01e9ad0d52
clean up bert next_data ( #13650 )
...
train iter was designed to never stop for both real and fake data
2025-12-11 22:56:28 -05:00
Jakob Sachs
ab2220b834
Handle missing bfloat16 natives on CPU architectures ( #13553 )
...
* CPU: fix compiler-rt libcall by adding intermediate casts for bfloat16
* fix lint
* remove old manual bypass of bf16 for CPU tests, and add diversion converstion from bf16 to/from fp16
---------
Co-authored-by: Jakob Sachs <jakobs99@purelymail.com>
2025-12-11 15:38:43 -05:00
nimlgen
cbae33003d
ci: add usb4 ( #13643 )
...
* ci: add usb4
* debug=3
* undef
* revert
2025-12-11 19:41:41 +03:00
chenyu
03600aef1e
failed test case when init jit with empty inputs ( #13641 )
...
not related to bert grad acc, but still seems to be a bug
2025-12-10 22:03:06 -05:00
nimlgen
51f3c9f615
am: use va_base as base ( #13640 )
2025-12-10 21:09:35 +03:00
chenyu
5034c6fb37
reenable FREE_INTERMEDIATE for bert ( #13639 )
...
* reenable FREE_INTERMEDIATE for bert
* comment
2025-12-10 12:08:09 -05:00
qazal
be6d538351
viz: add kernel walltime to pmc scoreboard ( #13638 )
...
* viz: add kernel walltime to pmc scoreboard
* fix typing
* tiny TracingKey refactor
* key on kernel name
2025-12-10 20:16:42 +08:00
qazal
1666c4aaab
viz: fix counter names ordering ( #13637 )
2025-12-10 17:05:27 +08:00
qazal
c801bb7054
viz: show all kernel pmcs ( #13635 )
2025-12-10 07:16:02 +08:00
wozeparrot
4854a0c02c
fix: getattr returns AttributeError not ImportError when missing ( #13633 )
2025-12-09 14:26:54 -08:00
chenyu
016a59cafa
remove contiguous and use where in EmbeddingBert ( #13632 )
2025-12-09 15:49:21 -05:00
nimlgen
ddecba300f
amd: use getattr for autogen ( #13630 )
...
* amd: use getattr for autogen
* fi
2025-12-09 20:36:26 +03:00
Nino Risteski
76d465dbc3
optim empty shard #13513 ( #13598 )
...
* optim empty shard
* remove tuple
* simplify
* lint
* lint2
* test
* remove original buffer unique id
* new rule
* reset shard
* update
* reset shard
2025-12-09 12:28:36 -05:00
ayanhan
47a170be2e
test: enable cummax scalar IndexError test ( #13625 )
2025-12-09 12:25:56 -05:00
Christopher Milan
9eae9dc3be
regen smu_v13 with stdint ( #13631 )
...
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2025-12-09 12:20:01 -05:00
nimlgen
7cd8852f60
autogen: do no return tuples ( #13629 )
2025-12-09 20:08:13 +03:00
nimlgen
9e484b5b1c
hcq: check size is None, do not read the whole size for 0s ( #13628 )
2025-12-09 19:37:44 +03:00
nimlgen
1329033b8c
am: fix hot-queue restarts, only dequeue ( #13627 )
2025-12-09 19:37:21 +03:00
nimlgen
b07839493d
proclogs with xccs ( #13626 )
2025-12-09 16:46:08 +03:00
qazal
2c333818f4
simplify UOp stringifier [pr] ( #13618 )
...
* simplify UOp stringifier [pr]
* fix tuple
2025-12-09 05:06:16 +08:00
chenyu
2471b49e45
minor bert / llama change from grad acc branch ( #13622 )
...
* minor bert / llama change from grad acc branch
* revert those
2025-12-08 16:04:14 -05:00
Christopher Milan
cb3d756547
NAK compile-only test ( #13621 )
2025-12-08 15:53:46 -05:00
Christopher Milan
a4c3d48aa9
compile-only test for IR3 actually works ( #13619 )
2025-12-08 15:07:49 -05:00
Christopher Milan
a17077d1d9
skip test_double_assign in CI LVP ( #13620 )
2025-12-08 14:54:02 -05:00
Christopher Milan
1c16b6e082
Mesa: freedreno ( #12746 )
...
* ir3 init
* got a program
* 1 + 1 works
* use isa_disasm instead of shader_disasm
* wip
* matmul works
* works on py3.14
* fix const loading
* skip QCOM failing tests
* cleanup
* args actually work
* add compile-only tests
* fix typo and install tinymesa
* IR3 NULL backend
* (float32) images work
* autogen fix
* fix compile only test
* typo
* mypy happy
* compile-only uses py3.14
* bump mesa
* unify qcom disassembler
* float16 works
* disasm shows in viz
* save a line
* add real del
* variable workgroup sizes
* simplify diff
* bump line count
* properly set wgsz
* regen mesa
* no preamble
* bump lines
2025-12-08 14:02:08 -05:00
Douglas Nyberg
947c6eefc3
add Swish op ( #13541 )
...
* add Swish ONNX operator
* add Swish regression test
* remove trailing whitespace
* upgrade ONNX to 1.20, add excludes for unimplemented ops
* upgrade ONNX to 1.19, add Swish op
* upgrade ONNX to 1.19, TensorFlow to 2.18, add Swish op
* exclude attention_3d and attention_4d_gqa tests
* exclude attention fp16 tests
* exclude all attention tests
* retrigger CI
* retrigger CI - worker crash
2025-12-08 12:41:18 -05:00
nimlgen
dd8a1a10d4
amd: tiny cleanups ( #13616 )
2025-12-08 13:15:56 +03:00
qazal
2b07336c82
viz server cleanups ( #13615 )
...
* depths start at 0
* rename the api path
2025-12-08 17:44:43 +08:00
wozeparrot
89c4206e22
fix: typing ( #13614 )
2025-12-07 20:10:30 -08:00
qazal
572dfd5506
add static amd program info to viz ( #13594 )
...
* llvm-readelf
* amd_readelf + soft_err
* cleanup
* multiple metadata
* max wgp size, may be less
2025-12-08 04:08:14 +08:00
qazal
73093314bd
viz: support list of sidebar info ( #13612 )
2025-12-08 03:09:43 +08:00
chenyu
b981b6f89e
remove old llama grad_acc ( #13611 )
...
* remove old llama grad_acc
* GRADIENT_ACC_STEPS=1
2025-12-07 13:03:47 -05:00
Christopher Milan
94d7646bdc
fix anonymous struct fields ( #13610 )
2025-12-07 12:56:38 -05:00
nimlgen
dcd50baca4
amd/nv: cleanup ( #13608 )
2025-12-07 17:05:26 +03:00
nimlgen
ac5f1e115d
autogen: repro for the bug ( #13607 )
...
* autogen: repro for the test
* mute
2025-12-07 15:51:03 +03:00
Christopher Milan
4eae4b0ce6
unify adreno autogen with mesa ( #13604 )
...
* unify adreno autogen with mesa
* gen pm4
* TestTiny::test_plus works
* add a6xx enums
* IMAGE=2 TestTiny::test_gemm works
* remove adreno from CI
* cleanup
2025-12-06 15:17:36 -05:00
kamilisjon
e20bc0b9b5
remove unused function parameter in beam search ( #13602 )
2025-12-06 11:40:47 -05:00
nimlgen
abafb96441
hcq: check all subbufs are free ( #13599 )
...
* hcq: check all subbufs are free
* fix
* Update ops_amd.py
2025-12-06 17:43:18 +03:00
nimlgen
f2b549d921
amd: refactor scratch calc ( #13595 )
...
* amd: refactor scratch calc
* fix
2025-12-06 16:41:35 +03:00
chenyu
4562f217e1
more bert updates ( #13597 )
...
prep split jit
also lower BS to 72
2025-12-06 08:32:43 -05:00
wozeparrot
93f1baca77
feat: tk fa in tensor ( #13580 )
2025-12-05 14:36:29 -08:00
chenyu
cb4c6324ef
revert bert grad accumulation ( #13596 )
...
prep for the new split jit style
2025-12-05 17:30:08 -05:00
qazal
f20212e1ec
refactor viz error handler ( #13593 )
2025-12-06 02:37:39 +08:00
Christopher Milan
dec2f50aee
reenable process replay for lvp ( #13592 )
2025-12-05 12:36:35 -05:00
chenyu
0977206b1c
Revert am ( #13591 )
...
* Revert "hotfix: amd: tmpring (#13589 )"
This reverts commit 4d8b283b36 .
* Revert "amd: use correct structs (#13583 )"
This reverts commit d8b09eda57 .
2025-12-05 11:03:12 -05:00
chenyu
ac1227575f
IMAGE=1 driving_vision in benchmark ( #13587 )
2025-12-05 10:20:54 -05:00
nimlgen
4d8b283b36
hotfix: amd: tmpring ( #13589 )
...
* hotfix: amd: tmpring
* more
2025-12-05 18:19:05 +03:00
qazal
8c332219f9
viz: remove x86asm highlighter ( #13586 )
...
* viz: remove x86asm highlighter
* formatting
2025-12-05 21:05:50 +08:00
qazal
5d8726d8d2
viz: refactor to generic sidebar ( #13584 )
2025-12-05 20:09:41 +08:00
nimlgen
d8b09eda57
amd: use correct structs ( #13583 )
2025-12-05 14:46:38 +03:00
qazal
6d92e9ffbf
hotfix: skip process replay on lvp ( #13585 )
2025-12-05 19:25:23 +08:00
Christopher Milan
8011b953c9
mesa: remove glsl type hack ( #13578 )
...
* mesa: remove glsl type hack
* lazy type access
* save a line
* fix windows?
* mypy happy
2025-12-04 21:18:56 -05:00
George Hotz
c5bd28e21d
start work on schedule cache ( #13529 )
...
* start work on schedule cache
* local unique
* schedule cache works
* schedule cache cleanup
* fix tests
* preserve metadata
* oops, fix cache
* put that there
* fix spec
* always miss
* why is that broken?
* src[0].op
* fix process replay
* delete abstractions2
* reenable the actual schedule cache
* metadata is best effort
* fix JIT in examples/gradaccum_mnist.py
* full jit
* fixed and test is real
2025-12-04 17:24:49 -08:00
wozeparrot
62e2fc5108
tk: global load/store rv ( #13577 )
2025-12-04 17:23:48 -08:00
Christopher Milan
5cfe1698e8
autogen: strip function parameter qualifiers ( #13576 )
...
* autogen: strip function parameter qualifiers
* regen hip
* re-regen hip
2025-12-04 19:54:34 -05:00
qazal
f21c9dbf4b
enable PMC with VIZ=2 ( #13575 )
2025-12-05 03:09:53 +08:00
qazal
d7caae5f61
viz: tabulate pmc ( #13574 )
...
* viz: tabulate pmc
* linter
* enable nesting
* pmc comes before waves
2025-12-05 03:08:39 +08:00
chenyu
42f6cf3a90
tighter test_real_world mem and kernel count bounds ( #13573 )
...
also check if actual usage is within 20% of set limit, the old limits are too big to be useful
2025-12-04 13:35:39 -05:00
chenyu
89f9e1dcd5
add SGD to beautiful_mnist ( #13571 )
2025-12-04 12:17:29 -05:00
qazal
512a8f3dd4
viz: start global memory PMC tests ( #13569 )
2025-12-05 00:40:27 +08:00
chenyu
7df56d3b99
Optimizer.device is a property ( #13568 )
2025-12-04 09:25:15 -05:00
nimlgen
db99a61fad
qcom: support cpu mappings ( #13565 )
...
* test
* qcom: support cpu mappings
* clean
* msg
2025-12-04 14:50:46 +03:00
George Hotz
bd6a068ef7
move track_rewrites to outer schedule cache ( #13556 )
...
Co-authored-by: qazal <qazal.software@gmail.com>
2025-12-04 19:13:45 +08:00
qazal
3eae146139
faster process replay [pr] ( #13564 )
2025-12-04 18:52:07 +08:00
Rory Clear
6eab756578
fix and test loading num_batches_tracked ( #13538 )
...
* fix and test loading num_batches_tracked
* add failing reverse case
* try reshape state dict if mismatch
* reshape for () and (1,)
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-12-04 01:22:49 -08:00
nimlgen
877a7fdd61
jit: support encdec ( #13563 )
...
* jit: support encdec
* fix
2025-12-04 11:58:34 +03:00
Douglas Nyberg
a8a62bc08e
add max/min reduction support to ScatterND ( #13562 )
2025-12-04 00:53:47 -08:00
ayanhan
edf929ec9d
fix: add __delitem__ to Tensor with proper TypeError ( #13561 )
2025-12-04 00:53:08 -08:00
Douglas Nyberg
9411ecedc4
fix CUDA half-precision trunc() type mismatch ( #13559 )
2025-12-03 21:53:16 -05:00
ayanhan
92b40290c7
fix: add test_sum_int and remove outdated TODO in test_custom_kernel ( #13560 )
2025-12-03 21:51:58 -05:00
Christopher Milan
0a54434b15
mitigate ctypes c_bool bitfield bug ( #13558 )
...
* mitigate ctypes c_bool bitfield bug
* don't delete old test
2025-12-03 20:46:04 -05:00
George Hotz
96d16675fe
update examples/gradaccum_mnist.py to use the JIT
2025-12-03 16:11:42 -08:00
George Hotz
24ca8eeaa7
small fixups from schedule_cache ( #13557 )
2025-12-03 15:41:16 -08:00
Douglas Nyberg
f5abd38132
remove tfa dependency: use keras.optimizers.Lamb and tf.raw_ops for LARS ( #13555 )
2025-12-03 17:48:27 -05:00
George Hotz
a4c4e48385
add LUNIQUE op ( #13554 )
2025-12-03 14:34:34 -08:00
George Hotz
a909cd4581
faster HEVC decode ( #13552 )
...
* faster HEVC decode
* bind to variables
* cleanups
* more cleanups
2025-12-03 11:33:05 -08:00
chenyu
22777a89ea
minor test_uop_symbolic updates ( #13551 )
2025-12-03 13:17:44 -05:00
chenyu
a205f98ef4
tighter bound for MOD ( #13550 )
2025-12-03 11:24:29 -05:00
nimlgen
fcdb01abe7
hip: fix ioctl ( #13548 )
2025-12-03 16:40:43 +03:00
qazal
aab7535805
viz: format buffer size unit ( #13547 )
2025-12-03 21:35:49 +08:00
nimlgen
daea1161cc
nv: nvdec for blackwell ( #13546 )
2025-12-03 16:30:22 +03:00
nimlgen
549f3287a8
fix caching for fetch ( #13544 )
2025-12-03 14:34:14 +03:00
qazal
8390de39e6
amd: static flag check for sqtt/pmc ( #13545 )
2025-12-03 18:36:15 +08:00
George Hotz
ddf3f2d0c4
rdna3 asm + zip_extract ( #13499 )
...
* rdna3 asm + zip_extract
* include sqtt
* fix end parsing
* disassembler working
* parsing fields
* instruction
* op
* more parsing
2025-12-02 22:56:01 -08:00
George Hotz
6bd355fa26
add needs_second_gpu decorator ( #13543 )
...
* add needs_second_gpu decorator
* more skips
* two more fixes
2025-12-02 19:08:23 -08:00
wozeparrot
0d55aec605
fix after end ( #13542 )
2025-12-02 18:42:58 -08:00
chenyu
8902781dc1
enable more benchmarks ( #13540 )
...
* enable more benchmarks
* disable some
* adjust ASSERT_MIN_STEP_TIME
* mac NOCLANG=1
2025-12-02 20:31:14 -05:00
George Hotz
055d5aeb7f
add external_test_process_count
2025-12-02 17:26:30 -08:00
chenyu
e8879f7e31
match torch clamp backward ( #13533 )
...
* match torch clamp backward
* fix PYTHON
2025-12-02 17:58:32 -05:00
qazal
7622be761f
add new remu instructions from #13533 ( #13539 )
2025-12-03 06:29:20 +08:00
wozeparrot
18640f57b2
feat: configurable timeout ( #13537 )
2025-12-02 13:35:35 -08:00
chenyu
21aac568fd
limit lift x*y out of reduce to int [pr] ( #13535 )
2025-12-02 16:11:45 -05:00
Roelof van Dijk
c158e3c988
add cifar gated uop_given_valid regression test ( #13536 )
2025-12-02 16:02:47 -05:00
Roelof van Dijk
e329baffa7
fix cifar while keeping openpilot fused ( #13528 )
...
* this works
* test now passes
2025-12-02 12:05:56 -08:00
nimlgen
0874ba8cc8
test_hevc: do not download the whole file ( #13531 )
...
* test_hevc: do not download the whole file
* fix
2025-12-02 21:31:28 +03:00
qazal
366badaa68
require renderer argument in get_program, removes device opening in process replay [pr] ( #13524 )
2025-12-03 02:05:31 +08:00
George Hotz
21184ae6b1
bump cache to 14 ( #13530 )
2025-12-02 08:02:19 -08:00
George Hotz
037edc151c
late gate for ALLOW_TF32 ( #13527 )
...
* remove ALLOW_TF32
* the right place to put that gate
2025-12-02 07:51:58 -08:00
Douglas Nyberg
6a7c58abf1
fix(onnx): unwrap list/tuple value in Pad op ( #13500 )
...
* fix(onnx): unwrap list/tuple value in Pad op
* add regression test for Pad list value
* remove trailing whitespace
* use _resolve_const for Pad constant_value
2025-12-02 07:47:20 -08:00
qazal
c65aa93081
refactor sqtt loader to enable PMC=1 SQTT=0 ( #13526 )
2025-12-02 22:50:38 +08:00
chenyu
60f7c6cce6
simpler drop_and_clauses [pr] ( #13525 )
2025-12-02 09:12:21 -05:00
nimlgen
77a76d1b13
device: respect compiler ContextVars ( #13523 )
...
* device: envvars for cc
* fix
* fix
* x
* um
* fix
* remote
* em
* cleanup
* typing
* fix
* debug
* lvp?
* ugh
* singl
* rm
* lol
* fix
* ?
* this?
* why?
* rev
* mod test
* l
2025-12-02 14:42:04 +03:00
wozeparrot
1b7dbfb37f
tk: named kernels + per kernel range id ( #13522 )
2025-12-01 22:51:04 -08:00
wozeparrot
8713ae6de9
fix: dead sdv2 download link ( #13521 )
2025-12-01 22:50:53 -08:00
George Hotz
44104b0b7f
mnist with grad acc + Adam on CPU ( #13520 )
...
* mnist with grad acc + Adam on CPU
* still broken, but closer
* works w/o jit
* this works without the jit
2025-12-01 18:27:32 -08:00
George Hotz
7307120311
shard to one device is to ( #13519 )
...
* shard to one device is to
* fst
2025-12-01 16:29:53 -08:00
chenyu
0b92fd30f5
simpler simplify_valid [pr] ( #13514 )
...
dedup instead of getting a True clause which is removed later
2025-12-01 17:36:33 -05:00
qazal
a5ec3b24be
viz: start PMC in the counters view ( #13510 )
2025-12-02 00:01:57 +08:00
nimlgen
759b41ab91
amd: fix rsrc_word3 on gfx9 ( #13509 )
2025-12-01 12:47:54 +03:00
chenyu
ebbd114885
simpler invalid alu [pr] ( #13508 )
2025-11-30 22:18:42 -05:00
George Hotz
ada6b92b2d
add a gate to rewrite if there's no rules [pr] ( #13506 )
2025-11-30 17:40:52 -08:00
George Hotz
97b56e11e0
hotfix: 32 workgroups for radeon 8050s
2025-11-30 08:20:17 -08:00
George Hotz
bd4b9de7d2
use numpy in amd_uop_matmul for simpler tracing ( #13503 )
2025-11-30 08:04:38 -08:00
qazal
9023ca30ef
show number of waves in each SE/CU ( #13491 )
...
* show number of waves in each SE/CU
* update to test_ones
2025-11-30 22:29:16 +08:00
nimlgen
455dd88236
nv: minimal hevc ( #13502 )
...
* nv: minimal hevc
* validate
* not needed
* tralin
* var
* cpu
* fxi
* desc
* move
* cleanup
2025-11-30 16:46:55 +03:00
George Hotz
fd373fea7a
fix a few tests [pr] ( #13498 )
2025-11-29 13:43:45 -08:00
George Hotz
29b11c8992
bug in device enumerate where we didn't put default back ( #13495 )
2025-11-29 13:00:55 -08:00
George Hotz
6a140f74fe
split out unique_const and cache const [pr] ( #13493 )
...
* split out unique_const
* add cache to const
* call const in unique_const
2025-11-29 10:44:28 -08:00
George Hotz
c38b7684dc
improve microbenchmarks ( #13492 )
...
* improve microbenchmarks
* bugfix + ubench
* lil
* no src in const method
2025-11-29 10:15:22 -08:00
qazal
941597db71
viz UI cleanups ( #13490 )
2025-11-29 22:07:00 +08:00
qazal
d457ee0ba4
viz: correctly handle multiple sqtt traces of the same prg ( #13460 )
2025-11-29 20:52:41 +08:00
George Hotz
6f4d7c0c70
directly create tensor in _apply_uop ( #13489 )
2025-11-28 19:51:06 -08:00
kamilisjon
3d76ef9ba8
Update tests ( #13479 )
2025-11-28 18:35:28 -08:00
nimlgen
192bf4e00a
amd,nv: remove unused env vars ( #13487 )
2025-11-28 23:12:53 +03:00
qazal
ae9c56134e
skip test_tk failing locally on macbook ( #13476 )
2025-11-29 01:15:37 +08:00
qazal
f33ccd31fd
viz: instruction deduping for SQTT inst waves ( #13482 )
2025-11-28 23:17:07 +08:00
Roelof van Dijk
eb543a91e8
perf: remove graph-in-graph from expand_index ( #13473 )
...
* remove graph-in-graph from devectorizer
* vectorize, not sink
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-11-27 11:32:16 -08:00
Roelof van Dijk
d3e125d05d
keyword changed (import reserved in python) ( #13477 )
2025-11-27 11:23:00 -08:00
qazal
72ef533d9c
tracing: use u32 for buffer args encoding ( #13472 )
2025-11-28 00:19:51 +08:00
George Hotz
18addc0a1d
process replay only get_program ( #13475 )
2025-11-27 08:18:18 -08:00
George Hotz
a8e005b095
enable process replay (non-checking) by default ( #13474 )
2025-11-27 07:28:44 -08:00
qazal
952a6a8b10
viz: add kernel buffers back to the sidebar ( #13471 )
2025-11-27 22:10:35 +08:00
Kirill R.
57869387f9
Update wording in mnist.md ( #13469 )
2025-11-27 05:59:49 -08:00
nimlgen
1d207eca3d
cuda: fix fmt in compiler ( #13470 )
2025-11-27 16:51:17 +03:00
qazal
2df8a3474e
viz: bring back flops and mem in sidebar ( #13467 )
2025-11-27 17:27:44 +08:00
George Hotz
05cd2279d0
add cache on reshape ( #13466 )
...
* remove cache on divmod, way less objects
* _apply_reshape
* reshape
* no gc on realize
* wow that cache is fast
2025-11-26 18:57:40 -08:00
George Hotz
f4123b66df
add DEBUG_GC ( #13465 )
...
* add DEBUG_GC
* fixup create_schedule_with_vars
* work
2025-11-26 17:44:44 -08:00
George Hotz
19228e8d37
test_graph is flaky
2025-11-26 16:37:42 -08:00
George Hotz
268b3eb392
factor scheduling into complete_create_schedule_with_vars ( #13464 )
2025-11-26 15:43:27 -08:00
George Hotz
e4cd649ff0
remove kernelize to prepare for refactors ( #13463 )
...
* remove kernelize to prepare for refactors
* less kernelize
* last test
2025-11-26 14:18:50 -08:00
qazal
b63e5a7568
viz: full range x axis scroll ( #13459 )
2025-11-26 21:28:07 +08:00
qazal
c12e218751
viz: double click on INST wave ( #13458 )
2025-11-26 21:12:40 +08:00
qazal
e9cb738c7a
viz: event sidebar cleanup ( #13457 )
2025-11-26 19:47:15 +08:00
qazal
2a3b665972
viz: initial zoom at first event ( #13456 )
...
* viz: initial zoom at first event
* sidebar work
2025-11-26 16:42:06 +08:00
Christopher Milan
b2af92c821
fix HCQGraph.__del__ bug when finalizing ( #13298 )
...
* fix _do_ioctl import
* fix circular import
* suppress_finalizing instead
2025-11-25 20:33:48 -08:00
qazal
8c1e2a42fd
viz: start work on profiler speed ( #13455 )
2025-11-26 07:54:04 +08:00
wozeparrot
ffc31a23f4
tk mi350 ( #13288 )
2025-11-25 15:49:44 -08:00
nimlgen
436ab6bfc7
nv: use opt mutliple vaspaces ( #13453 )
2025-11-25 23:10:21 +03:00
qazal
7238df7a94
viz: cleanup sort_fn ( #13454 )
2025-11-26 04:10:10 +08:00
qazal
5520f1fb0b
viz: per cu timeline ( #13451 )
...
* add cu_loc
* work
* WAVE -> W
2025-11-26 00:05:20 +08:00
qazal
4a9562e353
viz: draw markers on top ( #13449 )
...
* viz: draw markers on top
* create generic label drawer
* same text rendering infrastructure for markers
* minor details
* diff
2025-11-25 17:27:01 +08:00
George Hotz
5373fd2d66
add user device ( #13447 )
...
* add user device
* add device_sort_fn (#13448 )
Co-authored-by: qazal <qazal.software@gmail.com>
* linter
* order by dname
---------
Co-authored-by: qazal <qazal.software@gmail.com>
2025-11-25 15:25:45 +08:00
George Hotz
241e533451
toposort recursive_property is faster ( #13446 )
2025-11-24 22:29:15 -08:00
George Hotz
8e8fec408e
fix n^2 _apply_map_to_tensors [pr] ( #13443 )
...
* clean up slow rules
* fix rule
* non n^2 toposort
* topovisit
* state dict profile_marker
2025-11-24 18:59:16 -08:00
wozeparrot
249553a119
tinyfs tweaks ( #13444 )
2025-11-24 18:07:32 -08:00
wozeparrot
f46bc31156
tk: start and step in range ( #13442 )
2025-11-24 15:43:24 -08:00
George Hotz
cc5e6323ac
stable diffusion profiling ( #13441 )
...
* stable diffusion profiling
Signed-off-by: George Hotz <geohot@gmail.com>
* profile_marker
* profile per step
* fix slow Context
* profile that
---------
Signed-off-by: George Hotz <geohot@gmail.com>
2025-11-24 15:25:45 -08:00
nimlgen
18cfb54736
amd: a bit better se limiting ( #13440 )
...
* amd: a bit better se limiting
* SQTT_LIMIT_SE=0
2025-11-24 21:51:47 +03:00
C T
2d53029be3
Whisper less flaky tests ( #13435 )
...
* use less flaky metric for whisper long transcription
* multiline long transcription 3 reference
* fix reference transcript
see https://homepage.ntu.edu.tw/~karchung/miniconversations/MC.htm
sanitized for whisper
* try lower wer threshold
* add test for wer metric
* extract TRANSCRIPTION_3_ALT
* rename test
* rename
* add tests for high WER difference
* move tests
* sync metric
2025-11-24 09:50:49 -08:00
qazal
2a9bd12700
sqtt: add occupancy events to the timeline ( #13430 )
2025-11-24 22:28:05 +08:00
Sieds Lykles
63a931ff76
Symbolic divisor fuzzer ( #13433 )
...
* render z3 range better
* working version
* rename
* add to workflow
* factor out variable_names
* smaller expressions
* smaller
* + back
2025-11-23 20:29:32 +01:00
nimlgen
677db34eba
nv: cleanup map flags ( #13434 )
2025-11-23 19:54:52 +03:00
qazal
712c7a6448
sqtt loader cleanups from the occupancy branch ( #13431 )
...
* cleanup err handling
* from disasms
* s/wave_execs/wave_insts
2025-11-23 21:50:34 +08:00
George Hotz
9d7a17ee39
beautiful SQTT_PARSE=1 with color ( #13428 )
...
* beautiful SQTT_PARSE=1 with color
* linter
* linter 2
* a few more labels
* filter and or
* wave alloc
* a few more
2025-11-23 01:05:14 -08:00
qazal
474a631877
viz: align left offset for nested items ( #13420 )
2025-11-23 14:22:51 +08:00
George Hotz
da0aa57a3b
add cu parsing to attempt_sqtt_parse
2025-11-22 22:09:05 -08:00
qazal
320ed78803
can view wave timeline with SQTT_ITRACE_SE_MASK=0 ( #13427 )
2025-11-23 13:55:47 +08:00
Pranil
c1838c71fc
display service name typo ( #13426 )
...
its tinybox-display.service
2025-11-22 20:49:56 -08:00
George Hotz
5110409339
continue work on parse sqtt, enable with SQTT_PARSE ( #13425 )
...
* continue work on parse sqtt, enable with SQTT_PARSE
* fix timing
* delta is pre instruction
* hi8 values
* a few more
* a bit more
* let it crash if you enabled it
* figure out simd
* hide 0x11
2025-11-22 19:03:17 -08:00
George Hotz
92170d0ff1
lil op cleanup ( #13424 )
...
* track flag count and op count
* text
* more
* file count
* lil op cleanup
* cleanups
* move
2025-11-22 15:21:15 -08:00
George Hotz
423b76a852
improve sqtt format parser (saturday coffee shop project) ( #13419 )
...
* improve sqtt format parser
* actually read the trash code ChatGPT wrote
* cleanups
* hand written parser
* quality
* more
* was missing first packet
* maybe
* filt
* fixups
* label the waves
* progress
2025-11-22 15:04:10 -08:00
George Hotz
9d6cf3472e
remove op/sentinel
2025-11-22 15:01:47 -08:00
Christopher Milan
310da2a201
remove hashFiles in setup-tinygrad ( #13423 )
...
* fix hashFiles in setup-tinygrad on macos
* remove hashFiles altogether
2025-11-22 17:47:10 -05:00
qazal
c14033e10f
viz: faster startup time with SQTT=1 ( #13337 )
...
* roc.py cleanups
* direct append
* viz index cleanup
* simd row details
* add kernel arg
* late instructions decode
* more instruction decode to sep server request
* 200ms startup, 6 second to waves timeline
* sort units
* creating new http paths is easy now
* instructions unpacker
* min diff, use hyphens
* summary table
2025-11-22 22:02:30 +08:00
qazal
1655fdb6de
viz: cleanup sqtt loader ( #13417 )
2025-11-22 20:10:23 +08:00
qazal
903eec3754
fix sz.py tinygrad import in ci ( #13418 )
2025-11-22 19:20:26 +08:00
nimlgen
3a42680e22
amd: pmc generic arch for gfx10+ ( #13407 )
2025-11-22 12:31:23 +03:00
George Hotz
1f8b24a6b9
track flag count and op count ( #13416 )
...
* track flag count and op count
* text
* more
* file count
2025-11-21 22:46:33 -08:00
George Hotz
4c0f4226b9
delete the PRECAST op [p] ( #13415 )
...
* don't use PRECAST in cstyle renderer [p]
* fix in metal
* fix opencl
* __builtin_bit_cast
* precast is unused
* cuda is c99?
* lambda_union_bitcast
* helper function
* delete precast op
2025-11-21 21:47:14 -08:00
wozeparrot
1f648bb1ba
feat: reenable mobilenetv2 dsp ( #13320 )
2025-11-21 15:21:49 -08:00
chenyu
054477a44f
remove full_symbolic in simplify ( #13413 )
...
only flip one schedule in winograd backward, no functional difference
2025-11-21 15:04:00 -05:00
chenyu
cb29265f23
add test that shows the validhack regression with bad rewrite order ( #13411 )
2025-11-21 13:48:30 -05:00
qazal
fdfe83880b
viz: unique sqtt wave names ( #13410 )
...
* viz: unique sqtt wave names
* better name for the shape
* it's a per program counter now
* table view, refactor to wave:insts dict
2025-11-22 02:43:31 +08:00
chenyu
a6c9b4ff6a
fix symbolic comments [pr] ( #13408 )
2025-11-21 09:18:50 -05:00
Sieds Lykles
114bb94c55
Fix load collapse MAX to ADD ( #13406 )
...
* add Ops.ADD to pattern
* add test
2025-11-21 12:26:14 +01:00
qazal
87c248eafa
small cleanups from viz memory usage fixes ( #13405 )
...
* shape link cleanups
* cleanup findRectAtPosition
2025-11-21 17:05:08 +08:00
qazal
0de1b24154
viz: SE : CU : SIMD : WAVE in sqtt timeline ( #13404 )
...
* wave id in device rows
* SE : CU : SIMD : WAVE
* automatic width
* better styling
* rm the blue
* sort
2025-11-21 15:42:29 +08:00
George Hotz
dabb02767f
set AMD profile mode with sudo on SQTT or PMC ( #13403 )
...
* require profile mode
* add mode setter
* cleanup
* not needed
* SQTT_LIMIT_SE
2025-11-20 23:19:11 -08:00
George Hotz
e1051d00d7
multi like on full_like as well as rand_like ( #13402 )
...
* multi like on full_like as well as rand_like
* add test and fix bug
* mismatch, optim match
* one line
2025-11-20 20:46:48 -08:00
chenyu
fa3def2f12
call less simplify in simplify_valid_load [pr] ( #13401 )
2025-11-20 19:54:22 -05:00
qazal
895ec7417e
viz: enable mapping function names to colors ( #13400 )
2025-11-21 06:43:02 +08:00
George Hotz
a74f6020d5
track apply map to tensors ( #13399 )
...
* track apply map to tensors
* sub
2025-11-20 14:24:55 -08:00
chenyu
647fde64e6
no sym in pm_reduce [pr] ( #13398 )
...
* no sym in pm_reduce [pr]
* fix that
2025-11-20 16:49:09 -05:00
qazal
1313250e0d
viz: use system helper for llvm-mca ( #13395 )
2025-11-21 04:47:25 +08:00
Christopher Milan
de3593957f
Revert "Revert "autogen: fix formatting on zero-argument function-like macros…" ( #13388 )
...
This reverts commit 0901a40685 .
2025-11-20 15:36:13 -05:00
qazal
1220072328
viz: refactor to generic steps api ( #13393 )
2025-11-21 04:33:23 +08:00
George Hotz
26ccbf7040
debufferize with symbolic in one pm ( #13392 )
2025-11-20 11:47:03 -08:00
George Hotz
c46f608703
top down remove_bufferize ( #13391 )
...
* top down remove_bufferize
* removable if ALWAYS_CONTIGUOUS
2025-11-20 11:32:00 -08:00
Christopher Milan
4043489803
set curl -f in setup-tinygrad ( #13389 )
...
* set curl -f in setup-tinygrad
* test bad redirect
* Revert "test bad redirect"
This reverts commit ad945e7ffc .
2025-11-20 13:45:47 -05:00
chenyu
0251a8e628
parse_valid minor cleanup [pr] ( #13385 )
...
* stricter parse_valid [pr]
* not stricter
* no VCONST
* Revert "no VCONST"
This reverts commit 330dbdf4060562596febcbf970bda6051a35012f.
2025-11-20 13:15:06 -05:00
Christopher Milan
0901a40685
Revert "autogen: fix formatting on zero-argument function-like macros ( #13386 )" ( #13387 )
...
This reverts commit 58d85d4bab .
2025-11-20 12:45:35 -05:00
b1tg
91e289cb14
amd fp8 llvm ( #13186 )
...
* amd fp8 llvm support
* fix max
* clean
* add test_mi350.sh
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-11-20 12:35:57 -05:00
Roelof van Dijk
1058748440
torch backend: no aten.detach for torch 2.10 compat ( #13381 )
...
* this works, less cpp?
* simpler = better
* keep torch 2.9 working as well
2025-11-20 09:12:15 -08:00
Christopher Milan
58d85d4bab
autogen: fix formatting on zero-argument function-like macros ( #13386 )
...
* fix formatting on zero-argument function-like macros
* autogen tests should run
* ugh
2025-11-20 12:11:04 -05:00
qazal
9dbc550692
roc: map disassembly to prog name ( #13384 )
2025-11-20 23:47:19 +08:00
qazal
ebcdf68bab
viz: use content headers for profiler ( #13383 )
2025-11-20 23:33:16 +08:00
nimlgen
0b0ea4981c
hcq: unwrap signals ( #13382 )
2025-11-20 18:12:41 +03:00
qazal
9dcd52287a
add external_benchmark_pyrender ( #13378 )
...
* add external_benchmark_pyrender
* can ctrlc it
* cpu_profile exists
2025-11-20 17:38:28 +08:00
George Hotz
cb38c704c3
delete nonfunctional ramp.py
2025-11-19 20:43:44 -08:00
George Hotz
8919c994b7
Revert "AxisType.PLACEHOLDER in reshape to do less graph_rewrite ( #13373 )" ( #13375 )
...
This reverts commit ac7559e33d .
2025-11-19 19:34:30 -08:00
George Hotz
ac7559e33d
AxisType.PLACEHOLDER in reshape to do less graph_rewrite ( #13373 )
...
* AxisType.PLACEHOLDER in reshape to do less graph_rewrite
* _apply_movement_op cache
2025-11-19 19:19:58 -08:00
chenyu
050682ab40
use invalid_gate consistently [pr] ( #13374 )
2025-11-19 22:15:12 -05:00
Roelof van Dijk
0dc2ff431d
fix: revive torch backend ( #13280 )
...
* fix: revive torch backend
* as_strided view vs copy
* Revert "as_strided view vs copy"
This reverts commit 82a61223f2 .
* add extra tests (move inplace, add fusion tests)
* better fusion with inplace_op
* no optimizer hooks (break mnist training fusion)
* split off fusion tests in separate file, assert on resnet fusion
fix: remove comments
* cleanup, reduce diff
* reduce diff
* better fusion and identity checks
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-11-19 15:26:50 -08:00
wozeparrot
56b2540349
tk: keep extra tile data by replacing uop ( #13370 )
2025-11-19 15:11:43 -08:00
George Hotz
ab7df42c78
bring back fold_divmod_general with bugfix and test [pr] ( #13369 )
...
* Revert "Revert "merge to fold_divmod_general [p] (#13359 )""
This reverts commit 05ccc69248 .
* Revert "Revert "actually merge to fold_divmod_general [pr] (#13363 )""
This reverts commit 90e5752199 .
* Revert "Revert "add cache to fold_divmod_general (#13365 )""
This reverts commit 8e17bd6791 .
* bring back fold_divmod_general with bugfix and test
2025-11-19 14:51:51 -08:00
George Hotz
986d113024
symbolic fuzz failure ( #13367 )
...
* symbolic fuzz failure
* skip flaky test
2025-11-19 14:21:08 -08:00
George Hotz
05ccc69248
Revert "merge to fold_divmod_general [p] ( #13359 )"
...
This reverts commit 7711bbac7f .
2025-11-19 14:18:09 -08:00
George Hotz
90e5752199
Revert "actually merge to fold_divmod_general [pr] ( #13363 )"
...
This reverts commit 3d82b83cec .
2025-11-19 14:18:08 -08:00
George Hotz
8e17bd6791
Revert "add cache to fold_divmod_general ( #13365 )"
...
This reverts commit b5309a5043 .
2025-11-19 14:18:08 -08:00
George Hotz
b5309a5043
add cache to fold_divmod_general ( #13365 )
2025-11-19 13:49:18 -08:00
George Hotz
3d82b83cec
actually merge to fold_divmod_general [pr] ( #13363 )
...
* actually merge to fold_divmod_general [pr]
* one more merge
* Revert "one more merge"
This reverts commit aa79f6781c .
* avoid that case for speed
* faster and simpler
2025-11-19 13:17:56 -08:00
chenyu
a91f00925b
remove VECTORIZE and WMMA rules from sym [pr] ( #13362 )
2025-11-19 14:51:21 -05:00
George Hotz
7711bbac7f
merge to fold_divmod_general [p] ( #13359 )
...
* merge to fold_divmod_general [p]
* merge more
* merge more
* merge more
2025-11-19 11:37:45 -08:00
George Hotz
6fdbd03104
more divmod cleanup [p] ( #13358 )
...
* more divmod cleanup [p]
* lil cleanups, faster
2025-11-19 10:35:15 -08:00
George Hotz
bd88a72149
div and mod to its own file, try 2 [p] ( #13357 )
2025-11-19 10:10:06 -08:00
George Hotz
957cf717e7
Python speed ( #13355 )
...
* skip process replay by default
* work on python speed
* fix names of rewrite rules
* fix that test
2025-11-19 09:03:00 -08:00
chenyu
fc19ea76b5
clean up threefry rules ( #13354 )
2025-11-19 11:48:07 -05:00
George Hotz
385618d45b
skip process replay by default ( #13353 )
2025-11-19 08:25:34 -08:00
chenyu
fba4535289
remove hacks for threefry long removal when padded [pr] ( #13352 )
2025-11-19 11:11:39 -05:00
George Hotz
225eb1500f
generic range changes that work for str + int ( #13350 )
...
* generic range changes that work for str + int
* opt range counts up
2025-11-19 08:07:49 -08:00
chenyu
1a72ac16a6
move where same false branch rule to symbolic_simple [pr] ( #13349 )
2025-11-19 10:15:38 -05:00
chenyu
79055ddb8b
clean propagate_invalid more [pr] ( #13347 )
2025-11-19 09:47:50 -05:00
nimlgen
0c9fbf87e1
nvioctl: classes ( #13346 )
2025-11-19 16:14:15 +03:00
qazal
f2221130bb
viz: pick shape by event type ( #13279 )
2025-11-19 20:15:52 +08:00
wozeparrot
be72b78dcb
tk: small fixes ( #13345 )
...
* fix: handle case where final uop isn't a tk wrapped one
* clean: remove after from mma
2025-11-19 00:58:50 -08:00
wozeparrot
e4fbde5b3b
fix: extra options need to go on second step too ( #13344 )
2025-11-19 00:58:09 -08:00
George Hotz
1a332afa76
spec test on 3.14 ( #12957 )
2025-11-19 00:43:04 -08:00
Christopher Milan
a438c277de
autogen tests for 3.14 ( #13343 )
2025-11-18 22:16:59 -05:00
chenyu
722e7a16ed
remove rule in propagate_invalid [pr] ( #13342 )
2025-11-18 21:38:33 -05:00
George Hotz
1afa3c0877
vmap on full model ( #13340 )
...
* vmap on full model
* vmap gemm
* reduce sums on end
* outer reduce
* only if there's ranges
* put those rules in symbolic
* ranges
* do opt later
* add zero range
2025-11-18 16:06:06 -08:00
chenyu
46cb65e692
delete rules from sym [pr] ( #13339 )
2025-11-18 14:57:35 -05:00
George Hotz
9c59b3d19e
vmap grad needs reduce_backward ( #13336 )
...
* vmap grad needs reduce_backward
* fuse and outer
2025-11-18 10:08:30 -08:00
qazal
a647c9eca6
sqtt ui minor fixes ( #13335 )
...
* roc.py cleanups
* direct append
* viz index cleanup
* simd row details
2025-11-19 01:27:56 +08:00
George Hotz
06e39a88a9
outer vmap works ( #13334 )
...
* outer vmap works
* fuse works
* vmap outer works
* outer ranges work
* grad work
* should be good to merge
2025-11-18 09:27:48 -08:00
chenyu
805de27e07
no load substitute in uop_given_valid [pr] ( #13333 )
2025-11-18 11:47:58 -05:00
chenyu
05294bc648
fix some mypy cast [pr] ( #13331 )
2025-11-18 09:23:42 -05:00
qazal
5623e765c8
VIZ=2 enables SQTT ( #13330 )
2025-11-18 22:20:31 +08:00
nimlgen
331f70aa75
roc: ctrlc ( #13255 )
...
* roc: ctrl-c works
* rm
2025-11-18 19:29:28 +08:00
George Hotz
583560ab72
this is the right way to write vmap ( #13328 )
2025-11-17 20:20:52 -08:00
Christopher Milan
8e8e53c886
int8_t is c_byte ( #13326 )
2025-11-17 21:25:40 -05:00
George Hotz
e4fead8a86
write scan in uops ( #13321 )
...
* write scan in uops
* ops range
* no need for variable
* meh, later
* shorter
2025-11-17 16:58:08 -08:00
wozeparrot
8894a5409d
feat: hipcc compiler ( #13319 )
2025-11-17 15:13:32 -08:00
George Hotz
6d3385c284
print special ops in postrange ( #13318 )
...
* print special ops in postrange
* fix on OSX
2025-11-17 14:43:23 -08:00
chenyu
b637093be9
remove a few rules in pm_lower_index_dtype [pr] ( #13317 )
2025-11-17 17:04:56 -05:00
George Hotz
98e9e73286
hotfix: amd_uop_matmul getenvs
2025-11-17 13:26:01 -08:00
qazal
e7e1935225
cleanup sqtt/test_timing ( #13315 )
2025-11-18 04:28:05 +08:00
wozeparrot
33773fda87
tk initial mi350 ( #13289 )
2025-11-17 11:46:32 -08:00
nimlgen
e2cee64050
Revert "hcq: add tag to exec events ( #13311 )" ( #13314 )
...
This reverts commit f63ded5817 .
2025-11-17 22:15:31 +03:00
chenyu
646372490c
move tiktoken import in llama3 ( #13316 )
...
only Tokenizer requires that
2025-11-17 14:09:37 -05:00
qazal
a37f221e44
viz: visualize waves in the timeline ( #13292 )
...
* viz: visualize waves in the timeline
* timeline in format
* per step
* rm that
2025-11-17 22:04:21 +08:00
nimlgen
f63ded5817
hcq: add tag to exec events ( #13311 )
...
* hcq: add tag to exec events
* f
* fix
* fix
2025-11-17 16:59:30 +03:00
qazal
50a443f558
viz: add shader engine to wave exec payload ( #13310 )
...
* viz: show sqtt shader engine
* order it from smallest unit
* easier to config
2025-11-17 19:11:34 +08:00
nimlgen
9bb17c53ea
amd: timer fix ( #13267 )
2025-11-17 13:59:03 +03:00
George Hotz
55be95da15
cleanup sqtt raw parser ( #13309 )
...
* cleanup sqtt raw parser
* better names (don't merge yet)
* clean up amd
* a few more names
* one more filter
2025-11-16 13:11:51 -08:00
George Hotz
cabd4add48
more work parsing SQTT, separate VIZ/PROFILE ( #13308 )
...
* more work parsing SQTT
* more minimal runner
* sep VIZ/PROFILE
* parse print new
* improve parser
* more filter
* that
* split them
* lil cleanup
* skip flaky test
* AQL in mmapeak
2025-11-16 10:40:39 -08:00
qazal
13efdf8c31
test s_nop stall ( #13307 )
2025-11-17 00:59:39 +08:00
George Hotz
295600dc5a
saturday coffee shop work parsing the att format ( #13295 )
...
* saturday coffee shop work parsing the att format
* add examples
* parser
* classes of packets
* fully vibe coded parser
* vibing
* empty
* some vibe names
* vibes
* most of these are wrong
* more vibes
* better names
* parsing
* parse
* cleanup parser
* touchups
2025-11-16 08:25:51 -08:00
Christopher Milan
a9ed241172
properly suppress NIRRenderer.__del__ error ( #13299 )
2025-11-16 18:58:04 +03:00
qazal
c70b06ec19
sqtt test_timing work ( #13304 )
...
* sqtt test_timing cleanups
* only the instruction
* v_mfma_f32_16x16x32_f16 16 cycles, only after second one though
2025-11-16 23:49:24 +08:00
chenyu
8f0e747b3a
Tensor._tri with arange ( #13297 )
2025-11-16 10:21:16 -05:00
chenyu
6372c95094
disable benchmark MobileNetV2 on DSP ( #13305 )
...
failed on tinyc2
2025-11-16 09:42:52 -05:00
Christopher Milan
61625a3898
fix objc finalizing bug ( #13296 )
2025-11-16 12:43:04 +03:00
nimlgen
acbe6361ab
qcom: suppress_finalizing to free ( #13294 )
2025-11-16 11:49:23 +03:00
wozeparrot
ef42334239
tk: load store cleanup ( #13290 )
2025-11-15 17:08:23 -08:00
chenyu
e8844853ed
Tensor.eye with arange ( #13287 )
...
with rangify we can write these with arange
2025-11-15 12:32:27 -05:00
Christopher Milan
5b823af696
Remove (pypi) clang dep for autogen ( #13284 )
...
* no more clang
* regen comgr_3
* ci doesn't need pypi clang
* fix objc
* REGEN for libclang
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-11-15 09:05:11 -08:00
George Hotz
df53c62a9f
bump line count
2025-11-15 08:16:20 -08:00
nimlgen
d37e1fe065
nv: wait for wpr to reset ( #13282 )
...
* nv: wait for wpr to reet
* fix
* comment
* wai
* f
* fix
2025-11-15 20:00:49 +08:00
George Hotz
22c08b470c
fold using outerworld range ( #13286 )
...
* scan using outerworld range
* almost
* sched
* simple range
* mypy
* woooo outer range
* spec passes
* print the numbers
* lol it runs
* real test
2025-11-14 20:43:41 -08:00
George Hotz
567066f51f
tests for cast there and back ( #13195 )
...
* fix cast folding in llama
* dtypes that work everywhere
* Skip test_cast_there_and_back for backend casts
Skip test due to backend casting issues.
2025-11-14 16:56:09 -08:00
George Hotz
6c5fa349e1
add (unused) outer range ( #13285 )
2025-11-14 16:47:52 -08:00
Christopher Milan
d1bb08c5a1
In-tree autogen: objective c ( #13223 )
...
* checkout changes from autogen branch
* move assert
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-11-14 14:08:42 -08:00
George Hotz
e5351699bd
openpilot warp ( #13283 )
...
* openpilot image warp test
* 0.4 ms on metal, 1 ms on CPU
* new inputs each time
* reshape
2025-11-14 13:55:32 -08:00
qazal
7c110e1a57
viz: minor cleanups for sqtt ( #13275 )
...
* small prg cleanup
* test_timing
2025-11-15 01:08:56 +08:00
chenyu
888aaab151
test_tiny cleanup ( #13276 )
2025-11-14 11:11:32 -05:00
nimlgen
3e63831b98
nv: support 580+ drivers ( #13269 )
...
* nv: 580+ support
* start
* f
* fake
* fix
2025-11-14 21:44:16 +08:00
qazal
2ee701a009
roc: fix CEnum access ( #13270 )
...
* roc: add decoder to ci
* also add installer
* use CEnum syntax
* try 2
* add to setup
* revert ci change
* the other enum too
2025-11-14 21:41:24 +08:00
nimlgen
c80d459d99
autogen: fix packed args structs ( #13274 )
...
* autogen: fix packed args structs
* and test this
2025-11-14 20:24:06 +08:00
nimlgen
14eb48b13a
autogen: rename nv_gpu to nv_570 ( #13273 )
...
* autogen: rename nv_gpu to nv_570
* rename
2025-11-14 20:07:19 +08:00
nimlgen
734bfa07b4
nv: refactor uvm calls ( #13272 )
2025-11-14 19:53:04 +08:00
nimlgen
f72b1fbca4
nv: read numClasses ( #13271 )
...
* nv: read numClasses
* fix
* d
2025-11-14 19:43:25 +08:00
nimlgen
84f065f2a2
autogen: warning and msg ( #13268 )
...
* autogen: warning and msg
* f
2025-11-14 19:10:26 +08:00
George Hotz
44d84228ff
move comgr_3 logic back to the old place ( #13266 )
...
* move comgr_3 logic back to the old place
* explicit
2025-11-13 20:05:54 -08:00
Christopher Milan
09f3aae169
In-tree autogen: all C libraries ( #13220 )
...
* checkout files from autogen branch
* ioctl with payload
* fix am generations
* properly fix generations
This reverts commit b2a54f4f41 .
* revert discovery.h
* support pragma pack(1)
* typo
* better getter
* typo
* NVCEC0_QMDV05_00_RELEASE[01]_ENABLE
* align support
* anon handling fix
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-11-13 18:57:44 -08:00
wozeparrot
777cbec5b3
tk: rename rt tile dims to base ( #13265 )
2025-11-13 18:43:02 -08:00
wozeparrot
7eb0d8e744
feat: mixins on tiles ( #13246 )
2025-11-13 16:52:52 -08:00
George Hotz
ba84d415fe
work from benchmarking tinybox red v2 ( #13264 )
...
* work from benchmarking tinybox red v2
* gpuburn
2025-11-13 16:38:40 -08:00
wozeparrot
547304c471
tk: group cleanup ( #13262 )
2025-11-13 14:19:51 -08:00
wozeparrot
4ada51618f
tk: don't flatten in clear ( #13249 )
2025-11-13 13:38:01 -08:00
George Hotz
6b1bae6614
ruff format mixin ( #13261 )
2025-11-13 10:10:38 -08:00
Faizaan Gagan
3049f3edda
support _rebuild_tensor method interception ( #13253 )
2025-11-13 09:41:21 -08:00
Harald Schäfer
3af231904e
openpilot compile tests: assert pre-rangify speeds ( #12775 )
...
* assert pre-rangify speeds
* typo
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-11-13 09:39:06 -08:00
George Hotz
faf68c03a8
more mi350x matmul work ( #13138 )
...
* more mi350x matmul work
* broken compute
2025-11-13 09:09:28 -08:00
Ayman Jabr
256f81bb02
Fix tracemeta 0 ( #13049 )
...
* chore: tclesius branch resolved
* fix: indentation
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-11-13 09:07:11 -08:00
alpharush
7e0aaadecd
feat: add repro command to summary ( #10930 )
2025-11-13 08:52:27 -08:00
nimlgen
6be86dde17
nv: add timeout when repsonding to rpc ( #13260 )
2025-11-14 00:42:21 +08:00
nimlgen
f9b7586e08
roc: fix blob gc ( #13256 )
2025-11-13 23:38:35 +08:00
George Hotz
263b724143
one cache and bump it ( #13258 )
2025-11-13 07:33:31 -08:00
George Hotz
5efa727b83
move _pool to MovementMixins ( #13257 )
2025-11-13 07:28:52 -08:00
George Hotz
bcdfc109b5
hotfix: disable flaky test
2025-11-13 06:19:28 -08:00
qazal
006dea4c3e
roc: only save instruction execs ( #13254 )
2025-11-13 21:28:40 +08:00
nimlgen
f9586b38ba
system: pci mask and val ( #13251 )
2025-11-13 20:44:58 +08:00
George Hotz
7316da3253
new readme ( #13250 )
...
* new readme
* update
2025-11-13 00:48:28 -08:00
George Hotz
17aa3379e9
hotfix: improve self_tokenize
2025-11-13 00:18:57 -08:00
chenyu
4e5a9132e7
JIT_BATCH_SIZE=0 in compile3 ( #13245 )
...
fixed some enqueue time
2025-11-12 23:12:45 -05:00
wozeparrot
759557f633
feat: move tk tests to testextra ( #13242 )
2025-11-12 17:06:53 -08:00
chenyu
3f939f3d3c
update pm_simplify_valid ( #13241 )
...
* update pm_simplify_valid
fixed openpilot conv regression
* IMAGE training is broken
2025-11-12 19:40:02 -05:00
chenyu
f9851a852f
minor update to uop_given_valid [pr] ( #13243 )
...
split from #13241
2025-11-12 19:03:18 -05:00
qazal
fe2876a6d8
hotfix: second GB/s in viz ( #13240 )
2025-11-13 07:14:27 +08:00
George Hotz
a23dea202b
actually make AMD_LLVM not default ( #13238 )
2025-11-12 15:07:23 -08:00
George Hotz
ab9fa964d8
DISABLE_COMPILER_CACHE -> CCACHE ( #13234 )
...
* DISABLE_COMPILER_CACHE -> CCACHE
* Fix cachekey assignment in Compiler constructor
2025-11-12 15:07:09 -08:00
qazal
be2e24cb25
roc: requires sudo to install ( #13237 )
2025-11-12 16:59:22 -05:00
George Hotz
8f1f195b6d
hotfix: no hexdump for usbgpu patch.py
2025-11-12 12:05:37 -08:00
nimlgen
9a53fcbde4
amd: sqtt on rdna3.5 ( #13233 )
2025-11-13 03:30:42 +08:00
George Hotz
13f10a31dc
AMD_LLVM default off ( #13232 )
2025-11-12 11:06:33 -08:00
qazal
8b26cf2b3d
sqtt: update rcp timing test ( #13231 )
...
* sqtt: assert correct output in timing test
* found why
2025-11-13 02:01:54 +08:00
Jan Akhremchik
bc8e537423
Add NONZERO op to onnx backend ( #13211 )
2025-11-12 08:55:51 -08:00
nimlgen
af17e07251
viz: sqtt touchups ( #13228 )
...
* viz: sqtt touchups
* revert
* matches
2025-11-12 22:40:37 +08:00
qazal
7a6853fa40
viz: show python callstack in the first graph ( #13218 )
2025-11-12 20:52:28 +08:00
nimlgen
82eb63d3ad
qcom: auto switch idle timer when profiling ( #13230 )
...
* qcom: auto switch idle timer when profiling
* fi
2025-11-12 20:31:24 +08:00
nimlgen
fcd8d0751a
test_timing for hip ( #13229 )
2025-11-12 20:28:58 +08:00
qazal
74b9d33acb
viz: direct link to program source ( #13227 )
2025-11-12 16:27:13 +08:00
wozeparrot
371c1f2355
tk: move tiles to class ( #13224 )
2025-11-11 21:53:46 -08:00
Christopher Milan
41a098a82d
In-tree autogen: libc.py ( #13217 )
...
* checkout changes from autogen branch
* parents
* pylint happy
* move sys to system in helpers.py
* typo
* typo
2025-11-11 19:13:48 -08:00
wozeparrot
222bb12ddf
tk softmax ( #13205 )
2025-11-11 15:13:16 -08:00
wozeparrot
787f0070ed
feat: don't use output reg as local reduce reg ( #13203 )
2025-11-11 14:35:16 -08:00
chenyu
ece1415def
clean up image_dot and image_conv2d ( #13222 )
...
* clean up image_dot and image_conv2d
* those are fine
* interesting
2025-11-11 15:53:03 -05:00
nimlgen
2f0ea29b34
qcom: 48bit timestamps ( #13214 )
...
* qcom: 48bit timestamps
* f
* lol
* fix
2025-11-12 04:14:33 +08:00
qazal
bc55bc4849
cleanup test_viz profiler tests ( #13221 )
2025-11-12 03:46:48 +08:00
chenyu
23b90945c3
add a benchmark for openpilot vision with DEBUG=2 ( #13219 )
...
see per kernel speed, also disable the jobs for 0.9.9
2025-11-11 14:41:52 -05:00
George Hotz
c2075f3613
gc disable during big rewrites ( #13215 )
...
* gc disable during big rewrites
* cleaner with helper
2025-11-11 10:30:47 -08:00
Roelof van Dijk
e59313da08
migrate pytest and ruff ( #13216 )
2025-11-11 13:27:51 -05:00
Gaétan Lepage
6fd7ce3832
migrate to pyproject.toml ( #13189 )
...
* migrate to pyproject.toml
* move mypy config to pyproject.toml
2025-11-11 09:09:27 -08:00
qazal
8002921a04
viz: improve the program run tooltip ( #13212 )
...
* add tflops to tooltip format
* show if the run was batched
2025-11-12 00:56:03 +08:00
qazal
f91e366a17
viz: display the graph layout recursion error ( #13194 )
...
* viz: display the graph layout recursion error
* share styles
* +min-width
* same thing
* inline the append
2025-11-11 15:25:12 +08:00
wozeparrot
73497af4c0
clean: use np for allclose ( #13204 )
2025-11-10 23:02:43 -08:00
George Hotz
a6360fd94d
store can have shape ( #13202 )
...
* store can have shape
* _shape
2025-11-10 22:16:47 -08:00
b1tg
f3692b7406
clean up hip renderer ( #13063 )
...
* clean up hip renderer
* ocml
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-11-11 00:44:24 -05:00
chenyu
22b8579234
one last regressed dm kernel ( #13201 )
2025-11-10 23:30:52 -05:00
chenyu
58b7e4fab3
GROUPTOP heuristic on more axes ( #13206 )
...
fixed dm speed
2025-11-10 23:30:37 -05:00
chenyu
829cdafccc
update openpilot slow conv uop ast ( #13197 )
...
the two remaining slow ones
2025-11-10 17:03:20 -05:00
George Hotz
0c978d45e6
stub attention ( #13196 )
...
* stub attention
* name the kernels
2025-11-10 13:48:38 -08:00
chenyu
58c30fc7ce
minor image_conv2d cleanup ( #13193 )
2025-11-10 16:05:40 -05:00
chenyu
60e55d9a2d
line count 18500 ( #13191 )
2025-11-10 13:52:13 -05:00
nimlgen
09a59c2203
qcom: support new chip versioning ( #13185 )
...
* qcom: support new chip versioning
* ops
* nit
* fix
* f
2025-11-10 23:57:29 +08:00
qazal
50934050bc
sqtt: append all wave execs ( #13190 )
2025-11-10 23:50:08 +08:00
qazal
38a24731a1
cleanup sqtt tooling ( #13188 )
...
* cleanup viz/serve.py
* use latest profile in rgptool.py
* unwrap nullable in roc.py, fix disasms typing
2025-11-10 20:52:57 +08:00
qazal
845a24dcc6
viz: group sqtt waves by program ( #13187 )
...
* viz: group sqtt waves by program
* color the names
2025-11-10 19:25:23 +08:00
George Hotz
fd6803000e
mutmut cfg ( #13184 )
...
* mutmut cfg
* coveragerc
2025-11-09 23:29:29 -08:00
wozeparrot
6252831ceb
feat: initial tk library ( #13160 )
2025-11-09 22:54:29 -08:00
George Hotz
925231aec1
repeat does less reshape for 1s ( #13183 )
2025-11-09 19:43:02 -08:00
George Hotz
d7369de048
hotfix: update weekly commits table
2025-11-09 19:37:06 -08:00
chenyu
6c48c87e51
improved ASSERT_MIN_STEP_TIME ( #13182 )
...
* improved ASSERT_MIN_STEP_TIME
getting close, current time +1ms then round up
* relax
2025-11-09 16:41:12 -05:00
nimlgen
17715688c7
system: validate vendor for APLPCIIfaceBase ( #13181 )
2025-11-10 02:49:21 +08:00
nimlgen
614783693e
nv: remove hardcoded expansion_rom_off ( #13180 )
...
* nv: remove hardcoded expansion_rom_off
* to max size
2025-11-09 21:43:19 +08:00
chenyu
e1d46de8f8
update GROUPTOP heuristic more ( #13178 )
...
reverts #13176
2025-11-09 02:31:12 -05:00
chenyu
41e45c20ff
minor stuff reading the printed code [pr] ( #13177 )
2025-11-09 00:58:51 -05:00
chenyu
8e868dced8
only GROUPTOP one reduce kernel ( #13176 )
...
* only GROUPTOP one reduce kernel
* ALLOWED_GATED_READ_IMAGE=148
2025-11-08 22:38:44 -05:00
chenyu
834067d91f
move onnx import in compile3 ( #13172 )
...
only used in test_vs_onnx
2025-11-08 09:44:34 -08:00
nimlgen
7f3240dbfe
nv: cleanup alloc ( #13170 )
...
* nv: cleanup alloc
* okay okay
2025-11-09 00:14:46 +08:00
qazal
7250fc0354
viz: double click on kernel run goes to codegen ( #13147 )
2025-11-08 23:40:50 +08:00
qazal
8a7fa9e7b4
sqtt: show total cycles of kernel in viz ( #13169 )
2025-11-08 21:00:40 +08:00
chenyu
2ba8b4946f
external_benchmark_op_cat.py ( #13168 )
...
* external_benchmark_op_cat.py
cat kernel that's 1ms on master and 50us with no GROUP and with NOLOCALS
* fix
2025-11-08 01:54:10 -05:00
chenyu
a62496cb3d
clean up get_grouped_dims [pr] ( #13159 )
2025-11-08 01:53:54 -05:00
wozeparrot
eb0192b0bb
feat: print ranges that aren't ended ( #13167 )
2025-11-07 22:01:29 -08:00
George Hotz
b41541bc44
bounty: Remove Tensor._pool alternative implementation and verify kernels remain the same ( #13164 )
2025-11-07 16:59:48 -08:00
George Hotz
ffb9e8396f
fix indexing bug with convs
...
* minimal difference for ONE_POOL=1
* fix indexing bug
* improve indexing debugger
* more debugger improvements
* always for reshape
2025-11-07 16:45:19 -08:00
chenyu
6a509da7f3
Scheduler.reduceops helper [pr] ( #13162 )
2025-11-07 18:59:46 -05:00
George Hotz
2413311289
make _pool simpler ( #13161 )
...
* make _pool simpler
* just syntax
* more correct and smaller
* try this now
* Revert "try this now"
This reverts commit 607cdc2164 .
* ONE_POOL
2025-11-07 15:58:44 -08:00
George Hotz
70054cdb14
move backward cast to broadcasted, expand to mixins ( #13156 )
...
* shrink_to mixin
* move backward cast into _broadcasted
* expand to movement mixin
* move a few more
* fix spec issue
2025-11-07 15:07:47 -08:00
George Hotz
f2519ea0ba
shrink_to mixin ( #13155 )
2025-11-07 11:46:24 -08:00
C T
0f9d7f650d
whisper: fix oob, explicit dtype ( #13144 )
...
* fix dtype depending on numpy version
numpy v2 np.array returns int64 which Tensor passed through for the
first decode call, swallowing the <|notimestamps|> token and corrupting
the sequence
* fix whisper OOB
global limit on whisper's context length
* enforce whisper max_tokens_to_sample (match openai)
local limit on max tokens decoded
2025-11-07 12:55:01 -05:00
Ahmed Harmouche
3ecff3a8da
Fix dim splitting bug for len(dim) == len(limited) case ( #13142 )
...
* Fix gpudims bug on webgpu
* Fix split dim bug
* Remove webgpu_bug from examples
* Add test for shape correctness
* Fix 3D indexing
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-11-07 12:31:06 -05:00
nimlgen
b8e48effcb
device: no compilers message with reasons ( #13146 )
...
* device: no compilers message with reasons
* typings
* mypy
2025-11-07 23:01:45 +08:00
nimlgen
35e461ef69
hcq: use exception group ( #12616 )
...
* hcq: use exception group
* fix
2025-11-07 21:23:12 +08:00
nimlgen
10dc8335d2
tinygpu: fix teardown crash ( #13143 )
...
* tinygpu: fix crash
* um?
* double relase
* restore
2025-11-07 19:52:54 +08:00
qazal
d4a216d7d9
viz: display compiler errors ( #13141 )
2025-11-07 18:09:50 +08:00
qazal
7e94369464
add helper for test_timing custom ops ( #13140 )
2025-11-07 17:13:55 +08:00
nimlgen
95620426d5
tinygpu: unmap dma when client closed ( #13129 )
...
* tinygpu: unmap dma when client closed
* syn
* tiny fixes
2025-11-07 16:08:43 +08:00
wozeparrot
500d7661fa
feat: show range len on index in viz ( #13139 )
2025-11-06 23:21:27 -08:00
George Hotz
bb6364d7c7
tuplize from linearizer behind flag ( #13136 )
...
* remove tuplize from linearizer
* optional tuplize
2025-11-06 20:15:03 -08:00
chenyu
bb8cf948f2
variation of (x%c)+(x//c)*c = x ( #13135 )
...
when x is in the form of y//b, the idiv term might have combined
2025-11-06 18:53:28 -05:00
George Hotz
42b34cf83d
bottom up linearizer ( #13133 )
...
* bottom up linearizer
* late stores
* more complete
* remove broken heuristic
* upcast size
* opt
* more conservative
* it needs that
* disable opencl half on QCOM
* fix
* make that a real test
* cpu test okay
* ptx skip
* end is after the range
2025-11-06 15:30:32 -08:00
George Hotz
e0d828dba8
little cleanups
2025-11-06 13:58:19 -08:00
chenyu
bfb0c0391f
test custom eye function ( #13134 )
...
this version is also faster with NOOPT
2025-11-06 14:51:55 -05:00
George Hotz
290441dd44
do loads early ( #13131 )
...
* do loads early
* local and reg
2025-11-06 09:57:09 -08:00
George Hotz
097264853d
very simple priority ( #13130 )
...
* very simple priority
* still simple
2025-11-06 09:25:28 -08:00
George Hotz
07b415e831
fixup op order ( #13128 )
...
* fixup op order
* more order
* move a few more
* more
* DEBUG_LINEARIZE
2025-11-06 08:50:04 -08:00
nimlgen
b9b68bf437
amd: add kern to sqtt event ( #13126 )
...
* amd: add kern to sqtt event
* fix
2025-11-06 22:02:02 +08:00
qazal
88245d6579
qol improvements to sqtt decoder and timing tests ( #13125 )
2025-11-06 20:51:30 +08:00
nimlgen
dafdb4bfb1
test hcq open with pytest ( #13124 )
...
* test hcq open with pytest
* fi
2025-11-06 20:09:51 +08:00
nimlgen
05e2ff4d87
system: fix flock on pcidevs ( #13123 )
...
* system: fix locking of hcq devices
* rename and fullrun
* force ok
* fix
* fix
2025-11-06 19:02:13 +08:00
qazal
3126c89b84
viz: visible horizontal scrollbar in long texts ( #13122 )
2025-11-06 17:23:02 +08:00
George Hotz
91cc773397
add run count to toposort ( #13119 )
2025-11-05 22:29:34 -08:00
Adeeb Shihadeh
dca7fb0a49
qcom: make priority configurable ( #13120 )
2025-11-05 22:27:54 -08:00
qazal
b2bb3af12a
make range_color work in VIZ ( #13121 )
2025-11-06 14:26:48 +08:00
chenyu
f33c182393
test custom qkv kernel ( #13118 )
...
adding the online softmax hits infinite loop so starting with this
2025-11-05 23:32:13 -05:00
George Hotz
c65e6d8887
add ranges to print_uops ( #13116 )
...
* remove tuplize from linearizer
* try this
* simple priority
* add colored ranges to print_uops
* improve comments
* fix no const in src
* fix mypy
* fix define global
* fix var placement
* no prefer early load
* revert linearizer for now
2025-11-05 20:26:56 -08:00
George Hotz
9b2b535fa4
fix issue with multi flip ( #13115 )
2025-11-05 15:28:50 -08:00
George Hotz
4027eef264
fix test warnings ( #13114 )
...
* fix test warnings
* precommit passes
* ignore std_mean warning
2025-11-05 15:06:29 -08:00
George Hotz
bcfe42937f
move permute/flip/shrink to mixins ( #13113 )
...
* move permute to mixins
* move more stuff
* two more
* fix local mypy
* fix tests
* fix shrink
2025-11-05 14:14:15 -08:00
George Hotz
2d4f01fda0
move mixins to mixin dir ( #13105 )
...
* move mixins to mixin dir
* math
2025-11-05 10:18:33 -08:00
chenyu
52f0081e77
use where instead of mul in Embedding ( #13112 )
2025-11-05 12:49:01 -05:00
b1tg
edc4e1aede
ignore trailing nops in llvm-objdump output ( #13110 )
2025-11-06 01:10:51 +08:00
chenyu
03ee0cfe45
minor fast_idiv cleanup [pr] ( #13109 )
2025-11-05 11:44:36 -05:00
chenyu
18d4ecc1f3
lower nv test_gemm_4096 target ( #13107 )
2025-11-05 11:05:16 -05:00
nimlgen
eff80beeed
amd: props in device not sqtt ( #13106 )
...
* amd: props in device not sqtt
* fix
* f
* fix
* fix
2025-11-05 23:43:20 +08:00
nimlgen
757ceab2a2
system: allow using vidmem for uc mem ( #13104 )
2025-11-05 19:12:59 +08:00
qazal
8119d9f082
sqtt: decode each instruction exec ( #13093 )
...
* sqtt: decode each instruction exec
* start tests
* run_asm
* capture sqtt per kernel
* chaining vgprs
* test things
* inst_execs in viz
* can also configure l and g
* 1l + cleanup
* test_sleep
* test_wmma
* work
* test sleep with llvm builtin
2025-11-05 17:30:27 +08:00
chenyu
54141e9cb9
DISABLE_COMPILER_CACHE=1 in speed_v_theoretical ( #13096 )
2025-11-04 11:28:18 -05:00
chenyu
1c9f720654
remove unused type ignore [pr] ( #13095 )
2025-11-04 10:08:07 -05:00
nimlgen
c857dc5af0
autogen: try/except in try_dlopen ( #13094 )
...
* autogen: try/except in try_dlopen
* ugh
2025-11-04 22:51:53 +08:00
nimlgen
eaf7cbc178
amd: flush sqtt after each kernel ( #13092 )
...
* amd: flush sqtt after each kernel
* merge for rgp
2025-11-04 22:12:48 +08:00
qazal
96417665e8
show sqtt decoder errs in viz ( #13088 )
...
* show sqtt decoder errs in viz
* don't touch roc.py
* give hljs a default language
* work from tinyr9
* work
2025-11-04 22:05:06 +08:00
nimlgen
49191ada77
roc: install sqtt decoder ( #13091 )
...
* roc: install?
* msg
* 0.1.4
2025-11-04 18:56:01 +08:00
nimlgen
16f1f644ba
amd: remove sqtt=2 ( #13090 )
2025-11-04 18:29:24 +08:00
nimlgen
2e97eaa866
roc: no nullptr when no wave instructions ( #13087 )
2025-11-04 17:32:14 +08:00
wozeparrot
9c00c0688a
tk fa: use 16x64 tiles ( #13086 )
2025-11-03 18:25:38 -08:00
wozeparrot
4ed0f216b5
fix: make max_matmul run again ( #13085 )
2025-11-03 18:09:09 -08:00
chenyu
ca17718b6d
remove symbolic_flat ( #13083 )
...
* remove symbolic_flat
some kernels are different but sometimes it's better so not clear, will merge as long as benchmark passes
* test_location
2025-11-03 17:25:21 -05:00
chenyu
fda720e013
simpler _is_balanced [pr] ( #13082 )
...
returns False earlier
2025-11-03 16:47:14 -05:00
chenyu
ddf01fdb15
revert mlperf.yml setting ( #13080 )
2025-11-03 15:24:13 -05:00
qazal
6df34a5887
lint sqtt parser with mypy ( #13079 )
...
* llvm address table errs
* mypy likes annotated dicts
* unwrap nullable
2025-11-04 00:53:59 +08:00
qazal
2d2040bc92
viz: tabulate sqtt ( #13078 )
...
* viz: tabulate sqtt
* nomore asdict
2025-11-04 00:03:15 +08:00
nimlgen
dfde3f54d9
rocprof: use llvm disasm ( #13077 )
...
* rocprof: use llvm disasm
* rm
2025-11-03 23:58:58 +08:00
qazal
27d42fd575
sqtt decoder print behind DEBUG>=5 ( #13076 )
...
* sqtt decoder print behind DEBUG>=5
* gfx version stuff also behind 5
2025-11-03 23:20:03 +08:00
George Hotz
416b15cc59
improve uop matmul syntax ( #13074 )
...
* improve uop matmul syntax
* store takes const
* copy
* cleanups
* faster and simpler
* label them reduce
* better syntax
* touchup
2025-11-03 21:34:26 +08:00
nimlgen
08855c162b
amd: correct sqtt_read for several xccs ( #13075 )
...
* amd: correct sqtt_read for several xccs
* default mask
2025-11-03 19:59:56 +08:00
qazal
1c0d4f1cd2
viz: counters loader ( #12987 )
...
* standalone custom loader
* first iteration on the ui
* work
* add center helper
* add edge offsets
* enumerate all edge types
* try dagre layout algorithm
* simpler spec
* bring back double edges
* more work on edge paths
* aesthetics
* custom edges also works
* dimmer inactive links
* cleanup
* cleanup
* split out the ncu layout
* this is just a k/v map now
* rm that
* more cleanup and comments
* do work
* also this work
* simpler start
* rm that
* sqtt work
* view sqtt
* sqtt
* --custom is just in profile
* wrap c call
* from tinygrad install
* eg. module not found
2025-11-03 19:42:36 +08:00
George Hotz
1e3d6e49a6
index slicing + allclose ( #13071 )
...
* continue work on slicing+allclose
* Revert "Revert "slicing + allclose""
This reverts commit 6c7a12f21c .
* fix tests + better syntax
* forgot an after
* slot is an integer
2025-11-03 13:01:48 +08:00
George Hotz
6c7a12f21c
Revert "slicing + allclose"
...
This reverts commit c9a1e35b1e .
2025-11-03 12:05:44 +08:00
George Hotz
c9a1e35b1e
slicing + allclose
2025-11-03 12:00:45 +08:00
chenyu
a317d6e625
extra/amdpci/setup_python_cap.sh ( #13070 )
2025-11-02 19:19:36 -05:00
chenyu
ad501ce50a
mlperf cron install tqdm ( #13069 )
...
one more...
2025-11-02 18:09:27 -05:00
chenyu
2c8d619147
mlperf cron install influxdb3-python ( #13068 )
2025-11-02 17:55:40 -05:00
chenyu
4c22f089fc
mlperf cron install tensorflow try 2 ( #13067 )
2025-11-02 17:11:01 -05:00
chenyu
c58cf91850
mlperf cron install tensorflow ( #13066 )
2025-11-02 16:48:05 -05:00
chenyu
74db65cf72
update mlperf bert LOGMLPERF ( #13065 )
2025-11-02 15:26:37 -05:00
chenyu
b18293de96
train bert in mlperf cron ( #13064 )
...
more relevant now
2025-11-02 15:04:02 -05:00
nimlgen
be0028d3ce
amd: universal set_grbm ( #13062 )
...
* amd: universal set_grbm
* fix
2025-11-03 03:35:55 +08:00
nimlgen
37a730abce
amd: fix pmc sq gfx11+ ( #13058 )
...
* amd: fix pmc sq gfx11+
* fix
2025-11-02 21:56:47 +08:00
qazal
24054bb655
viz: check overlay width after layout ( #13060 )
2025-11-02 21:47:58 +08:00
George Hotz
962d980919
fuse hasn't worked since rangeify, remove it ( #13057 )
2025-11-02 14:01:52 +08:00
George Hotz
036ee9f84c
Self type + mixins ( #13056 )
...
* use Self type
* mixin
* fix later
2025-11-02 13:30:01 +08:00
George Hotz
8cbef912d2
move reshape to MathTraits ( #13054 )
...
* move reshape to MathTraits
* confirm it works in amd_uop_matmul
2025-11-02 12:56:15 +08:00
George Hotz
1ff341bae5
python 3.11 is now required ( #13055 )
2025-11-02 12:55:40 +08:00
George Hotz
267be7fc5e
fp16 acc
2025-11-02 12:53:04 +08:00
wozeparrot
8206eab4fc
fix: tk fa 4 workers ( #13052 )
2025-11-01 16:41:29 -07:00
Sieds Lykles
885b6dea9e
multiple reduce range arange folding ( #13047 )
...
* multi reduce arange folding
* add test
* cvar to var
* add circular_pad_bw test
2025-11-01 22:11:26 +01:00
Sieds Lykles
f97fb703c8
catch group error in matvec heuristic ( #13051 )
2025-11-01 22:09:35 +01:00
Sieds Lykles
ecb8565f67
Revert "Better cleanup of arange bufferize ( #13046 )" ( #13048 )
...
This reverts commit c99b7dfd4a .
2025-11-01 18:09:37 +01:00
Sieds Lykles
c99b7dfd4a
Better cleanup of arange bufferize ( #13046 )
...
* check for reduce and index instead of cast
* add test
2025-11-01 16:16:31 +01:00
nimlgen
051aab5481
open viz with sqtt flags ( #13001 )
2025-11-01 22:48:17 +08:00
nimlgen
2db57f3a97
amd: better msg when out of perf regs ( #13042 )
2025-11-01 22:47:50 +08:00
chenyu
bebec73471
write custom_sum with set and after ( #13045 )
2025-11-01 10:45:30 -04:00
George Hotz
e98506735b
add CONTRACT support to UOp programs ( #13043 )
...
* add contract support
* use contract
* 342 tflops
2025-11-01 19:11:32 +08:00
George Hotz
65a0a31475
AMD mi350x matmul from stream ( #13040 )
...
* works
* working mfma
* 120 TFLOPS
* regs
* 192 TFLOPS
* try pipelining
* something
* notes
* contract
* linter to 3.11
* that was a bug
2025-11-01 17:55:19 +08:00
chenyu
f396df26ea
test custom sum ( #13039 )
...
* test custom sum
this is higher level than set and after?
* only float
2025-10-31 19:25:56 -04:00
nimlgen
a23226e61e
amd: pmc for gfx9 ( #13036 )
...
* amd: pmc for gfx9
* xcc
* vmid mask
* ugh
* tiny
* minor
* sorryg
2025-11-01 04:26:34 +08:00
nimlgen
f6786c1bfd
autogen: py314 ( #13038 )
...
* autogen: py314
* bump py?
2025-11-01 04:02:19 +08:00
nimlgen
d532117df5
amd: rename set_grbm_se -> set_grbm_se_sh ( #13037 )
2025-11-01 01:37:57 +08:00
nimlgen
a9e5ffd3d1
amd: new pmc src ( #13034 )
2025-11-01 01:33:23 +08:00
Sieds Lykles
3dc593c536
add strip_params to pyrender ( #13021 )
...
* add strip_params to pyrender
* update that one too
* strip_parens fix
* cleaner
* add test
* add some more tests
* cleaner strip_parens
2025-10-31 14:15:56 +01:00
George Hotz
bc178d14a9
matmul example on metal showing off tensor core ( #13033 )
...
* matmul example on metal showing off tensor core
* flip the args of placeholder
* mat_idx
* imp
2025-10-31 19:40:36 +08:00
George Hotz
e066b3176b
hotfix: types and names for custom kernel test
2025-10-31 17:34:55 +08:00
George Hotz
54f48f93c6
working backward pass in custom kernel ( #13032 )
...
* working backward pass in custom kernel
* custom_kernel tensor method
* no SPEC=2
2025-10-31 17:26:18 +08:00
George Hotz
b791d70725
support custom UOp kernels ( #13028 )
...
* support custom UOp kernels
* no number
* multioutput works
* backward kernel runs
* move kernel class
* grad later
* work
* no tags in kernel graph
* test arange
* arange + contig
* delete comment
2025-10-31 15:51:39 +08:00
qazal
9f0c25ec48
viz: use indexing toggle for schedule graph ( #13031 )
2025-10-31 15:32:08 +08:00
George Hotz
b2caf4c2b3
prepare for custom kernel ( #13029 )
2025-10-31 14:47:37 +08:00
qazal
564e9ccc31
fix show indexing toggle default on ( #13030 )
2025-10-31 14:41:15 +08:00
qazal
6cd341354e
viz: add toggle to hide indexing UOps ( #13027 )
...
* start
* pass opts to worker
* works
* rename to showIndexing
* keep toggle through rewrites
* fix nan
* real fix for nan
* move render function
* fix firefox
* fix safari
* more work
2025-10-31 13:21:11 +08:00
George Hotz
b46229ca51
use shrink in amd_matmul_uop ( #13026 )
...
* use shrink in amd_matmul_uop
* colors
2025-10-31 10:43:41 +08:00
wozeparrot
78f7650eec
faster tk matmul ( #13006 )
2025-10-30 19:09:27 -07:00
George Hotz
512513c403
cleanup amd uop matmul ( #13025 )
...
* cleanup amd uop matmul
* remove mod
* move that out
* better variable names
* var names
* more
* render fallback
* colors
2025-10-31 10:04:45 +08:00
chenyu
f6430a0559
add script for one slow openpilot conv ( #12953 )
...
* add script for one slow openpilot conv
* fix ruff
2025-10-30 18:08:41 -04:00
chenyu
73002ebffa
print p.applied_opts with DEBUG >= 3 ( #13024 )
2025-10-30 16:51:21 -04:00
chenyu
99e76f33a0
remove unneeded TYPE_CHECKING [pr] ( #13020 )
2025-10-30 12:01:13 -04:00
nimlgen
629b177b66
amd: sqtt works in profile mode ( #13019 )
2025-10-30 23:48:52 +08:00
Sieds Lykles
4c8362128b
New symbolic renderer + strip parens ( #13017 )
...
* new uop renderer
* better tester
* strip parens
* update tests
* split method check_uop_against_string
* use ctx.update instead of add_rendered method
* strip parens based on precedence
* update test
* new symbolic renderer
* add comment
2025-10-30 16:41:32 +01:00
chenyu
c78dfcc5a1
simplify ProgramSpec __post_init__ STORE/LOAD [pr] ( #13018 )
2025-10-30 11:13:21 -04:00
b1tg
363a201cc6
fp8 amd cstyle ( #12999 )
...
* amd fp8 cstyle
* don't repeat
* space
* lint
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-10-30 10:45:52 -04:00
nimlgen
5be3a93d02
amd: enable pmc on gfx12 ( #13015 )
2025-10-30 22:43:10 +08:00
nimlgen
cf5ab93b8e
amd: pmc grbm block ( #13016 )
2025-10-30 22:42:59 +08:00
nimlgen
4d7a7096c9
am: enable perfmon ( #13013 )
...
* am: enable perfmon
* try
* msg
2025-10-30 22:28:36 +08:00
chenyu
985b6eb95f
ues less typing.cast [pr] ( #13002 )
2025-10-30 09:29:52 -04:00
George Hotz
5eb87ab131
hotfix: bump cifar time to 350
2025-10-30 17:29:20 +08:00
George Hotz
4a741e8364
modernize amd uop matmul ( #13011 )
...
* modernize amd uop matmul
* progress
* comment
* more comments
* revert that
* mac cleanups
* fix estimates
* format
2025-10-30 17:02:38 +08:00
qazal
66ea3a0be4
put DEFINE_LOCAL counter in context ( #13008 )
2025-10-30 15:49:26 +08:00
George Hotz
e456f2cb1e
more uop programs ( #13007 )
...
* more uop program
* test_matmul_relu
* tests fix
2025-10-30 14:57:59 +08:00
wozeparrot
c18b283f58
feat: timeout on stuck socket ( #13009 )
2025-10-29 23:11:26 -07:00
wozeparrot
92a87e37e4
fix: fetch_file ( #13010 )
2025-10-29 22:44:22 -07:00
George Hotz
e64d4b3b44
uops programs ( #13005 )
...
* uops programs
* work
* work
* more syntax
* more syntax
* comments
2025-10-30 12:28:10 +08:00
George Hotz
5894df059c
hotfix: prevent inf loop if reduce splits
2025-10-30 11:21:40 +08:00
George Hotz
2da02f1ae1
add loads at the end ( #12988 )
...
* add loads at the end
* simpler
* late load
* tests passing
* fix matvec
* spec test passes
* fix where on load
* fix abs2
* fix more tests
2025-10-30 10:42:19 +08:00
nimlgen
4b001ec723
amd: pmc in mockgpu ( #13000 )
...
* amd: pmc in mockgpu
* fix
* do not open in ci
2025-10-30 01:52:02 +08:00
nimlgen
a6f5b1482e
amd: perf counters ( #12975 )
...
* amd: perf counters
* sq
* cleaner
* fix
* if enabled
* ruff
* mypy
* counters
* reset
* fix
* no cpu
2025-10-30 00:10:31 +08:00
b1tg
457602b350
fix fp8 cast folding ( #12997 )
2025-10-29 09:27:42 -04:00
Sieds Lykles
70bce62c67
dont collapse possibly empty symbolic range ( #12994 )
...
* dont collapse a symbolic range based on min/max
* refactor z3 renderer
* include sink explicitely instead of dtypes.void
* use dtype.scalar()
2025-10-29 12:17:09 +01:00
Sieds Lykles
79903ae2be
refactor z3 renderer ( #12996 )
...
* refactor z3 renderer
* include sink explicitely instead of dtypes.void
* use dtype.scalar()
2025-10-29 12:01:07 +01:00
George Hotz
819592ee67
hotfix: disable DoubleMatmul for PTX
2025-10-29 16:37:17 +08:00
George Hotz
30ca3f2af8
all double matmul ( #12993 )
...
* fix more double matmuls
* a few more
* all double matmul passes
* opts for flash attention
* fix spec
* comment
2025-10-29 16:25:27 +08:00
Sieds Lykles
9f39f6391c
shared_codegen_spec and fix index spec ( #12967 )
...
* split shared_codegen_spec and fix index
* add VCONST to program_spec and move index to shared_codegen_spec
* working ignore_oob=0
* cleanup
* fix spec
* undo that
* move barrier and special earlier
* fix more spec issues
* more updates
* remove special from program_spec
* cleanup and fixes
* move more to shared
* special is not in shared_spec
* some comments
* dont do bounds check there
2025-10-29 09:14:11 +01:00
George Hotz
1c362736aa
fix more double matmuls ( #12991 )
...
* fix more double matmuls
* a few more
2025-10-29 16:09:48 +08:00
George Hotz
e42b4edf8c
remove if stuff ( #12992 )
2025-10-29 15:29:35 +08:00
George Hotz
8c47cf4323
pcontig double matmul works ( #12899 )
...
* pcontig double matmul works
* tests
* contract
* closer
* works-ish
* add that broadcast
* 2 more work
* something
* disable broken ones
* llvm
* align 16
2025-10-29 13:06:43 +08:00
George Hotz
35b6f4148d
delete untested quantize ( #12990 )
2025-10-29 12:46:32 +08:00
Sieds Lykles
5ce8a1d2f2
Merge adjacent try all permutations for reduce ( #12972 )
2025-10-29 05:04:54 +01:00
George Hotz
b147e7e8e6
flatten bufferize ( #12984 )
...
* flatten bufferize
* simpler
* tests pass
* flat
* not flat
2025-10-29 11:23:43 +08:00
qazal
a7dac11aad
viz: keep rewrite step in back button history ( #12986 )
2025-10-29 11:09:43 +08:00
qazal
37967fa17b
viz: add integer query param helper and more typing ( #12985 )
...
* viz: query param helper
* json.dumps once
2025-10-29 10:44:01 +08:00
chenyu
fb53bdad5d
unused propagate_invalid rules [pr] ( #12983 )
...
named is not used, so you know it never matched
2025-10-28 22:16:50 -04:00
chenyu
ef16e6c68c
unwrap instead of cast [pr] ( #12982 )
2025-10-28 21:29:23 -04:00
chenyu
f55fcfecf9
ProgramSpec uops must end with SINK [pr] ( #12981 )
2025-10-28 17:12:22 -04:00
chenyu
9442442cb1
update variable names in search [pr] ( #12979 )
...
no lin nor linearize
2025-10-28 15:37:52 -04:00
wozeparrot
d66c997a39
feat: thunderkittens fa2 ( #12955 )
2025-10-28 11:27:45 -07:00
b1tg
bb307b9e81
fix fp8 vectorization ( #12977 )
...
* fix fp8 vectorization
* add fp8 tc to benchmark
2025-10-28 13:55:30 -04:00
nimlgen
c11dd56956
amd: cleanup import urls ( #12976 )
2025-10-29 00:43:02 +08:00
George Hotz
5e01cc299b
zero len ranges fail ( #12974 )
...
* zero len ranges fail
* fix Python backend
* fix llvm
* fix ptx
* yolo fix nir
* this works...
* always store...
* always store...
* Revert "always store..."
This reverts commit 0816cf344d .
2025-10-28 22:49:55 +08:00
George Hotz
e936aa7974
cleanups from if range branch ( #12973 )
2025-10-28 20:58:47 +08:00
qazal
901d27b3ba
viz: optional text dims try 2 ( #12971 )
2025-10-28 18:54:28 +08:00
George Hotz
f5a3b33d33
add fun with nhwc convs
2025-10-28 17:12:22 +08:00
George Hotz
907499b02c
clean up GROUP/SINK ( #12969 )
...
* clean up GROUP/SINK
* fix end
* range_str color
2025-10-28 16:08:10 +08:00
Sieds Lykles
e22c5e7e73
process_replay uses opts argument for KernelInfo.opts_to_apply ( #12946 )
...
* opts_to_apply is opts
* skip beamed kernels
* simpler change
* fix the tensor cores tests for process replay
* use opts
2025-10-28 09:00:28 +01:00
George Hotz
6c9560a846
more syntactic sugar for pyrender ( #12968 )
2025-10-28 15:24:33 +08:00
George Hotz
b0da173f2f
add unique to const, fix longstanding bug ( #12965 )
...
* add unique to const, fix longstanding bug
* _force_unique=True
* fix tests
* fix more tests
2025-10-28 15:11:37 +08:00
Sieds Lykles
e110f4632a
split cat (on cpu) ( #12864 )
...
* split ranges but only on cpu
* except KernelOptError for threads
* use GROUP and END
* no more flatten_range needed
* remove noop end
* always process replay for openpilot
* update test
* skip test
* fix in outs calculation
With the new linearizer the toposort is a problem, this matches the spec
now
* undo that
2025-10-28 07:55:19 +01:00
qazal
3b82dee625
viz: match DEBUG=2 for exec item metadata ( #12966 )
...
* viz: match DEBUG=2 for exec item metadata
* remove repr from kernel
2025-10-28 14:53:57 +08:00
qazal
99589dea81
move viz edge tagging to UOp graph ( #12964 )
2025-10-28 12:46:23 +08:00
George Hotz
bbe0bebbf3
no range tags in kernels ( #12962 )
2025-10-28 12:33:48 +08:00
George Hotz
39c2117dea
cleanup pyrender ( #12961 )
2025-10-28 10:47:39 +08:00
George Hotz
2832954bcb
test with IGNORE_OOB=0 ( #12960 )
2025-10-28 10:32:19 +08:00
George Hotz
7784cec48e
pytest-split on spec ( #12959 )
2025-10-28 10:09:01 +08:00
George Hotz
4d817a289e
simplify spec ( #12958 )
...
* simplify spec
* more
2025-10-28 09:52:32 +08:00
George Hotz
62e62d8760
move verify to spec / cleanup ( #12956 )
...
* move verify to spec / cleanup
* lil
* more explicit
2025-10-28 08:58:10 +08:00
wozeparrot
24884c6768
fix: don't use KITTENS_HOPPER for 4090 ( #12954 )
2025-10-27 17:19:53 -07:00
nimlgen
372d9e5753
hcq: helper for visible devices ( #12950 )
...
* hcq: helper for visible devices
* fix
* f
2025-10-28 02:27:56 +08:00
Justin Erenkrantz
f2ffe9c8cf
Apply an override for nbio 7.3.0 to 7.2.0. ( #12949 )
2025-10-27 11:10:10 -07:00
qazal
63484d837e
Revert "viz graph drawing cleanups ( #12933 )" ( #12947 )
...
This reverts commit 189582db5e .
2025-10-28 00:39:37 +08:00
chenyu
a79832b01f
control_flow.py -> linearizer.py [pr] ( #12948 )
2025-10-27 12:38:13 -04:00
b1tg
45e2f916a3
add quantize fp8 in llama3 ( #12893 )
...
* add quantize fp8 in llama3
* don't truncate fp8 alu result
* cast to float32 before matmul
* --model weights/LLaMA-3/8B-SF-DPO/
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-10-27 10:22:57 -04:00
George Hotz
25c2da1579
check SPEC=2 in CI ( #12945 )
...
* check SPEC=2 in CI
* split SPEC=2
* fast enough
2025-10-27 21:53:57 +08:00
Sieds Lykles
072f7c35c5
fix in/outs calculation in ProgramSpec ( #12937 )
...
With the new linearizer the toposort is a problem, this matches the spec
now
2025-10-27 12:31:41 +01:00
qazal
e93c9bf6a7
viz: extend main code block to full height ( #12944 )
2025-10-27 18:43:49 +08:00
George Hotz
273b1f914d
new pyrender, tested with SPEC=2 ( #12934 )
...
* pyrender always works with SPEC=3
* test pyrender
* work
* work
* work
* .sintify
* v const
* kernelize
* pyrender
* viz always
* optional forced_reshape
* cleanups
2025-10-27 18:41:51 +08:00
George Hotz
701a632907
move VECTORIZE/CONST ( #12942 )
2025-10-27 17:37:13 +08:00
nimlgen
95748a4518
nv: map vram after resets ( #12938 )
2025-10-27 17:17:07 +08:00
George Hotz
8fb545c475
don't late simplify on marg ( #12941 )
2025-10-27 17:07:41 +08:00
George Hotz
7139e036c5
bugfixes from pyrender ( #12940 )
2025-10-27 16:56:53 +08:00
George Hotz
804133cffd
rename RECIP to RECIPROCAL ( #12939 )
2025-10-27 16:53:13 +08:00
nimlgen
f4da94af28
system: reset is a method of pcidevice ( #12936 )
2025-10-27 16:21:10 +08:00
wozeparrot
6b54378eba
working kitten matmul ( #12935 )
2025-10-26 23:40:49 -07:00
qazal
189582db5e
viz graph drawing cleanups ( #12933 )
...
* viz: make node label dims optional
* inplace edge updates
* change that
2025-10-27 13:59:32 +08:00
qazal
70ba84eb04
viz: generic node label centering ( #12925 )
...
* viz: correct node label centering
* matches
* overlay
* the other way
2025-10-27 12:02:34 +08:00
Sieds Lykles
eaeaea2f9c
pyrender Ops.SPECIAL and use correct dtype for Ops.RANGE rendering ( #12931 )
2025-10-27 03:21:34 +01:00
nimlgen
8c1368cab6
system: class PCIBarInfo ( #12930 )
...
* system: class PCIBarInfo
* fix
2025-10-27 03:57:42 +08:00
nimlgen
f00009c731
hcq: drivers take pcidev ( #12929 )
...
* hcq: drivers take pcidev
* fix nv
2025-10-26 20:43:51 +08:00
ttomsa
99a519f068
linearizer cleanup ( #12923 )
...
* cleanup
* comments
* also this
2025-10-26 18:30:12 +08:00
George Hotz
c0c24d3a70
cleanup wmma ( #12927 )
...
* cleanup wmma
* fix test_ops failures on android
2025-10-26 18:26:47 +08:00
George Hotz
0a32ab0006
nitpicks from typecheckers ( #12926 )
...
* nitpicks from the typechecker
* more
2025-10-26 17:52:55 +08:00
George Hotz
db5c918215
source extra/cl_android.sh to fix opencl on android
2025-10-26 15:27:51 +08:00
qazal
c94e597b3e
viz ui selector cleanups ( #12924 )
2025-10-26 14:40:47 +08:00
chenyu
94701d4838
clean up divide_exact order [pr] ( #12919 )
...
do the const first since ADD can also call into that
2025-10-25 18:47:57 -04:00
chenyu
e18922f111
limit AND const min max to ints [pr] ( #12918 )
2025-10-25 16:07:52 -04:00
nimlgen
92324172be
amd: refactor usb into usbdevice ( #12916 )
...
* amd: refactor usb into usbdevice
* nu
* my bad
* ops
* my bad
2025-10-26 01:00:19 +08:00
qazal
3b192f5eac
split viz graph rendering from dag layout ( #12914 )
2025-10-25 15:36:44 +08:00
George Hotz
6415e3e8a7
use Ops.GROUP instead of Ops.NOOP for merging stores ( #12912 )
...
* use Ops.GROUP instead of Ops.NOOP for merging stores
* fs noop
2025-10-25 12:26:12 +08:00
George Hotz
b4f6a2c7a3
add kernel spec ( #12911 )
...
* add kernel spec
* fix kernel spec
2025-10-25 11:49:20 +08:00
George Hotz
8a941d95a4
SPEC=2 is full spec, SPEC=1 is default ( #12910 )
...
* SPEC=1 passes all tests
* just use SPEC, not __debug__
2025-10-25 11:10:43 +08:00
wozeparrot
456560c1ff
stateless tinyfs copyin ( #12908 )
2025-10-24 19:18:38 -07:00
wozeparrot
a5b0f57067
clean: cleanup tinyfs copyout ( #12907 )
2025-10-24 18:32:55 -07:00
chenyu
4b7329001d
clean up test_avg_pool3d ( #12905 )
2025-10-24 14:31:36 -04:00
George Hotz
6b35467f53
stores don't end ranges ( #12902 )
...
* early endrange
* bugfixes
2025-10-24 23:05:03 +08:00
nimlgen
5b5ba31a86
amd: make sqtt bufs uc ( #12898 )
2025-10-24 18:55:14 +08:00
Sieds Lykles
e1f8c82938
Onnx Layer/Group/RMS/Batch-Norm ReduceL2 fp32 intermediates for fp16 ( #12109 )
...
* match onnx spec
* use least_upper_dtype
* promote the square
* just cast before the square
2025-10-24 12:26:11 +02:00
George Hotz
0bde87d8d7
cleanups from flash attention branch ( #12897 )
2025-10-24 14:14:56 +08:00
wozeparrot
9dac505565
variable bs keccak ( #10731 )
2025-10-23 14:10:21 -07:00
chenyu
154b4f9f40
test FUSE_OPTIM=1 test/test_optim.py ( #12895 )
2025-10-23 15:54:27 -04:00
chenyu
6e4ee8deea
small heuristic cleanup [pr] ( #12892 )
2025-10-23 10:50:15 -04:00
nimlgen
f835566e27
sqtt: correct header ( #12891 )
...
* sqtt: correct header
* f
2025-10-23 22:37:17 +08:00
Sieds Lykles
c1db62ff7c
move reduce collapse to rangeify ( #12845 )
2025-10-23 15:44:17 +02:00
Sieds Lykles
04b3e51f1b
remove old reduce collapse rule ( #12889 )
...
* comment this out
* remove
2025-10-23 13:51:49 +02:00
qazal
cdfb8e31ae
hotfix: correct viz rewrite step counter reset ( #12890 )
2025-10-23 19:47:16 +08:00
George Hotz
6df19a4ac6
lil qol improvements to viz ( #12887 )
2025-10-23 18:41:07 +08:00
George Hotz
ff68a6263b
move locals into codegen (dedup works) ( #12885 )
...
* move locals into codegen (dedup works)
* move in optimize
2025-10-23 17:07:39 +08:00
George Hotz
ddb53d1d48
PCONTIG=3 both saves ram and flops ( #12884 )
...
* PCONTIG=3 both saves ram and flops
* group
* gate locals
* should be correct
2025-10-23 16:37:26 +08:00
qazal
2a5c22436e
remove outdated docs ( #12881 )
2025-10-23 12:52:36 +08:00
qazal
bcc30e5e10
viz: add linearized UOp list view ( #12883 )
...
* viz: add linearized UOp list view
* lang
2025-10-23 12:52:14 +08:00
George Hotz
e85cee0aad
flip Ops.END srcs ( #12882 )
...
* flip Ops.END srcs
* backward
* late end split
2025-10-23 12:47:50 +08:00
George Hotz
74b4cfe44b
Ops.GROUP + range check ( #12880 )
...
* simpler
* fix that
* Ops.GROUP + range check
* fix bugs
* fix linter
* fix test
2025-10-23 12:05:21 +08:00
Sieds Lykles
914defd55d
give endrange priority ( #12870 )
...
* uncomment line
* try giving endrange priority
2025-10-23 05:19:13 +02:00
qazal
2f95c10702
remu new instructions / use volatile in emulator tests ( #12862 )
...
* remu new instructions
* start moving to volatile
* test_simple works
* test_exec_mov works and lid is still here
* test_exec_cmp_vopc
* clang did s_mov_b32 exec_lo, 1
* don't hardcode v1
* support volatile in tests
* hw_test passes
* only the volatile version
* subrev saturating behavior
2025-10-23 11:13:43 +08:00
George Hotz
e718254004
simpler end ( #12879 )
...
* simpler
* fix that
2025-10-23 10:35:58 +08:00
wozeparrot
6e00dec95d
feat: pin openpilot 0.10.1 models ( #12878 )
2025-10-22 14:57:54 -07:00
wozeparrot
3a9aa05359
feat: extra nvcc options ( #12876 )
2025-10-22 13:21:11 -07:00
chenyu
f0831c8c30
add 0.10.0 to comma benchmark ( #12875 )
...
* add 0.10.0 to comma benchmark
disabled the 0.10.1 ones which are pinned to master. it does not work because benchmark uses the cached old version
* that's pinned
2025-10-22 15:18:21 -04:00
nimlgen
e7e535cd53
amd: sqtt for gfx9 ( #12844 )
...
* amd: start sqtt for gfx9
* writes something, but sometimes zeroes
* HEADER!
* w
* tiny
* mypy
2025-10-23 02:31:07 +08:00
b1tg
81108f91ee
amd tc: 16x16x32 ( #12874 )
...
* amd tc: 16x16x32
* test
* clean, test amd_cdna4
2025-10-22 13:48:01 -04:00
George Hotz
bf173c0a37
we don't support multi end yet ( #12869 )
2025-10-22 23:43:32 +08:00
nimlgen
a7bc0104c2
amd: clean up sqtt_stop ( #12872 )
2025-10-22 22:17:03 +08:00
nimlgen
b6eb9172ea
amd: fix ip offsets ( #12867 )
2025-10-22 20:50:18 +08:00
George Hotz
174811fc0f
hotfix: slightly looser load spec for AMD bfloat16
2025-10-22 19:55:59 +08:00
George Hotz
7762b3558b
clean up the spec ( #12868 )
...
* tighten up the spec
* move validate into a different file
* that moved to validate
* after(barr)
2025-10-22 19:50:42 +08:00
George Hotz
726988fa4b
late ifs try 2 ( #12865 )
...
* late ifs try 2
* fix image
* fix that test
* panic
* ptx fixups
* preserve toposort
* those pass locally
* Revert "those pass locally"
This reverts commit 063409f828 .
* no ls
* make that explicit
2025-10-22 18:49:27 +08:00
George Hotz
6abe90fb7c
fix linearizer non-determinism ( #12866 )
2025-10-22 17:51:35 +08:00
qazal
cebc2b5721
cleanup viz profiler metadata ui ( #12860 )
...
* cleanup viz profiler metadata ui
* text
* select over .args
* space
2025-10-22 17:31:12 +08:00
Sieds Lykles
8d0256c46b
Move gate to load for loaded index ( #12861 )
...
* change condition
* change test to better represent how the uop looks irl
2025-10-22 09:53:07 +02:00
chenyu
6d86e962c7
update ASSERT_MIN_STEP_TIME ( #12857 )
...
0.10.1 driving_policy is good now, still need driving_vision and dmonitoring to be fast
2025-10-21 22:46:07 -04:00
George Hotz
92778c7a8b
rename opts to ren, add store ranges back ( #12856 )
...
* rename opts to ren
* fix docs and bring store back
2025-10-22 09:15:38 +08:00
chenyu
c5cee74706
remove BLOCK_REORDER ( #12854 )
...
not used
2025-10-21 19:10:14 -04:00
chenyu
0b673eddec
simpler newton_schulz transpose ( #12853 )
2025-10-21 17:21:45 -04:00
b1tg
60d7e232f2
cuda fp8 ( #12782 )
...
* cuda fp8
* tensor core
* tc test
* clean
* clean pm
2025-10-21 15:05:25 -04:00
Harald Schäfer
587ccc0e5c
compile3: make selftests opt-in ( #12851 )
2025-10-21 11:32:27 -07:00
wozeparrot
c3149c618a
feat: nvcc compiler ( #12852 )
2025-10-21 11:31:23 -07:00
chenyu
8baa61bd67
use torch 2.9 and its Muon in test ( #12773 )
...
* use torch 2.9 and its Muon in test
* relax and disable
2025-10-21 13:35:17 -04:00
chenyu
f51f9aaa16
muon ns_params -> ns_coefficients ( #12850 )
...
match the official torch one
2025-10-21 12:35:52 -04:00
wozeparrot
62e7b8b870
feat: just use compile3 ( #12849 )
2025-10-21 07:56:50 -07:00
nimlgen
c7336c3e31
amd: sqtt for aql ( #12846 )
2025-10-21 22:35:01 +08:00
George Hotz
8960ac54f3
remove RewriteStep premature optimization ( #12840 )
...
* remove RewriteStep premature optimization
* fix ebs
* core line count
2025-10-21 21:45:20 +08:00
Sieds Lykles
7f798a9630
Cleanup const buffers ( #12829 )
...
* split pm_cleanups
* update test_schedule
* shrink when we remove bufferize
* dont do shrink if shape is empty
* update tests
* remove *1 from metadata
* deal with the noop bufferize
* only noop on cvar
* cleanup
* fix if
* rename
2025-10-21 14:53:49 +02:00
nimlgen
1ad6598963
amd: trace all instructions ( #12831 )
2025-10-21 20:52:24 +08:00
Christopher Milan
cdc72556a1
no more brew ( #12839 )
2025-10-21 08:12:46 -04:00
George Hotz
20a232f1c5
bugfixes from multioutput + PCONTIG=3 for fa bw memory fix ( #12837 )
...
* bugfixes from multioutput
* PCONTIG=3 fixes fa memory usage
* that's base
2025-10-21 19:21:02 +08:00
qazal
0435d31f1c
viz: generic back button functionality ( #12838 )
2025-10-21 18:52:00 +08:00
George Hotz
7d9551ce2e
move to late/control_flow.py ( #12835 )
2025-10-21 18:15:06 +08:00
George Hotz
d711a4b933
delete old linearizer ( #12834 )
...
* new linearizer with early endrange
* cleanups
* second stage removal
* not store
* do that later
* end cleanup
* fix globals
* end
* multi end
* fix ends earlier
* work
* do_merge_ends
* mini change
* range_gate
* fix cpu
* test fixups
* ranges on index
* not for ptx
* delete linearizer
* remove more junk
* delete that test
* we insert endif
* all ends
2025-10-21 17:52:18 +08:00
qazal
40633ab34d
list buffer args to kernel in profiler ( #12826 )
...
* list buffer args to kernel in profiler
* stable order
* back button works
* deselect also works
2025-10-21 17:51:36 +08:00
George Hotz
c780cd9abb
new linearizer with early endrange ( #12823 )
...
* new linearizer with early endrange
* cleanups
* second stage removal
* not store
* do that later
* end cleanup
* fix globals
* end
* multi end
* fix ends earlier
* work
* do_merge_ends
* mini change
* range_gate
* fix cpu
* test fixups
* ranges on index
* not for ptx
2025-10-21 17:37:48 +08:00
George Hotz
d59d4cdbe4
lil less is okay
2025-10-21 17:09:44 +08:00
qazal
32af1ff84b
viz graph drawing small cleanups ( #12830 )
...
* viz graph drawing small cleanups
* str literal
2025-10-21 15:51:32 +08:00
Sieds Lykles
367fbabc30
remove Ops.SUBSTITUTE ( #12827 )
...
* remove Ops.SUBSTITUTE
* remove from viz
2025-10-21 08:19:42 +02:00
qazal
57f6b6f229
style view codegen like a link in profiler ( #12825 )
2025-10-21 12:15:13 +08:00
qazal
154cdfe46d
viz state cleanups ( #12821 )
...
* viz state cleanups
* more generic
2025-10-21 11:44:51 +08:00
George Hotz
a71a41f6d1
rename Ops.ENDRANGE -> Ops.END ( #12824 )
2025-10-21 11:32:18 +08:00
qazal
8521fd5263
viz: hierarchical rewrites ( #12805 )
...
* viz: hierarchical rewrites
* count of subrewrites
* arrows
* better keyboard things
* add select and deselect utils
* works
* diff
* event stopPropagation
* work
* don't change the rewrite
* walk tree back
2025-10-21 10:55:41 +08:00
George Hotz
df2f8b9295
use after on locals ( #12815 )
...
* use after on locals
* fix estimates
* too much compute
* correct for both ptx and normal
* err, that
* tighter spec
* keep that
2025-10-21 10:29:12 +08:00
Christopher Milan
68c045bf0a
NIR: Check for brew packages tinymesa and tinymesa_cpu ( #12739 )
...
* brew install tinymesa_cpu
* brew --prefix tinygrad_cpu too
* fix brew paths
* check both brew paths
* better errors
* handle failure
2025-10-21 09:38:43 +08:00
wozeparrot
990e8b97ee
feat: log openpilot 0.10.1 times ( #12816 )
2025-10-20 18:30:34 -07:00
George Hotz
565a7a6218
num_batches_tracked has shape () ( #12820 )
2025-10-21 09:22:39 +08:00
George Hotz
25beea5769
hotfix: suppress_finalizing on device __del__
2025-10-21 09:04:36 +08:00
chenyu
c7c59e6dd7
unused UPat.or_broadcasted and GroupOp.Block [pr] ( #12819 )
2025-10-20 12:24:58 -04:00
nimlgen
e284f6325a
llvm: fix compile key for different processors ( #12812 )
2025-10-20 19:46:48 +08:00
George Hotz
203a93363c
Revert "after clean up of locals ( #12813 )" ( #12814 )
...
This reverts commit 5d0d3d7aac .
2025-10-20 19:33:35 +08:00
George Hotz
5d0d3d7aac
after clean up of locals ( #12813 )
2025-10-20 19:24:24 +08:00
George Hotz
d1e2c393f8
after in sym, axis_letters in range ( #12811 )
...
* after in sym, axis_letters in range
* this is better
* this work?
2025-10-20 18:54:37 +08:00
Sieds Lykles
a8e4614436
remove REAL_SUBSTITUTE=0 and make it fast ( #12809 )
...
* fast REAL_substitute
* remove REAL_SUBSTITUTE=0
2025-10-20 12:44:20 +02:00
Sieds Lykles
1e93d19ee3
stable diffusion --fakeweights ( #12810 )
2025-10-20 12:41:06 +02:00
nimlgen
b5e36e3c6c
nv: check if jitlink is avail ( #12808 )
...
* nv: check if jitlink is avail
* why
* fix
* fix
2025-10-20 18:13:16 +08:00
George Hotz
b8a9cce783
replace NOOP with AFTER in reg init ( #12804 )
...
* after op
* fix tests
* replace NOOP with AFTER in reg init
* closer
* or_after there
* fix device
* fix all renderers
* better spec for after
2025-10-20 15:34:32 +08:00
qazal
12fd2c9c7b
explicitly set ignore_indexing for schedule only ( #12803 )
2025-10-20 13:11:57 +08:00
qazal
734c99f722
viz: show indexing rewrites during run_rangeify ( #12802 )
...
* viz: show indexing rewrites during run_rangeify
* sinking index
2025-10-20 12:37:03 +08:00
George Hotz
2e9082e0bc
after op ( #12801 )
...
* after op
* fix tests
2025-10-20 12:27:56 +08:00
qazal
339e6edb7d
viz: ui prereqs for hierarchical rewrites ( #12799 )
2025-10-20 12:15:15 +08:00
wozeparrot
357dac8425
feat: allow tuple indexing on uops ( #12797 )
2025-10-19 19:11:05 -07:00
George Hotz
ba593f7b98
don't render index ( #12796 )
...
* don't render index
* update to ignore_indexing
---------
Co-authored-by: qazal <qazal.software@gmail.com>
2025-10-20 09:48:36 +08:00
George Hotz
cad3ada909
tinygpu: build with SIP off works
2025-10-20 09:11:09 +08:00
nimlgen
9cd35deae7
amd: fix alignment + pointers for aql over usb ( #12793 )
2025-10-19 23:55:57 +08:00
nimlgen
59784a5972
amd: ensure ts is written ( #12794 )
2025-10-19 23:55:49 +08:00
chenyu
63a23dfe80
test step 0 in TestTrainingOnnxOps ( #12790 )
...
and tighter rtol
2025-10-19 09:15:49 -04:00
chenyu
e8158afd4b
update test_qlinear_add_round_half_to_even ( #12789 )
...
this does not pass locally
2025-10-19 08:47:27 -04:00
Sieds Lykles
1df9c7d7e7
reduce_collapse uses symbolic_flat ( #12766 )
...
* sym->symbolic_flat
* cast invalid drops invalid
2025-10-19 12:27:47 +02:00
Sieds Lykles
fd6ef4801c
rangeify uses symbolic_flat ( #12786 )
...
* symbolic_simple -> symbolic_flat
* remove expected failures
2025-10-19 12:27:14 +02:00
George Hotz
89e7f2fa00
mmapeak: gfx1103 support
2025-10-19 16:57:28 +08:00
George Hotz
617614beb7
add mi350x support to mmapeak ( #12784 )
2025-10-19 16:11:07 +08:00
qazal
c8ef4b60f6
viz: share match tracing and TINY device profiler ( #12783 )
...
* set a default name for the traces
* set profile_matches + renames
* profile_matches test
* traces 4 steps total
2025-10-19 14:30:07 +08:00
chenyu
350a4754a9
Update openpilot models ( #12780 )
...
* Update openpilot models
* Update slower model
* fix that
---------
Co-authored-by: Bruce Wayne <harald.the.engineer@gmail.com>
2025-10-18 20:32:35 -04:00
chenyu
30ff84d050
update test_conv2d_ceildiv_edge_case ( #12779 )
2025-10-18 16:43:32 -04:00
nimlgen
442218266d
qcom: fix profiler ( #12778 )
...
* qcom: fix profiler
* this way
2025-10-19 01:27:59 +08:00
Harald Schäfer
addc54b96c
Simplify openpilot compile3.py ( #12748 )
...
* Simpler compile3
* tests
* remove default args
* onnx file is still fp16
* self-test FP16 too
* allow test disable
* absurd tolerance
* Just do latest
* Try simplest
* use later models
* kernel count not relevant if speed is good
* dead improts
* Revert "dead improts"
This reverts commit f68c2cd15d .
* Revert "kernel count not relevant if speed is good"
This reverts commit 0955ca4ee0 .
* add back kernal count check on latest model
2025-10-18 10:12:22 -04:00
nimlgen
037f6e8fa0
qcom: ioctl for 7xx ( #12777 )
2025-10-18 20:33:14 +08:00
wozeparrot
82f10cfe2e
feat: assert on bufferview math ( #12772 )
2025-10-17 14:20:08 -07:00
chenyu
fcdf4ab37e
remove a contiguous in LARS ( #12770 )
2025-10-17 17:07:30 -04:00
nimlgen
910d698b78
system: cleanup page sizes ( #12771 )
...
* system: cleanup page sizes
* ooops
2025-10-18 02:06:42 +08:00
George Hotz
062a6d68d7
test flash attention backward ( #12762 )
...
* test flash attention backward
* TODO: fix pcontig
* end ranges
* render colors
* very big
* multiout at every level
* reset ending ranges
* fix tests
* ugh
2025-10-17 23:15:59 +08:00
George Hotz
33025b99f6
small changes from fa backward ( #12769 )
2025-10-17 22:41:18 +08:00
chenyu
e0d0d4372d
fix shape of m and v in onnx Adam with FUSE_OPTIM ( #12768 )
...
value is still slightly off but that's not onnx specific
2025-10-17 10:32:41 -04:00
qazal
bd662bea67
viz: light up program runs ( #12764 )
...
* basics work
* fix the color
* light up program events
* swap a with p
* better
2025-10-17 19:33:18 +08:00
George Hotz
c9a3464f76
those decimals never mattered ( #12760 )
...
* those decimals never mattered
* this
* improve debug
* real substitute fixes pcontig
* locals are different buffers
2025-10-17 17:16:24 +08:00
qazal
0160f034d6
viz: show display name for copy runners ( #12761 )
...
* viz: show display name for copy runners
* more u32
2025-10-17 16:59:51 +08:00
qazal
253d32b065
viz: add metadata to buffer user list ( #12758 )
...
* simple failing test
* encodings
* test passing
* key is deduped
2025-10-17 16:28:54 +08:00
George Hotz
935a60db72
bring back partial contig and flash attention ( #12756 )
...
* bring back partial contig and flash attention
* why not 2
* work
* that
* fix pcontig
2025-10-17 16:19:05 +08:00
Sieds Lykles
f6bc620169
UOp.prod and UOp.sum methods ( #12755 )
2025-10-17 10:02:01 +02:00
Sieds Lykles
d1bb5c0426
slightly flatter symbolic ( #12757 )
2025-10-17 09:58:45 +02:00
qazal
5417e4b099
viz helper cleanups ( #12754 )
2025-10-17 15:20:24 +08:00
qazal
3196a7aae3
viz: pre reqs for lighting up programs ( #12753 )
2025-10-17 15:03:21 +08:00
qazal
dfb8f9fc9e
viz: annotate buffer mutability in the memory graph ( #12750 )
2025-10-17 11:53:02 +08:00
Sieds Lykles
79c2f1ae26
remove reduce_rangless and replace with reduce_unparented ( #12749 )
2025-10-17 04:46:05 +02:00
chenyu
9561803cb0
fix assert in test_schedule ( #12745 )
...
* fix assert in test_schedule
updated kernel counts and some old tests
* fix
2025-10-16 15:39:50 -04:00
chenyu
285534ce64
delete DONT_REALIZE_EXPAND and DONT_GROUP_REDUCES ( #12744 )
...
does nothing now
2025-10-16 14:11:33 -04:00
chenyu
98239f1156
few shapetracker cleanups ( #12741 )
2025-10-16 12:43:27 -04:00
chenyu
53478c741d
relax ASSERT_MIN_STEP_TIME for space lab policy ( #12742 )
2025-10-16 11:40:36 -04:00
geohotstan
5d209ee7ec
onnx helper intermediate node output validation ( #12740 )
...
* start
* update comments
* good
* add comments and better printing
* done
2025-10-16 11:17:47 -04:00
Christopher Milan
bce2bc0465
Revert "use RTLD_GLOBAL on macos" ( #12738 )
...
This reverts commit 89fe3e574d .
2025-10-16 10:07:21 -04:00
chenyu
f34f26bca0
fix gpt2 with benchmark ( #12736 )
...
`CPU=1 python3 examples/gpt2.py --benchmark 128` works now
2025-10-16 09:55:20 -04:00
Sieds Lykles
55db1b0e0e
reduce where that is cut from two sides ( #12733 )
...
* better rule
* correct pattern
* shorten line
2025-10-16 15:25:15 +02:00
nimlgen
cf9baeea61
Revert "nv: check if jitlink is avail ( #12731 )" ( #12735 )
...
This reverts commit a069a45d14 .
2025-10-16 20:41:49 +08:00
George Hotz
8be7844b2e
use apply uop for assign to fix assign metadata ( #12732 )
...
* use apply uop for assign
* fix metadata for assign
* fix backward metadata
* those aren't real tests
2025-10-16 20:34:12 +08:00
nimlgen
3aa2277b8f
nv: usb4 ( #12696 )
...
* hackish
* prog
* match
* l
* simpler
* refactor
* not osx
* apple things
* tiny changes
* fix mask
* match fix
* nn
2025-10-16 20:11:19 +08:00
nimlgen
a069a45d14
nv: check if jitlink is avail ( #12731 )
2025-10-16 19:58:50 +08:00
George Hotz
a498ec9c18
cleanup names of postrange + fast FUSE_OPTIM ( #12730 )
...
* cleanup names of postrange
* make FUSE_OPTIM not slow
* delete junk in def r
2025-10-16 19:38:31 +08:00
Sieds Lykles
8f740e07ff
no broadcasting/vectors in reduce collapse ( #12729 )
2025-10-16 13:22:57 +02:00
qazal
533f18b22c
viz: add trace data for inflight buffers ( #12728 )
...
* viz: add trace data for inflight buffers
* add test_inflight_buf
* temp stores the keys
* update tests / use Tensor.ones
2025-10-16 19:15:03 +08:00
George Hotz
af4479c169
faster stable diffusion load ( #12725 )
...
* faster stable diffusion load
* failing tests
2025-10-16 18:31:59 +08:00
nimlgen
e7c057d5dc
system: alloc_sysmem return view ( #12724 )
...
* system: alloc_sysmem return view
* e
2025-10-16 17:55:01 +08:00
nimlgen
b86a33a312
ptx: support bw ( #12722 )
2025-10-16 15:38:08 +08:00
nimlgen
b8cd66c7a2
nv: support all gb20x and small bar ( #12721 )
2025-10-16 15:37:54 +08:00
George Hotz
1d1e1d9d88
delete the ShapeTracker ( #12720 )
...
* delete the ShapeTracker
* fix tests
* fix more
* fix gc test
2025-10-16 15:36:22 +08:00
George Hotz
592e86f6f5
remove UOp.st ( #12716 )
...
* remove UOp.st
* fix tests
* torch backend disable
2025-10-16 14:44:09 +08:00
wozeparrot
cc2dfe22f5
tinyfs: fetch file utility ( #12719 )
2025-10-15 23:38:56 -07:00
nimlgen
3ed543f956
system: reorder funcs + barrier on macos ( #12714 )
2025-10-16 14:38:01 +08:00
qazal
b77bdbbc62
viz: count unpickle in server startup time ( #12715 )
...
* viz: count unpickle in server startup time
* type checking
2025-10-16 13:07:46 +08:00
George Hotz
7c19db00f1
remove st from jit/split_reduceop ( #12713 )
...
* remove st from jit
* fix by merging reshapes
* no st usage in rangeify
* hmm, stop early works
* fix speed regressions
2025-10-16 12:50:58 +08:00
qazal
069177c1be
trace buffer producer and consumers ( #12639 )
...
* trace buffer producer and consumers
* work
* generic colored util
* fix batched
* basic clicking works
* generic javascript that works for producer and consumers
* keep focused shape
* idle time
* timings for producer and consumers dedup
* from sd test
* tiny cleanups
* timeline
* work
* up to here
* assert
* list it
* work
2025-10-16 11:11:31 +08:00
George Hotz
4a151e7533
make xcode signing happy, waiting for entitlement ( #12712 )
2025-10-16 10:20:34 +08:00
chenyu
c3278e5622
clean up old tests ( #12708 )
2025-10-15 17:53:17 -04:00
chenyu
b8cf35fb77
print macOS version in CI ( #12705 )
2025-10-15 15:05:33 -04:00
Daniel
d65bd669f8
update tiny torch backend hook ( #12575 )
...
* update the backend to fix torch deprecation warning
* use param_hook to avoid full backward hook needlessly firing on inputs which do not require gradients
* fix indentation
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-10-15 14:02:33 -04:00
nimlgen
db5ae846aa
nv: do not use va_addr for cpu accesses ( #12697 )
...
* nv: do not use va_addr for cpu accesses
* mypy
2025-10-15 22:48:12 +08:00
nimlgen
3ab23af829
nv: copy prog with copyin ( #12701 )
...
* nv: copy prog with copyin
* to bytes
* fix test
2025-10-15 22:48:01 +08:00
nimlgen
fafbf3daea
memory: reserve ptable ( #12702 )
2025-10-15 22:47:50 +08:00
George Hotz
85a907605c
hotfix: only 20 steps of beautiful_mnist_torch, some CI machines are slow
2025-10-15 22:29:34 +08:00
Christopher Milan
e1996d358c
use RTLD_GLOBAL on macos ( #12699 )
2025-10-15 22:24:50 +08:00
chenyu
312c622d35
support None in pad_to and shrink_to ( #12700 )
2025-10-15 09:25:31 -04:00
George Hotz
612e3d6143
replace mop arg with vectorized index ( #12695 )
...
* replace mop arg with vectorized index
* tests passing
* better viz
* no compile4
2025-10-15 20:50:06 +08:00
wozeparrot
9ec4c06d7d
feat: one request per device ( #12698 )
2025-10-15 05:22:07 -07:00
Sieds Lykles
99aa3bd5f9
reduce collapse reduce only the cut range ( #12687 )
2025-10-15 13:57:41 +02:00
Sieds Lykles
91ac4f1f92
late merging of where and load ( #12694 )
2025-10-15 13:33:06 +02:00
qazal
768dc952de
viz ui cleanups / renaming ( #12691 )
...
* better viz names
* delete unused
* don't use opacity, it's multiplicative
* keep styles
* scrollbar coloring
* pyrender doesn't work here
beautiful_mnist r_64_16_32_36@lower all index dtypes
2025-10-15 18:40:22 +08:00
chenyu
2e50ed0767
increase timeout of resnet cron ( #12693 )
...
does not finish in 6 hours now
2025-10-15 06:08:58 -04:00
Christopher Milan
0aabc1e938
Mesa NIR backend (NAK/LLVMpipe) ( #12089 )
...
* nak works
* TestOps::test_add works
* testop has no crashes
* fix bool casts
* fix typo
* add disassemble
* RANGE and locals/regs
* simplify NAKCompiler
* disass cleanup
* cleanup nir codegen
* almost all tests passing
* cleanup notes in extra/
* old notes
* only import nak if NIR=1
* fix new SPECIAL syntax
* fix local/shared memory
* more tests passing
* add DEFINE_VAR support
* llvmpipe kinda works
* diskcache
* some mypy stuff
* lvp passing test_ops.py
* fix imports
* actually fix imports
* remove 'stdout'
* fix llvm import
* fix mypy issues
* nicer errors
* simpler test_dtype skips
* test lvp in CI
* fix github action syntax
* fix more actions typos
* switch to mesa 25.1.0
* diskcache_put
* better generation for lvp nir_options
* b64encode shader blobs
* Revert diskcache changes
This reverts commits 930fa3de8a and 8428c694b3 .
* general cleanup
* better error messages
* fix llvm import
* fix windows tests
* link with libm and libgcc_s
* fix some errors
* dont check for 'float4'
* NIR uses pointer arithmetic
* use tinymesa
* bump tinymesa
* bump tinymesa again
* update lvp nir_options
* print nir shader with DEBUG
* simplify LVPCompiler
* more tests
* "gated" STORE
* NAK is cacheable
* more tests
* all tests pass locally for NAK
* test autogen in CI
* autogen deps
* more deps
* fix uop_gc
* fix macos
* mypy
* save 2 lines
* save two more lines
* save 1 line
* save 4 lines
* save more lines
* Revert "save more lines"
This reverts commit dd3a720c5a .
* save more lines
* fix LVP on windows
* refactor
* reorganize some code
* refactor lib_gpu
* move LVP check
* out of order loads
* remove support.mesa
* bump tinymesa version
* simplify LVP jit
* macos
* macos ci
* shell: bash
* testing
* more testing
* compute brew prefix
* stupid typo
* actually fix
* lib
* stdout on macos
* inline gallivm_compile_module
* Revert "inline gallivm_compile_module"
This reverts commit b65983b151 .
* elf macos
* semicolon
* inherit from CPULLVMCompiler
* ruff
* disas test
* fix libm linking
* default is fine actually
* arm works
* add elf loader link test
* fix NAK beam
* pylint is too smart by half
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2025-10-15 17:38:33 +08:00
qazal
f0268d13f6
cleanup viz server ( #12688 )
2025-10-15 15:58:36 +08:00
nimlgen
aa81bde150
amd: usb4/thunderbolt on macs ( #12641 )
...
* tbgpu
* works
* cleaner
* this
* zero size
* h
* fix
* simpler
* prio over usb
* c
* not needed
* linter
* this way
* mappings
* mypy
* mypy
* mypy 2
* nn
2025-10-15 13:02:01 +08:00
George Hotz
236c4590c3
use margs as intermediate for new style mops ( #12686 )
...
* use marg to prepare for movement op change
* clean up forced reshape
* move marg
* more marg
* more
2025-10-15 12:43:00 +08:00
qazal
7597e1dcac
pyrender in viz ( #12682 )
...
* pyrender in viz
* keep profile still print_tree
* keep special in render
2025-10-15 11:53:30 +08:00
qazal
60e03eec37
viz: add View Program option ( #12683 )
2025-10-15 11:37:51 +08:00
George Hotz
a59439d013
use UOp.shape property instead of UOp.st ( #12664 )
...
* work on shape property
* reshape causing issues
* more mops
* all mops
* need to cache it
* _shape is like _device
* mostly works
* shape is good
* const uses _shape
* fix tests
* size doesn't use st
* close
* test is broken
* one less st
* hack for 3 op assign
* oops, i didn't mean to change that
* support emulate in the NullDevice
* reproed failure in emulation
* fix wmma
2025-10-15 10:01:34 +08:00
chenyu
89df6f611d
reenable sdxl mac benchmark ( #12680 )
...
also updated faster sd step times
2025-10-14 17:36:17 -04:00
chenyu
d25ceffe8d
update padto opts tests ( #12679 )
2025-10-14 17:00:42 -04:00
chenyu
e8380968f2
add venv_sd_mlperf to gitignore ( #12676 )
...
training stable diffusion stuff
2025-10-14 12:51:36 -04:00
wozeparrot
f228c03f9f
fetch raid from cloud ( #10799 )
...
* feat: initial tinyfs device
* feat: don't allow compute on tinyfs device
* feat: tensor helpers to load and store
* feat: bufferview for tinyfs
* fix: keep copy sizes correct
* fix: recv large
* clean: unneeded
* feat: comment
* clean: unneeded
* clean: remove
* clean: remove
* feat: get request tag
* feat: rename to cloud
* feat: send request_id
* feat: start computing tree
* feat: compute store tree on this side
* feat: jank chunked load
* feat: more debugging
* feat: rename to just load and store
* feat: correct chunk count
* fix: fix load for < 1mb
* feat: comments
* feat: don't truncate on block devices
* feat: better way of testing block device
* feat: don't need to pad that much
* feat: connect to nodes directly on load
* feat: cache connections
* feat: don't hard code chunk size
* feat: close mmap when closing file handle
* feat: don't overwrite stuff on disk if storing from disk
* clean: debug print
* fix: close mmap
* feat: await workers
* feat: fast copy from tinyfs to disk
* feat: don't copy to device on last
* feat: use single socket per device
* feat: raid in tinyfs
* clean: remove import
* clean: type
* feat: maintain single event loop
* feat: lower worker count
* feat: use connection pool
* feat: fetch mapping in its own process
* fix: release lock
* feat: don't fetch if exists
* feat: req id only on stores
* feat: always fetch
* fix: rangeify
* feat: allow specifying raid root
* fix: dealloc buffer
* feat: start support non 0 offset
* clean: use cleaner
* feat: don't pass to threadpool
* clean: typing
2025-10-14 07:53:55 -07:00
chenyu
70dd297a05
BS=96 for bert ( #12675 )
...
96 trains fine now
2025-10-14 09:07:43 -04:00
Sieds Lykles
852d80dff9
better where on load folding ( #12651 )
...
* move where clauses to load
* shorten line
* drop clauses if they are duplicated
* add rule for swapped where branch
* where on ungated load
* dont move clause if load is in the clause
* parse_valid returns None
* no data dependent branches
* fix rule
* enable swapped rule
* remove those
2025-10-14 13:30:47 +02:00
nimlgen
c7e63601fd
gfx1200 tc for AMD_LLVM ( #12673 )
2025-10-14 19:17:48 +08:00
George Hotz
db4a359374
fix up some slow tests that launch python ( #12672 )
...
* fix up some slow tests that launch python
* svd nonfull in parallel
* split test_advancedindex
2025-10-14 19:13:55 +08:00
nimlgen
4918c827c2
amd: lib_gpu does not need cpu_access ( #12670 )
2025-10-14 18:34:34 +08:00
nimlgen
0c9d47deab
hcq: add alignment to kernargs ( #12669 )
2025-10-14 18:33:12 +08:00
qazal
d3bfcd3277
minor patches for SQTT over usb on gfx12 ( #12627 )
...
* disable cpu_access in the sqtt buffer allocation
not sure if this is required, it results in a very slow call to
pcie_mem_write over USB GPU, removing it worked fine.
* fix itrace_se_mask on gfx12
on gfx11 it gave 6 se, on gfx11 this value is 2 so no instructions were
traced.
* Revert "fix itrace_se_mask on gfx12"
This reverts commit 0644adbcd1 .
2025-10-14 18:07:46 +08:00
Sieds Lykles
1e6e5a0efd
parse_valid returns None instead of raising (#12663 )
...
* parse_valid returns None
* change there too
2025-10-14 11:57:38 +02:00
qazal
471bd30d16
cleanup viz/serve.py ( #12665 )
...
* use load_pickle
* update comment
2025-10-14 17:50:39 +08:00
George Hotz
fb61f3519f
remove assign contiguous hack ( #12659 )
...
* remove assign contiguous hack
* remove bad contiguous usage in torch backend
* assign
2025-10-14 16:42:14 +08:00
George Hotz
30ee7c4c26
cleanup Device usage in Tensor ( #12662 )
2025-10-14 16:22:22 +08:00
Sieds Lykles
e06cbfcb8a
combine pm_drop_and_clauses ( #12660 )
...
* combine those
* wino kernels decreased
2025-10-14 10:09:41 +02:00
George Hotz
84d4589ed4
remove pylint from pre-commit and CI ( #12658 )
...
* remove pylint from pre-commit and CI
* multidevice test is fast
* faster pre-commit
* 8 is faster than 4
* better name
* how did that typecheck?
2025-10-14 15:39:59 +08:00
qazal
8ecaf839e2
cleanup UOp tracing [pr] ( #12657 )
2025-10-14 14:50:59 +08:00
George Hotz
b9eb5b5d49
clean up the LLM tokenizer ( #12653 )
...
* clean up the LLM tokenizer
* simple tokenizer is actually simple
* ugh write good code
2025-10-14 14:22:01 +08:00
qazal
a9ef93176f
viz: add colored text helper ( #12654 )
2025-10-14 13:05:26 +08:00
George Hotz
ecdc7539a2
add typing to MathTraits ( #12650 )
...
* add typing to MathTraits
* fix assign
2025-10-14 12:35:20 +08:00
qazal
9bf032de69
viz: keep focused shape in view ( #12648 )
2025-10-14 10:49:08 +08:00
chenyu
77b5e6774e
fix bert training config ( #12647 )
...
FREE_INTERMEDIATE=0 REWRITE_STACK_LIMIT=500000
2025-10-13 15:03:47 -04:00
nimlgen
f1041dc0ac
pylint 4.0.0 ( #12642 )
...
* cpu: fix spacing
* fix pylint
* fix pylint
* pylint 4.0.0
* lambda
* keep eval for now
* im so sorry
2025-10-13 23:28:36 +08:00
wozeparrot
47e0c43976
feat: Tensor.{load, store} ( #12629 )
2025-10-13 08:04:41 -07:00
chenyu
0f776c6e46
examples/mlperf/training_submission_v6.0 ( #12644 )
...
copied from v5.1
2025-10-13 09:58:25 -04:00
Sieds Lykles
e0139fafc1
UOp symbolic tests use eval to check against string ( #12643 )
2025-10-13 14:19:42 +02:00
b1tg
218225e8d0
pylint error ( #12630 )
...
Co-authored-by: wozeparrot <wozeparrot@gmail.com>
2025-10-13 05:05:12 -07:00
nimlgen
9096d7cc2e
amd: support for rx9060 ( #12640 )
2025-10-13 19:44:15 +08:00
qazal
066d25f5fb
refactor to trace_num property in buffers ( #12638 )
2025-10-13 18:06:55 +08:00
qazal
cd6aeebfee
sqtt: osx decoder installer ( #12637 )
2025-10-13 17:26:12 +08:00
Sieds Lykles
e537e895b1
drop unused invalid conditions ( #12635 )
...
* drop where conditions if the ranges are not used inside the index
* remove allow_any_len
2025-10-13 10:52:21 +02:00
wozeparrot
9ab06dffad
hotfix: block from env ( #12628 )
2025-10-12 08:07:32 -07:00
wozeparrot
12435a2dab
actual tinyfs device ( #12620 )
2025-10-12 07:51:17 -07:00
chenyu
8f5f57c7d9
smaller CNT fuzz shapetracker ( #12626 )
2025-10-12 08:52:30 -04:00
George Hotz
1ecf403294
cleanup long lines [pr] ( #12623 )
...
* cleanup long lines
* more
* a few more
* all noqa fixed
* fix amd + cuda
* clean that up
2025-10-12 20:18:05 +08:00
qazal
fd51ecf983
process_replay for get_rangeify_map ( #12624 )
2025-10-12 15:14:40 +03:00
qazal
b5afa3848e
viz: fix memory graph total nbytes ( #12622 )
...
* viz: fix memory graph total nbytes
* post increment
* simple regression test
* loop with markers + slightly off text baseline
* cpu events clear
2025-10-12 14:32:46 +03:00
nimlgen
822eab057f
cpu: respect taskset + allow all cores ( #12619 )
...
* cpu: account taskset + allow all cores
* spaces
2025-10-12 14:31:40 +08:00
chenyu
7ac74d1550
remove unused type ignore [pr] ( #12618 )
2025-10-11 21:24:04 -04:00
Sieds Lykles
772a8dfe31
reshape uses valid when simplifying ( #12597 )
...
* reshape uses valid when simplifying
* try with IGNORE_OOB=0
* is it this test?
* skipif gpuocelot
2025-10-11 17:02:54 +02:00
nimlgen
08e62454b6
amd: use cpu_view() in sqtt ( #12610 )
2025-10-11 18:11:25 +08:00
Sieds Lykles
a2ae56674a
uop_given_valid try multiple clauses (#12615 )
...
* uop_given_valid uses less simplify
* enable test
* try all expressions together
* enable test
2025-10-11 11:53:42 +02:00
Sieds Lykles
dccdd190aa
uop_given_valid uses less simplify ( #12612 )
...
* uop_given_valid uses less simplify
* enable test
2025-10-11 10:57:39 +02:00
qazal
9205527db0
viz: draw highlights above shapes ( #12613 )
2025-10-11 11:39:13 +03:00
George Hotz
cab034b863
improve typing ( #12611 )
...
* improve typing and bump to 3.11
* no need for Self yet
* improve typing
* binop also
2025-10-11 16:20:23 +08:00
Sieds Lykles
4300ebc455
cache apply_movement_op ( #12609 )
...
* cache apply_movement_op
* pyling and clear cache
* fix types
* ignore
* cleanup
2025-10-11 08:53:10 +02:00
George Hotz
7596c1b8f5
TestOuterworldReduce works ( #12608 )
2025-10-10 20:06:41 +08:00
chenyu
001b3710d3
enable some test_ops tests ( #12607 )
2025-10-10 07:23:21 -04:00
qazal
a62dc9ceb5
viz: light up buffer path ( #12603 )
2025-10-10 14:07:30 +03:00
qazal
464c56862f
viz: update ansi regex ( #12605 )
...
* viz: update ansi regex
* better
* add ansi_colors_light
* javascript
2025-10-10 13:58:58 +03:00
George Hotz
ac96d98745
GROUP_REDUCE is now bright RED instead of green ( #12604 )
2025-10-10 18:23:57 +08:00
nimlgen
89be3590aa
amd: sqtt on gfx12 ( #12564 )
...
* amd: sqtt on gfx12
* cleaner
* thi
* and this
* ops
* ugh
* back
* rm this
* rm
2025-10-10 17:54:14 +08:00
chenyu
95ad047445
do not use sint_to_uop in renderer [pr] ( #12601 )
2025-10-10 05:29:10 -04:00
Sieds Lykles
e625c27598
update min step times openpilot ( #12600 )
2025-10-10 11:24:27 +02:00
nimlgen
6ec96f6088
amd: remove dup flags in sqtt ( #12595 )
2025-10-10 17:23:33 +08:00
wozeparrot
9471157346
feat: bump llvm version ( #12598 )
2025-10-10 02:20:22 -07:00
qazal
36c753bd63
viz: switch llvm mca info to tabulate ( #12596 )
2025-10-10 11:54:34 +03:00
qazal
b27470b6db
viz: add buffer details in the timeline sidebar ( #12591 )
2025-10-10 11:36:08 +03:00
chenyu
03ef5197fc
move get_contraction to helpers [pr] ( #12594 )
2025-10-10 04:28:57 -04:00
Sieds Lykles
965bd194f2
uop_given_valid cleanup ( #12592 )
...
* cleanup
* cleanup there
2025-10-10 10:18:53 +02:00
chenyu
af90dc00de
remove some View add logic [pr] ( #12584 )
...
no longer simplify the case of v0+v1 where v0 has a mask
2025-10-10 03:47:56 -04:00
wozeparrot
f12e2a75db
feat: add thunderkittens ( #12590 )
2025-10-10 00:32:33 -07:00
qazal
caae46cfba
fix process replay progress update ( #12587 )
2025-10-10 10:20:55 +03:00
nimlgen
1309cea247
rocprof parser in extra ( #12569 )
...
* rocprof parser
* viewer
* vw
* skip
2025-10-10 14:56:42 +08:00
Sieds Lykles
cbdc13279d
fix openpilot gated reads ( #12570 )
...
* fix gated image counts
* slice correctly
2025-10-10 04:52:57 +02:00
chenyu
c8dfd10257
ShapeTracker.real_strides -> is_expanded [pr] ( #12579 )
...
only keep the used part
2025-10-09 22:52:45 -04:00
qazal
88ce63a49a
remove outdated comment in multi [pr] ( #12580 )
2025-10-10 05:50:49 +03:00
George Hotz
5977df267f
outerworld uses expand ( #12578 )
2025-10-10 10:25:25 +08:00
chenyu
f2c3a72b0c
remove RANGEIFY flag [pr] ( #12577 )
2025-10-09 21:52:54 -04:00
George Hotz
9b66c2b0b7
fix weekly commits table (i didn't know we linted extra)
2025-10-10 09:23:33 +08:00
George Hotz
658b96cbfb
weekly commits table
2025-10-10 09:15:41 +08:00
qazal
b86ad6053a
test_schedule independent of RANGEIFY flag ( #12568 )
...
* test_schedule independent of RANGEIFY flag
* comment for expectedFailure + test_cast_padded_view
* test_cast_padded_const works
* don't use full_shape it's fine
* add todos for the rest
2025-10-09 20:00:50 +03:00
nimlgen
502e613c9c
amd: clean up uppercased vars ( #12571 )
2025-10-09 19:39:27 +08:00
George Hotz
840d2bf1ea
fix div rules ( #12567 )
...
* group div rules
* merge those pattern matchers
* revert
2025-10-09 19:28:21 +08:00
nimlgen
8a1c3dc1bf
amd: use soc headers from rocm ( #12566 )
2025-10-09 19:10:46 +08:00
qazal
e0694fdb8e
remove UPat.__repr__ [pr] ( #12565 )
2025-10-09 12:35:34 +03:00
chenyu
678f83e41b
delete ShapeTracker to_valid_uop and substitute [pr] ( #12563 )
2025-10-09 05:06:10 -04:00
nimlgen
a11b686c71
amd: sqtt for all gfx11 ( #12546 )
...
* amd: general sqtt for gfx11
* target
* ops
* no gfx12 here
2025-10-09 17:04:06 +08:00
chenyu
a0cbbc35ad
remove LLAMA_LAYERS in ci ( #12562 )
2025-10-09 04:46:41 -04:00
chenyu
fe94453d52
delete CONTIGUOUS with RANGE in st [pr] ( #12561 )
2025-10-09 04:32:31 -04:00
chenyu
f793cdeb87
clean up shape changing logic to not use st [pr] ( #12560 )
2025-10-09 04:13:02 -04:00
chenyu
1bcea19846
remove ShapeTracker.reduce [pr] ( #12559 )
2025-10-09 03:54:11 -04:00
chenyu
c1cc277fc3
don't call src[0].shape multiple times in MULTI st [pr] ( #12558 )
2025-10-09 03:40:17 -04:00
qazal
2551a60d97
viz: split out shape links ( #12557 )
2025-10-09 10:34:55 +03:00
George Hotz
e7aa26ed29
make remove bufferize fast ( #12555 )
...
* add more uop gc test
* make remove bufferize fast
* substitute is fast too
* fix tests
2025-10-09 15:20:02 +08:00
chenyu
cf8232ec6a
clean up more RANGEIFY flag ( #12556 )
2025-10-09 03:06:48 -04:00
nimlgen
658c566e22
vars in gated_read_image_count ( #12486 )
...
* vars in gated_read_image_count
* nc
2025-10-09 14:54:15 +08:00
George Hotz
a8a9ac0e95
add more uop gc test ( #12553 )
2025-10-09 14:49:32 +08:00
chenyu
250f05a776
run some hashing test only on METAL ( #12554 )
...
quite slow on CPU
2025-10-09 02:39:49 -04:00
qazal
da9425c1a7
viz: sum all buffers in zoomed out memory graph ( #11898 )
...
* viz: switch to transformation matrix
* simpler axes domains
* less domain
* split loops
* flatten
* tiny rects
* solid proxy but still too big
* cache FileNotFound
* gridlines instead of padding
* not this
* like METAL -> METAL memory -> graph
* less colors
* better
* more grid work
* glitch
* clamp
* add range index
* pixel grids
* set min width
* y cords
* pruning
* test: clip in world units
* keep linear scan
* switch to interval tree
* fps counter
* work
* visible is the easiest
* shapes api
* math
* test bitgrid
* checkout
* work
* simpler
* work
* draw
* it's just a polygon
* merge polygons
* cleanup old stuff
* switch to hashmap there too
* add tooltips
* fix that
* better color
* better
2025-10-09 09:30:37 +03:00
chenyu
ae51bdd06a
remove trivial use of RANGEIFY flag ( #12550 )
...
some tests need update still
2025-10-09 02:29:38 -04:00
George Hotz
80d99d52a5
reduce_unparented only checks ranges ( #12548 )
2025-10-09 14:14:03 +08:00
nimlgen
375ee2c576
faster backward_slice ( #12515 )
...
* not cached backward_slice
* mypy
* just speed
* faster
2025-10-09 14:12:20 +08:00
George Hotz
1dc500426e
remove restrictions on range ending in indexing ( #12543 )
...
* remove restrictions on range ending in indexing
* early simplify
* Revert "early simplify"
This reverts commit 657d9972c2 .
* disable const folding tests
2025-10-09 13:53:08 +08:00
chenyu
585bd95b50
fix ruff 0.14.0 [pr] ( #12547 )
2025-10-09 01:52:30 -04:00
qazal
6af29b913b
viz: format rewrite time as a comment ( #12545 )
...
* viz: format rewrite time as a comment
* put above
2025-10-09 07:14:27 +03:00
qazal
baab7e334d
put match times in viz ( #12544 )
...
* put match times in viz
* float
2025-10-09 06:56:10 +03:00
George Hotz
51420d1f99
rangeify profiling ( #12540 )
...
* clean up stable diffusion weight loading
* add profiling to run_rangeify
* fix tests
2025-10-09 11:32:34 +08:00
chenyu
43bce1f39f
delete View minify [pr] ( #12538 )
2025-10-08 23:25:53 -04:00
qazal
9f9a8b0b5b
viz: fix tiny device linking ( #12541 )
2025-10-09 06:25:33 +03:00
George Hotz
6e6059dde0
clean up stable diffusion weight loading ( #12452 )
2025-10-09 11:13:11 +08:00
chenyu
20d98b19c3
delete more unused ShapeTracker stuff ( #12536 )
2025-10-08 23:09:44 -04:00
qazal
bb5671a837
some more ops.py cleanups ( #12525 )
...
* remove GroupOp.Meta and st_arg
* inline axis_arg
* only allow .buffer on reshapes (or the buffer)
* gate is the other way
* still want can_pad?
* use op_in_backward_slice_with_self
* .buffer is recursive
* lint
* pathlib there
2025-10-09 06:06:44 +03:00
chenyu
be05028419
move ASSERT_MIN_STEP_TIME to compile3 ( #12535 )
...
threshold is current time +20%
2025-10-08 22:16:59 -04:00
George Hotz
615ec6acf0
refactor to apply_movement_op ( #12533 )
...
* refactor to apply_movement_op
* new pm_mops is fine
* make mypy happy
* cleanup apply_movement_op function
2025-10-09 10:16:09 +08:00
chenyu
c4732a18bd
update tests that depend on SPLIT_REDUCEOP ( #12534 )
2025-10-08 21:53:30 -04:00
chenyu
5986d656a2
tighter ASSERT_MIN_STEP_TIME ( #12531 )
...
set to about 1.2x of actual time now
2025-10-08 21:22:54 -04:00
George Hotz
fc2bd53700
chatgpt nits ( #12529 )
...
* tsink_base wasn't needed
* nits from chatgpt
2025-10-09 07:34:44 +08:00
nimlgen
89ec2b3a74
memory: move bump allocator ( #12505 )
2025-10-08 23:12:04 +08:00
George Hotz
84fc34b274
tsink_base wasn't needed ( #12528 )
2025-10-08 22:46:06 +08:00
chenyu
28edea5d67
delete FUSE_CONV_BW ( #12527 )
2025-10-08 10:41:38 -04:00
George Hotz
2653147cb7
delete the lowerer ( #12526 )
2025-10-08 21:58:18 +08:00
George Hotz
0774575442
delete the old rangeify path and all the children stuff ( #12524 )
...
* delete the old rangeify path and all the children stuff
* remove the on_stack stuff and any retries
* don't use the p word
* Revert "remove the on_stack stuff and any retries"
This reverts commit 49a2b328b9 .
2025-10-08 21:24:04 +08:00
Rudeus
a65ec5c693
fix fromarray depreceation ( #12512 )
2025-10-08 09:13:26 -04:00
qazal
b6835f4134
remove Ops.VIEW and related UOp methods ( #12522 )
...
* remove Ops.VIEW and related UOp methods
* update abstractions2.py
* no ShapeTrackers in abstractions2.py
* it's a size 1
2025-10-08 14:47:02 +03:00
George Hotz
3b0b3a2e64
fast RANGEIFY ( #12504 )
...
* rtoposort is fast, can replace rangeify with this
* fast rangeify
* work
* fast rangeify works for mnist
* should work
* progress
* pad fix
* FAST
* tests passing
* don't delete those shape ops
* put in rangeify map
* ending ranges fix
* tests
* mstack/mselect no hacks
* move to indexing.py
* touch up tests + add comments
* disable failing test
* actually make the file readable
* failing
* error
2025-10-08 19:38:06 +08:00
qazal
9448924d9e
update gpt2 kernel count tests in CI=0 ( #12523 )
2025-10-08 14:29:11 +03:00
qazal
c5a1f9f5f9
no ShapeTrackers in multi.py ( #12521 )
...
* switch multi to all movement ops
* inline dvars
2025-10-08 14:04:05 +03:00
chenyu
ee0382ad99
remove ShapeTracker.invert ( #12520 )
2025-10-08 18:37:34 +08:00
chenyu
d5058427ea
remove ShapeTracker.real_size ( #12519 )
2025-10-08 06:15:29 -04:00
qazal
6f26603f06
delete swizzler.py ( #12518 )
...
* delete swizzler
* remove merge_views tests
* don't need rewrites_for_views
* apply_rewrites
2025-10-08 13:02:34 +03:00
qazal
7e0b14243e
delete grouper and kernelize ( #12517 )
...
* delete grouper and kernelize
* +sys.setrecursionlimit
2025-10-08 12:27:26 +03:00
chenyu
942022c309
smaller LLAMA_LAYER in Test llama 3 training ( #12516 )
...
very slow now
2025-10-08 05:10:51 -04:00
chenyu
e701106a64
remove FUSE_ARANGE ( #12511 )
...
it was the default already
2025-10-08 04:54:07 -04:00
qazal
291a19650b
move Kernel dataclass to rangeify ( #12510 )
2025-10-08 11:30:06 +03:00
qazal
ad49f8148b
switch process_replay to rangeify ( #12509 )
2025-10-08 11:26:43 +03:00
chenyu
da1f46ff3f
remove RANGEIFY specific test jobs ( #12507 )
2025-10-08 04:12:04 -04:00
George Hotz
1e567a5cf8
make RANGEIFY=1 the default ( #12161 )
...
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: Sieds Lykles <93992551+S-Lykles@users.noreply.github.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-10-08 03:46:09 -04:00
nimlgen
9e7103647d
amd: rename cmd_id to sqtt_next_cmd_id ( #12503 )
...
* amd: rename cmd_id to sqtt_next_cmd_id
* and typo
2025-10-08 15:16:19 +08:00
nimlgen
4a756a37d8
amd: support rocm7 ( #12502 )
...
* amd: support rocm7
* mock
2025-10-08 14:30:39 +08:00
qazal
60b6dca5ba
update some tests instead of expect_rangeify_fails ( #12500 )
...
* update test_clone_doesnt_dedup to use base
* new_flat_buffer passes
* fix test_reorder_expand
* remove the view stuff
* remove that test, we don't want this view const behavior
* test_setitem_becomes_subbuffer is good
2025-10-08 07:42:31 +03:00
qazal
84597ed53c
early assert for device mistmatched asts in rangeify ( #12499 )
...
* early assert for device mistmatched asts in rangeify
* alt also passes
2025-10-08 07:19:36 +03:00
qazal
2e19354c1c
viz: reorder timeline graphs ( #12498 )
...
* viz: reorder timeline graphs
* update test_viz with the new order
2025-10-08 07:10:23 +03:00
George Hotz
d06226b575
fix SPEC and all_tensors iterator ( #12496 )
2025-10-07 23:18:17 -04:00
qazal
a7cb80bfab
use recursive_property in UOp device ( #12477 )
...
* simple failing test with RecursionError
* switch to @recursive_property
* merge 2
* diff
2025-10-08 06:15:05 +03:00
George Hotz
a6d59a0b45
backward_slice to get srcs recursively ( #12494 )
...
* change name to backward_slice
* faster check
* clean up comments and names
* comment
2025-10-08 10:31:42 +08:00
chenyu
eb3bc277b3
remove ASSERT_MIN_STEP_TIME in external_benchmark_openpilot ( #12495 )
...
should add for compile3 and compile 3 only
2025-10-07 22:13:42 -04:00
qazal
239f9a3029
update viz to not use children [pr] ( #12493 )
2025-10-08 04:35:01 +03:00
Sieds Lykles
b465c17b56
Revert "UOp.factor and add chain sorting ( #12413 )" ( #12492 )
...
This reverts commit e74be4a140 .
2025-10-08 03:20:23 +02:00
George Hotz
945cc46475
delete children tracking from uop ( #12491 )
...
* delete children tracking from uop
* uop children no longer exists
* no tracked children
* that test is flaky too
2025-10-08 09:04:14 +08:00
nimlgen
648e5bb223
hcq: do not raise when fini ( #12487 )
...
* hcq: do not raise when fini
* Revert "hcq: do not raise when fini"
This reverts commit 44af5f7d05 .
* this way
* runtime is fine
* nn
2025-10-07 23:27:03 +08:00
George Hotz
a2345787b9
parents is faster than sparents ( #12490 )
2025-10-07 21:31:50 +08:00
George Hotz
12c4963489
add more rangeify pm tests ( #12488 )
2025-10-07 05:45:38 -04:00
George Hotz
403fdfcfd4
check spec in test, cleanup vectorize render ( #12484 )
2025-10-07 17:05:50 +08:00
qazal
22674798df
assert correctness in test_permuted_assignment [pr] ( #12483 )
2025-10-07 11:42:22 +03:00
George Hotz
75ce11593c
test_reshape_match should match ( #12479 )
2025-10-07 16:07:21 +08:00
chenyu
fe774a4319
more skip WINO on benchmark ( #12482 )
2025-10-07 03:43:51 -04:00
chenyu
8ad5f9e74f
skip slow benchmarks ( #12481 )
...
* skip slow benchmarks
padded tc is already slow, rest are slow with rangeify (correct if run locally)
* relax more
2025-10-07 03:28:56 -04:00
George Hotz
ea7672931f
fix test_matmul_relu_cat ( #12478 )
2025-10-07 02:32:23 -04:00
George Hotz
514d2a0774
merge tagless reshapes ( #12474 )
...
* merge tagless reshapes
* cleanup
2025-10-07 13:57:58 +08:00
chenyu
7b48f3cc45
failed test case repro for openpilot model ( #12475 )
...
* failed test case repro for openpilot model
* assertEqual
2025-10-07 13:46:43 +08:00
chenyu
a5484b767e
remove skipping cast in simplify_valid [pr] ( #12472 )
...
* remove skipping cast in simplify_valid [pr]
unsupported statements are handled in uop_given_valid already. the test failed because (100%x) somehow got simplified
* better test
2025-10-07 00:10:04 -04:00
George Hotz
b4509fba31
thundermittens ( #12471 )
...
* thundermittens
* give device a type
2025-10-07 11:47:39 +08:00
George Hotz
0f25b4b289
move frontend dir to nn [pr] ( #12470 )
2025-10-07 10:42:22 +08:00
qazal
f664bcc8bd
use recursive_property in UOp tracing ( #12469 )
...
* test
* simple passing
2025-10-06 21:10:52 +03:00
qazal
1af05dae77
fix rangeify in compile4.py ( #12467 )
...
* fix rangeify in compile4.py
* fix type_verify
2025-10-06 13:37:46 +03:00
qazal
76e8a3250c
rangeify: late zero folding ( #12464 )
...
* rangeify: late zero folding
* early
* not kernels
* none
* multi
* linter
* mstack is sink comment
* more comment
2025-10-06 12:52:33 +03:00
George Hotz
0c015a24fe
use recursive_property to prevent RecursionError ( #12465 )
...
* use recursive_property to prevent RecursionError
* not slower
* fix tests
* faster
* simpler
2025-10-06 15:59:18 +08:00
chenyu
a1881b0c17
update test_chicken ( #12466 )
...
logits are close, just numerical
2025-10-06 03:58:44 -04:00
qazal
1b1978b9c0
early copy fixup ( #12463 )
...
* simple failing test
* early copy fixup
2025-10-06 06:38:29 +03:00
chenyu
c1e85f699c
multi test case for sharded ring allreduce ( #12462 )
...
* multi test case for sharded ring allreduce
triggers `children not making progress` with RANGEIFY
* expect_rangeify_fails
2025-10-05 23:18:24 -04:00
chenyu
1823a5043f
don't check MAX_BUFFER_SIZE on NULL ( #12461 )
2025-10-05 22:09:29 -04:00
George Hotz
46e8ea15c1
split pm_substitute_recurse ( #12460 )
2025-10-05 21:35:50 -04:00
nimlgen
1216fff781
remote: raise runtimeerror in checkz ( #12453 )
2025-10-05 21:22:53 +08:00
qazal
6ad9a688ed
add failing test after "pend substitutes for speed" ( #12457 )
...
* add failing substitute test
* expect_rangeify_fails
2025-10-05 16:10:04 +03:00
chenyu
74b04f7dca
test beautiful_mnist_multigpu ( #12455 )
...
* test beautiful_mnist_multigpu
another example that fails with RANGEIFY
* now i remember
* MAX_BUFFER_SIZE=0
2025-10-05 08:45:01 -04:00
hooved
69857d0ab0
Stable Diffusion mlperf training ( #11304 )
...
* entrypoint for sd mlperf train development
* match sd-v2 mlperf reference unet
* implement dataloader from mlperf ref
* update dataloader reference
* implement LambdaLR scheduler from mlperf ref
* match tokenizer from mlperf reference
* sample latent
* add noise to latent
* complete training epoch
* run full training step
* jit training loop
* replicate mlperf ref. losses over 11 train steps
* save tinygrad loss checkpoints properly
* match out.2.bias.grad to reference
* match weights to ref after 1 step
* compare out.2.bias to ref over three train steps
* implement attn_mask; cleanup closeness testing
* correct mse loss
* update dev_run / dependencies
* setup validation config/checkpointing
* implement validation sampling
* test closeness of eval denoise step to mlperf ref
* test closeness of decoder to mlperf ref
* confirm inception matches mlperf ref
* resize w/ bicubic interpolation, test closeness
* confirm closeness of clip preprocess to mlperf ref
* confirm clip score matches mlperf ref
* confirm fid/clip scores match mlperf ref
* cleanup
* cleanup
* zero-init some unet params as in mlperf reference
* revert jit change
* uncomment dependencies
* move to tinybox red
* implement GradScaler from torch but jittable
* simplify lr_scheduler, ensure jittability
* instantiate GradScaler
* only check if grads are finite with fp16
* implement fp16 training loop
* refactor UNet: norm, gelu, mixed precision
* refactor clip_tokenizer to enable versioning
* make fp16 attention closer to torch
* remove comparisons to torch fp16 attention
* add globvars.py for reference
* confirm closeness of fp16 unet forward to mlperf
* test norm closeness to torch with precast
* remeasure e2e with master attention
* more detailed softmax upcast comparison to torch
* parameterize softmax upcast in attention and unet
* use fp32 weights with autocast to fp16
* cleanup
* add data/checkpoint download script
* debug kernel timeout on AMD
* fix finite grads check; start multigpu
* pass numpy arrays from dataloader
* include text encoder in jit train step
* use int32 for tokens instead of int64
* prevent multi bug in reshape within clip
* corealize more, del refs before
* add more logging and wandb
* use erf gelu in clip encoder
* minor changes to train step and logging
* save checkpoints for eval or resuming
* add eval-only logic to training script
* multigpu eval
* remove PARALLEL=0
* cleanup
* pad eval batches of size < EVAL_BS
* workaround silent multigpu bug in jit
* cleanup
* tokenize captions
* verify correctness of multigpu eval
* cleanup
* verify correctness of grads in train step
* verify correctness of training (20 steps)
* don't shard in the training jit
* training settings
* minor cleanup
* overfit train w/ eval on 6 samples
* offload to enable combined train and eval
* download to raid; use local rclone
* misc changes for mi300x / logging
* refactor eval for larger BS, verify correctness
* cleanup
* ckpt resuming and remove eval cats
* eval BEAM config on mi300x and red
* resume eval after crash
* confirm eval correctness (one iteration, 6 samples)
* verify eval correctness at full scale
* cleanup correctness testing
* training correctness (20 steps, BS=248 uniform)
* cleanup
* remove eval cache at end of run
* switch f16 for bf16, del grad scaler
* confirm bf16 training correctness
* timestamps, new jits
* merge jits in training
* realize loss/lr on CPU
* training correctness
* post-bf16 train/eval
* implement grad_acc with timing/logging
* beam offline; debug gradacc; use float32
* fix gradacc in jit, correctness test
* prepare f32 BS=512 gradacc=4 run
* workaround jit problem in diffusion eval
* scale lr by BS
* revert gradacc, prepare bf16 BS=336 lr*=BS train
* make checkpointing faster
* resume bf16 BS=336 base_lr=1.25e-7 run
* jit ckpt at beginning
* don't alloc more gpu mem in ckpt
* cleanup
* move script to mi300x dir
* cleanup
* cleanup unneeded files
* revert beam search to master
* minor changes
* fix regression: realize before assign in eval
* cleanup mlperf SD data/ckpt downloads
* workaround BEAM failure
* workaround bug in Tensor.stack
* minor changes
* revert gradscaler
* cleanup
* cleanup/validate dataloader
* ensure checksum of laion data
* simplify config
* load training state to jitted bufs
* simplify lr scheduler
* simplify train script
* cleanup comments
* refactor stable diffusion/unet init
* more refactoring of stable diffusion init
* fix import errors in tests
* refactor: separate train/eval
* fix import errors
* eval checkpoints in reverse chron. order
* save/load cycle in sd init
* refactor and verify eval
* verify training correctness
* prepare repro train run
* cleanup
* integrate beam retry, train, eval
* simplify wandb
* kill orphaned processes
* better logging
* train to 10 ckpts instead of 7
* remove optimizer/scheduler checkpointing/resume
* cleanup
* BEAM=2 7 ckpts
* add test to compare with torch softmax in amp
* cleanup
* stop eval early if checkpoint converged
* add test for lr scheduler
* add proper test method
* add test for training
* use venv name that is ignored by .gitignore
* linting
* add simple f32 softmax fxn
* revert change to scaled_dot_product_attention
* refactor gelu_erf init
* simplify mixed precision in unet
* add norm autocasting to fp32
* rm extra test
* test eval with NULL backend
* fix venv name
* simplify norm autocast
* use temp dir for training test
* actually add eval test
* remove parallel env variable from tests
* update clip with tests
* reorg init functions
* use np for testing
* remove unused var
* factor out GPUS
* add sd model init tests
* more unet tests
* match master
* rerun CI due to linux (remote) hang
* explain UNET_CKPTDIR
* rerun CI due to linux (remote) timeout
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-10-05 07:56:05 -04:00
George Hotz
a976ace404
minor improvements to rewrite ( #12454 )
...
* minor improvements to rewrite
* need that continue
* faster
2025-10-05 18:09:32 +08:00
qazal
4b60121498
fix bmnist torch with RANGEIFY=1 ( #12442 )
...
* fix bmnist torch with RANGEIFY=1
* alt
* test and comment
* this was always wrong
* simple failing test for rangeify
* simple upat to match the old behavior
2025-10-05 12:34:27 +03:00
George Hotz
b5f31d7505
earlier seen children ( #12451 )
2025-10-05 15:55:13 +08:00
qazal
865d5796f8
add a test for untested Tensor.assign behavior ( #12448 )
...
* add a test for untested Tensor.assign behavior
* better
2025-10-04 12:44:56 +03:00
Sieds Lykles
e74be4a140
UOp.factor and add chain sorting ( #12413 )
...
* add ordering
* fix some tests
* fix more tests
* shorten comment
* update test
* add rule and test
* add rule and test
* remove check
* use fold_divmod_congruence instead of simplify
* adjust tests
* shorten line
* new algo
* add test
* add function to un-nest the div
* add UOp.factor
* test UOp.factor
* uop_given_valid tries to factor simplex expression
* shorten line
* symbolic_flat is back
* change that back
* fix those new tests
* new rule for ordering
* factor multiple factors
* no symbolic_flat
* symbolic_flat to there
* move that back
* fix imports
* merge correctly
* linter happy
* add rule
* add a test
* cleanup
* revert that for now
* UOp.factor returns self instead of None
* try all_candidates
* remove or_else
* post index symbolic
* add test
* maket this closer to the original
* increase mac hlb_cifar min step time
* add some ordering tests
* cleanup
* increase pytest timeout time
* check dtype
2025-10-04 06:05:38 +02:00
Sieds Lykles
394dc24110
post index symbolic ( #12446 )
...
* post index symbolic
* add test
2025-10-03 23:23:03 +02:00
chenyu
9f2b69b870
enable few tests for PTX test_dtype ( #12445 )
2025-10-03 08:56:30 -04:00
George Hotz
0b534f71c2
recursive substitute should be O(n) ( #12444 )
...
* recursive substitute
* even faster
* make that a single rewrite
2025-10-03 18:29:59 +08:00
chenyu
b087663c35
RANGEIFY test_bert uses more ran somehow ( #12443 )
2025-10-03 04:38:53 -04:00
chenyu
940a8d5ba9
default IGNORE_OOB=1 ( #12441 )
...
* default IGNORE_OOB=1
z3 can get very slow with RANGEIFY, also update some kernel numbers to what it is
* add to test
2025-10-03 04:16:19 -04:00
George Hotz
d290e77a5b
pend substitutes for speed ( #12440 )
2025-10-03 15:49:19 +08:00
nimlgen
23d310bcc1
ptx: handle i8/u8 casts correctly ( #12439 )
...
* ptx: handle casts correctly
* notsetp
2025-10-03 15:34:15 +08:00
hooved
1e8945a28c
Training loop for Stable Diffusion mlperf ( #12315 )
...
* add diff
* fix edit error
* match master
* point reference to specific commit
* simplify wandb logging
* remove lr test, dehardcode device
* increase stack size limit
2025-10-03 02:45:38 -04:00
George Hotz
c7849ac593
fix test lil model ( #12437 )
...
* fix test lil model
* 4 not 3
2025-10-03 02:28:37 -04:00
chenyu
0f82d92b9d
use float for softmax in llm.py ( #12438 )
...
fixed numerical issue in `CPU=1 RANGEIFY=1 python3 -m tinygrad.apps.llm`
2025-10-03 02:27:56 -04:00
George Hotz
4c63f7e786
skip copies of reshaped buffers ( #12430 )
...
* skip copies of reshaped buffers
* always run NOOP
* comment
* comment
2025-10-03 13:05:47 +08:00
Sieds Lykles
0047bcc535
undo loaded comparison swap ( #12436 )
...
* add rule
* add a test
2025-10-03 06:57:29 +02:00
chenyu
f203d8b221
update RANGEIFY kernel count and test_masked_select ( #12435 )
2025-10-03 00:41:34 -04:00
wozeparrot
a6dd5a224b
skip webgpu tests ( #12433 )
2025-10-02 21:31:07 -07:00
chenyu
bf99de7b1e
update a few more tests for RANGEIFY ( #12434 )
2025-10-03 00:16:58 -04:00
George Hotz
9cd365c12e
little changes from double gemm ( #12429 )
...
* little changes from double gemm
* split pm_group_for_reduce
* pm_add_buffers_local
* Revert "pm_add_buffers_local"
This reverts commit 4d30a91db2 .
2025-10-03 10:31:51 +08:00
Sieds Lykles
16a65b4fd0
fix test_symbolic_gcd_div hang ( #12427 )
2025-10-03 04:21:16 +02:00
chenyu
2d24af888b
REWRITE_STACK_LIMIT ( #12426 )
2025-10-02 21:51:04 -04:00
hooved
1b58ef0d60
Increase stack size limit in unified_rewrite ( #12424 )
...
* increase stack size limit
* rerun CI due to random tqdm test fail
2025-10-03 09:06:47 +08:00
qazal
17d36d0952
don't tag MSTACK/MSELECT on global buffers ( #12423 )
...
* don't tag MSTACK/MSELECT
* fix
2025-10-02 13:32:15 +03:00
chenyu
7b3912d8e4
relax atol for some tests ( #12422 )
2025-10-02 05:04:44 -04:00
chenyu
98163832e4
update RANGEIFY test_cast_padded ( #12421 )
...
* update RANGEIFY test_cast_padded
* update test
2025-10-02 04:37:35 -04:00
chenyu
37beef6de3
add null bert training test in ci ( #12420 )
...
fails with RANGEIFY `RuntimeError: children not making progress`
2025-10-02 04:05:19 -04:00
qazal
f21851b099
ops: n^2 .device property fix ( #12419 )
...
* test case for a long rand chain
currently failing with RANGEIFY because device propogates too deep
* skip
* ops: n^2 .device property fix
* unskip
---------
Co-authored-by: Chen-Yu Yang <chenyu@fastmail.com>
2025-10-02 03:28:12 -04:00
b1tg
ec177c80c2
rangeify: fix test_where_fold (llvm) ( #12416 )
...
* rangeify: fix test_where_fold (AMD_LLVM)
* rm comment
2025-10-02 02:57:49 -04:00
qazal
13a25b2e67
rangeify: don't shape INDEX on kernelize ( #12417 )
2025-10-02 09:45:37 +03:00
hooved
5d9035f5a6
Eval for Stable Diffusion mlperf ( #12316 )
...
* add diff
* rerun ci
* refactor beam workaround, add test
* fix conflict
* linting
2025-10-02 02:35:38 -04:00
hooved
0f804c9a83
Stable Diffusion model init for mlperf ( #12314 )
...
* include clip pr diff
* updated unet and sd init
* dehardcode default device
* revert beam hang workaround
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-10-02 02:28:41 -04:00
George Hotz
0eee93f0c0
hotfix: disable split ranges for non rangeify
2025-10-02 13:15:24 +08:00
George Hotz
583553f467
split ranges ( #12411 )
...
* split ranges
* simpler
* split ranges
* range str
* fix test
* oops
* faster
* no group 2
* tests
* dont_sub_ranges_for_image
* revert that
2025-10-02 12:57:22 +08:00
qazal
6fc6b51b59
fix limit_bufs with kernelize ( #12415 )
2025-10-02 07:49:11 +03:00
qazal
d1c868f990
fix limit_bufs with multi ( #12414 )
2025-10-02 05:51:56 +03:00
qazal
2fcd55583f
allow less kernels in external_test_opt ( #12412 )
...
* allow less kernels in external_test_opt
* this was always 2
2025-10-02 05:05:42 +03:00
qazal
8b48e19ce2
skip more multi remote tests ( #12410 )
2025-10-02 04:50:46 +03:00
George Hotz
3770dd9d80
annotate bufferize in viz
2025-10-02 09:20:50 +08:00
qazal
5b649616ff
rangeify: detect and assert cycles ( #12405 )
...
* rangeify: assert cycles
* rng=2
* any
2025-10-02 03:39:43 +03:00
Sieds Lykles
9a64fc0d28
Load alt value with cast try 2 ( #12407 )
...
* add or_casted
* add tests and fix old tests
* cast load
* move that to pm_render
* add allow_any_len to gated load patterns in renderers
* slice [:2]
2025-10-02 00:55:29 +02:00
nimlgen
3e0e0290ce
increase timeout in test_module_runs ( #12408 )
2025-10-01 22:01:44 +03:00
Sieds Lykles
2f8ac77c25
add allow_any_len to gated load patterns in renderers ( #12406 )
2025-10-01 20:35:32 +02:00
George Hotz
89bed28716
split reduceop ( #12404 )
...
* some rangeify tests fixed
* bring split reduceop to rangeify
* fix tests
2025-10-01 18:45:16 +08:00
George Hotz
74ee305948
some rangeify tests fixed ( #12403 )
2025-10-01 18:23:37 +08:00
qazal
f198a9e1ba
skip test_multihost_aware_schedule, assign devices mismatch ( #12396 )
...
* minimal failing remote test
* this should've never worked?
* skip that test
2025-10-01 13:09:15 +03:00
b1tg
ac3d457d5e
rangeify: TestReduceOpsConstFolding ( #12397 )
...
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-10-01 17:58:19 +08:00
George Hotz
60e52fbe36
support opts in contig, simpler ( #12400 )
2025-10-01 17:20:04 +08:00
chenyu
6c95b1f39d
explicitly set device for CI unit test ( #12399 )
2025-10-01 05:16:54 -04:00
chenyu
6ba8bf282f
skip test_masked_select for RANGEIFY PYTHON ( #12395 )
2025-10-01 04:13:31 -04:00
chenyu
689ab9151b
more RANGEIFY tests ( #12393 )
...
would have caught the load alt regression without adding too many tests
2025-10-01 03:43:58 -04:00
chenyu
adc8c3b28f
Revert "load alt value with cast ( #12384 )" ( #12392 )
...
This reverts commit 05e91a248d .
2025-10-01 03:20:04 -04:00
b1tg
154d114364
rangeify: fix abstractions2.py ( #12386 )
...
* rangeify: fix abstractions2.py
* tests
* lint
* only abstractions2
* base
2025-10-01 09:58:56 +03:00
George Hotz
fe96c8d345
add HALF flag to tinygrad.apps.llm
2025-10-01 14:44:59 +08:00
George Hotz
f205352cd7
remove ranges with 1s ( #12388 )
...
* use op_in_parents
* remove the ranges of 1
* fix CL image thing
* fix realize
2025-10-01 14:43:29 +08:00
qazal
90b1c0dd96
rangeify: test_where_fold kernel count ( #12379 )
...
* rangeify: test_where_fold kernel count
* get these from the index
* replace ranges
* fine
* movement ops
* diff
* better
2025-10-01 09:35:12 +03:00
b1tg
42748ccb92
rangeify: fix test_prequant_conv2d_1x1 ( #12391 )
2025-10-01 02:33:47 -04:00
Sieds Lykles
05e91a248d
load alt value with cast ( #12384 )
...
* add or_casted
* add tests and fix old tests
* cast load
* move that to pm_render
2025-10-01 07:14:26 +02:00
qazal
714500edfd
viz: add font-weight to OffscreenCanvas config ( #12390 )
2025-10-01 08:08:47 +03:00
b1tg
57ad46c6e4
rangeify: increase atol for test_two_binops_no_rerun passing on real windows machine ( #12389 )
...
CPU_LLVM=1
2025-10-01 00:56:45 -04:00
George Hotz
e02da8f5ac
use op_in_parents ( #12385 )
2025-10-01 12:37:29 +08:00
chenyu
0662946fac
atol in test_two_binops_no_rerun ( #12387 )
...
for RANGEIFY LLVM
2025-10-01 00:05:47 -04:00
b1tg
da52006bde
rangeify: fix test_scatter_reduce ( #12380 )
...
* rangeify: fix test_scatter_reduce
* ext_vector_type
* set alignment=1 on boolean
2025-09-30 23:26:36 -04:00
George Hotz
1c1b4d14e9
minor cleaups in rangeify ( #12382 )
...
* minor cleaups in rangeify
* op_in_parents
* don't use toposort
* Revert "don't use toposort"
This reverts commit 257d8e2529 .
2025-10-01 11:19:48 +08:00
wozeparrot
4204edc60b
feat: skip test_long ( #12383 )
2025-09-30 20:07:39 -07:00
chenyu
8def8145e4
ALLOWED_KERNEL_COUNT openpilot 0.9.4 with RANGEIFY ( #12381 )
2025-09-30 22:58:59 -04:00
George Hotz
4c9a930de2
rangeify attn tests ( #12377 )
2025-10-01 09:59:19 +08:00
qazal
26247573e1
rangeify multi tests on gpu ( #12376 )
...
* rangeify multi tests on gpu
* fix limit_bufs
2025-10-01 04:53:04 +03:00
qazal
f2eb92948d
rangeify: ban view pushing ( #12371 )
...
* rangeify: ban view pushing
* don't shape INDEX
* fix the codegen cache
* make space
2025-10-01 04:37:52 +03:00
George Hotz
a128fa0f8a
removing double reshapes was wrong ( #12375 )
2025-10-01 09:25:35 +08:00
hooved
969a1b35ca
LR scheduler for Stable Diffusion mlperf training ( #12201 )
...
* add lr scheduler for stable diffusion training
* add lr scheduler test
* rerun ci
* rerun CI
* use np for testing
* move test to CI path
* remove unneeded copy
2025-09-30 21:21:08 -04:00
George Hotz
9ef319f349
bad conv in rangeify ( #12373 )
...
* bad conv with broken rangeify
* no maxpool needed
* add empty_like
* typo
* no self
* issue remains for test
2025-10-01 08:56:22 +08:00
nimlgen
080b26e7d7
use suppress_finalizing to not mute all exceptions ( #12372 )
2025-09-30 21:24:31 +03:00
George Hotz
44558a37f7
fix some rangeify tests ( #12370 )
...
* fix bad range merges
* fix rng
* fix uop gc
* fix some rangeify tests
* now that needs rangeify 2 also
2025-09-30 20:12:08 +08:00
nimlgen
2c397eb2a2
rangeify: buf limit ( #12336 )
...
* limit bufs
* g
* fix buffer limit
* um?
* fix
* only these?
* typo
* f
* cleaner
2025-09-30 14:59:47 +03:00
George Hotz
a83f219253
fix bad range merges ( #12368 )
...
* fix bad range merges
* fix rng
* fix uop gc
2025-09-30 19:30:21 +08:00
qazal
a95159d579
remove TestShapeSpec, it relies on ShapeTracker [pr] ( #12369 )
2025-09-30 14:20:35 +03:00
George Hotz
9cf5e66899
minimal rangeify stable diffusion fix ( #12367 )
...
* minimal rangeify stable diffusion fix
* more minimal
2025-09-30 18:48:35 +08:00
chenyu
b4a4817c9c
fix rangeigy test_linalg ( #12365 )
2025-09-30 06:28:35 -04:00
qazal
de1d562b69
rangeify: update test_pickle asserts ( #12366 )
...
* realized exists on the base
* use is_realized
2025-09-30 13:27:41 +03:00
b1tg
c9ef5d8fe5
rangeify: fix test_tensor_index_overflow (CPU_LLVM=1) ( #12362 )
...
* rangeify: fix test_tensor_index_overflow (CPU_LLVM=1)
* add test
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-09-30 05:55:15 -04:00
qazal
e8c595c29e
remu: add new instructions introduced in RANGEIFY ( #12363 )
...
* add v_mad_i64_i32 for test_output_padded_conv_transpose2d
* run amd test_ops
* skip test_masked_select
2025-09-30 12:36:29 +03:00
George Hotz
360980f1a3
work on rangeify cost function heuristics ( #12360 )
...
* work on rangeify cost function heuristics
* dedup
* better cost function
2025-09-30 16:44:29 +08:00
qazal
109c63b904
update Tensor unit tests for RANGEIFY ( #12359 )
...
* update test_kernelize for RANGEIFY
* also kernelizes user contiguous
* skip that test
* tensor uop repr
* 4 kernels, still realizes a float
2025-09-30 11:17:21 +03:00
George Hotz
7129419500
fix cifar training in RANGEIFY ( #12355 )
...
* fix cifar training in RANGEIFY
* even more wino fuse
* bugfix
* test to show issue
2025-09-30 15:59:19 +08:00
qazal
4ff7f20b9d
rangeify: fix kernelize ( #12357 )
2025-09-30 10:10:08 +03:00
chenyu
86c5c969ea
linalg cosmetic change ( #12356 )
2025-09-30 03:00:59 -04:00
qazal
6a56d3c859
rangeify: only test correctness in multi ( #12339 )
...
* work
* more work
* back here
* skip tests
* work
2025-09-30 09:55:59 +03:00
George Hotz
ab6b0d3a21
enable cleanup_dead_axes ( #12351 )
...
* enable cleanup_dead_axes
* don't mess with user contig
* correct tag behavior
* double reshape isn't correct
* block on assign too
* skip messing with symbolic
* Fix tests
* disable RANGEIFY=2
* test w rangeify
2025-09-30 14:09:39 +08:00
qazal
2a7310ab59
rangeify: fix remaining multi correctness issue ( #12354 )
2025-09-30 08:08:27 +03:00
Sieds Lykles
73b25bf47d
z3 fix loaded mask ( #12353 )
...
* z3 fix loaded mask
* indentation
2025-09-30 06:55:50 +02:00
wozeparrot
2a0caa09c2
push copy to disk ( #12348 )
2025-09-29 21:55:05 -07:00
chenyu
881709cd33
don't skip rangeify test_instancenorm_3d ( #12350 )
...
seems fine now
2025-09-30 00:05:59 -04:00
hooved
39aae679e4
Support bfloat16 on NULL backend ( #12340 )
...
* add failing test
* move test
* only run test with NULL default
* add skip reason
* add fix
2025-09-30 00:02:30 -04:00
chenyu
af935e7d32
Revert "reduce const folding ( #12344 )" ( #12349 )
...
This reverts commit 8e508a9927 .
2025-09-29 23:45:30 -04:00
George Hotz
f522e83a02
fix rangeify elu fusion for openpilot ( #12341 )
...
* fix rangeify elu fusion for openpilot
* flip the metadata
* copy over permuted contiguous support
* this is correct
* update that
2025-09-30 11:41:52 +08:00
qazal
d95d018bb5
add name to multi rewrite [pr] ( #12346 )
2025-09-30 06:34:58 +03:00
qazal
05275c9ec3
rangeify: enable assign to mstack target ( #12345 )
2025-09-30 06:27:57 +03:00
chenyu
8e508a9927
reduce const folding ( #12344 )
2025-09-29 23:08:56 -04:00
chenyu
3a480b858f
use more getitem in gpt2 ( #12343 )
2025-09-29 23:08:03 -04:00
qazal
32d69d07d7
rangeify: enable multitensor TestBatchNorm ( #12342 )
2025-09-30 06:05:00 +03:00
Sieds Lykles
d55d829635
Lower index dtype spec fix ( #12337 )
...
* new pm_lower_index_dtype
* load_store_indexing after index lowering
* shorten line
* seperate rule for long removal
* fix test
* fix index_to_concrete_int
* minor fixes
* add sink there
* update types in linearizer test
2025-09-30 04:26:50 +02:00
Sieds Lykles
c38f6ce140
unified_rewrite: use deque and dont add nodes to the stack multiple times ( #12320 )
...
* use deque instead of list
* increase ctx.progress and max stack_len
* add openpilot
* prevent placing uops on stack many times
* revert increasing ctx.progress and stack length limit
* dont block adding to the stack there
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-09-30 10:02:28 +08:00
hooved
c2689c505e
Clip model updates for Stable Diffusion mlperf training ( #12313 )
...
* stable diffusion mlperf clip changes
* add clip tests
* set gelu as attribute
* add more tests
* factor out GPUS
* rerun CI
* add imports to if blocks
* remove unneeded axis
* add clip tests to CI
* move clip tests
* add deps, disable max buf size
2025-09-29 21:50:14 -04:00
George Hotz
cdfa0f29fd
add rendering to index ( #12338 )
2025-09-30 09:18:05 +08:00
George Hotz
baf3b60cfb
fix gpt2 on rangeify ( #12335 )
2025-09-29 19:16:44 +08:00
qazal
9513f025c5
apply multi before rangeify ( #12298 )
...
* it doesn't realize it when i reshape
* cleaner graph
* map out
* REDUCE_AXIS also gives the wrong answer
* maybe
* work
* back here
* try
* more
* refactor tests
* check MultiBuffer
* or copy
* fine with this
* don't need graph_rewrite_map in rangeify
2025-09-29 14:16:31 +03:00
George Hotz
b899392f30
fix llm app with rangeify ( #12334 )
...
* fix llm app with rangeify
* add gpt2 contiguous also
2025-09-29 18:42:44 +08:00
wozeparrot
7ae6898e31
better late bufferview ( #12333 )
2025-09-29 03:08:34 -07:00
George Hotz
3291e00df7
fix efficientnet slowness on rangeify ( #12332 )
2025-09-29 18:01:01 +08:00
chenyu
9d2f2b8e34
skip test_mean_half_precision_overflow ( #12331 )
...
it only works with SPLIT_REDUCEOP=1
2025-09-29 05:15:04 -04:00
qazal
9915bcf2b4
remove no-op contiguous from rand ( #12329 )
2025-09-29 11:53:16 +03:00
chenyu
76c87d81b3
delete test_backward_sum_acc_dtype ( #12330 )
...
this test tests the wrong thing, it was only working because expand realize rule
2025-09-29 04:46:17 -04:00
George Hotz
fd2e4f2353
failing rng test ( #12328 )
...
* tighten spec: fixup devectorizer types / rangeify
* tighten assign
* failing rangeify test
* simpler
* otherwise contig
* more tolerance cause rng seed changed
2025-09-29 16:06:45 +08:00
George Hotz
29469577e8
tighten spec: fixup devectorizer types / rangeify ( #12327 )
...
* tighten spec: fixup devectorizer types / rangeify
* tighten assign
2025-09-29 15:41:11 +08:00
wozeparrot
a982480512
feat: late to_bufferview ( #12271 )
2025-09-29 00:29:43 -07:00
qazal
e01a3eb59a
rangeify whitespace cleanups [pr] ( #12326 )
...
* rangeify whitespace cleanups
* this is a noop
2025-09-29 10:04:51 +03:00
George Hotz
cf925d1ac5
remove metadata for rangeify codegen ( #12325 )
2025-09-29 14:29:28 +08:00
George Hotz
b252f890da
add support for SPEC=1 ( #12322 )
...
* add support for SPEC=1
* cleaner place for it
* non rangeify spec
* split non rangeify
2025-09-29 12:55:01 +08:00
qazal
292cb6ae26
viz: 404 if the requested rewrite doesn't exist ( #12323 )
2025-09-29 07:51:10 +03:00
qazal
250cb10e8f
rangeify permuted assign ( #12299 )
...
* enable RANGEIFY=1 test_assign
* work
* rangeify=0 asserts this ast
* remove that
* beta test, it's correct though
* skip multi
* matches torch/np output
* memcopy without memcopy
* can remove this
* rangeify isn't silently wrong anymore
* diff cleanup
* use UOp toposort instead of global tags
* actual assert TestRangeifyAssign
* step
* work
* this isn't optimizing away now
* some todos
* test fusion schedule
* typo
* dedup idxs
* cleaner
* pre
* work
* diff
2025-09-29 07:27:57 +03:00
Sieds Lykles
ed90de6583
Revert "Bufferize early, fix "children not making progress" on big graphs (#1…" ( #12318 )
...
This reverts commit 6f1cf717de .
2025-09-28 19:10:21 +02:00
Sieds Lykles
29f0886395
skip test_softmax_fusion tests if RANGEIFY==1 ( #12310 )
2025-09-27 05:57:40 +02:00
Sieds Lykles
b98f1881ef
dsp opt test has different axis number on rangeify ( #12309 )
2025-09-27 05:06:11 +02:00
Sieds Lykles
6f1cf717de
Bufferize early, fix "children not making progress" on big graphs ( #12308 )
...
* bufferize children early
* cleaner
* fix types
* lower number of reduceops
* test openpilot
2025-09-27 04:17:15 +02:00
qazal
0104b16b9b
rangeify: fix empty tags in reshapes ( #12307 )
2025-09-26 16:32:48 +03:00
nimlgen
f5eb46a3d9
fix limit buf metal on non rangeify ( #12303 )
...
* add failure test for limit buf on non rangeify
* correct metal
* correct
* hm
2025-09-26 11:06:28 +03:00
qazal
8b2e0930d7
rangeify: enable passing multi test ( #12301 )
2025-09-26 08:31:13 +03:00
Sieds Lykles
74411984fc
Rangeify IMAGE ( #12304 )
...
* add imagedtype to rangeify
* enable some image tests
* move the tests
* image upcast before locals
* add if statement
* rangeify image_dtype test
* decrease read_image count
2025-09-26 07:21:02 +02:00
wozeparrot
d2cd269e28
fix: try close mmap ( #12306 )
2025-09-25 20:54:27 -07:00
chenyu
17cec8d645
RANGEIFY winograd test ( #12297 )
...
speed seems fine
2025-09-24 23:42:32 -04:00
nimlgen
476a2a0a96
test_qcom: update ( #12293 )
2025-09-24 21:45:58 +03:00
qazal
38ecefaacb
RANGEIFY=1 allreduce ( #12260 )
...
* ci
* extract mops
* work
* assert early
* port this?
* can realize shard
* allreduce passing
* notes
* better handling of shard
* err
* outerworld allreduce twice
* work
* don't tag movement ops
* don't tag movement ops
* delete old logic
* 19 failing + ram
* cleanup
* reset stuff
* simplest failing test
* diff
* test_ones
* allreduce work
* allreduce more work
* down to 22 failing tests
* port _device_num
* replace creates a new UOp here
* pour symbolic everywhere
* 7 failing
* focus on allreduce
* work
* cleanup
* more ci
* fix test_schedule_ring
* post index const shape
* much better
* diff cleanup
2025-09-24 18:13:08 +03:00
qazal
0e778296be
rangeify: refactor const folding ( #12291 )
...
* rangeify: refactor const folding [pr]
* it got better
2025-09-24 17:58:39 +03:00
qazal
6c9d8c7e41
rangeify: simplify noop copy ( #12289 )
2025-09-24 17:01:23 +03:00
qazal
1400ce105f
rangeify: fix sharding ( #12288 )
2025-09-24 14:33:56 +03:00
qazal
154c865966
rangeify: fix ram usage in multi ( #12286 )
2025-09-24 13:48:58 +03:00
Sieds Lykles
e8945c74de
fix infinite symbolic loop with VCONST ( #12285 )
2025-09-24 07:06:22 +02:00
Sieds Lykles
45c7252aed
Better div nesting 2 ( #11812 )
...
* remove check
* use fold_divmod_congruence instead of simplify
* adjust tests
* shorten line
* new algo
* add test
* cleanup
* update tests
* ALLOWED_GATED_READ_IMAGE from 16 -> 12
* only remove the call to simplify
* add option to simplify with factor_remainder
* Allowed readimage gates back to 16
2025-09-24 04:50:26 +02:00
Sieds Lykles
6146c64d81
lower the invalid gate last ( #12164 )
...
* lowering invalid gate is part of lower_index_dtype
* update test
* remove import
* put that back
* reduce_collapse uses invalid
* fix that pattern to use invalid_pat
* valid creates the right dtype count
* seperate rule for lowering invalid gate
* dont unvectorize Invalid gate
* image_fixup uses Invalid
* update tests
* cleanup
* update split_load_store
* add .scalar() there
2025-09-24 04:27:35 +02:00
qazal
ad7c8c21ea
rangeify: INDEX doesn't passthrough MSELECT ( #12279 )
2025-09-23 21:36:50 +03:00
nimlgen
02a7b7fe48
rangeify: fix test_setitem ( #12269 )
...
* rangeify: fix test_setitem
* um?
* better?
* simple where folding
* f
* revert
* x
2025-09-23 20:42:36 +03:00
qazal
2f145a98e0
rangeify: fix contiguous multi ( #12278 )
...
* rangeify: fix contiguous multi
* when it's changing root, it should construct a new UOp
2025-09-23 20:05:29 +03:00
nimlgen
5f4eeb054c
rangeify: passes now ( #12277 )
2025-09-23 18:46:49 +03:00
qazal
680ce54dd4
add types to replace_dnum ( #12276 )
2025-09-23 14:43:04 +03:00
chenyu
fffce0a6b4
use more no_range in simplify [pr] ( #12275 )
2025-09-23 02:33:56 -04:00
chenyu
51b88b2265
process replay tests in rangeify ( #12274 )
2025-09-23 01:30:06 -04:00
chenyu
b54cb272d0
move test_qcom to test/device ( #12272 )
2025-09-22 21:07:10 -04:00
Sieds Lykles
d21e34e617
enable test_sum_twice ( #12270 )
...
* remove skip
* remove import
2025-09-23 00:57:29 +02:00
Sieds Lykles
5a4b244e6b
Check for group inside another reduce ( #12268 )
...
* add check
* get the ranges correctly
* add test
* comment and better check
2025-09-23 00:32:41 +02:00
qazal
a6fd96f620
rangeify: don't tag movement ops ( #12267 )
...
* don't tag movement ops
* delete old logic
2025-09-22 16:40:17 +03:00
chenyu
b03ceb806e
move test_sample to test_randomness ( #12266 )
2025-09-21 21:11:32 -04:00
qazal
25e0b725d1
cleanup section 0 rangeify ( #12264 )
2025-09-22 00:30:44 +03:00
qazal
1aba668a37
cleanup buffer_view matcher ( #12263 )
2025-09-21 23:45:48 +03:00
nimlgen
b53a266254
rangeify: fix test_optim ( #12262 )
...
* rangeify: fix test_optim
* add to cl?
* these are good now
2025-09-21 18:08:35 +03:00
qazal
461e9becec
srender UOp in movement op arg ( #12261 )
2025-09-21 13:55:45 +03:00
Sieds Lykles
9569fdfa36
use str for AxisType and AddrSpace __repr__ ( #12252 )
2025-09-21 05:24:41 +02:00
qazal
8365c28cd5
viz: put a limit of brightness scale ( #12259 )
2025-09-20 18:52:55 +03:00
nimlgen
4762a24022
test_free_intermediates force buffers ( #12255 )
...
* test_free_intermediates force buffers
* f
* fix for rangiefy
* xx
2025-09-20 18:14:39 +03:00
qazal
57c7e0a8f8
RANGEIFY=1 test_jit ( #12254 )
...
* RANGEIFY=1 test_jit
* don't do any of that
* disk
* simple disk tensor
* more work
* run more tests
* it also doesn't copy everytime
* skip tests that hang everything
2025-09-20 17:34:32 +03:00
chenyu
393c6b236c
test case to sum twice in different order ( #12253 )
...
* test case to sum twice in different order
fixed by #12251
* try metal
2025-09-20 10:11:57 -04:00
qazal
4756971c88
skip test_bf16_disk_write_read on CL=1 ( #12256 )
2025-09-20 17:11:06 +03:00
chenyu
5e794be8af
tighter spec for RANGE ( #12250 )
2025-09-20 07:59:50 -04:00
Sieds Lykles
73c8dae60d
add missing remove_blockend case ( #12251 )
...
* add missing remove_blockend case
* remove expectedFailure
* better comment
2025-09-20 06:29:19 +02:00
wozeparrot
dc4dd898b7
fix: close mmap ( #12249 )
2025-09-19 14:09:12 -07:00
Sieds Lykles
bb1f376ae6
profile z3 ( #12248 )
2025-09-19 22:52:06 +02:00
Sieds Lykles
7e06d3ebba
enable test_symbolic_jit ( #12245 )
...
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-09-19 20:23:42 +02:00
qazal
bb59eed82f
rangeify: don't tag consts, they are global ( #12247 )
...
* rangeify: don't tag consts, they are global
* don't map movement ops
* sym failing test
* remove that
* update comment
* simpler test
* work
2025-09-19 15:25:03 +03:00
Sieds Lykles
cc038b31b6
Shrink instead of reshape to unregister symbolic ( #12241 )
...
* Slice to unbind symbolic
* use vmax for now
* assert shape in reshape is valid
* update test_symbolic_ops to use shrink instead of reshape
* remove infer_with_bound_values for npw
* symbolic output doesnt have symbolic strides
* symbolic jit tests use shrink to unregister symbolic
* update test
* update more tests
* wrap vmax in int()
* only create a new st if the store is not an assigne
* unwrap st
* comments
2025-09-19 06:04:35 +02:00
chenyu
a531a649fb
test_resize_upsample_scales_cubic_align_corners_cpu is fixed ( #12244 )
2025-09-18 20:55:26 -04:00
Sieds Lykles
8d703a6369
z3 xor doesnt use bitcast ( #12243 )
2025-09-19 00:31:44 +02:00
chenyu
0dad6cc518
good RANGEIFY kernel counts in external_test_opt ( #12242 )
...
no push permute stuff. the model ones are less clear if it's good, some got slower
2025-09-18 17:58:54 -04:00
chenyu
cff1065f5e
test CL=1 RANGEIFY=1 onnx ( #12240 )
...
all except test_resize_upsample_scales_cubic_align_corners_cpu runs
2025-09-18 16:49:46 -04:00
Sieds Lykles
ef05178855
fix 0//0 infinite rewrite in rangeify onnx ( #12239 )
2025-09-18 21:59:50 +02:00
chenyu
87707ef0b8
unify range_start [pr] ( #12236 )
2025-09-18 13:52:54 -04:00
qazal
825f148469
rangeify: fix copy size mismatch errs ( #12232 )
...
* rangeify: fix copy size mismatch errs
* const folding can happen in sym
assert it
* shippable
* rangeify copy is completely wrong
* pre_bufferize
* tag bufferize
* pre back
2025-09-18 18:23:32 +03:00
chenyu
f82b16a0e9
RANGEIFY test_tensor ( #12235 )
2025-09-18 10:35:43 -04:00
chenyu
7487c13b61
truncate_fp16 -> float_to_fp16 ( #12234 )
...
match float_to_bf16 and float_to_fp8
2025-09-18 09:48:27 -04:00
b1tg
54c15d74a4
python float8 support ( #11960 )
...
* basic support
* alu
* nan in exec_alu
* rand_for_dtype
* inf + 0.0
* finfo
* revert rand_for_dtype
* clean
* truncate fp8s inf
* spec ok
* float_to_fp8 nan/inf
* least_upper_dtype
* clean up
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-09-18 09:17:09 -04:00
qazal
dbbc261075
rangeify: fix COPY simplifier ( #12233 )
2025-09-18 14:35:33 +03:00
Sieds Lykles
f1108f1cbe
Enable test_symbolic_ops on rangeify ( #12230 )
...
* enable
* merge correctly
2025-09-18 02:12:36 +02:00
Sieds Lykles
812f485cd7
Enable threefry_doesnt_use_long test on rangeify ( #12229 )
...
* dont bufferize rangeify
* enable doesnt_use_long test
2025-09-18 01:58:34 +02:00
nimlgen
3c5b8bf50c
am: bump fw to rocm7 ( #12226 )
2025-09-17 21:20:22 +03:00
qazal
525f80e0d2
rangeify: enable putting consts back in the tensor graph ( #12225 )
...
* rangeify: enable putting consts back in the tensor graph
* work
* sym in ci
2025-09-17 19:45:04 +03:00
chenyu
edffc246ed
MUL in reduce_unparented ( #12223 )
...
* MUL in reduce_unparented
* some test
2025-09-17 11:56:39 -04:00
qazal
7733c217c5
remove spam comments in test_schedule ( #12224 )
2025-09-17 18:24:55 +03:00
qazal
d917895569
map out rangeify errors in test_schedule ( #12211 )
...
* map out rangeify errors in test_schedule
* skip that
* add to ci
2025-09-17 09:10:28 +03:00
Sieds Lykles
158506b91e
Upgrade some divmod folding for symbolic divs ( #12216 )
...
* use const_factor() instead of arg
* add test
* change div min_max
* add tests
* add divide_by_symbolic_gcd
* add tests
* one more test
* Slice to unbind symbolic
* deal with const factor properly
* minor cleanup
* divide_by_symbolic_gcd becomes UOp.gcd and UOp.divide_exact
* add tests
* add gcd_without_const
* fix divide_exact bug
* add factor_remainder
* add tests
* fix imports
* elif -> if
* remove expectedFailure
* add more tests
* add more unwrap
* fix signature of pop_const
* remove that
* remove that
2025-09-17 03:00:50 +02:00
Sieds Lykles
328bfe6b9b
fix map_expand for symbolic shapes ( #12218 )
...
fix incorrect default argument in resolve
2025-09-17 01:20:18 +02:00
chenyu
5b12764b83
add arange cat arange test ( #12217 )
...
simple test case to catch wrong reduce const folding. also clean up the old arange complexity test
2025-09-16 17:12:32 -04:00
nimlgen
53655a4ee5
cuda: cleanup old comment ( #12215 )
2025-09-16 23:11:32 +03:00
chenyu
6b808c5fe6
update TestSymbolicJit.test_plus1_pad ( #12214 )
...
was failing because movement was not captured
2025-09-16 15:57:50 -04:00
Shun Usami
2a72b00679
Add test for 2D tensor indexing in setitem ( #12193 )
...
* Add test for 2D tensor indexing in setitem
* Fix _masked_setitem to handle multi dim indexing correctly
* Fix indent
* Add fuzz test for 3D tensor indexing in setitem
* Skip indexing fuzz test (slow)
2025-09-16 14:57:25 -04:00
chenyu
c7b03457d7
Revert "Revert "more llvm intrinsics ( #11961 )" ( #12194 )" ( #12195 )
...
This reverts commit df1c183e46 .
2025-09-16 14:55:31 -04:00
chenyu
494bb12500
skip slow cifar bf16 on red benchmark ( #12213 )
...
very slow to compile the fake bf16
2025-09-16 14:55:01 -04:00
chenyu
419e997187
increase benchmark timeout ( #12212 )
...
account for compile cache, and it's annoying that job died due to timeout also messes the machine
2025-09-16 14:09:02 -04:00
chenyu
84d2d047ea
Tensor.pad_to and Tensor.shrink_to ( #12210 )
...
most of the time i want this instead of spelling out the args
also add more input validation to shrink
2025-09-16 12:24:55 -04:00
qazal
122a50fe8c
assert kernel count ( #12205 )
2025-09-16 14:24:39 +03:00
chenyu
e555748807
test rangeify const folding ( #12200 )
...
* test rangeify const folding
reduce i know how to fix, multi and test_cast_padded tbd
* test_instancenorm_3d is very slow
2025-09-15 20:03:48 -04:00
chenyu
f732f66709
rangeify test_nn almost pass ( #12198 )
...
* rangeify test_nn almost pass
* issue with jit
* flaky
2025-09-15 17:49:20 -04:00
chenyu
82e037aad5
ci test.yml updates ( #12197 )
...
* ci test.yml updates
move docs together and external_benchmark_schedule to unit
* torch
2025-09-15 17:09:02 -04:00
chenyu
146c31586d
split RANGEIFY ci ( #12196 )
...
one CPU and one CL for speed
2025-09-15 15:41:10 -04:00
chenyu
df1c183e46
Revert "more llvm intrinsics ( #11961 )" ( #12194 )
...
This reverts commit d01e3d7719 .
2025-09-15 13:56:43 -04:00
b1tg
d01e3d7719
more llvm intrinsics ( #11961 )
...
* more llvm intrinsics
* assert nan
* skip test_log_nan on metal
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-09-15 13:05:23 -04:00
nimlgen
b63bd02969
update runtime docs ( #12191 )
2025-09-15 17:46:20 +03:00
qazal
57e8bf61e8
viz: fix Specificity for rect styling ( #12190 )
2025-09-15 17:33:37 +03:00
chenyu
72e010d816
fix rangeify ci ( #12189 )
...
CL=1, and multitensor needs to test with CPU since CL does not support multi in CI
2025-09-15 10:24:57 -04:00
qazal
f1bd06134d
test fuse with RANGEIFY=2 ( #12187 )
2025-09-15 15:51:23 +03:00
qazal
ef0ef705fe
viz: remove async from event listener ( #12186 )
2025-09-15 15:08:28 +03:00
qazal
d8855ec266
viz/serve.py cleanups ( #12185 )
...
* don't assign unused variable
* *path to
2025-09-15 13:43:26 +03:00
qazal
b8a74c1569
cpu: add disassembler err message ( #12184 )
...
* cpu: add disassembler err message
* print msg
2025-09-15 13:29:44 +03:00
qazal
a388d2cb1a
remove PROFILE=1 option, it's just VIZ=1 [pr] ( #12176 )
...
* remove PROFILE=1 option, it's just VIZ=1 [pr]
* sqtt
* sqtt 2
* return last
* rename
2025-09-15 12:51:50 +03:00
George Hotz
65397bfdeb
set testpath on pytest ( #12183 )
2025-09-15 16:13:05 +08:00
George Hotz
ae0edc8a67
renumber ranges ( #12182 )
...
* enable rangeify const folding
* renumber ranges for kernel deduping
2025-09-15 13:03:39 +08:00
hooved
e1fef895b1
don't hardcode weights path ( #12171 )
2025-09-15 00:33:47 -04:00
hooved
3a9db08b49
download data and ckpts for sd train/eval ( #12170 )
2025-09-15 00:31:45 -04:00
chenyu
bdb3afd566
failed test case for symbolic pad ( #12179 )
2025-09-15 00:25:21 -04:00
George Hotz
9fcc87761e
enable rangeify const folding ( #12181 )
2025-09-15 12:02:19 +08:00
George Hotz
1353250b6c
tags on bufferize are the tensor tags ( #12180 )
2025-09-15 11:46:03 +08:00
George Hotz
60d7db093e
delete bufferized consts + output noops ( #12163 )
...
* bring const folding to rangeify
* comment that
2025-09-15 11:07:44 +08:00
qazal
525c20dc7e
viz: remove unused runtime_stats feature ( #12177 )
2025-09-15 02:53:05 +03:00
qazal
75ff9b7a9a
viz: add buffer lifetime to tooltip ( #12175 )
2025-09-15 02:33:50 +03:00
chenyu
15b166ce6d
bump test_module_runs to 30 seconds ( #12174 )
...
25 seconds sometimes
2025-09-14 16:48:40 -04:00
ttomsa
943236ef74
move cast pat out of symbolic_simple ( #11945 )
...
* move pat
* move it here
* rm extra check
---------
Co-authored-by: Sieds Lykles <93992551+S-Lykles@users.noreply.github.com>
2025-09-14 21:39:48 +02:00
Steven Shi
25b1bc8eff
added top k sampling to examples/mamba ( #12061 )
2025-09-14 15:27:34 -04:00
Shun Usami
34a05b31fe
Fix advanced tensor indexing setitem ( #12128 )
...
* Add failure test case for advanced tensor indexing setitem
* Fix advanced tensor indexing setitem when permuted
* Reduce line count
* Revert unnecessary change
* Combine two lines into one
2025-09-14 15:22:40 -04:00
chenyu
d09c0f28c5
increase test_module_runs ( #12173 )
...
timed out on ci windows llvm
2025-09-14 15:19:21 -04:00
chenyu
12a910f1d2
update torch 2.8 ( #12172 )
...
support _reshape_alias. something is wrong with one case of unfold
2025-09-14 15:19:03 -04:00
chenyu
98ecab7563
remove ml_dtypes ( #12169 )
2025-09-14 14:20:05 -04:00
qazal
02054b53fe
remove tests that pre date the uop spec ( #12168 )
...
* remove tests that pre date the uop spec
* const src
* for RANGEIFY=1
* update with bind
* remove import
2025-09-14 18:47:42 +03:00
qazal
1591e4f66b
update outbufs selection in test_linearizer [pr] ( #12166 )
2025-09-14 13:46:49 +03:00
nimlgen
d1ae30f7ef
hcq: do not spam with errors in -m device ( #12150 )
...
* hcq: do not spam with errors in -m device
* um?
* um?
* nn
* helps?
* um?
* no gc?
* fix
2025-09-14 10:56:59 +03:00
George Hotz
d5bc27797b
fix some multitensor on rangeify ( #12162 )
...
* fix some multitensor on rangeify
* rangeify multi hacks
* copy on const
2025-09-14 14:31:57 +08:00
Meng Zhuo
4b7904eca9
add cpu support for riscv64 ( #12136 )
2025-09-14 11:40:58 +08:00
George Hotz
bcafa72b7f
use tags instead of graph_rewrite_map in rangeify ( #12110 )
...
* use tags instead of graph_rewrite_map in rangeify
* new style, add realize
* metadata works
* simple failure
* fix
* loops
* stuff becomes a NOOP when you remove it
* stuff becomes a NOOP when you remove it
* tags on bufferize
* bmnist works
* locals don't work
* shippable
* fix some tests
* simpler map_realize
* remove const hack
* debuggable test
* broke
* assign test
* straight up bug
* wooo it passes
* sink shouldn't be there
* fix ops
* bmnist
* kv cache ish
* Set RANGEIFY context variable to 0
* should work normal
* better
* types
* hacks to fix test_symbolic
* pm_add_buffers
* tests should pass
2025-09-14 11:39:01 +08:00
chenyu
d2316ba91a
don't validate output in sdxl with fakeweights ( #12160 )
...
NULL backend passed validation before because both desired and actual went through NULL backend
2025-09-13 21:47:51 -04:00
nimlgen
b1d1816f43
device: fix envvars ( #12159 )
2025-09-13 23:38:09 +03:00
nimlgen
19d9d29b7e
device: compilers in tinygrad.device ( #12151 )
...
* hcq: do not spam with errors in -m device
* -m tinygrad p2
* fix
* ugh
* comp in ckey
* fix
* one more
* print defaults
* xx
2025-09-13 21:45:29 +03:00
qazal
6410dcb7c2
viz: less verbose render loop ( #12158 )
...
* define visible once
* move y offsets to one place
2025-09-13 19:04:37 +03:00
nimlgen
92df52d79a
make method_cache account for compiler ( #12156 )
...
* make method_cache account for compiler
* sorry
2025-09-13 17:00:11 +03:00
chenyu
0c392089d9
update mypy ( #12155 )
2025-09-13 09:48:38 -04:00
qazal
fbca6183ad
do not launch BEAM when opts_to_apply exists [pr] ( #12152 )
2025-09-13 14:57:46 +03:00
George Hotz
b2a95d32bb
check clSetKernelArg ( #12149 )
2025-09-13 17:24:55 +08:00
George Hotz
0695e322a8
fix android cpu device ( #12148 )
2025-09-13 15:42:04 +08:00
Sieds Lykles
e3a3764917
delete fold_unrolled_divs ( #12146 )
2025-09-13 03:09:36 +02:00
Sieds Lykles
51ed6e94b2
AxisType __repr__ method ( #12145 )
2025-09-13 01:15:38 +02:00
Sieds Lykles
0757a9a819
add pytest-timeout of 3 min per item ( #12144 )
...
* add pytest-timeout with timeout of 3 min
* func_only
2025-09-13 00:48:41 +02:00
Sieds Lykles
2fc0bd150b
Arange overflow raises error and one_hot upcast ( #11975 )
...
* add error
* to_dtype
* shorten line
* add test
* upcast one hot dim im overflows
2025-09-13 00:18:25 +02:00
chenyu
aac3dceaf6
merge two PYTHON backend ci job ( #12143 )
...
* merge two PYTHON backend ci job
and mark anything that takes > 10 in test_ops slow
* two more
2025-09-12 17:36:46 -04:00
ttomsa
a12d0933c1
fix vec dtype in fast idiv ( #12080 )
...
* fix
* add vec dtypes to fuzzer
* add vec=False
---------
Co-authored-by: Sieds Lykles <93992551+S-Lykles@users.noreply.github.com>
2025-09-12 23:00:43 +02:00
chenyu
25091951ba
update test/models ( #12142 )
...
minor fix and run more stuff in tinygrad for speed
2025-09-12 16:43:28 -04:00
Sieds Lykles
62376c8b2b
update store load noop pattern to use Invalid ( #12141 )
...
* update pattern
* add test
2025-09-12 22:25:53 +02:00
chenyu
647965fb09
test_train cleanup ( #12140 )
...
* test_train cleanup
remove skipIf due to buffer sizes, runs locally
* those are slow
2025-09-12 13:21:30 -04:00
chenyu
0fad07c684
viz serve default path ( #12139 )
...
`python tinygrad/viz/serve.py` shows last session instead of an empty page
2025-09-12 18:32:44 +03:00
nimlgen
81e33b8439
system: cpu memory mappings are uncached ( #12137 )
...
* system: cpu memory mappings is uncached
* adm amd
2025-09-12 13:28:25 +03:00
qazal
68b0ad05a4
viz: format tuple tags ( #12135 )
...
* viz: format tuple tags
* use python repr
2025-09-12 11:36:53 +03:00
qazal
e80c8a7548
merge TestIndexing with TestSchedule + remove duplicate tests ( #12134 )
...
* merge TestIndexing with TestSchedule
* remove the arange_copy tests
* no FUSE_ARANGE import
2025-09-12 10:35:14 +03:00
Sieds Lykles
b5a3b8de20
remove where on gated load if gates are the same ( #12129 )
...
* add rules
* add tests
2025-09-12 06:52:35 +02:00
George Hotz
a2f502b89e
fix rangeify=1 ops on GPU ( #12130 )
2025-09-12 11:17:37 +08:00
George Hotz
0766616962
isolate the const hacks in the old kernelize ( #12126 )
...
* isolate the const hacks in the old kernelize
* if rangeify, don't waste time
2025-09-12 08:35:35 +08:00
Sieds Lykles
1f3950a484
Invalid idx ( #12067 )
...
* merge index_dtype_3
* new lowering with Invalid idx
* remove that dtype from range
* finish merge
* annotate better
* indentation
* dont need that anymore
* always process replay for openpilot
* more uop_given_valid for idx
* valid past index_child
* fix bug preventing load getting an alt value
* add track_match_stats back in in shapetracker and remove cache
* get_valid_idx -> get_valid and get_idx
* fix heuristics with new idx
* split line
* fix typo
* fix signature
* dont skip idx if stride is 0
the idx may still be invalid
* lower const with new valid
* delete to_indexed_uops
* update shapetracker test
* delete axis_is_masked
* add cache back
* move around comment
* fix get_valid bug
* move invalid fold to symbolic so its earlier
* cleanup
* update applying padto to new idx
* add unit tests
* cleanup
* fold line
* improve spec
* dont try to render Invalid as a float
* more consistent invalid index
* update some tests
* Fold index with true cond
* skip test
* vconst min max if Invalid in arg
* fix signature of UOp.const
* add test for min/max of Invalid CONST/VCONST
* add InvalidType to as_const signature
* is Invalid to isinstance
* Add InvalidType to ConstLike
* index gate is a where gate
* make that a metaclass
* fix heurisics for new idx
* mypy happy
2025-09-12 01:42:02 +02:00
chenyu
544eb2c402
clean up test_scatter_reduce ( #12125 )
2025-09-11 16:36:58 -04:00
chenyu
9ad6a56d17
smaller test_simple_reduce ( #12124 )
2025-09-11 15:45:38 -04:00
chenyu
e5ef9ec5b1
remove IGNORE_OOB=0 in ci tests ( #12117 )
2025-09-11 15:05:04 -04:00
chenyu
3a83b56da5
fix test_dequantization_mxfp4 ( #12123 )
...
* fix test_dequantization_mxfp4
* assert_allclose
* rtol
2025-09-11 14:22:06 -04:00
chenyu
520e2e0727
actually run unit tests in ci MacOS (unit) ( #12122 )
...
* actually run unit tests in ci MacOS (unit)
* that's always wrong
2025-09-11 13:32:30 -04:00
nimlgen
acb700fc26
ci: fix ptx env ( #12120 )
2025-09-11 12:42:15 -04:00
chenyu
20cd7177de
delete test_bert_fuse_arange ( #12121 )
...
* delete test_bert_fuse_arange
it's the default now and we are not interested in FUSE_ARANGE=0 version
* remove -v
2025-09-11 12:35:51 -04:00
chenyu
b07f962058
split metal model tests ( #12119 )
...
* split metal model tests
* llama too
2025-09-11 12:20:12 -04:00
chenyu
66593f135f
remove duplicated test_real_world ( #12118 )
...
included in the test/models right below
2025-09-11 11:57:14 -04:00
qazal
e76211fcbc
viz: specify all rect styles in parent ( #12115 )
...
* viz: specify all rect styles in parent
Visually a no-op, but it's easier to reason about when the rect's coloring comes from `g` parent that holds UOp data.
* this stays
2025-09-11 13:48:59 +03:00
nimlgen
400ad93892
ci: gate boost paths for macos only ( #12114 )
2025-09-11 12:48:34 +03:00
George Hotz
3ef0e5e01e
rangeify: use Ops.REALIZE and not Ops.CONTIGUOUS if it's added by system ( #12111 )
...
* rangeify: use Ops.REALIZE and not Ops.CONTIGUOUS if it's added by system
* fix contig + BufferizeOpts
* no outerworld
2025-09-11 11:56:59 +08:00
b1tg
52ebed991e
change dtype promo lattice when fp8s is supported ( #12088 )
...
* change dtype promo lattice when fp8s is supported
* no device check
* int64 + uint64 => fp8
2025-09-10 22:09:11 -04:00
George Hotz
d4eba5800d
rangeify cost function infrastructure ( #12091 )
...
* one call to hc opt
* does that pass?
* add cost function to rangeify
* test
* more test
* gate thread
* bufferize has shape
* ish
* match old behavior
* no ci there
2025-09-11 07:19:53 +08:00
qazal
78610b681e
viz: light up children ( #12107 )
...
* viz: light up children
* keep tag coloring
2025-09-11 01:28:01 +03:00
Sieds Lykles
3989f5b559
Revert "Simplify valid in symbolic ( #12104 )" ( #12108 )
...
This reverts commit 73d479a016 .
2025-09-10 23:36:40 +02:00
Sieds Lykles
73d479a016
Simplify valid in symbolic ( #12104 )
...
* cleanup cast_folding
* from sym to symbolic
* no more sym in dtype lowering
* move around simplify_valid
* update test
2025-09-10 23:26:19 +02:00
chenyu
e306650d39
remove GPUDevice ( #12106 )
2025-09-10 16:35:00 -04:00
George Hotz
d8a7a1c9c7
BUFFERIZE shape should be each range, not the product ( #12105 )
...
* BUFFERIZE shape should be each range, not the product
* fix tests
* resolve
2025-09-11 04:02:24 +08:00
Sieds Lykles
3730172c10
cleanup cast_folding ( #12101 )
...
* cleanup cast_folding
* from sym to symbolic
* no more sym in dtype lowering
2025-09-10 21:30:20 +02:00
chenyu
0e266f376c
ops_gpu -> ops_cl ( #12103 )
2025-09-10 15:15:48 -04:00
chenyu
0599e86186
replace hardcoded GPU in llama debug msg ( #12102 )
2025-09-10 13:56:40 -04:00
qazal
5a84d86db7
viz: fix buffer tooltip offset ( #12100 )
...
* fixup offsets
* add buffer num to tooltip
2025-09-10 20:12:20 +03:00
nimlgen
fb96394ff5
auto-select available compilers ( #12094 )
...
* device: auto select compilers
* fix
* metal+opencl
* nv/cuda
* test without ptx
* ptx
* fix tests
* fix
* fix test
* rename
* test + cleaner
* xx
* ops
* better test
* win?
* um?
* types
* debug
* win??
* sep rung
* wtf?
* debug
* skip win
* revert this
* types
2025-09-10 19:52:01 +03:00
chenyu
bb67829e99
raise KernelOptError in TC _apply_tc_opt ( #12099 )
...
currently getting
```
2025-09-10 13:18:19
File "/home/chenyu/tinygrad/tinygrad/codegen/opt/search.py", line 149, in beam_search
2025-09-10 13:18:19
acted_lins: list[Scheduler] = flatten([get_kernel_actions(lin, include_0=False).values() for lin,_ in beam])
2025-09-10 13:18:19
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-09-10 13:18:19
File "/home/chenyu/tinygrad/tinygrad/codegen/opt/search.py", line 107, in get_kernel_actions
2025-09-10 13:18:19
lin2.apply_opt(a)
2025-09-10 13:18:19
File "/home/chenyu/tinygrad/tinygrad/codegen/opt/postrange.py", line 169, in apply_opt
2025-09-10 13:18:19
ret = self._apply_tc_opt(use_tensor_cores, cast(int, opt.axis), tc_select, tc_opt)
2025-09-10 13:18:19
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-09-10 13:18:19
File "/home/chenyu/tinygrad/tinygrad/codegen/opt/postrange.py", line 235, in _apply_tc_opt
2025-09-10 13:18:19
idx = self.rngs.index(a)
2025-09-10 13:18:19
^^^^^^^^^^^^^^^^^^
2025-09-10 13:18:19
ValueError: UOp(Ops.RANGE, dtypes.index, arg=(1002, <AxisType.REDUCE: 6>), src=(
2025-09-10 13:18:19
UOp(Ops.CONST, dtypes.index, arg=15, src=()),)) is not in list
```
2025-09-10 12:32:19 -04:00
George Hotz
84b249ef0e
move simplify reduce out of devectorizer ( #12098 )
2025-09-10 21:24:57 +08:00
qazal
5d66a2d885
viz: refactor range clipping ( #12097 )
2025-09-10 16:23:46 +03:00
George Hotz
9789337722
early reduce simplify ( #12046 )
...
* early reduce simplify
* min changes
* need that
* that goes in simplify
* no more arange reduce opt
2025-09-10 21:02:46 +08:00
nimlgen
21e6926a6a
HostLLVMCompiler -> CPULLVMCompiler ( #12096 )
2025-09-10 14:04:16 +03:00
nimlgen
551560b87c
do not use getenv('PTX') in tests ( #12095 )
...
* test without ptx
* fix tests
* fix test
* linters
2025-09-10 14:04:07 +03:00
Sieds Lykles
0e420e68b4
delete axis_is_masked ( #12092 )
2025-09-10 05:26:19 +02:00
George Hotz
ef53a6fc19
one call to hc opt ( #12074 )
...
* one call to hc opt
* does that pass?
* Clean up postrange.py by removing comments
2025-09-10 11:18:18 +08:00
Sieds Lykles
499f50483b
x | !x -> True ( #12090 )
2025-09-10 03:26:01 +02:00
Sieds Lykles
5b73076e48
assert benchmark times ( #12042 )
...
* assert jitted times in openpilot
* better error
* better error
* add ASSERT_MIN_STEP_TIME to more models
* t is step_times
* update benchmark times
* update times
2025-09-09 23:40:02 +02:00
b1tg
58d13a6e3e
remove redundant check ( #12087 )
2025-09-09 15:15:39 -04:00
qazal
71fcb23d4a
viz: cleanup renderDag ( #12086 )
2025-09-09 19:19:45 +03:00
b1tg
82e955fe79
fix inf bug in float_to_fp8 ( #12085 )
2025-09-09 12:02:56 -04:00
b1tg
14faf7a5c0
AutoCastType tests for fp8s/bf16 ( #12084 )
2025-09-09 11:33:01 -04:00
qazal
5e76eff26d
viz: pre fetch workers ( #12083 )
...
* viz: pre fetch workers
* move check
2025-09-09 15:56:39 +03:00
qazal
5fde033794
viz: prune worker payload ( #12082 )
2025-09-09 14:45:13 +03:00
nimlgen
1c6c42715f
unify cpu and llvm ( #11982 )
...
* try unify cpu and llvm
* fixes
* fix
* ops
* no llvm
* fix
* rm
* lvmm is ot
* oops
* override
* no llvm
* ignore
* skip llvm
* ooops
2025-09-09 13:54:44 +03:00
qazal
50cc7175cb
viz: use complete progress helper ( #12081 )
...
* viz: use complete progress helper
* min diff
* rename show to start
2025-09-09 11:00:52 +03:00
Sieds Lykles
239091d111
numba>=0.55 for uv resolution ( #12079 )
...
* force numba version
* update comment
2025-09-09 01:43:32 +02:00
chenyu
2bd1fff79c
ci GPU misc cleanups ( #12078 )
2025-09-08 16:47:29 -04:00
chenyu
1781d5bced
remove PYTHONPATH in test.yml ( #12077 )
...
set globally already
2025-09-08 15:41:47 -04:00
nimlgen
9182948951
remove llvm_bf16_cast ( #12075 )
2025-09-08 20:51:15 +03:00
chenyu
11213398b9
reorder amdremote in test yml ( #12073 )
2025-09-08 13:43:04 -04:00
nimlgen
ebbcdd6577
cpu: use suppress_finalizing ( #12071 )
2025-09-08 18:28:09 +03:00
qazal
73ca0e870c
viz: index visible rects ( #12070 )
2025-09-08 17:37:17 +03:00
chenyu
d40f5b766b
default BEAM_PADTO to 0 ( #12069 )
...
seems incorrect, disable by default now
2025-09-08 10:17:03 -04:00
Sieds Lykles
75b58fe2d3
move simplify_valid pat to sym ( #12065 )
...
* move simplify_valid pat to sym
* fix expectedfailure
2025-09-08 07:01:26 +02:00
chenyu
56861852be
enable IMAGE for test_mnist and test_mnist_backward ( #12064 )
...
passes now
2025-09-07 09:06:39 -04:00
nimlgen
ef71acc88a
hcq: cleanup fileio iface ( #12063 )
...
* hcq: cleanup fileio iface
* typo
* _
2025-09-07 15:43:27 +03:00
nimlgen
35ddfc3d39
change default cpu_count ( #12062 )
2025-09-06 23:30:20 +03:00
nimlgen
97187bf8b6
cleanup win and arch checks ( #12060 )
...
* cleanup win and arch checks
* stupid mypy
2025-09-06 23:08:46 +03:00
Sieds Lykles
f326df8ae8
add type: ignore ( #12059 )
2025-09-06 21:17:35 +02:00
George Hotz
c66935f7b9
only run hcopts once ( #12053 )
...
* only run hcopts once
* same?
2025-09-06 11:14:52 -07:00
qazal
801be5f7b9
viz: memory graph cleanups ( #12057 )
...
* delete the total nbytes tooltip
* split pixel rescaling from layout
2025-09-06 19:44:53 +03:00
nimlgen
10ac427aaa
cpu threading ( #11951 )
...
* start cpu threading
* fix
* fix2
* fix
* hacks?
* threads
* minor
* no dsp
* dsp 2
* n
* more
* test
* xm
* cleaner
* readable
* f
* reorder
* when no threads
* rangeify
* typos
* not needed
* reapply
* remoev this
* linter
* fixed cpu count in ci
* fix
* fixes
* rm
* typo
* sort based on speed
* test if test works in ci
* Revert "test if test works in ci"
This reverts commit 1f05edb531 .
* do not pad thread
2025-09-06 16:13:43 +03:00
nimlgen
2b1844da27
cpu: support several threads in runtime ( #12055 )
2025-09-06 13:29:31 +03:00
nimlgen
f37b836618
factor out _globalizable_rngs ( #12054 )
2025-09-06 13:29:23 +03:00
nimlgen
1630c87d0e
run optimize_local_size only when locals supported ( #12056 )
2025-09-06 13:29:09 +03:00
Jordan Chalupka
48ec5efad9
only run autogen tests on change ( #12049 )
...
* only run autogen tests on change
* example change
* rm example change
2025-09-05 23:53:01 -07:00
Sieds Lykles
581b2388c2
add dtypes.index ( #12015 )
...
* add dtypes.index
* cast shape, stride and mask to dtypes.index in view.create
* move pm_lower_index_dtype to ops
* DEFINE_VAR is dtype.index by default
* merge var_val_using_str
* remove int from commutative
* fix test_rewrite_map
* change that to dtypes.index
* change some int to index
* shorten those
* remove old cast in renderer
* cleanup
* change that back
* add comment
* delete comment
* just delete those
* view doesnt have to cast anymore
* adjust comment
2025-09-06 06:03:44 +02:00
Sieds Lykles
c6c16b2946
var_vals uses str for var (#12011 )
...
* var_vals is str,int
* remove imports
* remove print
* fix test
* change var_vals in hcq
* update test_hcq
* fix multitensor _device_num var
* fix syminfer test
* shorten line
* p.vars stays list[Variable]
* shorten line
* vars is back to tuple[Variable, ...]
* change var_vals in extra
* change var_vals from shapetracker
* var_vals is str:int
* fix signature
2025-09-06 04:16:12 +02:00
George Hotz
8658a97197
hotfix: name the shift rewrite better + no ctx there
2025-09-05 19:01:59 -07:00
George Hotz
6ef3270fc8
fix opt gate ( #12050 )
2025-09-05 18:59:54 -07:00
George Hotz
66c5206b42
hotfix: minimal scheduler copy
2025-09-05 18:24:00 -07:00
George Hotz
478e758755
Revert "fix scheduler copy ( #12048 )"
...
This reverts commit 51b7c40788 .
2025-09-05 18:21:55 -07:00
George Hotz
51b7c40788
fix scheduler copy ( #12048 )
...
* fix scheduler copy
* hand coded opt only runs once
2025-09-05 17:17:49 -07:00
George Hotz
0123c394e5
early simplfy_merge_adjacent ( #12045 )
...
* do simplify_merge_adjacent before schedule
* do simplify_merge_adjacent before schedule
* disable that slow test
2025-09-05 16:39:20 -07:00
George Hotz
8423c06144
delete unused bufs_from_lin ( #12044 )
2025-09-05 16:08:28 -07:00
George Hotz
38dcadf07b
delete kernel.py ( #12040 )
...
* delete kernel.py
* delete that file
* rip and tear
* don't test search
* imports
* fix torch frontend
* not a part of regen
2025-09-05 15:52:07 -07:00
George Hotz
ee4f696086
delete more tests ( #12043 )
...
* delete more tests
* delete and simplify
* flaky on windows
* a few more, those remained
2025-09-05 15:31:30 -07:00
George Hotz
12c7b1bb01
cleanup lin tests without Kernel ( #12041 )
...
* cleanup lin tests without Kernel
* no kernel.py there
* remove that test
2025-09-05 15:13:14 -07:00
Sieds Lykles
8435d2d23b
fix openpilot speed regeression ( #12039 )
...
* set local_size=None if special.arg[0]=='i'
* add cast back
2025-09-06 00:05:45 +02:00
George Hotz
e00858a2c3
only POSTOPT ( #12038 )
2025-09-05 14:46:33 -07:00
George Hotz
433581f8ed
make POSTOPT=2 the default ( #12034 )
...
* make POSTOPT=2 the default
* more matching tc
* fix winograd
* fix that test
* add matvec to Scheduler
* flip tc sort order
* similar speed
* fix beam on image
* disable slow tests
* slow
2025-09-05 14:34:05 -07:00
chenyu
3b41a04b96
remove test_openpilot in test_onnx ( #12037 )
...
openpilot is tested in compile3
2025-09-05 16:20:03 -04:00
Sieds Lykles
290521f68e
add check for z3>=4.12.4 ( #12035 )
2025-09-05 20:33:26 +02:00
George Hotz
870f63d9cc
add WARP axistype, fix postopt bugs ( #12033 )
...
* postopt is 83% match
* warp is bright CYAN
* beautiful mnist beam works
* fix shutdown bug
2025-09-05 10:36:55 -07:00
chenyu
4c2d4f683a
lower universal_test_unary cos domain ( #12032 )
...
flaky
2025-09-05 12:19:44 -04:00
chenyu
a340723bf1
SKIP_SLOW_TEST=1 for nv CI ( #12031 )
2025-09-05 11:52:02 -04:00
chenyu
ce7163e9b4
clean up skip slow tests in PYTHON ( #12028 )
...
skip with SKIP_SLOW_TEST and decorators
2025-09-05 11:35:26 -04:00
qazal
f08299d2ec
viz: small profiler resizing improvements ( #12026 )
...
* switch to ResizeObserver
* set a fixed size for device-list
* less
* height from devices
* int
* side rect, more const
2025-09-05 18:29:03 +03:00
chenyu
5dcc4c7f1b
skip test_linalg in windows unit test ( #12030 )
2025-09-05 11:28:40 -04:00
George Hotz
f8e2dd4dd1
investigate opts mismatches ( #12020 )
2025-09-05 07:40:29 -07:00
chenyu
e0da644171
lower sample count in test_multinomial ( #12027 )
2025-09-05 10:10:28 -04:00
chenyu
9b6f1b86cb
add Tensor.maximum in test_dtype_alu ( #12025 )
...
works except nan
2025-09-05 09:48:39 -04:00
nimlgen
3e1c04bcdf
jit: noopt for copy buffers ( #12023 )
2025-09-05 16:04:35 +03:00
qazal
ab413ce72f
viz: give tooltips a max-width ( #12022 )
...
* viz: give tooltips a max-width
* better
2025-09-05 14:25:38 +03:00
qazal
f461ccf407
exclude op2 nan lt in test_dtype_alu ( #12024 )
...
failure: https://github.com/tinygrad/tinygrad/actions/runs/17490320000/job/49679581331?pr=12022#step:6:125
2025-09-05 14:14:22 +03:00
nimlgen
4fcea8493d
viz: add label to tooltip ( #12021 )
2025-09-05 13:06:33 +03:00
George Hotz
2b5a73ac65
improve test_linearizer ( #12016 )
...
* improve test_linearizer
* tweaks
* simpler
* get_prg
* that one doesn't have to return
* fix postopt bugs
* fix rng
2025-09-04 20:44:05 -07:00
chenyu
7f3df6ea21
exclude nan in test_dtype_alu lt ( #12019 )
2025-09-04 23:38:37 -04:00
Sieds Lykles
f5404ca53c
Divmod combine - associative variations ( #12017 )
...
* add rule and test
* more rules and tests
* add all four variations
* fix test
* test fixed!
* adjust commment
* add new variations
* disable intel tensor core ops count test for bigger_matmul_half
2025-09-05 03:44:02 +02:00
chenyu
677220ae7e
test_tesnor_data to unit/ ( #12013 )
2025-09-04 19:58:27 -04:00
George Hotz
431666da74
POSTOPT=2 work ( #12012 )
...
* POSTOPT=2 work
* bugfixes
* add chain in one place
* tensor cores match
* better hcopt check
* match from old
* Change POSTOPT ContextVar value to 0
* we didn't need to check that
2025-09-04 16:55:56 -07:00
George Hotz
30eb42a69e
fix POSTOPT pad ( #11999 )
...
* fix POSTOPT=1
* fix some tests
* Revert "fix some tests"
This reverts commit 8ee058e206 .
* fix padding restrictions
* cuda has two tensor cores
* Set POSTOPT ContextVar to 0 in helpers.py
2025-09-04 14:28:58 -07:00
qazal
da61b40604
some viz tests don't need track_rewrites ( #12010 )
2025-09-04 23:59:32 +03:00
qazal
be364a1adb
viz: add default tracing group ( #12009 )
...
This enables seeing rewrites in unit tests like `VIZ=1 python3 test/test_uop_graph.py TestUOpGraph.test_in_bounds_access_gated_local` that call graph_rewrite directly.
`@track_rewrites` keeps existing as an optional helper to organize larger traces.
2025-09-04 23:29:56 +03:00
chenyu
52166fd7eb
smaller test_ops inputs ( #12007 )
2025-09-04 16:22:33 -04:00
chenyu
dc8501af30
clean up wino tests ( #12008 )
...
removed the one that tests hcopt and added one for backward kernel counts
2025-09-04 16:14:55 -04:00
chenyu
8c720e8760
less iterations for symbolic double for loops ( #12006 )
2025-09-04 15:09:17 -04:00
George Hotz
70ce29b630
test pyrender ( #12005 )
...
* test pyrender
* make them print
* switch to pyrendered
2025-09-04 11:48:40 -07:00
George Hotz
560df206cc
split tc test ( #12003 )
...
* split tc test
* split hand coded opts
* remove some skipped tests
* skips on emulated
2025-09-04 11:47:56 -07:00
qazal
4996bb668b
load all traces before asserting in test_viz ( #12004 )
2025-09-04 21:34:48 +03:00
George Hotz
9dee724fc4
make EMULATE a context var ( #12002 )
...
* make EMULATE a context var
* fix test amx
2025-09-04 11:15:43 -07:00
George Hotz
09106e4aae
refactor and split test_linearizer ( #12001 )
...
* refactor and split test_linearizer
* forget that file
* imports
* remove from docs
* test gen float4
2025-09-04 10:53:07 -07:00
chenyu
fb71d1e5fd
delete some test_search tests ( #11998 )
...
TC_SEARCH_OVER_SHAPE was removed so should the tests
2025-09-04 11:19:49 -04:00
chenyu
ca7574cb2d
ci set PYTHONPATH for all ( #11997 )
2025-09-04 10:06:04 -04:00
nimlgen
e213b85810
cpu: add thread_id to worker ( #11995 )
2025-09-04 14:58:13 +03:00
qazal
35f37a64a9
viz: remove useless ctx.save and restore calls ( #11996 )
...
It's a UI no-op since we always set the styles right before drawing.
2025-09-04 14:56:41 +03:00
Sieds Lykles
572a3c15c6
Move Ops.SPECIAL arg to src ( #11918 )
...
* initial moving bound to src
* arg to src
* remove import
* fixup linearizer
* arg to src
* fix test_uop_graph
* fix more tests
* fix python renderer
* get const value from const uop
* ssimplify uop estimates
* fix webgpu locals
* fix old test
* gate Ops.SPECIAL in linearizer
* use ssimplify() for local/global_size
* remove toposort gate_parents_instead_of_self
* fix rendering in comment
* cleanup
* rename and add comments
* add BottomUpGate with test
2025-09-04 09:31:44 +02:00
George Hotz
5cf42dc4db
add Scheduler to replace Kernel with POSTOPT=2 ( #11924 )
...
* ** simple kernel to replace Kernel for postopt
* support old
* fix beam
* beaming
* beam on old
* bring tensor cores back
* raise
* postbeam
* test ops passes on mac
* skip that
* postopt default
* gate that
* fix tensor cores
* a few test fixes
* dsp fix
* tc fix
* loop
* support swap
* test_gemv
* fix beam for variable
* test opts from high level stuff
* range annoying
* compile slow
* metal slow
* better beam
* no POSTBEAM
* fix nolocals
* hc opt mostly works
* put that back
* lil
* some work
* fix that
* POSTOPT 2
* fix tests
* no postopt 2
* work
* back
* padded tensors cores
* shift_to
* postopt 0 passes?
* write PADTO
* fix padded tensor cores
* compare hcopt
* 18000 lines
* should pass tests
* fix rangeify
* put types back
2025-09-03 19:23:30 -07:00
chenyu
b13e071463
move test_winograd to unit test ( #11993 )
2025-09-03 21:47:32 -04:00
chenyu
edc8b99853
more tests that pass PTX now ( #11992 )
2025-09-03 21:18:14 -04:00
chenyu
ed2f45712b
remove skip PTX in test_arange ( #11991 )
...
all passes now
2025-09-03 20:45:19 -04:00
George Hotz
a5f2b4872a
use_tensor_cores is a heuristic ( #11989 )
...
* use_tensor_cores is a heuristic
* context
2025-09-03 17:05:10 -07:00
George Hotz
63e930fec3
apply_tensor_cores is a heuristic ( #11988 )
...
* apply_tensor_cores is a heuristic
* delete extra_opts
2025-09-03 16:39:33 -07:00
chenyu
d0e739453e
update many einsum tests ( #11981 )
...
correct the exception testing, and raise ValueError instead of assert when checking args
2025-09-03 15:40:20 -04:00
George Hotz
55e4bdd353
split_uop is a method ( #11984 )
2025-09-03 10:46:17 -07:00
ttomsa
1877eddde4
broadcast for upat ( #11940 )
2025-09-03 10:04:23 -07:00
George Hotz
5ed262982a
remove some tc hacks from BEAM ( #11980 )
...
* remove some tc hacks from BEAM
* cosmetic changes
* revert that
2025-09-03 09:59:10 -07:00
b1tg
6d53cac457
dtype fuzz: log need input > 0 ( #11979 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-09-03 12:10:42 -04:00
Jordan Chalupka
68e83b850f
nbytes should raise an exception when size is unlimited ( #11928 )
...
* nbytes should raise an exception when size is unlimited
* adding a test
2025-09-03 07:06:20 -07:00
Sieds Lykles
86e908db57
cast parents of int64 alu to int32 if possible ( #11977 )
...
* add overflows helper
* add rules
* x -> y
* check overflow of u too
* cleaner
* use alu instead of replace to preserve vectorization
* just one rule
* add test
2025-09-03 11:05:04 +02:00
Sieds Lykles
033184b3cb
parse_valid with non const rhs ( #11957 )
...
* const to using vmin/vmax
* add test
* convert to int
* remove left over part of and
2025-09-03 08:08:46 +02:00
Sieds Lykles
53eff8970a
add Ops.GEP to _min_max ( #11976 )
2025-09-03 07:07:54 +02:00
Sieds Lykles
d1d0960e6e
remove intermediate cast using bounds - weaker pattern ( #11974 )
2025-09-03 06:24:40 +02:00
Sieds Lykles
8a2846b31a
assert embedding input is integer dtype ( #11963 )
...
* cast embedding input
* raise error if not using int for index embedding
2025-09-03 01:44:26 +02:00
wozeparrot
d16cc6c012
feat: resume ckpt ( #11970 )
2025-09-02 15:47:48 -07:00
George Hotz
1b73993521
pyrender to render uops ( #11968 )
...
* pyrender to render uops
* new pyrender style
* pyrender works
* list str
* store render
2025-09-02 15:44:01 -07:00
chenyu
e921fb44ee
clean up testnvidia env ( #11969 )
2025-09-02 18:29:00 -04:00
chenyu
69dd1817d0
raise RuntimeError in merge_dicts instead of assert [pr] ( #11965 )
2025-09-02 17:18:44 -04:00
qazal
f750c15965
viz: add python marker ( #11952 )
...
* viz: add python marker
* remove duplicate
2025-09-02 23:44:00 +03:00
George Hotz
550cf2ca7f
tests from postopt ( #11964 )
...
* tests from postopt
* reraise is fine
2025-09-02 13:34:17 -07:00
qazal
b977ec0813
viz: axes domains cleanup ( #11962 )
2025-09-02 19:30:45 +03:00
nimlgen
897254ad6c
ci: add dev<->cpu copy speeds ( #11959 )
2025-09-02 15:22:44 +03:00
George Hotz
74040663bf
make ptrdtype a UOp property ( #11955 )
2025-09-01 16:35:43 -07:00
George Hotz
0dfca4e74b
add failing test for rangeify setitem ( #11954 )
2025-09-01 16:24:35 -07:00
wozeparrot
7c21271a5f
feat: end_lr envvar ( #11953 )
2025-09-01 14:53:07 -07:00
chenyu
6a40216724
correct bf16 fuzz input in test_dtype_alu ( #11933 )
...
it was using float16 inputs, now it's uint16 then convert to bf16
2025-09-01 10:52:26 -04:00
chenyu
965ea59b16
test_dtype_alu use AMD_LLVM from helpers ( #11950 )
2025-09-01 10:03:17 -04:00
b1tg
a9f07c31bc
fix amd llvm sqrt ( #11936 )
...
* fix amd llvm sqrt
* lint
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-09-01 09:31:14 -04:00
qazal
0a53e72f70
viz: fix trace duration in python test decoder ( #11949 )
2025-09-01 14:32:25 +03:00
qazal
27c9ed5a84
viz: more consistent naming of events ( #11948 )
...
* s/shapes/events in test_viz
* s/bufs/events in the memory packer
2025-09-01 14:16:47 +03:00
qazal
c7bb561ef9
remu: add v_rsq_f32_e32 instruction ( #11947 )
...
https://github.com/tinygrad/tinygrad/pull/11936 introduces a change to
the AMD LLVM renderer that outputs this instruction. Adding both 32 and
64 bit variants.
2025-09-01 11:29:31 +03:00
Sieds Lykles
d9560a631c
remove cast between ints if safe ( #11946 )
2025-09-01 05:56:49 +02:00
Sieds Lykles
a19d689481
fix vec dtype _min_max ( #11944 )
2025-09-01 03:24:07 +02:00
Sieds Lykles
f32f3464d6
Can safe cast from certain ints to floats ( #11941 )
...
* add rule
* add some tests
* prevent infinite loop with bfloat16
* add some ints to double and float can_safe_cast
* add tests
2025-09-01 00:51:24 +02:00
Sieds Lykles
1c6e43c203
Double cast is one cast if intermediate cast is safe ( #11939 )
...
* add rule
* add some tests
* prevent infinite loop with bfloat16
* prevent more infinite rewrite
2025-09-01 00:36:29 +02:00
wozeparrot
7e68045fb2
feat: small llama3 training ( #11829 )
2025-08-31 13:41:47 -07:00
nimlgen
020abe0556
hcq: finalize without synchronization when in error state ( #11872 )
...
* hcq: finalize without synchronization when in error state
* ooops
* fix
* fix
* fix
2025-08-31 18:39:13 +03:00
qazal
2004c9757d
tracing: add default clock ( #11935 )
2025-08-31 18:24:44 +03:00
b1tg
c1eeb3b99c
only skip AMD_LLVM ( #11934 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-08-31 18:15:47 +03:00
b1tg
75d380a77c
fix transcendentals in python renderer ( #11932 )
...
* fix transcendentals in python renderer
* add test
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-08-31 09:37:17 -04:00
Sieds Lykles
61e4dc6ad5
render special arg in cstyle if arg is UOp ( #11931 )
2025-08-31 07:01:29 +02:00
Sieds Lykles
d3252ccd85
fix special vmax when arg is UOp ( #11930 )
2025-08-31 06:54:39 +02:00
qazal
0bacd9fc9b
viz: give disassembly its own node ( #11927 )
2025-08-31 00:28:52 +03:00
chenyu
af89be317e
relax rtol for bfloat16 test_dtype_alu ( #11926 )
2025-08-30 17:16:08 -04:00
George Hotz
632c2fb119
lowerer works on rangeifed + print exception ( #11925 )
2025-08-30 12:05:44 -07:00
qazal
c27b99d68f
viz: refactor to indexed rewrite traces ( #11923 )
2025-08-30 20:01:47 +03:00
qazal
9aff00a6ea
switch viz command line args to pathlib ( #11922 )
2025-08-30 18:13:47 +03:00
qazal
c86ee5bfaf
viz: canonicalize device name colors ( #11921 )
2025-08-30 18:12:30 +03:00
nimlgen
a4f05ebd1a
ci: rebuild gpuocelot with boost libs ( #11920 )
2025-08-30 17:24:19 +03:00
qazal
bf0d055b39
viz: color by name ( #11919 )
2025-08-30 16:04:58 +03:00
Sieds Lykles
0bc34c000f
simplify range mod its own upper bound ( #11917 )
...
* add rules
* add tests
2025-08-30 08:37:35 +02:00
chenyu
561318fea7
Tensor.cos in test_stype_alu ( #11916 )
...
* Tensor.cos in test_stype_alu
* need this fix anyway
2025-08-29 20:26:36 -04:00
NoahKusaba
0838021753
remove np from beautiful_cifar ( #10988 )
...
* remove np from beautiful_cifar
* remove np from cifar
* rename variable and rename tensor.arrange to just tensor.randperm
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-08-29 19:34:16 -04:00
nimlgen
cf9d8c8142
ci: pin boost for macos runners ( #11910 )
2025-08-30 01:38:06 +03:00
nimlgen
c6e342cdac
mockgpu: no hang if gpuocelot failed ( #11915 )
2025-08-30 00:44:49 +03:00
chenyu
26d03a86a1
test_symbolic_ops.py cleanup ( #11895 )
2025-08-29 17:11:59 -04:00
b1tg
b2cc06218a
python bfloat16 ( #11912 )
...
* python bf16
* _to_torch_storage_type
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-08-29 15:18:02 -04:00
George Hotz
afad7d0cd1
remove dtype from range, it will be dtypes.index soon [pr] ( #11914 )
...
* remove dtype from range, it will be dtypes.index soon [pr]
* a few more
2025-08-29 09:52:07 -07:00
qazal
30e72d5820
multi device and copy tracing for NULL device ( #11913 )
...
* add device name to NULL programs
* trace transfers
2025-08-29 15:31:00 +03:00
qazal
d8e1e4dc61
tracing: show NULL programs ( #11911 )
2025-08-29 14:09:33 +03:00
nimlgen
75678b2cbe
amd: retire pm4 xcc sync ( #11835 )
...
* amd: aql default when several xccs
* amd: retire om4 xcc sync
* remove more
* more
* more
2025-08-29 09:56:27 +03:00
George Hotz
394c2d1db1
update Kernel API in tests + move optimize_local_size ( #11907 )
2025-08-28 15:12:47 -07:00
nimlgen
fa695ac1ce
ci: mac gpuocelot ( #11906 )
...
* gm
* fix?
* ops
* imp
* xx
* add file
2025-08-28 23:29:43 +03:00
George Hotz
b9b438c516
small updates from postopt ( #11903 )
...
* tests from postopt
* modernize
* skip lin tests
* that's fixed?
* skip, not failure
2025-08-28 12:34:52 -07:00
nimlgen
bb55a3001f
nv: flush reset message ( #11897 )
2025-08-28 22:17:20 +03:00
nimlgen
e8289c75b1
ci: do not reinstall existing pkgs in macos ( #11900 )
2025-08-28 21:20:15 +03:00
chenyu
134cf56904
update cache name for gpuocelot ( #11896 )
2025-08-28 13:11:10 -04:00
Ben Waldron
ea1be2e4cd
[bounty] Remove using reshape to register symbolic shape ( #11771 )
...
* Modify tests and start work towards removing symbolic reshape
* Refactor symbolic reshape
* fix small error
* much cleaner + fix more tests
* Can remove this now
* Update test_symbolic_ops and test_tiny
* Couple more tests
* Unused import
* More tests and add EXPAND to Tensor.empty
* Fix test beam search
* all int
* Fix rangeify by adding shrink
* Remove OOB check and so fix test_symbolic_jit
* test_symbolic_jit doesn't need OOB Context anymore either
* Should remove that test now
* Cleanups part 1
* fix linters
* Final cleanups
* Don't reassign inside for loop
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-08-28 12:30:49 -04:00
qazal
53853ae49b
viz: switch to Path2D ( #11892 )
2025-08-28 18:58:16 +03:00
nimlgen
874c1db4af
am: init support for aql ( #11888 )
2025-08-28 18:41:46 +03:00
Ben Waldron
17ecaf4682
Add test_variable_empty ( #11889 )
...
* Add test_variable_empty
* Move test and add TODO
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-08-28 11:38:27 -04:00
Nino Risteski
54be477152
rope cache optim for jit prune in llm.py ( #11678 )
...
* rope cache optim for jit prune
* rope test
* tests in test attention
* Revert "rope test"
This reverts commit 69ede543d0 .
* lint
2025-08-28 08:31:29 -07:00
quortus
5f8fe9a331
Replace ASSIGN with STORE in test_linearizer ( #11821 )
2025-08-28 07:33:20 -07:00
geohotstan
4e8370309c
Support onnx If OP ( #11648 )
...
* start
* tiny clean up
* whoops, didn't mean to accidentally fix this
* fix .to(device), kinda hacky and this fix makes it slower?
* merge properly
* FINALLY figured out slowness, also hack pylint for now
* add DEBUGONNX print for subgraph
* oops
* WOOOOOOOO SHAPE CACHE 50% SPEED INCREASE
* small fix, but maybe all deterministic Tensor creation in fp should be cached
* cache condition
* sliiiightly cleaner
* better abstraction?
* remove sam from model_benchmark
* remove shape cache speed up for now
* less lines
* isinstance fix
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-08-28 10:17:35 -04:00
George Hotz
6d6f0dada7
support for tuple ranges ( #11890 )
...
* support for tuple ranges
* breaks it
2025-08-28 07:02:31 -07:00
nimlgen
60dd9a162c
memory: tiny tlsf cleanup ( #11887 )
2025-08-28 14:07:18 +03:00
chenyu
beb5982165
FUSE_ATTENTION ( #11884 )
2025-08-27 19:59:17 -04:00
George Hotz
cb5295168d
postrange boilerplate work ( #11881 )
2025-08-27 15:22:59 -07:00
George Hotz
fd579433bc
pre expander shouldn't go in gpudims ( #11880 )
2025-08-27 14:52:24 -07:00
nimlgen
44816218b5
memplan: fix large buffers planning ( #11878 )
...
* memplan: fix large buffers planning
* fix
* fix dsp
2025-08-27 23:54:27 +03:00
nimlgen
4006366752
Revert "memplan: fix large buffers planning ( #11876 )" ( #11877 )
...
This reverts commit 7f90497efc .
2025-08-27 22:36:14 +03:00
nimlgen
7f90497efc
memplan: fix large buffers planning ( #11876 )
...
* memplan: fix large buffers planning
* fix
2025-08-27 22:04:15 +03:00
George Hotz
e4afdf9ea1
improve DEBUG=2 string with TB/s and TFLOPS [pr] ( #11875 )
2025-08-27 11:42:41 -07:00
Jordan Chalupka
e9789d8a70
Add mxfp4 support ( #11873 )
...
* bump ggml url
* map mxfp4 to tensor
* tests
2025-08-27 10:56:56 -07:00
qazal
884eb53e89
tracing: fix types ( #11871 )
...
* tracing: fix types
* /profiler isn't a thing
* return list
2025-08-27 15:50:43 +03:00
Sieds Lykles
d39365809a
add ctx to z3_renderer arg ( #11867 )
...
* add ctx to z3_renderer arg
* update symbolic fuzzer
* rewrite u1,u2,u3
* update fuzz_fast_idiv
* remove imports
2025-08-27 03:38:15 +02:00
George Hotz
24c00a4061
darken hex on viz ( #11865 )
...
* darken hex on viz
* more readable
2025-08-26 15:57:50 -07:00
qazal
f38e4af226
viz: add custom zoom filter ( #11861 )
2025-08-27 01:30:29 +03:00
nimlgen
62df6c39af
amd: correct handling of relocations ( #11863 )
...
* amd: correct handling of relocations
* ops
* add
2025-08-27 01:26:45 +03:00
George Hotz
d261458ecd
add colors to range ( #11860 )
2025-08-26 14:32:12 -07:00
Sieds Lykles
7dfc7e4abc
uops_to_z3 helper( #11859 )
2025-08-26 22:58:05 +02:00
chenyu
1bbb578afd
named expression for POW and MAX gradient ( #11858 )
2025-08-26 16:03:03 -04:00
chenyu
7028cb4167
clean up TestBitcastConstFolding ( #11856 )
2025-08-26 15:26:47 -04:00
George Hotz
d4154e0349
split devectorizing of buf/index ( #11855 )
2025-08-26 12:05:48 -07:00
George Hotz
b268755d51
small changes from postopt ( #11854 )
2025-08-26 11:56:16 -07:00
Sieds Lykles
a3aeef45cc
associative variation of where branch-merging ( #11851 )
...
* add rule and test
* change comment
2025-08-26 19:27:05 +02:00
chenyu
aabe7756be
fix type in fold_bitcast [pr] ( #11853 )
2025-08-26 13:22:30 -04:00
Jordan Chalupka
4785cd959a
[TYPED=1] cvar should allow dtype as a tuple ( #11770 )
...
* cvar dtype:DType|tuple[DType, ...]|None=None
* fmt
* add a test
* list typeguard as a dep for CI
* extra step to install mypy
* fix venv
* ci fixes
* mv typeguard to testing install group
* simpler TYPED=1 test
* add typeguard to lint group
2025-08-26 12:49:51 -04:00
qazal
b111076301
viz: fixup click on overlay rect ( #11850 )
2025-08-26 19:25:42 +03:00
b1tg
1dd613cb89
test float_to_bf16 round-to-even behavior ( #11849 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-08-26 12:16:10 -04:00
b1tg
409399c609
fix nan in float_to_bf16 ( #11843 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-08-26 11:42:25 -04:00
qazal
43d5d66d34
viz: add UOp ports to edges ( #11847 )
...
* viz: add UOp ports to edges
* one edge label
* g.tag styling
* replace with NodeList
2025-08-26 18:31:52 +03:00
chenyu
f28f613f85
improved float_to_bf16 ( #11848 )
...
round instead of truncate
2025-08-26 11:14:06 -04:00
nimlgen
afe14ccbfa
amd: aql default when several xccs ( #11832 )
2025-08-26 15:16:36 +03:00
qazal
3674c0754e
viz: small uop click changes ( #11846 )
...
* also highlight self
* can always unselect by clicking outside
* less layout
2025-08-26 14:56:13 +03:00
qazal
f2a3c27372
viz: g.edges() once ( #11845 )
2025-08-26 13:29:59 +03:00
qazal
b0df3e62a8
viz: light up srcs and paths on UOp click ( #11844 )
...
* viz: light up srcs and paths on UOp click
* safari doesn't have context-stroke
* safari also has a bug
* safari acceptance
2025-08-26 09:03:09 +03:00
qazal
6236749867
viz: move rect styles to classes ( #11842 )
...
* viz: move rect styles to classes
* add rect
2025-08-26 07:55:34 +03:00
qazal
81ffa07439
viz: pass through nodes without a link ( #11841 )
2025-08-26 07:00:43 +03:00
Sieds Lykles
265d287615
add decomp for !x&!y -> !(x|y) ( #11836 )
2025-08-26 05:21:06 +02:00
chenyu
337e979a59
call dtypes.as_const in Tensor(list) ( #11840 )
2025-08-25 22:08:26 -04:00
George Hotz
215818379b
new (post) group for reduce ( #11837 )
...
* new (post) group for reduce
* fixes
* leave if
* fix locals
* size
* no vectorized buf
* image fixes
* don't track that
* fix ptx
* name buffer with reduce range
* remove unused in lowerer
* yay DEFINE_REG refactor
2025-08-25 18:03:00 -07:00
chenyu
ac3449b0c8
truncate_fp16 cleanup ( #11838 )
...
native `@` is default
2025-08-25 19:03:41 -04:00
qazal
e146418f65
hotfix: profiler content-type is application/octet-stream ( #11831 )
2025-08-25 15:56:42 +03:00
qazal
a1f6823060
viz: memory layout in client side ( #11830 )
...
* viz: memory layout in client side
* update test_viz
2025-08-25 14:49:33 +03:00
George Hotz
a6dbb09058
changes for postrange ( #11828 )
2025-08-24 17:37:07 -07:00
George Hotz
27701ef823
add locals support to rangeify ( #11826 )
2025-08-24 14:03:12 -07:00
Sieds Lykles
a286a1a6f7
Fast idiv try removing factors of two before cast ( #11824 )
...
* try removing factors of two
* dont return if None
* add test
2025-08-24 20:04:25 +02:00
George Hotz
a03b930339
hotfix: green v2 in docs
2025-08-24 10:25:14 -07:00
George Hotz
6540bb32a6
move into codegen late [pr] ( #11823 )
2025-08-24 10:23:25 -07:00
nimlgen
bba088ef11
amd aql queue ( #11708 )
...
* amd aql queue
* xcc
* fiz
* aql better
* llvm
* no for aql
* wrap
* is_sql
* am support
* complete
* fix
* mypy
* minor
2025-08-24 19:53:00 +03:00
George Hotz
1fa09d9ede
BLOCK_REORDER is context var, heuristic cleanups [pr] ( #11819 )
...
* BLOCK_REORDER is context var, heuristic cleanups [pr]
* split get opt and do opt
* oops, should be on
2025-08-24 09:41:34 -07:00
qazal
8b18cc2a94
viz memory layout cleanup ( #11820 )
...
* rename to dtype_size
* cleanr memory shape creator
2025-08-24 19:37:31 +03:00
Sieds Lykles
dd69114573
Revert "Better div nesting ( #11811 )" ( #11818 )
...
This reverts commit 952f729b07 .
2025-08-24 18:11:24 +02:00
nimlgen
e19f901330
amd: rptr/wptr in create_queue ( #11817 )
2025-08-24 18:03:45 +03:00
nimlgen
d71444857e
amd: apply relocs for kernel_code_entry_byte_offset for AMD_LLVM ( #11816 )
...
* amd: apply relocs for kernel_code_entry_byte_offset for AMD_LLVM
* fix
2025-08-24 17:48:40 +03:00
George Hotz
44bc7dc73d
remove KernelInfo from GROUP_REDUCE ( #11814 )
2025-08-23 19:55:41 -07:00
George Hotz
229adfb7c3
Revert "remove KernelInfo from gpudims ( #11809 )" ( #11813 )
...
This reverts commit 846753f343 .
2025-08-23 19:37:10 -07:00
Sieds Lykles
952f729b07
Better div nesting ( #11811 )
...
* remove check
* use fold_divmod_congruence instead of simplify
* adjust tests
* shorten line
2025-08-24 04:17:40 +02:00
Sieds Lykles
e652062f92
tweak divmod_folding condition ( #11810 )
2025-08-24 02:59:02 +02:00
George Hotz
846753f343
remove KernelInfo from gpudims ( #11809 )
...
* remove KernelInfo from gpudims
* that's good in there
2025-08-23 16:32:45 -07:00
Sieds Lykles
07d4ed7e4c
one more symbolic add variation ( #11807 )
2025-08-24 01:15:04 +02:00
qazal
759ebea4eb
viz: reflect timeline API boundary in names ( #11808 )
...
* define shapes once
* depth isn't an event property
* update server naming
2025-08-24 02:12:12 +03:00
George Hotz
132f09fab7
global/locals from AxisType in range ( #11806 )
2025-08-23 15:49:17 -07:00
qazal
0d86288bd7
viz: calculate timeline fixed points in client side ( #11805 )
...
* viz: calculate timeline fixed points in client side
* 26 bytes / event
* math
2025-08-24 01:44:40 +03:00
George Hotz
a75da49951
use AxisType for UPCAST/UNROLL ( #11800 )
...
* use AxisType for UPCAST/UNROLL
* fixes
* fix the bug
* fix hack
* bad test
* flaky test
2025-08-23 14:44:48 -07:00
qazal
2407fecdae
viz bytepack format ( #11792 )
...
* viz bytepack format
Training a 1B llama yields ~20M profiler events.
With JSON serialization, the browser tries to load 6GB to memory. This OOMs since each tab is limited to <3-4GB memory usage. Using a packed format, we only need ~600MB.
**Design decisions:**
- Timestamps are in microseconds relative to start time. They're stored in u32, which can express up to ~1 hr of trace events.
- Strings (kernel names, metadata, etc) are deduped.
- Buffer sizes are in u64 nbytes.
More optimization possible:
- The string lookup is a JSON dumped array, we can compress this.
- Can store less for memory by moving the layout to client.
**Results**
| | Events | JSON | bytepack |
|----------------|---------|-------------|-------------|
| DP=8 llama 1B train (`command: [1]`) | 24M | 5.8GB | 640MB |
| examples/beautiful_mnist.py | 16K | 3.7MB | 745KB |
| examples/gpt2.py | 55K | 12.54MB | 1.40MB |
`[1]`: `VIZ=1 FAKEDATA=1 OFFLOAD_OPTIM=1 DP=8 BS=8 GRADIENT_ACC_STEPS=2 BLOCK_REORDER=0 LR=3e-4 TRAIN_ON_VAL=1 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=1B WARMUP_STEPS=36 DECAY_STEPS=360 SEQLEN=8192 PYTHONPATH=. AMD=1 AMD_LLVM=0 MODEL=llama3 python3 examples/mlperf/model_train.py`
* python reference decoder
* 27 bytes / event, 1hr hard limit
2025-08-23 23:50:21 +03:00
qazal
b12d1d866c
count bytes per kernel in test_viz ( #11801 )
...
Currently at ~100 bytes/kernel with JSON.
2025-08-23 23:35:27 +03:00
Sieds Lykles
6a50ab6b87
adjust idiv min_max ( #11802 )
...
* change div min_max
* add tests
2025-08-23 22:25:51 +02:00
chenyu
9d4cccd0f9
test_dtype_alu cleanups ( #11799 )
2025-08-23 15:11:17 -04:00
George Hotz
aefabaf774
add AxisType to range ( #11798 )
...
* add AxisType to range
* missed them
* fix that test
* fix that test
2025-08-23 11:15:00 -07:00
qazal
b975830424
add profile loader helper in test_viz ( #11797 )
2025-08-23 19:20:29 +03:00
chenyu
7123df3928
Use Tensor.logaddexp to implement Tensor.softplus ( #11796 )
...
instead of piecewise linear, numerical is handled by logaddexp. jax does this and i think it's more elegant than torch's approach
2025-08-23 11:52:29 -04:00
qazal
aaea6b97ad
viz memory: compute nbytes ( #11795 )
...
* viz memory: compute nbytes
* local map
2025-08-23 17:34:07 +03:00
qazal
58653b5eae
viz: store memory scale ( #11794 )
2025-08-23 16:19:44 +03:00
chenyu
fb8ee02424
Tensor.logaddexp ( #11793 )
2025-08-23 09:15:00 -04:00
Sieds Lykles
5a6817d5f8
Fix z3 rendering of floats in indexing ( #11740 )
...
* Fix floating point comparison in indexing
* wrap in noop
* update tests
* improve rules for loading and comparing floats
* add test cast to bool
2025-08-23 05:56:19 +02:00
chenyu
4267c45db3
non-supported dtype in transcendental ( #11754 )
...
* non-supported dtype in transcendental
`CPU=1 python3 test/test_dtype_alu.py TestDTypeALU.test_bfloat16_unary` works
* test
* works on real mac
2025-08-22 23:13:45 -04:00
chenyu
e39b25cd36
upcast float exp to at least float32 ( #11758 )
...
* upcast float exp to at least float32
* unlucky seed
2025-08-22 20:16:34 -04:00
nimlgen
b057a90d49
memory: rename is_huge_page -> is_page ( #11786 )
2025-08-22 20:08:58 +03:00
qazal
38f0fa7bde
viz: only send trace duration ( #11789 )
...
* viz: only send trace duration
* can unwrap
2025-08-22 20:00:48 +03:00
qazal
1c81ec9248
viz: rename to start/end timestamp ( #11788 )
2025-08-22 19:47:49 +03:00
qazal
9ff03680ba
viz: store relative timestamps ( #11787 )
...
* viz: store relative timestamps
* err
* update test
2025-08-22 19:30:21 +03:00
nimlgen
698392334f
system: message for eaccess as well ( #11785 )
2025-08-22 18:21:32 +03:00
geohotstan
1e679bd789
fix max_unpool2d inf ( #11784 )
...
* start
* add regression test for maxunpool2d
2025-08-22 08:31:24 -04:00
George Hotz
9832599c9e
test_vmap + permute isn't a sint ( #11783 )
...
* test_vmap + permute isn't a sint
* order
2025-08-21 22:39:35 -07:00
George Hotz
bb8de51e5f
remove unused early cleanups + contig w range [pr] ( #11780 )
...
* remove unused early cleanups [pr]
* contiguous with range
* woah, this works
2025-08-21 20:04:45 -07:00
chenyu
91a4de4ca7
fix getitem with inf in tensor ( #11781 )
2025-08-21 21:55:32 -04:00
George Hotz
66e9d54eed
RANGEIFY=2 is partial contig ( #11777 )
2025-08-21 16:53:58 -07:00
Jordan Chalupka
8de6db15ac
exclude .git from ruff ( #11773 )
2025-08-21 15:37:50 -07:00
George Hotz
5954a0975f
fix some assigns on rangeify ( #11774 )
...
* fix some assigns
* llvm test
* more tests
* upd test
2025-08-21 15:15:54 -07:00
qazal
2e0eb88549
viz: add metadata to UOp tracing ( #11772 )
...
* viz: add metadata to UOp tracing
* place after tag
* optional field
* err, refcount of root must be 0
2025-08-22 00:18:45 +03:00
George Hotz
d6f9606e93
small cleanups to rangeify ( #11769 )
2025-08-21 11:15:09 -07:00
uuuvn
bd4a9473b0
Multihost exception handling ( #11729 )
...
Co-authored-by: wozeparrot <wozeparrot@gmail.com>
2025-08-21 13:51:49 -04:00
George Hotz
a2c7b807e0
don't bufferize 0s ( #11766 )
2025-08-21 10:10:56 -07:00
nimlgen
9eff7cd1d8
am: support 64bit discovery ( #11768 )
2025-08-21 18:28:13 +03:00
b1tg
56cd47a159
fix amd llvm bf16 tc ( #11713 )
...
* fix amd llvm bf16 tc
* is_cdna
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-08-21 09:33:28 -04:00
George Hotz
a044648111
rangeify load cleanups + multi support ( #11765 )
...
* use the old buf_uop + cleanups
* simpler handling of load
* everything needed for multi too
2025-08-20 20:55:49 -07:00
George Hotz
9f94c25a25
fix symbolic usage. use shrink, not reshape ( #11762 )
...
* fix test_var
* revert those things
* fix the ones in test tiny
* use better syntax
* it's the same, but that's clearer
* fix pad
2025-08-20 18:35:42 -07:00
chenyu
5276fbc9c5
fix gather with inf values ( #11760 )
...
(mask * x) is wrong because 0*inf is nan. i feel we have a lot of those still...
2025-08-20 20:35:40 -04:00
wozeparrot
b979162c5d
llama3 eval train ( #11706 )
2025-08-20 19:56:35 -04:00
chenyu
dbd3b67657
clamp GRAD_CLIP_NORM in llama ( #11761 )
2025-08-20 19:55:50 -04:00
George Hotz
9635592141
** rangeify, try 3 ( #11683 )
...
* ** rangeify, try 3
* bring that over
* bufferize, don't use contig tag
* work
* ish
* fix rangeify
* flash attention is back
* fix rangeify tests
* stuff passes
* fix test_log_softmax
* more stuff passes
* progress children
* new endrange solution
* progress
* progress counter
* basic assign
* contigs only
* symbolic in schedule
* unbind_kernel
* late children
* ops fixed
* beautiful mnist is close
* that seems to work
* mnist works
* improve names
* fix bmnist
* no pcontig
* testing backward
* work
* clone movement ops
* new_range helper
* MBLOCK/MERGE
* ops tests pass
* revert mblock stuff
* cleanups...but it breaks ops
* remove reindex
* hack for relu
* disable the hacks
* more hacks
* upd
* mostly works with cleanups disabled
* ndr
* ops tests pass
* terrible hacks for indexing to work
* context mismatch
* pcontig
* split pcontig v contig
* z3 trunc
* null
* no fuse in rangeify
* ops test passes
* lnorm
* fix assign
* nd rangeify
* both should work
* tests for rangeify
* cleanups
* stores pass the pointer through
* disable pcontig for now
* PARTIAL_CONTIG is a flag
2025-08-20 14:22:44 -07:00
chenyu
d7553721d1
clean up test_dtype_alu ( #11757 )
...
remove the check that looks into schedule, only test if output matches
2025-08-20 14:36:18 -04:00
chenyu
5f08a3e928
hotfix: cast half to float in Tensor.tolist ( #11755 )
...
workaround for python < 3.12
2025-08-20 12:18:35 -04:00
qazal
de4cb722a4
viz: add metadata and var_vals tracing ( #11753 )
...
* viz: add metadata and var_vals tracing
* add test_trace_metadata
* set TRACEMETA=1
2025-08-20 18:39:51 +03:00
nimlgen
6589c9e643
hcq: better errors for ifaces ( #11751 )
...
* hcq: better errors for ifaces
* fix linter
* typo
* space
2025-08-20 17:50:51 +03:00
chenyu
be7b0b6970
TRANSCENDENTAL_SUPPORTED_DTYPES->TRANSCENDENTAL_DTYPES ( #11752 )
2025-08-20 10:29:36 -04:00
ttomsa
220a2a88d7
a*(1/b) -> a/b on LLVM, CPU ( #11743 )
...
* add fdiv rewrite
* :)
* use float_lop
* use reciprocal()
* revert
* move to decompositions
2025-08-20 09:35:10 -04:00
George Hotz
12ab3f8b06
correct row_count in process replay ( #11748 )
2025-08-19 22:21:07 -07:00
George Hotz
8af8808c61
cleanup tests, bump caches ( #11746 )
2025-08-19 21:21:07 -07:00
George Hotz
00391db628
no ast for mem estimate ( #11744 )
...
* no ast for mem estimate
* skip for webgpu
2025-08-19 20:18:45 -07:00
chenyu
dd413e1208
remove a Ops.REDUCE check in reduce_collapse [pr] ( #11734 )
2025-08-19 19:21:28 -04:00
ttomsa
70c3f1fb29
x.where(False, True) -> !x ( #11738 )
...
* add pat
* add test
2025-08-19 19:08:16 -04:00
George Hotz
1d307f568c
move device tests to test/device + test cleanups ( #11735 )
...
* move device tests to test/device
* test speedups
* test device
* linalg to unit
* upd
* so pytest just works
* more divide and skip
* speed
* test devectorize
* add pillow
2025-08-19 16:02:20 -07:00
wozeparrot
bcc7623025
feat: bump version to 0.11.0 ( #11736 )
2025-08-19 17:08:56 -04:00
qazal
8c987b3293
DISABLE_FAST_IDIV is a context var [pr] ( #11733 )
2025-08-19 23:30:50 +03:00
George Hotz
bf467c623d
changes from rangeify + better NullRenderer ( #11732 )
...
* changes from rangeify + better NullRenderer
* fix test
2025-08-19 12:51:54 -07:00
chenyu
02353588cb
small getitem cleanup ( #11730 )
2025-08-19 12:25:58 -04:00
chenyu
712a5c651a
minor Tensor.triu cleanup ( #11728 )
...
less confusing dtype
2025-08-19 08:07:38 -04:00
nimlgen
9c9e337c78
amd: parse soc enums ( #11727 )
...
* amd: parse soc enums
* remove from mock
* fix
* minimal amd_gpu
2025-08-19 15:06:09 +03:00
qazal
57ad69160a
viz: inline memory shape spec ( #11725 )
2025-08-19 08:03:29 +03:00
chenyu
c5b52e9321
onnx RotaryEmbedding cleanup ( #11724 )
2025-08-18 23:34:42 -04:00
George Hotz
31619774a9
Revert "Revert "fix the misused cast in amd llvm tc ( #11711 )" ( #11715 )" ( #11723 )
...
This reverts commit ca28db5a97 .
2025-08-18 19:44:35 -07:00
George Hotz
2ea54d7337
improve syntax of UPats using f [pr] ( #11717 )
...
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-08-18 20:49:45 -04:00
chenyu
b67345caa3
use truncate in onnx read_int64 [pr] ( #11720 )
2025-08-18 20:49:35 -04:00
qazal
50e789e290
hotfix: add device to decompositions ctx ( #11721 )
...
fast_idiv requires it for checking if a dtype is supported. Without
this, codegen creates non reproducible output without a complete
os.environ. since `is_dtype_supported` will open devices based on the
env var unless the device is specified by the caller.
2025-08-19 03:31:16 +03:00
George Hotz
4b3fcb4064
Revert "REDUCE_AXIS keepdim=False ( #11311 )" ( #11718 )
...
This reverts commit b518a7378a .
2025-08-18 13:28:53 -07:00
George Hotz
67d0ba5bd8
new ops from rangeify ( #11716 )
2025-08-18 13:13:11 -07:00
George Hotz
4afa0b86bb
hotfix: ls -lh on wheel size
2025-08-18 11:52:59 -07:00
George Hotz
ca28db5a97
Revert "fix the misused cast in amd llvm tc ( #11711 )" ( #11715 )
...
This reverts commit 799a637b03 .
2025-08-18 11:51:28 -07:00
chenyu
c10e4c4e20
print wheel build size ( #11714 )
2025-08-18 14:29:47 -04:00
b1tg
b518a7378a
REDUCE_AXIS keepdim=False ( #11311 )
...
* progress
* fix tests
* fix tests
* remove hack for test_symfold
* fix test_conv.py on llvm
* hack test_cache_speed
* lint
* remove hack for helper_linearizer_opt
* tests
* fix DSP
* clean up
* remove hack for kernelize.py
* hack for test/test_multitensor.py TestMultiTensor.test_matmul_shard_none
* clean
* uop.r need reshape?
* lower_store cause fail
* fix lower?
* avoid contiguous hack
* 2134
* conv2d count
* remove unused
* hack lower
* reduced and clean up
* fix TestMultiTensor.test_matmul_shard_none
* src sync + fix TestMultiTensor.test_matmul_shard_none
* remove excluded in mop
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2025-08-18 10:09:17 -07:00
b1tg
61884f2057
add cstyle renderer to the NULL device ( #11709 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-08-18 09:52:22 -07:00
uuuvn
18db8fa311
Allow choosing leaders in multinode reduce ( #11506 )
...
Co-authored-by: wozeparrot <wozeparrot@gmail.com>
2025-08-18 12:43:20 -04:00
b1tg
799a637b03
fix the misused cast in amd llvm tc ( #11711 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-08-18 09:15:34 -07:00
qazal
fef97547f9
viz: preset the final timestamp ( #11712 )
2025-08-18 17:51:21 +03:00
chenyu
c30a113b2a
support bf16 and fp8 in Tensor.tolist ( #11704 )
...
memoryview does not support it, but casting works fine so cast is fine
2025-08-17 15:11:13 -04:00
nimlgen
1c62a3833b
am: add versioned_header to load_fw ( #11702 )
...
* am: add versioned_header to load_fw
* fix mypy
2025-08-17 20:11:57 +03:00
qazal
eb3c918c5b
viz: s/area/height ( #11703 )
2025-08-17 19:20:01 +03:00
qazal
d762edd694
viz: define tracks in python ( #11701 )
...
* viz: defines tracks in python
* update unittests
* figuring it out
* works
* diff cleanup
* math
* y axis is back
2025-08-17 18:19:13 +03:00
qazal
eeeea29171
viz: device list refactor ( #11700 )
...
* viz: device list refactor
* paddingTop/padding-top
2025-08-17 15:08:54 +03:00
George Hotz
9366a23eb0
test backward in test_tiny ( #11697 )
...
* test backward in test_tiny
* empty
2025-08-16 20:29:39 -07:00
chenyu
4666df71c1
fix test_fuse_and_tc_opt ( #11699 )
2025-08-16 21:10:53 -04:00
geohotstan
3d7c35d615
add fuse and tc opt bug repro ( #11695 )
...
* FINALLY HAVE A SMALL REPRO OH BOY
* show failure in CI
* cleaner?
* 1 possible fix
* Revert "1 possible fix"
This reverts commit 9e0fd215dd .
2025-08-16 18:24:49 -04:00
nimlgen
d1224a7c4a
am: check both signatures ( #11694 )
...
* am: check both signatures
* fix
2025-08-16 20:01:07 +03:00
qazal
58c8991fa4
add Ops.REWRITE_ERROR ( #11689 )
2025-08-16 00:56:53 +03:00
qazal
ec4fccb1da
viz: pass through RewriteNotReady ( #11690 )
2025-08-16 00:33:59 +03:00
qazal
e954decb44
viz: pass UOp.st errors ( #11688 )
2025-08-16 00:07:56 +03:00
nimlgen
bf0c45fd16
system: resource_resize might be unavail ( #11680 )
2025-08-15 22:03:23 +03:00
George Hotz
4ab9fb2edd
explicit fixed point rewrite ( #11685 )
...
* explicit fixed point rewrite
* local cache
* fix that
2025-08-15 11:08:41 -07:00
chenyu
5d6963c968
RuntimeError for unsupported dtype in PYTHON ( #11686 )
2025-08-15 13:59:27 -04:00
nimlgen
b970cd6895
am: fix psp ring completion ( #11679 )
...
* am: psp ring timeout + fix 0 fence_value
* no sleep
2025-08-15 20:15:49 +03:00
qazal
c8ba48b223
show rewrite errors in viz ( #11684 )
2025-08-15 19:09:47 +03:00
George Hotz
560984fd8d
small changes from rangeify ( #11682 )
...
* small changes from rangeify
* const like thing
* ksym
2025-08-15 08:45:52 -07:00
chenyu
d0d39885c3
onnx in tinygrad ( #11675 )
2025-08-14 19:57:21 -04:00
wozeparrot
71260a5ea4
feat: only bench openpilot 0.9.9 models ( #11664 )
2025-08-14 19:27:18 -04:00
chenyu
4ddefbccb4
update setup packages ( #11674 )
...
sorted, and added missing 'tinygrad.frontend' and 'tinygrad.runtime.autogen.nv'
2025-08-14 19:24:57 -04:00
chenyu
48c4033ae1
fix pylint for onnx ( #11673 )
...
* fix pylint for onnx
* too long
2025-08-14 18:48:02 -04:00
chenyu
e9d0027591
llama MP realize weight after shard ( #11672 )
...
* llama MP realize weight after shard
prevents memory spike on device 0
* empty weight for FAKEDATA
2025-08-14 16:17:46 -04:00
nimlgen
4176b24264
amd: support xcc in regs ( #11670 )
...
* amd: support xcc in regs
* mockamd
* typong
2025-08-14 21:20:11 +03:00
Sieds Lykles
f399d0d75d
Render mod in terms of idiv ( #11668 )
...
* Render mod in terms of idiv
* cvar -> var
2025-08-14 19:59:39 +02:00
nimlgen
d747eeed32
amd logs parser based on device ( #11669 )
2025-08-14 19:49:33 +03:00
geohotstan
1e904155e3
Add Onnx Huggingface to test/models/test_onnx.py ( #11468 )
...
* BOOM
* cache extra/huggingface/models/
* why max buffer size is not 0
* override MAX_BUFFER_SIZE
* less models
* remove more models and change cache dir to already cached dir
* only metal
* less is more?
* remove check ops
* why is this not setting the ENVVAR
* ughhhhh just test in models
* only cpu and gpu
* only cpu actually
* just override it idk
* final
* move extra dependencies up top
* simplification
* fix print
* make README better
* revert ops_disk fix for now
* clean up test_onnx
* remove testing fashion clip model cuz sloooowwwwww
* actually let METAL run this
* fix comment mistake
* fix download path in run_models
* does this work?
* cleanup setup and teardown
* contextvar like this?
* prove model is cached
* do I need to increment DOWNLOAD_CACHE_VERSION?
* see if cached with incremented DOWNLOAD_CACHE_VERSION
* use warnings to see if the model exists
* revert DOWNLOAD_CACHE_VERSION stuff and clean up
* add retry to download
* nit
2025-08-14 11:16:41 -04:00
Sieds Lykles
06beeb6e13
Nest div even if factor is negative ( #11666 )
2025-08-14 13:58:59 +02:00
Sieds Lykles
661e9a2d5d
div_and_mod_folding refactor ( #11585 )
...
* divmod const folding is its own function
* split nested mod optimization out of div and mod folding
* make `fold_binary_numerator` its own function
* factor out `fold_divmod_congruence`
* check sign of numerator
* add tests
* assert int on vmin and vmax
* add type: ignore
* factor out more rules
* remove div_and_mod_folding
* cached_property to property
* remove import
* add returns
* restore old order
* check sign of x.vmin and newx.vmin
* check more signs
* add some test that would have caught bugs
* better test if the div simplified
* shorten line
* replace terms_factors_const with pop_const
* move that back
* minor cleanup
* remove comments
* some cleanup
2025-08-14 11:52:42 +02:00
chenyu
0fc43c2e54
fix test_const_tensor_index index ( #11660 )
...
index should be ints
2025-08-13 19:50:16 -04:00
chenyu
4fe19eec72
Ops.TRUNC ( #11659 )
2025-08-13 18:40:48 -04:00
qazal
eb10a9c76a
viz: always left align timeline values ( #11658 )
2025-08-13 23:55:28 +03:00
George Hotz
22bdf48cdd
render ranges in viz, name gbufs with sizes. changes from rangeify ( #11656 )
...
* render ranges in viz, name gbufs with sizes. changes from rangeify
* fix unit test dtypes
2025-08-13 12:46:16 -07:00
George Hotz
9b4da590bb
remove need for cast_vec ( #11653 )
...
* remove need for cast_vec
* fix amdllvm
2025-08-13 12:09:47 -07:00
kevvz
e2873a3a41
[bounty] Muon optim ( #11414 )
...
* newton schulz
* add muon + move newton schulz to tensor
* compact newton schulz
* better tests
* cleanup
* add comments for muon
* cleanup
* add export with tests
* match muon optim with test optim
* cleanup
* unsed import
* correct comment
* whitespace
* move export
* muon test fix
* match reference impl + tests
* remove export by moving muon device
* add credit
* cleanup
* remove print
* spacing
* spacing
* comma
* cleanup
* removal
* fix tests + optim momentum
* consistent is not/ not
* more consistency
* fix test
* cleanup
* fix the nones
* remove comment
* cast
* comment
* comment
* muon teeny test
* muon flag beautiful mnist
* set steps
* steps as hyperparam
* match default test steps
* name
* large cleanup
* dont care about steps
* nesterov false default
* match each other impl
* steps
* switch nest
* swap defaults
* update docstring
* add no nesterov test
* ban fuse_optim
* prints
* classical momentum
* alternative condition
* recon
* pre + post wd
* false default
* detach
* signature changes
* context
* swap order
* big cleanup
* 0 step instead
* parity
* remove fuse
* remove fused
* better paper
* assert message
* correct shape check + eps
* multidim
* add eps
* cleanup
* correct assert message
* lint
* better tests
* naming
* ns_steps,ns_params
* update docstring
* docstring
* match sgd and muon together
* sandwich
* add back fused
* parity
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-08-13 14:27:55 -04:00
chenyu
94e6d84e32
rewrite Tensor.round to not use cast int ( #11654 )
2025-08-13 13:51:08 -04:00
George Hotz
d2521d828a
transcendental+idiv+threefry are uop decompositions ( #11636 )
...
* transcendental+idiv+threefry are uop decompositions [pr]
* threefry decomp
* fix randomness tests
* fix webgpu
* unneeded now
* fix
* move prematcher
* all cast should probably be cast_vec
2025-08-13 09:37:12 -07:00
geohotstan
cf7224ce3e
fully lint onnx.py ( #11647 )
...
* mypy
* ruff ruff ruff
2025-08-13 08:22:06 -07:00
geohotstan
925555b62a
Fix onnx Domain bug ( #11650 )
2025-08-13 08:20:50 -07:00
Sieds Lykles
67df617fe1
add launch bounds to ptx ( #11646 )
2025-08-13 13:05:39 +02:00
qazal
88f95e9f59
viz: minor fixups for firefox ( #11645 )
...
* fix circle attr
* set fill color
2025-08-13 12:59:28 +03:00
qazal
6f88eac0fc
viz: refactor node and edge tagging ( #11644 )
2025-08-13 12:41:01 +03:00
qazal
8140bf9778
viz: create layout once ( #11643 )
...
* start
* work
* works
* diff cleanup
2025-08-13 09:24:58 +03:00
chenyu
3fb79bb43a
minor onnx cleanups ( #11642 )
2025-08-13 01:05:19 -04:00
chenyu
e9e5a08a04
simplify onnx cubic ( #11641 )
...
we can drop the double where and abs since we know which ranges the inputs map into
2025-08-12 19:57:31 -04:00
George Hotz
18cdbec447
split decompositions pass ( #11638 )
...
* split decompositions pass
* fix ptx
* pack load store early
* restore that
2025-08-12 12:56:05 -07:00
chenyu
0d8a0d7a96
update test_multi_const_folding_tensor to include pow ( #11635 )
...
pow folds now
2025-08-12 13:35:37 -04:00
Sieds Lykles
4d6e407eb0
Extend fast_idiv to negative ints ( #11632 )
...
* fast idiv for signed ints
* Add rule and test
* fix tests
* redo fuzz_fast_idiv to do negative ints as well
* adjust comments
* remove unused imports
2025-08-12 19:34:49 +02:00
qazal
17adbe86d8
hotfix: do not default to capturing args in track_rewrites ( #11634 )
2025-08-12 20:01:24 +03:00
geohotstan
ad9dec25b3
combine onnx parser and onnx ( #11485 )
...
* start
* more
* fix onnx_runner test
* pass
* patch for disk and add domains from huggingface
* simpler docs
* revert domain changes
* rerun ci
* revert onnx ops test change
* add fix from strenum stuff
* correct way
* revert correct way to leave the fix for another PR
* test segfault
* Revert "test segfault"
This reverts commit 4e1aaf41e7 .
* remove some unnecessary documentation
* test segfault again
* Revert "test segfault again"
This reverts commit 56fc5f03e7 .
* try gemini suggested patch for sys._getframe
* keep trying with gemini
* revert not working gemini suggestions and try faulthandler
* remove pythonfaulthandler
* trigger CI a few times
* minimize diff
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-08-12 12:56:39 -04:00
Sieds Lykles
4c3982c44e
Take sign out of mod ( #11631 )
...
* Add rule and test
* fix tests
2025-08-12 18:44:36 +02:00
qazal
e28605e324
rename profile point event fields [pr] ( #11633 )
2025-08-12 19:11:21 +03:00
nimlgen
8a7be0a747
metal: workaround for transfers sync issue ( #11622 )
...
* metal: workaround for transfers sync issue
* metal tracsfer sync is broken
* hm
* rm it?
* keep it
2025-08-12 16:16:34 +03:00
qazal
efe8b5611d
move ProfilePointEvent out of device.py [pr] ( #11630 )
...
Generic profiling events exist in helpers so they can be imported from
everywhere in tinygrad.
2025-08-12 09:58:32 +03:00
chenyu
0d7075f2de
assign should broadcast input tensor ( #11629 )
...
fixed test_assign_broadcast
2025-08-11 23:36:35 -04:00
Joshua Kissoon
c44760c89d
torch backend: fix arange, add linalg.cross, add tests ( #11628 )
2025-08-11 23:34:41 -04:00
George Hotz
ca41b5e38b
skip_0 in graph rewrite [pr] ( #11627 )
...
* skip_0 in graph rewrite [pr]
* no track_rewrites on test
* use dict instead of set
2025-08-11 18:29:04 -07:00
Sardor
ca7a641442
fix bugs at examples/yolov3.py ( #11614 )
...
* Update load_weight. Give valid model url
* Fix bug in iou function
2025-08-11 21:14:47 -04:00
chenyu
0c97d6de1b
don't round pow output for int pow int ( #11625 )
...
also added atol=0 and big pows for the tests
2025-08-11 20:57:47 -04:00
chenyu
d623f6d850
support int Tensor pow to const non-negative int ( #11624 )
...
matches torch
2025-08-11 19:50:19 -04:00
chenyu
857a830dcc
fix test_arange_float_step ( #11623 )
2025-08-11 16:58:42 -04:00
chenyu
0806677b51
rewrite sort idx ( #11613 )
2025-08-11 16:20:56 -04:00
George Hotz
700c11597b
switch contextvars.ContextVar to _ContextVar ( #11621 )
2025-08-11 12:20:09 -07:00
ttomsa
ae0c3cfff6
change clang -march flag to -mcpu on arm ( #10970 )
...
Co-authored-by: wozeparrot <wozeparrot@gmail.com>
2025-08-11 13:38:48 -04:00
geohotstan
27bcb9fd1c
Support cubic mode for ONNX Resize OP ( #11612 )
...
* start
* add reference
* this is so much slower
* this makes sense but differs from official impl, but results are still correct..?
* add a comment
* Just keep it simple for now since I don't fully get it yet
* address comments
* correct
* teeny clean up
* another small comment improvement lol
2025-08-11 11:49:30 -04:00
nimlgen
d2bb1bcb97
cloud: a bit better err handling ( #11616 )
...
* cloud: err propagation to client
* fix
* print exc
* linter
* excs
* fix
* hm
* flaky
2025-08-11 15:51:22 +03:00
qazal
6a232ccdac
viz: add tiny range drawing helper ( #11620 )
...
* viz: add tiny range drawing helper
* less
2025-08-11 15:15:43 +03:00
qazal
e768773e13
viz: use colors helper ( #11618 )
2025-08-11 13:10:15 +03:00
qazal
7d6c0a8cc7
viz: refactor progress msg ( #11617 )
2025-08-11 13:01:36 +03:00
chenyu
630edcffd8
remove .float calls in olmoe ( #11610 )
...
still matches torch
2025-08-10 20:33:22 -04:00
chenyu
a67e0917c3
list indexing can normalize in python ( #11609 )
...
* list indexing can normalize in python
list index does not need to be normalized in tensor
* update those
2025-08-10 20:02:38 -04:00
chenyu
1181ec0cd2
few more tensor indexing test cases ( #11608 )
2025-08-10 18:56:42 -04:00
George Hotz
996c907c0b
rewrite not ready + children machinery ( #11607 )
...
* rewrite not ready + children machinery
* it doesn't like track rewrites
2025-08-10 15:28:30 -07:00
Sieds Lykles
1875bc69f9
Late rewrite rules for CMPLT ( #11591 )
...
* add rules
* more rules
* fix comment spelling
* remove two rules
2025-08-10 22:18:13 +02:00
nimlgen
5403a4aeaf
null dev: support offset on buffers ( #11606 )
...
* null dev: support offset on buffers
* nolimit
2025-08-10 21:58:37 +03:00
geohotstan
b0dab6a4cd
onnx Resize OP clean up ( #11603 )
...
* start
* slight clean up
2025-08-10 14:10:39 -04:00
Sieds Lykles
10540414cd
Add Ops.CMPEQ ( #10431 )
...
* Add op
* add to Groupop.ALU
* fix spec
* fix ptx
* temporary pickle by name to see process replay
* add Ops.EQ to binary ops
* Actuall rename properly
* add test to assert CMPEQ is being used
* Ops.CMPEQ is automatic cast to bool
* add Ops.CMPEQ to llvm
* add Ops.CMPEQ to llvm
2025-08-10 13:13:16 +02:00
chenyu
f7aa1b85fe
minor sort cleanups ( #11602 )
2025-08-10 01:51:23 -04:00
chenyu
dfb702ef33
fix sort for small dim ( #11601 )
...
* fix sort for small dim
* fixed test_sort_empty
2025-08-10 01:17:41 -04:00
chenyu
ef17af85c6
remove .float call in llama logit ( #11598 )
...
* remove .float call in llama logit
* bfloat item
2025-08-10 00:02:18 -04:00
chenyu
dd3d2eb36c
add training llama3 test in ci ( #11599 )
2025-08-09 22:35:39 -04:00
chenyu
3e64467322
remove freqs_cis contiguous in llama ( #11597 )
2025-08-09 21:11:12 -04:00
chenyu
7338ffead0
small beautiful_mnist update ( #11596 )
...
gather is fast now. there's a conv/bw kernel that only gets fast with BEAM, but whole thing runs < 5 seconds now regardless
2025-08-09 19:51:14 -04:00
chenyu
45baec1aab
model parallel llama ( #11588 )
...
MP=8 GRADIENT_ACC_STEPS=3 BS=1 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=70B SEQLEN=512 PYTHONPATH=. MODEL=llama3 python3 examples/mlperf/model_train.py
2025-08-09 16:54:27 -04:00
nimlgen
09bc377da3
search: print runtime failures on debug ( #11593 )
2025-08-09 23:01:19 +03:00
nimlgen
14f99ff1a1
amd: doorbell_cpu_addr is not used ( #11592 )
...
* amd: doorbell_cpu_addr is not used
* hm
2025-08-09 20:03:21 +03:00
Sieds Lykles
01c770c77b
Fix z3 float cast in indexing ( #11590 )
...
* adjust dtype of z3_renderer and add rule for cast
* dtypes.bool is also cast noop
* add regression test
* make embedding smaller
* even smaller test
2025-08-09 17:59:23 +02:00
Sieds Lykles
10d388499d
Refactor optional.py ( #11578 )
...
* move fast_idiv to transcendental
* move optional.py
* adjust comment
* change import
* mypy needs this?
2025-08-09 17:35:05 +02:00
nimlgen
20e46a175c
do not use disk with usb ( #11119 )
...
* not use disk with usb
* better name
2025-08-09 11:58:02 +03:00
qazal
53179953fc
viz: factor out memory graph render ( #11586 )
2025-08-08 20:18:11 +03:00
qazal
8ce72d3fad
simpler disassembly table spec ( #11583 )
...
* simpler disassembly table spec
* update ui
* move to scalar/vec render
2025-08-08 17:59:26 +03:00
qazal
44a222a9b2
viz: move resource usage summary to server ( #11582 )
2025-08-08 17:08:28 +03:00
qazal
793ace530e
update amd_uop_matmul.py import ( #11581 )
...
Using this for testing SQTT
2025-08-08 17:07:35 +03:00
chenyu
b232c60def
benchmark openpilot 0.9.9 ( #11575 )
...
* benchmark openpilot 0.9.9
not sure what to do with the 0.9.7 ones with IMAGE=2 and validate
* name
2025-08-08 01:26:14 -04:00
qazal
16f0edbe90
pass opts arg in get_program process replay [pr] ( #11571 )
...
* fix ptx process replay
* keyword arg
* renderer is also optional [pr]
* test_linearizer fixup
* name function order is args,ret,kwargs
* can use opts_to_apply
* pass through p.applied_opts
* sink_arg
* now it opens devices too
2025-08-08 03:05:09 +03:00
qazal
960cc6533a
pass through name function args in track_rewrites ( #11572 )
2025-08-08 02:28:52 +03:00
wozeparrot
1826004ef9
feat: add tinyos builder link ( #11570 )
2025-08-07 17:42:18 -04:00
George Hotz
82be8abfd2
move opt under codegen ( #11569 )
2025-08-07 14:19:17 -07:00
chenyu
702e38dc19
remove FUSE_ARANGE_UINT ( #11567 )
...
also add IGNORE_OOB=1 to bert runs. lowered BS on tinybox to 90 since 96 oom during eval without reset
2025-08-07 16:49:06 -04:00
George Hotz
6ed2dfd187
delete the arange dim mismatch restriction ( #11568 )
...
* delete the arange dim mismatch restriction
* skip that test race
2025-08-07 13:46:17 -07:00
wozeparrot
7ae4335127
feat: generate blend index ( #11566 )
2025-08-07 14:20:28 -04:00
chenyu
594cbdc66f
skip AM ResNet50 benchmark ( #11565 )
...
hanging with FUSE_ARANGE?
2025-08-07 14:07:01 -04:00
chenyu
aa1a6f2132
support threshold in Tensor.softplus ( #11564 )
...
fix gradient for large input
2025-08-07 13:43:18 -04:00
chenyu
7ee3770961
FUSE_ARANGE=1 ( #11427 )
...
* FUSE_ARANGE=1
* fix test
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-08-07 13:32:34 -04:00
George Hotz
4dfcfb1ae5
Revert "Revert "viz: align-center checkbox ( #11555 )""
...
This reverts commit c52facfd29 .
2025-08-07 08:15:57 -07:00
George Hotz
7e42427a7b
Revert "Revert "viz: remove color for unbind step ( #11554 )""
...
This reverts commit 5650c7b86c .
2025-08-07 08:15:51 -07:00
George Hotz
dc765fbeb7
Revert "viz: timeline perf ( #11533 )"
...
This reverts commit 031f26632b .
2025-08-07 08:08:51 -07:00
George Hotz
5650c7b86c
Revert "viz: remove color for unbind step ( #11554 )"
...
This reverts commit 1e205775bd .
2025-08-07 08:08:50 -07:00
George Hotz
c52facfd29
Revert "viz: align-center checkbox ( #11555 )"
...
This reverts commit 91ec093464 .
2025-08-07 08:08:50 -07:00
George Hotz
974cfbe76d
Revert "viz: add support for colored tooltip text ( #11556 )"
...
This reverts commit b3f7ea6f93 .
2025-08-07 08:08:49 -07:00
George Hotz
3bf0db80ef
Revert "viz: pick the largest rect for proxy fillColor ( #11558 )"
...
This reverts commit 76079bc7f2 .
2025-08-07 08:08:48 -07:00
George Hotz
9764c6cdee
fix mismatch reduce, try 2 ( #11560 )
...
* fix mismatch reduce, try 2
* fix heuristic
* delete that test
* don't start allowing ones
2025-08-07 07:57:58 -07:00
qazal
76079bc7f2
viz: pick the largest rect for proxy fillColor ( #11558 )
2025-08-07 16:40:17 +03:00
nimlgen
4f29a2c441
fix flaky test on macos ( #11557 )
2025-08-07 15:55:35 +03:00
qazal
b3f7ea6f93
viz: add support for colored tooltip text ( #11556 )
2025-08-07 15:04:43 +03:00
qazal
91ec093464
viz: align-center checkbox ( #11555 )
2025-08-07 14:22:02 +03:00
qazal
1e205775bd
viz: remove color for unbind step ( #11554 )
2025-08-07 14:16:21 +03:00
nimlgen
031f26632b
viz: timeline perf ( #11533 )
...
* viz: timeline perf
* progress
* fast
* less lines
* less lines
* less lines
* fix chrome
2025-08-07 13:16:17 +03:00
George Hotz
a1aa5670aa
Revert "fix mismatch reduce ( #11547 )" ( #11549 )
...
This reverts commit 49d21a9055 .
2025-08-06 22:43:15 -07:00