Commit graph

12,738 commits

Author SHA1 Message Date
George Hotz
e401546bf0 deepsek work 2026-03-25 10:53:39 +08:00
George Hotz
a33ac869aa
llm server: temperature + test client (#15444)
* improvements to the llm server

* eval script

* eval llm

* better eval gets 58.71

* cleanups

* add temperature, but multinomial is absurdly slow

* claude is so smart

* lint

* remove slop

* no more stop
2026-03-24 21:07:15 +08:00
nimlgen
9db5d677c7
jit in viz (#15447) 2026-03-24 18:23:53 +08:00
Christopher Milan
2e4fbbcc9c
ir3: fix texture mapping and benchmark (#15443) 2026-03-24 04:52:54 -04:00
Christopher Milan
d5320a9ddf
QCOM cleanups (#15435) 2026-03-23 22:18:38 -04:00
George Hotz
85dee83f5d
amd flash attention cleanups + emulator fixes (#15431)
* amd flash attention cleanups

* simpler

* params

* fix emulator bugs

* fix idiv bug

* remove that test

* more emu fixes
2026-03-24 10:10:46 +08:00
chenyu
018a9e2d3c
remove match_dtype arg in Tensor._broadcasted (#15440)
reworked Tensor.where to not need it, also updated dtypes.from_py to use isinstance because ConstFloat issues
2026-03-23 22:10:39 -04:00
qazal
a590eded87
sqtt: rdna4 decoder work (#15434)
* sqtt: rdna4 decoder work

* diff cleanup

* more diff

* test

* work

* works

* TS_DELTA_SHORT
2026-03-24 03:49:32 +09:00
qazal
109472c37e
sqtt: new s_barrier pickles, handle rdna4 barriers in emulator (#15437) 2026-03-24 03:25:28 +09:00
nimlgen
fa4cdb422e
memplan on linears (#15422)
* memplan

* test

* x

* arenas

* correct

* set any size

* ugh

* make hevc happy

* x

* x

* held

* rm old

* del

* x

* fu

* f

* cl

* cl

* ok
2026-03-23 19:50:16 +08:00
nimlgen
2da008ae3b
jit: rm replan (#15433) 2026-03-23 19:31:51 +08:00
qazal
c4c53418f8
sqtt: comment out flaky rocprof timestamp assert (#15432)
* comment out rocprof assert, add new assert

* better than > 0 assert

* string
2026-03-23 19:24:04 +09:00
chenyu
66a86f88a0
simpler Tensor._broadcasted inferred dtype (#15430) 2026-03-23 05:20:11 -04:00
Pham Nguyen Hung
c89576921d
Updated the APIs of mnist_gan (#15429)
Co-authored-by: pnhung1703@gmail.com <Hung Pham>
2026-03-23 17:04:00 +08:00
George Hotz
c62dea6881
ai slop flash attention (it works) (#15401)
* ai slop flash attention (it works)

* speed up, 2 TFLOPS + 7 GB/s

* simpler

* simpler

* optimize

* faster

* warp shuffle

* sqtt: link dispatch to exec (#15396)

* sqtt packet linking infra

python

* javascript

* ~doubly linked list

* ui works

* work

* exec can also highlight the pc, coloring work

* more work

* rm sqtt/model.py, doesn't need to be upstreamed

* viz: no context enters in cli, update llama profile (#15404)

* removed unused named arg in rules [pr] (#15414)

* viz: sqtt printer in viz/cli.py (#15411)

* work

* sqtt timeline in CLI

* format all printers nicely

* s/Showed/Printed

* ansistrip

* sys.exit

* keep colors in list

* work from amd_copy_matmul

* has_more always gets returned

* linter

* don't print colors

* more colors

* wow this is so deep

* work

* minor details

* selected

* improve progress bar

* remove it

* 22, global_load_vaddr is so long

* remove *0 hack in sign, gradient materializes zeros for unconnected nodes (#15416)

Amp-Thread-ID: https://ampcode.com/threads/T-019d1612-6322-706b-a94d-a812400a55cb

Co-authored-by: Amp <amp@ampcode.com>

* works

* cnt=20

* revert that

* uop slice tests

* simpler

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: gg <ggordbegli@gmail.com>
Co-authored-by: Amp <amp@ampcode.com>
2026-03-23 16:15:10 +08:00
qazal
1568a5ed07
viz: show dispatch to exec delay in sidebar (#15428) 2026-03-23 16:59:59 +09:00
Christopher Milan
ddaeebb500
nir: add shift support (#15426) 2026-03-23 03:37:44 -04:00
nimlgen
c74fa9bbe1
fix jitbeam not triggered (#15424)
* um

* beam

* x

* f
2026-03-23 15:34:59 +08:00
qazal
fd3559103b
viz/cli: better error message for empty itrace (#15425) 2026-03-23 15:50:20 +09:00
nimlgen
395aacd77d
jit: prune on linear (#15423)
* jit: prune on linear

* x

* this is from the future
2026-03-23 14:10:34 +08:00
chenyu
248cd9b39f
make Tensor init the only caller of Tensor.from_uop (#15421)
* make Tensor init the only caller of Tensor.from_uop

prep broadcast cleanups

* type
2026-03-23 00:29:08 -04:00
chenyu
67dcc79fdd
push Tensor(symbolic) logic to Tensor.from_uop (#15420) 2026-03-22 23:49:35 -04:00
gg
2087df814f
remove *0 hack in sign, gradient materializes zeros for unconnected nodes (#15416)
Amp-Thread-ID: https://ampcode.com/threads/T-019d1612-6322-706b-a94d-a812400a55cb

Co-authored-by: Amp <amp@ampcode.com>
2026-03-22 12:49:26 -04:00
qazal
c7b18e6108
viz: sqtt printer in viz/cli.py (#15411)
* work

* sqtt timeline in CLI

* format all printers nicely

* s/Showed/Printed

* ansistrip

* sys.exit

* keep colors in list

* work from amd_copy_matmul

* has_more always gets returned

* linter

* don't print colors

* more colors

* wow this is so deep

* work

* minor details

* selected

* improve progress bar

* remove it

* 22, global_load_vaddr is so long
2026-03-23 00:17:05 +09:00
chenyu
bcc08307da
removed unused named arg in rules [pr] (#15414) 2026-03-22 09:25:46 -04:00
qazal
2363bceb47
viz: no context enters in cli, update llama profile (#15404) 2026-03-22 05:47:02 +09:00
qazal
a9ceaf3c5f
sqtt: link dispatch to exec (#15396)
* sqtt packet linking infra

python

* javascript

* ~doubly linked list

* ui works

* work

* exec can also highlight the pc, coloring work

* more work

* rm sqtt/model.py, doesn't need to be upstreamed
2026-03-21 23:48:58 +09:00
nimlgen
9656d97d97
jit: captures linears, not execitems (#15399)
* jit: captures linears, not execitems

* x

* um

* etsts

* mockcuda
2026-03-21 16:32:12 +08:00
George Hotz
c13d9d29ff
add SHAPED_WMMA (#15400)
* add SHAPED_WMMA

* shaped wmma

* less bad
2026-03-21 16:16:03 +08:00
George Hotz
41a9b09683
minimal vec in amd_copy_matmul (#15398)
* minimal vec in amd_copy_matmul

* unified

* unify

* reshape/permute

* cleanups

* simpler

* move index

* cleanups

* more shared
2026-03-21 14:57:21 +08:00
qazal
30b3054fd5
whitespace cleanups in viz and sqtt.py (#15395) 2026-03-21 04:46:19 +09:00
qazal
71ccc69c52
FP8=1 llama works again, hipcc can run on macos (#15394)
* hipcc macos shim

* is_dtype_supported opens devices less
2026-03-20 23:43:15 +09:00
Christopher Milan
9470d5193a
deterministic decomp apply order (#15393) 2026-03-20 08:10:45 -04:00
Christopher Milan
376585b003
use should_emulate for target dtype in decomp (#15392) 2026-03-20 07:44:57 -04:00
Christopher Milan
a12d3951de
fix test_export_model imports (#15389) 2026-03-20 07:27:01 -04:00
George Hotz
1a2a203f48
add wmma support to amd_copy_matmul (#15384)
* add wmma support to amd_copy_matmul

* 15 TFLOPS and merged

* unify

* simpler

* simpler

* simpler

* cleanups

* TM/TN is the full regs

* comments

* WAVES_PER_SH + SQTT_EVENT

* Add WAVERDY support

* no split warp

* 3 range
2026-03-20 19:02:19 +08:00
Christopher Milan
1560b534a5
remove IMAGE=2 (#15312) 2026-03-20 06:26:52 -04:00
Christopher Milan
30d609432f
ci: only xcode-select for gpuocelot on macos (#15387) 2026-03-20 05:58:16 -04:00
chenyu
d1b4e37dfa
remove InvalidType branch in Tensor.__init__ (#15386)
it's handled by `elif isinstance(data, get_args(ConstType)):` already
2026-03-20 05:32:33 -04:00
chenyu
c491345766
pass device into Tensor._frompy (#15385)
* pass device into Tensor._frompy

with this, canonicalize_device is the only usage of Device in tensor.py

* export_model.py
2026-03-20 05:09:01 -04:00
George Hotz
3b75d8a7a2
fix double after bug in rangeify (#15381) 2026-03-20 14:53:46 +08:00
Christopher Milan
0c89340a1e
automatically emulate unsupported (tiny) floats [skip_process_replay] (#15366) 2026-03-20 02:31:44 -04:00
George Hotz
78ad089817
make precompile the default for llm (#15376)
* make precompile the default for llm

* works

* empty is okay for kvcache

* fix cache misses

* more tests
2026-03-20 14:08:55 +08:00
chenyu
459ef41ea0
don't exclude weakint in is_dtype_supported [pr] (#15378) 2026-03-20 02:08:29 -04:00
qazal
cf6a429aaa
mypy emulator pre-commit passing (#15379)
* fix dict stuff

* add type: ignores

* fix pcode to put uops not ints
2026-03-20 14:44:09 +09:00
wozeparrot
87c4ec1724
llama: use flat llama (#15353) 2026-03-19 22:12:38 -07:00
chenyu
da1700e16b
dtypes.index -> dtypes.weakint (#15377) 2026-03-20 01:08:46 -04:00
nimlgen
3b04e3ea28
no gmmu mappings with GMMU=0 (#15369)
* usb

* free

* simple gmmu=0

* x

* x

* vram

* init tests

* ppg

* x
2026-03-20 12:18:34 +08:00
ridoy majumdar
c1183b8872
remove dead code in pyrender (#15115)
* remove dead code in pyrender

* retrig CI

* retrig CI

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2026-03-19 23:59:56 -04:00
chenyu
bf33c5f796
remove gradient materialize_grads (#15367)
effectively default to True

and removed *0 hack in Tensor.copysign. now dy/dx=0 if y does not depend on x

remove
2026-03-19 23:36:03 -04:00