Commit graph

257 commits

Author SHA1 Message Date
wozeparrot
5164c21b44
gemm: keep shape thru mxfp8 quantize (#16692) 2026-06-20 22:28:53 -07:00
George Hotz
649971f02a
remove DEFINE_LOCAL and DEFINE_REG (gpt) (#16673)
* remove define_local and define_reg (gpt)

* fix precommit

* cleanups

* regalloc fix

* cleanups 2
2026-06-19 10:07:50 -07:00
wozeparrot
36f6d1b064
gemm: fix bf16 atb for mp sharding (#16637) 2026-06-16 15:58:47 -07:00
qazal
f998b9930a
fp8 gemm inv_scale in epilogue (#16625)
* fuse scale

* remove python inv_scale

* more inv_scale removal

* more cleanups

* cleaner

* diff polish

* work

* rename

* simpler

* simpler

* compute

* c

* Revert "c"

This reverts commit 8941fec7ca.

* Revert "compute"

This reverts commit 9db573a6d3.

* Revert "simpler"

This reverts commit 910ad33f87.

* Revert "simpler"

This reverts commit bf75d235a1.

* s_g

* update types

* less diff noise

* remove
2026-06-15 18:44:41 +09:00
wozeparrot
67a4f129c2
llama: fix bf16 gemm oob (#16603) 2026-06-12 19:43:05 -07:00
wozeparrot
e770805d21
llama: mxfp8 (#16574) 2026-06-11 22:15:24 -07:00
wozeparrot
c38d6a7e3a
mxfp8 part 2 (#16561) 2026-06-10 23:36:11 -07:00
wozeparrot
2bdc360606
gemm: mxfp8 hipkittens gemm (#16541)
* gemm: mxfp8 hipkittens gemm

* feat: update hipkittens

* feat: kernel signature

* clean: just kernel

* feat: from tinygrad

* feat: test

* fix: add back utils

* clean: no diff

* clean: no diff
2026-06-09 15:20:05 -07:00
wozeparrot
a1ec32cfd2
llama: current grad scaling (#16518) 2026-06-05 15:39:41 -07:00
qazal
3b1a5f9770
llama: a_bT and aT_b bf16 gemms (#16487)
* hk_bf16_gemm

* enable in 8b

* cleanups

* rename to USE_HK_BF16_GEMM

* work

* work

* work

* work

* change the gemms

* work

* work

* set as default

* work

* change
2026-06-04 23:30:21 +09:00
wozeparrot
7dcfd144b6
llama: columnwise fp8 scaling (#16480) 2026-06-02 18:55:45 -07:00
George Hotz
ffadd7a315
remove intel and amx support (#16482) 2026-06-02 18:53:05 -07:00
George Hotz
58d58c1659
remove DEVECTORIZE (#16290)
* remove DEVECTORIZE

* fully remove DEVECTORIZE
2026-05-20 13:25:49 -07:00
qazal
bfb2d1f89a
Revert "fp8 gemm speedup (#16236)" (#16245)
This reverts commit d95bf394e1.
2026-05-19 02:01:44 +09:00
qazal
d95bf394e1
fp8 gemm speedup (#16236)
* add asm_gemm option

* milestone

* work

* edit

* only the fast kernel

* diff
2026-05-17 04:58:28 +09:00
wozeparrot
e97f2c1114
llama: only gemm + fa custom kernel (#16180)
* llama: tie store to grad directly

* llama: set mp flags

* llama: non fused grad fp8 quantize path
2026-05-12 21:03:49 -07:00
wozeparrot
730fa66bf3
llama speed 6 (#16071) 2026-05-06 20:51:03 -07:00
wozeparrot
528d35e306
llama speed 4 (#15993) 2026-04-30 17:14:41 -07:00
nimlgen
dfd2d07005
remove CompiledRunner (#15970)
* rm usage of CompiledRunner

* more tests

* last

* linter

* sink

* remove

* linter
2026-04-29 22:45:48 +03:00
qazal
a37b605523
remove arch from asm kernel class (#15977)
* rm arch from kernel

* update other tests

* update abstractions4.py
2026-04-30 03:39:52 +09:00
wozeparrot
ef09071073
llama: speed 2 (#15960) 2026-04-28 20:44:37 -07:00
nimlgen
77965a22e5
local optimize as rewrite (#15953)
* local optimize as rewrite

* better

* x

* slighly rename

* fix

* ugh

* remove

* x

* remove

* not weak
2026-04-28 22:51:04 +03:00
nimlgen
4164666c72
programinfo (#15942)
* programinfo

* fix

* m

* x

* x

* changes

* x

* fix

* rm
2026-04-27 23:12:03 +03:00
nimlgen
bb652352c7
remove execitem (#15932)
* remove execitem

* f

* x
2026-04-25 19:33:04 +03:00
nimlgen
768106a542
remove schedule from extra/docs/examples (#15929)
* remove schedule from extra/docs/examples

* f
2026-04-25 14:09:12 +03:00
nimlgen
f2751955cb
remove linear_to_schedule from tests (#15912)
* remove linear_to_schedule from tests

* x
2026-04-24 20:02:10 +03:00
chenyu
9192c93b7e
Tensor.invalid -> Tesnor.invalids (#15849)
matches ones and zeros, and to not share name with UOp.invalid
2026-04-21 11:19:51 -04:00
nimlgen
bfe28ee2ad
rm run_schedule (#15847) 2026-04-21 18:14:30 +03:00
wozeparrot
9e60e4a7e7
llama: native fp8 (#15733) 2026-04-16 22:16:05 -07:00
qazal
12c653a743
remove opts arg in get_program, everything uses opts_to_apply [pr] (#15767)
* check Ops.BEAM in process replay

* remove opts from the get_program api

* lint

* simplify

* cleanup
2026-04-16 22:42:43 +03:00
chenyu
3394d18066
size*itemsize -> nbytes (#15729)
and some UOp.size removal to prep for size to mixin change
2026-04-14 16:27:54 -04:00
wozeparrot
55bcd7cc9e
llama amax outside (#15670) 2026-04-09 23:08:03 -07:00
George Hotz
48a7627b04
add RDNA4 support to copy WMMA (#15663)
* add RDNA4 supportt to copy WMMA

* simpler

* simpler

* comment

* assert
2026-04-09 22:48:20 +08:00
George Hotz
1ebeb52e59
RDNA4 asm gemm (#15427)
* sqtt: rdna4 decoder work

* diff cleanup

* more diff

* test

* 125

* r4

---------

Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2026-04-08 21:26:44 +08:00
wozeparrot
70dbd35023
llama: move custom_kernel into flat_llama (#15643) 2026-04-08 00:19:14 -07:00
wozeparrot
7e54992bf6
fp8 llama (#15588)
Co-authored-by: qazal <qazal.software@gmail.com>
2026-04-04 18:24:57 -07:00
Christopher Milan
0ed8d9271d
Renderers accept Target or nothing (#15590) 2026-04-03 01:09:41 -04:00
qazal
fefb0ebc2a
gemm/asm: fp8 cleanups (#15580)
* normal gemm here

* s/dtypes.fp8e4m3/FP8_DTYPE

* gemm_bw

* device UOp stays NULL
2026-04-02 19:02:38 +09:00
chenyu
1aa04eab08
simple CreationMixin (#15567)
start with full_like, zeros_like, ones_like
2026-04-01 23:00:56 -04:00
qazal
8feb8edc68
gemm/asm: add fp8 support to cdna asm_gemm (#15542)
* work

* hmm, mixins

* rhs_transposed

* also fix the dtype

* check for hipcc

* Exception

* select dev

* default
2026-03-31 19:32:54 +09:00
George Hotz
85dee83f5d
amd flash attention cleanups + emulator fixes (#15431)
* amd flash attention cleanups

* simpler

* params

* fix emulator bugs

* fix idiv bug

* remove that test

* more emu fixes
2026-03-24 10:10:46 +08:00
George Hotz
c62dea6881
ai slop flash attention (it works) (#15401)
* ai slop flash attention (it works)

* speed up, 2 TFLOPS + 7 GB/s

* simpler

* simpler

* optimize

* faster

* warp shuffle

* sqtt: link dispatch to exec (#15396)

* sqtt packet linking infra

python

* javascript

* ~doubly linked list

* ui works

* work

* exec can also highlight the pc, coloring work

* more work

* rm sqtt/model.py, doesn't need to be upstreamed

* viz: no context enters in cli, update llama profile (#15404)

* removed unused named arg in rules [pr] (#15414)

* viz: sqtt printer in viz/cli.py (#15411)

* work

* sqtt timeline in CLI

* format all printers nicely

* s/Showed/Printed

* ansistrip

* sys.exit

* keep colors in list

* work from amd_copy_matmul

* has_more always gets returned

* linter

* don't print colors

* more colors

* wow this is so deep

* work

* minor details

* selected

* improve progress bar

* remove it

* 22, global_load_vaddr is so long

* remove *0 hack in sign, gradient materializes zeros for unconnected nodes (#15416)

Amp-Thread-ID: https://ampcode.com/threads/T-019d1612-6322-706b-a94d-a812400a55cb

Co-authored-by: Amp <amp@ampcode.com>

* works

* cnt=20

* revert that

* uop slice tests

* simpler

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: gg <ggordbegli@gmail.com>
Co-authored-by: Amp <amp@ampcode.com>
2026-03-23 16:15:10 +08:00
George Hotz
c13d9d29ff
add SHAPED_WMMA (#15400)
* add SHAPED_WMMA

* shaped wmma

* less bad
2026-03-21 16:16:03 +08:00
George Hotz
41a9b09683
minimal vec in amd_copy_matmul (#15398)
* minimal vec in amd_copy_matmul

* unified

* unify

* reshape/permute

* cleanups

* simpler

* move index

* cleanups

* more shared
2026-03-21 14:57:21 +08:00
George Hotz
1a2a203f48
add wmma support to amd_copy_matmul (#15384)
* add wmma support to amd_copy_matmul

* 15 TFLOPS and merged

* unify

* simpler

* simpler

* simpler

* cleanups

* TM/TN is the full regs

* comments

* WAVES_PER_SH + SQTT_EVENT

* Add WAVERDY support

* no split warp

* 3 range
2026-03-20 19:02:19 +08:00
chenyu
da1700e16b
dtypes.index -> dtypes.weakint (#15377) 2026-03-20 01:08:46 -04:00
George Hotz
4091d37e8e
flat llama step work (#15355)
* flat llama step work

* fp8 support

* blacklisted matmul

* chestertons fence
2026-03-20 09:06:12 +08:00
George Hotz
6e196195d8
add test for flat llama (#15327)
* add test for flat llama

* simpler

* back to split w1/w3

* env

* still too much ram

* invalid
2026-03-18 15:16:33 +08:00
qazal
5cd1daa3bc
cdna asm_gemm in one file, remove old rdna3 asm (#15281) 2026-03-16 04:32:30 +09:00
George Hotz
06d7cddb33
amd_copy_matmul is cleaner (#15248)
* amd_copy_matmul is cleaner

* it runs

* replicated stuff

* add tid there

* it runs

* cleanup

* x.src[1]

* flatten

* move that

* keep that assert
2026-03-14 12:56:09 +08:00