Commit graph

13,471 commits

Author SHA1 Message Date
George Hotz
3b43d26f10
assembly/amd: emu speed (#14344)
* assembly/amd: emu speed

* fix spec

* go

* don't do this

* simpler

* no stupid consts

* hack

* simpler

* no index

* no where

* faster linearizer

* fix spec

* no index dtype
2026-01-26 22:21:34 +08:00
George Hotz
774a454bb5
assembly/amd: fix scratch SVE (#14340)
* assembly/amd: default python REMU

* mem_used

* no lane

* sve

* remove that

* needs s_code_end in tests
2026-01-26 21:03:51 +08:00
qazal
2d91fe6310
use amdgpu dsl in mmapeak (#14342)
* use amdgpu dsl in mmapeak

* don't rely on llvm for vgpr counting

* llvm roundtrip assert

* rm it, add ci

* vgpr_count

* move emulated test to amd, it needs comgr

* env

* arch

* inst._fields -> inst.operands

* vgpr offset
2026-01-26 22:03:43 +09:00
qazal
b2e2ace85b
viz: remove ci check, it's VIZ=-1/-2 (#14343) 2026-01-26 20:36:23 +09:00
George Hotz
be23776ba7
assembly/amd: replace pcode with ucode (#14002)
* a bunch of todos for my boy claude

* uops have types

* lil cleanups

* simpler ucode

* isNAN

* calls

* move more

* cleanup pcode_parse

* cvt functions

* fix parser bugs

* no void

* minmax

* more pcode parse

* pretty print

* transform

* comments

* move to transform

* assign/declare

* simpler norm

* single PM

* just Uops

* simpler

* more typed

* all rewrite

* less verbose

* work

* spec

* transform

* work

* simpler spec

* less spec

* bitcast

* simpler

* simp ucode

* work

* more in pcode_transform

* remove junk

* more functions

* bug

* no void assign

* load/store

* wave

* fixes

* move denorm

* move more functions

* tests

* cat is shape None

* uop syntax

* move a few more

* program_spec

* cat stuff

* assign fix clear

* unused

* nans

* fp bits

* works with simplify

* remove junk

* special

* meh

* more

* more

* update test pcode parse

* improve parser

* parse some for loops

* merge master

* dead files

* tests pass

* emu2

* better emu2

* test_plus works

* uselessly write more instructions

* use pcode

* something

* something

* bench_emu

* progress

* ds works

* work

* work

* more passing

* run compare

* bench_emu

* more pcode

* a few more

* bugfixes

* bugfix

* test fixes

* tests pass without USE_HW

* all hw tests pass

* add more hw tests

* new hw tests

* bit

* less handcode

* parse more

* consolidate pcode

* fixes

* rsrc

* lane pcode

* cleanups

* simpler

* emu bugs

* one cmp test fails

* fix decode and upd name

* fix name and test harness

* _ftz_f32

* fix denorm

* fix VOPD and use load

* fix carry bug

* no load where / just invalid

* clean

* simpler

* merge sops

* refactoring

* simplifications

* bugfixes

* new tests

* f16 sin fix

* assertion and hw tests

* cvt functions

* one more failure

* bugfixes

* bugfix + regression

* more tests

* fmac

* no manual unrolling

* ordering

* LLVM backend is a lot faster

* compile inst

* more bugs

* f16

* bugfix

* fix regression

* one clang call

* 1M inst

* scratch works

* do scratch correctly

* cleanup

* regression

* cmp

* fmamk fixes

* merge

* fix vcmpx

* unify memory

* remove unused code

* ignore oob for test

* cleanups

* fix mbs

* unify cmp

* test

* minor cleanups

* bump timeout

* fix tests

* revert the CMPLE stuff

* remove opt

* less diff

* simpler

* revert

* support multiple backends

* memset is a lot faster

* split out in bench emu

* improve timing

* timing

* cache that

* cache that

* simpler and faster

* tokenize

* binop table

* simpler

* move to parser

* tok for lambda

* refactor

* expr_parser

* delete emu2_pcode

* import cleanup

* lil

* if parse

* work

* simpler

* no v

* trig preop is faster

* durations for tests

* fix cmp bug

* sdst

* remove scartch_size hack

* null behavior

* _MXCSRContext

* bugfixes

* DEBUG >= 3

* test smem crashes my gpu

* debug

* test

* test smem

* profiler

* full inst

* bugfix

* rtag(1)

* pc is 64-bit and word

* pc is real code now

* dynamic

* more dynamic

* fix oob access

* fix crash, more dyn

* all dyn

* really all dyn

* correct null mask

* lit + format

* 21s on the tests

* 13s on the tests

* canonical name

* simm16

* more dyn

* 14s

* proper saddr dedup

* dyn

* debug 5

* better 5

* revert dynamic stuff

* that can be dyn

* negative offsets

* dyn wmma

* f16 wmma support / ops / dtype / dtype_alu

* symbolic changes not needed

* ConstFloat

* more uop.const

* __eq__

* uop tests

* fix f16

* bf16 tensor cores

* whitespace

* remove cast roundtrip

* Revert "remove cast roundtrip"

This reverts commit c5bb0381c3.

* just the fix

* remove dead paths

* llvm runs
2026-01-26 18:04:29 +08:00
George Hotz
984cdc4840
add wrapper class for the -0.0 != 0.0 issue (#14339)
* add wrapper class for the -0.0 != 0.0 issue

* fixes

* spec fix

* missed one
2026-01-26 16:52:37 +08:00
qazal
92bfe92138
assembly/amd: fix cdna mfma xml (#14329)
* handwritten failing test

* new amdxml

* more mfma from fixes

* ci

* move arch of test integration

* alt

* amdxml human cleanup

* _TestIntegration rename to IntegrationTestBase

* it's the same problem as _LIT

* better comment

* better variable name
2026-01-26 17:51:26 +09:00
Garret Castro
6c109f4d75
LLVM: CPU threading support (#14320)
* make generic llvmrenderer class for cpu and amd

* move `tensor_cores` back to parent

* remove empty line

* restore extra matcher position

* add threading

* dont need to add core_id here

* dont move code for workitem

* cleanup

---------

Co-authored-by: TheVanadium <claude_user@ret2022.localdomain>
2026-01-26 13:12:39 +08:00
George Hotz
cc49e47ea2
tinygrad changes from ucode (#14336)
* tinygrad changes from ucode

* dtype
2026-01-26 11:30:18 +08:00
Garret Castro
8477368d07
generic LLVMRenderer class for CPU and AMD (#14321)
* make generic llvmrenderer class for cpu and amd

* move `tensor_cores` back to parent

* remove empty line

* restore extra matcher position

* cleanup

---------

Co-authored-by: TheVanadium <claude_user@ret2022.localdomain>
2026-01-26 09:11:49 +08:00
George Hotz
11ce1e847d llama train: null device support 2026-01-26 08:53:05 +08:00
chenyu
e3601788fa
update torch backend function (#14333)
those have tensor.py implementation
2026-01-25 16:39:34 -05:00
nimlgen
9865f51e39
cupti: ref collector (#14330)
* cupti: ref collector

* ll
2026-01-25 20:35:21 +03:00
nimlgen
21ab23ae18
nv: add pma for ada (#14328)
* nv: add pma for ada

* um

* fix

* shorter

* mock
2026-01-25 17:33:37 +03:00
George Hotz
49db266b96
ReprEnum for repr roundtrips (#14327)
* ReprEnum for repr roundtrips

* dsl

* bugfixes

* vdsty fixes

* cleaner

* fix

* fix cdna fields

* tests all pass
2026-01-25 18:58:31 +08:00
qazal
bf2d9d138f
viz: simplify amdgpu cfg (#14326)
* viz: replace llvm disasm with our disasm

* it starts with more code

* then it becomes less

* simpler, cdna disassembles with decimal simm16

* s_branch is upper case, add test

* simm16s and others
2026-01-25 15:21:45 +09:00
qazal
647e527a7e
viz: replace llvm disasm with our disasm (#14325) 2026-01-25 13:56:56 +09:00
nimlgen
4280a8eef2
am: update fw (#14323) 2026-01-25 01:08:47 +03:00
chenyu
7e41da1ae8
fix generate_dataset.sh (#14324)
added `set -e` so wrong pathes would fail the script, then fixed the path
2026-01-24 16:47:10 -05:00
chenyu
311bfd91d6
clean up where_on_load [pr] (#14322)
no repeated split_uop and general cleanup
2026-01-24 14:43:43 -05:00
nimlgen
8b282ba6d2
memory: reserved vram (#14318) 2026-01-24 19:39:24 +03:00
chenyu
00e9ba0b82
update type for split_uop and where_on_load [pr] (#14319)
also variable names in where_on_load, before logic update
2026-01-24 11:17:41 -05:00
chenyu
cb69b7b2b2
comment out fold_where_closure (#14316) 2026-01-24 10:15:42 -05:00
wozeparrot
d74587f16d
fa multi fix 2 (#14314) 2026-01-23 23:35:02 -08:00
chenyu
d9f0ad1d87
update return type for Tensor.tolist (#14313)
since sequence is incorrect since it can be list of list, use Any to avoid recursive type
2026-01-23 23:21:49 -05:00
qazal
807bc40931
assembly/amd: dsl and disasm cleanup (#14311)
* rdna4 inst helper

* remove dsl aliases
2026-01-24 11:36:12 +09:00
Christopher Milan
e782d44918
WEBGPU/NIR truncates ints (#14307)
* WEBGPU truncates ints

* nir has this bug too
2026-01-23 19:28:06 -05:00
nimlgen
26220a472e
no core_id (#14265)
* no core_id

* kwargs

* est

* linters

* ugh

* revert this

* deps

* glb

* should work?

* nn

* line

* fx

* ym

* z

* d

* um?

* revert

* this one?

* first half

* um p2

* all?

* um

* cleaner

* um
2026-01-23 21:30:12 +03:00
chenyu
e65bc7a7c5
where closure folding (#14304) 2026-01-23 10:55:13 -05:00
chenyu
d5a3b02a9c
clean up xpow (#14295)
mostly for `ret * (base < 0).where(adj, ret.const_like(1))` -> `(base < 0).where(neg_base, ret)`, since it's good for NAN neg_base but not generic
2026-01-23 10:19:47 -05:00
qazal
b913c910c5
assembly/amd: rdna4 passing test_roundtrip (#14300)
* test_roundtrip on different archs

* failing tests

* take RDNA4 xml changes from the emu branch

* work

* min diff to disasm flat

* test_add passes, rdna4 first

* correct vgpr field for the multi dword store stuff

* amdllvm

* recompile in roundtrip, get sources from emulator

* amdllvm, 2

* clean clean

* note, don't rely on that os.environ

---------

Co-authored-by: George Hotz <geohot@gmail.com>
2026-01-23 21:33:53 +09:00
qazal
f3b0e42863
remove extra sqtt pickles in gfx1200 (#14302) 2026-01-23 20:13:48 +09:00
George Hotz
d116312b1a
get cdna sqtt working (#14301)
* get cdna sqtt working

* cnd aprser

* wavestart/waveend

* names

* cdna

* test that
2026-01-23 18:46:15 +08:00
George Hotz
a5c4fa39d1
RDNA4 support in SQTT (#14299)
* table test

* cleanups

* dead file

* delta short

* tests

* delta test

* work

* l4 tests pass

* l0

* cnda

* print

* reverT

* wave failure

* wave failure

* test

* encs

* no l0 crap

* L4

* rdna4 sqtt

* notes

* linter
2026-01-23 16:16:45 +08:00
wozeparrot
963c59ebdb
fix: pull fixes from gradacc branch (#14296) 2026-01-22 23:07:54 -08:00
Christopher Milan
68668b8f28
fix WEBGPU NEG (#14298)
* fix WEBGPU NEG

* add test

* parenthesize
2026-01-23 01:44:52 -05:00
qazal
3b8a7bb8c9
use existing roc.py infra for sqtt tests (#14297)
* add pc, per kernel tracing

* work

* remove those imports

* min diff
2026-01-23 14:07:11 +09:00
chenyu
5f32f7a06b
fix winograd padding order (#14294) 2026-01-22 23:00:14 -05:00
George Hotz
52b989c6c8
don't place consts early + fixes from anthropic challenge (#14286)
* don't place consts early

* add anthropic challenge

* with ref

* do we still have to devectorize bools?

* tests pass

* just WHERE

* fine, revert that

* fine, revert

* only index

* z3 validator doesn't support vectorized

* Revert "z3 validator doesn't support vectorized"

This reverts commit 1b7930ecb3.

* z3 not for vec

* no spec

* VLIWRenderer

* loop unrolling

* better comments

* cleanups

* skip cast

* renderer

* cleanups

* prints

* no hack

* hacks

* bump to 11

* reg warning

* lil clean

* cleaner renderer
2026-01-23 10:48:39 +09:00
chenyu
0903782bc0
remove few dead or unneeded codes [pr] (#14275) 2026-01-22 20:05:43 -05:00
chenyu
3eb5cd7d32
stronger test_rand_is_lazy (#14293) 2026-01-22 18:58:53 -05:00
chenyu
c15b6e6709
update test_randn_finite skipped device (#14292) 2026-01-22 18:26:02 -05:00
chenyu
073c6a81b5
raise if Tensor._buffer is called during jit (#14114)
* raise if Tensor._buffer is called during jit

* cleaner
2026-01-22 17:30:18 -05:00
nimlgen
8cd22df2dd
amd: alive wgps (#14149)
* amd: disabled wgps

* l

* wgp

* uoops

* mockgpu

* drm

* ad this

* fi

* reg
2026-01-23 00:08:45 +03:00
chenyu
a738c4bb22
test symbolic view broken with jit (#14290) 2026-01-22 13:44:47 -05:00
chenyu
f22fa6a5be
test rand is lazy (#14289) 2026-01-22 13:07:55 -05:00
chenyu
1726b884f2
update test_jit_v_nojit_random_regen (#14288)
current behavior is that jit and non-jit consume random seed differently, still the random values are different
2026-01-22 12:21:47 -05:00
chenyu
fbed36fa15
jit graph handle input==output aliasing (#14287)
a position that wasn't an input during capture should never become an input during execution, but graph cannot tell this by jit_cache and input_buffers only
2026-01-22 11:37:41 -05:00
chenyu
8bb61c2490
stronger test_graph_input_output_aliasing (#14282)
* stronger test_graph_input_output_aliasing

* comfirmed failure
2026-01-22 09:59:34 -05:00
qazal
d7afa02085
clean up the extra/sqtt directory (#14284)
* remove legacy test_timing stuff

* remove legacy test_pmc, update active_sqtt_parse
2026-01-22 19:10:59 +09:00