Commit graph

63 commits

Author SHA1 Message Date
gswangg
df44a4e861
Make vectorization of CONST explicit (#5322)
* remove test_const_vectorize_fold

* remove const folding UPat for VECTORIZE

* refactor cstyle render_const

* remove calls to dtype.scalar() in render_const

* add assert

* add vectorized const to UOp.const

* add UPat GEP-VECTORIZE-CONST -> CONST

* render_vectorize for DEFINE_ACC in cstyle

* add back missing render_cast in render_const

* generate vectorized consts as UOps for DEFINE_ACC

* update asserts for DEFINE_ACC with VECTORIZE src

* add UPats for PHI with VECTORIZE src

* use prev rendered vectorize in DEFINE_ACC render

* update DEFINE_ACC in python runtime

* update vectorized DEFINE_ACC in PTXRenderer

* rebase DEFINE_ACC changes on lowerer

* verbose rewrite of bad UPats

* simplify UOps.CONST implementation in ops_python

* update sum_collapse UPats for DEFINE_ACC-VECTORIZE

* revert linearizer to TOT

* fix DEFINE_ACC implementation in ops_python

* simplify DEFINE_ACC in cstyle

* Fix linter error

* support VECTORIZE in fold gated load/store UPat

* support VECTORIZE in other fold gated load UPats

* rewrite VECTORIZE in UPat for no input DEFINE_ACC

* simplify DEFINE_ACC render in cstyle

* make VECTORIZE rules more concise

* add more vectorize fold tests

* inline VECTORIZE-CONSTs in cstyle render

* revert VECTORIZE/GEP rule refactor

* revert cstyle render_const refactor

* inline VECTORIZE-CONSTs in cstyle render

* implicitly vectorized const rendering -> explicit

* WMMA VECTORIZE CONST process replay hacks

* VECTORIZE CONST NAN process_replay hacks

* more VECTORIZE CONST NAN hacks

* cleanup process_replay hacks

* isnan() -> not isfinite() cstyle VECTORIZE CONST

* tweak isnan and isfinite checks VECTORIZE CONST

* tweak for positive vs negative infinity VECTORIZE CONST

* add assert to PTX CONST render

* process_replay VECTORIZE CONST render parity for PTX STORE

* vmin/vmax for VECTORIZE'd CONST

* update WMMA folding rules

* add tests for WMMA VECTORIZE fold

* hack for cstyle half4 CONST zero process_replay parity

* revert PTX backend changes

* add back minimal DEFINE_ACC PTX change

* remove cstyle process_replay hacks

* remove dead code in PTX CONST render

* cleanup vmin/vmax logic for VECTORIZE'd CONSTs

* update vectorize fold tests to use DEFINE_VAR

* fix long line formatting in test

* remove unwanted merge artifact

* more vmin/vmax cleanup

* remove unnecessary asserts

* yet more vmin/vmax cleanup

* get rid of explicit VECTORIZE CONST logic in _min_max

* reuse CONST instead of creating a new one

* remove unneeded cast

* handle DType correctly in sconst

* improve readability of tests

* save a line

* save another line

* tuplize pats in src

* remove GEP-VECTORIZE pats

* add vec +0 fold

* HACK: fold only vec8 +0

* remove vectorized ALU fold hack

---------

Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-08-08 20:59:05 +03:00
George Hotz
d73bc85ba9
UOpGraph not in renderer or Program [run_process_replay] (#5867)
* UOpGraph not in renderer or Program [run_process_replay]

* fix some tests

* fix ptx
2024-08-01 16:20:30 -07:00
George Hotz
693990a346
swap src[2] and src[3] in load [run_process_replay] (#5821)
* swap src[2] and src[3] in load [run_process_replay]

* cleanups + bugfix

* fix ptx
2024-07-30 14:04:13 -07:00
George Hotz
4df46eac67
clean up tensor cores [run_process_replay] (#5736)
* clean up tensor cores [run_process_replay]

* remove tuple(wmma_sz), self.opts.device

* remove tls, leave DEVICE
2024-07-26 13:21:23 -07:00
chenyu
16c27ae400
update UOp.SPECIAL arg spec [run_process_replay] (#5661)
* update UOp.SPECIAL arg spec [run_process_replay]

from `(0, "gid0", 4)` to just `("gid0", 4)`. closer to a Variable

* fix ptx
2024-07-23 16:58:12 -04:00
chenyu
fdc72ba102
reorder UOps.DEFINE_VAR in runtime [run_process_replay] (#5659)
prep rewrite SPECIAL using DEFINE_VAR
2024-07-23 14:32:10 -04:00
George Hotz
d13654a820
move uopgraph to file [run_process_replay] (#5364)
* move uopgraph to file [run_process_replay]

* fix print tree test
2024-07-10 17:34:50 -07:00
George Hotz
6972a2569f
Linearizer -> Lowerer (#4957)
* st to uops function

* lowerer

* uops reduce

* uops reduce

* acc_number correct

* reduce unroll

* complete unroll

* do upcasts

* handle multioutput

* define_accs

* fix valid

* get grouped dims

* revert lin

* minor

* fixup_ast

* group for reduce

* group works now

* all forwards pass

* all ops tests pass

* fix clang

* mypy

* lil cleanups, no image yet

* ugh, variables everywhere

* bugfix

* counters and name fix

* use symbolic, not uops

* cleanups

* Fix tests

* linearizer tests

* expands

* float4 expand load

* tests pass

* woooo, float4 test

* test ops works again

* one more lin test

* more lin tests

* bypass

* fix tests

* something like this

* const in defineacc

* uops get_reduce_acc

* move around

* allow consts in the LOAD/STORE

* each axis should only appear once, 21 failures

* 16 failures

* fix some image

* optional float4

* onnx tests

* gate the stores

* add reorder

* fix terrible skip function

* tc work

* opt add/mul merge

* fix float4 tests

* tiny tweak, 9 failing

* 7 test failures

* start tc, but i don't think this will work

* progress on tensorcores

* note

* fix ops tests

* closer on tc

* weeee...one tensor core works

* still works, more generic

* large WMMA works

* tc test passes

* use WMMA as accumulator

* basic tc tests passing

* small gemm padded works

* 4 failures

* 3 tests failing

* super barrier

* now two tests failing

* one test failing

* cleanpus, add reduce to UopGraph

* remove the linearizer

* remove unused

* lil cleanups

* Lowerer everywhere

* remove test that doesn't exist now

* image indexing

* llvm fix

* fix metal

* fix image

* fix images

* might fix ptx

* fix image type mismatch

* more tests pass

* CAST -> VECTORIZE

* forgot that one

* fix TestOps.test_flip_eye_crash

* locals shouldn't be image dtype

* change less files

* test fix

* fix recursive expands

* touches

* MULACC support in python

* delete unneeded

* alu before contract

* bug fixes

* tests

* no var multireduce

* simpler tc

* metal works in new style

* working on AMD and METAL

* fix amd

* shot in the dark, fix amd

* something for CUDA

* CUDA WORKS from the docs

* comment

* correct merge

* cleanups + ptx fix + get_reduce_acc

* local alias isn't used anymore

* add store sanity check

* fix for AMD

* cleanups and single expand pass

* more correct with acc_cache

* tests should pass

* block on WMMA

* tests pass

* merge contract and reduce

* contractor fixes issue

* multicontract

* pre expand wmma (same as a reduce)

* expand wmma and only take one

* all expands

* comments and whitespace
2024-07-10 15:07:42 -07:00
qazal
ae10e936e7
UOps.VECTORIZE cleanups [run_process_replay] (#5314)
* still render_cast

* one extra line ok

* these are all just vectorize

* save space

* behavior change can go in a different diff
2024-07-07 10:49:08 +03:00
greg-niemeyer
77b2ce9fc9
Add UOps.VECTORIZE [run_process_replay] (#5289)
* Add UOps.VECTORIZE to core

* Update vectorized cast tests

* Addresses code review comments

- Removes VECTORIZE from LLVMRenderer
- Add line breaks to unduly long lines
- Add noop CAST rule back
- Update asserts and add render_vectorize in
  CSytleLanguage renderer

* Add missing const folding rule for VECTORIZE

Also adds corresponding test

* Fixes test_const_vectorize_fold and add assert

- Use sane types with VECTORIZE in test_const_vectorize_fold
- Add assert that sanity checks the types for VECTORIZE

* Rename test_cast_vectorized_fold

Renames test_cast_vectorized_fold to test_noop_vectorize_fold
because the test targets a very specific rule and there are
other tests for VECTORIZE.

* Revert unrelated changes

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: qazal <qazal.software@gmail.com>
2024-07-07 09:59:57 +03:00
Roelof van Dijk
26e254c42b
ruff: else-raise and else-return (#5175)
* ruff: enable else-raise and else-return

* ruff: add error names

* fix order

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-06-27 07:54:59 -04:00
Jhenner Tigreros
dfa562dbc1
DEFINE_ACC takes UOps.CONST in vin instead of arg (#4975)
* Change DEFINE_ACC to receive UOps.CONST in vin

* Use localtype instead of acc dtype

* Fix idp

* Fix copy list

* Fix warp

* Fix error

* Fix merge

* Fix testing

* Fix merge

* Use deepcopy

* Change to copy of inp

* Fix lint

* Move const to first place

* Fix issue upat

* Fix upat patterns

* Change to list, to test permutations

* Add condition

* Change pm

* Revert change pm

* Remove unused rule

* Fix

* Change of float4 DEFINE_ACC values

* Cast on PM to correct dtype

* Improve assert message

* Move IFs

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-06-24 09:25:33 -07:00
chenyu
4c7e316ded
update pylint for ops_python (#5046)
the two errors (cell-var-from-loop and arguments-out-of-order) does not apply as we use it as intended.
2024-06-18 20:15:34 -04:00
chenyu
a8e9307e0b
pylint runtime/ and shape/ (#5044)
as pointed out by #4877, need to add `__init__.py` to trigger pylint. fixed some errors except ops_python (will do in a separate pr, it has a lot of errors), and sub-folders in runtime
2024-06-18 19:48:18 -04:00
kormann
7c3b877216
rename uop [run_process_replay] (#5031)
* rename

* fix unittests

* rename vin

* fix test

* fix type [run_process_replay]

* rm pre commit hook change
2024-06-18 21:34:05 +03:00
chenyu
03b367c014
handle float16 overflow in PYTHON (#5022)
* handle float16 overflow in PYTHON

use `truncate` when constructing tensor from list to make sure all values are packable (might be slow, but should be correct). add truncate_fp16 to cast overflowed values to inf/-inf.

* all valid fmt supports truncate
2024-06-17 21:12:52 -04:00
chenyu
013c73c3b3
minor refactor overflow handing in python backend (#5015)
made it clear that it's only handing int now. need to handle float inf next
2024-06-17 12:18:38 -04:00
nimlgen
654a8b9ef7
retire hsa (#4885)
* retire hsa

* EMULATE_AMD
2024-06-09 11:33:03 +03:00
Roelof van Dijk
1785a70e77
fix: else-return on runtime (#4881)
* fix: add init file

* fix: no else-return

* fix: remove file again
2024-06-08 14:44:24 +02:00
chenyu
3afc914617
CMPEQ -> CMPNE and make it safe to pad (#4818)
* CMPNE

* new dataset
2024-06-03 18:02:15 -04:00
George Hotz
4753283221
LOOP -> RANGE (#4650) 2024-05-19 06:40:20 -07:00
George Hotz
07b350a8f4
new uops is an actual graph (#4560)
* new uops is an actual graph

* it's way slower

* simpler

* fix define acc

* render_loop unique

* ops test pass

* add pattern matcher back, there's bugs

* rewrite

* use priority queue

* recursive children

* fix tests

* fix tests with SINK

* fix abstractions

* fix assembly

* simpler

* link define_acc

* fix DEFINE_ACC placement

* type verify

* full cmp

* fix cmp

* ACCESS_ACC

* insert DEFINE_ACC

* fix PHI

* recursive rewrite

* fix many tests

* sum collapse

* more patterns

* correct change

* fold arange

* fix that lin test

* space

* big folding rule works

* close

* has more maxes, meh

* cached node replace

* set changed

* simplest folding yet

* works

* works

* DIV

* all tests pass

* del

* fuzz linearizer fails

* sum_collapse

* test depth 2 cf

* fix lin test 14

* fix clang depth

* disable that

* failure 14 is fixed

* fix ptx

* failure 27 is fixed

* fix llama

* run_cnt

* Revert "Optimize PTX gated loads index calculation (#4304)"

This reverts commit d97d5a7689.

* fix uops loop

* fix ptx bugs

* add barrier

* print

* mem_type in ptx direct

* bypass tests that fail in CI but pass locally

* ptx remove ptr_ar

* more ptx passing

* fix ptx tests

* assert compile support

* remove  model inference benchmark from red
2024-05-17 18:00:18 -07:00
nimlgen
daf57af3eb
move tc to renderers (#4631)
* move tc to renderers

* missed import

* fix typo

* fix

* fix imports

* remove from tests

* fix 4607

* nv emulate timestamp

* time is int

* correct time
2024-05-18 00:36:29 +03:00
George Hotz
02327b8adf
simple stuff from new_uops branch (#4563) 2024-05-12 22:18:05 -07:00
George Hotz
347a3acb37
add renderer class (#4524)
* add renderer class

* tests pass

* fix pylint

* fix tensor cores
2024-05-10 21:40:02 -07:00
George Hotz
d438d5698d
bring buffer back to device (#4517) 2024-05-10 11:22:31 -07:00
George Hotz
4eef1ee9bf
move renderer into options (#4514)
* move renderer into options

* fix tests

* renders are functions
2024-05-10 10:01:51 -07:00
George Hotz
89e119bc58
move Allocator to buffer.py (#4502)
* move Allocator to buffer.py

* move those to realize

* memory file

* cleanup
2024-05-09 19:45:56 -07:00
George Hotz
f635c4d273
fix define global (#4383)
* fix define global

* remove name from DEFINE_GLOBAL

* fix fuzzing

* fix ptx

* fix python
2024-05-01 22:32:56 -04:00
chenyu
0a34d6016b
move exec_alu from uops to ops (#4033)
will use this for const folding in lazy too
2024-04-01 17:20:53 -07:00
chenyu
b47f6cebb2
LinearizerOptions -> CompilerOptions (#3978) 2024-03-28 17:50:23 -04:00
Francis Lam
7c5729a3bd
wmma: refactor to remove wmma_func and create TC funcs as needed (#3945)
* wmma: refactor to remove wmma_func and create TC funcs as needed

* test_linearizer: disable bf16 CUDA during emulation testing

* cstyle: clean up creation of CUDA vec dtypes

* extra/gemm: add option to accumulate to bfloat16

* cleanups

* benchmark: add CUDA bfloat16 matmul

* more cleanups
2024-03-27 16:43:09 -04:00
chenyu
6c7df1445b
enforce UOps.CONST arg has python type based on dtype (#3952)
added an assert in uops, remove the cast in renderer
2024-03-27 01:41:38 -04:00
nimlgen
e2d6f76723
_alloc and _free with options (#3934)
* _alloc has options

* linter

* fix hsa
2024-03-26 09:11:41 -07:00
chenyu
2e39f57594
move lines around in ops_python wmma (#3911) 2024-03-24 17:14:26 -04:00
chenyu
8c8b57fd5f
cleanup ops python (#3908)
i just want to merge lars!
2024-03-24 11:36:31 -04:00
chenyu
1c51d586ea
replace raise Exception with specific errors (#3874) 2024-03-22 12:32:21 -04:00
chenyu
5dd048a378
remove HIP in core tinygrad (#3810)
* remove HIP in core tinygrad

ci test uses device RHIP and HSA compiler (LinearizerOpt), so fine to remove HIP from tc.
Also updated README and EMULATE tc test flag

* EMULATE_CUDA
2024-03-18 18:19:27 -04:00
George Hotz
ca19eb3e82
where fold try 2 (#3748)
* where fold try 2

* assign fold

* test_where_fold works

* add gated store support to ops_python

---------

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-15 07:46:26 -07:00
chenyu
75d4344cda
UOps.BITCAST (#3747)
* UOps.BITCAST

implicitly fixed no const folding for bitcast

* python backend

* ptx

* consistent llvm
2024-03-14 21:00:35 -04:00
Patrick Tsai
971d7f5d7c
O(n) arange attempt (#3530)
* It works?

* Clamp correctly

* Refactor

* Make code better

* Undo some stuff

* First step to trying to make floats work

* Floats work in Python op but not metal because int div is different

Python integerdivision was implemented as // which rounds towards
negative infinity, but C integer division rounds towards 0 so there
is an off-by-1 division error

* arange does cumsum with ints and then multiplies by step

This is so loop optimization can remain int only

* Undo a lot of symbolic changes

* Final check

* Cleanup

* There can be multiple phis

* Fix multiple phi op removal

* const sets dtype correctly

* Fix bugs

* Fix a couple bugs and add loop vars to resolve

* missed one

* Don't trim too many ops

* Fix symbolic test

* Use ones instead of full

* Delete test

* Lint passes

* max node error

* Small updates to loop logic

* Remove unnecessary changes

* We are getting somewhere

* Simple case

* Fix

* rm, prn

* Better

* If NumNode doesn't work then continue

* clamp is needed for arange(256)

* Move everything into the optim fn

* Replace correctly

* Order optimizations better

* Delete

* mypy

* Test for simplification

* Rename

* Fix test

* update test description

* Undo more

* Cleanup

* No replaced_ops map

* Fix lint

* AssertionError

* back again

* Reinstate assertion

* Return true and make diff not as big

* Bigger range for test

* Change cumsum impl

* fix bug

* make big cumsum work

* lint

* Undo cumsum 2-stage removal

* No while helper

* optional min/max clamping

* floats work

* rm giant arange test

* fix python cast None

* Check phi parents

* one phi allowed per where

* Fix one phi per where

* Rework iteration

* Delete assertions

* convert to int

* Try mul -1 instead of neg for hip..?

* Remove one phi per where requirements

* one accum only

* Lint

* should simplify a loop at a time

* Don't get rid of loop explcitly

* Need to iterate backwards

* lint

* unary neg

* Make optim work for onnx and sum_pad_collapse

* Better message

* filter alu ops correctly

* Fix the limiter

* lint and simplify

* Add it back

* off by one error

* test wheres and phis

* test max ops and non-if stuff

* <=

* cast_scalar

* Oops

* Change test

* Pass loop uops instead of a modified map

* Cut param transfer between linearizer and uops

* Fix issues

* Fix lint

* fix efficientnet python 3.8 invalid syntax

* distinct vars in seen_vars

* accurate var names

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-11 16:09:20 -07:00
George Hotz
81baf3eed3
bring ptx back (#3623)
* bring ptx back

* ptx back

* fix define var

* fix a few bugs

* bugfixes

* fixes

* fix llvm bug

* fix test bug
2024-03-06 13:34:21 -08:00
George Hotz
aa9b013d79
add constant folding for WHERE in uops (#3584)
* add constant folding for WHERE in uops

* prereqs for generic constant folding

* fix test

* disable slow overflow logic

* make that test faster
2024-03-02 10:37:14 -08:00
Francis Lam
e17f1821a7
wmma: add CUDA tensor core and fix test_speed_v_torch failure (#3544) 2024-03-01 17:51:02 -08:00
chenyu
b7e555f6c0
run test_linearizer_failures on PYTHON backend (#3565)
* run test_linearizer_failures on PYTHON backend

only test 1, some have hanging issues and gated store is not implemented

* --durations=20

* two less slow ones
2024-03-01 17:00:18 -05:00
George Hotz
2c19ab6561
define var (#3548)
* define var

* remove vars from there

* fix python symbolic ops

* fix llvm

* pypath
2024-02-29 16:43:27 -08:00
George Hotz
83cdc85790
add index to DEFINE_GLOBAL (#3542)
* remove DEFINE_GLOBAL from uops with side effects

* add index to DEFINE_GLOBAL

* bugfix

* better var name
2024-02-29 15:22:26 -08:00
geohotstan
9268a8b154
remove MULACC (#3459)
* init

* removed mulacc

* is uoptimize the problem?

* lol hax make work temporarily fix l8er

* revert extra/ changes

* clean up

* flaky metal tests?

* add back mulacc for metal

* revert last commit

* try skipping linearizer_failure tests

* skip flammit tests... cuz tests all work locally

* try narrow down exact linearizer failure test

* try 2

* try 4

* generated code is the exact same wtf why CI fails

* code for 15 and 17 are exact same with or without mulacc, this should pass

* try only 1 failure

* try garbage collecting lol...

* try del variables lol

* try gcing after del lol...

* is diskcache the problem???

* try disabling opts cache idk

* try remove hack

* try disable github metal cache...

* try CACHELEVEL=0 :D idk anymore

* try increase newCommandQueueWithMaxCommandBufferCount_, im almost out of ideas...

* revert

* actually not a HACK

* oops
2024-02-29 07:40:40 -05:00
Carson Radtke
15df9406d6
fix exec_alu(UnaryOps.SQRT, <...>, (0,)) + add test (#3487)
* fix exec_alu(UnaryOps.SQRT, <...>, (0,)) + add test

* sqrt(0) != nan

* fix tabs
2024-02-23 18:28:00 +01:00
George Hotz
7698781389
Revert "wmma: add CUDA tensor core (#3464)" (#3474)
This reverts commit e9cef13f0b.
2024-02-22 11:58:16 +01:00