* remove test_const_vectorize_fold
* remove const folding UPat for VECTORIZE
* refactor cstyle render_const
* remove calls to dtype.scalar() in render_const
* add assert
* add vectorized const to UOp.const
* add UPat GEP-VECTORIZE-CONST -> CONST
* render_vectorize for DEFINE_ACC in cstyle
* add back missing render_cast in render_const
* generate vectorized consts as UOps for DEFINE_ACC
* update asserts for DEFINE_ACC with VECTORIZE src
* add UPats for PHI with VECTORIZE src
* use prev rendered vectorize in DEFINE_ACC render
* update DEFINE_ACC in python runtime
* update vectorized DEFINE_ACC in PTXRenderer
* rebase DEFINE_ACC changes on lowerer
* verbose rewrite of bad UPats
* simplify UOps.CONST implementation in ops_python
* update sum_collapse UPats for DEFINE_ACC-VECTORIZE
* revert linearizer to TOT
* fix DEFINE_ACC implementation in ops_python
* simplify DEFINE_ACC in cstyle
* Fix linter error
* support VECTORIZE in fold gated load/store UPat
* support VECTORIZE in other fold gated load UPats
* rewrite VECTORIZE in UPat for no input DEFINE_ACC
* simplify DEFINE_ACC render in cstyle
* make VECTORIZE rules more concise
* add more vectorize fold tests
* inline VECTORIZE-CONSTs in cstyle render
* revert VECTORIZE/GEP rule refactor
* revert cstyle render_const refactor
* inline VECTORIZE-CONSTs in cstyle render
* implicitly vectorized const rendering -> explicit
* WMMA VECTORIZE CONST process replay hacks
* VECTORIZE CONST NAN process_replay hacks
* more VECTORIZE CONST NAN hacks
* cleanup process_replay hacks
* isnan() -> not isfinite() cstyle VECTORIZE CONST
* tweak isnan and isfinite checks VECTORIZE CONST
* tweak for positive vs negative infinity VECTORIZE CONST
* add assert to PTX CONST render
* process_replay VECTORIZE CONST render parity for PTX STORE
* vmin/vmax for VECTORIZE'd CONST
* update WMMA folding rules
* add tests for WMMA VECTORIZE fold
* hack for cstyle half4 CONST zero process_replay parity
* revert PTX backend changes
* add back minimal DEFINE_ACC PTX change
* remove cstyle process_replay hacks
* remove dead code in PTX CONST render
* cleanup vmin/vmax logic for VECTORIZE'd CONSTs
* update vectorize fold tests to use DEFINE_VAR
* fix long line formatting in test
* remove unwanted merge artifact
* more vmin/vmax cleanup
* remove unnecessary asserts
* yet more vmin/vmax cleanup
* get rid of explicit VECTORIZE CONST logic in _min_max
* reuse CONST instead of creating a new one
* remove unneeded cast
* handle DType correctly in sconst
* improve readability of tests
* save a line
* save another line
* tuplize pats in src
* remove GEP-VECTORIZE pats
* add vec +0 fold
* HACK: fold only vec8 +0
* remove vectorized ALU fold hack
---------
Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
* st to uops function
* lowerer
* uops reduce
* uops reduce
* acc_number correct
* reduce unroll
* complete unroll
* do upcasts
* handle multioutput
* define_accs
* fix valid
* get grouped dims
* revert lin
* minor
* fixup_ast
* group for reduce
* group works now
* all forwards pass
* all ops tests pass
* fix clang
* mypy
* lil cleanups, no image yet
* ugh, variables everywhere
* bugfix
* counters and name fix
* use symbolic, not uops
* cleanups
* Fix tests
* linearizer tests
* expands
* float4 expand load
* tests pass
* woooo, float4 test
* test ops works again
* one more lin test
* more lin tests
* bypass
* fix tests
* something like this
* const in defineacc
* uops get_reduce_acc
* move around
* allow consts in the LOAD/STORE
* each axis should only appear once, 21 failures
* 16 failures
* fix some image
* optional float4
* onnx tests
* gate the stores
* add reorder
* fix terrible skip function
* tc work
* opt add/mul merge
* fix float4 tests
* tiny tweak, 9 failing
* 7 test failures
* start tc, but i don't think this will work
* progress on tensorcores
* note
* fix ops tests
* closer on tc
* weeee...one tensor core works
* still works, more generic
* large WMMA works
* tc test passes
* use WMMA as accumulator
* basic tc tests passing
* small gemm padded works
* 4 failures
* 3 tests failing
* super barrier
* now two tests failing
* one test failing
* cleanpus, add reduce to UopGraph
* remove the linearizer
* remove unused
* lil cleanups
* Lowerer everywhere
* remove test that doesn't exist now
* image indexing
* llvm fix
* fix metal
* fix image
* fix images
* might fix ptx
* fix image type mismatch
* more tests pass
* CAST -> VECTORIZE
* forgot that one
* fix TestOps.test_flip_eye_crash
* locals shouldn't be image dtype
* change less files
* test fix
* fix recursive expands
* touches
* MULACC support in python
* delete unneeded
* alu before contract
* bug fixes
* tests
* no var multireduce
* simpler tc
* metal works in new style
* working on AMD and METAL
* fix amd
* shot in the dark, fix amd
* something for CUDA
* CUDA WORKS from the docs
* comment
* correct merge
* cleanups + ptx fix + get_reduce_acc
* local alias isn't used anymore
* add store sanity check
* fix for AMD
* cleanups and single expand pass
* more correct with acc_cache
* tests should pass
* block on WMMA
* tests pass
* merge contract and reduce
* contractor fixes issue
* multicontract
* pre expand wmma (same as a reduce)
* expand wmma and only take one
* all expands
* comments and whitespace
* Add UOps.VECTORIZE to core
* Update vectorized cast tests
* Addresses code review comments
- Removes VECTORIZE from LLVMRenderer
- Add line breaks to unduly long lines
- Add noop CAST rule back
- Update asserts and add render_vectorize in
CSytleLanguage renderer
* Add missing const folding rule for VECTORIZE
Also adds corresponding test
* Fixes test_const_vectorize_fold and add assert
- Use sane types with VECTORIZE in test_const_vectorize_fold
- Add assert that sanity checks the types for VECTORIZE
* Rename test_cast_vectorized_fold
Renames test_cast_vectorized_fold to test_noop_vectorize_fold
because the test targets a very specific rule and there are
other tests for VECTORIZE.
* Revert unrelated changes
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: qazal <qazal.software@gmail.com>
as pointed out by #4877, need to add `__init__.py` to trigger pylint. fixed some errors except ops_python (will do in a separate pr, it has a lot of errors), and sub-folders in runtime
* handle float16 overflow in PYTHON
use `truncate` when constructing tensor from list to make sure all values are packable (might be slow, but should be correct). add truncate_fp16 to cast overflowed values to inf/-inf.
* all valid fmt supports truncate
* wmma: refactor to remove wmma_func and create TC funcs as needed
* test_linearizer: disable bf16 CUDA during emulation testing
* cstyle: clean up creation of CUDA vec dtypes
* extra/gemm: add option to accumulate to bfloat16
* cleanups
* benchmark: add CUDA bfloat16 matmul
* more cleanups
* remove HIP in core tinygrad
ci test uses device RHIP and HSA compiler (LinearizerOpt), so fine to remove HIP from tc.
Also updated README and EMULATE tc test flag
* EMULATE_CUDA
* where fold try 2
* assign fold
* test_where_fold works
* add gated store support to ops_python
---------
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
* It works?
* Clamp correctly
* Refactor
* Make code better
* Undo some stuff
* First step to trying to make floats work
* Floats work in Python op but not metal because int div is different
Python integerdivision was implemented as // which rounds towards
negative infinity, but C integer division rounds towards 0 so there
is an off-by-1 division error
* arange does cumsum with ints and then multiplies by step
This is so loop optimization can remain int only
* Undo a lot of symbolic changes
* Final check
* Cleanup
* There can be multiple phis
* Fix multiple phi op removal
* const sets dtype correctly
* Fix bugs
* Fix a couple bugs and add loop vars to resolve
* missed one
* Don't trim too many ops
* Fix symbolic test
* Use ones instead of full
* Delete test
* Lint passes
* max node error
* Small updates to loop logic
* Remove unnecessary changes
* We are getting somewhere
* Simple case
* Fix
* rm, prn
* Better
* If NumNode doesn't work then continue
* clamp is needed for arange(256)
* Move everything into the optim fn
* Replace correctly
* Order optimizations better
* Delete
* mypy
* Test for simplification
* Rename
* Fix test
* update test description
* Undo more
* Cleanup
* No replaced_ops map
* Fix lint
* AssertionError
* back again
* Reinstate assertion
* Return true and make diff not as big
* Bigger range for test
* Change cumsum impl
* fix bug
* make big cumsum work
* lint
* Undo cumsum 2-stage removal
* No while helper
* optional min/max clamping
* floats work
* rm giant arange test
* fix python cast None
* Check phi parents
* one phi allowed per where
* Fix one phi per where
* Rework iteration
* Delete assertions
* convert to int
* Try mul -1 instead of neg for hip..?
* Remove one phi per where requirements
* one accum only
* Lint
* should simplify a loop at a time
* Don't get rid of loop explcitly
* Need to iterate backwards
* lint
* unary neg
* Make optim work for onnx and sum_pad_collapse
* Better message
* filter alu ops correctly
* Fix the limiter
* lint and simplify
* Add it back
* off by one error
* test wheres and phis
* test max ops and non-if stuff
* <=
* cast_scalar
* Oops
* Change test
* Pass loop uops instead of a modified map
* Cut param transfer between linearizer and uops
* Fix issues
* Fix lint
* fix efficientnet python 3.8 invalid syntax
* distinct vars in seen_vars
* accurate var names
---------
Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* run test_linearizer_failures on PYTHON backend
only test 1, some have hanging issues and gated store is not implemented
* --durations=20
* two less slow ones
* init
* removed mulacc
* is uoptimize the problem?
* lol hax make work temporarily fix l8er
* revert extra/ changes
* clean up
* flaky metal tests?
* add back mulacc for metal
* revert last commit
* try skipping linearizer_failure tests
* skip flammit tests... cuz tests all work locally
* try narrow down exact linearizer failure test
* try 2
* try 4
* generated code is the exact same wtf why CI fails
* code for 15 and 17 are exact same with or without mulacc, this should pass
* try only 1 failure
* try garbage collecting lol...
* try del variables lol
* try gcing after del lol...
* is diskcache the problem???
* try disabling opts cache idk
* try remove hack
* try disable github metal cache...
* try CACHELEVEL=0 :D idk anymore
* try increase newCommandQueueWithMaxCommandBufferCount_, im almost out of ideas...
* revert
* actually not a HACK
* oops