* Revert "Revert "ops rdna""
This reverts commit 0400315078.
* Revert "Revert "writing 2""
This reverts commit 325a3bf2cf.
* no dump
* 2x 2
* simple asm
* local size
* sub
* lil work
* support args != 3
* assembler work
* generate that
* ptx assembler
* begin index renderer
* max
* ptx loops
* gemms work
* valid works
* asm working a bit more
* close
* passing all ops tests
* ptx is a codegen only, not a backend
* ptx
* float16 support
* rdna goes here
* install types
* make amd disassemble
* ansilen for pretty print
* fix ptx log2/exp2
* assemblyinstruction
* new asm
* working gemm
* fix cmp
* more passing
* mod
* ptx works again
* rdan3 add works
* log exp
* sin is sin 2pi
* fix types
* progress
* loops work
* rdna xyz
* better addressing
* cleanups
* handle exception in early process
* div support
* rdna float4
* locals work
* fix neg index
* cast
* smaller diff
* yaml
* import only if selected
* fromimport
* types
* this all needs rewriting
* a few more
* add cumsum with n-dim inputs, over arbitrary axis + relevant tests
* increased rtol for cumsum test
* move test_cumsum into test_ops
* skip arange test for images as relies on cumsum
* Fix typo
* rewrite cumsum to work with images
* add and reorganize test_slice_* tests
* refactor Tensor.__getitem__()
* preliminary tests for 1) 0D tensors and 2) varargs for Tensor.zeros and Tensor.ones
* always compare shapes of the numpy arrays obtained from tinygrad and torch tensors
* add more tests for 0D support
* remove test_tensor.test_slicing(). All slicing tests at test/test_ops.py
* add zero-dim support
* make test_end2end.py consistent with 0dim support
* add test for tensor with zero in shape
* don't simplify ones if shape is ()
* skip tests that need zero-size tensor support.
- zero-size tensor support not related to 0dim tensors.
* add tests for __getitem__() supporting strides >= 1
* refactor __getitem__: support for strides >= 1
* minor refactors and add comments to __getitem__
* add tests for slices with negative steps
* add support for slices with negative strides
* Don't collapse dimensions during batched matmul (FIX#799)
* Avoid reshaping tensor to the same shape
* Skip batched matrix multiply when IMAGE is set
* make maximum split grad
* added test for maximum split grad when equal
* minor expr simplification
* (2-eq)/2 only once
* update test bc one more sum output child stays
* activation ops
* type hints + more testing
* formatting correction + parameter testing
* fixes to shape testing
* hardtanh to use clip + removed type hints
* assign val fix
* fix binop, other tests failure
* that was a bad idea
* better layernorm
* inference kernel count tests
* new style reshape pushing
* fixup replacement
* 199 kernels is okay. fix flops
* push reshape through unaryops only
* GRAPH=2 draws the phantom ops
* found resnet issue
* non working test
* mul is cheaper than div
* OPT inflation
* SHUFFLE_PAD_OPS in OPT=2
* linearizer outputs something
* working ish
* cstyle codegen
* clang mostly works
* fix load valid
* fix numberless loop
* fancy gen
* working
* fix enet compiler
* cleanups
* float4 upcasting
* less lines
* supports_float4
* constant folding
* mulacc
* internet tests flaky in CI
* 90% image support
* fix image generic
* bugs exposed with shapetracker and single view
* new llvm
* use vload, remove OLD
* that's really poorly done
* ending up being more lines
* runs one metal kernel
* conv2d works
* ops tests are passing
* const folding
* all ops work
* pre commit always passes
* torch works
* working still
* fix graph test
* tests passing
* image almost works
* image conv works
* most images
* fix custom
* fix assignment
* fix compile enet
* clean up comments
* fix realize return value
* include shapetracker in LB repr
* copy should make a copy
* reenable method cache
* fix lna
* dtypes in graph
* forward only for IMAGE=2
* simple realize
* getting close
* fixup new api, it's good except the kernel count
* back to 197 kernels
* tests should pass
* go to a real float
* no type_on_cpu
* fix the docs
* put shapetracker back in it's proper place
* clean up opt
* don't let global kernels get too small
* 8192 -> 1024
* disable local shape for clang
* fix can_merge
* unroll the 5x5 depthwise convs in op
* load float4 check
* Less, LessOrEqual, Greater, GreaterOrEqual, Equal
* lint fix
* using built in functions
* overriding __eq__ breaks things
* backwards pass for less - foward only tests
* one other spot
* removing backwards for comparison ops to match pytorch
* raise runtime error
* more tests for comparison ops
* fixed the lineup
* added number upcast tests