* why does max_unpool2d feel slower than out.gradient ...
* slightly cleaner
* what happened to ruff
* need to think about this some more
* slightly faster now?
* clean up, 1 more failing edge case
* ok good
* working TINY_BACKEND
* nit doc wording
* retry CI
* wow argmax is so good
* 1 less line
* clean up and better variable names
* is this torch thing right...?
* add more tests
* slap a TODO on it
* clean ups
* prettier looking code and fix ceil mode test
* add return types and some docs
* ok that was a bad example since indices == value, just no example
* extra/gemm/max_matmul: start of custom kernels for GEMM
* add an unoptimized FP16/FP16 MMA example
* add slow 3-stage fp16 acc example
* add correct 3-stage pipeline with unswizzled/flat smem input (slow)
* add acc fp16 example with 3 stages and swizzle (no bank conflicts)
* add max version of NV fp16_fp16_fp16
* fix up comments and removed unused code in max variations
* add start of no_xor example
* fix to account for UOps to Ops
* train_shakespeare_char.py works
* move aten.where.self_out to tiny_backend_out
* fix memory leak
* corealize in the backward_hook
* Update backend.py
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* poc
* repeated values fail, sigh
* is this being timed out?
* fix up down names
* bitonic v2, does this run?
* bitonic v3, faster
* bitonic v3.1, faster
* bitonic v3.1.1, same speed unlucky
* support dim and indices
* bitonic v3.2, simpler code, TODO repeated indices
* bruv gimme green for once cmon
* cat (stack) implementation, slow but maybe one day when cat is fast meow
* revert to v3.2
* bitonic v4, who let the cats out edition
* clean up variable names
* figured out repeated indices :D
* ruff check --fix
* use sort for topk
* add Tensor.sort everywhere
* fix docs and add some types
* slightly better variable names
* am I doing torch inplace correctly?
* delegate sort to values_stable
* add a contig, faster first sort
* maybe don't test_inplace
---------
Co-authored-by: chenyu <chenyu@fastmail.com>