Commit graph

1,049 commits

Author SHA1 Message Date
George Hotz
2eb9241329 better conv 2025-03-24 13:07:14 +08:00
George Hotz
554a490751
Merge branch 'master' into dsp_search 2025-03-24 12:29:22 +08:00
George Hotz
74d98eafb8
add onnx frontend stub [pr] (#9558) 2025-03-24 12:24:34 +08:00
George Hotz
651c678edf work 2025-03-24 09:49:53 +08:00
George Hotz
3274bd2d81 output 2025-03-23 15:13:00 +08:00
geohotstan
309afa20b7
add Tensor.max_unpool2d (#9518)
* why does max_unpool2d feel slower than out.gradient ...

* slightly cleaner

* what happened to ruff

* need to think about this some more

* slightly faster now?

* clean up, 1 more failing edge case

* ok good

* working TINY_BACKEND

* nit doc wording

* retry CI
2025-03-22 12:11:33 -04:00
George Hotz
30f4d64148 rules 2025-03-22 19:17:16 +08:00
George Hotz
2634975d5a 5 and 8 2025-03-22 19:14:04 +08:00
George Hotz
fd73ec2b1b knum 2025-03-22 18:59:54 +08:00
George Hotz
e1d2bec4a4 opt 2025-03-22 18:52:56 +08:00
George Hotz
1b4e9f5e91 more opt rules 2025-03-22 18:07:31 +08:00
George Hotz
25c023bcbe more 2025-03-22 17:49:34 +08:00
George Hotz
26b02a037c fix 33 2025-03-22 17:17:47 +08:00
George Hotz
dca95428a5 touch 2025-03-22 11:05:36 +08:00
Francis Lata
eb95825eea
RetinaNet dataloader (#9442)
* retinanet dataloader

* remove batch_size from generate_anchors

* refactor kits19 dataset tests

* add tests for dataloader

* fix testing setup and cleanups

* remove unused import
2025-03-21 13:36:41 -04:00
George Hotz
af94addb3a ish 2025-03-21 17:46:45 +08:00
George Hotz
0416b0998d revert those 2025-03-21 17:15:38 +08:00
George Hotz
8e555c586c
switch quantization to unsigned/unsigned + add Ops.REDUCE (#9527)
* switch quantization to unsigned/unsigned + add Ops.REDUCE

* tests

* nhwc + replay pkl
2025-03-21 17:02:37 +08:00
George Hotz
f66b03f0a6 dsp ish 2025-03-21 16:28:08 +08:00
George Hotz
2729a46ca6 don't do that 2025-03-21 16:04:21 +08:00
George Hotz
dbb50e4a00 knum 4 2025-03-21 15:48:50 +08:00
George Hotz
71c7c455a6 quantize 2025-03-21 14:55:29 +08:00
George Hotz
ff3438be4e fast 2025-03-21 13:04:18 +08:00
George Hotz
c3c85c64ee simpler 2025-03-21 09:24:33 +08:00
George Hotz
f6ed8f4a27 8 folds 2025-03-20 21:20:46 +08:00
George Hotz
87718170d2 more generic 2025-03-20 21:14:33 +08:00
George Hotz
b67af4049c knum 20 2025-03-20 20:59:06 +08:00
George Hotz
16e425a4c0 work 2025-03-20 20:24:21 +08:00
George Hotz
e7402e6643 KNUM=13 will be fast like roadrunner 2025-03-20 18:45:53 +08:00
George Hotz
e5ccd9e846 work 2025-03-20 15:20:03 +08:00
George Hotz
223feb2118
Merge branch 'master' into dsp_search 2025-03-20 10:52:30 +08:00
George Hotz
68053d0510
dsp stuff / sniff ioctls from snpe (#9490)
* sniff ioctls from snpe

* dump input buffers

* snpe logs from dsp

* NHWC support

* knum 3

* this run?

* revert those

---------

Co-authored-by: Comma Device <device@comma.ai>
2025-03-20 10:38:23 +08:00
geohotstan
8c0d0a122c
Add return_indices to max_pool (#9506)
* wow argmax is so good

* 1 less line

* clean up and better variable names

* is this torch thing right...?

* add more tests

* slap a TODO on it

* clean ups

* prettier looking code and fix ceil mode test

* add return types and some docs

* ok that was a bad example since indices == value, just no example
2025-03-19 15:25:37 -04:00
Francis Lam
1e5d9ad8f7
extra/gemm/max_matmul: start of custom kernels for GEMM (#6926)
* extra/gemm/max_matmul: start of custom kernels for GEMM

* add an unoptimized FP16/FP16 MMA example

* add slow 3-stage fp16 acc example

* add correct 3-stage pipeline with unswizzled/flat smem input (slow)

* add acc fp16 example with 3 stages and swizzle (no bank conflicts)

* add max version of NV fp16_fp16_fp16

* fix up comments and removed unused code in max variations

* add start of no_xor example

* fix to account for UOps to Ops
2025-03-19 15:04:57 +08:00
b1tg
a95b489a55
nanoGPT train works with tiny torch backend (#9283)
* train_shakespeare_char.py works

* move aten.where.self_out to tiny_backend_out

* fix memory leak

* corealize in the backward_hook

* Update backend.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-19 11:51:02 +08:00
George Hotz
117b7a16ef
VALIDATE_WITH_CPU [pr] (#9488)
* VALIDATE_WITH_CPU [pr]

* fix test
2025-03-18 15:15:04 +08:00
nimlgen
a82c9332d3
am: rename soc21 to soc (#9482) 2025-03-18 08:54:26 +08:00
Anish Umale
5e58f4b65b
Tiny backend test_ops fix part 3 (#9483)
* extract straightforward things from https://github.com/tinygrad/tinygrad/pull/9302

* pass dtype and device for ones_like
2025-03-17 18:01:51 -04:00
TJ
9fcef4d009
add masked_select to tensor.py (#9468)
* add masked_select to tensor.py

* fix tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-17 16:05:36 -04:00
geohotstan
53d6f1e1bb
Add bitonic cat sort (#9422)
* poc

* repeated values fail, sigh

* is this being timed out?

* fix up down names

* bitonic v2, does this run?

* bitonic v3, faster

* bitonic v3.1, faster

* bitonic v3.1.1, same speed unlucky

* support dim and indices

* bitonic v3.2, simpler code, TODO repeated indices

* bruv gimme green for once cmon

* cat (stack) implementation, slow but maybe one day when cat is fast meow

* revert to v3.2

* bitonic v4, who let the cats out edition

* clean up variable names

* figured out repeated indices :D

* ruff check --fix

* use sort for topk

* add Tensor.sort everywhere

* fix docs and add some types

* slightly better variable names

* am I doing torch inplace correctly?

* delegate sort to values_stable

* add a contig, faster first sort

* maybe don't test_inplace

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-17 12:01:23 -04:00
George Hotz
8eb9093fb8 lil 2025-03-17 19:57:15 +08:00
George Hotz
45f7c08111 work 2025-03-17 19:22:12 +08:00
George Hotz
e57258b17b prettier rendering 2025-03-17 18:46:25 +08:00
George Hotz
14c9f14125 dsp beam search 2025-03-17 16:42:32 +08:00
George Hotz
824c5f41ac
dsp work try 3 (#9475)
* dsp work try 3

* padding
2025-03-17 16:42:12 +08:00
George Hotz
cc0041cb8c padding 2025-03-17 16:30:29 +08:00
George Hotz
52ae9af4dd
Fast DSP for MobileNetV2 (try 2) (#9467)
* Fast DSP for MobileNetV2 (try 2)

* enable fast path on uchar

* fix tests
2025-03-17 15:10:36 +08:00
George Hotz
09e7708b49
minimum change for rdna4 [pr] (#9455) 2025-03-16 13:39:24 +08:00
George Hotz
cb7a7f69c7
quantization preprocessor from DSP, should be universal (#9437)
* quantization preprocessor from DSP, should be universal

* touchups

* fix tests
2025-03-15 07:49:37 +08:00
chenyu
0e591baf43
redo simple_matmul change (#9450)
numpy does not support bfloat16
2025-03-14 17:53:52 -04:00