Commit graph

8,282 commits

Author SHA1 Message Date
George Hotz
6f792e8045 vmemu 2025-03-24 15:02:40 +08:00
George Hotz
b1f8018bf4 unaligned load 2025-03-24 14:54:11 +08:00
George Hotz
2eb9241329 better conv 2025-03-24 13:07:14 +08:00
George Hotz
554a490751
Merge branch 'master' into dsp_search 2025-03-24 12:29:22 +08:00
George Hotz
74d98eafb8
add onnx frontend stub [pr] (#9558) 2025-03-24 12:24:34 +08:00
George Hotz
de7d6cec3a hotfix: DEBUG 5 prints the ast 2025-03-24 11:43:11 +08:00
George Hotz
651c678edf work 2025-03-24 09:49:53 +08:00
chenyu
ba41076e94
update embedding test to not use dtypes.long [pr] (#9556) 2025-03-23 21:33:38 -04:00
chenyu
c965f4c20b
update bert config (#9555)
BEAM 4->5 for green, 2% faster
use AMD driver instead of AM for red, 5% faster
2025-03-23 16:14:41 -04:00
chenyu
d734e24c01
minor WEBGPU_PATH cleanup [pr] (#9552)
also mypy recognizes `sys.platform == 'win32'` but does not recognizes it if wrapped inside a helper...
2025-03-23 09:10:02 -04:00
Ahmed Harmouche
7ce7fe0574
Refactor webgpu_dawn lib finding (#9547)
* Refactor webgpu_dawn lib finding

* Fix ruff
2025-03-23 08:23:29 -04:00
uuuvn
c631c72f22
HCQ: Increment timeline signal before submitting (#9550)
`AMDComputeQueue.__del__` frees `hw_page` which is safe because
`AMDAllocator._free` does `self.dev.synchronize()` which is supposed
to wait for execution of IB to finish, however that doesn't happen if
AMDComputeQueue is dropped right after submit before timeline signal is
incremented, which it is in most places leading to a race if .bind() is
also used (required for multi-xcc because bug in mec fw treats all
PACKET3_PRED_EXECs outside IBs as if they had EXEC_COUNT of zero).
2025-03-23 18:30:38 +07:00
nimlgen
d5667419af
am: move out pte creation logic (#9548)
* am: move out pte creation logic

* emu

* ops
2025-03-23 18:29:10 +07:00
George Hotz
3274bd2d81 output 2025-03-23 15:13:00 +08:00
geohotstan
309afa20b7
add Tensor.max_unpool2d (#9518)
* why does max_unpool2d feel slower than out.gradient ...

* slightly cleaner

* what happened to ruff

* need to think about this some more

* slightly faster now?

* clean up, 1 more failing edge case

* ok good

* working TINY_BACKEND

* nit doc wording

* retry CI
2025-03-22 12:11:33 -04:00
George Hotz
30f4d64148 rules 2025-03-22 19:17:16 +08:00
George Hotz
2634975d5a 5 and 8 2025-03-22 19:14:04 +08:00
George Hotz
fd73ec2b1b knum 2025-03-22 18:59:54 +08:00
George Hotz
e1d2bec4a4 opt 2025-03-22 18:52:56 +08:00
George Hotz
1b4e9f5e91 more opt rules 2025-03-22 18:07:31 +08:00
George Hotz
25c023bcbe more 2025-03-22 17:49:34 +08:00
George Hotz
07abf9e6bc multi_add_int32 2025-03-22 17:33:56 +08:00
George Hotz
26b02a037c fix 33 2025-03-22 17:17:47 +08:00
George Hotz
5089a601c6 name it 2025-03-22 14:44:01 +08:00
George Hotz
6b49a63c48 linearizer workaround 2025-03-22 14:18:02 +08:00
quortus
bdd44d4255
Fix DSP transcendentals (#9542) 2025-03-22 11:08:18 +08:00
George Hotz
dca95428a5 touch 2025-03-22 11:05:36 +08:00
Ignacio Sica
eddafb84e5
Bugfix for TC=3 (#9464)
* wrong but uses less shared

* for size 8 tc1 with devectorize in 0 loads into local before wmma and works

* improvements over tc1 devectorize

* fix tc=3

* works for handcoded tc opts

* clean bugfix tc=3

* fix

* revert changes
2025-03-21 16:43:42 -07:00
chenyu
6da78164f9
assert Kernel ast.op to be Ops.SINK [pr] (#9539)
rest of the code assumes self.ast is defined anyway
2025-03-21 18:09:44 -04:00
chenyu
c33679c47b
increase size in test_multinomial_counterexample (#9540)
should be less flaky
2025-03-21 17:46:52 -04:00
Francis Lata
1a1087e3a0
cleanups on losses and dataset tests (#9538) 2025-03-21 17:03:18 -04:00
Francis Lata
8cbe4009fc
RetinaNet losses (#9536)
* add sigmoid_focal_loss and l1_loss

* update ref implementation comment
2025-03-21 15:52:54 -04:00
Francis Lata
e6389184c5
update comment for retinanet dataloader implementations (#9534)
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-21 15:07:45 -04:00
chenyu
ee3d313b34
Revert "update ruff to 0.11.2 (#9531)" (#9535)
This reverts commit d8d65e2747.
2025-03-21 14:52:25 -04:00
chenyu
b46b8ee15e
add a flag to log when beam surpassed max limit [pr] (#9533) 2025-03-21 13:37:02 -04:00
Francis Lata
eb95825eea
RetinaNet dataloader (#9442)
* retinanet dataloader

* remove batch_size from generate_anchors

* refactor kits19 dataset tests

* add tests for dataloader

* fix testing setup and cleanups

* remove unused import
2025-03-21 13:36:41 -04:00
b1tg
58206fa8a9
add amd llvm compiler (#9519)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-21 23:13:27 +08:00
chenyu
d8d65e2747
update ruff to 0.11.2 (#9531)
0.11.2 fixed the false alert from 0.11.1. also pinned the version in setup for now to prevent broken CI from ruff upgrade
2025-03-21 10:32:59 -04:00
George Hotz
8a477ba4e1 knum 3 2025-03-21 20:36:18 +08:00
George Hotz
264dd91b8a 70 GFLOPS 2025-03-21 20:31:14 +08:00
George Hotz
bdf716b915 mul work 2025-03-21 20:05:29 +08:00
George Hotz
cf41c803d0 fast 13 2025-03-21 18:10:59 +08:00
George Hotz
3cf9224df5 a scale and b scale 2025-03-21 18:07:53 +08:00
George Hotz
af94addb3a ish 2025-03-21 17:46:45 +08:00
qazal
ee3ed73ed1
add reorder_view matcher to scheduler [pr] (#9528) 2025-03-21 17:46:20 +08:00
George Hotz
dc1469a188 double reduce 2025-03-21 17:33:48 +08:00
George Hotz
0416b0998d revert those 2025-03-21 17:15:38 +08:00
George Hotz
c715c25420
Merge branch 'master' into dsp_search 2025-03-21 17:13:10 +08:00
George Hotz
8e555c586c
switch quantization to unsigned/unsigned + add Ops.REDUCE (#9527)
* switch quantization to unsigned/unsigned + add Ops.REDUCE

* tests

* nhwc + replay pkl
2025-03-21 17:02:37 +08:00
George Hotz
f66b03f0a6 dsp ish 2025-03-21 16:28:08 +08:00