Commit graph

8,514 commits

Author SHA1 Message Date
George Hotz
37a40bf975 early lower cat 2026-03-07 11:39:20 +08:00
George Hotz
af1db22b25 simpler 2026-03-07 10:11:21 +08:00
George Hotz
be0f9d1055 min 2026-03-07 10:00:29 +08:00
George Hotz
5b9a6c5520 Add Ops.CAT movement op (ai slop) 2026-03-06 18:51:25 +08:00
George Hotz
6fd18ef875
rename CAT to VCAT (#15167) 2026-03-06 18:46:28 +08:00
Roelof van Dijk
059c6326c0
metal uint32 icb offset overflow (#15156)
* metal uint32 icb offset overflow

fix: diff

supports_exec_item

GraphRunner.supports_exec_item

tests

fix: can't import on non-metal

stricter

* also test the non-metal buffer case

* imports on non-mac
2026-03-06 00:54:39 +03:00
chenyu
da61088ca4
more divmod recombine (#15162) 2026-03-05 12:53:22 -05:00
chenyu
167a1d56a6
improve divmod folding (#15148)
canonicalize to div than mod which enables more simplifcation
2026-03-05 10:07:36 -05:00
Christopher Milan
b824579e4d
simplify image_conv2d pitch alignment hacks (#15158) 2026-03-05 07:17:34 -05:00
qazal
5bf542469d
viz: python traceback for USER device (#15160)
* start

* ux

* unittests
2026-03-05 20:22:09 +09:00
Roelof van Dijk
d65923bda5
tensor.py: add normalize function (#15159)
* tensor.py: add normalize function

* p==0 should match torch
2026-03-05 18:55:53 +08:00
Roelof van Dijk
fc0534910c
q5k is like q4k (#15155) 2026-03-05 17:02:49 +08:00
Ananta Ranganathan
8ef656324e
FIXED TEST Q5_K GGUF dequant (#15147)
* q5_k gguf support as separate pr

* fix the problematic gemv test for q5_k

* add assert to make sure the gemv test cant fail with warning instead of error
2026-03-05 16:32:36 +08:00
George Hotz
e97922a57c
LLM speedup with two jits, prefill/rollout (#15153)
* START_TIME

* print cleanup

* fix tests
2026-03-05 16:21:09 +08:00
wozeparrot
be23772d43
llama3 fixes part2 (#15150) 2026-03-04 23:43:50 -08:00
George Hotz
fb43b415f9
fix symbolic shape call + chunked prefill (#15149)
* fix precompile for symbolic shape

* chunked prefill

* cleaner

* test that
2026-03-05 14:02:26 +08:00
George Hotz
8a82b26522
llm: print the prefill cache size (#15146)
* print the llm prefill cache size

* mock that too
2026-03-05 12:13:28 +08:00
chenyu
b5370fd52d
use copy_multi in alu_multi [pr] (#15143)
* use copy_multi in alu_multi [pr]

* copy to anything
2026-03-04 22:53:00 -05:00
George Hotz
72a9ed6e23
fix render depth bug + add warmup to serve + no realize default (#15144)
* fix render depth bug + add warmup to serve

* make realize not the default
2026-03-05 11:21:16 +08:00
George Hotz
ac1847cbf7
fully symbolic llm (#15097)
* work

* llm symbolic (almost)

* work

* revert that

* llm sym

* works

* cleanups

* cache tokens with the kv cache

* cleanups

* cleanups
2026-03-05 10:22:11 +08:00
qazal
33a1970045
sqtt: simplify inst mapping, validate JUMP processing in CI (#15139)
* jump cleanup

* assert there's a JUMP

* new example for JUMP

* regenerate examples

* rdna4 work

* new packets

* work

* less for branch handling

* less verbose

* fix err message
2026-03-05 09:53:12 +09:00
chenyu
04da527a7a
minor div_and_mod_symbolic cleanups (#15138) 2026-03-04 19:05:44 -05:00
chenyu
106d18b792
use UOp methods in allreduce.py [pr] (#15137)
except the one line with Ops.BUFFER and Ops.NOOP, not sure what that's for
2026-03-04 17:15:33 -05:00
chenyu
34594bcaaf
Revert "bug in metal: offset is stored as uint32, overflow (#15129)" (#15136)
This reverts commit 9c58db16fa.
2026-03-04 16:54:42 -05:00
Roelof van Dijk
9c58db16fa
bug in metal: offset is stored as uint32, overflow (#15129)
* metal uint32 icb offset overflow

* fix: diff

* supports_exec_item

* GraphRunner.supports_exec_item

* tests

* fix: can't import on non-metal
2026-03-04 22:52:12 +03:00
chenyu
1f96cc2b51
update non-contiguous buffer error message [pr] (#15131)
* update non-contiguous buffer error message [pr]

also cleaned up the tests

* order
2026-03-04 11:13:26 -05:00
George Hotz
47faa2d7b4 hotfix: llm kv cache uses clone instead of realize to avoid many realize 2026-03-04 19:07:03 +08:00
George Hotz
8ebd24637b
fix fa forward building with clang 22 (#15124)
* fix fa forward building with clang 22

* fix: override rocm path

---------

Co-authored-by: Woze Parrot <wozeparrot@gmail.com>
2026-03-04 02:32:25 -08:00
Christopher Milan
592f9bf6c6
set OPENPILOT_HACKS=1 to enable replace assign (#15123) 2026-03-04 05:26:04 -05:00
Christopher Milan
5623cea7b1
move openpilot contiguous hacks to schedule (#15120) 2026-03-04 03:04:06 -05:00
George Hotz
5ecfe549e7
allreduce is a function with LATE_ALLREDUCE=1 (#15119)
* allreduce as a function

* allreduce function

* support allreduce function

* LATE_ALLREDUCE
2026-03-04 15:17:58 +08:00
Christopher Milan
e7e70a3c95
simplify idx before counting backward_slice (#15117) 2026-03-03 23:53:50 -05:00
George Hotz
2d72a4a90c
fix copying padded const (#15116)
* fix const padding cpu

* remove comment
2026-03-04 10:39:45 +08:00
chenyu
b5ebb4d06d
contiguous_view_offset returns only offset [pr] (#15113)
size is always input.size
2026-03-03 15:23:39 -05:00
nimlgen
abd830b260
am: setup_rinf returns only doorbell (#15112) 2026-03-03 19:27:41 +03:00
nimlgen
4b42bb54aa
am: reset sdma to start from 0 (#15109) 2026-03-03 18:14:46 +03:00
George Hotz
01ddb4c267
add precompile to call (#15099)
* add precompile to call

* put get back

* something

* after structure

* alt

* keep it call

* resolve call

* resolve linear call

* precompile works with llm

* revert rangeify

* color for debugging

* getenv PRECOMPILE

* clean up deco pattern

* fully recursive sink scheduling

* revert llama

* fix SPEC=2
2026-03-03 22:32:42 +08:00
qazal
c7f908b788
sqtt: fix rdna4 structs (#15111)
* work

* DEBUG=2
2026-03-03 23:32:14 +09:00
qazal
8dd691761d
sqtt: remove old files (#15108) 2026-03-03 22:43:24 +09:00
Christopher Milan
5f6b610da1
FLOAT16 logic for IMAGE==1 goes back to image_conv2d (#15105) 2026-03-03 05:37:57 -05:00
George Hotz
7d025089e3
no after removal (#15102)
* no after removal

* we are using walk

* null schedule test

* pytest deps

* Revert "pytest deps"

This reverts commit 5e1c5304ec.

* Revert "null schedule test"

This reverts commit 02da66053e.

* clean null tests
2026-03-03 17:50:31 +08:00
wozeparrot
92c16810ac
feat: per device mem_used (#15100) 2026-03-03 01:31:28 -08:00
qazal
e3a0598d0b
viz: the whole pc should be in view (#15101) 2026-03-03 17:17:53 +09:00
Christopher Milan
c70e8af068
move IMAGE FLOAT16 logic to allocations (#15095)
* FLOAT16 logic in allocations

* cleanup

* separate that

* only apply when IMAGE == 1

* test passing now

* create image buffers earlier
2026-03-02 22:00:05 -05:00
George Hotz
d483e4153a
buffer view is like buffer (#15082)
* buffer view is like buffer

* fix

* swap_reshape_shrink

* contiguous on gguf, fix overlap

* revert that

* _device_supports_view

* this

* fix that test

* 0 buffers

* that test was wrong

* this

* check correct size

* contig BUFFER_VIEW

* this

* fix tests

* buffer view tests

* om

* fix torch

* no MOCKGPU

* skip
2026-03-03 09:52:33 +08:00
qazal
848f5cea96
viz: sqtt instruction packet trace (#15065) 2026-03-03 07:55:04 +09:00
chenyu
f80b1033c5
simpler Tensor.all (#15089)
same generated kernel
2026-03-02 11:08:55 -05:00
chenyu
4008f7d4e8
move Tensor.one_hot +1 to python (#15088) 2026-03-02 10:56:41 -05:00
nimlgen
dafbe9733a
am: cleanup (#15086) 2026-03-02 17:06:21 +03:00
George Hotz
5ff278446c
add contiguous_view_offset (#15084)
* add contiguous_view_offset

* no int
2026-03-02 18:05:04 +08:00