Commit graph

3,756 commits

Author SHA1 Message Date
George Hotz
411392dfb7
move files into uop dir (#10399)
* move files into uop dir [pr]

* tinygrad.uop is a thing

* fix uop docs, no pr

* fix viz
2025-05-18 11:38:28 -07:00
uuuvn
27c12be471
amd mockgpu graph support (#10385)
For testing remote graph stuff (prompted by #10371) in ci
2025-05-18 09:43:16 -07:00
qazal
04b23087d8
grouper tests from fuse_arange_default [pr] (#10394) 2025-05-18 18:42:43 +03:00
qazal
9e2089dcd4
don't raise Exception in process replay [pr] (#10392)
* don't raise Exception in process replay [pr]

* continue generating diffs unless [pr] is set, exit(1) otherwise

* change

* works
2025-05-18 11:23:23 +03:00
qazal
0294bfe507
simpler can_pad (#10364)
* simpler can_pad [pr]

* 3 kernels

* tests

* less kernels
2025-05-18 10:00:07 +03:00
George Hotz
6f77b938d7
Move getbits tests into test_helpers (#10382) 2025-05-17 17:04:00 -07:00
George Hotz
6ec88d94df
add tests for multi ram usage [pr] (#10376) 2025-05-17 15:33:40 -07:00
वेदांत
2453d99050
rms matching pytorch implementation (#10319)
* rms matching pytorch implementation

* pre commit fix

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-05-17 08:23:11 -07:00
qazal
e054b53a75
kernel count tests for pad [pr] (#10369)
* kernel count tests for pads

* handcoded rand one kernel

* comment

* prerealize device rng counter

* test_rand_handcoded generates /0

* remove track_rewrites
2025-05-17 17:20:46 +03:00
George Hotz
e13f2a3092
multi is O(1) (#10183)
* multi is O(1)

* allreduce

* no new uops needed

* junk

* something

* simple

* that's really what i want

* closer

* inject _device_num

* pretty print

* cleanups

* this

* early dnum

* ops allreduce is good

* ish

* device is the tuple and this is fine

* simpler

* progress

* copy_multi

* work

* more tests

* more tests pass

* work

* no None axis

* tests

* no none multi

* type fixes

* pre commit passes

* lil

* remove this

* mlperf dataloader on mac

* that test was wrong

* unbind

* support DEBUG=2

* realize

* only unbind bound vars

* don't include fixedvars

* graph test

* one test

* fixedvars in hcq

* new ring reduce

* ring reduce

* simpler ring

* mselect

* mselect doesn't work

* Revert "mselect doesn't work"

This reverts commit c78b77bd7d.

* Revert "mselect"

This reverts commit bb2e430ac3.

* simpler

* fixups

* no optional

* fix jit

* move things around

* cleanup multi

* simpler multi

* simpler reshape
2025-05-16 23:14:23 -07:00
George Hotz
e1a40e8040
add hcq fixedvars support [pr] (#10356)
* add hcq fixedvars support [pr]

* different test

* fixedvars are only for comp_queues

* fix hcq varvals
2025-05-16 22:05:53 -07:00
George Hotz
876d2275a1
changes from new multi (#10353)
* changes from new multi

* revert hcq change
2025-05-16 13:07:29 -07:00
wozeparrot
66e00c04dd
fix: skip kernel timing tests on ci cuda (#10348) 2025-05-16 11:48:06 -07:00
qazal
e9e5b54e43
grouper cleanups and merge with insert_kernels [pr] (#10349)
* grouper cleanups and merge with insert_kernels [pr]

* remove that
2025-05-16 14:39:56 +03:00
b1tg
caded2f413
llvm diagnostic error (#10267)
* llvm diagnostic info

* use decorator

* better error reporting

* fix mypy

* collect all diag msgs

* test diag error

---------

Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-05-16 02:03:20 -04:00
George Hotz
a4a25720b2
add test_multitensor_jit_input [pr] (#10347) 2025-05-15 20:47:57 -07:00
wozeparrot
1ed04f993b
move benchmark stat tracking to influxdb (#10185) 2025-05-15 16:14:56 -07:00
wozeparrot
f59ecf2116
fix: mockgpu cuda timing (#10343) 2025-05-15 14:14:14 -07:00
qazal
7cfe367c07
failing test for slow embedding kernel with FUSE_ARANGE=1 [pr] (#10330) 2025-05-15 14:58:11 +03:00
qazal
0a45cd0cbe
grouper: merge views in fuse elementwise (#10325)
* grouper: merge views in fuse elementwise

* with gradient api
2025-05-15 13:17:09 +03:00
qazal
89d8d5b25e
add dims check in FUSE_ARANGE (#10323) 2025-05-15 11:33:21 +03:00
qazal
8fad0f0124
grouper: check for unsafe PAD in FUSE (#10322) 2025-05-15 10:53:44 +03:00
chenyu
f008e5f233
test_dtype_alu should cast bf16 input (#10320)
when testing alu for bfloat16, it should cast inputs to bfloat16 first, otherwise numpy has both errors from input and errors from alu which is more inaccurate
2025-05-15 01:11:39 -04:00
George Hotz
568d6d96e7
small changes from new multi [pr] (#10318) 2025-05-14 20:50:59 -07:00
chenyu
f6cf25fce4
cleanup test_conv2d_ceildiv_edge_case [pr] (#10317) 2025-05-14 23:35:28 -04:00
Kirill R.
50d7162acd
Add conv2d ceildiv edge case (#10303) 2025-05-14 22:50:23 -04:00
wozeparrot
9bbc2bc2a7
hotfix: filter_too_much (#10308) 2025-05-14 15:31:51 -07:00
George Hotz
42e70193c9
multi: instead of real, just copy (#10289)
* multi: instead of real, just copy

* fix test

* remove real
2025-05-14 10:36:55 -07:00
qazal
043efc6ec4
do not require self for track_rewrites [pr] (#10302) 2025-05-14 18:23:32 +03:00
qazal
d342f7688d
remove some skips in test_schedule + use assertRaisesRegex [pr] (#10296) 2025-05-14 14:54:07 +03:00
qazal
40f4ce3390
enable AMD CI for TestRandomness.test_multinomial [pr] (#10295) 2025-05-14 14:32:22 +03:00
qazal
1770e00c41
only CAPTURE_PROCESS_REPLAY=1 + add filterwarnings back [pr] (#10292) 2025-05-14 11:58:42 +03:00
qazal
1c97338be5
enable process replay assert for schedule [pr] (#10280)
* enable process replay assert for schedule

* start at unique+1
2025-05-14 11:10:47 +03:00
uuuvn
7bc4864bc4
Make dev a property of Allocator (#10286)
* Make `dev` a property of `Allocator`

(this is a prereq refactor for #10285)

At least `BufferXfer.copy` accesses it assuming it's always present,
currently most devices just add this property on their own repeating
the same code over and over again.

This is also a bit footguny, see `RemoteAllocator` that named this
property `device` instead of `dev`, i could obviously just change that
in one place but doing it globally seems like a better solution (and it
reduces code duplication too).

`MallocAllocator` is a bit special, but passing `None` works just fine.

* typing

* ignore type instead of cast
2025-05-13 17:01:01 -07:00
uuuvn
ddff9857b8
Remote properties is a dataclass (#10283)
Not strictly required for anything but soon there will be like 4 new
properties and having it be a huge json just seems like a bad taste.

It also seems right to not have a separate endpoint for this, just
`GetProperties` request that returns a repr of this similar to how
requests are sent in `BatchRequest`.

This will also make a switch to anything other than http much simpler
if it will be required for any reason, like just a tcp stream of
`BatchRequest`s
2025-05-13 11:56:58 -07:00
uuuvn
ba87eca0f1
Remote multi (basic) (#10269)
* Basic remote multi support

Simplest thing to be able to use remote with multiple gpus, very slow
because no transfers (copyin copyout for cross-device copies)

* tests
2025-05-13 09:52:47 -07:00
George Hotz
5f64bbc63d
improve multi tests + add support for fixedvars [pr] (#10281)
* improve multi tests + add support for fixedvars [pr]

* add support for fixedvars
2025-05-13 09:27:00 -07:00
chenyu
8a906cb124
Tensor.randn_like (#10276) 2025-05-13 11:53:59 -04:00
chenyu
c4988bc07b
only run test_u32_to_f16 if it supports fp16 (#10277)
* only run test_u32_to_f16 if it supports fp16

* cleanup
2025-05-13 11:16:14 -04:00
uuuvn
1900c3c68a
Metal multi in ci is fine actually (#10274)
Useful for testing remote multi stuff
2025-05-13 10:07:35 -04:00
nimlgen
6f42bf8b54
usbgpu: 10 steps in benchmark to hit cache (#10273) 2025-05-13 17:06:50 +03:00
qazal
a2d6b0afe0
fix FUSE pushing through SHRINK (#10271) 2025-05-13 11:38:53 +03:00
geohotstan
1c4ab6b991
ONNX add tests against ORT (#10270)
* start

* clean up

* indicate file location too
2025-05-13 04:03:52 -04:00
Sieds Lykles
02208565de
add check (#10257) 2025-05-12 11:03:01 -04:00
Kirill R.
4c7c139102
Use cmod/cdiv in sym_infer (#10258)
* Use cmod/cdiv in sym_infer

* test

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-05-12 09:07:28 -04:00
qazal
95c6a736a9
fix FUSE_ARANGE=1 for bert (#10255) 2025-05-12 14:44:05 +03:00
Sieds Lykles
7c4b381fbf
Extra simplify valid test [pr] (#10256)
* add test

* Change the range

* add todo test
2025-05-12 07:32:03 -04:00
chenyu
70c797b107
train bert tests (#10248)
added a working bert tiny test, and a failed bert FUSE_ARANGE test
2025-05-11 08:42:08 -04:00
nimlgen
2145bce3f9
usbgpu: copyin size is 16k (#10240)
* usbgpu: copyin size is 16k

* ush
2025-05-09 22:12:54 +03:00
Sieds Lykles
74e40aafa0
use cdiv in div and mod folding (#10216)
* use cdiv

* use cdiv and cmod there as well

* Add tests

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-05-09 12:37:24 -04:00