Commit graph

11,106 commits

Author SHA1 Message Date
George Hotz
c3f99a727e
objc fast msg (#8922)
* benchmark kernel launch

* don't realize unneeded

* faster

* faster metal

* fix mypy

* new objc message style [pr]

* without sync

* no div 0

* lru cache that

* no sync in the profile

* fix

* update all to new style

* remove comment

* graph one kernel

* fix graph one kernel

* remove that sync
2025-02-06 17:49:06 +08:00
qazal
a2e7e49fe1
prepickle scheduler process replay [pr] (#8924) 2025-02-06 10:16:36 +01:00
qazal
89d7480b0c
hotfix: don't sink views [pr] (#8923) 2025-02-06 09:15:12 +01:00
George Hotz
0cbb7d7f1e hotfix: metal has known sync issue 2025-02-06 14:29:41 +08:00
George Hotz
a8e54df363
benchmark single kernel launch (#8921)
* benchmark kernel launch

* don't realize unneeded

* faster

* faster metal

* fix mypy

* without sync

* no div 0

* lru cache that

* no sync in the profile
2025-02-06 13:35:34 +08:00
George Hotz
3e082d4a9d
add float4 support to LLVM (#8920)
* add float4 support to LLVM

* is_bool
2025-02-06 12:15:50 +08:00
George Hotz
b05c536f74
cleanup some llvm stuff [pr] (#8919)
* cleanup some llvm stuff [pr]

* debug

* default to newer llvm

* repr
2025-02-06 11:45:03 +08:00
Josh Moore
44e0eab8fd
Fix AttributeError occurring after ValueError in _apply_uop (#8905)
* Fix AttributeError occurring after ValueError in _apply_uop

* Update tensor.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-02-06 10:56:29 +08:00
chenyu
30695da256
remove Tensor._to_const_val (#8917)
* remove Tensor._to_const_val

added a TODO for advance indexing on const, which was the last place that checks const in Tensor

* that is not folding now

* one more
2025-02-05 21:44:39 -05:00
George Hotz
d09b5f801c
don't use Tensor new, add to all_tensors after constructions [pr] (#8918) 2025-02-06 10:21:32 +08:00
FICTURE7
759b3f86bf
Pass host CPU features to LLVM target (#8909)
* Pass host CPU features to LLVM target

This gets `test_gemm_fp16` to pass on Windows. It would fail because the
generated machine code would call compiler-rt functions to to perform
truncating. This gets the test to pass on some hardware, because LLVM
gets access to more instructions. Essentially this is similar to
`-march=native`.

Unless this was intentionally left as is to be re-implemented fully in
LLVM IR or something.

* Fix linter complaints
2025-02-06 10:19:30 +08:00
uuuvn
09ec33a578
Better errors when relocating against undefined symbol (#8902) 2025-02-06 10:13:44 +08:00
chenyu
488200f16c
move more pow const to rewrite (#8916)
* move more pow const to rewrite

one less use of _to_const_val

* fix
2025-02-05 20:30:12 -05:00
chenyu
76671381aa
move positive const ** t to a rewrite rule (#8914)
* move positive const ** t to a rewrite rule

* one more test
2025-02-05 19:30:12 -05:00
Ignacio Sica
cad44f5f42
add Half-Precision Accumulation Support for Tensor Cores in NV, CUDA, and PTX (#8680)
* ptx and nv rendering refactor to work with half acc

* ptx fix!

* use same reg for acc and out

* fix comment

* another fix

* minor change in commet

* fix

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-02-05 16:56:37 -05:00
nimlgen
17f9b1cef6
am: load fw based on versions (#8913)
* am: load fw based on versions

* ops

* ops2
2025-02-06 00:02:09 +03:00
chenyu
189bfa164e
enable backward test for pow(neg const ** x) (#8912)
backward works now. 0**x still does not work because it's a special case fixed in transcendental
2025-02-05 15:35:21 -05:00
chenyu
9307572fe3
Ops.POW and transcendental (#8911) 2025-02-05 15:15:59 -05:00
nimlgen
bff7c70eef
hcq: better var check (#8908) 2025-02-05 22:38:59 +03:00
Ignacio Sica
aec3b8d515
add regression test: test_get_kernel_actions_preserves_actions_state (#8907)
* test_get_kernel_actions_preserves_actions_state

* simplify

* simplify

* refactor assert message
2025-02-05 14:13:01 -05:00
qazal
e71497aabc
move assign ShapeTracker check to pattern matcher [pr] (#8906)
* move assign ShapeTracker check to pattern matcher [pr]

* rename the st uop to view
2025-02-05 19:47:20 +01:00
Ignacio Sica
0f6109ec00
hotfix bug in get_kernel_actions after TC_SEARCH_OVER_SHAPE was introduced (#8904)
* hotfix search bug

* copy actions
2025-02-05 13:10:05 -05:00
Ignacio Sica
15f94ac964
TC_SEARCH_OVER_SHAPE to search multiple TC shapes (#8793)
* squash search over search

* refactor assert

* init benchmark

* cleaner get_kernel_actions

* cleaner get_kernel_actions

* add comment
2025-02-05 11:03:46 -05:00
qazal
e7edadda54
construct the sched_sink with graph_rewrite [pr] (#8903)
* construct the sched_sink with graph_rewrite

* diff

* move break_sched
2025-02-05 15:16:48 +01:00
qazal
ef7ad3f077
simpler subbuffer construction + copyin is always base (#8900)
* realize copy

* cleanup buffer_view

* smaller
2025-02-05 09:10:20 +01:00
qazal
6f0cc2e9c5
rename to KernelContext and move the linearize_sched comment [pr] (#8899)
* rename to KernelContext and move that comment [pr]

* 500
2025-02-05 07:49:58 +01:00
geohotstan
6fb0e5751b
hotfix test_onnx_imagenet (#8897)
* start

* log severity

* only change this

* change abstraction so it's more usable for huggingface

* WHOOPS

* actually this is more correct
2025-02-05 14:39:55 +08:00
George Hotz
c1c5227acb
preserve size in dtype ptr [pr] (#8898) 2025-02-05 14:38:57 +08:00
George Hotz
5844883e59 bump master version v0.10.1 2025-02-05 09:08:28 +08:00
uuuvn
a51c688f39
Cleanup llvm cleanup (and some clang things too) (#8871)
* Cleanup llvm cleanup (and some clang things too)

* Tests

* Tests 2

* forgot mockgpu

* more print some sources
2025-02-05 07:49:05 +08:00
eliotgolding
bb5ded85cc
Don't rewrite idiv to rshift when numerator is negative (#8885)
* more conditions for shift rewrite mul/idiv

* make ptx test uint so the new condition is true

* delete idiv test

* rewrite to 0 is wrong for idiv, as denominator is cast to 0 before division

* mul/div by 2**(large count) is unsupported anyway
2025-02-05 07:47:33 +08:00
pedro
666b6149bc
Use full soname for libgcc_s in CPUProgram (#8642) (#8896)
Number after .so is abi version, it is always 1 for libgcc_s.
Most linux systems set default library versions via symlinks that are
simply followed to get actual elf, however conda does it via linker
scripts which ctypes doesn't follow (below contents of libgcc_s.so):
```
/* GNU ld script
   Use the shared library, but some functions are only in
   the static library.  */
GROUP ( libgcc_s.so.1 -lgcc )
```
ctypes.util.find_library thinks that this is the actual elf and
ctypes.CDLL just loads this text file as a shared library. The result
is:
```
  File "/home/me/src/tinygrad/tinygrad/device.py", line 223, in CPUProgram
    helper_handle = ctypes.CDLL(ctypes.util.find_library('System' if OSX else 'gcc_s'))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/miniforge3/envs/tinygrad/lib/python3.12/ctypes/__init__.py", line 379, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: /home/me/miniforge3/envs/tinygrad/lib/libgcc_s.so: invalid ELF header
```

Co-authored-by: uuuvn <83587632+uuuvn@users.noreply.github.com>
2025-02-05 07:45:48 +08:00
chenyu
48349efdc1
copy is already contiguous (#8886) 2025-02-04 17:53:33 -05:00
nimlgen
4c28235bd1
am: remove hardcodes (#8895)
* am: remove hardcodes for 7900

* h
2025-02-05 00:52:53 +03:00
geohotstan
057c70b05f
add onnx_helpers to extra and add ort validate to benchmark_onnx (#8890)
* start

* log severity

* only change this

* change abstraction so it's more usable for huggingface

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-02-04 16:36:01 -05:00
chenyu
89eebd4bfb
pow cleanups (#8894)
more readable
2025-02-04 15:52:57 -05:00
qazal
7a9e3247c2
simple start to the Kernel UOp [pr] (#8893)
* simple start to a kernel [pr]

* add the sched_sink and spec

* rename kernels to sinks

* pylint complains
2025-02-04 21:48:15 +01:00
qazal
b4e8878e01
remove tensor_uops tracking from ScheduleContext [pr] (#8892)
* remove tensor_uops tracking from ScheduleContext [pr]

* cleaner
2025-02-04 20:34:15 +01:00
qazal
6a0da51ed0
truncate process replay logs [pr] (#8891)
* truncate process replay logs [pr]

* work

* max_lines

* bump to 1K
2025-02-04 20:26:48 +01:00
qazal
c7c279a6bd
unbind ShapeTrackers without maintaining a cache [pr] (#8889)
* replace with a try [pr]

* check vars

* ahaa
2025-02-04 19:43:41 +01:00
chenyu
61de654efa
minor shard cleanup [pr] (#8888) 2025-02-04 13:22:31 -05:00
qazal
6ec7f1b00f
replace UPat(name="x") with UPat.var("x") [pr] (#8887)
* replace UPat(name="x") with UPat.var("x") [pr]

* a few more
2025-02-04 19:12:40 +01:00
qazal
c26b06eaeb
delete fold_img_cast [pr] (#8875) 2025-02-04 18:43:45 +01:00
qazal
acf0baefee
process replay from tensor uops to kernel ast (#8883)
* process replay from tensor uops to kernel ast

* this dedups

* switch back to string key
2025-02-04 18:09:20 +01:00
Ignacio Sica
dcf104ee68
ptx wmma render refactor (#8873)
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-02-04 11:01:23 -05:00
qazal
b92f36179d
don't use set in schedule + add GroupOp.All [pr] (#8882)
* don't use set in schedule + add GroupOp.All [pr]

* update that
2025-02-04 08:19:27 +01:00
George Hotz
56fa5c1191
dsp simulator (#8869)
* dsp simulator

* progress

* fix

* close on test tiny

* working

* less waste

* line savings

* Device DSP compiler

* mock DSP at the bottom

* DSP tests

* docker caching

* test update

* need load

* skip that test for CI DSP

* last touch

* ugh
2025-02-04 09:45:04 +08:00
chenyu
836cf42c2e
fix rand_like for multi (#8880) 2025-02-03 19:00:14 -05:00
chenyu
746d899dbd
move multi axis to property (#8879)
also updated tests so that axis is known prior to realize
2025-02-03 16:02:09 -05:00
nimlgen
fa90079370
amd: reallocate scratch (#8872)
* amd: reallocate scratch

* use it

* oops

* allocate default

* mypy

* ops

* address realloc from none better

* types correct

* this better

* ops

* rm
2025-02-03 23:21:37 +03:00