Commit graph

4,667 commits

Author SHA1 Message Date
Arseny Kapoulkine
514c43201d
Fix issues with pointer provenance in load/store through ALU (#3916)
* Track pointer provenance in load/store through ALU

Previously load/store could be incorrectly rendered into
ld.global/st.global when the input was an ALU op that performed an
address computation with DEFINE_LOCAL on one of the arguments.

* Simplify the load provenance workaround

The issue is that we can render the same code twice, and on the second
run the opstream is already modified so that vin[0] isn't a DEFINE_*,
which overwrites initially correct .shared wth .global.

* Add a couple tests for basic local use

* Skip local tests on LLVM since it doesn't implement DEFINE_LOCAL
2024-03-25 14:41:05 -07:00
chenyu
83f39a8ceb
env var to change default float (#3902)
* env var to change default float to fp16 or bf16

looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights.

working on default bf16 too.
```
RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined
    __bf16 cast0 = (nv_bfloat16)(val0);
```

remove that in cifar

* DEFAULT_FLOAT

* default of default

* unit test

* don't check default

* tests work on linux
2024-03-24 20:33:57 -04:00
George Hotz
03899a74bb increase atol on reset train 2024-03-24 15:17:31 -07:00
qazal
d8fafca13a
assign regression (#3907)
* infra

* track mutations

* assign levels

* add seen back

* add test

* infra 2.0

* add assign targets

* dont need levels

* delete

* Update test_assign.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-24 15:12:31 -07:00
Patrick Tsai
e27129a798
Fix linearizer failure 26 test (#3906)
* Adjust adds between WHERE and PHI

* Not much better

* undo recursive change

* hm

* iterate over where, not factored op

* oo

* consts only for loop

* UNdo var name change

* update

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-03-24 16:34:13 -04:00
wozeparrot
9a9cac58f9
add lars to nn (#3750)
* feat: add lars

* feat: don't remove this comment

* clean: smaller diff

* clean: shorter line

* feat: remove mlperf lars, switch resnet

* fix: fully remove mlperf lars

* clean: comment

* feat: contiguous

* feat: no weight decay on skip params

* feat: optimizergroup

* feat: classic momentum

* fix: pylint

* clean: move comment

* fix: correct algo

* feat: lrschedulergroup

* feat: skip list tests

* feat: :| forgot that params are a thing

* feat: remove skip_list params from main params

* feat: set moment

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-24 11:43:12 -04:00
chenyu
2c69888654
include negative float in test_dtype (#3884)
* include negative float in test_dtype

* that is ub

* too annoying

* pack can overflow
2024-03-24 02:39:15 -04:00
Francis Lam
0145366323
wmma: fix the AMD TC threads to split the first 16 threads (#3904)
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias.  now it splits it properly into 8 and the
remaining 2 into the correct local stride
2024-03-23 21:17:42 -04:00
chenyu
a2b2597fc2
replace dtype.name str with render_dtype (#3903)
fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override
2024-03-23 19:25:48 -04:00
Alejandro F Queiruga
556dcfb8f2
Fix the result permutation in einsum (#3895)
* Fix permutation of result indices in einsum.

* Delete stray line used for breaking tests

* Fix linter error by renaming twice-used variable

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-23 15:48:19 -04:00
chenyu
2d3ce53348
touchup test_dtype.test_gradient_dtype (#3887)
add back bad merge from #3613 and add float.double and float.bfloat16 to test
2024-03-22 20:56:45 -04:00
David Hou
fc11808a79
initialize Tensor grad same type as self (#3613)
* initialize Tensor grad same type as self

* also test different default float

* check dtype + try/finally

* don't test_gradient_dtype if f16 is not supported

* fix bad merge

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-22 20:33:18 -04:00
Francis Lam
8db7a6bbcc
debug: add optional detailed BEAM_LOG logging (#3883)
* debug: add optional detailed BEAM_LOG logging

show uop count, compile and run times for each candidate in search

also add --timing to verify_kernel.py to make it easier to explore
hand-crafted applied opts

* fix linter
2024-03-22 19:23:31 -04:00
George Hotz
54dc48aa47
fix assign (#3878)
* fix assign

* remove terrible optimizer hack

* oops, not realized assigns
2024-03-22 11:48:48 -07:00
Francis Lam
5587594a00
fuzz_linearizer: add --ast and --file params to read kernels (#3877)
also fix up ast_str_to_str to support the new tuple of LazyOps
2024-03-22 14:27:40 -04:00
chenyu
c5467e5bd6
diverse test value in test_dtype DATA based on dtype (#3864)
* diverse test value in test_dtype DATA based on dtype

* eh fix typo

* that too?

* PTX does not support i8 and s8

* skip that

* unused line

* pus the hack back

* remove that
2024-03-22 14:22:06 -04:00
George Hotz
86ee36e697
preschedule all (#3875) 2024-03-22 11:20:06 -07:00
Szymon Ożóg
d8c3f1894a
Use UOpGraph in test (#3876) 2024-03-22 14:12:38 -04:00
qazal
fe6ceff15f
proposal: multioutput JIT spec (#3856)
* corealize JIT

* requirements
2024-03-21 21:28:30 -07:00
uuuvn
6729f20aab
Ring allreduce try 2 (#3852)
* Ring allreduce v3

* Configurable size, number of gpus and jit in benchmark

* ScheduleBarrier v0

* GB/s that make sense

* ScheduleBarrier v0.1

* Fallback on 2 GPUs

* ScheduleBarrier v0.2

* ScheduleBarrier v0.3

* ScheduleBarrier v0.3.1

* ScheduleBarrier v0.3.2

* Replace ScheduleBarrier with automatic optimization

* unused import

* fix comment

* typing

* better fallback

* python 3.8

* RING=2 and use ContextVar

* DEBUG >= 2 and change name

* linter

* type

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-21 19:17:51 -04:00
Francis Lam
3c0478bfab
fuzz_linearizer: add additional DEBUG info for comparison errors (#3866) 2024-03-21 18:58:10 -04:00
chenyu
e50b7abe4f
diversed buf inputs based on dtype in fuzz_linearizer (#3863) 2024-03-21 16:23:11 -04:00
chenyu
30fa03243e
reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures (#3861) 2024-03-21 14:12:27 -04:00
chenyu
33dd99acf4
remove helper_add_store from test_linearizer_failures (#3860) 2024-03-21 12:53:31 -04:00
chenyu
6bf0b82267
alloc new output in fuzz_linearizer between baseline and real one (#3859)
if the kernel is an assign `a += 1`, the rawbufs[0] is updated twice and gives false compare_error
2024-03-21 11:36:05 -04:00
nimlgen
85691c8e20
fix hsa sync issue (#3847)
* fix hsa sync issue

* linter
2024-03-21 04:00:30 +03:00
chenyu
f271cd682b
user _resolve_dim in argmax (#3846)
also added comment of the behavior if there are multple, and more tests
2024-03-20 20:17:30 -04:00
Francis Lam
6d5dec2fef
log optimized kernels and a script to compare with non-optimized ones (#3829)
* search: add BEAM_VERIFY option to validate search results

refactor fuzz_linearizer comparison to allow it to be used in for
BEAM_VERIFY in device.py

* search: fix to verify the beam_search result and not the fastest

* search: fix typing and clean up

* device: remove imports from test and add LOGKERN options

LOGKERN output can be used with test/external/verify_kernel.py
to validate correctness

* fix example in verify_kernel.py

* cleanup fixes

* fix to use f-strings
2024-03-20 19:22:08 -04:00
chenyu
519336cfea
factor out partial in SumNode div int (#3841)
* factor out partial in SumNode div int

* div not rem

* space
2024-03-20 16:34:33 -04:00
George Hotz
8cb5215885
Revert "Ring allreduce in multitensor (#3000)" (#3840)
This reverts commit c5bf9e4c96.
2024-03-20 11:41:49 -07:00
uuuvn
c5bf9e4c96
Ring allreduce in multitensor (#3000)
* Ring allreduce v3

* Configurable size, number of gpus and jit in benchmark

* ScheduleBarrier v0

* GB/s that make sense

* ScheduleBarrier v0.1

* Fallback on 2 GPUs

* ScheduleBarrier v0.2

* ScheduleBarrier v0.3

* ScheduleBarrier v0.3.1

* ScheduleBarrier v0.3.2

* Replace ScheduleBarrier with automatic optimization

* unused import

* fix comment

* typing

* better fallback

* python 3.8

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-03-20 11:20:01 -07:00
chenyu
455f7bea9b
test example from half resnet that idx has number outside of int32 (#3838)
* test example from half resnet that idx has number outside of int32

* ruff
2024-03-20 13:44:20 -04:00
chenyu
d17900bc45
use int32 instead of default_int in simplify_phi_loops (#3828)
* use int32 instead of default_int in simplify_phi_loops

indices are in int32 now and is separated from buffer dtype. fix #3823

* return early if not supported

* it's not that

* why is it failing for RHIP
2024-03-19 17:49:58 -04:00
chenyu
99cbc24390
use dtypes.int32 as return dtype for functions that return indices (#3827)
behavior matches jax. It's fine to have a tensor greater than max int8 size even if we set default int to int8
2024-03-19 17:06:57 -04:00
chenyu
fa1921ec7d
move test_dtype tests to test dtype and output value (#3826) 2024-03-19 16:31:27 -04:00
Francis Lam
131bbb6563
test_linearizer_failure: add failure 27 from a gpt2 kernel (#3825)
* test_linearizer_failure: add failure 27 from a gpt2 kernel

found during a full fuzz test of applied_opts combos to a
depth of 4 on the gpt2 kernels w/o GROUPTOP.

added additional examples to failure 26 that don't have GROUPTOP

* add other platform failure
2024-03-19 16:29:50 -04:00
Francis Lam
9851e2c3b9
test_linearizer_failure: add failure 26 from a gpt2 kernel (#3821)
found during a full fuzz test of all applied_opts combos to a
depth of 3 on the gpt2 kernels
2024-03-19 13:19:54 -04:00
Patrick Tsai
b436c9792f
Fix factoring bug (O(n) arange related) (#3817)
* Factoring bug

* Another one in case

* It works now so change tests back

* large arange cumsum optimization

* More cleanup

* symbolic no factor div test

* name change

* Rename test

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-03-19 11:49:42 -04:00
chenyu
a6ed2ae3c6
use old cumsum optimization for arange (#3813)
revert to old cumsum opt while phi simplification is disabled.

added a flops complexity test for this
2024-03-18 20:01:03 -04:00
chenyu
ac866eaf5a
disable simplify_phi_loops (#3812)
* disble simplify_phi_loops

this breaks BEAM search GPT2.

* skip that
2024-03-18 19:25:26 -04:00
George Hotz
4c4d3cb3e3
restrict assignment to base (#3809)
* restrict assignment to base

* add some restrictions there

* more restrictions
2024-03-18 15:33:06 -07:00
chenyu
20681d5c4a
remove old dist multigpu (#3811) 2024-03-18 18:31:05 -04:00
chenyu
5dd048a378
remove HIP in core tinygrad (#3810)
* remove HIP in core tinygrad

ci test uses device RHIP and HSA compiler (LinearizerOpt), so fine to remove HIP from tc.
Also updated README and EMULATE tc test flag

* EMULATE_CUDA
2024-03-18 18:19:27 -04:00
Francis Lam
a7afd2f6bf
test_linearizer_failures: add failing kernel from GPT2 CUDA (#3808)
* test_linearizer_failures: add failing kernel from GPT2 CUDA

* test_linearizer_failure: remove "HIP" from failed_platforms
2024-03-18 17:16:40 -04:00
George Hotz
d8296d4a3f
simple assign tests (#3807) 2024-03-18 13:57:01 -07:00
wozeparrot
a0ab755317
threefry again (#3785)
* feat: initial xor

* feat: initial threefly

* feat: remove custom random

* fix: really need to install precommit

* feat: lmao forgot that this is rotate not a shift

* clean: put that there

* feat: numpy xor

* feat: quick test for xor

* feat: llvm xor

* feat: slightly working xor in torch

* feat: rand works in jit

* clean: save a line

* feat: match jax

* feat: maybe test against jax

* feat: requires_grad

* fix: fix test_symbolic_ops

* feat: lower alpha

* feat: just pad

* fix: maybe fix training tests?

* fix: fix some llvm stuff

* feat: cursed realize on the way out

* feat: testing jax

* fix: why is the jax install process not simple

* fix: maybe passing test

* fix: symbolic workarounds

* clean: still need that precommit

* fix: aaaa

* fix: more test fixes

* fix: quick fix for wgsl

* feat: need to set requires_grad on the final tensor

* feat: one more tensor

* feat: don't take forever

* feat: seeing y ci is brok

* feat: can't allocate 64GiB lmao

* fix: fix this

* feat: hope this doesn't break smth before i go to bed

* feat: don't destroy ram

* feat: int

* feat: remove jax

* feat: properish workaround?

* feat: skip slow webgpu tests

* feat: no longer fails

* feat: use dtypes

* feat: real number

* fix: torch

* fix: don't test against reference for torch

* feat: to device

* feat: fix advanced indexing

* feat: correct casting

* feat: even rng_counter

* feat: match master

* feat: this was actually bad

* fix: maybe?

* feat: store

* feat: remove realizes

* feat: somehow this is important

* feat: somehow this is also important

* feat: save a line

* fix: don't need that anymore

* feat: restore this

* fix: linter

* feat: remove realizes

* fix: realized is in base now

* fix: add back cast

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: bump deadline

* fix: :(

* fix: :(

* fix: not being dumb

* feat: try changing less tests

* feat: shouldn't have to change that

* feat: contiguous bumps it by one

* fix: hmm

* fix: numpy memory moment

* fix: cl_khr_fp16

* fix: torch has different tensor count

* fix: missing contiguous

* hmm: hmm

* fix: some fixes

* fix: typing

* feat: dont do that

* feat: typing fixes

* feat: why is this realize required?

* feat: ngl kinda odd typing

* feat: oh

* feat: remove realizes

* feat: why is this realize required?

* fix: hacky patch for cudacpu

* fix: without this realize pytest crashes?????

* fix: shorter line

* fix: cudacpu fixes

* fix: cudacpu fixes

* feat: real buffer

* feat: don't search when searching lmao

* fix: can't use contiguous things

* fix: no more 100GB arrays

* fix: revert

* fix: skip 7 and 10

* feat: working ish beam

* feat: minimize changes

* feat: seed 0 stable diffusion example changed

* fix: different on ci

* fix: no beam

* feat: make threefry optional

* fix: check value

* fix: unused import

* feat: threefry default

* fix: 5d

* feat: allow non upcast div

* fix: 5d better

* fix: 5d better

* fix: save all dtype

* feat: proper error

* feat: lazyop key

* fix: check float

* feat: try removing this realize now

* feat: disable threefry for uops hip tensor cores

* feat: don't need that

* feat: only check upcast

* fix: disable threefry for some metal tests

* feat: disable for metal tensor uops as well

* feat: disable for most uops

* fix: disable threefry for new uops tests

* feat: multitensor

* fix: typing

* feat: threefry default off

* feat: skip threefry half rand

* feat: restore old

* fix: bad git

* clean: ruff

* feat: bfloat16 fix

* fix: :|

* feat: restore old

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-18 16:47:07 -04:00
nimlgen
629757eaa1
hotfix: update inputs of correct transfers in hsagraph (#3800)
* hotfix: update inputs of correct transfers in hsagraph

* test it

* run in ci?
2024-03-18 15:52:27 -04:00
George Hotz
0183a05f0a
test assign (#3798)
* Reapply "add failing assign test (#3796)" (#3797)

This reverts commit 1e1beb888c.

* no realized check
2024-03-18 08:58:04 -07:00
George Hotz
1e1beb888c
Revert "add failing assign test (#3796)" (#3797)
This reverts commit 2dea12832c.
2024-03-18 08:55:36 -07:00
George Hotz
2dea12832c
add failing assign test (#3796)
* that was a hack

* tests to reveal the issue

* add assign for realized assign
2024-03-18 08:47:30 -07:00