Commit graph

60 commits

Author SHA1 Message Date
chenyu
1fa0351acb
fix DEFINE_ACC invalid_value to have same type as localtype (#3980) 2024-03-28 19:21:17 -04:00
Patrick Tsai
e27129a798
Fix linearizer failure 26 test (#3906)
* Adjust adds between WHERE and PHI

* Not much better

* undo recursive change

* hm

* iterate over where, not factored op

* oo

* consts only for loop

* UNdo var name change

* update

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-03-24 16:34:13 -04:00
Francis Lam
0145366323
wmma: fix the AMD TC threads to split the first 16 threads (#3904)
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias.  now it splits it properly into 8 and the
remaining 2 into the correct local stride
2024-03-23 21:17:42 -04:00
chenyu
a2b2597fc2
replace dtype.name str with render_dtype (#3903)
fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override
2024-03-23 19:25:48 -04:00
chenyu
30fa03243e
reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures (#3861) 2024-03-21 14:12:27 -04:00
chenyu
33dd99acf4
remove helper_add_store from test_linearizer_failures (#3860) 2024-03-21 12:53:31 -04:00
Francis Lam
131bbb6563
test_linearizer_failure: add failure 27 from a gpt2 kernel (#3825)
* test_linearizer_failure: add failure 27 from a gpt2 kernel

found during a full fuzz test of applied_opts combos to a
depth of 4 on the gpt2 kernels w/o GROUPTOP.

added additional examples to failure 26 that don't have GROUPTOP

* add other platform failure
2024-03-19 16:29:50 -04:00
Francis Lam
9851e2c3b9
test_linearizer_failure: add failure 26 from a gpt2 kernel (#3821)
found during a full fuzz test of all applied_opts combos to a
depth of 3 on the gpt2 kernels
2024-03-19 13:19:54 -04:00
chenyu
ac866eaf5a
disable simplify_phi_loops (#3812)
* disble simplify_phi_loops

this breaks BEAM search GPT2.

* skip that
2024-03-18 19:25:26 -04:00
Francis Lam
a7afd2f6bf
test_linearizer_failures: add failing kernel from GPT2 CUDA (#3808)
* test_linearizer_failures: add failing kernel from GPT2 CUDA

* test_linearizer_failure: remove "HIP" from failed_platforms
2024-03-18 17:16:40 -04:00
qazal
e3e89c244b
multioutput uoping infra (#3706)
* linearize multioutput

* add vars to copy
2024-03-15 21:56:59 -07:00
chenyu
a2d3cf64a5
move is_dtype_supported to test.helpers (#3762)
* move is_dtype_supported to test.helpers

updated all places that check if float16 is supports

* fix tests
2024-03-15 14:33:26 -04:00
nimlgen
6b8c66e04f
fix broken loops in llvm (#3751) 2024-03-15 11:57:51 +03:00
nimlgen
6bf11a2ce3
fix incorrect direct store with gep (#3735)
* fix incorrect direct store with gep

* better comment

* phi as well

* dtype check there

* mypy happy?

* not used

* renames

* phi in phi
2024-03-14 20:58:50 +03:00
qazal
43953c0ba9
skip grouped store for umatching upcasts (#3723)
* skip if upcasts dont match

* outputs match now

* this ast is hardcoded

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-14 01:18:31 -04:00
nimlgen
08064a0e29
add SEED env to fuzz_linearizer (#3713)
* add SEED env to test/external/fuzz_linearizer.py

* found some

* more platforms
2024-03-13 18:08:42 +03:00
chenyu
e1b2a82d89
fix st.real_size can be nagative if valid is always false (#3708)
two followups after this. (1) if a buffer is never accessed in kernel, it can be removed from input (2) real_size can be smaller conditional on valid being true (the old validhack stuff)
2024-03-12 20:34:07 -04:00
Francis Lam
b6e2495fdd
kernel: limit shared memory usage when adding opts (#3705)
* kernel: limit shared memory usage when adding opts

* search: remove unnecessary limit on search space

apply_opt will do the more correct check
2024-03-12 17:06:21 -04:00
Patrick Tsai
971d7f5d7c
O(n) arange attempt (#3530)
* It works?

* Clamp correctly

* Refactor

* Make code better

* Undo some stuff

* First step to trying to make floats work

* Floats work in Python op but not metal because int div is different

Python integerdivision was implemented as // which rounds towards
negative infinity, but C integer division rounds towards 0 so there
is an off-by-1 division error

* arange does cumsum with ints and then multiplies by step

This is so loop optimization can remain int only

* Undo a lot of symbolic changes

* Final check

* Cleanup

* There can be multiple phis

* Fix multiple phi op removal

* const sets dtype correctly

* Fix bugs

* Fix a couple bugs and add loop vars to resolve

* missed one

* Don't trim too many ops

* Fix symbolic test

* Use ones instead of full

* Delete test

* Lint passes

* max node error

* Small updates to loop logic

* Remove unnecessary changes

* We are getting somewhere

* Simple case

* Fix

* rm, prn

* Better

* If NumNode doesn't work then continue

* clamp is needed for arange(256)

* Move everything into the optim fn

* Replace correctly

* Order optimizations better

* Delete

* mypy

* Test for simplification

* Rename

* Fix test

* update test description

* Undo more

* Cleanup

* No replaced_ops map

* Fix lint

* AssertionError

* back again

* Reinstate assertion

* Return true and make diff not as big

* Bigger range for test

* Change cumsum impl

* fix bug

* make big cumsum work

* lint

* Undo cumsum 2-stage removal

* No while helper

* optional min/max clamping

* floats work

* rm giant arange test

* fix python cast None

* Check phi parents

* one phi allowed per where

* Fix one phi per where

* Rework iteration

* Delete assertions

* convert to int

* Try mul -1 instead of neg for hip..?

* Remove one phi per where requirements

* one accum only

* Lint

* should simplify a loop at a time

* Don't get rid of loop explcitly

* Need to iterate backwards

* lint

* unary neg

* Make optim work for onnx and sum_pad_collapse

* Better message

* filter alu ops correctly

* Fix the limiter

* lint and simplify

* Add it back

* off by one error

* test wheres and phis

* test max ops and non-if stuff

* <=

* cast_scalar

* Oops

* Change test

* Pass loop uops instead of a modified map

* Cut param transfer between linearizer and uops

* Fix issues

* Fix lint

* fix efficientnet python 3.8 invalid syntax

* distinct vars in seen_vars

* accurate var names

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-11 16:09:20 -07:00
chenyu
915f98791c
use custom KernelOptError in kernel opt (#3661)
be more specific about invalid kernel opt, used that in test_linearizer_failures.

make BEAM kernel search work even with assertion disabled.

`BEAM=2 python3 -O examples/llama.py  --temperature=0 --count=10 --prompt="Hello." --timing`
2024-03-08 15:36:16 -05:00
chenyu
1130c73844
add FUZZ_NTH to fuzz_linearizer (#3656)
* add FUZZ_NTH to fuzz_linearizer

also update tests in test_linearizer_failures to not just run on METAL

* update failures for HIP/HSA

* test_failure_21 LLVM PADTO
2024-03-08 09:16:49 -05:00
chenyu
b282a45e39
fix direct store float4 with same vin (#3652)
In a kernel that stores expanded value, the vin of float4 can come from same source, and we only remove once in that case.
2024-03-07 18:11:50 -05:00
chenyu
d33311ebe0
remove parens of ALU if it has associative property (#3635)
need to remove SUB since it's possible to have (const - (const - const)) in test/test_ops.py::TestOps::test_cos,
in which case cannot remove the parens of children
2024-03-06 21:12:11 -05:00
chenyu
fe6b6e38c1
remove parentheses of GEP if it's from SSA (#3634)
fixed some bracket nesting level exceeded maximum of 256 errors
2024-03-06 20:22:46 -05:00
chenyu
48d22067ca
clean up test_linearizer_failures (#3562)
* cleanup test_linearizer_failures

* fix test_failure_8

* fix that

* better assert message
2024-03-01 15:57:17 -05:00
chenyu
1136e2a82a
skipIf(not( -> skipUnless( in test_linearizer_failures (#3519)
if these behaves weirdly in CI might need to disable them in CI
2024-02-28 13:48:47 -05:00
Francis Lam
39d75f0d58
test_linearizer_failures: add more METAL examples (#3495)
these were obtained from running fuzz_linearizer on METAL
2024-02-26 10:19:05 +01:00
George Hotz
871ba73e65
_reduce_op is axis based now (#3462)
* _reduce_op is axis based now

* axis_

* update lin failures

* disable that

* fix shape
2024-02-21 16:36:31 +01:00
xarkes
28a8b72024
Remove Interpreted device & remaining CPU/TORCH ref (#3423)
* Remove Interpreted device & remaining CPU/TORCH ref

* Oops

* supports_device was useful

* Fix doc wording

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-02-16 00:30:21 -05:00
Francis Lam
668324d92b
wmma: protect TC locals from modification and use only LOCAL (#3379)
also remove unnecesssary upcast_dim from tensor_core and calculate
it from the dimensions and thread sizes
2024-02-13 10:19:35 +01:00
Francis Lam
f1ad01fd91
test_linearizer_failures: add new linearizer compile failure on METAL (#3380) 2024-02-12 20:28:34 -05:00
Francis Lam
2266152b28
linearizer: added FUZZ_BEAM to fuzz_linearizer and additional tests (#3340)
Fixed test_tensor_core_opts to test all the TCs.

Added commented out failing tests in test_color_shapes_with_local.
2024-02-08 16:12:58 +01:00
nimlgen
5097d5b808
fix padto when with late reduce (#3180)
* fix padto test

* no long comment
2024-01-19 14:01:44 -05:00
nimlgen
f87ecbb0f3
fuzzer validates outputs + (partially) oob accesses (#3178)
* fuzzer validates outputs + (partially) oob accesses

* +random

* oob check only for compiled

* type cmp fixes

* fix zeroing

* no prints

* add seed
2024-01-19 13:34:51 -05:00
George Hotz
c003be7309
Revert "track size in shapetracker" (#3043)
* Revert "track size in shapetracker (#3026)"

This reverts commit a8ba1ac08f.

* st.size
2024-01-08 13:13:39 -08:00
George Hotz
a8ba1ac08f
track size in shapetracker (#3026)
* track size in shapetracker

* shapetracker adapter

* size is an int

* create Buffer with st.size

* only compare the views for the jit

* fix webgpu
2024-01-05 20:15:53 -08:00
George Hotz
a280cfe169
move dtypes to dtype.py (#2964)
* move dtypes to dtype.py

* fix urllib
2024-01-01 14:58:48 -08:00
qazal
5f07ef455e
update dtypes (#2872) 2023-12-20 15:04:02 -05:00
George Hotz
051402625e
remove pushing contig + fix linearizer bug (#2798)
* remove that logic

* fix test, move LOADs

* fix repeat issue on LLVM

* with_phi
2023-12-16 09:36:31 -08:00
chenyu
765f8b05e5
TernaryOps.WHERE has vin[0] as bool and BinaryOps.CMPLT always outputs bool (#2782)
* vin[0] to where is always bool

* due to better hack

* update test

* fix test_uops
2023-12-15 14:51:51 -05:00
George Hotz
c6eb618013
tests from new lazy branch (#2774)
* tests from new lazy branch

* fix lin 11

* that was needed

* doesn't fail

* mark

* meant that

* llvm passes
2023-12-14 23:06:39 -08:00
George Hotz
6d6eb9302d
ruff checks the max line length is 150 (#2734)
* ruff checks the max line length is 150

* fix tensor.py

* a lot more

* done
2023-12-12 17:34:47 -08:00
chenyu
0fb1d47aa0
two linearizer fuzzer failed test case for webgpu (#2685)
* add a linearizer fuzzer failed for webgpu

* CI specific
2023-12-08 22:52:34 -05:00
Ahmed Harmouche
4b01839774
support vals on WebGPU, run more tests (#2668)
* Vals on webgpu, run more tests

* Skip slow tests, run symbolic ops tests

* Balance out tests
2023-12-07 16:45:21 -08:00
qazal
4380ccb169
Non fp32 math (#2264)
* `global_load` and `global_store` using buffer dtype

* `UOps.PHI` in all dtypes

* `UOps.ALU` in all dtypes

* `UOps.CONST` & `UOps.DEFINE_ACC` in all dtypes

* -- endof implementation --
+tiny lint changes

* these tests require the fp16 extention

you can run them locally to confirm they're green: (GPT2 test is broken in master for mac, see [this](https://discord.com/channels/1068976834382925865/1069001075828469790/1177993277958533261)

`GPU=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_max_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_min_float16_cpu test/models/test_real_world.py::TestRealWorld::test_llama test/models/test_real_world.py::TestRealWorld::test_gpt2 test/models/test_whisper.py test/test_specific_conv.py::TestSpecific::test_big_vec_mul`

skip the new test_linearizer_failures in CI GPU because of the fp16 extention

This passes on a real GPU since the extention is available:
`GPU=1 python3 -m pytest test/test_linearizer_failures.py::TestLinearizerFailures::test_failure_8`

see CI logs [here](https://github.com/tinygrad/tinygrad/actions/runs/6996590597/job/19032641427#step:14:644)

* these tests fail in CI due to segfaults and CPU crashes

To confirm they're green locally, you can run the following commands:

1. For the tests skipped in test_ops.py (note: CLANG is very slow)

`for var in GPU CUDA CLANG; do export $var=1; for test in test/test_ops.py::TestOps::test_slice_fancy_indexing_no_dim_collapse test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_collapse_int test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_none test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_and_collapse; do python3 -m pytest $test; done; unset $var; done`

2. For the ONNX tests skipped in CLANG:

```
CLANG=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_ai_onnx_ml_array_feature_extractor_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_0_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_1_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_none_no_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_negative_indices_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_no_weight_reduction_mean_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_mean_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_mean_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_none_no_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_sum_weight_high_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_expanded_cpu
```

3. The LLVM test I skipped here is already [skipped in master for all backends](https://github.com/tinygrad/tinygrad/blob/master/test/external/external_test_onnx_backend.py#L186), I just made it more specific

`LLVM=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu`

* Revert "these tests fail in CI due to segfaults and CPU crashes"

This reverts commit 15db570143.

* merge with cleanup-vectorized-hip-renders

* barely working HIP P1, ALU ops need a refactor?

* manage the fact that in HIP [half2 is actually an unsigned int vec](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L59)) and half is a totally different __half that [has an unsigned int element in it](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L50)) but can't be accessed [because it's private](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L86)). If you just do this:

```
half2 val0 = // ...
half val1 = // ...
```
then you can't do:
```
val0.x + val1 // error: use of overloaded operator '+' is ambiguous (with operand types 'unsigned short' and 'half' (aka '__half'))
```

* update the sign definition to avoid division by zero in all dtypes

* diff cleanup p1: why were these in the diff anyways

* less hacky HIP, enable CIFAR fp16 benchmark, test ops for HIP in CI!

add ALU ops overloads for HIP

this will make HIP max work

handle mod

Revert "handle mod"

This reverts commit 370fd4b3fbe99b6ae8cc293d005b106628205933.

update max to use hmax

add HIP GEP render logic

enable CIFAR fp16 benchmark

test ops for HIP

back to store as float because this only works for float4 grouping right now

test_ops for hip!!

always sign

* back to the sign we had before because we cant do a backward pass on a Less node

* remove old hacks

HIP compiling test_ops in CI takes ~9 mins, not doing it for now

new HIP ALUs

* reduce accs done right

* refactor to function

* no device hacks

hacks p2

the other way

* LLVM ALU ops

half, float and double are all float

update max

* update test_uops, cmplt is always a bool in the real linearizer. assertAlmostEqual is wrong when ret is bool

* cleanup LLVM wrong code

* dummy change for the CUDA install glitch

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-12-03 13:45:49 -08:00
chenyu
3eb3c74675
metal ci tests everything (#2499)
* metal ci tests everything

* pretty good

* METAL
2023-11-29 12:04:37 -05:00
George Hotz
5629fc368c
Use Buffer.STORE at the end of ASTs (#2494)
* work

* store broken

* interpreteds work

* this passes

* symbolic cpu

* fix tests

* fix opt tests

* images fail

* fix InterpretedFlopCounter

* stupid hack for images
2023-11-28 20:11:37 -08:00
George Hotz
ab5d14d4ba
MEM -> LOAD (#2492)
* MEM -> LOAD

* keep legacy working
2023-11-28 16:46:37 -08:00
Christopher Mauri Milan
7f01dd04f0
Apply ruff linting rules to tests (#2473)
* everything except F821

* enable F821 with noqa

* dumb fix

* fix remaining imports and (former) lambdas

* replace _ with noqa to avoid gc
2023-11-27 21:24:06 -08:00
George Hotz
9e07824542
move device to device.py (#2466)
* move device to device.py

* pylint test --disable R,C,W,E --enable E0611

* fix tests
2023-11-27 11:34:37 -08:00