Commit graph

4,667 commits

Author SHA1 Message Date
chenyu
1d7f01bc6d
fix gpt2 with empty prompt (#3100)
logits would be empty so need to replace that with ones before sampling, also cannot reshape with -1 when there's 0 in other axes
2024-01-12 14:18:17 -05:00
chenyu
f3a50b4e40
fix broadcasted logic if there's 0 in shapes (#3097)
* fix broadcasted logic if there's 0 in shapes

should always expand into 0, not the other way around. fixed matmul with 0 in input shapes.
for forwards for now though, backward is more involved and would need to change 0 size shortcuts

* fix tests
2024-01-12 13:32:43 -05:00
SnakeOnex
025fbf4e80
One hot in tensor.py (#3093)
* onehot in Tensor.py

* one_hot tests

* works for all shapes, not just 1

* pylint

* not a static method

* moved around, num_classes mandatory

* pylint

* pylint

* space & moving

* formatting

* moved tests
2024-01-12 13:31:18 -05:00
chenyu
7086d77db1
bugfix do not reset shapetracker of 0 size lazybuffer (#3096)
it might be coming from an expand, and resetting results incorrect stride. caught by interpreted backend
2024-01-11 23:22:52 -05:00
Yixiang Gao
13e872b53f
add mutigpu support for llama attention (#3064)
* add llama attention test for multigpu

* test fails

* kv cache trying to shrink on sharded axis

* mask None works for scale dot product

* kv cache seems to be working but scale dot product breaks

* scaled dot product works, but the last linear layer failed

* running into the reshape case where it could be wrong for multigpu

* making sure it was the reshape

* adding contiguous doesn't solve

* need to shard more properly

* remove reshape test

* minor adjustment to scale dot product attention test

* weights are sharded wrong

* continue fix new weight sharding

* clean up

* fix attention when start_pos is 0

* remove print

* add TODOs for the best mutigpu interface
2024-01-11 16:31:02 -08:00
Yixiang Gao
adcc844755
cat works (#3086) 2024-01-11 08:25:20 -08:00
Yixiang Gao
6842476ca6
better test demonstration (#3077)
* a better test demonstration

* fix white space
2024-01-10 10:50:52 -08:00
George Hotz
2495ca95c7
early gate the graph (#3070) 2024-01-09 20:17:13 -08:00
George Hotz
ff0d6e4551
jit autorealizes output (#3069) 2024-01-09 20:10:22 -08:00
chenyu
145718a90f
unbind view or shapetracker also returns var_val (#3067)
* unbind view or shapetracker also returns var_val

4% faster for llama compile time

* one line less

* unbound_views
2024-01-09 21:45:05 -05:00
George Hotz
ac3f246c11
cached size (#3060)
* cached size

* simplify simplify

* 0 doesn't have base

* fix test

* cleaner cache

* hmm, metal is flaky on this...might be real(ish) but useless as test

* short circuit reshape/expand properly

* better reshape bypass
2024-01-09 16:37:37 -08:00
Yixiang Gao
73b72b8de2
test scaled dot product attention (#3063)
* add test

* add initial test for scaled dot product attention

* test pass for scaled dot product attention
2024-01-09 14:30:57 -08:00
chenyu
55ac2a2cf7
Tensor.cat with 0 shape tensors (#3062)
* Tensor.cat with 0 shape tensors

supported both 0 in cat axis (for a subset of input), or 0 in non-cat axis (all needs to be 0)

* no shp
2024-01-09 16:54:06 -05:00
chenyu
f0d7ad8aaa
fix gpt2 attention with start_pos = 0 (#3061)
* fix gpt2 attention with start_pos size 1

test cases taken from ll_transformer branch

* fix interpreted
2024-01-09 16:14:55 -05:00
George Hotz
2c6f2e899d
No extra vars call (#3054)
* remove unused reciprocal

* comment

* remove unneeded call to vars

* free speedup
2024-01-09 09:52:58 -08:00
Yixiang Gao
259bf9bffc
add multigpu test for RMSNorm (#3056)
* need all gather

* add two multigpu test scenarios for RMSNorm
2024-01-09 09:52:51 -08:00
chenyu
dab8214103
unit tests for Device.canonicalize (#3055) 2024-01-09 12:47:20 -05:00
George Hotz
374f7659a7
remove unused reciprocal (#3053)
* remove unused reciprocal

* comment
2024-01-09 08:59:04 -08:00
Yixiang Gao
a686663657
make Embedding device aware for multigpu (#3051)
* make Embedding device aware for multigpu

* split line instead of igore because that's cheating

* add test incomplete

* add test complete

* remove comment

* fix white space

* remove nn.Embedding
2024-01-08 20:09:26 -08:00
chenyu
19298e7a3f
Device._buffers -> Device._devices (#3052)
backend devices used to be called buffers
2024-01-08 21:30:38 -05:00
chenyu
ee6a73826b
clean up test_nn.py (#3049)
used Tensor.train decorator, reordered to always tinygrad instances first, and removed redundant idx cast
2024-01-08 18:45:03 -05:00
chenyu
3eb3664074
fix nn.Embedding with empty length input (#3048) 2024-01-08 18:08:36 -05:00
George Hotz
7ea2e0035b
move children for speed (#3047)
* move children for speed

* no children anymore
2024-01-08 15:02:32 -08:00
George Hotz
655c6f61d3
St real size (#3046)
* track the size in the lazybuffer

* shapetracker real size

* lint
2024-01-08 14:44:53 -08:00
George Hotz
c003be7309
Revert "track size in shapetracker" (#3043)
* Revert "track size in shapetracker (#3026)"

This reverts commit a8ba1ac08f.

* st.size
2024-01-08 13:13:39 -08:00
George Hotz
c5a941d466
webgl backend in extra (#3041)
* WebGL WIP

* 84% of ops passing test

* tests passing 100%

* Cleanup, refactor

* Shave off some lines

* Work on dtypes

* TestOps at 100% again

* Efficient net shaders compile in browser webgl2

* Compile all efficientnet shaders in browser

* Create empty textures for tensor buffers

* Run program. Up next weight loading

* Exported WebGL model working

* Add tests, refactor

* Explicit cast alu for GLSL

* Fix CI tests

* WebGL efficientnet demo

* Compile and run yolov8 in browser

* Fix imports

* Simplify yolo compile

* Fix bool*bool and cast cmplt to float

* More tests

* Do std tests pass on CI?

* Skip std tests on CI

* Remove explicit_cast_alu hack, and solve it in code_for_op

* Move to new dtype-less alloc api

* Remove local size hack: optimize local_size only if device has local

* Remove glsl.py, and move content to cstyle

* dont_use_locals in opts

* Fix dtype tests

* type_map in CStyleLanguage

* Make core changes smaller, cleaner, refactor export_model and demo

* Skip pad_slice

* Simplify: render_const, render_conditional

* solve bool alu for other binops, cleaner ops_webgl

* Fix noopt hack

* Remove some skipIfs

* WebGL image hack

* type_names is a better name

* global_max

* Fix dtype import

* Fix type_names -> type_map

* Fix lint

* Remove webgpu, back to 5k lines (#3040)

* remove webgpu

* max 5000 lines

* revert those to master

* retain that cstyle

---------

Co-authored-by: Ahmed Harmouche <ahmedharmouche92@gmail.com>
2024-01-08 09:29:13 -08:00
chenyu
ef5f545fd8
add more Tensor.clip test cases (#3034)
* add more Tensor.clip test cases

add cases for same low/high and both negative etc

* case min > max
2024-01-07 13:08:59 -05:00
George Hotz
a8ba1ac08f
track size in shapetracker (#3026)
* track size in shapetracker

* shapetracker adapter

* size is an int

* create Buffer with st.size

* only compare the views for the jit

* fix webgpu
2024-01-05 20:15:53 -08:00
chenyu
138c17c094
enable argmax tests for METAL/WEBGPU in CI (#3027)
not sure why it was skipped but works now in CI
2024-01-05 21:43:00 -05:00
George Hotz
2a2d3233d2
add test that the compiler isn't used (#3025)
* add test that the compiler isn't used

* one print_tree

* improve speed with st size cache

* switch to gpt-2
2024-01-05 17:24:01 -08:00
chenyu
520406cf3a
add Tensor.unflatten and Tensor.flatten(end_dim) (#3023)
simplified cases when splitting a dim, or merge dims in predix
2024-01-05 17:55:29 -05:00
George Hotz
f432ec9c33
Bitcast hip fix + fix mixtral (#3022)
* fix bitcast in hip

* wrong dtype for precast, double COPY
2024-01-05 14:51:25 -08:00
George Hotz
60abc62a3f
fast hip read (#3014)
* fast hip read

* hip read faster

* fix tests

* to_mv

* simplify

* bump to 6k lines
2024-01-05 10:33:13 -08:00
chenyu
4465ef28c5
add test_softmax to test_ops (#3020)
* add test_softmax to test_ops

somehow it was not tested

* too many buffers in softmax backward for WEBGPU
2024-01-05 11:19:49 -05:00
chenyu
f88506e630
move gpt2/llama sampling inside the model call (#3013)
* move gpt2/llama sampling inside the model call

* argmax uses one more kernel
2024-01-04 17:01:50 -05:00
Yixiang Gao
8a63f26a0f
make LR scheduler work with multigpu (#3011)
* add a failing test for LR scheduler when using multigpu

* fix calculation order and unnecessary tensor created for float

* min_lr is no longer tensor
2024-01-04 12:10:56 -08:00
chenyu
91665ef143 rewrite MUL CAST SUM to CAST MULACC 2024-01-04 13:12:22 -05:00
chenyu
ab7dfd637b use float for acc dtype for half tensor sum
we previously only upcast uint and int, and half was using half for acc.
change to acc in float for precision. but cast the result back to half to match torch/jax output dtype
2024-01-04 13:12:22 -05:00
geohotstan
57817028bb
removed redundant dtype hacks in onnx_ops (#2939)
* updated most dtype hacks in onnx_ops

* temporarily revert dequantizelinear change

* I think this is right...

* MORE FIXES WOOOO NEW DTYPE IS AWESOME

* ok

* oops missed a print

* half -> float32 for CI

* is npdtype

* some more

* fix if ordering

* more clean ups

* final cleanups

* casting to half not allowed

* k nvm

* revert ArgMax change

* only GPU

* llvm begone

* teeny tiny change

* fix: attempt to add cast tests

* try this

* fix dequantizelinear

* revert some stuff

* tests pass pls

* less lines in onnx_tests

* oops missed string tensor tests

* clean up

* try: revert default behavior changes

* fix: disabled Cast and Castlike tests

* docs: small changes

* fix: fixed isNaN op and enabled associated tests

* fix: forgot about float16

* done

* update disabled test

* gah missed another float16

* disable rest of failing tests

* rm extra line

* try...

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-01-04 01:45:24 -05:00
chenyu
9f39165188
correct (dtype, device) in test_dtype.is_dtype_supported (#3007)
corrected dtypes for TORCH and float64 support
2024-01-04 00:25:37 -05:00
chenyu
ae112c9dbe
fix some long lines in tests (#3006)
* fix some long lines in tests

* better
2024-01-03 23:53:33 -05:00
George Hotz
9699c8c90b
don't alloc for InterpretedASTRunner (#2999) 2024-01-03 17:05:53 -08:00
chenyu
74cc6fd3c2
remove AndNode.__floordiv__ special case (#2996)
* remove AndNode.__floordiv__

AndNode produces a Node that min/max is bounded by [0, 1] so `//` on top of that is almost always 0.
we don't really use that either

* keep the test
2024-01-03 17:44:55 -05:00
Yixiang Gao
5663dd46b6 Merge branch 'master' of github.com:tinygrad/tinygrad into cifar_fp16 2024-01-03 10:11:46 -08:00
chenyu
81b97cd2c6
canonicalize device in LazyBuffer constructor (#2991)
fixed the multitensor +1 then sum bug
2024-01-03 12:55:25 -05:00
chenyu
db525cf8c2
multitensor failed test case with +1 then sum on DEVICE:0 (#2990) 2024-01-03 12:17:11 -05:00
George Hotz
5dbaaa7061 hotfix: make multitensor shard contiguous 2024-01-03 08:48:30 -08:00
Yixiang Gao
84eb6dd32a skip GPU cause opencl on intel can't compile half 2024-01-03 07:07:21 -08:00
Yixiang Gao
73879b50ad only need to check the min_lr for the nan bug 2024-01-03 07:00:50 -08:00
Yixiang Gao
99f8740c60 running half in CI CPU is slow 2024-01-02 18:44:35 -08:00