Commit graph

58 commits

Author SHA1 Message Date
Francis Lam
8cf0bb9351
optimizer: simplify GROUP and LOCAL to have one of each (#2162)
* optimizer: simplify GROUP and LOCAL to have one of each

Now that tensor cores only use LASTLOCAL, we can simplify to use
only that op everywhere.

The only use of GROUP is in matvec hand-coded opts and it doesn't
make a performance difference so switching to use only the top
behavior.

Also adds additional asserts to prevent tensor core dims from
being altered which causes bad kernels to be generated.

* search: remove duplicated actions
2023-10-27 11:37:44 -10:00
Francis Lam
bf3490cdf9
wmma: refactor tensor cores using existing local dims (#2097)
* wmma: refactor tensor cores using existing local dims

* optimizer: fix bad rebase and break after one late local

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-10-25 13:10:46 -04:00
chenyu
4444e6d4b3
stable diffusion < 324ms (#2129)
* stable diffusion < 324ms

* revert swap action

* fix tests due to more sum splitting

* REDUCEOP_SPLIT_THRESHOLD env var
2023-10-24 14:56:12 -04:00
George Hotz
6dc8eb5bfd
universal disk cache (#2130)
* caching infra for tinygrad

* nons tr key

* fix linter

* no shelve in beam search

* beam search caching

* check tensor cores with beam too

* pretty print

* LATEBEAM in stable diffusion
2023-10-22 10:56:57 -07:00
Francis Lam
ace6b2a151
optimizer: add test for correctness of opts (#2124)
* optimizer: add test for correctness of opts

Also added OptOps.UPCASTMID to constrain valid axes for opts with
group_for_reduce.

* llvm: fix LinearizerOptions to correctly not has_shared

* optimizer: remove premature test scaffold for TC opts

* search: fix the action space
2023-10-22 08:02:22 -07:00
David Hou
95e17ff0d4
fix wino mask upcast calculation (#2057)
* fix wino mask upcast calculation

* add tests for wino upcast hcopt

* add info to note

* real world wino hcopt test

* wino backward test

* whitespace
2023-10-18 16:54:48 -07:00
Szymon Ożóg
4bef1591f0
Disable ocelot cache + fix matvec in triton (#2010)
* Revert "disable flaky triton test"

This reverts commit 1e15fdaee7.

* Update test.yml

* check if has shared for matvec

* disable ocelot cache for triton

* disable ocelot cache

* disable ocelot cache

* pass shared to triton uops tests

* temporary debugs for CI crash

* Revert "temporary debugs for CI crash"

This reverts commit fee3ea96c8.

* Revert "triton isn't tested, and allows this refactor (#2007)"

This reverts commit dea8bb0938.

* add runtime_args to every renderer, move triton local size override to runtime args

* Add binary to args, correct type returned

* update to new loops

* Update test.yml
2023-10-17 10:33:32 -07:00
George Hotz
1bf4aef0f5
fix image dtype cmp (#2089)
* fix image dtype cmp

* print that with debug 3
2023-10-16 17:52:38 -07:00
George Hotz
5472a14544
openpilot compile2 (#1977)
* start compile2

* tweak

* why are there two more kernels?

* minor cleanups

* don't break onnx tests

* add __metadata__ support to safetensors

* no early realize in onnx

* cleanups

* bugfix

* clean up image type, add optimize

* opt to match old

* try that

* opt work

* run compile2

* optimizer

* prt more

* prerealize

* imp

* NOLOCALS works

* no locals means no locals

* support fractional globals

* all locals welcome

* int that

* cleanups

* show gemv regression

* clean up diff

* use idx for the cond

* nolocals

---------

Co-authored-by: Comma Device <device@comma.ai>
2023-10-15 20:39:46 -07:00
George Hotz
49bcfec383
0s in the action space (#2070)
* 0s in the action space

* simpler

* skip duplicate actions
2023-10-14 11:22:48 -07:00
George Hotz
4124cf1df5
cleanup tensor cores, expose exclude local upcast (#2064)
* expose exclude_local_upcast

* convert apply tensor cores to ops

* update comment

* put LOCAL back to what it was, BEAM is better than way
2023-10-14 09:21:03 -07:00
George Hotz
90c777d815
remove apply_auto_opt (#2063) 2023-10-13 07:44:14 -07:00
George Hotz
6f1810af2d
with unroll, the action space goes from 161 -> 127 (#2060)
* with unroll, the action space goes from 161 -> 127

* more reliable instrumentation

* beam search is so op

* beam bugfix
2023-10-12 20:52:23 -07:00
George Hotz
0c3b6f13a8
Latest opt (#2044)
* split out actions

* rl algorithm
2023-10-11 15:46:14 -07:00
George Hotz
41bfeb2c1e
start work on auto opt (#2034)
* start work on auto opt

* lin failure

* not beating hcopt

* greedy

* timing is fast

* codegen.search

* greedy search in handcode_opt

* track running gflops

* clean up those files

* no failure
2023-10-11 12:54:53 -07:00
George Hotz
f139060103
Rewrite hand coded opt with action space (#2030)
* tests passing

* hand coded opt with new abstractions

* simpler opts

* split out tensor cores
2023-10-10 07:38:38 -07:00
George Hotz
f1f64bc88d
remove val_vars from the linearizer (#2009)
* remove val_vars from the linearizer

* no need to store var vals
2023-10-07 07:47:28 -07:00
George Hotz
ffa33d743a
good changes from openpilot_compile2 (#2000)
* good changed from openpilot_compile2

* float32 image type was wrong

* cleaner way to write that + a test
2023-10-06 13:33:24 -07:00
George Hotz
7a68060422
Revert "allow local + grouped reduce in hand_coded (#1996)" (#1998)
This reverts commit 219a1f7063.
2023-10-06 06:43:28 -07:00
nimlgen
219a1f7063
allow local + grouped reduce in hand_coded (#1996)
* allow local + grouped reduce in hand_coded

* allowed loop size based on global_dims

* fix const

* fix const one more time

* better divisor

* a bit fix

* can take 2, why not

* fix linter

* better comments

* start with 2

* not always pick group reduce

* fix images

* better images

* better
2023-10-06 06:11:28 -07:00
George Hotz
1862e14a4f
fix gemv triggering for gemm (#1975) 2023-10-05 05:23:00 -07:00
Francis Lam
0ba75c4370
optimizer: add matvec optimizations (#1972)
* optimizer: add matvec optimizations

* renderer: fix alignment of shared memory in opencl
2023-10-04 14:16:27 -07:00
George Hotz
88b6ed6945 disable broken optim_conv2d 2023-10-04 07:33:50 -07:00
nimlgen
2ea1dd3e87
no process() in Linearizer (#1966)
* no process() in Linearizer

* more process() clean up
2023-10-04 07:18:42 -07:00
George Hotz
717451a244
Revert "optimizer: add matvec optimizations (#1753)" (#1959)
This reverts commit f520323054.
2023-10-03 00:28:42 -07:00
Francis Lam
f520323054
optimizer: add matvec optimizations (#1753)
* optimizer: add matvec optimizations

* Update optimizer.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-10-03 00:01:59 -07:00
George Hotz
90326dbdc3
resnet50 hand coded optimization (#1945)
* resnet50 hand coded opt

* hand optimize one kernel

* opt in both places to fix test
2023-09-29 09:34:51 -07:00
George Hotz
c907efbf4a
reorder a few things (#1915)
* reorder a few things

* huh, that has to be there

* move apply shapetracker

* BufferOps

* only for type checking
2023-09-25 10:17:21 +08:00
George Hotz
20059dc55b
Make ShapeTracker Immutable (#1909)
* ugh

* ops test pass

* fix shapetracker tests

* sym shapetracker

* shapetracker is a tuple of views now

* from_shape

* fix has variable shape

* key isn't needed

* post init assert
2023-09-24 21:09:03 +08:00
George Hotz
7ff7aacdb4
LazyOp out of Linearizer (#1908)
* loadop buffer on cpu

* works for GPU

* sort of working

* has bugs

* gpu tests pass

* fix some tests

* fix tensor cores

* fix test linearizer

* fix symbolic

* fix has_variable_shape

* non symbolic size

* disable weird test

* simple cache fix

* fix custom function

* fix kopt

* cleanups

* a bit broken on the assign

* contig check

* only buffer

* need that order

* idx

* dedup buffers

* hmm, bugfix

* fix tensor cores

* opts device
2023-09-24 14:30:53 +08:00
George Hotz
97dc813329
Revert "All LazyOps in the Linearizer (#1905)" (#1907)
This reverts commit a5820390db.
2023-09-24 11:51:22 +08:00
George Hotz
a5820390db
All LazyOps in the Linearizer (#1905)
* loadop buffer on cpu

* works for GPU

* sort of working

* has bugs

* gpu tests pass

* fix some tests

* fix tensor cores

* fix test linearizer

* fix symbolic

* fix has_variable_shape

* non symbolic size

* disable weird test

* simple cache fix

* fix custom function

* fix kopt

* cleanups

* a bit broken on the assign

* contig check

* only buffer

* need that order

* idx
2023-09-24 11:50:00 +08:00
Szymon Ożóg
58296c079d
Make Triton work again (#1547)
* Move ops_triton to runtime and remove errors from deprecated code

* Remove deprecated AST Kernel

* Remove deprecated buffer

* Add TritonProgram

* Triton Buffer

* Use RawCUDABuffer

* triton_compile

* Added new parameter

* pass _buf to program

* remove deprecated include

* Added triton tests

* Deprecated includes removed

* remove double print

* Disable float4 support

* Disable float4 support

* variable load fix

* Track local size

* Add pycuda to triton dependencies

* Merge test.yml

* install cuda packages for testing

* merge double package install

* remove emulated from triton tests

* upscale local index to power of 2 and add masking

* cuda envs

* Add TernaryOps

* ConstOp loading

* proper function name

* remove deprecated variables

* get global program from name

* const ops match local shape

* Enable test_nn

* remove deprecated import

* fix linter error

* Add wait logic

* Add local size override

* accumulate local shapes instead of using max shape

* Merge triton tests into global tests

* fix envs in testing

* Old testing routine

* split file into renderer and program

* remove print and starting whitespace

* pretty ptx print on debug 5

* linter errors

* ignore triton saturation tests

* ignore test example

* remove pytorch cpu extra index

* Add triton to existing testing routine

* use triton tests

* disable cuda backend in triton tests

* use cudacpu in tests

* print used device

* Print device default

* Remove print

* ensure we are running triton backend

* update variable signatures

* update dtypes for load

* infinity render fixed

* limit global size

* negative infinity now properly rendered

* split chain with parentheses for and node

* Add option to disable shared memory, disable for triton

* missing import

* Properly index and mask conditional load

* use mask only if not loading a block pointer

* nan support

* fix symbolic tests to include chain split

* proper masking for stores

* Implemented bool dtype

* Add mod

* fix loads for variables with valid range

* merge triton with cuda runtime

* merge from master

* run triton tests with cuda

* Correct target when running from triton

* conftest with triton compiler config

* use triton nightly

* verbose tests for triton

* capture stdout

* fix function depth when exiting multiple loops

* add render valid function for readabilty

* fix mask for local loops

* add _arg_int32 datatype

* fix dims for conditional loads

* enable non float stores

* correct variable dtypes

* fix type for arg_int32

* remove junk

* Added get max function for range based var.max

* remove deprecated code

* Fix triton ptxas path

* Fix testing for CI

* clamp local size by max local size instead of always running max

* Disable matmul test in triton cpu

* rerun tests

* Disable broken test in triton cpu

* whitespace removed

* rerun tests again

* Disable TestSymbolicOps for triton

* update to new uops

* linter fix

* ignore test/extra

* linting fix

* Update tinygrad/renderer/triton.py

Co-authored-by: Gijs Koning <gijs-koning@live.nl>

* remove deprecated line

* quotes type fix

* linter

* Remove unnecesary lines

* UnaryOps.NEG

* dont define constants

* Linting fix

* Disable tests that are broken in ocelot

* remove trailing whitespace

* reduce line count

* linting fix

* update to new uast

* New looping style

* Update to new uast

* make AST runner work with triton

* linting fix

* set renderer var for testing

* disable local for ocelot

* reenable all tests for ocelot

* Pass shared to cuda

* Don't group if the backend doesn't support shared mem

* use working gpuocelot branch

* enable all tests

* enable local for ocelot

* cleanup

* Update test.yml

* update cache key

* reenable test symbolic and extra

* Update test.yml

* Revert "Update test.yml" (rerun tests)

This reverts commit 98c0630ee5.

* Revert "fix symbolic tests to include chain split"

This reverts commit 22a9a4c9cd.

* Revert "split chain with parentheses for and node"

This reverts commit 7499a7004e.

* use global size from linearizer

* rename newvar to dtype to match other renderers

* join program start lines

* simplify code that adds axis to local dims

* assign r[u] in ssa

* We no longer need to replace target in src

* we no longer need to cast indices to int by hand

* Update triton.py(rerun tests)

* Update triton.py(rerun tests)

* Update triton.py(rerun tests)

---------

Co-authored-by: Gijs Koning <gijs-koning@live.nl>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-09-23 14:17:12 +08:00
Gijs Koning
9eb6310686
Fix gpt optimization (#1885)
* fix for gpt

* the actual fix

* Remove change in symbolic

* small comment
2023-09-21 10:28:18 +08:00
chenyu
3ec301c2d7
apply view.py patch (#1844) 2023-09-10 17:32:15 -07:00
Roelof van Dijk
1bc52c60df
fix: minor tweaks to view (#1842)
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
2023-09-10 15:55:57 -07:00
nimlgen
31fca43706
kopt works with local+grouped reduce and tests (#1824) 2023-09-09 13:22:09 -07:00
George Hotz
6100d7425f
add 2 to locals, uops debug 5 (#1782) 2023-09-05 19:44:43 -07:00
Adrian Kretz
3473c9e88d
Metal conv tensor cores (#1696)
* Benchmark 5x5 conv kernel which is optimized

* Use Metal tensor cores in 2d convs
2023-09-04 15:14:46 -07:00
David Hou
3151d91f6e
3x3 winograd convs (#1675)
* winograd

* simplify local groups code

* comment

* respects self.opts.has_local

* always simplify ones

* make mypy happy

* move reshape, WINO flag

* wino flag, simple forward backward test for wino

* extra wino test

* merge oops

* comments

* axis_needs_valid -> axis_is_masked

* don't delete needs_valid (it's unused though)

* make linter happy

* make linter happy

* smaller test

* change number

* make wino tests very small
2023-09-03 07:29:43 -07:00
nimlgen
a96e54d8bb
search for grouped reduces (#1732) 2023-09-01 14:21:10 -07:00
George Hotz
5c403d43b9
New >3 indexing (#1729)
* move reindexing into linearizer

* get_grouped_dims

* don't limit for clang
2023-08-31 21:24:15 -07:00
George Hotz
c18a497dde
minor global dim cleanup (#1724) 2023-08-31 12:23:39 -07:00
Roelof van Dijk
62536d6000
perf: use enumerate where possible (#1692)
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
2023-08-30 10:41:51 -07:00
George Hotz
fdd7f282cb
Reenable tensor cores for self-hosted Mac CI (#1717)
* debug 5 matmul

* allow tensor cores in CI

* tensor cores on arm64

* put debug back
2023-08-30 07:53:04 -07:00
George Hotz
d37d092c14
split linearizer into 3 files (#1654) 2023-08-23 14:58:47 -07:00
George Hotz
643cbdfd50
make embedding and GPT-2 fast (#1631)
* make embedding fast

* jit more, variable shape support

* print mem bw
2023-08-22 15:14:38 -07:00
George Hotz
696e4d20a1 fix KOPT=2 with variable shape 2023-08-22 11:34:34 -07:00
Roelof van Dijk
109100656f
refactor: no len if it is not needed (#1598)
Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
2023-08-21 14:06:32 -07:00
George Hotz
739f327d2d
Shorter (#1582)
* deleting lines

* remove insert dims

* if statement is never hit

* bug fixes
2023-08-20 08:12:16 -07:00