tinygrad/extra
Vyacheslav Pachkov 4c33192a8b
add qcom runtime (#5213)
* qcom: driver init

* autogen stubs for msm_kgsl also fixup ioctls to show numbers instead of _IOW macros

* autogen: add adreno commands and registers

* ops_qcom: QcomAllocator + signals

* fix EDEADLK in hwqueue, init timestamps, use opencl compiler for qcom

* qcom: we do not really need all these constants input/output is enough

* qcom: perfctr for CS (do not really need all the rest)

* qcom: HALFREGFOOTPRINT and FULLREGFOOTPRINT are set to be around max

* qcom: explicitly set instruction len based on the shader size

* ops_qcom: Program init

extracts shader from open cl binary
sets input/output buffers
allocates stack
sets cs mode
runs shader

* use data64_le from helpers

* ops_qcom: use fill_kernargs for filling i/o buffers

* ops_qcom: add QcomCopyQueue just for api & set kernargs_args_offset

* new signals & fix exec

* add QCOM to the list of supported devices

* correct QcomComputeQueue._wait using CP_WAIT_REG_MEM

* fix exec, synchronize before copyout

* correct setting num_units for ST_SHADER

* fix gpu hangs on sigs with CP_MEM_WRITE, it is uncached mem anyway

* extract offsets to kernel arguments from opencl binary

* extract constants values and offsets from opencl binary

* handle KGSL_MEMFLAGS_USE_CPU_MAP correctly

* align kernel name to 4 bytes when skipping kernel opencl struct

* skip to consts directly using an offset from opencl binary header

* fix alloc

* get halfreg and fullreg from opencl bin

* set unmultipled global sizes as kernel group in HLSQ_CS_NDRANGE

* parse prg offset from open cl binary

* save loc with HLSQ_CS_CNTL. set this with HLSQ_CONTROL_2_REG

* support for vals in _fill_kernargs

* support 16-bit constants

* use KGSL_CONTEXT_NO_FAULT_TOLERANCE for contexts

this helps to not fall down when executing big kernels

    /* Don't time out if the context has disabled it */
    if (drawobj->context->flags & KGSL_CONTEXT_NO_FAULT_TOLERANCE)
        return;

* minor changes of _exec

* QCOMRenderer

* disable HCQGraph for demo. TOOD: support HCQ update api

* support HCQ

- remove copy queue
- add updates
- add strides for buffs and vars for QCOM

* bufs_stride

* clean ups

* linter

* call super().__init__(value) in QcomSignal

* disable=unused-import

* mypy

* type ignore when queue is on the device

* fix

* query gpu_id.
Will be useful for selecting commands e.g. CP_EVENT_WRITE vs
CP_EVENT_WRITE7

* working timestamps

* free context after device is done

* move gpu stack to the device

* reserve some space with lib_gpu for gpu to write to

this fixes test_interpolate_bilinear

* exclude tests that fails with GPU=1 on qualcomm

* lint

* unmap mem in _gpu_free

* ctxt priority and preemtion policy

* remove old qcom

* pass size to self.device.allocator.free

* skip tests only on qcom

* use kgsl and adreno defines instead of numeric vals

* use allocator for allocating lib_gpu

* update to QcomArgsState from master

* intermediate commit while conquering images

* enable image tests on qcom

* fix shader disasm size, dump textures stuff

* working images

* allow signals to be 0

* set branchstack from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* set shared memory size from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* update images in QcomArgsState & less loc for images

* set stack sizes from OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* stack allocation based on OpenCL binary

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* better autogen for kgsl and adreno. no more bitshifts

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* cleanup commit for parse cl lib

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* dont forget actual generated files

* refactor + less loc

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* device.py back

* lint

* ruff

* timestamp divisor

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* fix tex fmt & round global size

Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>

* dtypes

* 19.2MHz

* -1 loc in _update_exec

* remove noqa

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2024-09-02 19:35:47 +03:00
..
accel move things, clean up extra (#2292) 2023-11-13 20:18:40 -08:00
assembly lowerer is kernel [run_process_replay] (#5437) 2024-07-12 18:50:55 -07:00
backends move GraphException to jit.py (#5744) 2024-07-26 19:01:12 -04:00
datasets init REDUCE_AXIS with BinaryOps (#6256) 2024-08-24 11:28:41 +03:00
disassemblers/adreno move disassemblers and openpilot (#4592) 2024-05-14 19:30:02 -07:00
gemm extra/gemm/triton_nv_matmul: fix Program arguments (#6212) 2024-08-20 14:05:38 -07:00
hip_gpu_driver feat: autogen from kernel register offset headers (#6056) 2024-08-12 14:08:35 -07:00
hiprtc use comgr to compile (#3248) 2024-01-26 18:27:49 -08:00
junk coder.py can write and run code (#2439) 2023-11-25 12:27:54 -08:00
mockgpu autogen cleanup (#6064) 2024-08-14 20:20:35 +03:00
models sdxl batched inference fixes (#6293) 2024-08-28 07:44:58 -04:00
nv_gpu_driver update pylint path to check indent/space for all (#6022) 2024-08-10 14:41:09 -04:00
optimization update CI tests in extra with UOp AST (#6290) 2024-08-28 22:26:50 +03:00
qcom_gpu_driver add qcom runtime (#5213) 2024-09-02 19:35:47 +03:00
archprobe.py move dtypes to dtype.py (#2964) 2024-01-01 14:58:48 -08:00
augment.py [ready] Replacing os with pathlib (#1708) 2023-08-30 10:41:08 -07:00
disk_read_speed.py io_uring for copies from disk (#5035) 2024-06-21 11:36:51 +03:00
dump_cache.py wow how did i think that was okay (#2339) 2023-11-16 21:21:11 -08:00
export_model.py all realize 2 (#4527) 2024-05-10 22:43:09 -07:00
f16_w_uint32.py fix various examples (#4691) 2024-05-22 20:43:21 -04:00
gradcheck.py Fix: Jacobian tests [WIP] (#1126) 2023-07-05 15:36:22 -07:00
hip_events.py move autogen to runtime/autogen (#3254) 2024-01-26 12:44:19 -08:00
introspection.py bring buffer back to device (#4517) 2024-05-10 11:22:31 -07:00
lr_scheduler.py use at least float32 for optim.lr (#4297) 2024-04-25 14:42:28 -04:00
mcts_search.py AST is UOp (#6030) 2024-08-16 22:09:00 +03:00
multitensor.py multitensor start (#2676) 2023-12-07 17:07:05 -08:00
onnx.py fix some type error in onnx [run_process_replay] (#6153) 2024-08-17 19:54:20 -04:00
onnx_ops.py Tensor.prod (#6250) 2024-08-23 10:06:32 -04:00
ops.py init REDUCE_AXIS with BinaryOps (#6256) 2024-08-24 11:28:41 +03:00
ring_copy.py ring copy example (#3185) 2024-01-19 23:34:30 -05:00
thneed.py new style device (#2530) 2023-11-30 17:07:16 -08:00
threefry.py feat: example and extra tweaks (#6310) 2024-08-28 19:26:11 -07:00
to_movement_ops.py update CI tests in extra with UOp AST (#6290) 2024-08-28 22:26:50 +03:00
training.py tinytqdm.set_description and tinytrange (#5101) 2024-06-22 14:45:06 -04:00
transfer_speed.py hotfix: copy size is in bytes 2024-01-17 16:44:15 +00:00