Personal Page
https://www.youtube.com/watch?v=0WjfKQdfeMU
https://www.exampro.co/nca-aiio
https://habr.com/ru/articles/996610/ От MNIST к Transformer. Hello CUDA.
https://habr.com/ru/articles/1001886/ От MNIST к Transformer. Часть 2. Основы работы с памятью
Параллельные и Распределенные Вычисления 1. История паралеллизма. Внутренний параллелизм Лекторий ФПМИ https://www.youtube.com/watch?v=lIht6WZJL1s&list=PL4_hYwCyhAvbhhT3F80vt1FdB2TqklRsM&index=1
https://habr.com/ru/articles/994956/
https://habr.com/ru/articles/995170/
https://github.com/infatoshi/cuda-course
https://www.youtube.com/watch?v=86FAWCzIe_4
https://modal.com/gpu-glossary
https://jax-ml.github.io/scaling-book/
https://jamesakl.com/posts/cuda-ontology/
https://news.ycombinator.com/item?id=45947437
https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=high-level_overview
https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
https://docs.nvidia.com/cuda/pdf/CUDA_Runtime_API.pdf
https://www.bitsand.cloud/posts/profiling-gpus
https://leanpub.com/pythongpuprogramming Practical GPU Programming
https://damek.github.io/random/basic-facts-about-gpus/
https://enccs.github.io/gpu-programming/2-gpu-ecosystem/
https://habr.com/ru/companies/ruvds/articles/921024/ TPU from Google
https://stem-c.com/product/programming-massively-parallel-processors-a-hands-on-approach
https://talkpython.fm/episodes/show/509/gpu-programming-in-pure-python
https://habr.com/ru/companies/timeweb/articles/909122/
https://news.ycombinator.com/item?id=44216123
https://thechipletter.substack.com/p/demystifying-gpu-compute-architectures
https://habr.com/ru/companies/selectel/articles/912108/
https://blog.codingconfessions.com/p/gpu-computing
https://www.youtube.com/watch?v=h9Z4oGN89MU
https://medium.com/@muhammedashraf2661/cuda-programming-an-introduction-to-gpu-architecture-dfd8dfffa13f
https://blog.codingconfessions.com/p/gpu-computing
https://github.com/srush/GPU-Puzzles
https://www.pyspur.dev/blog/introduction_cuda_programming
https://news.ycombinator.com/item?id=43121059
https://www.i-programmer.info/news/91-hardware/17763-demystifying-gpu-terminology.html
https://www.youtube.com/watch?v=v_q2JTIqE20
https://github.com/cuda-mode/lectures
https://arxiv.org/abs/2503.20481
https://alexarmbr.github.io/2024/08/10/How-To-Write-A-Fast-Matrix-Multiplication-From-Scratch-With-Tensor-Cores.html
https://thechipletter.substack.com/p/demystifying-gpu-compute-architectures
https://news.ycombinator.com/item?id=39279163
https://medium.com/@penberg/demystifying-gpus-for-cpu-centric-programmers-e24934a620f1
https://arxiv.org/pdf/2103.13937.pdf CUDA Tutorial
https://www.youtube.com/watch?v=smlDvXVbyTo
https://www.youtube.com/playlist?list=PL4_hYwCyhAvbhhT3F80vt1FdB2TqklRsM RU
https://jax-ml.github.io/scaling-book/gpus/
https://news.ycombinator.com/item?id=44943666
Great news — your CUDA knowledge transfers well conceptually. Here’s the lay of the land:
Apple Silicon GPUs (M1/M2/M3/M4 in Mac Studio) use Metal as the low-level GPU framework. There’s no CUDA on Apple silicon — NVIDIA and Apple parted ways — so the stack looks like this:
| CUDA Concept | Apple Equivalent |
|---|---|
| CUDA C++ kernels | Metal Shading Language (MSL) — C++14-based |
__global__ kernels |
kernel functions in MSL |
| cuBLAS / cuDNN | Metal Performance Shaders (MPS) |
| CUDA streams | Metal command queues / command buffers |
| Unified memory (newer CUDA) | Unified memory is default on Apple Silicon |
| nvcc | Xcode + Metal compiler (xcrun -sdk macosx metal) |
| Nsight | Xcode GPU Frame Capture / Instruments |
Write compute kernels in Metal Shading Language — it looks very much like CUDA C++:
// CUDA: __global__ void add(float* a, float* b, float* c, int n)
kernel void add(device float* a [[buffer(0)]],
device float* b [[buffer(1)]],
device float* c [[buffer(2)]],
uint idx [[thread_position_in_grid]]) {
c[idx] = a[idx] + b[idx];
}
Host-side dispatch uses Objective-C or Swift via the Metal API (MTLDevice, MTLCommandQueue, etc.). You can also use C++ with Metal-cpp (Apple provides a header-only C++ wrapper).
Pre-built GPU primitives for ML/compute — analogous to cuBLAS+cuDNN. Great for matrix ops, convolutions, etc. without writing kernels manually.
Apple’s own NumPy/JAX-like framework for Apple Silicon, open-sourced in late 2023. If you’re doing ML or numerical work, this is the fastest path:
import mlx.core as mx
a = mx.array([1.0, 2.0, 3.0])
b = a * 2 # runs on GPU lazily
mx.eval(b)
It has automatic differentiation and feels like PyTorch/JAX.
PyTorch supports Apple GPU via the MPS backend — easiest migration if you already have CUDA PyTorch code:
device = torch.device("mps")
tensor = torch.randn(1000, 1000).to(device)
Most ops work; some less common ops fall back to CPU.
Deprecated by Apple in macOS 10.14. Don’t bother.
simdgroup)cudaMemcpy needed; CPU and GPU share the same physical memory__shared__ exactly — use threadgroup memory instead (same idea, different keyword)thread_position_in_grid = global thread indexIf you’re doing ML/data work → start with PyTorch MPS or MLX
If you want custom kernels → learn Metal compute shaders (your CUDA knowledge maps almost directly)
If you want the best performance → Metal + MPS together
The Apple Metal documentation and the MLX GitHub are both excellent starting points.