Gradtuity
Gradtuity is a tensor autograd engine I built from scratch to
understand what PyTorch is doing under the hood. It exposes a minimal
PyTorch-like API (Tensor, Module,
Linear, Conv2d, MaxPool2d,
MLP, CNN), but everything below that
surface is mine: GPU memory via raw CUDA
(ctypes + libcudart.so), forward and
backward kernels written in Triton, and an autograd engine that walks
the op graph in reverse topological order on
backward().
The hardest part wasn't the math or the kernels. It was
the memory plumbing. PyTorch lets you forget that "a tensor on
the GPU" is really a pointer into an opaque allocator with
reference-counted lifetimes. Once you do it yourself with
cudaMalloc / cudaFree and have to track who
owns what across graph nodes, the elegance of requires_grad
stops feeling like magic and starts feeling like engineering.
flowchart TB
subgraph api [Public API]
Tensor
Module
functional
end
subgraph core [Core]
tensor_impl[Tensor + autograd]
nn_layers[Linear, Conv2d, etc.]
end
subgraph kernels [Triton Kernels]
matmul
elemwise
conv_pool
reduce_optim
end
subgraph cuda [GPU]
cuda_mem[cuda_mem: malloc, copy]
end
api --> core
core --> kernels
kernels --> cuda What's next: more optimizers (SGD only right now), better convolution kernels, and a benchmark suite that actually compares against PyTorch on equal hardware.