Gradtuity - Dylan Norquist

Gradtuity is a tensor autograd engine I built from scratch to understand what PyTorch is doing under the hood. It exposes a minimal PyTorch-like API (Tensor, Module, Linear, Conv2d, MaxPool2d, MLP, CNN), but everything below that surface is mine: GPU memory via raw CUDA (ctypes + libcudart.so), forward and backward kernels written in Triton, and an autograd engine that walks the op graph in reverse topological order on backward().

The hardest part wasn't the math or the kernels. It was the memory plumbing. PyTorch lets you forget that "a tensor on the GPU" is really a pointer into an opaque allocator with reference-counted lifetimes. Once you do it yourself with cudaMalloc / cudaFree and have to track who owns what across graph nodes, the elegance of requires_grad stops feeling like magic and starts feeling like engineering.

flowchart TB
    subgraph api [Public API]
        Tensor
        Module
        functional
    end
    subgraph core [Core]
        tensor_impl[Tensor + autograd]
        nn_layers[Linear, Conv2d, etc.]
    end
    subgraph kernels [Triton Kernels]
        matmul
        elemwise
        conv_pool
        reduce_optim
    end
    subgraph cuda [GPU]
        cuda_mem[cuda_mem: malloc, copy]
    end
    api --> core
    core --> kernels
    kernels --> cuda

What's next: more optimizers (SGD only right now), better convolution kernels, and a benchmark suite that actually compares against PyTorch on equal hardware.