Location: In-person only, San Francisco (no remote)
We're early-stage and moving fast. We research and build autonomous systems that make ML models run as fast as physically possible by learning to program GPUs, profile and evaluate system performance, generate/fuse kernels, and push hardware to its limits.
GPU Fundamentals: Deep understanding of GPU architectures, CUDA programming, and parallel computing patterns.
Deep Learning Frameworks: Proficiency in PyTorch, TensorFlow, or JAX, particularly for GPU-accelerated workloads.
LLM/AI Knowledge: Strong grounding in large language models (training, fine-tuning, prompting, evaluation).
Systems Engineering: Skilled in C++, Python, and possibly Rust/Go for building tooling around CUDA.
ML for Code: Understanding of program synthesis, reinforcement learning for code generation, and automated testing/verification of generated code.