ai& is a new global AI technology company dedicated to meeting the world's growing demand for AI. Our vision is twofold: to serve as a premier AI lab specializing in localization, and to act as a global infrastructure and compute provider. We are building a unified, optimized global platform that integrates next-generation data centers and infrastructure, heterogeneous compute serving, and advanced model services. We believe that the most effective way to build and scale AI is to own the stack from top to bottom.
At ai&, we empower small teams with the autonomy needed to tackle significant challenges. Our approach is to deconstruct large problems into manageable components and solve complex issues collaboratively. We seek highly motivated, mission-driven individuals who demonstrate strong personal agency. We value curiosity as the foundation of talent, and we are looking for people eager to develop alongside our evolving technology and expanding business.
We are actively hiring worldwide, with presence in Tokyo, SF, Austin, and Toronto. We are more than happy to meet exceptional talent where they are.
Role overview
As a Kernel Optimization Engineer, your objective is to extract everything from heterogeneous GPU hardware. This means going below the framework layer, writing, profiling, and tuning custom CUDA and ROCm/HIP kernels that sit at the heart of our inference and training stack. You will work across NVIDIA and AMD silicon, understanding the deep architectural differences between the two and writing code that is optimal for each.
This is not a role about deploying existing kernels. It is about authoring them. You will identify bottlenecks in the execution loop including memory bandwidth saturation, warp divergence, occupancy limits, and cache thrashing, and build solutions from first principles. You will work closely with our inference and serving team to ensure that the kernels you build translate into real-world performance gains — but your domain is the kernel layer and everything below it.
The scope spans attention mechanisms, quantization primitives, custom activation functions, fused operators, and the communication kernels that tie multi-GPU systems together. The ideal candidate has a hardware-first intuition: they think in warps, tiles, and memory hierarchies before they think in frameworks. They are equally comfortable reading PTX and roofline charts. And they are never done optimizing.
Responsibilities
Custom Kernel Development Design and implement high-performance kernels for core AI primitives including GEMM, attention, normalization, and convolution. Own the full cycle from profiling to production deployment across LLM inference, training, and generative model workloads.
Cross-Vendor Hardware Optimization Develop deep expertise across NVIDIA and AMD GPU architectures. Understand the micro-architectural differences including memory subsystems, scheduler behavior, and cache hierarchies, and write kernels that are genuinely optimal for each target. Optimize across heterogeneous compute units including SIMD, matrix engines, and DMA.
Attention & Linear Algebra Primitives Build and tune fused attention kernels (Flash Attention variants, MLA, paged attention), GEMM primitives, and quantized compute paths (INT8, FP8, AWQ, GPTQ) that push the hardware to its limits.
Precision & Numerical Stability Prototype and evaluate precision formats including FP16, BF16, FP8, e5m2, and stochastic rounding. Understand the accuracy and performance trade-offs at a deep level and make principled decisions about where each format belongs.
Profiling & Bottleneck Analysis Use Nsight Compute, rocprof, Perfetto, VTune, and custom instrumentation to identify and eliminate performance bottlenecks. Translate profiling data into concrete architectural improvements.
Operator Fusion Identify opportunities to fuse multi-step operations into single kernel launches, reducing memory round-trips and kernel launch overhead across the inference and training execution graphs.
Communication Kernel Optimization Optimize collective communication primitives (AllReduce, AllGather, ReduceScatter) for multi-GPU and multi-node topologies, working closely with the infrastructure team.
Compiler & Runtime Integration Collaborate with compiler and runtime teams to integrate kernels into Triton, PyTorch, or SYCL pipelines. Contribute to micro-architecture feedback loops, helping co-design ISA and memory features with the hardware team where relevant.
Cross-Team Collaboration Work closely with the inference and serving team to ensure kernel-level performance translates into system-level gains. Share profiling insights, align on optimization priorities, and contribute to architectural decisions across teams.
Technical Leadership Maintain a high level of personal agency. Write production code, review kernel implementations, and contribute to architectural decisions in a flat, fast-moving team environment.
You may be a fit if you have the following skills
Deep Kernel Authorship You have written production CUDA or ROCm kernels from scratch. You understand warp execution, shared memory bank conflicts, occupancy, and instruction-level parallelism at an intuitive level. Strong proficiency in C++11 or higher, CUDA, Triton, and ideally LLVM/MLIR.
Hardware Architecture Knowledge Strong familiarity with NVIDIA Hopper/Ampere and AMD CDNA architectures. You know the differences between HBM bandwidth profiles, cache sizes, and execution units and you write code that reflects that knowledge. Deep understanding of memory layout, vectorization, thread and block scheduling, and cache behavior.
Precision & Numerical Fluency Solid grasp of numerical stability, mixed precision arithmetic, and modern precision formats. Experience making principled trade-offs between precision and performance in production systems.
Profiling Fluency Comfortable with Nsight Compute, rocprof, Perfetto, VTune, and roofline modeling. You do not guess where the bottleneck is. You measure it.
Parallel Programming Breadth Strong background across parallel programming models including CUDA, Triton, SYCL, OpenCL, or OpenMP. Experience optimizing irregular algorithms such as sparse linear algebra or graph computations.
Systems Thinking Ability to reason about how individual kernels compose into larger execution graphs, and how kernel-level decisions propagate up through the inference or training stack.
Great Team Spirit A mission-driven approach to engineering, valuing clear communication, hands-on execution, and collective success over individual silos
If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.
Lead brand, creative, and performance for a fast-growing high-ticket coaching firm as Limitless Agency's Marketing Manager, owning paid & organic growth plus the CEO's personal brand.
Hobbes is looking for a founding Account Executive to own full-cycle sales and build the go-to-market motion for its autonomous demo agent from its San Francisco HQ.
Chainguard is seeking a Staff Software Engineer to lead architecture and implementation of a scalable, secure Libraries Platform that automates builds, verification, and distribution of open-source packages (remote, full-time).
Tenex is hiring a Software Engineer II to develop scalable full-stack systems for its AI-native MDR platform and help shape product and engineering practices in a fast-growing startup.
Senior technical leader sought to shape LinkedIn’s core infrastructure strategy and lead cross-team initiatives across networking, storage, and messaging at massive scale.
FINRA is hiring a Software Engineer in Rockville, MD to develop robust, maintainable software and support engineering and operational excellence across the SDLC in a hybrid environment.
Lead the architecture and productionization of Spotify’s shared Agent Engine to power scalable, reliable agent-based experiences across the platform.
PracticeQ is hiring a Lead Software Engineer to drive design and delivery of scalable .NET services and modern front-end features that improve practice management and patient experiences.
Visa is hiring a Staff Software Engineer to architect and run mission-critical, GCP-based payment services in a hybrid Foster City role.
Lead performance and scalability improvements for LLM inference by optimizing runtime components, multi-GPU execution, and open-source serving frameworks at scale.
Adaptive is hiring a Lead Application Security Engineer to own and harden application security across their Java/TypeScript services and AWS infrastructure as the company scales.
Lead backend development for Bumble's Dating product by building scalable GCP-native services, driving projects end-to-end, and mentoring junior engineers.
WHOOP is hiring a Senior Fullstack Software Engineer to develop scalable AI platform features and seamless member experiences from frontend interfaces to backend APIs.
Lead and mentor a software engineering team to design and deliver manufacturing software and tooling that enables production of next‑generation surgical robotics.
Benepass is hiring a Senior Design Engineer to design, build, and evolve a scalable React/TypeScript design system and component library that bridges design and engineering.
SpringRole is the first professional reputation network powered by artificial intelligence and blockchain to eliminate fraud from user profiles. Because SpringRole is built on blockchain and uses smart contracts, it's able to verify work experienc...
735 jobs