Job details

Member of Technical Staff - Inference Optimization

About ai&

ai& is a new global AI technology company dedicated to meeting the world's growing demand for AI. Our vision is twofold: to serve as a premier AI lab specializing in localization, and to act as a global infrastructure and compute provider. We are building a unified, optimized global platform that integrates next-generation data centers and infrastructure, heterogeneous compute serving, and advanced model services. We believe that the most effective way to build and scale AI is to own the stack from top to bottom.

At ai&, we empower small teams with the autonomy needed to tackle significant challenges. Our approach is to deconstruct large problems into manageable components and solve complex issues collaboratively. We seek highly motivated, mission-driven individuals who demonstrate strong personal agency. We value curiosity as the foundation of talent, and we are looking for people eager to develop alongside our evolving technology and expanding business.

We are actively hiring worldwide, with presence in Tokyo, SF, Austin, and Toronto. We are more than happy to meet exceptional talent where they are.

Role overview

As a Kernel Optimization Engineer, your objective is to extract everything from heterogeneous GPU hardware. This means going below the framework layer, writing, profiling, and tuning custom CUDA and ROCm/HIP kernels that sit at the heart of our inference and training stack. You will work across NVIDIA and AMD silicon, understanding the deep architectural differences between the two and writing code that is optimal for each.

This is not a role about deploying existing kernels. It is about authoring them. You will identify bottlenecks in the execution loop including memory bandwidth saturation, warp divergence, occupancy limits, and cache thrashing, and build solutions from first principles. You will work closely with our inference and serving team to ensure that the kernels you build translate into real-world performance gains — but your domain is the kernel layer and everything below it.

The scope spans attention mechanisms, quantization primitives, custom activation functions, fused operators, and the communication kernels that tie multi-GPU systems together. The ideal candidate has a hardware-first intuition: they think in warps, tiles, and memory hierarchies before they think in frameworks. They are equally comfortable reading PTX and roofline charts. And they are never done optimizing.

Responsibilities

Custom Kernel Development Design and implement high-performance kernels for core AI primitives including GEMM, attention, normalization, and convolution. Own the full cycle from profiling to production deployment across LLM inference, training, and generative model workloads.
Cross-Vendor Hardware Optimization Develop deep expertise across NVIDIA and AMD GPU architectures. Understand the micro-architectural differences including memory subsystems, scheduler behavior, and cache hierarchies, and write kernels that are genuinely optimal for each target. Optimize across heterogeneous compute units including SIMD, matrix engines, and DMA.
Attention & Linear Algebra Primitives Build and tune fused attention kernels (Flash Attention variants, MLA, paged attention), GEMM primitives, and quantized compute paths (INT8, FP8, AWQ, GPTQ) that push the hardware to its limits.
Precision & Numerical Stability Prototype and evaluate precision formats including FP16, BF16, FP8, e5m2, and stochastic rounding. Understand the accuracy and performance trade-offs at a deep level and make principled decisions about where each format belongs.
Profiling & Bottleneck Analysis Use Nsight Compute, rocprof, Perfetto, VTune, and custom instrumentation to identify and eliminate performance bottlenecks. Translate profiling data into concrete architectural improvements.
Operator Fusion Identify opportunities to fuse multi-step operations into single kernel launches, reducing memory round-trips and kernel launch overhead across the inference and training execution graphs.
Communication Kernel Optimization Optimize collective communication primitives (AllReduce, AllGather, ReduceScatter) for multi-GPU and multi-node topologies, working closely with the infrastructure team.
Compiler & Runtime Integration Collaborate with compiler and runtime teams to integrate kernels into Triton, PyTorch, or SYCL pipelines. Contribute to micro-architecture feedback loops, helping co-design ISA and memory features with the hardware team where relevant.
Cross-Team Collaboration Work closely with the inference and serving team to ensure kernel-level performance translates into system-level gains. Share profiling insights, align on optimization priorities, and contribute to architectural decisions across teams.
Technical Leadership Maintain a high level of personal agency. Write production code, review kernel implementations, and contribute to architectural decisions in a flat, fast-moving team environment.

You may be a fit if you have the following skills

Deep Kernel Authorship You have written production CUDA or ROCm kernels from scratch. You understand warp execution, shared memory bank conflicts, occupancy, and instruction-level parallelism at an intuitive level. Strong proficiency in C++11 or higher, CUDA, Triton, and ideally LLVM/MLIR.
Hardware Architecture Knowledge Strong familiarity with NVIDIA Hopper/Ampere and AMD CDNA architectures. You know the differences between HBM bandwidth profiles, cache sizes, and execution units and you write code that reflects that knowledge. Deep understanding of memory layout, vectorization, thread and block scheduling, and cache behavior.
Precision & Numerical Fluency Solid grasp of numerical stability, mixed precision arithmetic, and modern precision formats. Experience making principled trade-offs between precision and performance in production systems.
Profiling Fluency Comfortable with Nsight Compute, rocprof, Perfetto, VTune, and roofline modeling. You do not guess where the bottleneck is. You measure it.
Parallel Programming Breadth Strong background across parallel programming models including CUDA, Triton, SYCL, OpenCL, or OpenMP. Experience optimizing irregular algorithms such as sparse linear algebra or graph computations.
Systems Thinking Ability to reason about how individual kernels compose into larger execution graphs, and how kernel-level decisions propagate up through the inference or training stack.
Great Team Spirit A mission-driven approach to engineering, valuing clear communication, hands-on execution, and collective success over individual silos

CUDA ROCm Kernel GPU C++ Triton LLVM MLIR Attention GEMM Quantization FP8 FP16 Profiling Nsight rocprof Multi-GPU AllReduce Performance

Awesome Motive Glassdoor Company Review

4.2

Awesome Motive DE&I Review

4.4

CEO of Awesome Motive

Kartik Mandaville

Approve of CEO

Average salary estimate

$240000 / YEARLY (est.)

min

max

$180000K

$300000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What it's like to work at Awesome Motive

Read Reviews

Similar Jobs

Marketing Manager

Awesome Motive Hybrid No location specified

Member of Technical Staff - Inference Optimization

About ai&

Average salary estimate

What it's like to work at Awesome Motive

Sign up for our weekly newsletter of fresh jobs

Sign up for our weekly
newsletter of fresh jobs