Job details

Customer Engineer

ABOUT BASETEN

Baseten powers mission-critical inference for the world's most dynamic AI companies, like Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma and Writer. By uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable companies operating at the frontier of AI to bring cutting-edge models into production. We're growing quickly and recently raised our $300M Series E, backed by investors including BOND, IVP, Spark Capital, Greylock, and Conviction. Join us and help build the platform engineers turn to to ship AI products.

THE ROLE

We're building a Customer Engineering team to own the post-sales technical relationship with our most strategic and enterprise customers. As a Sr. Customer Engineer, you'll be the technical front door for accounts running production ML workloads on Baseten — the person customers trust to keep their models healthy, their incidents short, and their roadmap heard.

This role blends deep infrastructure debugging, AI/ML performance expertise, incident command, and proactive account ownership. You'll triage and resolve issues across Kubernetes, GPUs, networking, and model serving, lead war rooms during P0 escalations, and translate recurring pain points into product improvements. You're not just reactive — you'll monitor customer health, drive QBRs, set up proactive alerting, and identify expansion opportunities before customers have to ask.

You'll partner closely with Solutions Architecture, SRE, Infra, Product, and Forward Deployed Engineering, but you own the customer outcome end-to-end: from first response to root-cause analysis to the follow-up that reinforces trust.

RESPONSIBILITIES

Technical Support & Debugging

Serve as the first responder to all post-sales customer issues via ticketing (Pylon) and Slack, triaging and resolving Tier 1 and Tier 2 issues independently.
Diagnose runtime issues related to latency, memory behavior, GPU utilization, concurrency, and model lifecycle management.
Debug infrastructure problems across Kubernetes (pods, controllers), networking, observability, and alerting systems.
Pull logs, read error traces, and correlate signals across Grafana, Loki, and Prometheus to pinpoint root causes — even when the real issue is buried layers deep.

Incident Response & Escalation

Lead incident response during outages and escalations, coordinating across Product, SRE, Sales, and Engineering.
Own customer communication through resolution — even when the fix is handed off to SRE or Infra — including delivering root-cause analyses after every P0/P1.
Escalate to SRE/ other engineering teams with structured context (customer, affected models, what you've already ruled out, specific ask) so nothing gets lost in translation.
Drive post-incident alerting reviews: why did the customer find this before we did, and what instrumentation or process change prevents it next time?

Proactive Account Ownership

Serve as the technical owner for top enterprise accounts with strict SLAs and high responsiveness expectations.
Set up and maintain proactive monitoring and alerts for all customer production models within 24 hours of handoff from SA(Solution Architect).
Drive the QBR process and proactive reengagement for expansion opportunities.
Track recurring failure patterns across accounts and push for durable fixes — not just incident closure.
Monitor internal feedback channels and route product-level issues to the right teams.

Cross-Functional Collaboration

Own the SA-to-CE handoff for new customers: validate architecture, confirm production-readiness milestones, and establish escalation paths.
Maintain and improve runbooks, knowledge bases, and diagnostic best practices so the team scales with the customer base.
Translate user feedback into roadmap signals, documentation improvements, and product enhancements.
Coordinate end-to-end on projects spanning feature requests, new deployments, and operational debugging — scoping, execution, communication, and stakeholder alignment.

REQUIREMENTS

Deep Kubernetes troubleshooting expertise, including resource debugging, pod/runtime analysis, and log-based diagnostics with observability tooling (Grafana, Loki, Prometheus).
Strong infrastructure debugging across container orchestration, networking, and service dependencies, with hands-on production cluster experience.
Experience managing high-severity incidents with major customers — SLAs, war rooms, post-incident reviews, and clear executive-level communication throughout.
Proven project management skills with an ownership mindset: you can run multiple complex, multi-stakeholder initiatives in parallel without dropping threads.
Ability to translate recurring technical pain points into roadmap-level insights and product improvements.
Strong communication skills and executive presence during high-visibility situations, ensuring both technical clarity and customer confidence.
3+ years of experience in a fast-paced, high-growth, or customer-facing engineering environment.

NICE TO HAVE

Familiarity with high-performance AI model serving, including troubleshooting ML pipelines from preprocessing through inference.
Experience with ticketing and incident-response platforms such as Pylon or Zendesk.
Hands-on experience with Helm, Flux, CI/CD tooling, or scripting automations for deployment and operational workflows.
Background in SRE, DevOps, or forward-deployed engineering roles at an infrastructure company.

BENEFITS

Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.

Apply now to embark on a rewarding journey in shaping the future of AI! If you are a motivated individual with a passion for machine learning and a desire to be part of a collaborative and forward-thinking team, we would love to hear from you.

At Baseten, we are committed to fostering a diverse and inclusive workplace. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity or expression, national origin, age, genetic information, disability, or veteran status.

We are an Equal Opportunity Employer and will consider qualified applicants with criminal histories in a manner consistent with applicable law (by example, the requirements of the San Francisco Fair Chance Ordinance, where applicable).

Customer Engineer Kubernetes SRE DevOps GPU Grafana Prometheus Loki Incident Response ML Serving CI/CD Helm Flux Production ML Post-incident RCA

Baseten Glassdoor Company Review

5.0

Baseten DE&I Review

5.0

CEO of Baseten

Unknown name

Approve of CEO

Average salary estimate

$195000 / YEARLY (est.)

min

max

$160000K

$230000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

Technical Support Specialist

Rezilient Health Hybrid No location specified

VIEW

Posted 14 hours ago

Rezilient is hiring a Technical Support Specialist to own first-line technical triage, resolve common user issues, and route complex problems across product, engineering, and clinical teams to improve patient and internal user experience.

Benefits Support Staff (Public Benefits Specialist, Entry) | Spanish/English bilingual required

Oregon Hybrid Hood River | DHS

VIEW

Posted 12 hours ago

Provide bilingual (Spanish/English) in-person and phone assistance to Oregonians applying for public benefits while supporting office operations and applying eligibility rules with compassion and accuracy.

Creator Experience Specialist

Activate Talent Hybrid No location specified

VIEW

Posted 4 hours ago

Customer-focused Creator Experience Specialist needed to support and engage creators across TikTok Shop and social platforms, delivering fast support, coaching, and community growth on a PST schedule.

Enterprise Technical Support Specialist - NYC

Notion Labs Hybrid New York

VIEW

Posted 8 hours ago

Inclusive & Diverse

Transparent & Candid

Mission Driven

Collaboration over Competition

Empathetic

Social Impact Driven

Rise from Within

Work/Life Harmony

Maternity Leave

Paternity Leave

Family Coverage (Insurance)

Medical Insurance

Dental Insurance

Vision Insurance

Mental Health Resources

Life insurance

Disability Insurance

Health Savings Account (HSA)

Flexible Spending Account (FSA)

Paid Time-Off

Notion is hiring an Enterprise Technical Support Specialist in NYC to deliver high-touch technical support, reproduce and drive resolution of complex issues, and help scale enterprise support processes.

Technical Support Engineer - West

Tines Hybrid North America (Remote)

VIEW

Posted 5 hours ago

Provide expert technical support for Tines’ automation platform, resolving API and integration issues and helping customers implement secure, scalable workflows while contributing product feedback.

Customer Service Specialist - Healthcare Billing, Revenue Cycle Management, Amazon One Medical

Amazon Hybrid USA

VIEW

Posted 14 hours ago

Inclusive & Diverse

Rise from Within

Mission Driven

Diversity of Opinions

Work/Life Harmony

Transparent & Candid

Growth & Learning

Fast-Paced

Collaboration over Competition

Take Risks

Friends Outside of Work

Passion for Exploration

Customer-Centric

Reward & Recognition

Feedback Forward

Rapid Growth

Medical Insurance

Paid Time-Off

Maternity Leave

Mental Health Resources

Equity

Paternity Leave

Fully Distributed

Flex-Friendly

Some Meals Provided

Snacks

Social Gatherings

Pet Friendly

Company Retreats

Dental Insurance

Life insurance

Health Savings Account (HSA)

Provide empathetic, accurate patient billing support and revenue-cycle assistance for One Medical members, resolving insurance and payment issues via phone and digital systems.

Technical Customer Success Manager - Integrations

Agiloft Hybrid United States

VIEW

Posted 9 hours ago

As a Technical Customer Success Manager - Integrations at Agiloft, you will validate integration architectures, remediate high-risk implementations, and scale repeatable integration success patterns to improve adoption and retention.

Technical Support Manager

Promenade Hybrid Remote

VIEW

Posted 6 hours ago

Lead a technical support team at Promenade, owning escalations, operational KPIs, and hands-on technical troubleshooting across POS, ecommerce, hardware, and integrations.

Head of Customer Experience

Awesome Motive Hybrid No location specified

VIEW

Posted 26 minutes ago

Equal Parts seeks a Head of Customer Experience in Austin, TX to build the lifecycle, systems, and teams that own onboarding, servicing, and renewals across its acquisition-driven insurance platform.

Manager, Technical Support Engineering - East

Tines Hybrid North America (Remote)

VIEW

Posted 12 hours ago

Lead and scale Tines' East Technical Support Engineering team to deliver world-class support for enterprise and public-sector customers while driving process improvements and cross-functional collaboration.

Customer Service Representative

City of New York Hybrid New York City, NY

VIEW

Posted 9 hours ago

DCWP's Licensing Division is hiring a Customer Service Representative to review and process license applications, deliver high-volume customer support, and assist applicants in navigating agency and interagency requirements.

Customer Success Technical Manager

LiveData Hybrid No location specified

VIEW

Posted 13 hours ago

LiveData is hiring a Customer Success Technical Manager to diagnose and resolve technical issues, own customer-facing documentation, and support hospital deployments in a remote, Eastern-time-focused role.

Head of Customer Experience, North America

Rapsodo Hybrid No location specified

VIEW

Posted 7 hours ago

Lead the design and scaling of Rapsodo’s post-sale customer experience across hardware and software products in North America, building infrastructure, AI-driven operations, and cross-functional feedback loops to improve onboarding, support, and activation.