Browse 35 exciting jobs hiring in Llm Evaluation now. Check out companies hiring such as Spotify, LanguageWire, EQL Tech in Providence, Houston, Chattanooga.
Lead the architecture and productionization of Spotify’s shared Agent Engine to power scalable, reliable agent-based experiences across the platform.
LanguageWire is hiring an AI Engineer to design and productionize LLM-based translation workflows and bridge ML experimentation with production engineering.
Work on a mission-driven fintech team to build and ship core AI products (LLM/VLM and evaluation pipelines) that power eligibility and compliance for education savings accounts.
Lead the product vision and engineering for clinician-facing AI tools at knownwell, building and operating RAG-based clinical decision support with full product ownership and direct clinician partnership.
Experienced technical product leader needed to own prioritization, quality, and stakeholder alignment for LLM-driven products while staying hands-on with architecture, code reviews, and AI cost optimization.
Help build and deploy production AI agent platforms that power personalized financial advisory workflows for institutional clients at Arta.
Welo Data is building a flexible, remote contributor network of native English speakers to annotate, evaluate, and create prompts that improve AI systems.
Lead Slack's search and AI platform as VP Product to set strategy, drive model and infrastructure decisions, and deliver reliable, scalable AI-powered search and knowledge services for enterprise users.
NiCE is hiring a Forward Deployed Engineer to design, ship, and operate production-scale conversational AI agents that solve high-impact enterprise problems.
Contract opportunity to evaluate and improve LLM conversational responses in Hindi and English by performing fact-checking, annotation, and qualitative assessment.
Lead the design and production of LLM-driven coaching systems at Valence, applying deep ML and engineering expertise to build enterprise-grade, context-aware AI experiences.
Senior engineering leader to design, evaluate and productionize agentic AI systems, prompt architectures and multi-agent orchestration for critical banking workflows at Deutsche Bank in Cary, NC.
Crosby AI is hiring a Data Scientist to develop NLP/LLM models, evaluation frameworks, and data strategies that power its AI-driven legal platform.
Generative AI Analyst at Welocalize to craft prompts, annotate and evaluate LLM outputs, and lead labeling workflows in a remote full-time role.
Lead the design and implementation of secure, scalable Generative AI and ML architectures for an EdTech organization focused on building production-ready RAG, retrieval, and MLOps solutions.
Build the internal tooling and evaluation infrastructure that empowers engineers and researchers to iterate quickly and reliably on Crosby’s LLM-powered legal platform.
Handshake is hiring an ML Research Scientist to drive open scientific research, create public benchmarks, and collaborate with top AI labs to advance data and evaluation methods for frontier models.
Finny, an AI-first fintech in Chelsea, NYC, is hiring a Lead QA Engineer to build and scale automated testing, CI/CD integration, and LLM evaluation across our Python backend and TypeScript frontend.
Lead the design and evaluation of agentic LLM systems that power a fintech's financial intelligence platform, ensuring correctness, scalability, and production reliability.
Experienced software engineers with strong system-design and ML/LLM experience are needed to build and productionize LLM-powered agents, evaluation pipelines, and scalable AI infrastructure at Permute.
Fullscript is looking for a Staff Machine Learning Engineer to architect and ship production LLM-driven clinical features that improve clinician workflows and patient outcomes.
Khan Academy is hiring a Senior AI Engineer (24-month fixed-term) to lead integration, evaluation, and quality improvements of generative AI features that support learning at scale.
Lead the AI product portfolio for marketing to turn enterprise AI strategy into a cohesive MarTech roadmap, measurable productivity gains, and durable automation at scale.
Lead the AI MarTech product portfolio at ServiceNow to convert AI strategy into scalable agentic workflows, measurable productivity gains, and sustained marketing leverage.
Work on TRM’s AI Engineering team to design and ship agentic LLM systems and scalable infrastructure that augment investigations and ensure safe, auditable behavior in high-sensitivity environments.
Varick seeks an AI Engineer to architect and ship production-grade agent systems, evaluation pipelines, and retrieval-driven context strategies for enterprise AI deployments.
Lead the design, production deployment, and continual improvement of AI-powered features for Savvas's flagship K-12 platform, applying deep LLM, cloud, and software engineering expertise to improve student learning at scale.
Virtue AI is seeking a hands-on Testing Engineer to lead product and backend QA, automate system testing, and perform model red-teaming for a cutting-edge AI security platform.
Help shape Baseten's model ecosystem by combining hands-on engineering, developer education, and product thinking to improve model discovery, evaluation, and adoption.
Lead the engineering of agentic ML systems and AI-native developer tooling at NVIDIA's Cosmos team to accelerate model development through agents, pipelines, and scalable evaluation.
Lead architecture and delivery of enterprise-scale LLMs, agent orchestration, and retrieval systems to build safe, scalable AI workflows for IFS Nexus Black.
TRM Labs is hiring a Senior AI Research Engineer to drive model evaluation, fine-tuning, and production orchestration for large-scale LLM and ML systems that power blockchain intelligence.
Work as a bilingual English–Mandarin evaluator to assess, fact-check, and annotate LLM responses for a remote contract role covering Taiwan, Malaysia, and the USA.
Project Lion seeks a US-based Prompt Engineer to drive template-to-autorater migrations, optimize prompts using APG/APO tooling, and validate autorater quality versus human baselines.
A US-based Senior Prompt Engineer (part-time) to design, optimize, and validate prompts and autorater workflows that ensure high-quality, structured LLM outputs across complex template architectures.
Below 50k*
1
|
50k-100k*
2
|
Over 100k*
1
|