Browse 56 exciting jobs hiring in Ai Evaluation now. Check out companies hiring such as Cover Whale, LanguageWire, EQL Tech in Greensboro, Huntsville, Ontario.
Lead and build the agentic AI platform that enables pods of engineers and AI agents to safely and reliably deliver production software at scale.
LanguageWire is hiring an AI Engineer to design and productionize LLM-based translation workflows and bridge ML experimentation with production engineering.
Work on a mission-driven fintech team to build and ship core AI products (LLM/VLM and evaluation pipelines) that power eligibility and compliance for education savings accounts.
Lead and grow an Applied AI engineering team at Mercor to build scalable evaluation and data systems that measurably improve frontier model performance.
Lead the product vision and engineering for clinician-facing AI tools at knownwell, building and operating RAG-based clinical decision support with full product ownership and direct clinician partnership.
Experienced technical product leader needed to own prioritization, quality, and stakeholder alignment for LLM-driven products while staying hands-on with architecture, code reviews, and AI cost optimization.
Help build and deploy production AI agent platforms that power personalized financial advisory workflows for institutional clients at Arta.
Lead Slack's search and AI platform as VP Product to set strategy, drive model and infrastructure decisions, and deliver reliable, scalable AI-powered search and knowledge services for enterprise users.
NiCE is hiring a Forward Deployed Engineer to design, ship, and operate production-scale conversational AI agents that solve high-impact enterprise problems.
Experienced domain experts in Business Operations & Communications or Education and Academic Research are needed for a remote, retainer-based 2‑week role evaluating and crafting prompts for AI writing models with US-contextual standards.
Join an early-stage AI safety startup as a founding Forward Deployed Engineer to design rigorous AI evals, lead customer implementations, and shape product strategy for certification of real-world AI agents.
Epoch AI is hiring remote Researchers and Senior Researchers to conduct data-driven investigations, build benchmarks, and forecast AI capabilities and trends.
Visa is hiring a Product Analyst to define and scale generative AI platform capabilities, combining product analytics, prototyping, and cross-functional collaboration to deliver responsible, enterprise-grade AI solutions.
Colibri Group is hiring an AI Engineering Intern to help design and evaluate AI-driven educational tools, focusing on model behavior, alignment, and responsible AI practices under senior mentorship.
Unstructured is hiring an AI Engineer to architect and ship production-grade RAG and agentic systems that process messy multimodal data for high-impact government and military contracts.
Contract opportunity to evaluate and improve LLM conversational responses in Hindi and English by performing fact-checking, annotation, and qualitative assessment.
Lead the design and production of LLM-driven coaching systems at Valence, applying deep ML and engineering expertise to build enterprise-grade, context-aware AI experiences.
LinkedIn seeks a Hybrid Machine Learning Engineer to build and deploy scalable relevance and evaluation models for recommender systems and generative/NLP-driven product features.
A selective, eight-week (mostly virtual) unpaid bootcamp at ServiceNow for undergraduate students to learn agentic AI, build and evaluate agents, and present a capstone project during an in-person finale.
AIR is hiring a Technical Assistance Consultant to develop and deliver workforce-focused TA, training, and capacity-building to advance economic mobility, workforce development, and future-of-work strategies including AI integration.
Lead the strategic integration of AI across ServiceNow marketing by owning the MarTech and agentic product portfolio to drive adoption, efficiency, and measurable business impact.
Senior engineering leader to design, evaluate and productionize agentic AI systems, prompt architectures and multi-agent orchestration for critical banking workflows at Deutsche Bank in Cary, NC.
Generative AI Analyst at Welocalize to craft prompts, annotate and evaluate LLM outputs, and lead labeling workflows in a remote full-time role.
Lead the design and implementation of secure, scalable Generative AI and ML architectures for an EdTech organization focused on building production-ready RAG, retrieval, and MLOps solutions.
Build the internal tooling and evaluation infrastructure that empowers engineers and researchers to iterate quickly and reliably on Crosby’s LLM-powered legal platform.
Neighbors Bank is looking for a decisive, process-improvement focused Recruiting Coordinator to manage hiring pipelines, conduct candidate evaluations, and help evolve recruiting practices in a fully remote role.
Handshake is hiring an ML Research Scientist to drive open scientific research, create public benchmarks, and collaborate with top AI labs to advance data and evaluation methods for frontier models.
Lead the design and evaluation of agentic LLM systems that power a fintech's financial intelligence platform, ensuring correctness, scalability, and production reliability.
SweetRush is hiring an Instructional Designer/eLearning Developer to create and deliver IT-focused learning solutions (AI, cybersecurity, workplace apps) for a global enterprise in a remote, Eastern Time–preferred contract role.
Experienced software engineers with strong system-design and ML/LLM experience are needed to build and productionize LLM-powered agents, evaluation pipelines, and scalable AI infrastructure at Permute.
Fullscript is looking for a Staff Machine Learning Engineer to architect and ship production LLM-driven clinical features that improve clinician workflows and patient outcomes.
Khan Academy is hiring a Senior AI Engineer (24-month fixed-term) to lead integration, evaluation, and quality improvements of generative AI features that support learning at scale.
Handshake seeks experienced 3D Slicer users to remotely evaluate AI-generated medical imaging content and provide expert feedback on segmentation, DICOM workflows, and clinical research relevance.
Handshake seeks experienced Shotcut users to evaluate AI-generated video edits and create tool-focused assessment materials on a flexible, remote, hourly contract basis.
Lead the AI product portfolio for marketing to turn enterprise AI strategy into a cohesive MarTech roadmap, measurable productivity gains, and durable automation at scale.
Lead the AI MarTech product portfolio at ServiceNow to convert AI strategy into scalable agentic workflows, measurable productivity gains, and sustained marketing leverage.
Work on TRM’s AI Engineering team to design and ship agentic LLM systems and scalable infrastructure that augment investigations and ensure safe, auditable behavior in high-sensitivity environments.
aiEDU is hiring a Senior Lead, Research & Evaluation to design and run impact measurement, lead research strategy, and build data systems that inform program decisions across the organization.
Varick seeks an AI Engineer to architect and ship production-grade agent systems, evaluation pipelines, and retrieval-driven context strategies for enterprise AI deployments.
Lead the design, production deployment, and continual improvement of AI-powered features for Savvas's flagship K-12 platform, applying deep LLM, cloud, and software engineering expertise to improve student learning at scale.
Rwazi is hiring a Decision Intelligence Analyst to validate and improve AI-driven decision outputs by identifying failure modes, formalizing evaluation rubrics, and refining judgment frameworks.
Lead the AI product portfolio for marketing at ServiceNow, defining and delivering a unified MarTech and agentic roadmap that drives measurable productivity and enterprise-scale adoption.
Lead architecture and delivery of scalable, secure AI and agentic systems at PointClickCare to drive measurable clinical and operational outcomes across the platform.
Contract reviewers are needed to compare AI-generated English text pairs, choose the clearer response, and provide concise explanations to help improve model output quality.
Virtue AI is seeking a hands-on Testing Engineer to lead product and backend QA, automate system testing, and perform model red-teaming for a cutting-edge AI security platform.
Lead architecture and delivery of enterprise-scale LLMs, agent orchestration, and retrieval systems to build safe, scalable AI workflows for IFS Nexus Black.
TRM Labs is hiring a Senior AI Research Engineer to drive model evaluation, fine-tuning, and production orchestration for large-scale LLM and ML systems that power blockchain intelligence.
Handshake AI seeks Physics PhDs to perform flexible, hourly contract work evaluating AI-generated physics content for scientific accuracy and physical reasoning.
Handshake seeks Math PhDs for flexible, remote hourly contracts to design domain-relevant math questions and evaluate AI-generated mathematical reasoning and proofs.
Handshake seeks doctoral-level biology experts to review and critique AI-generated biological content on a flexible, remote, hourly contract basis.
Below 50k*
2
|
50k-100k*
1
|
Over 100k*
10
|