Job details

Lead Site Reliability Engineer

Join us as we work to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for all.

About Our Company:

At athenahealth, we deliver high quality and affordable healthcare solutions and drive growth across industries. Our success is powered by our talented team and the strategic leadership of our corporate managers. We foster a collaborative and dynamic environment where employees are encouraged to innovate, grow, and excel in their careers. As part of our team, you’ll be empowered to make a significant impact, lead strategic initiatives, and drive business results.

Position Overview:

We are looking for a Lead Site Reliability Engineer to join our Cloud Engineering division. Cloud Engineering ensures the continuous availability of the technologies and systems that are the foundation of athenahealth’s services. We are directly responsible for thousands of servers, petabytes of storage, and handling thousands of web requests per second, all while sustaining growth at a meteoric rate. We enable an operating system for the medical office that abstracts away administrative complexity, leaving doctors free to practice medicine.

But enough about us; let’s talk about you!

You’re a seasoned engineer with a passion for identifying and resolving reliability and scalability challenges. You are a curious team player, someone who loves to explore, learn, and make things better. You are excited to uncover inefficiencies in business processes, creative in finding ways to automate solutions, and relentless in your pursuit of greatness. You’re a nimble learner capable of quickly absorbing complex solutions and an excellent communicator who can help evangelize engineering excellence.

The Team:

We are a bunch of Site Reliability Engineers who are passionate about reliability, automation, and scalability. We use an agile based framework to execute our work, ensuring we are always focused on the most important and impactful needs of the business. We support systems in both private and public cloud and make data-driven decisions for which one best suit the needs of the business. We are relentless in automating away manual, repetitive work so we can focus on projects that help move the business forward.

Job Responsibilities:

Reliability and Availability:

Define, measure, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for cloud services and infrastructure components.
Lead efforts to continuously improve system availability, fault tolerance, and disaster recovery capabilities.
Ensure proactive incident detection, efficient root cause analysis, and timely resolution of production incidents
Participate in a 12x7 on-call rotation. We have a peer team in India that manages the overnight on-call

Automation and Infrastructure as Code (IaC):

Drive automation efforts to reduce manual intervention and streamline cloud infrastructure management.
Implement Infrastructure as Code (IaC) using tools like Terraform, AWS CloudFormation, and Ansible to provision, manage, and scale cloud resources.
Automate deployment, scaling, and monitoring processes to improve efficiency and reduce operational complexity.

Monitoring, Observability, and Performance Tuning:

Design and implement monitoring, logging, and alerting solutions to track cloud infrastructure health, performance, and security.
Use observability tools (e.g., Prometheus, Grafana, Cloud Watch) to ensure continuous visibility into cloud infrastructure performance and capacity.
Identify bottlenecks and performance issues, proposing and implementing improvements to ensure optimal resource usage.

Security and Compliance:

Ensure that cloud infrastructure is built with security best practices in mind and meets all relevant compliance and regulatory requirements.
Collaborate with security teams to implement security controls and risk mitigation strategies across cloud environments.
Regularly audit and review cloud infrastructure for security vulnerabilities and compliance gaps.

Collaboration and Cross-Functional Leadership:

Work closely with development, DevOps, and operations teams to ensure cloud infrastructure aligns with application and business requirements.
Lead and mentor a team of Site Reliability Engineers, promoting best practices and fostering a culture of operational excellence.
Act as a key technical point of contact for cloud-related infrastructure and operations issues.

Incident Management and Post-Mortem:

Lead the incident response efforts for cloud infrastructure-related issues, ensuring that all incidents are managed effectively.
Conduct post-incident reviews (PIRs) to identify root causes and implement preventive measures.
Continuously refine incident management processes to reduce downtime and enhance recovery times.

Required Skills and Qualifications:

10 years of hands-on experience with cloud automation and configuration management tools (e.g., Terraform, AWS CloudFormation, Ansible, Puppet). On a Hybrid Cloud Set-up.
7+ years of experience in a Site Reliability Engineering (SRE), Infrastructure Engineering, or DevOps role, with at least 3+ years in a technical leadership capacity.
Deep knowledge of cloud services and technologies (e.g., EC2, S3, Lambda, Kubernetes, etc.).
Proficiency in scripting or programming languages (Python, Go, Bash, etc.).
Experience with monitoring, logging, and observability tools (e.g., Prometheus, Grafana, Datadog, ELK stack).
Familiarity with Continuous Integration/Continuous Deployment (CI/CD) pipelines and cloud-native development practices.
Strong expertise in managing cloud infrastructure (AWS, Google Cloud, Azure) in production environments.
Experience with cloud-native architectures, microservices, and containerized environments (Kubernetes, Docker)
Proven experience in building and managing highly available, scalable, and fault-tolerant systems in the cloud
Strong understanding of cloud networking, storage, compute services, On-Prem and security best practices
Strong knowledge of Linux administration and internals
Effective communication skills, with the ability to translate technical concepts to non-technical stakeholders.

Preferred Qualifications:

Bachelor’s degree in Computer Science, Engineering, or a related field
Knowledge of database systems such as MySQL, Oracle or PostgreSQL
Experience with managing on-prem infrastructure at scale
Certifications in AWS, RedHat5 or relevant technologies are a plus
Experience running containerized workloads (Kubernetes, Docker) in production

Expected Compensation

$119,000 - $203,000

The base salary range shown reflects the full range for this role from minimum to maximum. At athenahealth, base pay depends on multiple factors, including job-related experience, relevant knowledge and skills, how your qualifications compare to others in similar roles, and geographical market rates. Base pay is only one part of our competitive Total Rewards package - depending on role eligibility, we offer both short and long-term incentives by way of an annual discretionary bonus plan, variable compensation plan, and equity plans.

About athenahealth

Our vision: In an industry that becomes more complex by the day, we stand for simplicity. We offer IT solutions and expert services that eliminate the daily hurdles preventing healthcare providers from focusing entirely on their patients — powered by our vision to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for all.

Our company culture: Our talented  employees — or athenistas, as we call ourselves — spark the innovation and passion needed to accomplish our vision. We are a diverse group of dreamers and do-ers with unique knowledge, expertise, backgrounds, and perspectives. We unite as mission-driven problem-solvers with a deep desire to achieve our vision and make our time here count. Our award-winning culture is built around shared values of inclusiveness, accountability, and support.

Our DEI commitment: Our vision of accessible, high-quality, and sustainable healthcare for all requires addressing the inequities that stand in the way. That's one reason we prioritize diversity, equity, and inclusion in every aspect of our business, from attracting and sustaining a diverse workforce to maintaining an inclusive environment for athenistas, our partners, customers and the communities where we work and serve.

What we can do for you:

Along with health and financial benefits, athenistas enjoy perks specific to each location, including commuter support, employee assistance programs, tuition assistance, employee resource groups, and collaborative  workspaces  — some offices even welcome dogs.

We also encourage a better work-life balance for athenistas with our flexibility. While we know in-office collaboration is critical to our vision, we recognize that not all work needs to be done within an office environment, full-time. With consistent communication and digital collaboration tools, athenahealth enables employees to find a balance that feels fulfilling and productive for each individual situation.

In addition to our traditional benefits and perks, we sponsor events throughout the year, including book clubs, external speakers, and hackathons. We provide athenistas with a company culture based on learning, the support of an engaged team, and an inclusive environment where all employees are valued.

Learn more about our culture and benefits here: athenahealth.com/careers

https://www.athenahealth.com/careers/equal-opportunity

SRE Site Reliability Engineer Lead SRE Terraform CloudFormation Ansible AWS GCP Azure Kubernetes Docker Prometheus Grafana Datadog ELK Python Go Bash IaC CI/CD Linux Observability Microservices DevOps

Average salary estimate

$161000 / YEARLY (est.)

min

max

$119000K

$203000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

Software Development Engineer- Product Reliability Engineering

Visa Hybrid Austin, TX, USA

VIEW

Posted 14 hours ago

Make infrastructure resilient and scalable at Visa by building automation, database reliability tooling, and GenAI-powered engineering assistants on the Product Reliability Engineering team in Austin.

Senior Software Engineer

NBCUniversal Hybrid 30 Rockefeller Plaza, New York, NEW YORK

VIEW

Posted 12 hours ago

Senior Software Engineer needed to develop scalable, LLM-powered agentic systems and cloud-native backends for NBCUniversal's AI initiatives.

Principal CAD Developer

CoLab Software Hybrid North America, Remote

VIEW

Posted 15 hours ago

Senior product-minded engineer needed to prototype, architect, and de-risk browser-based 2D/3D CAD and engineering-data systems for a remote-first AI platform used by major OEMs.

Manager, Manufacturing Software Engineering

Intuitive Hybrid Sunnyvale, CA

VIEW

Posted 19 hours ago

Lead and mentor a software engineering team to design and deliver manufacturing software and tooling that enables production of next‑generation surgical robotics.

Staff Software Engineer - Network Security & Automation

LinkedIn Hybrid Sunnyvale, CA

VIEW

Posted 19 hours ago

Lead the architecture and implementation of LinkedIn’s network access control platform to automate secure, policy-driven connectivity across cloud and on‑prem production environments.

Software Engineer - Cloud

Jobgether Hybrid US

VIEW

Posted 8 hours ago

Work remotely on cloud infrastructure and data systems that power large-scale AI-driven automation for a mission-focused company transforming global waste systems.

Lead Software Engineer - Identity Access

MTB Hybrid Buffalo, NY

VIEW

Posted 14 hours ago

Lead modernization and secure identity/access efforts for enterprise applications at M&T Bank, driving cloud migrations, containerization, and engineering best practices.

Staff Software Engineer, Builder Tools

Temporal Technologies Hybrid United States, Remote Opportunity

VIEW

Posted 14 hours ago

Temporal is hiring a Staff Software Engineer to lead the architecture and operation of internal builder tools and AI-driven agent platforms that improve developer flow and reliability across the organization.

Senior Backend Engineer: Attribute Enrichment (Remote)

Constructor Hybrid No location specified

VIEW

Posted 15 hours ago

Constructor seeks a Senior Backend Engineer to design and operate low-latency, high-throughput Attribute Enrichment and Badges services that deliver ML-generated item attributes to global e-commerce customers.

Principal System Integration Engineer (Enterprise Systems), Python (1064) – Department of Technology

City and County of San Francisco Hybrid 1 S Van Ness Ave, San Francisco, CA 94103, USA

VIEW

Posted 21 hours ago

Lead design and development of secure, high-availability APIs and enterprise integrations for San Francisco’s JUSTIS criminal justice data exchange as the Principal System Integration Engineer.

Software Engineer I - Backend

CDW Hybrid Virtual - Illinois

VIEW

Posted 2 hours ago

CDW is hiring a remote Software Engineer I (Backend) to build and maintain Flask-based REST and GraphQL APIs on AWS while ensuring quality, performance, and secure production operations.

Senior Software Engineer, Builder Tools

Temporal Technologies Hybrid United States, Remote Opportunity

VIEW

Posted 15 hours ago

Temporal is looking for a Senior Software Engineer to build and operate internal developer tooling and agent platforms that improve developer flow and enable safe adoption of AI-driven tooling across the company.

Senior Software Engineer, Cloud Platform

Signifyd Hybrid United States (Remote);

VIEW

Posted 14 hours ago

Experienced platform engineer needed to lead and scale Signifyd's GCP/Kubernetes cloud platform, building self-service tooling, AI-driven automation, and robust observability for a global commerce product.

athenahealth

It’s our vision to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for all. With a thoughtful balance of humanity and technology, we’re able to uncover meaningful healthcare insights that can help cre...

7 jobs

MATCH

Calculating your matching score...

FUNDING

Other

DEPARTMENTS

Software Engineering

SENIORITY LEVEL REQUIREMENT

Senior Level

INDUSTRY

Enterprise Software & Network Solutions

TEAM SIZE