Site Reliability Engineer (SRE)

About Us

Tala Health was built to transform a healthcare system that remains slow, expensive and inefficient. Today’s patients often face a fragmented system that requires juggling doctor visits, lab work, referrals and long wait times just to reach a diagnosis. With Tala Health, patients will receive a new kind of care experience that brings AI agents and clinicians together from the start to deliver accurate, personalized care faster. We are building AI agents to support the full arc of the patient journey.

The Opportunity: Machine Learning Engineer

Patients count on our platform 24/7. You’ll build and maintain the tooling, alerts and incident‑response playbooks that keep latency low, data safe and uptime high as we scale from thousands to millions of sessions.

What You’ll Do

Define and uphold SLOs/SLAs for critical services, running capacity planning and chaos drills.
Build observability pipelines—metrics, logs, traces—so issues surface before users notice.
Automate infra provisioning and config with Terraform, Helm and Kubernetes Operators.
Lead on‑call rotations, drive root‑cause analyses, and document post‑mortems for continuous learning.
Optimize cost and performance of microservices, databases and real‑time model endpoints.
Manage backups, disaster‑recovery plans, secrets and certificate lifecycles in line with HIPAA.

What You Bring

5+ years SRE/DevOps experience running production workloads on AWS, GCP or Azure.
Mastery of Kubernetes, service meshes, load balancers and CDN tuning.
Hands‑on with monitoring & logging stacks (Prometheus, Grafana, OpenTelemetry, ELK).
Strong scripting or programming skills in Python, Go or Bash.
Familiarity with relational and NoSQL databases, caches and message brokers.
Security‑first mindset and experience in regulated or high‑stakes environments.

Ready to build the future of healthcare? Let’s get in touch.