Senior DevOps & Infrastructure Engineer

Our client, a US-based AI company, is building infrastructure that powers the next generation of AI. Their platform helps companies create high-quality post-training data and run reinforcement fine-tuning workflows at scale. At the heart of their product are fast, isolated, and massively parallel developer sandboxes that let AI teams test and improve their models.

Joining their team as a Senior DevOps & Infrastructure Engineer means owning critical cloud and containerized infrastructure, optimizing performance, reliability, and scalability for cutting-edge AI workloads.

The company is trusted by foundation labs, Fortune 500s, and fast-growing startups.

You will be part of ahigh-caliber team: former founders, published ML researchers, Olympiad medalists, and engineers who have built products with real adoption. They run lean, move fast, and hold an extremely high bar.

 

The Role

The company run a platform +SDK/dev tools for creating RL environments/post-training data and running reinforcement fine-tuning at scale. A key part of that experience is their infra and developer sandboxes: fast, reliable, observable, Dockerized compute environments with massive parallelization.

They are looking for an infrastructure owner who is obsessed with performance and reliability – someone who treats shaving seconds off sandbox lifecycle and runtime performance as a sport.

You will own DevOps, infrastructure and architecture decisions as the company hits their next order of scale.

 

What you’ll work on:

Developer sandbox infrastructure

  • Own AWS + EKS-based sandbox platform that runs Dockerized workloads for customers and internal teams.
  • Optimize sandbox lifecycle end-to-end: provisioning, scheduling, image pulls, startup, execution, teardown, and caching.
  • Design for massive parallelism while maintaining reliability, fairness, and predictable performance.

 

Kubernetes + AWS excellence

  • Evolve our cluster architecture: node groups, autoscaling strategies, spot/on-demand mixes, scheduling policies, and workload isolation.
  • Build safe-by-default patterns: quotas, resource limits, network policies, pod security, secrets management, and guardrails.
  • Improve cluster resiliency and operational ergonomics (upgrades, rollouts, disaster recovery, fail-safes).

 

Cross-stack DevOps ownership

  • Address infrastructure bottlenecks as they scale.
  • Improve developer experience for internal teams: safer deploys, better CI/CD, smoother local/dev workflows, faster iteration.
  • Provide architectural input and raise the infra maturity of the team via docs, patterns, and coaching.
  • Interface with their backend/workers (Railway), frontend (Vercel/Next.js), and data (Supabase/Postgres) to ensure the whole system is cohesive.

 

Performance engineering and ruthless measurement

  • Establish “infra product metrics” and instrument everything: P50/P95/P99 sandbox startup times, queue times, job success rates, noisy-neighbor rates, image     pull latencies, cluster saturation, and cost-per-run.
  • Build benchmarking harnesses for sandboxes and workloads to track regressions and validate improvements.
  • Treat efficiency as a first-class metric: optimize utilization without sacrificing latency or reliability.

 

Observability + incident readiness

  • Implement gold-standard observability across logs/metrics/traces with actionable dashboards and alerting tied to SLOs.
  • Create runbooks, incident processes, and postmortem culture that meaningfully improves the system each time.

 

Requirements

  • Deep AWS experience, including operating production systems at scale (networking, IAM, compute, storage, observability, cost).
  • Strong  Kubernetes/EKS experience: cluster design, workload isolation, autoscaling (cluster + pod), upgrades, reliability practices.
  • Excellent Docker + container runtime knowledge: image optimization, build pipelines, caching strategies, and runtime security considerations.
  • Systems-level competence: Linux fundamentals, networking, performance debugging, resource contention, concurrency basics.
  • Infrastructure automation: strong ability to implement infrastructure as code (Terraform/CDK/CloudFormation) and repeatable environments.
  • Observability expertise: metrics/logging/tracing design, SLOs/SLIs, alerting that avoids noise and catches real issues.
  • Security + compliance mindset: experience working in SOC 2-aligned environments; ability to implement least privilege, auditability, and operational controls.
  • Strong engineering communication: can write clear docs, propose designs, and upskill the team.

 

Nice-to-have

  • Experience migrating from mixed hosting providers into a more cohesive platform architecture.
  • Security depth beyond basics: threat modeling, hardening, secure supply chain for containers, audit-readiness workflows.
  • Ability to contribute across the stack:
       
    • Python (our SDK and backend systems) and Next.js/TypeScript, enough to collaborate effectively with other engineers.
  •  
  • Strong fluency with AI coding tools (using them to accelerate debugging, automation, and implementation without sacrificing correctness).

 

The offer

  • Remote or on-site work (relocation package available) on B2B basis
  • Compensation of 10 500 EUR gross/monthly
  • Meaningful equity 
  • Full healthcare
  • Daily team meals (if on-site)

 

Hiring process

  • Screening interview with Tech Recruitment
  • A meeting with the CEO
  • A technical interview

San Francisco, US
San Francisco, US
Contract, Full-time, Remote
Contract, Full-time, Remote
Posted
February 16, 2026
Apply Now

Other jobs in this category

Designed & Developed by Minimize