Senior DevOps & Infrastructure Engineer

Our client, a US-based AI company, is building infrastructure that powers the next generation of AI. Their platform helps companies create high-quality post-training data and run reinforcement fine-tuning workflows at scale. At the heart of their product are fast, isolated, and massively parallel developer sandboxes that let AI teams test and improve their models.

Joining their team as a Senior DevOps & Infrastructure Engineer means owning critical cloud and containerized infrastructure, optimizing performance, reliability, and scalability for cutting-edge AI workloads.

The company is trusted by foundation labs, Fortune 500s, and fast-growing startups.

You will be part of ahigh-caliber team: former founders, published ML researchers, Olympiad medalists, and engineers who have built products with real adoption. They run lean, move fast, and hold an extremely high bar.

The Role

The company run a platform +SDK/dev tools for creating RL environments/post-training data and running reinforcement fine-tuning at scale. A key part of that experience is their infra and developer sandboxes: fast, reliable, observable, Dockerized compute environments with massive parallelization.

They are looking for an infrastructure owner who is obsessed with performance and reliability – someone who treats shaving seconds off sandbox lifecycle and runtime performance as a sport.

You will own DevOps, infrastructure and architecture decisions as the company hits their next order of scale.

‍What you’ll work on:

‍

Developer sandbox infrastructure

Own AWS + EKS-based sandbox platform that runs Dockerized workloads for customers and internal teams.
Optimize sandbox lifecycle end-to-end: provisioning, scheduling, image pulls, startup, execution, teardown, and caching.
Design for massive parallelism while maintaining reliability, fairness, and predictable performance.

Kubernetes + AWS excellence

Evolve our cluster architecture: node groups, autoscaling strategies, spot/on-demand mixes, scheduling policies, and workload isolation.
Build safe-by-default patterns: quotas, resource limits, network policies, pod security, secrets management, and guardrails.
Improve cluster resiliency and operational ergonomics (upgrades, rollouts, disaster recovery, fail-safes).

Cross-stack DevOps ownership

Address infrastructure bottlenecks as they scale.
Improve developer experience for internal teams: safer deploys, better CI/CD, smoother local/dev workflows, faster iteration.
Provide architectural input and raise the infra maturity of the team via docs, patterns, and coaching.
Interface with their backend/workers (Railway), frontend (Vercel/Next.js), and data (Supabase/Postgres) to ensure the whole system is cohesive.

Performance engineering and ruthless measurement

Establish “infra product metrics” and instrument everything: P50/P95/P99 sandbox startup times, queue times, job success rates, noisy-neighbor rates, image pull latencies, cluster saturation, and cost-per-run.
Build benchmarking harnesses for sandboxes and workloads to track regressions and validate improvements.
Treat efficiency as a first-class metric: optimize utilization without sacrificing latency or reliability.

Observability + incident readiness

Implement gold-standard observability across logs/metrics/traces with actionable dashboards and alerting tied to SLOs.
Create runbooks, incident processes, and postmortem culture that meaningfully improves the system each time.

‍Requirements

Deep AWS experience, including operating production systems at scale (networking, IAM, compute, storage, observability, cost).
Strong Kubernetes/EKS experience: cluster design, workload isolation, autoscaling (cluster + pod), upgrades, reliability practices.
Excellent Docker + container runtime knowledge: image optimization, build pipelines, caching strategies, and runtime security considerations.
Systems-level competence: Linux fundamentals, networking, performance debugging, resource contention, concurrency basics.
Infrastructure automation: strong ability to implement infrastructure as code (Terraform/CDK/CloudFormation) and repeatable environments.
Observability expertise: metrics/logging/tracing design, SLOs/SLIs, alerting that avoids noise and catches real issues.
Security + compliance mindset: experience working in SOC 2-aligned environments; ability to implement least privilege, auditability, and operational controls.
Strong engineering communication: can write clear docs, propose designs, and upskill the team.

Nice-to-have

Experience migrating from mixed hosting providers into a more cohesive platform architecture.
Security depth beyond basics: threat modeling, hardening, secure supply chain for containers, audit-readiness workflows.
Ability to contribute across the stack:
- Python (our SDK and backend systems) and Next.js/TypeScript, enough to collaborate effectively with other engineers.
Strong fluency with AI coding tools (using them to accelerate debugging, automation, and implementation without sacrificing correctness).

The offer

Remote or on-site work (relocation package available) on B2B basis
Compensation from 10 500 EUR gross/monthly
Meaningful equity
Full healthcare
Daily team meals (if on-site)

Hiring process

Screening interview with Tech Recruitment
A meeting with the CEO
A technical interview

‍

San Francisco, US

Contract, Full-time, Remote

Posted

February 16, 2026

Apply Now

Refer a friend

Back

Other jobs in this category

Chief Technology Officer (CTO)

Our client is an innovative deep-tech company building a next-generation privacy and data integrity ecosystem that enables organizations to provably delete data using advanced cryptographic technology. The first product is now being brought to market, transforming cutting-edge cryptographic concepts into a practical B2B solution for companies where data control, compliance, and trust are critical. In parallel, several additional products are already in different stages of development.

Europe

Contract, Full-time, Remote

Posted

April 17, 2026

PHP Backend Developer

Our client is an international FinTech company offering cloud-based solutions for the payment industry. They are constantly enhancing their payment platform and have a need of a middle PHP Backend + DB Developer.

Riga, Latvia

Permanent, Full-Time

Posted

April 14, 2026