Senior Devops Engineer
QBench is a fast-growing, fully remote SaaS company powering modern laboratory operations through our cloud-based LIMS platform. As our platform scales and supports increasingly sophisticated customers, reliability, availability, and performance are mission-critical to our business and our customers’ operations.
The Senior DevOps Engineer is responsible for ensuring QBench’s production systems are reliable, observable, secure, and resilient. This role applies software engineering and systems thinking to infrastructure and operations, helping balance product velocity with operational excellence through clear standards, strong automation, and measurable outcomes.
This is a hands-on, high-ownership role focused on production stability, incident reduction, and continuous improvement—while enabling engineering teams to ship safely and confidently.
The Role
We are looking for an experienced DevOps Engineer who thrives in a remote-first environment and is passionate about building and operating reliable systems at scale.
You will play a key role in improving the reliability and scalability of our AWS-based platform, strengthening deployment and infrastructure patterns, and continuously reducing operational risk and toil. You will work closely with Engineering, Operations, and Leadership to ensure reliability is treated as a first-class product feature.
You’ll help us prioritize and execute a reliability and infrastructure roadmap—focusing first on the highest-impact improvements to production stability, delivery automation, and AWS infrastructure standardization.
Key Responsibilities
Production Ownership & Operational Excellence
- Own the reliability, availability, and performance of production systems.
- Lead incident response for production issues, including coordination, mitigation, and communication.
- Conduct blameless postmortems and ensure follow-up actions are completed.
- Proactively identify reliability risks and eliminate single points of failure.
Define reliability standards and operational practices that enable teams to ship safely.
- Cloud Infrastructure & Infrastructure as Code (IaC)
- Design, provision, and manage AWS infrastructure using Infrastructure as Code (IaC) tools such as AWS
- CDK and AWS SAM (and/or Terraform).
- Maintain clean, version-controlled, and auditable infrastructure across all environments.
- Manage and improve core AWS services including:
ECS Fargate
Elastic Beanstalk
AWS Lambda
RDS Aurora MySQL
S3, VPC, IAM, and networking components
- Lead infrastructure modernization initiatives, including re-architecting services for reliability and scalability.
CI/CD, Release Engineering & Deployment Automation
- Own and continuously improve CI/CD pipelines to ensure fast, reliable, and secure builds, tests, and deployments.
- Standardize deployment workflows across services and environments (dev/staging/prod), including safe rollout strategies and automated rollback where appropriate.
- Implement quality and security gates in the delivery pipeline (unit/integration tests, linting, dependency scanning, IaC checks).
- Improve deployment observability by integrating release markers, build metadata, and change tracking into monitoring and alerting.
- Partner with Engineering to reduce friction in the delivery process and improve developer velocity while maintaining reliability standards.
Performance, Capacity & Scalability
- Perform capacity planning and load analysis to ensure systems scale predictably.
- Optimize AWS resource usage for performance and cost efficiency.
- Tune and optimize MongoDB Atlas and RDS Aurora MySQL for throughput, latency, and reliability.
- Identify and remediate performance bottlenecks across the stack, including application-level configuration and runtime tuning.
Migrations, Upgrades & Risk Reduction
- Lead and execute infrastructure and platform migrations, including:
- Migrating workloads from Elastic Beanstalk to ECS Fargate
- Migrating AWS resources to more resilient architectures
- Perform server, runtime, and dependency upgrades, including:
Python version upgrades
Library and dependency updates
Server Platform Upgrades (EC2, RDS, etc.)
- Design and execute changes using safe rollout strategies, monitoring, and rollback plans.
Observability & Incident Prevention
- Build and maintain strong observability using metrics, logs, dashboards, and alerts.
- Ensure alerting is actionable and aligned with user-impacting symptoms.
- Reduce alert fatigue by continuously tuning thresholds and signals.
- Identify and eliminate sources of operational toil through automation.
Security, Compliance & Secrets Management
- Perform operational tasks supporting SOC 2 Security and Availability controls.
- Manage secrets and credentials using AWS Secrets Manager and AWS Systems Manager Parameter Store.
- Rotate credentials and enforce least-privilege access across infrastructure.
- Partner with Operations to support audits, evidence collection, and remediation.
- Ensure production changes follow auditable and repeatable processes.
Automation & Continuous Improvement
- Automate manual and repetitive operational tasks.
- Improve deployment safety and reliability through tooling and process improvements.
- Document runbooks, operational procedures, and reliability standards.
- Act as a DevOps and infrastructure advisor to Engineering teams during design and implementation.
How We Measure Reliability
At QBench, reliability is measured explicitly and transparently using operational and service-level indicators such as:
- Request success rate
- Latency (p95 / p99)
- Availability of critical workflows
- Incident metrics including:
Frequency and severity of incidents
Mean Time to Detect (MTTD)
Mean Time to Recovery (MTTR)
- Operational toil, tracked and reduced over time through automation
Reliability is treated as a shared responsibility, with the DevOps function providing guardrails, tooling, and accountability.
Measurable Outcomes / Success Metrics
- Improved production stability and reduced frequency/severity of incidents
- Faster detection and recovery from failures
- Improved deployment safety, consistency, and developer velocity
- Successful migration from Elastic Beanstalk to ECS Fargate with no reliability regressions
- Improved database performance and stability
- Successful SOC 2 audits with no high-severity reliability-related findings
- Clear, actionable postmortems and completed follow-up work
Required Qualifications
- 5+ years of experience in DevOps, Site Reliability Engineering, or Cloud Infrastructure roles.
- Must be located in the U.S.
- Strong hands-on experience with AWS, including ECS, Lambda, RDS, VPC, and IAM.
- Experience defining and operating service reliability metrics (SLIs/SLOs) and using them to drive operational improvements.
- Experience using Infrastructure as Code (AWS CDK, AWS SAM, Terraform, or similar).
- Experience managing and optimizing MySQL / Aurora in production.
- Strong incident response and postmortem experience.
- Solid understanding of cloud security and secrets management.
- Experience operating systems in SOC 2–aligned environments.
- Excellent written communication and documentation skills.
- Comfortable operating independently in a remote, async-first company.
- Legal resident of the United States residing in the United States.
Preferred / Nice-to-Have
- Experience migrating workloads from Elastic Beanstalk to ECS Fargate
- Experience building or operating observability platforms
- Familiarity with SOC 2 tooling (Drata, Vanta, Secureframe, or similar)
- Experience supporting high-availability SaaS platforms
- Background in Python application infrastructure
Who You Are
- Reliability-First: You treat reliability as a core product feature.
- Calm Under Pressure: You handle incidents methodically and communicate clearly.
- Data-Driven: You rely on metrics—not intuition—to guide decisions.
- Automation-Oriented: You aggressively eliminate manual work.
- Collaborative: You partner with Engineering to build reliable systems from the start.
Why QBench?
- Fully remote, US-based role
- High ownership and autonomy
- Direct impact on production reliability and customer trust
- Competitive compensation and benefits
- Opportunity to shape DevOps practices as the company scales
Salary: $100,000 - $118,000 USD/Annually (Include variable Comp)