Auriga is looking for a DevOps / SRE Engineer to own the reliability, scalability, and operational health of our production infrastructure. He will manage infrastructure end-to-end through code, build and maintain CI/CD pipelines, respond to production incidents as part of an on-call rotation, and partner with engineering teams to embed reliability into every stage of our software delivery lifecycle. This is a hands-on ownership role — not a support or ticket-queue position.
KEY RESPONSIBILITIES
On-Call & Incident Management
- Participate/ lead production incident triage, mitigation, and resolution.
- Define and track SLIs / SLOs and manage error budgets for critical services.
- Author blameless postmortems with root cause analysis and corrective action plans for all significant incidents.
- Build and maintain runbooks and incident response playbooks.
Infrastructure as Code & Infra Management
- Own the full infrastructure lifecycle using Terraform exclusively — no manual console provisioning in production.
- Design modular Terraform modules for cloud resources: compute, networking, storage, IAM, DNS, databases.
- Manage environment parity (dev / staging / prod), drift detection, and immutable infrastructure principles.
- Use Ansible or equivalent for configuration management, patch management, and compliance at scale.
- Plan and execute capacity planning, scaling, and cost optimization initiatives.
Kubernetes & Containers
- Provision, configure, upgrade, and maintain production Kubernetes clusters (EKS / GKE / AKS).
- Manage Helm charts, ArgoCD / FluxCD GitOps workflows, RBAC, network policies, and secrets management.
- Troubleshoot pod failures, resource constraints, scheduling issues, and node-level problems.
CI/CD & Release Engineering
- Build and maintain CI/CD pipelines (GitHub Actions / GitLab CI / Jenkins) with quality gates and automated rollback.
- Implement progressive delivery: blue-green, canary deployments, and feature flags.
- Conduct production readiness reviews before new service go-lives.
Observability & Monitoring
- Implement and manage full-stack observability: metrics (Prometheus + Grafana), logs (Loki), traces (OpenTelemetry / Jaeger).
- Build SLO-driven alerting and dashboards covering the four golden signals; route via PagerDuty / OpsGenie.
Linux, Networking & Security
- Administer production Linux systems (Ubuntu / RHEL / CentOS): performance tuning, hardening, troubleshooting.
- Manage VPCs, subnets, security groups, VPNs, load balancers, Nginx / HAProxy, CDNs, and SSL/TLS certificates.
- Manage secrets with Vault / AWS Secrets Manager; integrate security scanning into CI/CD pipelines (DevSecOps).
Database & Application Operations
- Manage operational health, backups, replication, and failover for MySQL / PostgreSQL, MongoDB, Redis, Memcached.
- Support deployments for both monolithic and microservices architectures; validate DR procedures regularly.
REQUIREMENTS
- 3–5 years of hands-on production DevOps / SRE / Platform Engineering experience.
- Proven on-call experience: incident ownership, RCA, and postmortem authorship in a real production environment.
- Strong Terraform proficiency: writing and maintaining modules, managing state, and working in team IaC workflow.
- Solid Linux administration skills at the OS level (not just application-layer).
- Kubernetes production experience: cluster admin, workload management, debugging, RBAC, network policies.
- CI/CD pipeline ownership experience (built pipelines from scratch, not just run them).
- Hands-on Prometheus + Grafana: alert rule design and dashboard creation.
- Python and/or Bash scripting for production automation.
- Working knowledge of AWS (primary), GCP, or Azure — able to architect and troubleshoot independently.
- Strong networking fundamentals: TCP/IP, DNS, TLS, VPCs, load balancers, firewalls.
PREFERRED QUALIFICATIONS
- CKA / CKAD, AWS DevOps Engineer Professional, or HashiCorp Terraform Associate certification.
- GitOps experience with ArgoCD or FluxCD in production.
- Service mesh experience: Istio or Linkerd.
- Message queue / streaming experience: Kafka, RabbitMQ, or AWS SQS / SNS.
- Chaos engineering or DR game day experience.
- HashiCorp Vault or cloud-native secrets management experience.
- Exposure to FinOps, cost visibility tooling, and cloud cost optimisation.
PREFERRED TECHNICAL SKILL SET
- Cloud: AWS (primary) | GCP / Azure
- IaC & Config: Terraform, Ansible
- Containers: Docker, Kubernetes, Helm, ArgoCD / FluxCD
- CI/CD: GitHub Actions, GitLab CI, Jenkins
- Observability: Prometheus, Grafana, Loki, OpenTelemetry, PagerDuty / OpsGenie
- Scripting: Python, Bash, Go
- Databases: MySQL / PostgreSQL, MongoDB, Redis, Memcached
- Messaging: Kafka, RabbitMQ, AWS SQS (preferred)
- Networking: Nginx, HAProxy, CloudFront, Cloudflare, Vault, WAF
- Workflow: Argo Workflows
About Company
Hi there! We are Auriga IT.
We power businesses across the globe through digital experiences, data and insights. From the apps we design to the platforms we engineer, we're driven by an ambition to create world-class digital solutions and make an impact. Our team has been part of building the solutions for the likes of Zomato, Yes Bank, Tata Motors, Amazon, Snapdeal, Ola, Practo, Vodafone, Meesho, Volkswagen, Droom and many more.
We are a group of people who just could not leave our college-life behind and the inception of Auriga was solely based on a desire to keep working together with friends and enjoying the extended college life.
Who Has not Dreamt of Working with Friends for a Lifetime
Come Join In!