Rancher Kubernetes expert

Posted on July 28, 2025

Apply Now

Job Description

  • 7 yrs
  • Mandatory skills:
  • Rancher, Kubrnetes (Rke2 and k3s), Terraform, Python
  • Strong background operating Prometheus, Grafana, and Elasticsearch/Fluentd/Kibana (ELK/EFK) stacks
  • Scope : We�re looking for a Rancher Kubernetes expert to lead the design, automation, and reliability of our on-prem and hybrid container platform. Sitting at the intersection of the Platform Engineering and Infrastructure Reliability teams, this role owns the lifecycle of Rancher-managed clusters�from bare-metal provisioning and performance tuning to observability, security, and automated operations.
  • You�ll apply SRE principles to ensure high availability, scalability, and resilience across environments supporting mission-critical workloads.
  • Core Responsibilities:
  • Platform & Infrastructure Engineering Design, deploy, and maintain Rancher-managed Kubernetes clusters (RKE2/K3s) at enterprise scale.
  • Architect highly available clusters integrated with on-prem infrastructure: UCS, VxLAN, storage, DNS, and load balancers.
  • Lead Rancher Fleet implementations for GitOps-driven cluster and workload management.
  • Performance Engineering & Optimization Tune clusters for high-performance workloads on bare-metal hardware, optimizing CPU, memory, and I/O paths.
  • Align cluster scheduling and resource profiles with physical infrastructure topologies (NUMA, NICs, etc.).
  • Optimize CNI, kubelet, and scheduler settings for low-latency, high-throughput applications.
  • Security & Compliance Implement security-first Kubernetes patterns: RBAC, Pod Security Standards, network policies, and image validation.
  • Drive left-shifted security using Terraform, Helm, and CI/CD pipelines; align to PCI, FIPS, and CIS benchmarks.
  • Lead infrastructure risk reviews and implement guardrails for regulated environments.
  • Automation & Tooling Build and maintain IaC stacks using Terraform, Helm, and Argo CD.
  • Develop platform automation and observability tooling using Python or GoEnsure declarative management of infrastructure and applications through GitOps pipelines SRE & Observability.
  • Apply SRE best practices for platform availability, capacity, latency, and incident response.
  • Operate and tune Prometheus, Grafana, and ELK/EFK stacks for complete platform observability.
  • Drive actionable alerting, automated recovery mechanisms, and clear operational documentation.
  • Lead postmortems and drive systemic improvements to reduce MTTR and prevent recurrence.
  • Required Skills
  • � 7+ years in infrastructure, platform, or SRE roles
  • � Deep hands-on experience with Rancher (RKE2/K3s) in production environments
  • � Proficient with Terraform, Helm, Argo CD, Python, and/or Go
  • � Demonstrated performance tuning in bare-metal Kubernetes environments (UCS, VxLAN, MetalLB)
  • � Expert in Linux systems (systemd, networking, kernel tuning), Kubernetes internals, and container runtimes
  • � Real-world application of SRE principles in high-stakes, always-on environments
  • � Strong background operating Prometheus, Grafana, and Elasticsearch/Fluentd/Kibana (ELK/EFK) stacks
  • Preferred Qualifications
  • � Experience integrating Kubernetes with OpenStack and Magnum
  • � Knowledge of Rancher add-ons: Fleet, Longhorn, CIS Scanning
  • � Familiarity with compliance-driven infrastructure (PCI, FedRAMP, SOC2)
  • � Certifications: CKA, CKS, or Rancher Kubernetes Administrator
  • � Strategic thinker with strong technical judgment and execution ability
  • � Calm and clear communicator, especially during incidents or reviews
  • � Mentorship-oriented; supports team learning and cross-functional collaboration
  • � Self-motivated, detail-oriented, and thrives in a fast-moving, ownership-driven culture

Required Skills

rancher kubrnetes (rke2 and k3s) terraform python strong background operating prometheus grafana and elasticsearch/fluentd/kibana (elk/efk) stacks