Rancher Kubernetes expert
Posted on July 28, 2025
Job Description
- 7 yrs
- Mandatory skills:
- Rancher, Kubrnetes (Rke2 and k3s), Terraform, Python
- Strong background operating Prometheus, Grafana, and Elasticsearch/Fluentd/Kibana (ELK/EFK) stacks
- Scope : We�re looking for a Rancher Kubernetes expert to lead the design, automation, and reliability of our on-prem and hybrid container platform. Sitting at the intersection of the Platform Engineering and Infrastructure Reliability teams, this role owns the lifecycle of Rancher-managed clusters�from bare-metal provisioning and performance tuning to observability, security, and automated operations.
- You�ll apply SRE principles to ensure high availability, scalability, and resilience across environments supporting mission-critical workloads.
- Core Responsibilities:
- Platform & Infrastructure Engineering Design, deploy, and maintain Rancher-managed Kubernetes clusters (RKE2/K3s) at enterprise scale.
- Architect highly available clusters integrated with on-prem infrastructure: UCS, VxLAN, storage, DNS, and load balancers.
- Lead Rancher Fleet implementations for GitOps-driven cluster and workload management.
- Performance Engineering & Optimization Tune clusters for high-performance workloads on bare-metal hardware, optimizing CPU, memory, and I/O paths.
- Align cluster scheduling and resource profiles with physical infrastructure topologies (NUMA, NICs, etc.).
- Optimize CNI, kubelet, and scheduler settings for low-latency, high-throughput applications.
- Security & Compliance Implement security-first Kubernetes patterns: RBAC, Pod Security Standards, network policies, and image validation.
- Drive left-shifted security using Terraform, Helm, and CI/CD pipelines; align to PCI, FIPS, and CIS benchmarks.
- Lead infrastructure risk reviews and implement guardrails for regulated environments.
- Automation & Tooling Build and maintain IaC stacks using Terraform, Helm, and Argo CD.
- Develop platform automation and observability tooling using Python or GoEnsure declarative management of infrastructure and applications through GitOps pipelines SRE & Observability.
- Apply SRE best practices for platform availability, capacity, latency, and incident response.
- Operate and tune Prometheus, Grafana, and ELK/EFK stacks for complete platform observability.
- Drive actionable alerting, automated recovery mechanisms, and clear operational documentation.
- Lead postmortems and drive systemic improvements to reduce MTTR and prevent recurrence.
- Required Skills
- � 7+ years in infrastructure, platform, or SRE roles
- � Deep hands-on experience with Rancher (RKE2/K3s) in production environments
- � Proficient with Terraform, Helm, Argo CD, Python, and/or Go
- � Demonstrated performance tuning in bare-metal Kubernetes environments (UCS, VxLAN, MetalLB)
- � Expert in Linux systems (systemd, networking, kernel tuning), Kubernetes internals, and container runtimes
- � Real-world application of SRE principles in high-stakes, always-on environments
- � Strong background operating Prometheus, Grafana, and Elasticsearch/Fluentd/Kibana (ELK/EFK) stacks
- Preferred Qualifications
- � Experience integrating Kubernetes with OpenStack and Magnum
- � Knowledge of Rancher add-ons: Fleet, Longhorn, CIS Scanning
- � Familiarity with compliance-driven infrastructure (PCI, FedRAMP, SOC2)
- � Certifications: CKA, CKS, or Rancher Kubernetes Administrator
- � Strategic thinker with strong technical judgment and execution ability
- � Calm and clear communicator, especially during incidents or reviews
- � Mentorship-oriented; supports team learning and cross-functional collaboration
- � Self-motivated, detail-oriented, and thrives in a fast-moving, ownership-driven culture
Required Skills
rancher
kubrnetes (rke2 and k3s)
terraform
python strong background operating prometheus
grafana
and elasticsearch/fluentd/kibana (elk/efk) stacks