← Serch more jobs

Site Reliability Engineer

LinkedIn FarmGPU Rancho Cordova, CA
Mid-Senior level Posted April 17, 2026 Job link
Thinking about this job
Not Met Priorities
What still needs stronger evidence
Requirements
  • Strong working knowledge of Linux systems —comfortable with the command line, process/service management, log analysis, and hands-on troubleshooting in a production environment.
  • Experience with monitoring and observability tools , particularly Grafana and Prometheus—able to navigate dashboards, interpret metric trends, and act on alerts.
  • Proficiency in scripting and automation : Python and/or bash for operational task automation; experience running Ansible playbooks in production.
  • Solid understanding of distributed system concepts and the ability to troubleshoot complex issues across multiple layers of the stack.
  • Familiarity with datacenter networking fundamentals : IP addressing, VLANs, switching, OSI layers 3/4—enough to diagnose and resolve common connectivity issues.
  • Experience with bare-metal server environments , including hardware diagnostics, BMC/IPMI management, and routine maintenance.
  • Working knowledge of containerization : Docker and/or Kubernetes at an operational level.
  • Solid troubleshooting methodology and attention to detail; comfortable following and improving documented runbooks.
  • Willingness to work on-site in Rancho Cordova, CA , including shift rotations covering evenings, nights, and weekends .
  • 3+ years in a production SRE, DevOps, or infrastructure operations role.
  • Experience implementing and tracking SLIs and SLOs for production services.
Preferred Skills
  • Familiarity with GPU server environments (NVIDIA H100/H200/B200) or HPC infrastructure.
  • Experience with storage platforms such as NVMe, NAS, or VAST Data in a production setting.
  • Exposure to security and compliance practices : secret management, access control, Linux hardening, SOC 2 familiarity.
  • Experience with cloud platforms (AWS, GCP, or Azure) or hybrid datacenter/cloud environments.
  • Relevant certifications such as RHCSA, CKA, or AWS Certified DevOps Engineer.
Education
  • (Not required) – Experience with cloud platforms (AWS, GCP, or Azure) or hybrid datacenter/cloud environments.
  • (Not required) – Relevant certifications such as RHCSA, CKA, or AWS Certified DevOps Engineer.