← Serch more jobs

Lead DevOps/Platform Engineer IV

LinkedIn Pacific Northwest National Laboratory Seattle, WA
Not Applicable Posted March 27, 2026 Job link
Thinking about this job
Not Met Priorities
What still needs stronger evidence
Requirements
  • Lead distributed system design processing data from hundreds of sources simultaneously
  • Mentor engineering teams and guide junior scientists/engineers
  • Platform Architecture & Infrastructure Leadership
  • Expert-level proficiency in Python and at least one additional language (Go, C#/.NET, C++) with proven ability to establish infrastructure automation standards, architect scalable tooling platforms, and guide teams in developing sophisticated automation frameworks
  • Mastery of Infrastructure as Code principles with deep expertise in Terraform, CloudFormation, Pulumi, or ARM templates and demonstrated ability to design enterprise-wide IaC strategies, module libraries, and governance frameworks that enable consistent and secure infrastructure deployment
  • Proven track record of architecting and leading implementation of enterprise-grade CI/CD platforms with ability to define build/release strategies, establish deployment patterns, and drive continuous delivery adoption while designing internal developer platforms that abstract complexity and accelerate team velocity
  • Expert proficiency with GitOps methodologies (ArgoCD, Flux), infrastructure testing frameworks (Terratest, InSpec), and policy-as-code (OPA, Sentinel) with strategic application of AI assist tools to drive team productivity, accelerate automation development, and optimize operational efficiency
  • Cloud Architecture & Orchestration Expertise
  • Demonstrated expertise architecting and leading multi-cloud infrastructure strategies across AWS, Azure, and GCP with deep expertise in containerization and Kubernetes ecosystem including production-grade container platforms, custom operators, CRDs, and multi-cluster strategies at organizational scale
  • Expert ability to architect sophisticated event-driven systems using cloud-native services (EventBridge, Event Grid, Pub/Sub, SNS/SQS) with advanced knowledge of service mesh architectures (Istio, Linkerd, Consul) and API gateway patterns for zero-trust networking and complex microservice environments
  • Mastery of cloud and container networking including CNI design, custom ingress implementations, advanced load balancing, service discovery patterns, and network security policies with ability to troubleshoot complex distributed system networking issues
  • Experience architecting edge computing solutions, hybrid cloud strategies, and secure enclave deployments with understanding of data sovereignty, latency optimization, and security requirements for geographically distributed infrastructure
  • Reliability Engineering & Security Leadership
  • Proven ability to architect comprehensive observability platforms integrating metrics (Prometheus, Thanos, Cortex), distributed tracing (Jaeger, Tempo), and logging systems (ELK, Loki, Splunk) with deep expertise in SRE principles including SLO/SLI frameworks, error budgets, and incident management
  • Expert implementation of security-first infrastructure including secrets management (Vault, AWS Secrets Manager, Azure Key Vault), automated vulnerability scanning, DevSecOps toolchains, and security policy enforcement across all infrastructure layers
  • Strategic capability to design enterprise disaster recovery and business continuity strategies including multi-region architectures, automated backup systems, RPO/RTO optimization, and regular DR testing with advanced chaos engineering practices to systematically improve system resilience
  • Deep understanding of compliance frameworks (SOC 2, HIPAA, FedRAMP, PCI-DSS, GDPR) with proven ability to implement automated compliance controls, audit logging, and infrastructure hardening standards that meet regulatory requirements
  • MLOps & Data Platform Engineering
  • Expertise in architecting end-to-end MLOps platforms with proven ability to design and implement model lifecycle management infrastructure including experiment tracking (MLflow, Weights & Biases), model versioning, model registries, feature stores (Feast, Tecton), and automated ML pipeline orchestration supporting continuous training and deployment
  • Deep expertise in building infrastructure for ML model serving and deployment including real-time inference APIs, batch prediction systems, A/B testing frameworks, model monitoring for drift detection, and automated model retraining pipelines with canary deployments and rollback capabilities
  • Advanced knowledge of distributed ML training infrastructure including multi-GPU and multi-node training orchestration, resource scheduling, and optimization for frameworks like PyTorch, TensorFlow, and JAX on Kubernetes-based platforms (Kubeflow, Ray, Spark ML) with deep understanding of compute resource management and cost optimization
  • Proven ability to architect cloud-native data platforms with expertise in ETL/ELT orchestration frameworks (Airflow, Prefect, Dagster, AWS Step Functions), production data storage systems (S3, Redshift, Databricks Delta Lake, PostgreSQL, MongoDB, Snowflake), and distributed data processing frameworks (Spark/Databricks, Kafka, Flink, Ray) supporting petabyte-scale data systems and real-time ML feature pipelines
  • Technical Leadership & Strategic Impact
  • Exceptional problem-solving and troubleshooting abilities with proven track record of resolving complex infrastructure incidents spanning ML pipelines, data platforms, and distributed systems while leading incident response and root cause analysis, combined with outstanding communication skills to translate technical complexity into business impact for executive leadership and stakeholders
  • Demonstrated ability to establish infrastructure and MLOps documentation standards, create comprehensive runbooks for ML system operations and DR procedures, develop technical training programs, and build knowledge sharing practices while mentoring and developing platform engineering teams through technical guidance and architecture reviews
  • PhD and 3 years of software engineering experience -OR
  • MS/MA or higher and 5 years of software engineering experience -OR
  • BS/BA and 7 years of software engineering experience -OR
  • AA and 16 years of software engineering experience in designing, architecting, programming, deploying, and automating software solutions in support of scientific research or consumer digital product development -OR
  • HS/GED and 18 years of software engineering experience in designing, architecting, programming, deploying, and automating software solutions in support of scientific research or consumer digital product development
  • This position requires the ability to obtain and maintain a federal security clearance.
Preferred Skills
  • Deep understanding of compliance frameworks (SOC 2, HIPAA, FedRAMP, PCI-DSS, GDPR) with proven ability to implement automated compliance controls, audit logging, and infrastructure hardening standards that meet regulatory requirements
  • MLOps & Data Platform Engineering
  • Expertise in architecting end-to-end MLOps platforms with proven ability to design and implement model lifecycle management infrastructure including experiment tracking (MLflow, Weights & Biases), model versioning, model registries, feature stores (Feast, Tecton), and automated ML pipeline orchestration supporting continuous training and deployment
  • Proven ability to architect cloud-native data platforms with expertise in ETL/ELT orchestration frameworks (Airflow, Prefect, Dagster, AWS Step Functions), production data storage systems (S3, Redshift, Databricks Delta Lake, PostgreSQL, MongoDB, Snowflake), and distributed data processing frameworks (Spark/Databricks, Kafka, Flink, Ray) supporting petabyte-scale data systems and real-time ML feature pipelines
  • HS/GED and 18 years of software engineering experience in designing, architecting, programming, deploying, and automating software solutions in support of scientific research or consumer digital product development
  • Degree in computer science, software engineering, or related field
  • Track record of architecting and operating large-scale infrastructure supporting significant user bases, high-volume transaction systems, petabyte-scale data platforms, or production ML systems serving millions of predictions
  • Experience building and leading high-performing platform engineering, DevOps, or MLOps teams through hiring, mentoring, technical guidance, and career development
  • Experience establishing infrastructure practices, platform strategies, MLOps frameworks, and DevOps transformation initiatives at organizational scale
  • Background in mission-critical, regulated, or high-security environments (government, defense, financial services, healthcare) with understanding of compliance requirements for both traditional systems and ML/AI applications
  • Demonstrated success leading complex, multi-team infrastructure and MLOps initiatives from architecture through production deployment, operational handoff, and continuous improvement
Education
  • (Not required) – PhD and 3 years of software engineering experience -OR
  • (Not required) – MS/MA or higher and 5 years of software engineering experience -OR
  • (Not required) – BS/BA and 7 years of software engineering experience -OR
  • (Not required) – AA and 16 years of software engineering experience in designing, architecting, programming, deploying, and automating software solutions in support of scientific research or consumer digital product development -OR
  • (Not required) – HS/GED and 18 years of software engineering experience in designing, architecting, programming, deploying, and automating software solutions in support of scientific research or consumer digital product development
  • (Not required) – Degree in computer science, software engineering, or related field