← Serch more jobs

Senior/Staff Software Engineer, Super Compute Memory

Pryon • Boston, MA

Not Applicable Posted April 5, 2026 Job link

Thinking about this job

Responsibilities

Commitments

Responsibilities

Design GPU resource allocation and scheduling for embedding generation and inference workloads across multi-tenant deployments; optimize GPU utilization (targeting 80%+ GPU memory utilization) through techniques like batching, quantization, and vLLM deployment
Build and operate GPU clusters (multi-node, multi-GPU) for production inference workloads; implement GPU monitoring, health checks, and auto-recovery mechanisms; optimize cost-per-inference through efficient GPU scheduling Profile and optimize GPU-based inference pipelines for latency (targeting <100ms p95) and throughput (queries/sec per GPU); implement inference optimizations including KV caching, continuous batching, and model parallelism strategies
Establish performance benchmarking framework, identify bottlenecks through profiling, and implement optimizations achieving 10x+ improvements
Design and implement distributed ingestion pipeline capable of processing 6M+ documents/hour using technologies like message brokers, MinIO, workflow engines and parallel processing frameworks
Architect high-performance vector similarity search infrastructure supporting billion-scale embeddings with sub-second query latency
Architect compute-optimized deployments across AWS (EC2, EKS), GCP (GKE, TPUs), Azure (AKS), and on-premises Kubernetes
Design, develop, and optimize high-performance, distributed systems and software for scalability and reliability
Collaborate with research scientists, ML engineers, and platform teams to deliver high-quality, large-scale software
Drive architectural decisions through RFCs, mentor ML and platform engineers on HPC best practices, establish coding standards for high-performance systems
Lead the technical design and implementation of major features and components with lasting architectural impact

Not Met Priorities

What still needs stronger evidence

Requirements

Extensive experience in software development, with a proven track record of delivering complex, large-scale systems (8+ years for Senior, 12+ years for Staff)
Proven experience building distributed systems at 100M+ scale (documents, vectors, or equivalent)
Deep knowledge of parallel and distributed computing concepts including consensus algorithms, distributed coordination, and fault tolerance
Hands-on experience with vector databases (pgvector, Pinecone, Weaviate, Milvus, or equivalent)
Proficiency in systems programming languages such as C++, Go, or Rust
Experience with parallel programming models (e.g., MPI, OpenMP, CUDA)
Production experience optimizing GPU workloads for inference including batch optimization, quantization (INT8, FP16), and GPU memory management
Experience managing large-scale GPU infrastructure (10+ GPUs in production) including cluster orchestration, resource scheduling, and cost optimization
Deep understanding of GPU architectures (NVIDIA A100/H100, tensor cores) and inference frameworks (vLLM, TensorRT, Triton)
Deep understanding of memory hierarchies, cache optimization, and NUMA architectures
Experience with container orchestration (Kubernetes) and distributed computing frameworks (Ray, Dask, Spark, or equivalent)
Familiarity with performance analysis and optimization tools and techniques (profilers, tracers, benchmarking)
Strong systems programming background with evidence of performance-critical contributions (open source, papers, or production systems)

Preferred Skills

Experience with cloud-based HPC, including services on AWS (EC2 P4/P5 instances), GCP (A100/H100 VMs), or Azure (ND-series)
Knowledge of networking and storage technologies in the context of high-performance computing (RDMA, NVMe, distributed filesystems, GPU-Direct Storage)
Advanced GPU optimization experience including multi-GPU inference (model parallelism, pipeline parallelism), mixed-precision training/inference, and GPU profiling tools (NVIDIA Nsight, nvprof, PyTorch Profiler)
Experience with ML infrastructure including model serving frameworks (vLLM, TensorRT-LLM, Triton Inference Server), GPU resource management (NVIDIA MIG, GPU time-slicing), and inference optimization (continuous batching, speculative decoding)
Production experience with GPU monitoring and observability (DCGM, GPU metrics dashboards, cost-per-query optimization)
Background in information retrieval or vector search (FAISS, HNSW, IVF indices, approximate nearest neighbor algorithms)
Production experience with object storage (MinIO, S3, GCS) at petabyte scale
Familiarity with specific technologies: Kafka, PostgreSQL, pgvector, Kubernetes, FluxCD, Yugabyte
Contributions to open-source projects in the HPC, distributed systems, or vector search space
Experience with on-premises enterprise deployments and air-gapped environments (government/defense sector)

About Pryon:
We’re a team of AI, technology, and language experts whose DNA lives in Alexa, Siri, Watson, and virtually every human language technology product on the market. Now we’re building an industry-leading knowledge management and Retrieval-Augmented Generation (RAG) platform. Our proprietary, cutting-edge natural language processing capabilities transform unstructured data into meaningful experiences that increase productivity with unmatched accuracy and speed.
Pryon is building one of the industry's most ambitious AI infrastructure platforms: a petabyte-scale ingestion and inference system powering mission-critical government and enterprise deployments. Our Super Compute Memory (SCM) initiative aims to process and serve massive knowledge bases at unprecedented speed and scale - think 6.5M+ documents ingested in under 20 minutes with sub-second retrieval.
We need a Senior/Staff Software Engineer with deep high performance computing expertise - someone who has built parallel and distributed systems at scale, not just used them. As a founding technical contributor to the SCM team, you will design and implement the parallel computing infrastructure that powers our ingestion, retrieval, and inference layers across multi-cloud and on-prem environments.
In This Role You Will:

Design GPU resource allocation and scheduling for embedding generation and inference workloads across multi-tenant deployments; optimize GPU utilization (targeting 80%+ GPU memory utilization) through techniques like batching, quantization, and vLLM deployment
Build and operate GPU clusters (multi-node, multi-GPU) for production inference workloads; implement GPU monitoring, health checks, and auto-recovery mechanisms; optimize cost-per-inference through efficient GPU scheduling Profile and optimize GPU-based inference pipelines for latency (targeting <100ms p95) and throughput (queries/sec per GPU); implement inference optimizations including KV caching, continuous batching, and model parallelism strategies
Establish performance benchmarking framework, identify bottlenecks through profiling, and implement optimizations achieving 10x+ improvements
Design and implement distributed ingestion pipeline capable of processing 6M+ documents/hour using technologies like message brokers, MinIO, workflow engines and parallel processing frameworks
Architect high-performance vector similarity search infrastructure supporting billion-scale embeddings with sub-second query latency
Architect compute-optimized deployments across AWS (EC2, EKS), GCP (GKE, TPUs), Azure (AKS), and on-premises Kubernetes
Design, develop, and optimize high-performance, distributed systems and software for scalability and reliability
Collaborate with research scientists, ML engineers, and platform teams to deliver high-quality, large-scale software
Drive architectural decisions through RFCs, mentor ML and platform engineers on HPC best practices, establish coding standards for high-performance systems
Lead the technical design and implementation of major features and components with lasting architectural impact
What You'll Need to Be Successful:

Extensive experience in software development, with a proven track record of delivering complex, large-scale systems (8+ years for Senior, 12+ years for Staff)
Proven experience building distributed systems at 100M+ scale (documents, vectors, or equivalent)
Deep knowledge of parallel and distributed computing concepts including consensus algorithms, distributed coordination, and fault tolerance
Hands-on experience with vector databases (pgvector, Pinecone, Weaviate, Milvus, or equivalent)
Proficiency in systems programming languages such as C++, Go, or Rust
Experience with parallel programming models (e.g., MPI, OpenMP, CUDA)
Production experience optimizing GPU workloads for inference including batch optimization, quantization (INT8, FP16), and GPU memory management
Experience managing large-scale GPU infrastructure (10+ GPUs in production) including cluster orchestration, resource scheduling, and cost optimization
Deep understanding of GPU architectures (NVIDIA A100/H100, tensor cores) and inference frameworks (vLLM, TensorRT, Triton)
Deep understanding of memory hierarchies, cache optimization, and NUMA architectures
Experience with container orchestration (Kubernetes) and distributed computing frameworks (Ray, Dask, Spark, or equivalent)
Familiarity with performance analysis and optimization tools and techniques (profilers, tracers, benchmarking)
Strong systems programming background with evidence of performance-critical contributions (open source, papers, or production systems)

Preferred Qualifications

Experience with cloud-based HPC, including services on AWS (EC2 P4/P5 instances), GCP (A100/H100 VMs), or Azure (ND-series)
Knowledge of networking and storage technologies in the context of high-performance computing (RDMA, NVMe, distributed filesystems, GPU-Direct Storage)
Advanced GPU optimization experience including multi-GPU inference (model parallelism, pipeline parallelism), mixed-precision training/inference, and GPU profiling tools (NVIDIA Nsight, nvprof, PyTorch Profiler)
Experience with ML infrastructure including model serving frameworks (vLLM, TensorRT-LLM, Triton Inference Server), GPU resource management (NVIDIA MIG, GPU time-slicing), and inference optimization (continuous batching, speculative decoding)
Production experience with GPU monitoring and observability (DCGM, GPU metrics dashboards, cost-per-query optimization)
Background in information retrieval or vector search (FAISS, HNSW, IVF indices, approximate nearest neighbor algorithms)
Production experience with object storage (MinIO, S3, GCS) at petabyte scale
Familiarity with specific technologies: Kafka, PostgreSQL, pgvector, Kubernetes, FluxCD, Yugabyte
Contributions to open-source projects in the HPC, distributed systems, or vector search space
Experience with on-premises enterprise deployments and air-gapped environments (government/defense sector)

Benefits for Full Time Employees:

Remote first organization
100% Company paid Health/Dental/Vision benefits for you and your dependents
Life Insurance, Short-term and Long-term Disability
401k
Unlimited PTO

We are interested in every qualified candidate who is authorized to work in the United States. However, we are not able to sponsor or take over sponsorship of employment visas at this time.
Pryon will not consider race, religion, sex, sexual preference, or national origin in ways that violate the Nation's civil rights laws.
We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.