← Serch more jobs

Machine Learning Data Engineer - Systems & Retrieval

LinkedIn Zyphra San Francisco, CA
Not Applicable Posted March 14, 2026 Job link
Thinking about this job
Not Met Priorities
What still needs stronger evidence
Requirements
  • Ensuring compliance and access control in data handling, with security and auditability in mind
  • Strong software engineering background with fluency in Python
  • Experience designing, building, and maintaining data pipelines in production environments
  • Deep understanding of data structures, storage formats, and distributed data systems
  • Familiarity with indexing and retrieval techniques for large-scale document corpora
  • Understanding of database systems (SQL and NoSQL), their internals, and performance characteristics
  • Strong attention to security, access controls, and compliance best practices (e.g., GDPR, SOC2)
  • Excellent debugging, observability, and logging practices to support reliability at scale
  • Strong communication skills and experience collaborating across ML, infra, and product teams
  • Experience building or maintaining LLM-integrated retrieval systems (e.g, RAG pipelines)
  • Academic or industry background in data mining, search, recommendation systems, or IR literature
  • Experience with large-scale ETL systems and tools like Apache Beam, Spark, or similar
  • Familiarity with vector databases (e.g., FAISS, Weaviate, Pinecone) and embedding-based retrieval
Preferred Skills
  • Experience building or maintaining LLM-integrated retrieval systems (e.g, RAG pipelines)
  • Academic or industry background in data mining, search, recommendation systems, or IR literature
  • Experience with large-scale ETL systems and tools like Apache Beam, Spark, or similar
  • Familiarity with vector databases (e.g., FAISS, Weaviate, Pinecone) and embedding-based retrieval
  • Understanding of data validation and quality assurance in machine learning workflows
  • Experience working on cross-functional infra and MLOps teams
  • Knowledge of how data infrastructure supports training pipelines, inference serving, and feedback loops
  • Comfort working across raw, unstructured data, structured databases, and model-ready formats
Education
  • (Not required) – Academic or industry background in data mining, search, recommendation systems, or IR literature