AI/LLM Data Engineer

Posted on February 26, 2025

Apply Now

Job Description

  • AI/LLM Data Engineer
  • Experience: 3-5 Years
  • Remote
  • Role Overview:
  • We are seeking an AI/LLM Data Engineer to build and maintain data pipelines for our Generative AI platform. This position requires expertise in Large Language Model (LLM) technologies and a strong background in data engineering with a focus on Retrieval-Augmented Generation (RAG) and knowledge base techniques. The role involves collaborating with cross-functional teams and working on high-impact AI projects.
  • Key Responsibilities:
  • ● Design, implement, and maintain an end-to-end multi-stage data pipeline for LLMs, including:
  • ○ Supervised Fine Tuning (SFT) processes
  • ○ Reinforcement Learning from Human Feedback (RLHF)
  • ● Evaluate and integrate diverse data sources to support Generative AI platforms
  • ● Develop and optimise workflows for:
  • ○ Chunking, indexing, ingestion, and vectorization of text and non-text data
  • ● Benchmark and implement various vector stores, embedding techniques, and retrieval methods
  • ● Build a flexible pipeline that supports multiple embedding algorithms, vector stores, and search types (vector search, hybrid search)
  • ● Implement and maintain auto-tagging systems and data preparation processes
  • ● Develop tools for text and image data crawling, cleaning, and refinement
  • ● Collaborate with teams to ensure data quality and relevance for AI/ML models
  • ● Work with data lakehouse architectures to optimize data storage and processing
  • ● Integrate Snowflake and vector store technologies to optimize workflows
  • Required Qualifications:
  • ● Education: Master's degree in Computer Science, Data Science, or a related field
  • ● Experience:
  • ○ 3-5 years of work experience in data engineering, with a focus on AI/ML
  • ○ Hands-on experience with data cleaning, tagging, annotation, and data crawling
  • ● Skills:
  • ○ Proficiency in Python, JSON, HTTP, and related tools
  • ○ Strong understanding of LLM architectures, training processes, and data requirements
  • ○ Experience with RAG systems, knowledge base construction, and vector databases
  • ○ Familiarity with embedding techniques, similarity search algorithms, and information retrieval
  • ○ Experience with data lakehouse concepts and architectures
  • ○ Knowledge of Snowflake and its integration in AI/ML pipelines
  • ○ Hands-on experience with vector store technologies and their applications in AI
  • ○ Collaborative communication skills, with the ability to work in a cross-functional team environment
  • ○ Ability to translate business needs into technical solutions
  • ○ Passion for innovation and ethical AI development
  • Preferred Qualifications:
  • ● Experience with LLM/RAG frameworks such as LangChain, LlamaIndex, Semantic Kernel, or OpenAI Functions
  • ● Familiarity with distributed computing platforms (e.g., Apache Spark, Dask)
  • ● Knowledge of data versioning and experiment tracking tools
  • ● Cloud platforms experience (AWS, GCP, Azure) for large-scale data processing
  • ● Understanding of data privacy and security best practices
  • ● Experience implementing data lakehouse solutions
  • ● Proficiency in optimising queries and data processes in Snowflake or Databricks
  • ● Experience with different LLM parameters (temperature, top-k, repeat penalty) and evaluation metrics

Required Skills

ai/ml engineeer