AI/LLM Data Engineer
Posted on February 26, 2025
Job Description
- AI/LLM Data Engineer
- Experience: 3-5 Years
- Remote
- Role Overview:
- We are seeking an AI/LLM Data Engineer to build and maintain data pipelines for our Generative AI platform. This position requires expertise in Large Language Model (LLM) technologies and a strong background in data engineering with a focus on Retrieval-Augmented Generation (RAG) and knowledge base techniques. The role involves collaborating with cross-functional teams and working on high-impact AI projects.
- Key Responsibilities:
- ● Design, implement, and maintain an end-to-end multi-stage data pipeline for LLMs, including:
- ○ Supervised Fine Tuning (SFT) processes
- ○ Reinforcement Learning from Human Feedback (RLHF)
- ● Evaluate and integrate diverse data sources to support Generative AI platforms
- ● Develop and optimise workflows for:
- ○ Chunking, indexing, ingestion, and vectorization of text and non-text data
- ● Benchmark and implement various vector stores, embedding techniques, and retrieval methods
- ● Build a flexible pipeline that supports multiple embedding algorithms, vector stores, and search types (vector search, hybrid search)
- ● Implement and maintain auto-tagging systems and data preparation processes
- ● Develop tools for text and image data crawling, cleaning, and refinement
- ● Collaborate with teams to ensure data quality and relevance for AI/ML models
- ● Work with data lakehouse architectures to optimize data storage and processing
- ● Integrate Snowflake and vector store technologies to optimize workflows
- Required Qualifications:
- ● Education: Master's degree in Computer Science, Data Science, or a related field
- ● Experience:
- ○ 3-5 years of work experience in data engineering, with a focus on AI/ML
- ○ Hands-on experience with data cleaning, tagging, annotation, and data crawling
- ● Skills:
- ○ Proficiency in Python, JSON, HTTP, and related tools
- ○ Strong understanding of LLM architectures, training processes, and data requirements
- ○ Experience with RAG systems, knowledge base construction, and vector databases
- ○ Familiarity with embedding techniques, similarity search algorithms, and information retrieval
- ○ Experience with data lakehouse concepts and architectures
- ○ Knowledge of Snowflake and its integration in AI/ML pipelines
- ○ Hands-on experience with vector store technologies and their applications in AI
- ○ Collaborative communication skills, with the ability to work in a cross-functional team environment
- ○ Ability to translate business needs into technical solutions
- ○ Passion for innovation and ethical AI development
- Preferred Qualifications:
- ● Experience with LLM/RAG frameworks such as LangChain, LlamaIndex, Semantic Kernel, or OpenAI Functions
- ● Familiarity with distributed computing platforms (e.g., Apache Spark, Dask)
- ● Knowledge of data versioning and experiment tracking tools
- ● Cloud platforms experience (AWS, GCP, Azure) for large-scale data processing
- ● Understanding of data privacy and security best practices
- ● Experience implementing data lakehouse solutions
- ● Proficiency in optimising queries and data processes in Snowflake or Databricks
- ● Experience with different LLM parameters (temperature, top-k, repeat penalty) and evaluation metrics
Required Skills
ai/ml
engineeer