Skip to content

MLOps Framework for Dynamic Dataset Selection using Semantic Search and LLM

Notifications You must be signed in to change notification settings

Mattjesc/MLOps-Framework-DDS-SS-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLOps Framework for Dynamic Dataset Selection using Semantic Search and LLM

image

Overview

This project provides an MLOps framework for dynamically selecting datasets to augment Large Language Models (LLM) using semantic search techniques. The framework supports both API-based and local model approaches, allowing for flexible deployment in cloud, hybrid, or local environments.

Features

  • Dynamic Dataset Selection: Automatically selects relevant datasets based on user queries using semantic search.
  • Semantic Search Integration: Enhances LLM responses with external data retrieved using semantic search techniques.
  • MLOps Practices: Incorporates automation, monitoring, and reproducibility for efficient model management.
  • Flexible Deployment: Supports cloud, hybrid, and local architectures.

Prerequisites

Before you begin, ensure you have met the following requirements:

  • Python 3.7+
  • API-based Approach:
    • OpenAI API Key
  • Local Model Approach:
    • CUDA-enabled GPU (recommended for performance)
    • PyTorch with CUDA support

Installation

API-based Approach

  1. Clone the repository:

    git clone https://github.com/yourusername/your-repo.git
    cd your-repo
  2. Install the required packages:

    pip install -r requirements_API.txt

    Note: Adjust dependencies accordingly as future versions might not be compatible.

  3. Set your OpenAI API key:

    export OPENAI_API_KEY=your_api_key_here

Local Model Approach

  1. Clone the repository:

    git clone https://github.com/yourusername/your-repo.git
    cd your-repo
  2. Install the required packages:

    pip install -r requirements_local.txt

    Note: Adjust dependencies accordingly as future versions might not be compatible.

Usage

API-based Approach

  1. Run the Streamlit app:

    streamlit run app_API.py
  2. Open your web browser and navigate to the URL displayed in the terminal.

Local Model Approach

  1. Run the Streamlit app:

    streamlit run app_local.py
  2. Open your web browser and navigate to the URL displayed in the terminal.

Customization

  • UI Framework: This project includes a simple Streamlit UI as an example. You are free to customize the UI or use any other framework that suits your needs.

Configuration

  • Dataset Mapping: Modify the DATASET_MAPPING dictionary in app_API.py or app_local.py to include your dataset paths and keywords.
  • Model Configuration:
    • Local Model Approach: Choose a model from Hugging Face's model hub and update the load_model_and_tokenizer function in app_local.py accordingly.
    • API-based Approach: While this example uses the OpenAI API, you can modify the run_rag_pipeline function in app_API.py to use any other API provider of your choice.

Architecture and Workflow

Keyword Detection

The framework uses a simple keyword-based detection mechanism to identify relevant datasets. When a user query is submitted, the system converts the query to lowercase and checks it against the keys in the DATASET_MAPPING dictionary. If a keyword from the query matches a key in the dictionary, the corresponding dataset is loaded and used for semantic search.

Semantic Search

Semantic search is performed using a pre-trained model from the Hugging Face Transformers library. The query and dataset entries are converted into embeddings, and cosine similarity is used to find the most relevant documents. The top results are then used to augment the LLM response.

LLM Augmentation

For the API-based approach, the augmented prompt is sent to the OpenAI API, which returns a response generated by the LLM. For the local model approach, the augmented prompt is processed by a locally hosted model from Hugging Face, generating a response based on the augmented context.

About

MLOps Framework for Dynamic Dataset Selection using Semantic Search and LLM

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages