MLOps Framework for Dynamic Dataset Selection using Semantic Search and LLM

Overview

This project provides an MLOps framework for dynamically selecting datasets to augment Large Language Models (LLM) using semantic search techniques. The framework supports both API-based and local model approaches, allowing for flexible deployment in cloud, hybrid, or local environments.

Features

Dynamic Dataset Selection: Automatically selects relevant datasets based on user queries using semantic search.
Semantic Search Integration: Enhances LLM responses with external data retrieved using semantic search techniques.
MLOps Practices: Incorporates automation, monitoring, and reproducibility for efficient model management.
Flexible Deployment: Supports cloud, hybrid, and local architectures.

Prerequisites

Before you begin, ensure you have met the following requirements:

Python 3.7+
API-based Approach:
- OpenAI API Key
Local Model Approach:
- CUDA-enabled GPU (recommended for performance)
- PyTorch with CUDA support

Installation

API-based Approach

Clone the repository:

git clone https://github.com/yourusername/your-repo.git
cd your-repo

Install the required packages:
```
pip install -r requirements_API.txt
```
Note: Adjust dependencies accordingly as future versions might not be compatible.
Set your OpenAI API key:
```
export OPENAI_API_KEY=your_api_key_here
```

Local Model Approach

Clone the repository:

git clone https://github.com/yourusername/your-repo.git
cd your-repo

Install the required packages:
```
pip install -r requirements_local.txt
```
Note: Adjust dependencies accordingly as future versions might not be compatible.

Usage

API-based Approach

Run the Streamlit app:
```
streamlit run app_API.py
```
Open your web browser and navigate to the URL displayed in the terminal.

Local Model Approach

Run the Streamlit app:
```
streamlit run app_local.py
```
Open your web browser and navigate to the URL displayed in the terminal.

Customization

UI Framework: This project includes a simple Streamlit UI as an example. You are free to customize the UI or use any other framework that suits your needs.

Configuration

Dataset Mapping: Modify the DATASET_MAPPING dictionary in app_API.py or app_local.py to include your dataset paths and keywords.
Model Configuration:
- Local Model Approach: Choose a model from Hugging Face's model hub and update the load_model_and_tokenizer function in app_local.py accordingly.
- API-based Approach: While this example uses the OpenAI API, you can modify the run_rag_pipeline function in app_API.py to use any other API provider of your choice.

Architecture and Workflow

Keyword Detection

The framework uses a simple keyword-based detection mechanism to identify relevant datasets. When a user query is submitted, the system converts the query to lowercase and checks it against the keys in the DATASET_MAPPING dictionary. If a keyword from the query matches a key in the dictionary, the corresponding dataset is loaded and used for semantic search.

Semantic Search

Semantic search is performed using a pre-trained model from the Hugging Face Transformers library. The query and dataset entries are converted into embeddings, and cosine similarity is used to find the most relevant documents. The top results are then used to augment the LLM response.

LLM Augmentation

For the API-based approach, the augmented prompt is sent to the OpenAI API, which returns a response generated by the LLM. For the local model approach, the augmented prompt is processed by a locally hosted model from Hugging Face, generating a response based on the augmented context.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements_API.txt		requirements_API.txt
requirements_local.txt		requirements_local.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLOps Framework for Dynamic Dataset Selection using Semantic Search and LLM

Overview

Features

Prerequisites

Installation

API-based Approach

Local Model Approach

Usage

API-based Approach

Local Model Approach

Customization

Configuration

Architecture and Workflow

Keyword Detection

Semantic Search

LLM Augmentation

About

Releases

Packages

Languages

Mattjesc/MLOps-Framework-DDS-SS-LLM

Folders and files

Latest commit

History

Repository files navigation

MLOps Framework for Dynamic Dataset Selection using Semantic Search and LLM

Overview

Features

Prerequisites

Installation

API-based Approach

Local Model Approach

Usage

API-based Approach

Local Model Approach

Customization

Configuration

Architecture and Workflow

Keyword Detection

Semantic Search

LLM Augmentation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages