Stars
Disaggregated serving system for Large Language Models (LLMs).
SGLang is a fast serving framework for large language models and vision language models.
Implementing a ChatGPT-like LLM in PyTorch from scratch, step by step
Video+code lecture on building nanoGPT from scratch
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including x86 and ARMv9.
Deep learning inference nodes for ROS / ROS2 with support for NVIDIA Jetson and TensorRT
Agent framework and applications built upon Qwen>=2.0, featuring Function Calling, Code Interpreter, RAG, and Chrome extension.
haileyschoelkopf / vllm
Forked from vllm-project/vllmA high-throughput and memory-efficient inference and serving engine for LLMs
Adlik / smoothquantplus
Forked from mit-han-lab/smoothquant[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
TinyChatEngine: On-Device LLM Inference Library
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with …
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
Fast inference from large lauguage models via speculative decoding
An annotated implementation of the Transformer paper.
Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)
REST: Retrieval-Based Speculative Decoding, NAACL 2024
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Transformer: PyTorch Implementation of "Attention Is All You Need"
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
The simplest, fastest repository for training/finetuning medium-sized GPTs.
This repository contains demos I made with the Transformers library by HuggingFace.