Stars
FlashInfer: Kernel Library for LLM Serving
MSCCL++: A GPU-driven communication stack for scalable AI applications
A throughput-oriented high-performance serving framework for LLMs
🚴 Call stack profiler for Python. Shows you why your code is slow!
A large-scale simulation framework for LLM inference
Machnet provides applications like databases and finance an easy way to access low-latency DPDK-based messaging on public cloud VMs. 750K RPS on Azure at 61 us P99.9.
A tensor-aware point-to-point communication primitive for machine learning
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Alpaca dataset from Stanford, cleaned and curated
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
经济学人(含音频)、纽约客、卫报、连线、大西洋月刊等英语杂志免费下载,支持epub、mobi、pdf格式, 每周更新
libcubwt is a library for GPU accelerated suffix array and burrows wheeler transform construction.
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.
Open-source software for volunteer computing and grid computing.
🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading