Stars
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Optimized primitives for collective multi-GPU communication
ncnn is a high-performance neural network inference framework optimized for the mobile platform
Dynamic Memory Management for Serving LLMs without PagedAttention
NVIDIA Linux open GPU kernel module source
High performance Transformer implementation in C++.
Paella: Low-latency Model Serving with Virtualized GPU Scheduling
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
This is a list of awesome edgeAI inference related papers.
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡
Tutorials for creating and using ONNX models
Open standard for machine learning interoperability
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Fast and memory-efficient exact attention
Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).