Skip to content
View hochen1's full-sized avatar

Block or report hochen1

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 270 62 Updated Sep 8, 2024

NCCL Tests

Cuda 837 232 Updated Jul 30, 2024

Optimized primitives for collective multi-GPU communication

C++ 3,152 793 Updated Sep 17, 2024

ncnn is a high-performance neural network inference framework optimized for the mobile platform

C++ 20,204 4,149 Updated Sep 25, 2024

Dynamic Memory Management for Serving LLMs without PagedAttention

C 192 13 Updated Sep 24, 2024

NVIDIA Linux open GPU kernel module source

C 15,063 1,254 Updated Sep 26, 2024

DietCode Code Release

Cuda 59 9 Updated Jul 21, 2022

High performance Transformer implementation in C++.

C++ 69 7 Updated Sep 14, 2024

Paella: Low-latency Model Serving with Virtualized GPU Scheduling

C++ 55 5 Updated May 1, 2024

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 573 45 Updated Sep 4, 2024

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

Python 360 29 Updated Sep 29, 2024

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 2,375 184 Updated Jul 16, 2024

My study note for mlsys

Jupyter Notebook 12 1 Updated Sep 21, 2024

Tuned OpenCL BLAS

C++ 1,046 205 Updated Jun 13, 2024

Tile primitives for speedy kernels

Cuda 1,506 58 Updated Sep 29, 2024

LLM training in simple, raw C/CUDA

Cuda 23,597 2,639 Updated Sep 27, 2024

A multi-level tensor algebra superoptimizer

C++ 319 18 Updated Sep 28, 2024
Python 5 1 Updated Mar 8, 2024

learning how CUDA works

Cuda 153 19 Updated Aug 16, 2024
C++ 11 Updated Jun 4, 2022

This is a list of awesome edgeAI inference related papers.

85 9 Updated Dec 21, 2023

Serving multiple LoRA finetuned LLM as one

Python 959 45 Updated May 8, 2024

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Python 36,553 4,509 Updated Sep 25, 2024

⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡

Python 2,926 200 Updated Nov 26, 2023

Tutorials for creating and using ONNX models

Jupyter Notebook 3,347 627 Updated Jul 15, 2024

Open standard for machine learning interoperability

Python 17,701 3,651 Updated Sep 29, 2024

CUDA Templates for Linear Algebra Subroutines

C++ 5,434 918 Updated Sep 25, 2024

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Python 82,541 22,214 Updated Sep 29, 2024

Fast and memory-efficient exact attention

Python 13,583 1,244 Updated Sep 28, 2024

Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).

C++ 235 24 Updated Mar 15, 2024
Next