hochen1

aochen hochen1

Stars

Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 270 62 Updated Sep 8, 2024

NVIDIA / nccl-tests

NCCL Tests

Cuda 837 232 Updated Jul 30, 2024

NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

C++ 3,152 793 Updated Sep 17, 2024

Tencent / ncnn

ncnn is a high-performance neural network inference framework optimized for the mobile platform

C++ 20,204 4,149 Updated Sep 25, 2024

microsoft / vattention

Dynamic Memory Management for Serving LLMs without PagedAttention

C 192 13 Updated Sep 24, 2024

NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source

C 15,063 1,254 Updated Sep 26, 2024

UofT-EcoSystem / DietCode

DietCode Code Release

Cuda 59 9 Updated Jul 21, 2022

LLMServe / SwiftTransformer

High performance Transformer implementation in C++.

C++ 69 7 Updated Sep 14, 2024

eniac / paella

Paella: Low-latency Model Serving with Virtualized GPU Scheduling

C++ 55 5 Updated May 1, 2024

IST-DASLab / marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 573 45 Updated Sep 4, 2024

microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

Python 360 29 Updated Sep 29, 2024

mit-han-lab / llm-awq

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 2,375 184 Updated Jul 16, 2024

tfruan2000 / mlsys-study-note

My study note for mlsys

Jupyter Notebook 12 1 Updated Sep 21, 2024

CNugteren / CLBlast

Tuned OpenCL BLAS

C++ 1,046 205 Updated Jun 13, 2024

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 1,506 58 Updated Sep 29, 2024

karpathy / llm.c

LLM training in simple, raw C/CUDA

Cuda 23,597 2,639 Updated Sep 27, 2024

mirage-project / mirage

A multi-level tensor algebra superoptimizer

C++ 319 18 Updated Sep 28, 2024

humuyan / ASPLOS24-Korch-AE

Python 5 1 Updated Mar 8, 2024

ifromeast / cuda_learning

learning how CUDA works

Cuda 153 19 Updated Aug 16, 2024

thu-cs-lab / affinity-test

C++ 11 Updated Jun 4, 2022

Kyrie-Zhao / awesome-real-time-AI

This is a list of awesome edgeAI inference related papers.

85 9 Updated Dec 21, 2023

punica-ai / punica

Serving multiple LoRA finetuned LLM as one

Python 959 45 Updated May 8, 2024

lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Python 36,553 4,509 Updated Sep 25, 2024

FreedomIntelligence / LLMZoo

⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡

Python 2,926 200 Updated Nov 26, 2023

onnx / tutorials

Tutorials for creating and using ONNX models

Jupyter Notebook 3,347 627 Updated Jul 15, 2024

onnx / onnx

Open standard for machine learning interoperability

Python 17,701 3,651 Updated Sep 29, 2024

NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines

C++ 5,434 918 Updated Sep 25, 2024

pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Python 82,541 22,214 Updated Sep 29, 2024

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 13,583 1,244 Updated Sep 28, 2024

inferflow / inferflow

Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).

C++ 235 24 Updated Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly