Skip to content
View tuanhe's full-sized avatar

Block or report tuanhe

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Disaggregated serving system for Large Language Models (LLMs).

Jupyter Notebook 288 28 Updated Aug 19, 2024

SGLang is a fast serving framework for large language models and vision language models.

Python 5,302 380 Updated Sep 25, 2024
Python 104 6 Updated Jun 12, 2024

Implementing a ChatGPT-like LLM in PyTorch from scratch, step by step

Jupyter Notebook 27,727 3,133 Updated Sep 26, 2024

Video+code lecture on building nanoGPT from scratch

Python 3,427 473 Updated Aug 13, 2024

DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including x86 and ARMv9.

C++ 132 14 Updated Aug 27, 2024

Deep learning inference nodes for ROS / ROS2 with support for NVIDIA Jetson and TensorRT

C++ 886 258 Updated Jul 13, 2024

Agent framework and applications built upon Qwen>=2.0, featuring Function Calling, Code Interpreter, RAG, and Chrome extension.

Python 3,199 312 Updated Sep 25, 2024

LLM training in simple, raw C/CUDA

Cuda 23,534 2,635 Updated Sep 26, 2024

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 3 Updated Mar 5, 2024

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Python 19 Updated Mar 15, 2024

TinyChatEngine: On-Device LLM Inference Library

C++ 708 68 Updated Jul 4, 2024

Design pattern demo code

C++ 1,053 269 Updated Apr 17, 2024

本项目旨在分享大模型相关技术原理以及实战经验。

HTML 9,332 914 Updated Sep 22, 2024

Inference Llama 2 in one file of pure C

C 17,215 2,053 Updated Aug 6, 2024

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with …

Python 4,917 391 Updated Sep 24, 2024

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 2,363 182 Updated Jul 16, 2024

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Python 4,356 465 Updated Aug 19, 2024

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

2,550 174 Updated Sep 26, 2024

Fast inference from large lauguage models via speculative decoding

Python 521 51 Updated Aug 22, 2024

An annotated implementation of the Transformer paper.

Jupyter Notebook 5,604 1,212 Updated Apr 7, 2024

Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)

Python 774 76 Updated Sep 21, 2024

REST: Retrieval-Based Speculative Decoding, NAACL 2024

C 163 10 Updated Sep 25, 2024

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Jupyter Notebook 2,222 152 Updated Jun 25, 2024

Transformer: PyTorch Implementation of "Attention Is All You Need"

Python 2,837 423 Updated Aug 6, 2024

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

Python 9,994 996 Updated Sep 26, 2024

Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.

Python 5,967 518 Updated Sep 6, 2024

The simplest, fastest repository for training/finetuning medium-sized GPTs.

Python 36,392 5,711 Updated Aug 19, 2024

This repository contains demos I made with the Transformers library by HuggingFace.

Jupyter Notebook 9,119 1,414 Updated Aug 8, 2024
Next