-
NVIDIA
- Hangzhou, Zhejiang
- https://fanshiqing.github.io/
Stars
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Provides end-to-end model development pipelines for LLMs and Multimodal models that can be launched on-prem or cloud-native.
Code for this paper "HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts via HyperNetwork"
Visualize expert firing frequencies across sentences in the Mixtral MoE model
Zero Bubble Pipeline Parallelism
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Source code examples from the Parallel Forall Blog
fanshiqing / grouped_gemm
Forked from tgale96/grouped_gemmPyTorch bindings for CUTLASS grouped GEMM.
High Performance Grouped GEMM in PyTorch
RDMA and SHARP plugins for nccl library
Flexible and powerful tensor operations for readable and reliable code (for pytorch, jax, TF and others)
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilizatio…
Fast and memory-efficient exact attention
Monitor Memory usage of Python code
A latent text-to-image diffusion model
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
Development repository for the Triton language and compiler
Implementation of the specific Transformer architecture from PaLM - Scaling Language Modeling with Pathways
This repository contains the results and code for the MLPerf™ Training v2.0 benchmark.
Example models using DeepSpeed
🧑🏫 60+ Implementations/tutorials of deep learning papers with side-by-side notes 📝; including transformers (original, xl, switch, feedback, vit, ...), optimizers (adam, adabelief, sophia, ...), ga…
Tutel MoE: An Optimized Mixture-of-Experts Implementation
PKU-DAIR / Hetu
Forked from Hsword/HetuA high-performance distributed deep learning system targeting large-scale and automated distributed training.