This folder contains 2 kinds of tutorials:
- end to end examples of how to use the library
- onboard team members and contributors on this project.
- XNLI classification: classification with / without optimizations (
Roberta
+XNLI
classification task) - text generation: text generation with / without optimizations (
T5
)
Tutorials below will show you how to implement a GPU kernel.
It requires basic knowledge of how GPU works, in particular its memory hierarchy.
If you are not familiar with that, check this article first.
Tutorials below are written in Pytorch
in the style of triton
(rewriting is trivial), to ease the learning.
- tiled matmul: matrix multiplication implementation in
CUDA
style - online softmax: parallelized softmax computation, a key ingredient of Flash Attention
- Flash Attention: attention computation without saving attention matrix to global memory
- matmul offsets: detailed explanations related to a performance trick used in
triton
matmul tutorial
Flash Attention tutorial covers most of what you need to know.