Running a LLM on the ESP32

Summary

I wanted to see if it was possible to run a Large Language Model (LLM) on the ESP32. Surprisingly it is possible, though probably not very useful.

The "Large" Language Model used is actually quite small. It is a 260K parameter tinyllamas checkpoint trained on the tiny stories dataset.

The LLM implementation is done using llama.2c with minor optimizations to make it run faster on the ESP32.

LLMs require a great deal of memory. Even this small one still requires 1MB of RAM. I used the ESP32-S3FH4R2 because it has 2MB of embedded PSRAM.

With the following changes to llama2.c, I am able to achieve 19.13 tok/s:

Utilizing both cores of the ESP32 during math heavy operations.
Utilizing some special dot product functions from the ESP-DSP library that are designed for the ESP32-S3. These functions utilize some of the few SIMD instructions the ESP32-S3 has.
Maxing out CPU speed to 240 MHz and PSRAM speed to 80MHZ and increasing the instruction cache size.

This requires the ESP-IDF toolchain to be installed

idf.py build
idf.py -p /dev/{DEVICE_PORT} flash

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
main		main
.DS_Store		.DS_Store
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
ESP32_LLM.jpg		ESP32_LLM.jpg
Kconfig.projbuild		Kconfig.projbuild
README.md		README.md
dependencies.lock		dependencies.lock
linker.lf		linker.lf
llm_output.gif		llm_output.gif
partitions.csv		partitions.csv
sdkconfig		sdkconfig
sdkconfig.ci		sdkconfig.ci
sdkconfig.old		sdkconfig.old