Serving LLMs from the First Principles
Description
In this workshop, we'll develop not 1, but 2 (two!) systems capable of serving AI models in production from the ground up. The first system will be based on established technologies like PyTorch and FastAPI, where we'll write everything from scratch to showcase the basics of AI model serving. For the other system, we'll be using nVidia Triton model serving technology and build a performant, production-grade ready model serving system.
Key takeaway
There's plenty of small, fun details that cooperate when serving LLMs and everyone can do it.
Prerequisites
- Some experience writing (pure) Python code; familiarity with AI-related topics like tensors and batching is a plus but they'll be explained and demonstrated.
- Participants are encouraged to have access to their own GPU machine with Docker installed. However, this workshop will provide means to create a temporary cloud-based GPU instance which will be destroyed at the end of a session.
Preparation instructions
- You should have uv installed: https://github.com/astral-sh/uv
- Alternatively, if you do want to run things locally, make sure:
- your device has an NVIDIA RTX GPU,
- it can run Docker containers with CUDA,
- predownload gemma-3-1b-it from HuggingFace: https://huggingface.co/google/gemma-3-1b-it.
Speaker
