Workshop Duration: 2h 20min

Serving LLMs from the First Principles

Marijan Smetko

Description

In this workshop, we'll develop not 1, but 2 (two!) systems capable of serving AI models in production from the ground up. The first system will be based on established technologies like PyTorch and FastAPI, where we'll write everything from scratch to showcase the basics of AI model serving. For the other system, we'll be using nVidia Triton model serving technology and build a performant, production-grade ready model serving system.

Key takeaway

There's plenty of small, fun details that cooperate when serving LLMs and everyone can do it.

Prerequisites

Some experience writing (pure) Python code; familiarity with AI-related topics like tensors and batching is a plus but they'll be explained and demonstrated.
Participants are encouraged to have access to their own GPU machine with Docker installed. However, this workshop will provide means to create a temporary cloud-based GPU instance which will be destroyed at the end of a session.

Preparation instructions

You should have uv installed: https://github.com/astral-sh/uv
Alternatively, if you do want to run things locally, make sure:
- your device has an NVIDIA RTX GPU,
- it can run Docker containers with CUDA,
- predownload gemma-3-1b-it from HuggingFace: https://huggingface.co/google/gemma-3-1b-it.

Speaker

Marijan Smetko

SWE @ Google

Marijan Smetko is a technology enthusiast from Croatia. As an SWE in Google, he works closely with Gemini LLMs in order to make them explain and teach STEM related topics to all the interested Google users. Space nerd.

Description

Key takeaway

Prerequisites

Preparation instructions

Speaker

Marijan Smetko

CONTACT

SITEMAP

TIMELINE