Skip to main content
Workshop Duration: 2h 30min

Serving LLMs from the First Principles

Marijan Smetko
Description

In this workshop, we'll develop not 1, but 2 (two!) systems capable of serving AI models in production from the ground up. The first system will be based on established technologies like PyTorch and FastAPI, where we'll write everything from scratch to showcase the basics of AI model serving. For the other system, we'll be using nVidia Triton model serving technology and build a performant, production-grade ready model serving system.

Prerequisites

- Target audience should have some experience writing (pure) Python code; familiarity with AI-related topics like tensors and batching is a plus but they'll be explained and demonstrated.
- Participants are encouraged to have access to their own GPU machine with Docker installed. However, this workshop will provide means to create a temporary cloud-based GPU instance which will be destroyed at the end of a session

Key takeaway: There's plenty of small, fun details that cooperate when serving LLMs and everyone can do it.

Speaker
Marijan Smetko

Marijan Smetko

SWE @ Google
Marijan Smetko is a technology enthusiast from Croatia. As an SWE in Google, he works closely with Gemini LLMs in order to make them explain and teach STEM related topics to all the interested Google users.

To make this website run properly and to improve your experience, we use cookies. For more detailed information, please check our Cookie Policy.

  • Necessary cookies enable core functionality. The website cannot function properly without these cookies, and can only be disabled by changing your browser preferences.