Serving LLMs from the First Principles
Description
In this workshop, we'll develop not 1, but 2 (two!) systems capable of serving AI models in production from the ground up. The first system will be based on established technologies like PyTorch and FastAPI, where we'll write everything from scratch to showcase the basics of AI model serving. For the other system, we'll be using nVidia Triton model serving technology and build a performant, production-grade ready model serving system.
Prerequisites:
- Target audience should have some experience writing (pure) Python code; familiarity with AI-related topics like tensors and batching is a plus but they'll be explained and demonstrated.
- Participants are encouraged to have access to their own GPU machine with Docker installed. However, this workshop will provide means to create a temporary cloud-based GPU instance which will be destroyed at the end of a session
Key takeaway: There's plenty of small, fun details that cooperate when serving LLMs and everyone can do it.
Speaker
