Media Summary: Fully Automated HW SKU Selection System to An SRE Approach to Monitoring ML in Production Daria Barteneva, Microsoft Azure Safe Evaluation and Rollout of AI Models Brendan Burns, Microsoft More and more online services and systems depend on ...
Srecon25 Americas Optimizing Machine Learning - Detailed Analysis & Overview
Fully Automated HW SKU Selection System to An SRE Approach to Monitoring ML in Production Daria Barteneva, Microsoft Azure Safe Evaluation and Rollout of AI Models Brendan Burns, Microsoft More and more online services and systems depend on ... The Search for Speed Scott Laird What do you do when you're new to a service and all you know is that you're spending huge ... Transformers in SRE Land: Evolving to Manage AI Infrastructure Qian Ding, Ant Group The rapid advancement of AI has ... Stopping Performance Regression via Changepoint Detection Joseph Cirella and Shanthini Velan, Bloomberg Bloomberg's ...
Challenges of Making Large AI Clusters Reliable John Looney and Panos Christeas, Crusoe.ai High Performance Computing ... Systems Thinking with Poisoned Systems Hazel Weakly, Nivenly Foundation; Sandeep Kanabar, Gen AI is often said to be a ... Logan McDonald, BuzzFeed The talk is about the most powerful observability system SREs have at their disposal: the human ... Resilience for AI Workloads at Scale: The Fast and the Finicky! Lerna Ekmekcioglu, Clockwork.io A Formula 1 car at high speeds ... CPU Utilization: The Hidden Cost of Running Hot Andreas Strikos, GitHub As we are in an AI era where demand for computation ...