Media Summary: Fully Automated HW SKU Selection System to An SRE Approach to Monitoring ML in Production Daria Barteneva, Microsoft Azure Safe Evaluation and Rollout of AI Models Brendan Burns, Microsoft More and more online services and systems depend on ...

Srecon25 Americas Optimizing Machine Learning - Detailed Analysis & Overview

Fully Automated HW SKU Selection System to An SRE Approach to Monitoring ML in Production Daria Barteneva, Microsoft Azure Safe Evaluation and Rollout of AI Models Brendan Burns, Microsoft More and more online services and systems depend on ... The Search for Speed Scott Laird What do you do when you're new to a service and all you know is that you're spending huge ... Transformers in SRE Land: Evolving to Manage AI Infrastructure Qian Ding, Ant Group The rapid advancement of AI has ... Stopping Performance Regression via Changepoint Detection Joseph Cirella and Shanthini Velan, Bloomberg Bloomberg's ...

Challenges of Making Large AI Clusters Reliable John Looney and Panos Christeas, Crusoe.ai High Performance Computing ... Systems Thinking with Poisoned Systems Hazel Weakly, Nivenly Foundation; Sandeep Kanabar, Gen AI is often said to be a ... Logan McDonald, BuzzFeed The talk is about the most powerful observability system SREs have at their disposal: the human ... Resilience for AI Workloads at Scale: The Fast and the Finicky! Lerna Ekmekcioglu, Clockwork.io A Formula 1 car at high speeds ... CPU Utilization: The Hidden Cost of Running Hot Andreas Strikos, GitHub As we are in an AI era where demand for computation ...

Photo Gallery

SREcon25 Americas - Optimizing Machine Learning Training Infrastructure: A Governance Approach
SREcon25 Americas - Fully Automated HW SKU Selection System to Optimize Apache Pinot’s Cost-to...
SREcon25 Americas - An SRE Approach to Monitoring ML in Production
SREcon25 Americas - Safe Evaluation and Rollout of AI Models
SREcon25 Americas - The Search for Speed
SREcon25 Americas - Transformers in SRE Land: Evolving to Manage AI Infrastructure
SREcon25 Americas - Stopping Performance Regression via Changepoint Detection
SREcon25 Europe/Middle East/Africa - Challenges of Making Large AI Clusters Reliable
SREcon25 Americas - Systems Thinking with Poisoned Systems
SREcon19 Americas - Optimizing for Learning
SREcon21 - Automating Performance Tuning with Machine Learning
SREcon25 Europe/Middle East/Africa - Resilience for AI Workloads at Scale: The Fast and the Finicky!
Sponsored
Sponsored
View Detailed Profile
SREcon25 Americas - Optimizing Machine Learning Training Infrastructure: A Governance Approach

SREcon25 Americas - Optimizing Machine Learning Training Infrastructure: A Governance Approach

Optimizing Machine Learning

SREcon25 Americas - Fully Automated HW SKU Selection System to Optimize Apache Pinot’s Cost-to...

SREcon25 Americas - Fully Automated HW SKU Selection System to Optimize Apache Pinot’s Cost-to...

Fully Automated HW SKU Selection System to

Sponsored
SREcon25 Americas - An SRE Approach to Monitoring ML in Production

SREcon25 Americas - An SRE Approach to Monitoring ML in Production

An SRE Approach to Monitoring ML in Production Daria Barteneva, Microsoft Azure

SREcon25 Americas - Safe Evaluation and Rollout of AI Models

SREcon25 Americas - Safe Evaluation and Rollout of AI Models

Safe Evaluation and Rollout of AI Models Brendan Burns, Microsoft More and more online services and systems depend on ...

SREcon25 Americas - The Search for Speed

SREcon25 Americas - The Search for Speed

The Search for Speed Scott Laird What do you do when you're new to a service and all you know is that you're spending huge ...

Sponsored
SREcon25 Americas - Transformers in SRE Land: Evolving to Manage AI Infrastructure

SREcon25 Americas - Transformers in SRE Land: Evolving to Manage AI Infrastructure

Transformers in SRE Land: Evolving to Manage AI Infrastructure Qian Ding, Ant Group The rapid advancement of AI has ...

SREcon25 Americas - Stopping Performance Regression via Changepoint Detection

SREcon25 Americas - Stopping Performance Regression via Changepoint Detection

Stopping Performance Regression via Changepoint Detection Joseph Cirella and Shanthini Velan, Bloomberg Bloomberg's ...

SREcon25 Europe/Middle East/Africa - Challenges of Making Large AI Clusters Reliable

SREcon25 Europe/Middle East/Africa - Challenges of Making Large AI Clusters Reliable

Challenges of Making Large AI Clusters Reliable John Looney and Panos Christeas, Crusoe.ai High Performance Computing ...

SREcon25 Americas - Systems Thinking with Poisoned Systems

SREcon25 Americas - Systems Thinking with Poisoned Systems

Systems Thinking with Poisoned Systems Hazel Weakly, Nivenly Foundation; Sandeep Kanabar, Gen AI is often said to be a ...

SREcon19 Americas - Optimizing for Learning

SREcon19 Americas - Optimizing for Learning

Logan McDonald, BuzzFeed The talk is about the most powerful observability system SREs have at their disposal: the human ...

SREcon21 - Automating Performance Tuning with Machine Learning

SREcon21 - Automating Performance Tuning with Machine Learning

Automating Performance Tuning with

SREcon25 Europe/Middle East/Africa - Resilience for AI Workloads at Scale: The Fast and the Finicky!

SREcon25 Europe/Middle East/Africa - Resilience for AI Workloads at Scale: The Fast and the Finicky!

Resilience for AI Workloads at Scale: The Fast and the Finicky! Lerna Ekmekcioglu, Clockwork.io A Formula 1 car at high speeds ...

SREcon25 Europe/Middle East/Africa - CPU Utilization: The Hidden Cost of Running Hot

SREcon25 Europe/Middle East/Africa - CPU Utilization: The Hidden Cost of Running Hot

CPU Utilization: The Hidden Cost of Running Hot Andreas Strikos, GitHub As we are in an AI era where demand for computation ...