Apparate: Early-Exit Models for ML Latency and Throughput Optimization - Abstract and Introduction
hackernoon.comApparate is a system that automatically applies and manages early exits (EEs) in ML models, whereby certain inputs can exit with results at intermediate layers. Apparate lowers median response latencies by 40.5-91.5% and 10.0-24.2% for diverse CV and NLP workloads, respectively.
Table of Links
2 Background and Motivation and 2.1 Model Serving Platforms
3.1 Preparing Models with Early Exits
3.2 Accuracy-Aware Threshold Tuning
3.3 Latency-Focused Ramp Adjustments
5 Evaluation and 5.1 Methodology
5.3 Comparison with Existing EE Strategies
7 Conclusion, References, Appendix
Abstract
Machine learning (ML) inference platforms are tasked with balancing two competing goals: ensuring high throughput given many ...
Copyright of this story solely belongs to hackernoon.com . To see the full text click HERE