Tech »  Topic »  Apparate: Early-Exit Models for ML Latency and Throughput Optimization - Abstract and Introduction

Apparate: Early-Exit Models for ML Latency and Throughput Optimization - Abstract and Introduction


Apparate: Early-Exit Models for ML Latency and Throughput Optimization - Abstract and Introduction by @textmodels

Apparate is a system that automatically applies and manages early exits (EEs) in ML models, whereby certain inputs can exit with results at intermediate layers. Apparate lowers median response latencies by 40.5-91.5% and 10.0-24.2% for diverse CV and NLP workloads, respectively.

Table of Links

Abstract and 1 Introduction

2 Background and Motivation and 2.1 Model Serving Platforms

2.2 Early-Exit Models

2.3 Challenges

3 Design

3.1 Preparing Models with Early Exits

3.2 Accuracy-Aware Threshold Tuning

3.3 Latency-Focused Ramp Adjustments

4 Implementation

5 Evaluation and 5.1 Methodology

5.2 Overall Results

5.3 Comparison with Existing EE Strategies

5.4 Microbenchmarks

6 Additional Related Work

7 Conclusion, References, Appendix

Abstract

Machine learning (ML) inference platforms are tasked with balancing two competing goals: ensuring high throughput given many ...


Copyright of this story solely belongs to hackernoon.com . To see the full text click HERE