. Author manuscript; available in PMC: 2019 Dec 23.

Published in final edited form as: Adv Neural Inf Process Syst. 2019 Dec;32:9392–9402.

Table 1: Application Datasets:

We compare our model to baselines averaged over 5 runs with different seeds in natural language understanding and computer vision applications and note the relative increase in number of params for each method. We report the overall score and maximum relative improvement (i.e. Lift) over the Vanilla model for each of the slice-aware baselines. For some trials of MoE, our system ran out of GPU memory (i.e. OOM).

Dataset	COLA (Matthews Corr. [23])				RTE (F1 Score)				CYDET (F1 Score)

	Param Inc.	Overall (std)	Slice Lift		Param Inc.	Overall (std)	Slice Lift		Param Inc.	Overall (std)	Slice Lift
	Param Inc.	Overall (std)	Max	Avg	Param Inc.	Overall (std)	Max	Avg	Param Inc.	Overall (std)	Max	Avg

Vanilla	-	57.8 (±1.3)	-	-	-	67.0 (±1.6)	-	-	-	39.4 (±-5.4)	-	-
HPS [7]	12%	57.4 (±2.1)	+12.7	1.1	10%	67.9 (±1.8)	+12.7	+2.9	10%	37.4 (±3.6)	+6.3	−0.7
Manual	12%	57.9 (±1.2)	+6.3	+0.4	10%	69.4 (±1.8)	+10.7	+4.2	10%	36.9 (±4.2)	+6.3	−1.7
MoE [17]	100%	57.2 (±0.9)	+20.0	+1.3	100%	69.2 (±1.5)	+10.9	+3.9	100%	OOM	OOM	OOM

Ours	12%	58.3 (±0.7)	+19.0	+2.5	10%	69.5 (±0.8)	+10.9	+4.6	10%	40.9 (±3.9)	+15.6	+2.3