Skip to main content
. Author manuscript; available in PMC: 2019 Dec 23.
Published in final edited form as: Adv Neural Inf Process Syst. 2019 Dec;32:9392–9402.

Table 1: Application Datasets:

We compare our model to baselines averaged over 5 runs with different seeds in natural language understanding and computer vision applications and note the relative increase in number of params for each method. We report the overall score and maximum relative improvement (i.e. Lift) over the Vanilla model for each of the slice-aware baselines. For some trials of MoE, our system ran out of GPU memory (i.e. OOM).

Dataset COLA (Matthews Corr. [23]) RTE (F1 Score) CYDET (F1 Score)

Param
Inc.
Overall (std) Slice Lift Param
Inc.
Overall (std) Slice Lift Param
Inc.
Overall (std) Slice Lift
Max Avg Max Avg Max Avg

Vanilla - 57.8 (±1.3) - - - 67.0 (±1.6) - - - 39.4 (±-5.4) - -
HPS [7] 12% 57.4 (±2.1) +12.7 1.1 10% 67.9 (±1.8) +12.7 +2.9 10% 37.4 (±3.6) +6.3 −0.7
Manual 12% 57.9 (±1.2) +6.3 +0.4 10% 69.4 (±1.8) +10.7 +4.2 10% 36.9 (±4.2) +6.3 −1.7
MoE [17] 100% 57.2 (±0.9) +20.0 +1.3 100% 69.2 (±1.5) +10.9 +3.9 100% OOM OOM OOM

Ours 12% 58.3 (±0.7) +19.0 +2.5 10% 69.5 (±0.8) +10.9 +4.6 10% 40.9 (±3.9) +15.6 +2.3