Online label aggregation with incomplete crowd responses

Yuyang Liu; Haoyu Liu; Runze Wu; Chengliang Chai; Minmin Lin; Renyu Zhu; Hui Liu; Tangjie Lv; Changjie Fan

doi:10.1007/s44443-025-00381-z

. 2026 Jan 20;38(2):76. doi: 10.1007/s44443-025-00381-z

Online label aggregation with incomplete crowd responses

Yuyang Liu ^1,², Haoyu Liu ^2,^✉, Runze Wu ^3,^✉, Chengliang Chai ³, Minmin Lin ², Renyu Zhu ², Hui Liu ^1,^✉, Tangjie Lv ², Changjie Fan ^1,^2,³

PMCID: PMC12948928 PMID: 41768854

Abstract

Crowdsourcing delivers responses that are asynchronous and incomplete, making offline aggregators that assume complete response sets impractical. Prior online methods often either require per-step completeness or repeatedly reload historical responses, which is storage- and privacy-unfriendly and susceptible to forgetting. We present OLA-Incomplete, an online label-aggregation framework designed for incomplete response streams. It integrates a variational-inference aggregator with a generative replay module that preserves historical information without reloading prior responses and explicitly models unknown worker reliability. At each update step, the generator replays cumulative responses and side information for previously observed instances to mitigate catastrophic forgetting, while the aggregator infers current truths by maximizing the evidence lower bound over a mixture of replayed and newly received labels. Across three public datasets—Duck, RTE, and PostSent—OLA-Incomplete attains final accuracies of 90.74%, 92.50%, and 95.99%, respectively, delivering at least 7.79% relative improvement over the strongest baseline. The approach further exhibits strong instantaneous online accuracy and robustness across response-chunk sizes and arrival orders, underscoring its practical utility for real-world crowdsourcing workflows.

Keywords: Crowdsourcing, Online label aggregation, Incomplete response, Generative replay

Introduction

Crowdsourcing has become a paradigmatic approach for harnessing collective wisdom (Mahotra and Majchrzak 2024; Kodjiku et al. 2023). It enables the flexible acquisition of tacit knowledge from large-scale data, spanning tasks such as image classification (Balayn et al. 2021) and disaster forecasting (Wang et al. 2024). The resulting crowdsourced labels can be used directly for time-sensitive decision making (Grassi et al. 2023) or as essential supervision for machine intelligence (Li 2010). Fueled by demand, public platforms such as Amazon Mechanical Turk (AMT) and CrowdFlower allow requesters to post tasks composed of multiple instances. Each instance is redundantly assigned to several workers by an automatic task-allocation policy. Because worker quality and response time are uncertain and largely opaque to requesters and platforms, aggregating potentially conflicting responses is required to infer the ground-truth label of each instance (Li et al. 2020; Wu et al. 2023).

Conventional label-aggregation methods operate offline (Yin et al. 2017): once all responses have been collected, the methods infer the truth once and for all (Zheng et al. 2017; Fu et al. 2024). However, many crowdsourcing campaigns span long periods, making complete collection of responses impractical. For example, labeling ImageNet by a single worker was estimated to take 25 years; even with approximately 25,000 workers, the construction still required 21 months (Li 2010). In such settings, online label aggregation prevents slow responses from delaying the entire workflow. It enables early stopping when interim aggregates pass manual spot checks. The intermediate inference results also could be used to supports auxiliary functions such as task assignment and annotator filtering (Daniel et al. 2018; Zhang et al. 2023; Huang et al. 2024). Without an online design, traditional approaches must repeatedly reload all historical responses to produce interim results—a resource-consuming operation that strains storage systems and, under privacy or maintenance constraints, may be infeasible (Hong et al. 2021). Hence, an online aggregation paradigm is both important and necessary.

Several methods have been explored for online aggregation, yet practical deployment remains challenging. Feng et al. (2014) proposed an incremental approach with a probabilistic worker model that updates as new responses arrive. However, without modeling the interplay between worker reliabilities and instance truth probabilities, its performance is limited. More recently, Hong et al. (2021) introduced an online variational Bayesian inference-based label aggregation model (BiLA) that achieved state-of-the-art results. BiLA propagates worker reliability across answered instances, akin to many offline methods. Nonetheless, BiLA requires complete responses at each time step. For clarity, we distinguish two cases. A complete response means that, at a given time step, the target instance has been labeled by all assigned workers. In contrast, an incomplete response means that the received chunk may contain arbitrary responses from any assigned instances and workers. As illustrated in Fig. 1, suppose three instances are each assigned to three workers. At a time step, the platform may receive an incomplete chunk containing several responses returned asynchronously. For example, the response $A_{24}^{1}$ for instance $Q_{2}$ from worker $w_{4}$ may arrive at $t_{1}$ . Such a chunk cannot be handled by BiLA, which assumes completeness (e.g., $A_{12}^{1}$ , $A_{13}^{1}$ , $A_{22}^{1}$ , and $A_{23}^{1}$ must also be present). In real deployments, the ordering and timing of returns are not controllable, so this assumption is rarely met.

Fig. 1 — Online label aggregation under *incomplete* responses. Left: task assignment distributes each instance ( $Q_{1}$ , ..., $Q_{3}$ ) redundantly to workers ( $w_{1}$ , ..., $w_{5}$ ). Along the timeline ( $t_{1}, t_{2}, \dots, t_{n}$ ), the platform receives *incomplete chunks* containing arbitrary responses ${A_{11}^{1}, A_{24}^{1}}$ at $t_{1}$ , ${A_{22}^{2}, A_{34}^{2}}$ at $t_{2}$ , and so on. At each time step, the generator replays historical cumulative responses (grids; checkmarks indicate newly observed entries; shaded cells are unobserved), and the variational aggregator updates on the mixture of replayed and newly received responses to infer current truths ${Y_{1}^{1}, Y_{2}^{1}}$ , ${Y_{1}^{2}, Y_{2}^{2}, Y_{3}^{2}}$ , etc. After inference, the arriving chunk is discarded (use-and-discard principle), and the process repeats at the next time step

In this paper, we address online label aggregation under incomplete responses. Our goals are threefold. (1) Preserve prior knowledge without reloading historical data. Because a time step may include only a subset of responses for the current instance, we need to reuse historical information for the current inference. Inspired by memory replay in lifelong learning (Chen et al. 2024), we design a generator that consolidates and replays historical responses to prevent catastrophic forgetting. (2) Handle uncontrolled worker reliability and unknown instance truth. Incorporating the generator raises the question of how to formulate the inference procedure. Building on Yin et al. (2017), we construct an online variational aggregator with a posterior-parameterized encoder that maximizes the evidence lower bound (ELBO) of observed responses. The aggregator leverages replayed history and side information to strengthen inference. (3) Couple the two modules in a controllable manner. We introduce a replay-accuracy threshold that balances replay fidelity against generator cost. The aggregator is then trained on a mixture of replayed data and current responses. The implementation of OLA-Incomplete is available at https://anonymous.4open.science/r/Online-label-aggregation-4C61.

Unlike prior online aggregation methods that either presume complete responses at each time step or repeatedly reload historical responses, OLA-Incomplete admits arbitrary incomplete chunks and consolidates history via generative replay. It then performs truth inference with an ELBO-driven variational aggregator that leverages replayed responses and instance-conditioned side information, thereby aligning the model with real, asynchronous platforms and yielding strong accuracy across datasets and chunk sizes. The main contributions can be summarized as follows:

We introduce OLA-Incomplete, an online label-aggrega-tion framework explicitly tailored to incomplete, cross-instance response streams, removing the completeness assumption common in prior work.
We replace costly and privacy-sensitive historical reloads with generative memory replay to prevent catastrophic forgetting. We then optimize an ELBO-driven variational aggregator that couples worker reliability with instance truth while exploiting replayed history and instance-level side information.
Across three public datasets and five competitive baselines, OLA-Incomplete achieves state-of-the-art instantaneous and final accuracy, with robustness to chunk size and return-order variability.

Related work

Background

Crowdsourcing has become an indispensable paradigm for large-scale data response. Platforms such as Amazon Mechanical Turk (AMT) distribute tasks redundantly to multiple workers in order to mitigate label noise. However, worker reliability and latency vary significantly, making it necessary to aggregate the collected responses into reliable ground-truth estimates. While early studies demonstrated the potential of crowdsourced data across natural language processing, computer vision, and audio tasks (Pyatkin et al. 2023; Pavlichenko et al. 2021; Groenen et al. 2023; Lu et al. 2023), the underlying challenge has always been how to integrate noisy, incomplete, and asynchronous labels into a consistent truth inference process. This section reviews prior work in three categories: offline label aggregation, online aggregation under asynchronous responses, and lifelong learning techniques that enable history reuse. We also highlight efficiency, privacy, and reliability concerns that are increasingly important in real-world systems.

Offline label aggregation

Early aggregation methods were designed in an offline regime, assuming that all responses for all tasks are available before inference begins. Majority voting is the most straightforward approach: the label that appears most frequently among workers is selected as the truth (Aydin et al. 2014; Li et al. 2014). Although easy to implement, this method neglects worker heterogeneity and is easily disrupted by spammers or systematic bias (Chen et al. 2022). To address this issue, trust-propagation methods introduced explicit credibility measures. For instance, corroboration-based algorithms (Galland et al. 2010) and reputation frameworks (Sun et al. 2021; Bahutair et al. 2023; Zhan et al. 2024) propagate reliability across workers and tasks, thereby filtering out malicious or low-quality contributors. These models are effective in dense data scenarios but suffer in sparse or cold-start conditions, as they require sufficient overlap in responses to estimate trust reliably. Community-based extensions also attempt to cluster workers to mitigate sparsity (Li et al. 2014).

Another influential direction treats truth inference as a generative process. The Dawid–Skene model (Dawid and Skene 1979) formulates worker reliability via confusion matrices and estimates them with expectation–maximization (EM). Subsequent Bayesian extensions such as BCCWords and CommunityBCC (Simpson et al. 2015; Venanzi et al. 2014) further capture community structure among workers, while recent advances combine probabilistic graphical models with deep neural networks to model complex reliability patterns (Cai et al. 2020; Li et al. 2021; Luo et al. 2018). Neural architectures like the Label-Aware Autoencoder (LAA) (Yin et al. 2017) reformulate aggregation as an encoding–decoding process, with the latent representation corresponding to the inferred truth. Although these methods have significantly improved aggregation accuracy, they share a common assumption: the availability of complete response sets prior to inference. This design means they must reload and process the entire history whenever updated results are required, which is impractical in long-running or real-time applications where responses arrive sequentially.

Online label aggregation under asynchrony

To overcome the latency of offline methods, online aggregation techniques have been proposed that update truth estimates as responses stream in. Online EM (OEM) (Cappé and Moulines 2009) adapts the classic EM algorithm by incrementally updating sufficient statistics whenever new labels arrive. In the context of crowdsourcing, OEM continually refines worker confusion matrices with each incoming response, thus avoiding repeated batch training. However, OEM inherits EM’s sensitivity to initialization and local optima, and experiments show that its performance can vary widely depending on data order. Stochastic EM (SEM) (Chen et al. 2018) reduces computation by updating expectations with one sample at a time, which stabilizes incremental updates and avoids storing full data. Yet, SEM may be vulnerable to bursts of noisy labels, since each new response has global impact on the estimate.

Neural approaches extend aggregation to the online regime. Yang et al. (Yang et al. 2024) proposed one-pass and two-pass Bayesian algorithms that traverse the data stream once or twice to infer worker reliabilities and task truths. These methods achieve accuracy close to iterative models at the cost of only linear-time complexity, and unlike EM-style algorithms, they can integrate new labels on the fly without revisiting history. This makes them highly efficient for large-scale or long-running projects. A more sophisticated solution is BiLA (Hong et al. 2021), which employs a variational Bayesian framework to propagate reliability across workers and tasks. BiLA achieves strong accuracy and has been regarded as a state-of-the-art online aggregator. It not only flexibly accommodates different noise distributions but also provides theoretical guarantees: its incremental optimizer comes with a convergence bound. Nevertheless, its effectiveness relies on the assumption that, at each time step, all assigned workers have completed their labels for the target instance. This “complete response” requirement rarely holds in practice, since response order and timing are unpredictable on real-world platforms. As a result, BiLA and similar methods are not well-suited to asynchronous environments where updates must be performed with arbitrary subsets of responses. Early efforts such as incremental EM (IEM) (Wang et al. 2022a) attempted to extend EM to dynamic truth discovery in related domains, but these methods remain unstable when only partial or delayed responses are available. Task-level strategies like setting deadlines or adaptive redundancy can approximate complete batches, yet they increase cost and do not address the modeling gap. In contrast, OLA-Incomplete explicitly targets the incomplete-response scenario. By integrating a generative replay mechanism with a variational aggregator, it consumes arbitrary partial arrivals without requiring historical reloading, thereby maintaining stability and accuracy even under asynchronous conditions.

Continual learning for history reuse

A central difficulty in online aggregation lies in retaining knowledge of past responses without storing or reprocessing the full history. This challenge is closely related to the problem of catastrophic forgetting in continual learning. Regularization-based methods alleviate forgetting by penalizing parameter drift from previous optima (Zhang et al. 2023; Li et al. 2023), while dynamic-architecture approaches expand model capacity to capture new tasks without overwriting old ones (Yan et al. 2021). These strategies improve stability but cannot reconstruct instance-level cumulative responses, which are necessary in label aggregation. Exemplar replay methods attempt to store a subset of past data for rehearsal, yet this raises storage and privacy concerns, particularly in crowdsourcing contexts where raw responses may be sensitive.

Generative replay has emerged as a promising alternative, synthesizing past data distributions through a generative model rather than storing original samples. It has proven effective in continual learning (Wang et al. 2022b; Rao et al. 2019; Chen et al. 2024), where models must adapt to evolving tasks while preserving prior knowledge. Chen et al. (2024) demonstrated its use in dynamic energy modeling, showing that synthetic replay can maintain accuracy without retaining the full dataset. Building on this idea, OLA-Incomplete employs a generative memory replay module that replays cumulative responses and side information tied to instance identities. The replayed signals, combined with current responses, allow the variational aggregator to maximize the evidence lower bound and update truth estimates without direct access to historical data. This design not only enhances robustness under incomplete and asynchronous inputs but also aligns with privacy and efficiency requirements in large-scale, long-running response pipelines. In particular, unlike exemplar replay, generative replay does not require retaining raw responses, which provides a natural degree of privacy protection in compliance with modern data regulations (Hou et al. 2025). As crowdsourcing systems scale up, this synergy between continual learning and online label aggregation becomes critical for sustaining accuracy, efficiency, and ethical standards over extended periods.

Preliminaries

Consider a set of instances $N = {1, \dots, N}$ and a set of workers $M = {1, \dots, M}$ . Each instance is redundantly assigned to the M workers. In the ideal case, a total of $N \times M$ responses would be observed. We denote the i-th arriving response by the tuple $(Q_{n}, A_{nm}^{i})$ , where $Q_{n}$ denotes the identifier of instance n and $A_{nm}^{i}$ is the corresponding categorical response provided by worker m for instance n at time step i ( $n \in N$ , $m \in M$ ). The response takes values in ${1, \dots, L}$ , where L is the number of classes.

To formalize memory consolidation and replay, we introduce notation for historical data. Let $H_{a}^{i} = {A_{uv}^{k} ∣ u \in N, v \in M, k \in {1, \dots, i}}$ be the set of all responses observed up to and including time step i. The set of instances that have received at least one response by time i is $H_{q}^{i} = \{Q_{u} | \exists A_{uv}^{k} \in H_{a}^{i}, u \in N, v \in M, k \in {1, \dots, i}\}$ . For any instance $Q_{u} \in H_{q}^{i}$ , its cumulative responses up to time step i are $C_{u}^{i} = \{A_{uv}^{k}, | v \in M, k \in {1, \dots, i}\} \subseteq H_{a}^{i}$ . For clarity of exposition, we first present the theoretical derivation of OLA-Incomplete under the setting where the response chunk size equals 1, i.e., exactly one response from one worker for one specific instance arrives at each time step. The extension to chunk sizes greater than 1 is straightforward by treating each chunk as a mini-batch in the same framework. All symbols used in OLA-Incomplete are summarized in Table 1.

Table 1.

All notations used in the proposed OLA-Incomplete framework

Notations	Descriptions
i	the time step of receiving data
n	the instance index that is return at the i-th time step
u	any index of processed instance except the n-th instance at the i-th time step
e	any index of processed instance at the i-th time step
$Q_{n}$	the question identification of receiving data at i-th time stamp
$A_{nm}^{i}$	the received response corresponding to $Q_{n}$ at the i-th time stamp
$H_{a}^{i}$	all received historical responses by the i-th time stamp
$H_{q}^{i}$	all received historical questions which have been labeled by the i-th time stamp
$C_{n}^{i}$	the received historical responses of the specific instance $Q_{n}$ by the i-th time stamp
${\tilde{C}}_{n}^{i}$	the generated historical responses of the specific instance $Q_{n}$ by the i-th time stamp
${\tilde{S}}_{n}^{i}$	the generated side information related to the specific instance $Q_{n}$
${\tilde{G}}_{n}^{i}$	the supervisory signals for label aggregation
$Y_{n}^{i}$	the inferred label related to the specific instance $Q_{n}$ at the i-th time step

Open in a new tab

Method

We first present the overall architecture of OLA-Incomplete as shown in Fig. 2, followed by the generative replay mechanism, the variational aggregator, and the joint training strategy.

Overall framework

OLA-Incomplete comprises two cooperating components: a generator and a variational aggregator. At time step i, consider the arrival of an incomplete response $A_{nm}^{i}$ from worker m on instance n. The generator replays the cumulative responses for any previously seen instance $Q_{n} \in H_{q}^{i}$ as:

\begin{matrix} {\tilde{C}}_{n}^{i} \sim P_{gen} (Q_{n}) . \end{matrix}

Within the replay process, the generator also produces instance-level side information represented by ${\tilde{S}}_{n}$ . The replay distribution in(1) then factorizes as:

\begin{matrix} P ({\tilde{C}}_{n}^{i} ∣ Q_{n}) = P ({\tilde{C}}_{n}^{i} ∣ {\tilde{S}}_{n}, Q_{n}) P ({\tilde{S}}_{n} ∣ Q_{n}) . \end{matrix}

The side information ${\tilde{S}}_{n}$ is treated as a latent variable within the replay module but is passed to the variational aggregator as an observed conditioning signal (i.e., the generator’s output). To perform online label aggregation, we adopt a truth-aware variational architecture in which the instantaneous ground truth is a latent variable denoted by $Y_{n}^{i}$ . The aggregation model is:

\begin{matrix} Y_{n}^{i} \sim P_{agg} ({\tilde{S}}_{n}, {\tilde{C}}_{n}^{i}) . \end{matrix}

Algorithm 1 — Online Incomplete Label Aggregation.

Generative memory replay

Crowdsourcing platforms often follow a practical “use-and-discard” policy for raw data, so an online aggregator should learn from new responses without repeatedly loading the full history and without forgetting what has already been learned. To this end, we employ a generative memory replay module that summarizes past responses into a compact, reusable form and replays this summary when a new incomplete response arrives.

Let $A_{nm}^{i}$ denote the i-th arriving (incomplete) response for instance $Q_{n}$ from worker m. We examine the memory state immediately before and after this arrival. Before observing $A_{nm}^{i}$ , the generator produces replayed, cumulative responses for each previously seen instance $Q_{u} \in H_{q}^{i - 1}$ :

\begin{matrix} {\tilde{C}}_{u}^{i - 1} \sim P_{gen} (Q_{u} ; Q_{u} \in H_{q}^{i - 1}) . \end{matrix}

Intuitively, ${\tilde{C}}_{u}^{i - 1}$ acts as the generator’s memory of what has been observed so far for $Q_{u}$ . After receiving the new response, we extend the history to include both the current instance and its latest response:

\begin{matrix} Q_{n} \in H_{q}^{i}, A_{nm}^{i} \in C_{n}^{i} . \end{matrix}

We build supervision signals by (i) keeping the previous replay target for all instances other than $Q_{n}$ , and (ii) appending the just-arrived response to the target of $Q_{n}$ :

\begin{matrix} {\tilde{G}}_{e}^{i} = \{\begin{matrix} {\tilde{C}}_{e}^{i - 1}, & if Q_{e} \in ∁_{H_{q}^{i}} Q_{n}, \\ {\tilde{C}}_{e}^{i - 1} \cup A_{em}^{i}, & if Q_{e} = Q_{n}, \end{matrix}) \end{matrix}

where $\cup$ indicates appending the new piece of evidence to the cumulative set. The generator is then trained to reproduce these targets via a standard cross-entropy objective:

\begin{matrix} l_{1} = \frac{1}{| H_{q}^{i} |} \sum_{Q_{e} \in H_{q}^{i}} [- {\tilde{G}}_{e}^{i} log {\tilde{C}}_{e}^{i}], {\tilde{C}}_{e}^{i} \sim P_{gen} (Q_{e}) . \end{matrix}

At $i = 1$ , no instance has been observed ( $H_{q}^{0} = \emptyset$ ). When $A_{nm}^{1}$ arrives, (6) collapses to

\begin{matrix} {\tilde{G}}_{n}^{1} = {\tilde{C}}_{n}^{0} \cup A_{nm}^{1} . \end{matrix}

Because ${\tilde{C}}_{n}^{0}$ comes from a random initialization and carries no information, we simply discard it and supervise the generator with the new response alone:

\begin{matrix} {\tilde{G}}_{n}^{1} = A_{nm}^{1} . \end{matrix}

This yields a clean, data-driven first update under (7). Afterwards, (6) and (7) continue to consolidate information as additional responses arrive.

Replay is accurate if two natural conditions hold: (i) before each update, the generator reproduces the correct cumulative responses for all previously seen instances, and (ii) training under (7) reaches a reusable state at that step. Formally, for any historical instance $Q_{u} \in H_{q}^{i - 1}$ ,

\begin{matrix} C_{u}^{i - 1} \Leftrightarrow {\tilde{C}}_{u}^{i - 1}, \end{matrix}

and with the new observation $A_{nm}^{i}$ incorporated, the constructed targets match the updated cumulative set:

\begin{matrix} {\tilde{G}}_{e}^{i} \Leftrightarrow C_{e}^{i}, Q_{e} \in H_{q}^{i} . \end{matrix}

To check these conditions in practice, we compute a replay accuracy that measures the agreement between the replayed outputs ${\tilde{C}}_{e}^{i}$ and the targets ${\tilde{G}}_{e}^{i}$ . When replay accuracy exceeds a preset threshold, we early stop the generator’s update at step i to avoid overfitting noise and to keep training efficient (see Section 4.4). Under unbiased consolidation, the above routine applies recursively to the entire response stream, letting the system absorb new information over time without forgetting prior responses.

Truth inference with side information

At time step i, a new incomplete response $A_{nm}^{i}$ arrives for instance $Q_{n}$ . Updating the model using $A_{nm}^{i}$ alone can be fragile, because a single response may be noisy or unrepresentative. To stabilize learning, the variational aggregator is fed with supervision signals ${{\tilde{G}}_{e}^{i} ∣ Q_{e} \in H_{q}^{i}}$ that combine (i) replayed cumulative responses for previously seen instances, ${{\tilde{C}}_{u}^{i - 1} ∣ Q_{u} \in H_{q}^{i - 1}}$ , with (ii) the newly received response $A_{nm}^{i}$ . Intuitively, ${\tilde{G}}_{e}^{i}$ summarizes everything the system knows up to step i and is treated as fully observed supervision at that step.

In addition to supervision, the generator supplies instance-level side information ${\tilde{S}}_{e}$ (e.g., embeddings or features for $Q_{e}$ ), which we use as a conditioning signal for the aggregator. Given $({\tilde{S}}_{e}, {\tilde{G}}_{e}^{i})$ , the aggregator performs variational inference over the latent truth $Y_{e}^{i}$ by maximizing the evidence lower bound (ELBO) on the joint log-likelihood $log P ({\tilde{S}}_{e}, {\tilde{G}}_{e}^{i})$ :

\begin{matrix} log P ({\tilde{S}}_{e}, {\tilde{G}}_{e}^{i}) & \geq L = E_{P (Y_{e}^{i} | {\tilde{S}}_{e}, {\tilde{G}}_{e}^{i})} [log P ({\tilde{S}}_{e}, {\tilde{G}}_{e}^{i} | Y_{e}^{i})] \\ - D_{KL} (P (Y_{e}^{i} | {\tilde{S}}_{e}, {\tilde{G}}_{e}^{i}) ‖ P (Y_{e}^{i})), \end{matrix}

where the first term encourages the latent truth to explain the observed signals and the KL term regularizes the posterior toward the prior, improving stability under partial observations. Here $P (Y_{e}^{i} | \cdot)$ is the aggregator-parameterized variational posterior and $P (Y_{e}^{i})$ is the prior over latent truths.

The expectation in (12) is not available in closed form. Following the reparameterization trick used in variational autoencoders (Kingma and Welling 2014), we approximate it by categorical sampling over the L classes:

\begin{matrix} L & \approx \sum_{l = 1}^{L} P (Y_{e}^{i} = l ∣ {\tilde{S}}_{e}, {\tilde{G}}_{e}^{i}) log P ({\tilde{S}}_{e}, {\tilde{G}}_{e}^{i} ∣ Y_{e}^{i} = l) \\ - D_{KL} (P (Y_{e}^{i} | {\tilde{S}}_{e}, {\tilde{G}}_{e}^{i}) ‖ P (Y_{e}^{i})) . \end{matrix}

This yields a tractable objective that balances (i) data fit given a candidate label l and (ii) deviation from the prior.

We extend the ELBO from a single instance to all instances in $H_{q}^{i}$ and introduce a coefficient $λ$ to control the strength of the KL regularizer:

\begin{matrix} l_{2} & = \frac{1}{| H_{q}^{i} |} \sum_{Q_{e} \in H_{q}^{i}} [\sum_{l = 1}^{L} q (Y_{e}^{i} = l ∣ {\tilde{S}}_{e}, {\tilde{G}}_{e}^{i}) log P ({\tilde{S}}_{e}, {\tilde{G}}_{e}^{i} ∣ Y_{e}^{i} = l) \\ - λ D_{KL} (q (Y_{e}^{i} | {\tilde{S}}_{e}, {\tilde{G}}_{e}^{i}) ‖ P (Y_{e}^{i}))] . \end{matrix}

Here $q (\cdot)$ denotes the aggregator’s variational posterior and $λ$ trades off fit and regularization; larger $λ$ places more weight on prior consistency, which is helpful when arrivals are sparse or noisy.

The variational aggregator maintains worker-specific reliability parameters that modulate the likelihood term in the ELBO and determine how strongly each worker’s response informs the latent truth. Concretely, each worker is associated with a set of unconstrained parameters that we map to a normalized reliability representation (via a worker-wise softmax) to ensure identifiability and keep the parameters on a probabilistic simplex. These parameters are part of the aggregator and are updated online at every time step by maximizing the ELBO objective in (14) over the current mixture of replayed supervision and newly arrived responses. In practice, only workers that appear in the replayed targets or in the current incomplete response receive nonzero gradients at a given step, while others remain unchanged. The replayed cumulative supervision prevents a single noisy arrival from dominating the reliability estimates and provides a consistent historical context for gradient-based updates.

Once trained at step i, the instantaneous prediction for $Q_{e}$ is obtained by choosing the most probable class:

\begin{matrix} {\hat{Y}}_{e}^{i} = arg max_{l \in {1, \dots, L}} Y_{e}^{i} . \end{matrix}

In practice, this corresponds to selecting the label with the highest posterior probability under the aggregator conditioned on $({\tilde{S}}_{e}, {\tilde{G}}_{e}^{i})$ .

Online training strategy

Recall (7), for each $Q_{e} \in H_{q}^{i}$ , let ${\tilde{G}}_{e}^{i}$ be the supervision target and let ${\tilde{C}}_{e}^{i}$ be the current generator output. To assess generator learning after each newly received response, we define the replay accuracy as:

\begin{matrix} η_{i} = \frac{1}{| H_{q}^{i} |} \sum_{Q_{e} \in H_{q}^{i}} I ({\tilde{G}}_{e}^{i}, {\tilde{C}}_{e}^{i}), \end{matrix}

where $I (\cdot, \cdot) = 1$ if ${\tilde{G}}_{e}^{i}$ and ${\tilde{C}}_{e}^{i}$ satisfy element-wise equality and 0 otherwise. We compute $η_{i}$ at fixed intervals and apply early stopping to the generator once $η_{i}$ exceeds a replay threshold $γ$ . The threshold can simply be set to 1.0 for unbiased replay. Furthermore, early stopping based on replay accuracy prevents over-convergence of the memory module. The detailed online procedure is summarized in Algorithm 1, where the generator and the variational aggregator are updated in alternation.

Experiments

Experimental setup

We evaluate OLA-Incomplete and baseline methods on three real-world crowdsourcing datasets. Table 2 summarizes the statistics of all datasets.

Duck (Welinder et al. 2010) is an image classification dataset in which workers indicate whether an image contains a duck. It comprises 4,212 incomplete responses from 39 workers.
RTE (Snow et al. 2008) targets the textual entailment recognition task with 800 items and 164 workers. Each item presents a premise–hypothesis pair, and workers choose whether the hypothesis can be inferred from the premise.
PostSent (Zheng et al. 2017) is a sentiment dataset of 1,000 company-related tweets. Workers label whether a tweet expresses positive sentiment toward the company.

Table 2.

Statistics of all datasets.

Dataset	Instances	Workers	Responses	Domain
Duck	108	39	4212	Duck Image Identification
RTE	800	164	8000	Textual Entailment Understanding
PostSent	1000	85	20000	Duck Sentiment Analysis

Open in a new tab

Baselines

We compare OLA-Incomplete against five representative baselines spanning majority voting, EM-based methods, and neural approaches, including the recent state-of-the-art BiLA.

Majority Voting (MV) infers the truth label as the most frequent worker-provided label for each instance.
Online EM (OEM) (Cappé and Moulines 2009) incrementally updates worker-specific confusion matrices and class priors as new instances arrive, and carries parameter estimates across time steps.
Stochastic EM (SEM) (Chen et al. 2018) performs EM updates using single-sample sufficient statistics, enabling stable incremental optimization without access to the full dataset.
Label-aware Autoencoders (LAA) (Yin et al. 2017) cast label aggregation as an autoencoding problem in which the encoder serves as a classifier that predicts latent truths.
Variational Bayesian label aggregation (BiLA) (Hong et al. 2021) optimizes a variational objective by minimizing the KL divergence between a recognition (neural) posterior and a generative posterior.

For MV and OEM, we treat all responses observed up to the current time step as available evidence and update the model accordingly. The resulting parameters are propagated to the next step. For LAA, we feed only the current incomplete chunk in order to avoid reloading historical responses. SEM and BiLA are executed according to their original online update protocols.

Performance metrics

On crowdsourcing platforms, label aggregation proceeds continuously as responses arrive. To capture performance in this online setting, we adopt two complementary metrics: instantaneous accuracy and final accuracy. Instantaneous accuracy measures the proportion of correctly inferred labels at each time step, providing a view of how quickly and stably the model adapts during the response stream. Final accuracy is the accuracy achieved after all responses have been received, which corresponds to the performance of an offline label aggregation method on the complete dataset. Importantly, ground-truth labels are used solely for evaluation and are never used for training.

Implementation details

Instances are encoded as one-hot vectors by their identities. For a given instance, each incomplete response is represented as a one-hot block where the position corresponding to the observed response is set to one. The supervisory signal of cumulative responses is constructed by concatenating multiple incomplete responses in worker-identity order. The generator uses a single hidden layer of size 64 with a sigmoid activation. The variational aggregator uses a single hidden layer whose size equals the number of classes L. The output layers of both modules apply a worker-wise softmax, i.e., the softmax operation is performed independently on each worker-specific block within the concatenated cumulative-answer embedding (of size $N \times M$ ).

Learning rates are set to $1 \times 10^{- 2}$ for generator and $5 \times 10^{- 4}$ for aggregator. We sweep the replay threshold $γ \in {0.990, 0.992, 0.994, 0.996, 0.998, 1.000}$ and the KL-weight $λ \in [10^{- 4}, 10^{- 1}]$ as reported in the sensitivity analyses. Chunk sizes follow ${1, 20, 40, 60, 80, 100}$ . To simulate asynchronous arrivals, we randomly permute all responses once per run and stream them sequentially. For reproducibility, we run each experiment with three random seeds ${1, 2, 3}$ : each seed controls the response-order shuffle and parameter initialization.

All models are implemented in PyTorch and trained on two NVIDIA Tesla V100 GPUs. Early stopping for the generator follows the replay-accuracy criterion and training stops when the replay accuracy reaches the threshold $γ$ or the maximum epoch budget is met (Table 3).

Table 3.

Summary of hyperparameters

Component	Hyperparameter	Value / Range
Generator	Hidden layer size	64
Generator	Activation	Sigmoid
Aggregator	Hidden layer size	L (number of classes)
Both modules	Output normalization	Worker-wise softmax
Optimization	Learning rate (generator)	$1 \times 10^{- 2}$
Optimization	Learning rate (aggregator)	$5 \times 10^{- 4}$
Regularization	KL weight	$[10^{- 4}, 10^{- 1}]$
Replay control	Replay threshold	${0.990, 0.992, 0.994, 0.996, 0.998, 1.000}$
Streaming setup	Chunk size	${1, 20, 40, 60, 80, 100}$
Stochasticity	Random seeds	${1, 2, 3}$ (report mean ± std)
System	Framework / GPUs	PyTorch / 2 $\times$ Tesla V100
Early stopping	Criterion	Stop when replay accuracy $\geq γ$

Open in a new tab

Ranges correspond to sweeps used in sensitivity and ablation studies

Performance

Final accuracy

Table 4 summarizes the final accuracy of OLA-Incomplete and all baselines across different datasets and chunk sizes. Overall, OLA-Incomplete achieves the best final accuracy on all three datasets and remains stable across chunk sizes. On Duck, it attains 90.74% at chunk size 40; on RTE, 92.50% at chunk size 100; and on PostSent, 95.99% at chunk size 20. These results yield at least a 7.79% relative improvement over the second-best baseline.

Table 4.

Overall performance of the aggregated labels among different methods in terms of the final accuracy(%)

Datasets	Chunk size	MV	OEM	SEM	LAA	BiLA	OLA-Incomplete
Duck	1	63.89	75.92	71.29	57.40	69.44	87.96
	20	63.89	55.56	64.81	54.63	59.26	89.81
	40	64.81	57.41	62.04	58.33	59.26	90.74
	60	65.74	64.81	62.96	52.78	59.26	87.04
	80	64.81	79.63	71.30	51.85	64.81	88.89
	100	61.11	71.30	68.52	53.70	56.48	90.74
RTE	1	74.12	71.06	74.12	74.12	50.13	91.75
	20	73.88	64.13	74.75	74.13	59.75	92.13
	40	73.75	59.38	73.75	74.25	62.13	92.25
	60	73.75	59.13	71.25	72.75	58.38	92.12
	80	73.63	63.25	69.25	73.50	61.00	92.00
	100	74.00	59.75	68.50	73.63	60.13	92.50
PostSent	1	78.60	59.56	78.60	66.50	47.80	91.10
	20	67.40	80.60	60.80	65.30	65.60	95.99
	40	67.20	83.40	64.20	67.00	64.10	89.90
	60	67.20	84.50	66.70	67.30	64.50	93.50
	80	67.10	85.30	67.00	66.60	65.10	93.90
	100	67.10	87.70	67.30	65.40	64.60	95.20

Open in a new tab

The bold numbers are the best results and the underlined numbers are the second best results

Among baselines, MV exhibits relatively stable but lower performance, consistent with the setting where many workers are reliable but worker heterogeneity is not modeled. OEM performs weakly on Duck and RTE yet is competitive on PostSent, suggesting sensitivity to dataset characteristics and response distributions. The neural or stochastic variants (SEM, LAA, BiLA) do not consistently counteract noisy or imbalanced response chunks in the online, incomplete setting, which can hinder convergence to strong optima. By contrast, OLA-Incomplete leverages generative replay to reuse historical evidence without reloading past data and, together with the variational aggregator, maintains robustness to the arrival pattern and chunk size.

Accuracy at each chunk

Instantaneous accuracy provides a finer-grained view of online performance as response chunks arrive (Fig. 3). OLA-Incomplete improves rapidly in the early stages of the stream and then stabilizes as more responses are processed, consistently outperforming all baselines across datasets. The only exception is that OEM approaches our curve on PostSent; however, it lags notably on Duck and RTE, indicating sensitivity to dataset characteristics and arrival patterns. Overall, the results highlight the robustness of OLA-Incomplete to both chunk size and response order.

Beyond predictive performance, it is also important to examine whether such accuracy gains come at the cost of higher computational demand. To this end, we measured the training and inference times of OLA-Incomplete and compared them with all baselines under the same hardware setup (two Tesla V100 GPUs).

Table 5 reports wall-clock training time, measured end-to-end over the entire stream, as well as the average per-chunk inference time in seconds. As expected, MV, OEM, and SEM are lightweight in terms of computation but achieve substantially lower accuracy. In contrast, the deep neural baselines LAA and BiLA incur considerably higher training costs. OLA-Incomplete introduces the replay mechanism yet remains close to BiLA in efficiency: its training time is within approximately 5–8% of BiLA on Duck and RTE, and it is comparable on PostSent. The inference time differs by no more than 0.2 seconds across datasets; OLA-Incomplete is slightly faster than BiLA on RTE but slightly slower on Duck and PostSent. Importantly, OLA-Incomplete does not store raw historical responses. Instead, the generative replay is parameterized, so the memory footprint arises mainly from model parameters and the tensors of the current chunk, which scale with the number of workers and classes, rather than from an explicit replay buffer. In summary, OLA-Incomplete achieves superior instantaneous and final accuracy while maintaining training and inference costs comparable to existing variational baselines.

Table 5.

Computational efficiency: wall-clock training time and average per-chunk inference time (seconds)

Method	Duck (s)		RTE (s)		PostSent (s)
	Training	Inference	Training	Inference	Training	Inference
MV	/	0.01	/	0.07	/	0.07
oEM	0.79	0.21	1.29	0.64	3.28	0.83
sEM	0.84	0.27	2.40	0.77	4.45	0.86
LAA	64.15	0.71	224.63	4.72	437.49	5.87
BiLA	71.83	0.83	256.17	6.49	460.03	7.37
OLA-Incomplete	75.54	0.88	276.56	6.31	462.86	7.54

Open in a new tab

Accuracy for one response at each step

To stress-test the online setting, we set the chunk size to 1, i.e., exactly one worker–instance response arrives per time step. As shown in Fig. 4, OLA-Incomplete maintains strong performance in this extreme streaming regime, achieving final accuracy above $90 %$ on all three datasets and surpassing the baselines throughout most of the stream.

Accuracy for offline label aggregation

Although online aggregation is the practical deployment scenario, when all responses are available the task reduces to offline label aggregation. We therefore evaluate all methods in the offline setting (Table 6). As expected, MV underperforms because it ignores worker reliability. OLA-Incomplete attains best or on-par with the best results across datasets, demonstrating that the proposed online design does not sacrifice offline accuracy and transfers effectively to the fully observed case.

Table 6.

The overall performance of offline label aggregation is reported in terms of the final accuracy (%)

Datasets	Duck	RTE	PostSent
MV	86.93	90.14	87.50
OEM	90.71	92.75	95.74
SEM	90.71	92.70	95.68
LAA	90.72	92.70	95.78
BiLA	90.71	92.66	95.75
OLA-Incomplete	90.72	92.70	95.75

Open in a new tab

Bold numbers indicate the best results, and underlined numbers denote the second-best results

Sensitivity analysis

Sensitivity to the replay threshold

We evaluate how the replay threshold $γ$ affects aggregation performance (Table 7). The threshold controls when the generator stops training on an incoming chunk: a larger $γ$ enforces more accurate replay of historical responses but typically incurs higher training cost. We sweep $γ$ from 0.990 to 1.000 across all datasets.

Table 7.

Sensitive analysis on replay threshold $γ$ in terms of final accuracy (%) with chunk size equals 100

Replay Threshold	Duck	RTE	PostSent
0.990	87.93	69.88	58.30
0.992	89.81	73.87	71.10
0.994	89.81	79.50	73.20
0.996	90.74	84.87	83.30
0.998	90.74	89.25	90.20
1.000	90.74	92.50	95.20

Open in a new tab

Overall, increasing $γ$ improves final accuracy, especially when chunk sizes are large relative to the dataset. When the chunk size is small (i.e., each chunk covers a small fraction of the dataset), lowering $γ$ has a limited impact because fewer replays are required and replay errors are less likely to accumulate. In contrast, for larger chunks, stricter replay (higher $γ$ ) yields consistent gains, illustrating that the constraint on the generator helps prevent error amplification. In practice, $γ$ should be selected based on deployment priorities: choose values close to 1.0 for accuracy-sensitive settings, and slightly smaller values for efficiency-sensitive scenarios.

Sensitivity to random seeds

Our online experiments randomize the response order once per run to simulate asynchronous arrivals; each random seed governs both the permutation of responses and parameter initialization. Table 8 reports the variability across seeds. We observe that MV, OEM, and LAA are noticeably affected by the shuffled arrival order, indicating sensitivity to streaming permutations. In contrast, OLA-Incomplete remains robust across seeds and chunk sizes, maintaining consistently strong performance. This robustness suggests that generative replay, coupled with the variational aggregator, effectively mitigates order-induced variance in the online, incomplete setting.

Table 8.

Sensitivity analysis on the arrival order of responses in terms of final accuracy (%) with chunk sizes of 1 and 100

Datasets	Chunk Size	Random Seed	MV	oEM	sEM	LAA	BiLA	OLA-Incomplete
Duck	1	1	63.89	75.92	71.29	57.4	69.44	87.96
		2	60.72	72.31	71.26	54.92	69.44	87.96
		3	63.48	66.48	71.29	52.77	69.44	87.96
		Mean±Std	62.70±1.41	71.57±3.89	71.28±0.00	55.03±1.89	69.44±0.00	87.96±0.00
	[58.41, 66.98]	[59.74, 83.40]	[71.24, 71.32]	[49.27, 60.79]	[69.44, 69.44]	[87.96, 87.96]
	100	1	61.11	71.3	68.52	53.7	56.48	90.74
		2	58.16	72.18	68.52	50.74	55.78	90.74
		3	59.11	67.39	68.49	50.36	56.48	90.74
		Mean±Std	59.46±1.23	70.29±2.08	68.51±0.02	51.60±1.49	56.25±0.32	90.74±0.00
	[55.72, 63.20]	[63.96, 76.62]	[68.47, 68.55]	[47.06, 56.14]	[55.24, 57.25]	[90.74, 90.74]
RTE	1	1	74.12	71.06	74.12	74.12	50.13	91.75
		2	71.11	68.58	71.08	72.28	49.34	91.75
		3	73.34	63.75	71.12	69.83	50.13	91.64
		Mean±Std	72.86±1.28	67.80±3.04	72.11±1.42	72.08±1.76	49.87±0.37	91.71±0.05
	[68.98, 76.74]	[58.56, 77.03]	[67.77, 76.44]	[66.73, 77.42]	[48.98, 50.76]	[91.57, 91.85]
	100	1	74	59.75	68.5	73.63	60.13	92.5
		2	71.14	56.75	68.5	70.45	60.13	92.5
		3	72.76	58.95	68.5	71.04	60.13	92.5
		Mean±Std	72.63±1.28	58.48±1.64	68.50±0.00	71.71±0.72	60.13±0.00	92.50±0.00
	[69.13, 76.14]	[54.77, 62.19]	[68.50, 68.50]	[70.02, 73.41]	[60.13, 60.13]	[92.50, 92.50]
PostSent	1	1	78.6	59.56	78.6	66.5	47.8	91.1
		2	74.86	58.53	78.54	64.36	47.81	91.1
		3	76.13	59.21	78.54	66.43	47.8	91.07
		Mean±Std	76.53±1.55	59.10±0.43	78.56±0.03	65.76±0.50	47.80±0.00	91.09±0.02
	[71.81, 81.25]	[57.80, 60.40]	[78.47, 78.65]	[62.74, 68.78]	[47.79, 47.82]	[91.05, 91.13]
	100	1	67.1	87.7	67.3	65.4	64.4	95.2
		2	66.93	85.26	67.3	65.35	64.41	95.2
		3	67.04	83.71	67.37	64.67	64.41	95.2
		Mean±Std	67.02±0.06	85.56±1.64	67.32±0.03	65.14±0.33	64.41±0.00	95.20±0.00
	[66.81, 67.24]	[80.56, 90.55]	[67.22, 67.42]	[64.06, 66.32]	[64.41, 64.41]	[95.20, 95.20]

Open in a new tab

Results are reported for three random seeds as well as mean ± standard deviation; the bracketed term reports the 95% CI across seeds

Ablation study

We evaluate the contribution of the side information produced by the memory module by comparing the full OLA-Incomplete with a variant whose variational aggregator operates without side information. Table 9 reports final accuracies across datasets and replay thresholds. Overall, side information yields consistent gains, and the improvement is most pronounced when the replay threshold $γ < 1.0$ and the chunk size is small relative to the dataset. In this regime, the generator may imperfectly replay historical responses. Because replayed outputs become supervisory signals at subsequent time steps, small inaccuracies can accumulate over time. The side information provides an additional instance-specific signal that regularizes the aggregator and mitigates error propagation, leading to higher final accuracy and more stable online trajectories. As expected, when $γ = 1.0$ the generator is required to reproduce historical responses exactly, and the two variants converge to similar performance, indicating that side information is particularly valuable in realistic, efficiency-oriented settings where strict replay is relaxed.

Table 9.

Ablation study on side information in terms of final accuracy (%) with chunk size equals 100

Replay Threshold	Duck		RTE		PostSent
	w/o si	w/ si	w/o si	w/ si	w/o si	w/ si
0.990	87.04	87.93	66.21	69.88	54.27	58.30
0.992	87.04	89.81	68.50	73.87	69.50	71.10
0.994	87.96	89.81	77.25	79.50	70.80	73.20
0.996	87.96	89.81	79.25	84.87	74.21	83.30
0.998	87.96	90.74	88.88	89.25	89.60	90.20
1.000	90.74	90.74	92.13	92.50	95.50	95.80

Open in a new tab

Conclusion

This work tackled the problem of online label aggregation under incomplete crowdsourced responses, a setting where only a subset of workers reply at each time step and historical responses may not be accessible. We proposed OLA-Incomplete, a variational framework that integrates a generative replay module with a truth-aware aggregator enhanced by side information, and introduced a replay-accuracy criterion to balance efficiency and fidelity. Extensive experiments on three public datasets demonstrated that OLA-Incomplete consistently achieves state-of-the-art performance, while maintaining robust instantaneous accuracy across different chunk sizes and random arrival orders. Notably, the method remains effective even in the extreme case of receiving one response per step and performs on par with or better than offline baselines when all responses are available. These findings highlight the practical value of OLA-Incomplete for real-time crowdsourcing applications, and point to several promising directions for future work, including adaptive replay control, compute-efficient replay strategies, robustness to adversarial workers, integration with task assignment, and theoretical guarantees on stability and uncertainty.

Data Availability

Data will be made available on request.

Declarations

Competing interests

The authors certify that there is NO conflict of interest in relation to this work.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Haoyu Liu, Email: liuhaoyu03@corp.netease.com.

Runze Wu, Email: wurunze1@corp.netease.com.

Hui Liu, Email: liuhui@pumc.edu.cn.

References

Aydin B, Yilmaz YSYYS, Li Y, Li Q, Gao J, Demirbas M (2014) Crowdsourcing for multiple-choice question answering. In: Proceedings of the AAAI conference on artificial intelligence, pp 2946–2953
Bahutair M, Bouguettaya A, Neiat AG (2023) Multi-use trust in crowdsourced iot services. IEEE Trans Serv Comput 16:1268–1281. 10.1109/TSC.2022.3160469 [Google Scholar]
Balayn A, Soilis P, Lofi C, Yang J, Bozzon A (2021) What do you mean? interpreting image classification with crowdsourced concept extraction and analysis. In: Proceedings of the web conference 2021, association for computing machinery, New York, NY, USA. pp 1937–1948
Cai D, Nguyen DT, Lim SH, Wynter L (2020) Variational bayesian inference for crowdsourcing predictions. In: 2020 59th IEEE Conference on Decision and Control (CDC), IEEE. pp 3166–3172
Cappé O, Moulines E (2009) On-line expectation-maximization algorithm for latent data models. J R Stat Soc Ser B Stat Methodol 71:593–613 [Google Scholar]
Chen J, Zhu J, Teh YW, Zhang T (2018) Stochastic expectation maximization with variance reduction. Advan Neural Inform Process Syst 31
Chen S, Ge W, Liang X, Jin X, Du Z (2024) Lifelong learning with deep conditional generative replay for dynamic and adaptive modeling towards net zero emissions target in building energy system. Appl Energy 353:122189 [Google Scholar]
Chen Z, Jiang L, Li C (2022) Label augmented and weighted majority voting for crowdsourcing. Inf Sci 606:397–409. 10.1016/j.ins.2022.05.066 [Google Scholar]
Daniel F, Kucherbaev P, Cappiello C, Benatallah B, Allahbakhsh M (2018) Quality control in crowdsourcing: a survey of quality attributes, assessment techniques, and assurance actions. ACM Computing Surveys (CSUR) 51:1–40 [Google Scholar]
Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the em algorithm. J Roy Stat Soc: Ser C (Appl Stat) 28:20–28 [Google Scholar]
Feng J, Li G, Wang H, Feng J (2014) Incremental quality inference in crowdsourcing. Database Systems for Advanced Applications. Springer International Publishing, Cham, pp 453–467 [Google Scholar]
Fu M, Zhang Z, Wang Z, Chen D (2024) The multi-objective task assignment scheme for software crowdsourcing platforms involving new workers. Journal of King Saud University - Computer and Information Sciences 36:102237. https://www.sciencedirect.com/science/article/pii/S1319157824003264
Galland A, Abiteboul S, Marian A, Senellart P (2010) Corroborating information from disagreeing views. In: Proceedings of the third ACM international conference on web search and data mining, association for computing machinery, New York, NY, USA. pp. 131–140. 10.1145/1718487.1718504
Grassi L, Ciranni M, Baglietto P, Recchiuto CT, Maresca M, Sgorbissa A (2023) Emergency management through information crowdsourcing. Inform Process Manag 60:103386. https://www.sciencedirect.com/science/article/pii/S0306457323001231, 10.1016/j.ipm.2023.103386
Groenen I, Rudinac S, Worring M (2023) Panorams: automatic annotation for detecting objects in urban context. IEEE Trans Multimedia 26:1281–1294 [Google Scholar]
Hong C, Ghiassi A, Zhou Y, Birke R, Chen LY (2021) Online label aggregation: a variational bayesian approach. Proc Web Conf 2021:1904–1915 [Google Scholar]
Hou S, Li S, Jahani-Nezhad T, Caire G (2025) Priroagg: achieving robust model aggregation with minimum privacy leakage for federated learning. IEEE Trans Inf Forensics Secur 20:5690–5704 [Google Scholar]
Huang W, Li P, Li B, Liu Q, Nie L, Bao H (2024) Three-sided online stable task assignment in spatial crowdsourcing. Inf Sci 654:119878. 10.1016/j.ins.2023.119878 [Google Scholar]
Kingma DP, Welling M (2014) Auto-encoding variational bayes. In: 2nd International Conference on Learning Representations, ICLR 2014
Kodjiku SL, Han T, Fang Y, Aggrey ESE, Sey C, Asamoah KO, Fiasam LD, Aidoo E, Wang X (2023) Wqcrowd: secure blockchain-based crowdsourcing framework with multi-tier worker quality evaluation. J King Saud University - Comput Inform Sci 35:101843. 10.1016/j.jksuci.2023.101843 [Google Scholar]
Li FF (2010) Imagenet: Crowdsourcing, benchmarking & other cool things. CMU VASC Semin 16:18–25 [Google Scholar]
Li H, Wu J, Braverman V (2023) Fixed design analysis of regularization-based continual learning. In: Conference on lifelong learning agents, PMLR. pp 513–533
Li Q, Li Y, Gao J, Su L, Zhao B, Demirbas M, Fan W, Han J (2014) A confidence-aware approach for truth discovery on long-tail data. Proc VLDB Endow 8:425–436 [Google Scholar]
Li Q, Li Y, Gao J, Zhao B, Fan W, Han J (2014) Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp 1187–1198
Li SY, Huang SJ, Chen S (2021) Crowdsourcing aggregation with deep bayesian learning. Sci China Inf Sci 64:130104 [Google Scholar]
Li Y, Sun H, Wang WH (2020) Towards fair truth discovery from biased crowdsourced answers. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, association for computing machinery, New York, NY, USA. pp 599–607
Lu X, Ratcliffe D, Kao TT, Tikhonov A, Litchfield L, Rodger C, Wang K (2023) Rethinking quality assurance for crowdsourced multi-roi image segmentation. In: Proceedings of the AAAI conference on human computation and crowdsourcing, pp 103–114
Luo Y, Tian, T., Shi, J., Zhu, J., Zhang, B., (2018) Semi-crowdsourced clustering with deep generative models. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (Eds), Advances in Neural Information Processing Systems, Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2018/file/3c1e4bd67169b8153e0047536c9f541e-Paper.pdf
Mahotra A, Majchrzak A (2024) Digital innovations in crowdsourcing using ai tools. Technovation 133:102997 [Google Scholar]
Pavlichenko N, Stelmakh I, Ustalov D (2021) Crowdspeech and voxdiy: benchmark datasets for crowdsourced audio transcription. arXiv:2107.01091
Pyatkin V, Yung F, Scholman MC, Tsarfaty R, Dagan I, Demberg V (2023) Design choices for crowdsourcing implicit discourse relations: revealing the biases introduced by task design. Trans Assoc Comput Linguistics 11:1014–1032 [Google Scholar]
Rao D, Visin F, Rusu A, Pascanu R, Teh YW, Hadsell R (2019) Continual unsupervised representation learning. In: Advances in neural information processing systems
Simpson ED, Venanzi M, Reece S, Kohli P, Guiver J, Roberts SJ, Jennings NR (2015) Language understanding in the wild: Combining crowdsourcing and machine learning. In: Proceedings of the 24th international conference on world wide web, international world wide web conferences steering committee, Republic and Canton of Geneva, CHE. pp 992–1002. 10.1145/2736277.2741689 [DOI]
Snow R, O’connor B, Jurafsky D, Ng AY (2008) Cheap and fast-but is it good? evaluating non-expert annotations for natural language tasks. In: Proceedings of the 2008 conference on empirical methods in natural language processing, pp 254–263
Sun L, Yang Q, Chen X, Chen Z (2021) Rc-chain: reputation-based crowdsourcing blockchain for vehicular networks. J Netw Comput Appl 176:102956 [Google Scholar]
Venanzi M, Guiver J, Kazai G, Kohli P, Shokouhi M (2014) Community-based bayesian aggregation models for crowdsourcing. In: Proceedings of the 23rd international conference on world wide web, association for computing machinery, New York, NY, USA. pp 155–164. 10.1145/2566486.2567989
Wang X, Dao F, Ji Y, Qiu S, Zhu X, Dong W, Wang H, Zhang W, Zheng X (2024) Crowdsourcing intelligence for improving disaster forecasts. The Innovation 5
Wang Z, Chen C, Dong D (2022a) Lifelong incremental reinforcement learning with online bayesian inference. IEEE Trans Neural Netw Learn Syst 33:4003–4016 [DOI] [PubMed] [Google Scholar]
Wang Z, Zhang Z, Ebrahimi S, Sun R, Zhang H, Lee CY, Ren X, Su G, Perot V, Dy J et al (2022b) Dualprompt: complementary prompting for rehearsal-free continual learning. In: European conference on computer vision, Springer. pp 631–648
Welinder P, Branson S, Perona P, Belongie S (2010) The multidimensional wisdom of crowds. In: Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A (Eds), Advances in Neural Information Processing Systems
Wu G, Zhou L, Xia J, Li L, Bao X, Wu X (2023) Crowdsourcing truth inference based on label confidence clustering. ACM Trans. Knowl. Discov, Data, p 17
Yan S, Xie J, He X (2021) Der: Dynamically expandable representation for class incremental learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 3014–3023
Yang Y, Zhao ZQ, Wu G, Zhuo X, Liu Q, Bai Q, Li W (2024) A lightweight, effective, and efficient model for label aggregation in crowdsourcing. ACM Trans. Knowl. Discov, Data, p 18
Yin L, Han J, Zhang W, Yu Y (2017) Aggregating crowd wisdoms with label-aware autoencoders. In: Proceedings of the twenty-sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 1325–1331
Zhan Z, Wang Y, Duan P, Sai AMVV, Liu Z, Xiang C, Tong X, Wang W, Cai Z (2024) Enhancing worker recruitment in collaborative mobile crowdsourcing: A graph neural network trust evaluation approach. IEEE Trans Mobile Comput:1–18
Zhang L, Wang S, Yuan F, Geng B, Yang M (2023) Lifelong language learning with adaptive uncertainty regularization. Inf Sci 622:794–807 [Google Scholar]
Zhang P, Cheng X, Su S, Wang N (2023) Task allocation under geo-indistinguishability via group-based noise addition. IEEE Trans Big Data 9:860–877. 10.1109/TBDATA.2022.3215467 [Google Scholar]
Zheng Y, Li G, Li Y, Shan C, Cheng R (2017) Truth inference in crowdsourcing: Is the problem solved? Proc VLDB Endow 10:541–552 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data will be made available on request.

[CR1] Aydin B, Yilmaz YSYYS, Li Y, Li Q, Gao J, Demirbas M (2014) Crowdsourcing for multiple-choice question answering. In: Proceedings of the AAAI conference on artificial intelligence, pp 2946–2953

[CR2] Bahutair M, Bouguettaya A, Neiat AG (2023) Multi-use trust in crowdsourced iot services. IEEE Trans Serv Comput 16:1268–1281. 10.1109/TSC.2022.3160469 [Google Scholar]

[CR3] Balayn A, Soilis P, Lofi C, Yang J, Bozzon A (2021) What do you mean? interpreting image classification with crowdsourced concept extraction and analysis. In: Proceedings of the web conference 2021, association for computing machinery, New York, NY, USA. pp 1937–1948

[CR4] Cai D, Nguyen DT, Lim SH, Wynter L (2020) Variational bayesian inference for crowdsourcing predictions. In: 2020 59th IEEE Conference on Decision and Control (CDC), IEEE. pp 3166–3172

[CR5] Cappé O, Moulines E (2009) On-line expectation-maximization algorithm for latent data models. J R Stat Soc Ser B Stat Methodol 71:593–613 [Google Scholar]

[CR6] Chen J, Zhu J, Teh YW, Zhang T (2018) Stochastic expectation maximization with variance reduction. Advan Neural Inform Process Syst 31

[CR7] Chen S, Ge W, Liang X, Jin X, Du Z (2024) Lifelong learning with deep conditional generative replay for dynamic and adaptive modeling towards net zero emissions target in building energy system. Appl Energy 353:122189 [Google Scholar]

[CR8] Chen Z, Jiang L, Li C (2022) Label augmented and weighted majority voting for crowdsourcing. Inf Sci 606:397–409. 10.1016/j.ins.2022.05.066 [Google Scholar]

[CR9] Daniel F, Kucherbaev P, Cappiello C, Benatallah B, Allahbakhsh M (2018) Quality control in crowdsourcing: a survey of quality attributes, assessment techniques, and assurance actions. ACM Computing Surveys (CSUR) 51:1–40 [Google Scholar]

[CR10] Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the em algorithm. J Roy Stat Soc: Ser C (Appl Stat) 28:20–28 [Google Scholar]

[CR11] Feng J, Li G, Wang H, Feng J (2014) Incremental quality inference in crowdsourcing. Database Systems for Advanced Applications. Springer International Publishing, Cham, pp 453–467 [Google Scholar]

[CR12] Fu M, Zhang Z, Wang Z, Chen D (2024) The multi-objective task assignment scheme for software crowdsourcing platforms involving new workers. Journal of King Saud University - Computer and Information Sciences 36:102237. https://www.sciencedirect.com/science/article/pii/S1319157824003264

[CR13] Galland A, Abiteboul S, Marian A, Senellart P (2010) Corroborating information from disagreeing views. In: Proceedings of the third ACM international conference on web search and data mining, association for computing machinery, New York, NY, USA. pp. 131–140. 10.1145/1718487.1718504

[CR14] Grassi L, Ciranni M, Baglietto P, Recchiuto CT, Maresca M, Sgorbissa A (2023) Emergency management through information crowdsourcing. Inform Process Manag 60:103386. https://www.sciencedirect.com/science/article/pii/S0306457323001231, 10.1016/j.ipm.2023.103386

[CR15] Groenen I, Rudinac S, Worring M (2023) Panorams: automatic annotation for detecting objects in urban context. IEEE Trans Multimedia 26:1281–1294 [Google Scholar]

[CR16] Hong C, Ghiassi A, Zhou Y, Birke R, Chen LY (2021) Online label aggregation: a variational bayesian approach. Proc Web Conf 2021:1904–1915 [Google Scholar]

[CR17] Hou S, Li S, Jahani-Nezhad T, Caire G (2025) Priroagg: achieving robust model aggregation with minimum privacy leakage for federated learning. IEEE Trans Inf Forensics Secur 20:5690–5704 [Google Scholar]

[CR18] Huang W, Li P, Li B, Liu Q, Nie L, Bao H (2024) Three-sided online stable task assignment in spatial crowdsourcing. Inf Sci 654:119878. 10.1016/j.ins.2023.119878 [Google Scholar]

[CR19] Kingma DP, Welling M (2014) Auto-encoding variational bayes. In: 2nd International Conference on Learning Representations, ICLR 2014

[CR20] Kodjiku SL, Han T, Fang Y, Aggrey ESE, Sey C, Asamoah KO, Fiasam LD, Aidoo E, Wang X (2023) Wqcrowd: secure blockchain-based crowdsourcing framework with multi-tier worker quality evaluation. J King Saud University - Comput Inform Sci 35:101843. 10.1016/j.jksuci.2023.101843 [Google Scholar]

[CR21] Li FF (2010) Imagenet: Crowdsourcing, benchmarking & other cool things. CMU VASC Semin 16:18–25 [Google Scholar]

[CR22] Li H, Wu J, Braverman V (2023) Fixed design analysis of regularization-based continual learning. In: Conference on lifelong learning agents, PMLR. pp 513–533

[CR23] Li Q, Li Y, Gao J, Su L, Zhao B, Demirbas M, Fan W, Han J (2014) A confidence-aware approach for truth discovery on long-tail data. Proc VLDB Endow 8:425–436 [Google Scholar]

[CR24] Li Q, Li Y, Gao J, Zhao B, Fan W, Han J (2014) Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp 1187–1198

[CR25] Li SY, Huang SJ, Chen S (2021) Crowdsourcing aggregation with deep bayesian learning. Sci China Inf Sci 64:130104 [Google Scholar]

[CR26] Li Y, Sun H, Wang WH (2020) Towards fair truth discovery from biased crowdsourced answers. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, association for computing machinery, New York, NY, USA. pp 599–607

[CR27] Lu X, Ratcliffe D, Kao TT, Tikhonov A, Litchfield L, Rodger C, Wang K (2023) Rethinking quality assurance for crowdsourced multi-roi image segmentation. In: Proceedings of the AAAI conference on human computation and crowdsourcing, pp 103–114

[CR28] Luo Y, Tian, T., Shi, J., Zhu, J., Zhang, B., (2018) Semi-crowdsourced clustering with deep generative models. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (Eds), Advances in Neural Information Processing Systems, Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2018/file/3c1e4bd67169b8153e0047536c9f541e-Paper.pdf

[CR29] Mahotra A, Majchrzak A (2024) Digital innovations in crowdsourcing using ai tools. Technovation 133:102997 [Google Scholar]

[CR30] Pavlichenko N, Stelmakh I, Ustalov D (2021) Crowdspeech and voxdiy: benchmark datasets for crowdsourced audio transcription. arXiv:2107.01091

[CR31] Pyatkin V, Yung F, Scholman MC, Tsarfaty R, Dagan I, Demberg V (2023) Design choices for crowdsourcing implicit discourse relations: revealing the biases introduced by task design. Trans Assoc Comput Linguistics 11:1014–1032 [Google Scholar]

[CR32] Rao D, Visin F, Rusu A, Pascanu R, Teh YW, Hadsell R (2019) Continual unsupervised representation learning. In: Advances in neural information processing systems

[CR33] Simpson ED, Venanzi M, Reece S, Kohli P, Guiver J, Roberts SJ, Jennings NR (2015) Language understanding in the wild: Combining crowdsourcing and machine learning. In: Proceedings of the 24th international conference on world wide web, international world wide web conferences steering committee, Republic and Canton of Geneva, CHE. pp 992–1002. 10.1145/2736277.2741689 [DOI]

[CR34] Snow R, O’connor B, Jurafsky D, Ng AY (2008) Cheap and fast-but is it good? evaluating non-expert annotations for natural language tasks. In: Proceedings of the 2008 conference on empirical methods in natural language processing, pp 254–263

[CR35] Sun L, Yang Q, Chen X, Chen Z (2021) Rc-chain: reputation-based crowdsourcing blockchain for vehicular networks. J Netw Comput Appl 176:102956 [Google Scholar]

[CR36] Venanzi M, Guiver J, Kazai G, Kohli P, Shokouhi M (2014) Community-based bayesian aggregation models for crowdsourcing. In: Proceedings of the 23rd international conference on world wide web, association for computing machinery, New York, NY, USA. pp 155–164. 10.1145/2566486.2567989

[CR37] Wang X, Dao F, Ji Y, Qiu S, Zhu X, Dong W, Wang H, Zhang W, Zheng X (2024) Crowdsourcing intelligence for improving disaster forecasts. The Innovation 5

[CR38] Wang Z, Chen C, Dong D (2022a) Lifelong incremental reinforcement learning with online bayesian inference. IEEE Trans Neural Netw Learn Syst 33:4003–4016 [DOI] [PubMed] [Google Scholar]

[CR39] Wang Z, Zhang Z, Ebrahimi S, Sun R, Zhang H, Lee CY, Ren X, Su G, Perot V, Dy J et al (2022b) Dualprompt: complementary prompting for rehearsal-free continual learning. In: European conference on computer vision, Springer. pp 631–648

[CR40] Welinder P, Branson S, Perona P, Belongie S (2010) The multidimensional wisdom of crowds. In: Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A (Eds), Advances in Neural Information Processing Systems

[CR41] Wu G, Zhou L, Xia J, Li L, Bao X, Wu X (2023) Crowdsourcing truth inference based on label confidence clustering. ACM Trans. Knowl. Discov, Data, p 17

[CR42] Yan S, Xie J, He X (2021) Der: Dynamically expandable representation for class incremental learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 3014–3023

[CR43] Yang Y, Zhao ZQ, Wu G, Zhuo X, Liu Q, Bai Q, Li W (2024) A lightweight, effective, and efficient model for label aggregation in crowdsourcing. ACM Trans. Knowl. Discov, Data, p 18

[CR44] Yin L, Han J, Zhang W, Yu Y (2017) Aggregating crowd wisdoms with label-aware autoencoders. In: Proceedings of the twenty-sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 1325–1331

[CR45] Zhan Z, Wang Y, Duan P, Sai AMVV, Liu Z, Xiang C, Tong X, Wang W, Cai Z (2024) Enhancing worker recruitment in collaborative mobile crowdsourcing: A graph neural network trust evaluation approach. IEEE Trans Mobile Comput:1–18

[CR46] Zhang L, Wang S, Yuan F, Geng B, Yang M (2023) Lifelong language learning with adaptive uncertainty regularization. Inf Sci 622:794–807 [Google Scholar]

[CR47] Zhang P, Cheng X, Su S, Wang N (2023) Task allocation under geo-indistinguishability via group-based noise addition. IEEE Trans Big Data 9:860–877. 10.1109/TBDATA.2022.3215467 [Google Scholar]

[CR48] Zheng Y, Li G, Li Y, Shan C, Cheng R (2017) Truth inference in crowdsourcing: Is the problem solved? Proc VLDB Endow 10:541–552 [Google Scholar]

PERMALINK

Online label aggregation with incomplete crowd responses

Yuyang Liu

Haoyu Liu

Runze Wu

Chengliang Chai

Minmin Lin

Renyu Zhu

Hui Liu

Tangjie Lv

Changjie Fan

Abstract

Introduction

Fig. 1.

Related work

Background

Offline label aggregation

Online label aggregation under asynchrony

Continual learning for history reuse

Preliminaries

Table 1.

Method

Fig. 2.

Overall framework

Algorithm 1.

Generative memory replay

Truth inference with side information

Online training strategy

Experiments

Experimental setup

Table 2.

Baselines

Performance metrics

Implementation details

Table 3.

Performance

Final accuracy

Table 4.

Accuracy at each chunk

Fig. 3.

Table 5.

Accuracy for one response at each step

Fig. 4.

Accuracy for offline label aggregation

Table 6.

Sensitivity analysis

Sensitivity to the replay threshold

Table 7.

Sensitivity to random seeds

Table 8.

Ablation study

Table 9.

Conclusion

Data Availability

Declarations

Competing interests

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases