Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Dec 17.
Published in final edited form as: Smart Health (Amst). 2020 Nov 13;18:100142. doi: 10.1016/j.smhl.2020.100142

Topic modeling for systematic review of visual analytics in incomplete longitudinal behavioral trial data

Joshua Rumbut a,c, Hua Fang a,c,*, Honggong Wang b
PMCID: PMC7745978  NIHMSID: NIHMS1649927  PMID: 33344744

Abstract

Longitudinal observational and randomized controlled trials (RCT) are widely applied in biomedical behavioral studies and increasingly implemented in smart health systems. These trials frequently produce data that are high-dimensional, correlated, and contain missing values, posing significant analytic challenges. Notably, visual analytics are underdeveloped in this area. In this paper, we developed a longitudinal topic model to implement the systematic review of visual analytic methods presented at the IEEE VIS conference over its 28 year history, in comparison with MIFuzzy, an integrated and comprehensive soft computing tool for behavioral trajectory pattern recognition, validation, and visualization of incomplete longitudinal data. The findings of our longitudinal topic modeling highlight the trend patterns of visual analytics development in longitudinal behavioral trials and underscore the gigantic gap of existing robust visual analytic methods and actual working algorithms for longitudinal behavioral trial data. Future research areas for visual analytics in behavioral trial studies and smart health systems are discussed.

Keywords: Missing data, MIFuzzy, Visualization, Topic modeling, Systematic review, Visual analytics

1. Introduction

Behavior data from longitudinal observational studies and randomized controlled trials create the need for specialized tools to enable analyses. Attrition, one of the major factors leading to missing data, is nearly inevitable in longitudinal smart health studies and often depends on factors under study (for instance smokers may not return surveys about tobacco usage) Sackett (1979); Fang et al. (2009); Rubin (2004) or the outcome. Behavioral research often relies on high-dimensional data with complex intercorrelation Kim et al. (2017); Mellins et al. (2009). The interplay of these challenges creates myriad variations within groups or strata, complicating the comparison between groups Fang (2017).

In longitudinal trials, clustering may be able to identify groups that share trajectories toward an outcome Fang et al. (2011). In order to perform a cluster analysis, it is necessary to optimize various parameters, such as the number of clusters sought. Robust validation is essential when applying any machine learning technique for parameter selection and determining the quality of the outcome. Validation procedures and interpreting outcomes both visually and with numerical metrics are specialist tasks, requiring training and additional time and computation resources Kwon et al. (2017); Fang (2017). Additionally, different methods for clustering generally have different levels of sensitivity to missing data. Many are either entirely unable to deal with missing data, ignore the missing data problems, or apply simple missing data preprocessing techniques without assessing imputation uncertainty or clustering uncertainty resulting from these imputations.

Data may be missing completely at random (MCAR), when there is no relationship between the propensity of a value to be missing and any observed or unobserved factors, missing at random (MAR), missingness is dependent only on other observed values, or missing not at random (MNAR), missingness is dependent on the missing value itself. In the MNAR scenario, the missingness is informative and therefore presents a particular challenge since excluding these values will introduce bias Fang et al. (2009); Rubin (2004).

There are three main options for dealing with missing data. The first is to either leave the data as-is or delete those samples containing missing data, which we call non-imputation (NI). The next is to use the values that are present to predict the missing values, called single imputation (SI). The last is multiple imputation (MI), where the missing values are imputed multiple times with each iteration producing a new data set upon which the desired analysis (in this case, clustering) is performed Fang et al. (2009); Rubin and Little (2019).

Human subjects may fail to provide information based on unobserved phenomena, and so the missing values are informative and challenging to predict by single imputation Fang (2017); Zhang et al. (2016). This can complicate clustering and trajectory pattern recognition in longitudinal trials. Different systems have been proposed for using visual analytics to guide users through the clustering process. The systems described in the literature show enormous variety in the clustering algorithms used, use of validation indices, visualizations presented to the user, but very few handle incomplete longitudinal data.

One system that has been designed to handle longitudinal data with missingness or zero-inflation is multiple-imputation based fuzzy clustering (MIFuzzy), a soft computing and semi-supervised learning method, specially designed for longitudinal trial studies. MIFuzzy has been applied in two longitudinal tobacco exposure studies which used self-reported measures and bioassays to examine the effects of prenatal tobacco exposure on perinatal or developmental neuropsychological outcomes. The missingness for the two studies is less than 43%. Authors found strong support for the reliability and validity of the MIFuzzy approach by comparing results from these studies despite their being conducted almost 20 years apart and using different exposure measures (e.g. Monthly vs. Trimester for self-report; GC/MS vs. radioimmunoassay for bioassays; total 22 vs. 14 attributes). MIFuzzy detected 3 distinct exposure clusters (heavier-, lighter- and non-exposed) in each study, with similar amounts of self-reported smoking in the analogous clusters. Also in both datasets, neonates’ weight differences averaged around 200 g between the heavier-tobacco-exposed (about 3200 g) and non-exposed groups (about 3400 g); in neither study did neonates in the lighter-exposed and non-exposed groups differ in weight. Finally, the MIFuzzy-derived pattern approach had more predictive power than conventional grouping approaches (e.g. based on self-reported smoking status - smoker vs. non-smoker) Pickett et al. (2009); Wakschlag et al. (2011).

Besides, MIFuzzy was applied to characterize complex online engagement and cognitive response behaviors over a period of internet-based and culturally tailored randomized controlled trials (RCT), and to untangle the efficacy of such trials for substance use cessation (e.g. smoking cessation). The missingness for the two longitudinal behavioral RCTs is less than 18%. The internet based RCT has incomplete, correlated, and high dimensional data with zero-inflation. MIFuzzy detected four types of online engagement patterns from three correlated intervention components with six monthly repeated measures. Engagement with each intervention component significantly differed between the near-zero engagement pattern group and other three. Interestingly, no significant differences were detected among age, education, gender, ethnic, intervention or internet-use subgroups across these patterns. MIFuzzy-derived pattern approach, which captured variations in subjects’ longitudinal engagement during intervention, again demonstrates its predictive power; this approach significantly predicted differences in the 6-month cessation rates, while the conventional approach could not Zhang et al. (2016). For the TDTA trial, MIFuzzy identified three patterns with substantial variations in cognitive behavioral response, subjects with these patterns differed in marital status, levels of acculturation and depression, further explaining their differences in cessation rates, where conventional grouping (treated vs. not) did not detect differences Kim et al. (2017). Additionally, MIFuzzy was examined in the dietary data of RCT participants with metabolic syndromes, where the longitudinal dietary data seemed more heterogeneous with missingness less than 23% Fang and Zhang (2018). MIFuzzy detected five patterns using 4 repeatedly measured diet-quality scores derived from 8 food components (fruit, vegetables, nuts and legumes, ratio of white to red meat, cereal fiber, trans-fat, ratio of polyunsaturated fat to saturated fat, and alcohol) on Alternate Healthy Eating Index. Preliminary tests showed that patients with the consistently low diet-quality pattern were significantly heavier, had larger waists and lower HDL at 1 year follow up than those with the consistently high diet-quality pattern. This suggests that future studies can investigate the patients with this consistently low diet quality pattern, explore their characteristics, food intake and disease outcomes for possible adaptive interventions. Currently MIFuzzy has been expanded to more longitudinal RCT and observational studies.

Unlike classic systematic review, we leverage the advanced text mining technique, specifically, apply and develop a longitudinal topic modeling to analyze a large corpus of previous papers published in the field of visual analytics in order to understand the trend and state-of-the-art of visual analytics in longitudinal trials. We have extracted text from the IEEE VIS conference, the premier conference for data visualization and visual analytics, from 1990 to 2019 in order to perform a systematic review of the literature. This field has changed enormously over 29 years. Typical models that fail to take into account the variation and evolution of topics over time may not be able to adequately describe the corpus. In order to capture the changes over time of the corpus, we have used a longitudinal topic model, Dynamic Topic Modeling (DTM) Blei and Lafferty (2006). Our DTM model along with a detailed review of systems described in the literature showed that the problem of missing data is rarely addressed at all in systems described in the IEEE VIS conference, and is most often associated with spatial data rather than longitudinal trial data.

Meanwhile, we studied statistical learning methods published outside the society of IEEE VIS, specifically, MIFuzzy. We examine and compare the available methods and tools from IEEE VIS with the MIFuzzy system, in development since 2011 and currently available for use, in an effort to guide ongoing enhancements to visualization and visual analytics in the longitudinal trial systems and establish the utility of longitudinal topic modeling as a tool for systematic review Fang Lab (2018); Fang (2017); Zhang et al. (2016). The use of DTM for systematic review, description of the evolution of topics in the visual analytics literature, and the detailed comparisons between visual analytics systems represent the primary contributions of this work.

In Section II we will explain the methodology used to collect documents for our corpus from the IEEE VIS publications. Section III will contain details of the text and data preprocessing required for performing systematic review on the visual analytics literature. Section IV will describe the use of Latent Dirichlet Allocation (LDA, a method of cross-sectional topic modeling) to select an appropriate number of topics for the longitudinal topic model, as well as showing results from the LDA model. Section V will show the results of the DTM model, describing the evolution of topics in the systematic review literature from the early years of the IEEE VIS conference to the present day. Section VI will provide detailed comparisons between MIFuzzy and other visual analytics systems previously published in IEEE VIS. Lastly, Section VII will contain discussions, conclusions, and directions for further research for visual analytics for longitudinal trials.

2. Methodology for longitudinal topic modeling systematic review

We performed an extensive examination of all proceedings of the IEEE VIS conference and associated venues dating back to 1990 b y querying the IEEE Xplore® database IEEE (2020). Since there is no existing collection of these proceedings, we first needed to locate the various publications that have served as venues for the conference proceedings. Once the appropriate venues were found, we searched for relevant papers using search terms relating to missing data and longitudinal trials. For the papers that matched our queries, we extracted metadata including the title, year, and abstract as well as the full text. We then used an LDA topic model to determine the most suitable number of topics to use for the systematic review. We then fit a DTM model to the corpus of visual analytics papers. Finally, we performed detailed examinations of those papers which featured systems that may be suitable for performing clustering on longitudinal trial data with visualizations and validation based on our topic model. See Fig. 1 for the flow diagram.

Fig. 1.

Fig. 1.

Workflow for systematic review by performing longitudinal topic modeling on the visual analytics literature.

2.1. Inclusion and exclusion criteria

To be included, papers had to match predetermined queries to ensure they had been published as part of the IEEE VIS conference or associated event and mention missing, trajectory, longitudinal trial, or behavioral data. Records found in the database for tables of content or front matter for the conference proceedings were excluded. The remaining documents were included in the dynamic topic model, along with previous work describing the methodology and applications of MIFuzzy.

In order to search through the history of the IEEE VIS conference, IEEE Xplore® database, Google searches, and links from other IEEE pages were used to identify all publications that contained proceedings from the IEEE VIS conference and associated events (such as the 2017 Workshop on Visual Analytics in Healthcare). This was necessary due to the lack of a single query condition that would select all IEEE VIS proceedings. This preliminary research identified the different publications that included the papers from the IEEE VIS conference. In addition to conference proceedings, the publication IEEE Transactions on Visualization and Computer Graphics has been a venue for relevant IEEE VIS papers and so was also included. See Fig. 2 for the number of corpus documents published in each year.

Fig. 2.

Fig. 2.

Included visualization papers from IEEE VIS Conference per year.

Citations and abstracts returned from the queries were loaded into an EndNote X8 library. The titles, abstracts, and publication information were reviewed according to the inclusion and exclusion criteria above. Full text manuscripts were obtained for all papers that appeared to warrant close examination due to output from the longitudinal topic model. A total of 6015 papers appear in the publications selected. Of those, 798 matched one of our search terms. We also included 7 papers concerning MIFuzzy, none of which were published in IEEE VIS. The 798 IEEE VIS papers and 7 MIFuzzy papers combined for 805 documents in the corpus that was used to develop the DTM model for systematic review.

2.2. Criteria for evaluating Visual Analytics systems

Interconnected challenges may be best overcome by an integrated and comprehensive approach, encoding best practices established through empirical experience and a coherent theoretical foundation. While no single approach is appropriate for every situation, we have identified eight crucial principles governing the design and implementation of a system to handle longitudinal data in the health and behavior domain:

  1. Handles missing data in a way that minimizes bias and threats to generalizability of results

  2. Appropriate validation metrics, preferably an integrated validation process

  3. Deal with data without imposing distribution or prior assumptions

  4. Requires the user to manually set model parameters

  5. Can handle high-dimensional data and large numbers of clusters in an efficient way

  6. Results must be able to be replicated

  7. Provides visualizations of data and clustering, including projections of high-dimensional data and trajectory data

  8. Utility demonstrated empirically in the health and behavior domains

We have used these criteria to carefully evaluate the visual analytics systems described in the conference papers and compare them to MIFuzzy. In order to understand these systems, we obtained and reviewed the full text of the original conference papers and, as needed, additional methods and applications papers published in other venues.

2.3. Topic modeling

In order to examine general trends and structure within the literature, we will perform a longitudinal analysis using a Dynamic Topic Model (DTM). In brief, DTM is as extension of the Latent Dirichlet Allocation (LDA) topic modeling algorithm. The result of DTM is an LDA model for each time step, with the evolution from one-time step to the next modeled as a Gaussian process.

LDA is a generative model, capable of assigning probabilities to unseen documents and creating new simulated documents, that assumes each document consists of a random mixture of topics which is characterized by a distribution over the vocabulary of the corpus. Specifically, to generate a document (sequence of words) w from corpus D with vocabulary V using a previously trained LDA model with k topics:

  1. Choose N~Poisson(ξ).

  2. Choose θ~Dir(α).

  3. For each of the N words wn:
    • 3.1.
      Choose a topic zn~Multinomial(θ).
    • 3.2.
      Choose a word wn from p(wn|zn), a multinomial probability conditioned on the topic zn.

Where N is the length of the generated document, ξ is the average number of words in a document in the training set (or any other value the user may want, the document length is not connected to any other component of the model and so can be set freely), θ is a k-dimensional Dirichlet random variable representing the topic mixture, z is the discrete random topic variable, α is a k-dimensional vector where ∀αiα,αi > 0, and β is a k × V matrix of conditional probabilities of each word given each topic. The number of topics k is chosen by the user, in practice multiple values are tested to see which is most suitable for the particular application at hand Blei et al. (2003).

The following steps model the longitudinal change in topics to a series of LDA models, one for each time slice:

  1. βtβt1~N(βt1,σ2I)

  2. αtαt1~N(αt1,δ2I)

Longitudinal topic models are fit using a variational expectation maximization (EM) algorithm to optimize a loss function for which no closed form solution exists. This presents the possibility of bias in addition to the possibility of fitting too closely to noise in the dataset. As seen in the results, items like typos and encoding errors can cause the model to find topics that lack semantic value. Therefore, the input data was carefully prepared in order to prevent this outcome.

3. Topic modeling of visualization text data for systematic review

In order to examine the visual analytics literature holistically and to get a broad view, we present the results of our analysis of the paper abstracts. These abstracts were obtained by using the export function on the IEEEXplore site, then compiled into a CSV file, and loaded into R. The abstracts contained 17,138 words after pre-processing was completed, a process which reduced the number of words. For the longitudinal model, abstracts were used since we found that the longitudinal topic model was too sensitive to the different quality of data found in converted PDFs versus the relatively cleaner abstracts from the database.

The cross-sectional topic model was trained on the complete text of the visual analytics papers. In order to acquire this dataset, PDFs of each article from the search results were downloaded. In order to convert the content to a convenient form for mining, all text was extracted from the PDFs using the GhostScript 9.23 Artifex Software, Inc. (2018). Data was cleaned and formatted using the tidytext library for the R Statistical Computing environment Silge and Robinson (2016); R Core Team et al. (2013). LDA was performed using the topicmodel library. DTM was performed with the gensim library Rehurek and Sojka (2010).

For both models, stop words (very common words such as “a”, “the”, etc.) were removed along with frequent nuisance text. One letter words or longer than 25 were also excluded. The size of the vocabulary, and presence of many names or other unusual words not present in dictionaries led to some difficulties in determining the appropriateness of including or excluding certain words. Longer words were examined in case they represented multiple words where spaces were not present either due to typos or database quality issues (19 of these were found, it is possible others remain in the dataset). A handful of cases were found where the characters were encoded in a form that is not compatible with UTF-8, possibly Windows-1252. In general, leaving words as they appeared in the text was favored in ambiguous cases.

4. Results of topic modeling for systematic review

In the entire corpus, the most common word was data followed by visualization. The word data appears in the complete text of almost every document in the corpus, and is the most common word (besides stop words) in the abstracts as well. Table 1 shows the most common words in the entire corpus after pre-processing.

Table 1.

Most common words in corpus.

Word Count of Occurrences
Data 26,542
Visualization 9746
Time 8676
Based 6558
Analysis 6506
Visual 6396
Information 4811
Set 4469
Results 4195
Model 4102

In order to choose the best model for the purposes of our analysis, we used Bayesian Information Criterion (BIC), a metric often used in LDA modeling. Metrics for validating topic models has been an active research area, with many new metrics proposed, but BIC remains a common choice Arun et al. (2010); Cao et al. (2009); Griffiths and Steyvers (2004); Murzintcev (2016); Deveaud et al. (2014). We sought a model that fit the data well, with a preference for a model with fewer topics in order to present more digestible snapshot of the visual analytics literature examined. We began by training incrementally larger LDA models on our full text dataset, beginning with 2 topics, adding a topic at each step, and stopping at 20 topics. The resulting BIC values for each model show a clear inflection point between the 7-topic model and 8-topic model. We present the 7-topic LDA model and chose 7 as the number of topics for our DTM model.

The LDA model produced topics that contained a substantial amount of overlap in the words that most distinguish each topic (see Fig. 3). All topics contain the word data and most contain visualization. This reduces the utility of the model in being able to describe the differences between the topics. There is also no tracking of longitudinal trends. These drawbacks are reduced in DTM.

Fig. 3.

Fig. 3.

Top 5 words from each of the 7 topics.

Using the results from LDA, a 7-topic model was examined. The results seem to be heavily influenced by the second half of the years, which is unsurprising given the increase in the quantity of papers published after 2005. The top terms (in terms of their ability to discriminate between topics) for 1992 (the first year) and 2019 (the last year available) are shown in Table 2.

Table 2.

Most likely words in DTM.

Topic 1992 Terms 2019 Terms
1 visualization, datasets, tiled, flowstringa, tech datasets, visualization, analytic, vispediab, tiled
2 usage, sourced, resultmapsc, tracked, fictitious usage, tracked, resultmapsc, viral, appropriately
3 vicond, basically, tiled, meter, synthetic vicond, tradition, basically, tiled, tease
4 objectives, algorithmically, designing, coloring, visualization designing, visualization, vispediab, coloring, resultmapsc
5 basically, modatee, supports, veachf, datasets basically, appropriately, modatee, datasets, supports
6 viral, envision, humanlike, motility, readings viral, motility, usage, reading, envision
7 visualization, datasets, vispedia, synthetic, analytic visualization, datasets, analytic, vispedia, synthetic
a

A system for flow visualization.

b

A visualization system using Wikipedia.

c

A hierarchical search visualization.

d

A motion capture system.

e

Occasionally, the word “accommodate” had a space inserted in the middle.

f

Author of a paper about using MCMC methods in transport simulations.

Based on an examination of the top terms in each topic and those papers which the model considers to be most heavily generated by them, we can (subjectively) identify themes to which they correspond. The first topic, which the model finds as its purest expression “TransGraph: Hierarchical Exploration of Transition Relationships in Time-Varying Volumetric Data,” is related to data mining and visual analytics generally Gu and Wang (2011). “Ball-Morph: Definition, Implementation, and Comparative Evaluation” is the paper that has the greatest proportion of topic 2, which describes simulations and other training systems Whited and Rossignac (2011). Topic 3 deals with spatial data as well as missing data, its top document is “Visualizing Incomplete and Partially Ranked Data” Kidwell et al. (2008). Topic 4 concerns aesthetics and design, as seen in its top paper “Mapping Color to Meaning in Colormap Data Visualizations” Schloss et al. (2019). The fifth topic concerns theory and numerical methods for visualization, exemplified by “Hierarchy of Stable Morse Decompositions” Szymczak (2012). The sixth topic is human computer interaction, with its top paper being “Effects of Virtual Human Appearance Fidelity on Emotion Contagion in Affective Inter-Personal Simulations” Volante et al. (2016). The last topic involves meta-analysis of visualizations, exemplified by “ScatterNet: A Deep Subjective Similarity Model for Visual Analysis of Scatterplots” Ma et al. (2020).

5. Results of topic modeling for systematic review

In the previous section, we described and traced the evolution of broad topics in the visual analytics literature. In addition, five papers from our longitudinal topic modeling were examined in full text form. This allows for detailed comparison of those papers that describe visual analytics systems with useful features for longitudinal trial data. These papers describe the visual analytics systems SOMFlow, TimeStitch, Clustrophile 2, Clustervision, and Hierarchical Clustering Explorer. These systems were evaluated based on the eight criteria introduced above as well as on the unique advantages and disadvantages that they have. See Table 3 for an overview of this comparison. We also describe MIFuzzy and how it was designed and validated as a comprehensive solution to the real-world problems of analysts in the smart health domain.

Table 3.

Visual analytics system comparison.

System Missing Data Validation Parameter Setting Efficient Replicable Visualizations Behavior Data
MIFuzzy MI XB, viz. aided et al. Automatic Yes Yes Multiple Kim et al. (2017); Fang et al. (2011, 2012)
SOMFlow Yes Qe, viz. aided Guided Varies Yes Multiple No
TimeStitch None None None Unknown No Interactive Flow Yes
Clustrophile 2 None Multiple Guided Yes Yes Multiple Yes
Clustervision SI Multiple Guided Yes Yes Multiple Yes
HCE None Multiple Guided Yes Yes Multiple No

5.1. SOMFlow

SOMFlow is an interactive visual analytics tool aimed at helping domain experts to iteratively find and examine interesting subsets in large time-series datasets matching a bespoke JSON format. Self-organizing maps (SOM) have been used as the basis for single imputation schemes García-Laencina, Sancho-Gómez, and Figueiras-Vidal (2010) before, SOMFlow appears to handle data with missing labels so long as the time series’ inputs are the same length. The focus of the authors of SOMFlow is the linguistics domain, working with prosodic data, though demographic information appears to accompany the audio signal.

Users may manually adjust parameters within certain guidelines, with results set based on rules of thumb, as well as actively guide the clustering through interactions. Errors are visualized, though it is not clear if the exact numbers are provided to users for use in their own publications.

Like many neural computing algorithms, the computational resources involved for SOMFlow are reportedly high. This may impose limitations for users working with larger datasets who lack access or ability to operate specialized hardware Sacha et al. (2018).

5.2. TimeStitch

TimeStitch is a system designed around an interactive process of defining sequences of events that may lead to a specific behavior. This was inspired by work looking for triggers for study subjects to lapse in their efforts to quit smoking. Latent groups may or not be found by manually examining sequences of interest to the user. No validation facilities are mentioned. Extensive demos and documentation are not available Polack et al. (2015).

5.3. Clustrophile 2

Clustrophile 2 is a system designed to guide users through the process of selecting an optimal clustering in the face of the myriad options and pitfalls inherent in the process. Its creation was guided by a set of design criteria that partially overlaps with those of MIFuzzy, in particular the authors seek reproducibility, analysis of large datasets, and visualization of high-dimensional datasets. They provide the user with numerous options for clustering algorithms, validation metrics, and support labeling the clusters that were identified. Additionally, educational material about the theoretical advantages and disadvantages of different choices are provided within the application.

The key shortcoming of Clustrophile 2 is that any handling of missing data must be done outside the application if it is to be done at all. Additionally, some clustering algorithms included in Clustrophile 2 cannot be used directly on datasets with missing data, so this presents a problem. Additionally, the opportunity to use clustering and multiple imputation in a virtuous cycle of validation is missed Cavallo and Demiralp (2019).

Clustrophile 2 was tested with a user study where data scientists were asked to analyze a publicly available dataset. Footage of its use is available, showing an intriguing user interface for allowing the user to comment of cluster quality (see Fig. 4). We are not aware of any further published uses of the software, nor does it seem to be publicly available.

Fig. 4.

Fig. 4.

Clustrophile 2 screenshot.

5.4. Clustervision

Clustervision (see Fig. 5) was developed with the goal of assisting data scientists in selecting the right clustering algorithm and finding optimal parameter values for their particular data set. It accomplishes this task by running multiple algorithms and ranking their performance based on validation indices. Its interface also allows users to set constraints on the algorithms in order to make the results line up with their prior knowledge or intuition. The authors also present a case study describing a team of data scientists and clinicians using Clustervision to extract useful insights from a medical dataset Kwon et al. (2017).

Fig. 5.

Fig. 5.

Clustervision screenshot.

While Clustervision provided a variety of validation metrics, and demonstrates the principles of visual validation of clustering results, handling of missing data is not present. Additionally, the user may not make optimal decisions when faced with different clustering choices. The system depends on the analytical savvy of the user. This may not be appropriate for some users, nor does it promote a consistent process across multiple studies.

5.5. Hierarchical Clustering Explorer

The Hierarchical Clustering Explorer (HCE) is a tool for exploratory data analysis that is particularly oriented toward microarray gene expression data. The user is shown distribution information for each variable, followed by the correlations between each pair of variables through a heatmap. HCE then presents a dendrogram, a visual representation of hierarchical clustering. In the genomics context, the tool serves to check whether there are irregularities in the data (by checking for unusual distributions) and to highlight clusters of genes worthy of more detailed study. Although current development status is unclear, usefulness was established via case studies, surveys, as well as usage by researchers unaffiliated with the original developers. Unfortunately, its model does not take into account longitudinal data Seo and Shneiderman (2006).

5.6. MIFuzzy

In statistical learning society, MIFuzzy, one of the most recent systems, is specifically designed for longitudinal trial data pattern recognition. This system addresses the visualization and validation of latent trajectory clusters for non-normal, zero-inflated, high-dimensional, correlated and incomplete longitudinal data in an approach integrating multiple imputation (MI) techniques into the fuzzy clustering Fang (2017). This integrated approach is intelligent and accounts for imputation uncertainty and thus the uncertainty of clustering accuracy. MIFuzzy visualization and validation is iterative and automatic built upon the numeric validation indices for MI-based fuzzy models, such as Fuzzy C-Means (FCM), as well as trajectory visualization and non-linear fuzzy Sammon mapping. The MI component of MIFuzzy is robust to three missing data mechanisms (MCAR, MAR and MNAR) and integrates into the MIFuzzy visualization process Zhang and Fang (2016). The fuzzy logic component of MIFuzzy handles real-world longitudinal trial data where participants over time can have membership in multiple clusters, ie., when these clusters touch or overlap. Overall, MIFuzzy is computationally tractable and replicable and its software tool is available for use in longitudinal trial data pattern recognition, visualization and validation Fang et al. (2009, 2011); Wang et al. (2015). MIFuzzy has also been used successfully to prepare data for classification using a neuro-fuzzy classifier Gurugubelli et al. (2019).

6. Discussion

To our knowledge, this is the first time longitudinal topic modeling has been used for systematic review, our proposed longitudinal topic model uncovered that systems were developing on separate tracks for visual analytics of missing data and longitudinal trial data. Methods for the analysis of incomplete data was rarely integrated into the development of visual analytics systems. These methods were not often present in the visual analytics systems which often avoided the issue entirely. This was underscored by the qualitative review of SOMFlow, Clustrophile, and other systems, which found that missing data was not given much emphasis in system design or in the full text of the paper. Many of these systems aimed to allow non-experts to operate, but few would be suitable for a non-expert with incomplete data.

MIFuzzy has been tested in randomized controlled trials and observational studies and has shown itself to be a useful tool to support analyses. The validation metrics that MIFuzzy is able to provide as a result of the folds created by the MI process would be more difficult to achieve in a piece-wise process where a possibly inexperienced user is left to make decisions. By narrowing down the number of possible workflows, MIFuzzy is able to ensure the user is following a well validated process that is useful for large, complex, zero-inflated, and incomplete data.

We found that MI was very rare among the visual analytics systems we examined, and that even SI was only slightly more common, even though many authors mentioned the possibility of using an MI approach and acknowledged the widespread problem of missing data that effects every domain. The MI system contained within MIFuzzy has been extensively tested to determine that its output of SAS′ PROC MI at the level of precision provided by the underlying floating-point number representation.

This level of testing was also uncommon. Other systems have largely been tested purely by looking at results, often having users rate the clustering quality. This process could prove unwieldy for large data sets. Software that encodes a predefined set of best practices from raw data to clustering results avoids bias caused by users’ subjective preferences and preconceived beliefs about what the data will show.

In this review of visual analytics tools from the IEEE conference, we identified a number of interesting systems that aim to guide users through the process of identifying interesting structures and features in their data. These systems have demonstrated a diverse array of clustering algorithms accompanied by innovative user interface design. We have also described MIFuzzy, one of the state-of-art techniques for visual analytics of longitudinal trial data. MIFuzzy takes an integrated approach to fuzzy clustering and multiple imputation, leveraging a theoretically and empirically supported method of using MI to validate the outcome of soft clustering. In comparing these systems, we were unable to identify any others that provided the such a comprehensive and robustly validated visual analytics system for data that are high-dimensional, correlated, longitudinal, or contains zero-inflated and missing values.

7. Conclusion

Longitudinal trials generate data that are high-dimensional, non-normal, and often contain missingness or zero-inflated values. Using a longitudinal topic model for systematic review, we establish that the visual analytics literature has often treated these problems separately, but integrated approaches are less common despite offering advantages. In particular, the systems described in the literature rarely handle missing data. We also describe MIFuzzy, a replicable trajectory pattern recognition tool with a standalone and robustly validated visual analytic feature designed to deal with the problems presented by longitudinal trial data. We identified further user studies including additional guidance in the user interface as possible directions for future work. More user-friendly, automated and streamlined visualization tools need to be provided with replicable algorithms as well as comprehensive validation and simulation procedures.

Acknowledgments

This research was partly supported by NIH grant RO1DA033323-01A1 and 1R56DK114514-01A1 to Dr. Fang.

Footnotes

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. R Core Team. (2013). R: A language and environment for statistical computing.
  2. Artifex software. (2018). Inc. http://ghostscript.com
  3. Arun R, Suresh V, Madhavan CEV, & Murthy MNN (2010). On finding the natural number of topics with latent dirichlet allocation: Some observations In Advances in knowledge discovery and data mining (pp. 391–402). Springer; Berlin Heidelberg: 10.1007/978-3-642-13657-3_43. [DOI] [Google Scholar]
  4. Blei DM, & Lafferty JD (2006). Dynamic topic models. In Proceedings of the 23rd international conference on machine learning (pp. 113–120). [Google Scholar]
  5. Blei DM, Ng AY, & Jordan MI (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. [Google Scholar]
  6. Cao J, Xia T, Li J, Zhang Y, & Tang S (2009). A density-based method for adaptive LDA model selection. Neurocomputing, 72, 1775–1781. 10.1016/j.neucom.2008.06.011 [DOI] [Google Scholar]
  7. Cavallo M, & Demiralp C (2019). Clustrophile 2: Guided visual clustering analysis. IEEE Transactions on Visualization and Computer Graphics, 25, 267–276. 10.1109/tvcg.2018.2864477 [DOI] [PubMed] [Google Scholar]
  8. Deveaud R, SanJuan E, & Bellot P (2014). Accurate and effective latent concept modeling for ad hoc information retrieval. Document Numérique, 17, 61–84. 10.3166/dn.17.1.61-84 [DOI] [Google Scholar]
  9. Fang H (2017). MIFuzzy clustering for incomplete longitudinal data in smart health. Smart Health, 1–2, 50–65. 10.1016/j.smhl.2017.04.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Fang L (2018). DISC MIFuzzy software. https://www.umassmed.edu/fanglab/DISCMIFuzzy-software/.
  11. Fang H, Dukic V, Pickett KE, Wakschlag L, & Espy KA (2012). Detecting graded exposure effects: A report on an east boston pregnancy cohort. Nicotine & Tobacco Research, 14, 1115–1120. 10.1093/ntr/ntr272 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fang H, Espy KA, Rizzo ML, Stopp C, Wiebe SA, & Stroup WW (2009). Pattern recognition of longitudinal trial data with nonignorable missingness: An empirical cases tudy. International Journal of Information Technology & Decision Making, 8, 491–513. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fang H, Johnson C, Stopp C, & Espy K (2011). A new look at quantifying tobacco exposure during pregnancy using fuzzy clustering. Neurotoxicology and Teratology, 33, 155–165. 10.1016/j.ntt.2010.08.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Fang H, & Zhang Z (2018). An enhanced visualization method to aid behavioral trajectory pattern recognition infrastructure for big longitudinal data. IEEE Transactions on Big Data, 4, 289–298. 10.1109/tbdata.2017.2653815 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. García-Laencina PJ, Sancho-Gómez J-L, & Figueiras-Vidal AR (2010). Pattern classification with missing data: A review. Neural Computing & Applications, 19, 263–282. [Google Scholar]
  16. Griffiths TL, & Steyvers M (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228–5235. 10.1073/pnas.0307752101 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Gurugubelli VS, Fang H, & Wang H (2019). Neuro-fuzzy classifier for longitudinal behavioral intervention data. In International conference on computing, networking and communications (ICNC). IEEE 10.1109/iccnc.2019.8685574 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Gu Y, & Wang C (2011). Transgraph: Hierarchical exploration of transition relationships in time-varying volumetric data. IEEE Transactions on Visualization and Computer Graphics, 17, 2015–2024. [DOI] [PubMed] [Google Scholar]
  19. IEEE. (2020). IEEE Xplore digital library. https://ieeexplore.ieee.org/Xplore/home.jsp.
  20. Kidwell P, Lebanon G, & Cleveland W (2008). Visualizing incomplete and partially ranked data. IEEE Transactions on Visualization and Computer Graphics, 14, 1356–1363. [DOI] [PubMed] [Google Scholar]
  21. Kim SS, Fang H, Bernstein K, Zhang Z, DiFranza J, Ziedonis D, & Allison J (2017). Acculturation, depression, and smoking cessation: A trajectory pattern recognition approach. In Tobacco induced diseases (Vol. 15). 10.1186/s12971-017-0135-x. doi: 10.1186/s12971-017-0135-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Kwon BC, Eysenbach B, Verma J, Ng K, De Filippi C, Stewart WF, & Perer A (2017). Clustervision: Visual supervision of unsupervised clustering. IEEE Transactions on Visualization and Computer Graphics, 24, 142–151. [DOI] [PubMed] [Google Scholar]
  23. Ma Y, Tung AKH, Wang W, Gao X, Pan Z, & Chen W (2020). ScatterNet: A deep subjective similarity model for visual analysis of scatterplots. IEEE Transactions on Visualization and Computer Graphics, 26, 1562–1576. 10.1109/tvcg.2018.2875702 [DOI] [PubMed] [Google Scholar]
  24. Mellins CA, Elkington KS, Bauermeister JA, Brackis-Cott E, Dolezal C, McKay M, Wiznia A, Bamji M, & Abrams EJ (2009). Sexual and drug use behavior in perinatally hiv-infected youth: Mental health and family influences. Journal of the American Academy of Child & Adolescent Psychiatry, 48, 810–819. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Murzintcev N (2016). ldatuning: Tuning of the latent dirichlet allocation models parameters. r package version 0.2.0 CRAN. [Google Scholar]
  26. Pickett KE, Kasza K, Biesecker G, Wright RJ, & Wakschlag LS (2009). Women who remember, women who do not: A methodological study of maternal recall of smoking in pregnancy. Nicotine & Tobacco Research, 11, 1166–1174. 10.1093/ntr/ntp117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Polack PJ, Chen S-T, Kahng M, Sharmin M, & Chau DH (2015). TimeStitch: Interactive multi-focus cohort discovery and comparison. In IEEE conference on visual analytics science and technology (VAST). IEEE 10.1109/vast.2015.7347682 [DOI] [Google Scholar]
  28. Rehurek R, & Sojka P (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on new challenges for NLP frameworks. [Google Scholar]
  29. Rubin D (2004). Multiple imputation for nonresponse in surveys. John Wiley & Sons. [Google Scholar]
  30. Rubin D, & Little R (2019). Statistical analysis with missing data (Vol. 793). [Google Scholar]
  31. Sacha D, Kraus M, Bernard J, Behrisch M, Schreck T, Asano Y, & Keim DA (2018). SOMFlow: Guided exploratory cluster analysis with self-organizing maps and analytic provenance. IEEE Transactions on Visualization and Computer Graphics, 24, 120–130. 10.1109/tvcg.2017.2744805 [DOI] [PubMed] [Google Scholar]
  32. Sackett DL (1979). Bias in analytic research In The case-control study consensus and controversy (pp. 51–63). Elsevier; 10.1016/b978-0-08-0249070.50013-4. [DOI] [Google Scholar]
  33. Schloss KB, Gramazio CC, Silverman AT, Parker ML, & Wang AS (2019). Mapping color to meaning in colormap data visualizations. IEEE Transactions on Visualization and Computer Graphics, 25, 810–819. 10.1109/tvcg.2018.2865147 [DOI] [PubMed] [Google Scholar]
  34. Seo J, & Shneiderman B (2006). Knowledge discovery in high-dimensional data: Case studies and a user survey for the rank-by-feature framework. IEEE Transactions on Visualization and Computer Graphics, 12, 311–322. [DOI] [PubMed] [Google Scholar]
  35. Silge J, & Robinson D (2016). tidytext: Text mining and analysis using tidy data principles in r. Journal of Open Source Software, 1, 37. [Google Scholar]
  36. Szymczak A (2012). Hierarchy of stable morse decompositions. IEEE Transactions on Visualization and Computer Graphics, 19, 799–810. [DOI] [PubMed] [Google Scholar]
  37. Volante M, Babu SV, Chaturvedi H, Newsome N, Ebrahimi E, Roy T, Daily SB, & Fasolino T (2016). Effects of virtual human appearance fidelity on emotion contagion in affective inter-personal simulations. IEEE Transactions on Visualization and Computer Graphics, 22, 1326–1335. [DOI] [PubMed] [Google Scholar]
  38. Wakschlag LS, Henry DB, Blair RJR, Dukic V, Burns J, & Pickett KE (2011). Unpacking the association: Individual differences in the relation of prenatal exposure to cigarettes and disruptive behavior phenotypes. Neurotoxicology and Teratology, 33, 145–154. 10.1016/j.ntt.2010.07.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Wang CJ, Fang H, Kim S, Moormann A, & Wang H (2015). A new integrated fuzzifier evaluation and selection (NIFEs) algorithm for fuzzy clustering. Journal of Applied Mathematics and Physics, 3, 802–807. 10.4236/jamp.2015.37098 [DOI] [Google Scholar]
  40. Whited B, & Rossignac J (2011). Ball-morph: Definition, implementation, and comparative evaluation. IEEE Transactions on Visualization and Computer Graphics, 17, 757–769. 10.1109/tvcg.2010.115 [DOI] [PubMed] [Google Scholar]
  41. Zhang Z, & Fang H (2016). Multiple-vs non-or single-imputation based fuzzy clustering for incomplete longitudinal behavioral intervention data. In IEEE first international conference on connected health: Applications, systems and engineering technologies (CHASE). IEEE 10.1109/chase.2016.19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Zhang Z, Fang H, & Wang H (2016). A new MI-based visualization aided validation index for mining big longitudinal web trial data. IEEE Access, 4, 2272–2280. 10.1109/access.2016.2569074 [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES