Abstract
Principal component analysis (PCA) discovers patterns in multivariate data that include spectra, microscopy, and other biophysical measurements. Direct application of PCA to crowded spectra, images, and movies (without selecting peaks or features) was shown recently to identify their equilibrium or temporal changes. To enable the community to utilize these capabilities with a wide range of measurements, we have developed multiplatform software named TREND to Track Equilibrium and Nonequilibrium population shifts among two-dimensional Data frames. TREND can also carry this out by independent component analysis. We highlight a few examples of finding concurrent processes. TREND extracts dual phases of binding to two sites directly from the NMR spectra of the titrations. In a cardiac movie from magnetic resonance imaging, TREND resolves principal components (PCs) representing breathing and the cardiac cycle. TREND can also reconstruct the series of measurements from selected PCs, as illustrated for a biphasic, NMR-detected titration and the cardiac MRI movie. Fidelity of reconstruction of series of NMR spectra or images requires more PCs than needed to plot the largest population shifts. TREND reads spectra from many spectroscopies in the most common formats (JCAMP-DX and NMR) and multiple movie formats. The TREND package thus provides convenient tools to resolve the processes recorded by diverse biophysical methods.
Introduction
Plotting the course of biomolecular or physiological processes typically uses procedures specific to the field. In the case of spectroscopy and imaging, tracking the process can be laborious because of the steps of assigning the peaks of the spectra or features in the images, manually choosing peaks or image features subjectively judged optimal for monitoring the process of interest, and managing complications from any concurrent processes. Spectral overlap and peak broadening (e.g., from chemical exchange in NMR) can prevent correct fitting (1). A more elegant alternative to such efforts is to apply unsupervised, multivariate statistical pattern recognition such as principal component analysis (PCA). PCA has provided insight from series of measurements from diverse techniques of molecular biophysics that include magnetic resonance, vibrational, optical, and dichroic spectroscopies; x-ray scattering and diffraction; mass spectrometry; calorimetry; hydrodynamics; atomic force microscopy; electron microscopy; and imaging by fluorescence, Raman or light scattering, as well as functional magnetic resonance imaging (Table S1 in the Supporting Material and references therein). Biophysical studies have often used PCA to determine dependencies of various reactions upon time, concentration, or other conditions, e.g., in protein folding (Table S1). PCA applied directly to spectra, images, and movies appears to be a convenient and general way to determine the main trends of change among measurement frames that record many localized changes. PCA is much more accommodating of many data distributions than is often appreciated (2). It transforms many measured variables to far fewer and uncorrelated principal components (PCs) that each capture part of the trends of covariation among measured variables (2).
PCA of NMR peak lists was used to track equilibrium transitions of proteins due to pH (3) and binding of partners (4, 5, 6). Closely related singular value decomposition (SVD) of NMR peak pick lists was used to reconstruct filtered basis spectra for use in fitting biphasic ligand binding (7) or for identifying binding sites (8). The applications of PCA were recently extended directly to NMR spectra, images, and movies without choosing any peaks or features for analysis. Moreover, applying PCA directly to NMR spectra makes binding isotherms easily accessible in all chemical exchange regimes, including intermediate exchange where severe broadening and nonlinearity of peak shifts ordinarily mask the true course of molecular association (1, 9). Application of SVD to series of time-dependent two-dimensional (2D) images or spectra extracted the dominant time course as PC1. The approach also detected multiple time-evolving processes in magnetic resonance imaging (MRI) movies as PCs. Similarly, when two sequential steps of binding were monitored by NMR, PCA detected both binding steps and the intermediate state with a single ligand bound (9). Accomplishments with SVD (PCA) have usually been limited to the laboratories that wrote task-specific code to perform the calculations, however.
Independent component analysis (ICA) can complement PCA. ICA aims instead to find independent components (ICs) (2). The quest of ICA for statistical independence is more demanding than PCA’s aim of correlation coefficients of zero. These objectives are equivalent for Gaussian (normal) distributions. ICA can be regarded as more general than PCA and is effective for non-Gaussian data and situations where PCA fails (2). However, ICA can be very slow to compute compared with PCA, lower in convergence, and require repeated calculations. Like PCA, ICA has been used to reduce dimensionality and filter or separate data in processing signals, images (10), large biological data sets (11), and NMR spectra of mixtures (12, 13, 14).
To make these capabilities available for application to a variety of spectroscopic and imaging techniques used in biophysics, we have developed a software package named TREND (Tracking and Resolving Equilibrium and Nonequilibrium population shifts in Data). Its main means of tracking the shifts is PCA implemented with SVD. Its secondary means is an ICA algorithm, which recapitulates the PCA results we examined, provided the correct number of ICs is specified. We first sought to extract binding isotherms, equilibrium shifts, and time courses (all potentially with multiple components) from series of NMR spectra. Because of the suitability of PCA for many other kinds of series of 2D digital data frames, we utilized the Python community’s support of file I/O in multiple data formats (e.g., movies and spreadsheets) and wrote code for additional spectroscopic formats, enabling wide application (e.g., JCAMP-DX, Sparky peak list). For example, we analyzed a cardiac MRI movie (15) with TREND to isolate multiple aspects of the cardiac cycle and to reconstruct movies from combinations of PCs. This software package can resolve biologically relevant reactions and processes with relative ease from many biophysical sources of complicated spectral and imaging data.
Materials and Methods
Implementation of TREND
TREND was written in Python 2.7 and calls NumPy for linear algebra and random number generation. It implements PCA (SVD) with function calls to NumPy. We first wrote TREND for operation at the command line. We added a graphical user interface (GUI), supported by Gooey, using function calls to wxPython. Most users will prefer to use the GUI to operate TREND. TREND comprises three programs, each with both interfaces (Table 1). The executable files trendmaingui and trendmain compute the PCs or ICs across the 2D series of measurements, create temporary files used by the plotting or reconstruction programs run afterward, and plot the first three components plus benchmarks of their significance. Trendplotgui and trendplot provide optional plotting that is customizable in terms of the number and choice of normalization of the components. Optional reconstructions of the measurement series are available from trendreconstructgui and trendreconstruct (Table 1). Explanations of the flags and parameters for the command line versions are available online in the manual for TREND (https://trendmizzou.gitbooks.io/trend-manual/content/). For convenience of installation, we packaged TREND and the public domain software it depends upon using PyInstaller. Consequently, TREND does not need Python on the host system. Distributions are available for Windows 7 and later, Mac OS X 10.7 and later, and these versions of Linux: Ubuntu 14.04/Fedora 23, Ubuntu 16.04, and Red Hat 7.1/CentOS 7 (http://biochem.missouri.edu/trend).
Table 1.
Executable File | Roles | Interface |
---|---|---|
trendmain.exe | preprocess and compute PCs or ICs | CLI |
trendmaingui.exea | preprocess and compute PCs or ICs | GUI |
trendplot.exe | plot selected PCs or ICs with choice of normalization | CLI |
trendplotgui.exea | plot selected PCs or ICs with choice of normalization | GUI |
trendreconstruct.exe | reconstruct spectra, images, or movies from PCs | CLI |
trendreconstructgui.exea | reconstruct spectra, images, or movies from PCs | GUI |
CLI is command-line interface; and GUI is graphical user interface.
Executable files with GUI are trendmaingui.app, trendplotgui.app, and trendreconstructgui.app for OS X or macOS platforms.
Conversion of a stack of 2D measurements into a matrix for analysis
A wide variety of 2D measurements can be read and analyzed by trendmaingui and trendmain. This includes images or movie frames comprising pixels, one-dimensional (1D) and 2D spectra from many spectroscopies, lists of peak positions and heights, and unprocessed NMR spectroscopic data in the time domain (free induction decays, FIDs; Fig. 1). The program reads NMR spectra in NMRpipe, Sparky, and Bruker Topspin formats (as well as FIDs in NMRpipe and Topspin formats) (Table 2) using code from Nmrglue (16). Trendmaingui and trendmain also read Agilent (Varian) VNMRJ format and JCAMP-DX formats of Bruker, Agilent, and Jeol spectrometers. To analyze the measurements from many other kinds of spectroscopy and biophysical measurements (Table S1), the program reads the most common JCAMP-DX formats, as well as spreadsheet and text formats commonly written by instruments (Table 2). Trendmaingui and trendmain read NMR peak lists either in the format of Sparky peak lists (17) or plain text files, before converting them into column vectors (3, 7). Movies are read in multiple formats (i.e., avi, mov, mp4, ogv, webm) by the MoviePy module into three-dimensional (3D) arrays with color layers. Trendmaingui and trendmain convert the movie frames to gray-scale (8-bit depth) and rearrange them into 2D matrices (Fig. 2). Time-lapse series of PNG images are read, using the Scipy module of Python, and handled similarly.
Table 2.
Choice in Trendmaingui | Format | Reconstruction Support | Comment |
---|---|---|---|
NMR Data Formats | |||
fid | NMRPipe FID | yes | |
ft2 | NMRPipe Ft2 | yes | |
ucsf | Sparky UCSF | yes | |
brukerfid | Bruker Topspin FID | yes | fid, ser in/1/pdata/subfolderb |
brukerft2 | Bruker Topspin spectra | yes | 1r, 2rr files |
agilentfid | VnmrJ, OpenVnmrJ FID | yes | fid |
agilentspectra | VnmrJ, OpenVnmrJ spectra | no | Phasefilea |
sparkylist | Sparky peak list | yes | duplicate peaks not allowed |
JCAMP-DX (Joint Committee on Atomic and Molecular Physical data—Data Exchange format) | |||
jcamp | JCAMP-DX | no | Only supports X..(Y+Y) and (XY..XY)c |
Text File Formats | |||
txt | floating point | yes | for series of text files |
complextxt | complex numbers | yes | for series of text files |
singletxt | complex or floating point | yes | for single .TXT file containing entire series |
Spreadsheet Formats | |||
csv | comma-separated floating point | no | for series of .CSV files |
complexcsv | comma-separated complex numbers | no | for series of .CSV files |
singlecsv | comma-separated complex or floating point | no | for single .CSV file containing entire series |
excel | Excel format | no | for series of Excel files |
singleexcel | Excel format with tabs | no | for single file with single or multiple tabs |
Images and Movies | |||
png | images in PNG format | yes | For series of .PNG files |
movie | common video formats | yes | .ogv, .mp4, .mpeg, .avi, .mov, .webm |
See the online TREND manual (https://trendmizzou.gitbooks.io/trend-manual/content/).
Currently the processed spectra must be saved by setting processed directory to 1.
JCAMP-DX is a general format for exchanging and archiving data from many instruments, including but not limited to infrared (IR), Raman, ultraviolet-visible (UV-Vis), fluorescence, NMR, and electron paramagnetic resonance (EPR). The data stored in JCAMP-DX files can be spectral plots, contours, or peak tables. TREND supports the most common JCAMP-DX formats. The digital data in JCAMP-DX can be AFFN (ASCII FREE FORMAT NUMERIC) form or ASDF (ASCII SQUEEZED DIFFERENCE FORM). TREND supports decoding compressed data, including PAC, SQZ, DIF, SQZDUP, and DIFDUP. Two most common tabular data forms, (X++(Y..Y)) and (XY..XY) are supported. TREND reads a series of JCAMP-DX files, or a single JCAMP-DX file with one or multiple blocks. TREND supports NTUPLE format (introduced by JCAMP-DX 5.0), which is designed for multidimensional techniques with data sets with multiple variables. For example, JCAMP-DX NMR uses NTUPLE to show mixed real/imaginary FID data sets. See format details in http://www.jcamp-dx.org/, https://badc.nerc.ac.uk/help/formats/jcamp_dx/, and http://wwwchem.uwimona.edu.jm:1104/spectra/testdata/index.html.
In the case of NMR data, spectra very recently emerged as probably the preferred format for application of PCA (9). In the examples below, NMR spectra (collected with a uniform set of parameters) were processed with NMRPipe (18) and converted to the UCSF format of Sparky (17, 19). NMR spectra in UCSF format were read by trendmaingui or trendmain for conversion into 2D matrices (Fig. 2). Unprocessed NMR data in the time domain (FIDs) can also be read, processed, and the solvent signal subtracted. (Analysis of time domain data is justified by Parseval’s theorem regarding the equivalency of signals in the time and frequency domains (20)).
Preprocessing
Regardless of original data format, columns from each 2D measurement read are positioned end-to-end into a single 1D vector for convenience (9) (Fig. 2). These 1D columns are arrayed over the experimental variable (concentration, pH, time, etc.) into the data matrix X, which has F1 × F2 points in the column dimension and n points per row for the n experimental conditions. To expedite manipulations of this matrix X and facilitate calculations on a modest laptop computer, each vector is compressed by deleting unchanging positions, resulting in matrix X′ (Fig. 2). For SVD of spectra, the user is encouraged to use a threshold that is three- to sevenfold the noise level to filter out low intensity regions of the spectra, which compresses matrix X′ further. However, it is better to use a lower threshold where intermediate exchange broadening significantly weakens NMR peaks.
As required by PCA and ICA algorithms, the rows of compressed matrix X′ are centered and then optionally scaled. Scaling enlarges weaker signals relative to large signals. The options for scaling methods include autoscaling, Pareto scaling, or no scaling (21). No scaling appears acceptable in most titrations, but autoscaling generally enhances fits to the binding isotherms. Autoscaling obviates the systematic scaling of 15N NMR peak shifts down by several-fold relative to 1H shifts that were used in PCA of lists in (3). Autoscaling also generalizes to 1H-13C correlation spectra. Pareto scaling is recommended for NMR titrations with substantial intermediate exchange broadening (9). Range, vast, and level scaling (21) are also implemented in trendmain but do not work well with NMR spectra. No scaling has been used for MRI movies. Column centering and scaling are not necessary in our experience, but are available in trendmaingui and trendmain as they are sometimes used for PCA (22) and ICA (11). The descriptions of the data scaling and centering methods (21) are listed in Table S2.
Calculating principal components via SVD
The compressed, preprocessed matrix X′ has m points per column and n points or experimental conditions in each row, with m > n. X′ can be decomposed into three matrices as follows:
(1) |
where U and VT are orthogonal matrices and S is a diagonal matrix that contains the square roots of eigenvalues for vectors in U or V in descending order. To obtain the trends of change across the measurements, we are interested in VT, whose rows span X′ and are called the right singular vectors. The VT matrix has row vectors . Importantly, the first row in the VT matrix is PC1 and the second row PC2, i.e., the two largest trends of change among the series of spectra or images measured. To obtain these PCs that record the relationships among columns in X′ (Fig. 2), it suffices to calculate VT. The rows of VT are orthonormal eigenvectors of the symmetric matrix (Fig. 2). (The normalized form of is equivalent to the covariance matrix, the alternative algorithm for computing PCA (2).) The normalized PC1 values from the first row of VT indicate the fractional population of the main change at each measurement in the series of measurements. When obtained from a typical titration of ligand binding, PC1 represents the binding isotherm; a dissociation constant may be fitted to it (9).
Reconstruction of spectra, images, or movies by PCA
The reconstructed data set Xreconst, with size of m × n, can be calculated as follows:
(2) |
where a, b, c, d, e … refer to the index of PCs generated by trendmain or trendmaingui to use in the reconstruction by trendreconstruct or trendreconstructgui. (Note a, b, c, d, e … can be nonconsecutive integers. To enable this, the “reconst” box should be selected in trendmaingui. When using trendmain, the –reconst flag should be included.) The U matrix is used for the reconstruction. It can be rewritten as column vectors: , which lie in the column space of X′. U can be calculated similarly to VT, by solving eigenvectors of the matrix . To recover the original 2D data series, the preprocessing steps of centering, scaling, and compression (filtering) can be reversed as described in the manual for TREND. The user can choose to reconstruct the centered and scaled matrix, matrix X′, or matrix X in the format of the original data (Fig. 2).
ICA calculations
ICA is available in TREND and implemented using scikit-learn (http://scikit-learn.org/stable/modules/decomposition.html#ica). Despite the potential generality of ICA, two limitations need to be respected. Since the magnitudes of ICs cannot be determined, their contributions cannot be ranked. ICA is also prone to local minima during optimization, requiring comparisons of repeated calculations (10, 11). TREND implements the FastICA algorithm for computational efficiency. FastICA preprocesses data by PCA to reduce dimensions and avoid overlearning (23, 24, 25). (Overlearning is an underdetermined situation that interferes in obtaining parameters and introduces artifacts to ICs (24, 25)).
ICA decomposes the data matrix X as follows:
(3) |
where A is the unknown mixing matrix that is invertible, square, and mixes the components in X, and S is the matrix containing underlying independent sources. The aim of ICA is to solve for the mixing matrix A because it contains the ICs that may contain the meaningful trends sought. However, A and S both being unknown makes ICA calculations challenging (10). The equation can be rewritten as follows:
(4) |
where W is the unmixing matrix that is calculated as A−1. To simplify and improve convergence of ICA, X is preprocessed to remove correlations and to normalize it, a process called whitening, which generates Xw. FastICA implements this whitening step using PCA to calculate the whitened data matrix Xw as follows:
(5) |
Where E is the matrix whose columns are normalized eigenvectors of the covariance matrix of XXT, and D is the diagonal matrix of the corresponding eigenvalues. The preprocessing with PCA also removes noise and reduces dimensions for ICA. The whitening simplifies the ICA problem to finding the unknown rotation matrix V that is defined as . In FastICA, V is estimated by maximizing non-Gaussian character. The equations lead to the following:
(6) |
Results and Discussion
Workflows of TREND
For wide application of SVD or ICA to diverse series of 2D measurements, we wrote TREND in Python to read and analyze multiple types of data. These include diverse spectra, images, movies, or lists in text or spreadsheet formats available from many modern instruments (Fig. 1). The spectral formats include widely used JCAMP-DX standards and NMR formats. TREND can also apply PCA or ICA to a single 2D data matrix read in from a text file, spreadsheet file, or multiblock JCAMP-DX file containing multiple spectra (Table 2). The algorithm of the trendmain and trendmaingui executable files includes the following steps:
-
1)
Convert each 2D measurement into a 1D vector arrayed by the experimental condition varied, in the data matrix X.
-
2)
Preprocess X with compression to X′ and optional scaling.
-
3)
Perform streamlined SVD or ICA to identify components (PC1, PC2, … or IC1, IC2, …) representing the major trend(s) varying with the experimental variable (Fig. 1).
The TREND package provides additional executable files for plotting the course of selected PCs or ICs or for rebuilding spectra, images, or movie from selected PCs (Table 1). For convenience, the plotting and reconstruction routines read temporary files just created by trendmain or trendmaingui; this frees the user from specifying input files, which is optional. The user may operate and customize these computations by a choice of GUI or command line arguments described in documentation for the software.
Although we used the Python routine NumPy to implement PCA (SVD) and scikit-learn (http://scikit-learn.org/stable/modules/decomposition.html#ica) to implement ICA calculations, corresponding routines are available in R (https://mran.microsoft.com/packages/), MATLAB (The MathWorks, Natick, MA, https://www.mathworks.com/matlabcentral/fileexchange/38300-pca-and-ica-package), and the MATLAB Statistics Toolbox. Recreating the workflows and functions depicted in Figs. 1 and 2 in an R or MATLAB environment would require code to parse the file formats of interest, reduce their dimensionality (i.e., “unfold” them), preprocess for readiness for the SVD or ICA routine, and interpret or reconstruct the results in the appropriate format. TREND spares the user this effort with a package that is user-friendly for NMR and other measurements from a variety of instrumentation, including spectroscopies and imaging; see Table 2 for data formats handled. TREND is free for academics, avoiding the cost of licensing MATLAB. TREND requires <200 MB of disk space whereas the MATLAB environment occupies 2 to 3 GB. TREND is portable and its installation lacks dependencies, other than the need for Internet access upon first usage.
We present examples of uses of TREND that illustrate 1) its performance in resolving two or more processes, which is nonroutine by conventional means, and 2) its wide applicability to trace and reconstruct concurrent, complex transformations recorded by biophysical means such as spectra or imaging.
Examples of ligand binding to two sites detected by NMR
Antecedents to TREND’s direct application of PCA to spectra and images were previous PCA studies of NMR peak lists. SVD was used to filter noise out of the lists, in turn used to reconstruct clean basis spectra to resolve three pH transitions (3) or two binding events (4, 7). With TREND we demonstrate a direct spectrum-driven approach to the latter examples of two biphasic associations. Fig. 3 A plots a two-site binding scheme, where P and L denote [protein] and [ligand], respectively. KD1 and KD2 are dissociation constants from site 1 and 2. PLn1 and PLn2 are intermediates with ligand at site 1 or 2, where n1 and n2 indicate the numbers of ligand molecules that bind cooperatively to site 1 and 2, respectively. PLn1Ln2 stands for the fully bound state. Equations 3 to 6 from (7) were used to simulate populations of species from the two-site binding scheme in a series of 15N HSQC spectra (Fig. 3 B) using methods given in Supporting Material. The curvature in the simulated shifts of several peaks (red arrows in Fig. 3 B) accompanies more than one mode of binding (7). PCA on the peak lists (chemical shifts) captures two smooth components, PC1 and PC2 (purple in Fig. 3 C), contributing 90% and 6% of the variance, respectively. The PC1 and PC2 components of the peak lists were recreated using trendreconstruct. PC1 captures from the curved trajectories of peak movements the main linear paths of change (Fig. S1, A and B). PC2 identifies the peak shifts orthogonal to PC1 (Fig. S1 C). Computing PC1 and PC2 instead directly from the simulated HSQC spectra using trendmain (green in Fig. 3 C) reproduces their counterparts extracted from peak lists very well, although each component contributes much less of the variance (38% and 15%, respectively). However, when PC2 values are normalized by PC1, there is a systematic difference in amplitude of PC2 consistent with its percentage of the variances listed above (inset in Fig. 3 C). This simulated two-site binding example and a number of 1:1 ligand-binding examples (2, 9) suggest that normalized PCs extracted from lists of picked peaks in the fast-exchange regime can be reproduced well by applying PCA to the series of spectra. However, PC1 and PC2 extracted by TREND from the FIDs from the simulated two-site binding example are skewed with sigmoidal deviation from the PCs obtained from either the peak lists or spectra (Fig. S2, A and B). In the investigation of a titration of β-lactoglobulin with 1-anilinoaphthalene-8-sulfonate (ANS), Konuma et al. resolved two binding components using PCA of the assigned peaks from the NMR spectra of the titration (4). They observed curved trajectories (red arrows in Fig. 3 D) and linear trajectories (blue arrows in Fig. 3 D), suggesting the presence of multiple binding sites. Fast exchange behavior supported reliable PCA of the chemical shift data in peak lists, which provided binding isotherms (4). TREND extracted the PCs from the spectra (green in Fig. 3 E) and unprocessed FIDs from the titration (green in Fig. S2, A and B). These PCs are compared with the previously reported binding isotherms (purple in Fig. 3 E). The binding populations of (4) are reproduced well by the normalized PC1 and PC2 derived from the spectra despite the t1-noise present (Fig. S3), and less well by PC1 and PC2 obtained from the FIDs. (The residual solvent signal was subtracted on-resonance from the FIDs using the trendmaingui option of a convolution difference window (26). In cases of especially poor solvent suppression, this subtraction might not be enough for reliable PCs.) When choosing the form of NMR data to analyze, application of TREND directly to spectra appears to be the most consistently accurate.
Reconstruction of the spectra of the ANS titration with trendreconstructgui using only PC1 and PC2 introduces artifacts that are ghosts of the peaks from each spectrum of the titration (not shown). The cumulative contribution ratio (reported by trendmaingui) saturates at eight PCs, suggesting eight is sufficient to represent the series of spectra. Using eight PCs in the reconstruction removed the ghosts of peaks and reproduced well the spectra and their biphasic trajectories of peak shifts upon additions of ANS (Fig. S3). The need for eight or more PCs is typical of the need for faithful reconstruction of series of spectra and images. Nonlinearity is typical of such series and spreads their variances across many PCs; see Fig. S7 in (9). This spreading of variances to many PCs could account for the need for many PCs for faithful reconstruction. Inspection of the reconstructed and original spectra finds both fast and fast-intermediate exchange regimes (Fig. S3). Application of PCA directly to the spectra, followed by reconstruction, accommodated this mixture of behaviors, as recently proposed (9).
ICA for confirming components
TREND supports optional use of ICA. If the number or significance of PCs obtained comes into question, ICA can be used to test the significance and validity of the PCs. It is also conceivable that ICA may be able to resolve components from some experiments that are not resolvable by PCA. ICA of peak pick lists from the two-site binding example of Fig. 3 B yields ICs equivalent to PC1 and PC2 (Fig. S4). We tested ICA with various numbers K of trial components with series of spectra containing N true components. When K ≤ N, ICA derives components that are very similar to those from PCA (Fig. S5). However when K > N, which means trying to extract more “independent components” than true components, ICA always fails in our experience, as evident from components that are jagged and meaningless (Fig. S5, E and F). Consequently, we propose that this failure of ICA can be used to count the meaningful components. The ICA should be repeated with incrementally higher K trial components. The lowest value of K at which ICA fails implies K − 1 significant components (see Fig. S5 for two examples of the iterative process). The drawback of ICA validation of components is in repeating FastICA calculations N + 1 times for each trial number of components, preferably with three to five repetitions of each, to escape local minima. Though the process is repetitive, it requires no previous knowledge of the number of components. Deciding the PCs that are significant may be quicker by identifying the PCs that contribute the most to scree plots (the convention) and which have large autocorrelation coefficients (smoothness) (7). However, recapitulation of PCs by ICs may engender more confidence in the reproducibility of the analysis.
Cardiac MRI movie resolved into components
Real-time imaging by MRI generates complex movies that are suitable to showcase the capabilities of TREND. An MRI movie of a slice through the four chambers of the heart (15; http://www.biomednmr.mpg.de/images/stories/movies/Media18.ogv) was analyzed by TREND. A movie for each of the first four individual PCs was reconstructed using trendreconstructgui, aiding interpretation of the PCs. PC1 follows the time course of breathing where the trough represents inhalation (Fig. 4 A; Movie S1). Fig. 4 B plots a frame from the PC2 movie (Movie S2) where the left ventricle is relaxed and open, known as diastole. Fig. 4 C plots a frame from the PC2 movie where the left ventricle and heart overall are contracted in systole. The time course of PC2 follows the alternation between the crests representing diastole and narrow troughs representing systole (Fig. 4 A). In the crests of PC2, the phases of rapid filling and subsequent slower filling of the ventricles can be observed. (An overview of the cardiac cycle is provided by Cardiovascular Physiology Concepts, Indianapolis, IN, http://www.cvphysiology.com/Heart%20Disease/HD002b.htm). The troughs of PC3 coincide with the isovolumetric contraction phase that begins systole (Fig. 4 A). The left ventricle and atrium walls and interiors alternate in appearance in the PC3 movie (Movie S3). Bright density between the left ventricle and atrium in the PC3 movie at the troughs in the PC3 time course suggests the closed state of the left atrioventricular (mitral) valve. Coinciding with this is detectable rotation of the right atrium and ventricle. The PC4 movie represents sudden overall rotations of the heart (Movie S4). The time courses indicate synchronization of these rotations (PC4) with both the cardiac cycle (PC2) and each inspiration of a breath (PC1); see Fig. S6 A. The rotations appear largest when a breath in begins and ends. These observations illustrate the ability of TREND to resolve and aid interpretation of concurrent processes.
A movie reconstructed from all four of these PCs using trendreconstructgui captures the major morphological changes of the cardiac cycle (Movie S5), but is not as smooth and nuanced as the original (15; http://www.biomednmr.mpg.de/images/stories/movies/Media18.ogv). Trendmaingui reports autocorrelation coefficients exceeding 0.7 for the first 44 PCs, suggesting their information content. Inspection of the scree plot and the cumulative contribution plot generated by trendmaingui indicates that the first four PCs account for ∼69% of the statistical variance across the movie, 10 PCs account for 85%, and 20 account for 93% (Fig. S6 B). Reconstruction of the cardiac MRI movie using the first 10 PCs imparts much increased realism to the depiction of the turbulent blood flow in the cardiac chambers and smoothness to the cardiac movement (Movie S6). Doubling the PCs to the first 20 enhances the fidelity further but more subtly (Movie S7). Omission of PC1 removes the largest background of breathing changes to the chest cavity, while preserving the cardiac cycle portrayal (Movie S8).
In reconstruction of other movies and NMR spectra, we also observed the faithfulness of the reconstruction to increase with number of PCs. Eight or more PCs may often be desirable for satisfying reconstruction of a measurement series. The scree plot and secondarily the autocorrelation coefficients appear useful for anticipating the number of PCs beneficial for reconstruction of the measurement series.
Conclusions
Direct application of PCA (or ICA) to 2D measurements using TREND will expand the accessibility of equilibrium and time-evolving processes measured by spectra and imaging. No curation, selection, assignment, or resolution of specific spectral peaks or image features is necessary using this unsupervised statistical approach. TREND can be applied “on-the-fly” on an instrument host computer during data collection to assess if the process or reaction has progressed far enough. Multiple concurrent processes, measured by biophysical techniques, have been readily resolved into principal or independent components. Movies and spectra can be reconstructed with TREND from the user’s choice of principal components. These capabilities will introduce, to our knowledge, new convenience and insight to analyses of spectrally detected reactions and imaging-detected processes studied by biophysics, physiology, and other disciplines.
Author Contributions
J.X. wrote the code. J.X. and S.R.V.D. designed the approach, performed the research, analyzed the results, and wrote the manuscript.
Acknowledgments
We are grateful to K. Sakurai, T. Konuma, and Y. Goto for spectra of the ANS titration of β-lactoglobulin; J. Frahm and his group for real-time MRI movies; M. D. Stanley for setting up the TREND website; A. G. Roberts, K. Stiers, and reviewers for beta-testing; and Y. Fulcher for discussion of PCA.
The work was supported by NSF grant MCB1409898.
Access to the TREND licensing (free for academics) and software downloads is available at http://biochem.missouri.edu/trend and https://nmrbox.org/.
Editor: Jeff Peng
Footnotes
Supporting Materials and Methods, two tables, six figures, and eight movies are available at http://www.biophysj.org/biophysj/supplemental/S0006-3495(16)34321-1.
Supporting Citations
References 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58 appear in the Supporting Material.
Supporting Material
References
- 1.Williamson M.P. Using chemical shift perturbation to characterise ligand binding. Prog. Nucl. Magn. Reson. Spectrosc. 2013;73:1–16. doi: 10.1016/j.pnmrs.2013.02.001. [DOI] [PubMed] [Google Scholar]
- 2.Jolliffe I.T. Springer-Verlag; New York: 2002. Principal Component Analysis. [Google Scholar]
- 3.Sakurai K., Goto Y. Principal component analysis of the pH-dependent conformational transitions of bovine beta-lactoglobulin monitored by heteronuclear NMR. Proc. Natl. Acad. Sci. USA. 2007;104:15346–15351. doi: 10.1073/pnas.0702112104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Konuma T., Lee Y.H., Sakurai K. Principal component analysis of chemical shift perturbation data of a multiple-ligand-binding system for elucidation of respective binding mechanism. Proteins. 2013;81:107–118. doi: 10.1002/prot.24166. [DOI] [PubMed] [Google Scholar]
- 5.Majumder S., DeMott C.M., Shekhtman A. Using singular value decomposition to characterize protein-protein interactions by in-cell NMR spectroscopy. ChemBioChem. 2014;15:929–933. doi: 10.1002/cbic.201400030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Cembran A., Kim J., Veglia G. NMR mapping of protein conformational landscapes using coordinated behavior of chemical shifts upon ligand binding. Phys. Chem. Chem. Phys. 2014;16:6508–6518. doi: 10.1039/c4cp00110a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Arai M., Ferreon J.C., Wright P.E. Quantitative analysis of multisite protein-ligand interactions by NMR: binding of intrinsically disordered p53 transactivation subdomains with the TAZ2 domain of CBP. J. Am. Chem. Soc. 2012;134:3792–3803. doi: 10.1021/ja209936u. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Cobbert J.D., DeMott C., Shekhtman A. Caught in action: selecting peptide aptamers against intrinsically disordered proteins in live cells. Sci. Rep. 2015;5:9402. doi: 10.1038/srep09402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Xu J., Van Doren S.R. Binding isotherms and time courses readily from magnetic resonance. Anal. Chem. 2016;88:8172–8178. doi: 10.1021/acs.analchem.6b01918. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Shlens, J. “A Tutorial on Independent Component Analysis.” Preprint, submitted April 11, 2014. arXiv:1404.2986.
- 11.Yao F., Coquery J., Lê Cao K.-A. Independent principal component analysis for biologically meaningful dimension reduction of large biological data sets. BMC Bioinformatics. 2012;13:24. doi: 10.1186/1471-2105-13-24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Nuzillard D., Bourg S., Nuzillard J. Model-free analysis of mixtures by NMR using blind source separation. J. Magn. Reson. 1998;133:358–363. doi: 10.1006/jmre.1998.1481. [DOI] [PubMed] [Google Scholar]
- 13.Ladroue C., Howe F.A., Tate A.R. Independent component analysis for automated decomposition of in vivo magnetic resonance spectra. Magn. Reson. Med. 2003;50:697–703. doi: 10.1002/mrm.10595. [DOI] [PubMed] [Google Scholar]
- 14.Monakhova Y.B., Tsikin A.M., Mushtakova S.P. Independent component analysis (ICA) algorithms for improved spectral deconvolution of overlapped signals in 1H NMR analysis: application to foods and related products. Magn. Reson. Chem. 2014;52:231–240. doi: 10.1002/mrc.4059. [DOI] [PubMed] [Google Scholar]
- 15.Zhang S., Joseph A.A., Frahm J. Real-time magnetic resonance imaging of cardiac function and flow-recent progress. Quant. Imaging Med. Surg. 2014;4:313–329. doi: 10.3978/j.issn.2223-4292.2014.06.03. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Helmus J.J., Jaroniec C.P. Nmrglue: an open source Python package for the analysis of multidimensional NMR data. J. Biomol. NMR. 2013;55:355–367. doi: 10.1007/s10858-013-9718-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Goddard T.D., Kneller D.G. University of California, San Francisco; San Francisco: 2000. SPARKY. [Google Scholar]
- 18.Delaglio F., Grzesiek S., Bax A. NMRPipe: a multidimensional spectral processing system based on UNIX pipes. J. Biomol. NMR. 1995;6:277–293. doi: 10.1007/BF00197809. [DOI] [PubMed] [Google Scholar]
- 19.Lee W., Tonelli M., Markley J.L. NMRFAM-SPARKY: enhanced software for biomolecular NMR spectroscopy. Bioinformatics. 2015;31:1325–1327. doi: 10.1093/bioinformatics/btu830. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Cavanagh J., Fairbrother W.J., Skelton N.J. Preface to the First Edition. In: Cavanagh J., Fairbrother W.J., Palmer A.G., Rance M., Skelton N.J., editors. Protein NMR Spectroscopy. 2nd. Academic Press; Burlington, VT: 2007. pp. vii–x. [Google Scholar]
- 21.van den Berg R.A., Hoefsloot H.C., van der Werf M.J. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics. 2006;7:142. doi: 10.1186/1471-2164-7-142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Selvaratnam R., Chowdhury S., Melacini G. Mapping allostery through the covariance analysis of NMR chemical shifts. Proc. Natl. Acad. Sci. USA. 2011;108:6133–6138. doi: 10.1073/pnas.1017311108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Hyvärinen A., Oja E. Independent component analysis: algorithms and applications. Neural Netw. 2000;13:411–430. doi: 10.1016/s0893-6080(00)00026-5. [DOI] [PubMed] [Google Scholar]
- 24.Hyvärinen A., Karhunen J., Oja E. Independent Component Analysis. John Wiley; New York: 2002. Practical considerations; pp. 269–271. [Google Scholar]
- 25.Särelä J., Vigario R. Overlearning in marginal distribution-based ICA: analysis and solutions. J. Mach. Learn. Res. 2003;4:1447–1469. [Google Scholar]
- 26.Marion D., Ikura M., Bax A. Improved solvent suppression in one-and two-dimensional NMR spectra by convolution of time-domain data. J. Magn. Reson. 1989;84:425–430. [Google Scholar]
- 27.Gualfetti P.J., Bilsel O., Matthews C.R. The progressive development of structure and stability during the equilibrium folding of the alpha subunit of tryptophan synthase from Escherichia coli. Protein Sci. 1999;8:1623–1635. doi: 10.1110/ps.8.8.1623. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Rüther A., Pfeifer M., Lüdeke S. Reaction monitoring using mid-infrared laser-based vibrational circular dichroism. Chirality. 2014;26:490–496. doi: 10.1002/chir.22307. [DOI] [PubMed] [Google Scholar]
- 29.Kakitani Y., Fujii R., Angerhofer A. Triplet-state conformational changes in 15-cis-spheroidene bound to the reaction center from Rhodobacter sphaeroides 2.4.1 as revealed by time-resolved EPR spectroscopy: strengthened hypothetical mechanism of triplet-energy dissipation. Biochemistry. 2006;45:2053–2062. doi: 10.1021/bi0511538. [DOI] [PubMed] [Google Scholar]
- 30.Kim-Shapiro D.B., King S.B., Ballas S.K. Time resolved absorption study of the reaction of hydroxyurea with sickle cell hemoglobin. Biochim. Biophys. Acta. 1998;1380:64–74. doi: 10.1016/s0304-4165(97)00132-3. [DOI] [PubMed] [Google Scholar]
- 31.Isin E.M., Guengerich F.P. Multiple sequential steps involved in the binding of inhibitors to cytochrome P450 3A4. J. Biol. Chem. 2007;282:6863–6874. doi: 10.1074/jbc.M610346200. [DOI] [PubMed] [Google Scholar]
- 32.Frank G.A., Goomanovsky M., Haran G. Out-of-equilibrium conformational cycling of GroEL under saturating ATP concentrations. Proc. Natl. Acad. Sci. USA. 2010;107:6270–6274. doi: 10.1073/pnas.0910246107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Shapiro D.B., Esquerra R.M., Kliger D.S. A study of the mechanisms of slow religation to sickle cell hemoglobin polymers following laser photolysis. J. Mol. Biol. 1996;259:947–956. doi: 10.1006/jmbi.1996.0372. [DOI] [PubMed] [Google Scholar]
- 34.Esquerra R.M., Goldbeck R.A., Kliger D.S. Spectroscopic evidence for nanosecond protein relaxation after photodissociation of myoglobin-CO. Biochemistry. 1998;37:17527–17536. doi: 10.1021/bi9814437. [DOI] [PubMed] [Google Scholar]
- 35.Hendler R.W., Bose S.K., Shrager R.I. Multiwavelength analysis of the kinetics of reduction of cytochrome aa3 by cytochrome c. Biophys. J. 1993;65:1307–1317. doi: 10.1016/S0006-3495(93)81170-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Chung H.S., Khalil M., Tokmakoff A. Nonlinear infrared spectroscopy of protein conformational change during thermal unfolding. J. Phys. Chem. B. 2004;108:15332–15342. [Google Scholar]
- 37.Uy D., O’Neill A.E. Principal component analysis of Raman spectra from phosphorus-poisoned automotive exhaust-gas catalysts. J. Raman Spectrosc. 2005;36:988–995. [Google Scholar]
- 38.Zapata A.L., Kumar M.R., Farmer P.J. A singular value decomposition approach for kinetic analysis of reactions of HNO with myoglobin. J. Inorg. Biochem. 2013;118:171–178. doi: 10.1016/j.jinorgbio.2012.10.005. [DOI] [PubMed] [Google Scholar]
- 39.Hendriks J., Hellingwerf K.J. pH dependence of the photoactive yellow protein photocycle recovery reaction reveals a new late photocycle intermediate with a deprotonated chromophore. J. Biol. Chem. 2009;284:5277–5288. doi: 10.1074/jbc.M805904200. [DOI] [PubMed] [Google Scholar]
- 40.Martínez J.C., Chequer N.A., Cordova T. Alternative metodology for gold nanoparticles diameter characterization using PCA technique and UV-Vis spectrophotometry. Nanosci. Nanotech. 2012;2:184–189. [Google Scholar]
- 41.Wasserman S.R., Allen P.G., Edelstein N.M. EXAFS and principal component analysis: a new shell game. J. Synchrotron Radiat. 1999;6:284–286. doi: 10.1107/S0909049599000965. [DOI] [PubMed] [Google Scholar]
- 42.Kalinin S.V., Rodriguez B.J., Ye Z.-G. Spatial distribution of relaxation behavior on the surface of a ferroelectric relaxor in the ergodic phase. Appl. Phys. Lett. 2009;95:142902. [Google Scholar]
- 43.Lichtert S., Verbeeck J. Statistical consequences of applying a PCA noise filter on EELS spectrum images. Ultramicroscopy. 2013;125:35–42. doi: 10.1016/j.ultramic.2012.10.001. [DOI] [PubMed] [Google Scholar]
- 44.Seo J., An Y., Choi C. Principal component analysis of dynamic fluorescence images for diagnosis of diabetic vasculopathy. J. Biomed. Opt. 2016;21:46003. doi: 10.1117/1.JBO.21.4.046003. [DOI] [PubMed] [Google Scholar]
- 45.Cohen A.E., Moerner W.E. Principal-components analysis of shape fluctuations of single DNA molecules. Proc. Natl. Acad. Sci. USA. 2007;104:12622–12627. doi: 10.1073/pnas.0610396104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Hansen L.K., Larsen J., Paulson O.B. Generalizable patterns in neuroimaging: how many principal components? Neuroimage. 1999;9:534–544. doi: 10.1006/nimg.1998.0425. [DOI] [PubMed] [Google Scholar]
- 47.Hashimoto A., Yamaguchi Y., Tamiya E. Time-lapse Raman imaging of osteoblast differentiation. Sci. Rep. 2015;5:12529. doi: 10.1038/srep12529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Rector D.M., Rogers R.F., George J.S. Scattered-light imaging in vivo tracks fast and slow processes of neurophysiological activation. Neuroimage. 2001;14:977–994. doi: 10.1006/nimg.2001.0897. [DOI] [PubMed] [Google Scholar]
- 49.Furnival T., Leary R.K., Midgley P.A. Denoising time-resolved microscopy image sequences with singular value thresholding. Ultramicroscopy. 2016 doi: 10.1016/j.ultramic.2016.05.005. Published online May 10, 2016. [DOI] [PubMed] [Google Scholar]
- 50.Kim T.W., Yang C., Ihee H. Combined probes of x-ray scattering and optical spectroscopy reveal how global conformational change is temporally and spatially linked to local structural perturbation in photoactive yellow protein. Phys. Chem. Chem. Phys. 2016;18:8911–8919. doi: 10.1039/c6cp00476h. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Pérez J., Vachette P., Durand D. Heat-induced unfolding of neocarzinostatin, a small all-beta protein investigated by small-angle x-ray scattering. J. Mol. Biol. 2001;308:721–743. doi: 10.1006/jmbi.2001.4611. [DOI] [PubMed] [Google Scholar]
- 52.Malmerberg E., Omran Z., Neutze R. Time-resolved WAXS reveals accelerated conformational changes in iodoretinal-substituted proteorhodopsin. Biophys. J. 2011;101:1345–1353. doi: 10.1016/j.bpj.2011.07.050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Haldrup K. Singular value decomposition as a tool for background corrections in time-resolved XFEL scattering data. Philos. Trans. R. Soc. B. 2014;369:20130336. doi: 10.1098/rstb.2013.0336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Boetker J.P., Rantanen J., Boyd B.J. Anhydrate to hydrate solid-state transformations of carbamazepine and nitrofurantoin in biorelevant media studied in situ using time-resolved synchrotron x-ray diffraction. Eur. J. Pharm. Biopharm. 2016;100:119–127. doi: 10.1016/j.ejpb.2016.01.004. [DOI] [PubMed] [Google Scholar]
- 55.Oka T., Yagi N., Kataoka M. Time-resolved x-ray diffraction reveals multiple conformations in the M-N transition of the bacteriorhodopsin photocycle. Proc. Natl. Acad. Sci. USA. 2000;97:14278–14282. doi: 10.1073/pnas.260504897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Macnaughtan D., Rogers L.B., Wernimont G. Principal-component analysis applied to chromatographic data. Anal. Chem. 1972;44:1421–1427. [Google Scholar]
- 57.Maggio R.M., Cerretani L., Chiavaro E. Application of differential scanning calorimetry-chemometric coupled procedure to the evaluation of thermo-oxidation on extra virgin olive oil. Food Biophys. 2012;7:114–123. [Google Scholar]
- 58.Idborg H., Edlund P.O., Jacobsson S.P. Multivariate approaches for efficient detection of potential metabolites from liquid chromatography/mass spectrometry data. Rapid Commun. Mass Spectrom. 2004;18:944–954. doi: 10.1002/rcm.1432. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.