Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Jan 29.
Published in final edited form as: J Open Source Softw. 2019 May 31;4(37):1279. doi: 10.21105/joss.01279

tsfeaturex: An R Package for Automating Time Series Feature Extraction

Nelson A Roque 1, Nilam Ram 2
PMCID: PMC6988501  NIHMSID: NIHMS1066998  PMID: 31998860

Statement of Need

In today’s digital world, data collection and storage costs are quite low. Humans are collectively outputting 2.5 quintillion bytes of data every day; by 2020, each person will generate ~ 1.7 MB every second (Cloud 2017). At this scale, intensive longitudinal data about humans’ behavior facilitates new discovery about the patterning of thought and action and potentially better prediction and optimization of health and well-being. In raw, form the 2.5 quintillion bytes of raw data generated daily are difficult to interpret – noisy time-series. Extraction of features from the time-series, however, allows:

  1. Researchers to reduce the dimensionality of their time-series data (e.g., reducing millions of time-stamped observations to, for example, summary feature vector of length 100);

  2. Summary characterizations of time-series data that may be used as predictors, correlates, or outcomes in study of between-person differences; and

  3. Improved and detailed description of human behavior streams (e.g., characterizing a behavioral time series in terms of its features; the mean is ‘X’, the range is ‘Y’, the peaks are at ‘T12’ and ‘T30’).

Short data streams are easily summarized using basic features (e.g., mean, standard deviation, IQR). However, as the time-series get longer, numerous other features may be needed and/or can be accessed. Study of intraindividual variability has outlined the wide variety of time-series features that can be used to characterize between-person differences and within-person change - with features such as probability of acute change (PAC) or mean square of successive differences (MSSD) providing useful information about individuals’ cognitive, emotional, and behavioral dynamics (for more info on intraindividual variability metrics, see: (Jahng, Wood, and Trull 2008)).

Summary

Functionality

tsfeaturex (Roque 2019) is an R package for automating time series feature extraction, inspired and modeled after the Python package tsfresh (Christ et al. 2018; blue-yonder 2016b). The R language (R Core Team 2019) allows for an easy to use interface, with the underlying processing speed advantage of C languages (and flexibility to run on the web, with the help of the shiny package in R; (Chang et al. 2019)). The API for tsfeaturex is a wrapper for the highly-trafficked dplyr (Wickham et al. 2019), mainly to lend on the flexibility of the grammar of data manipulation and shortcuts for non-standard evaluation. The API for tsfeaturex was designed to facilitate the extraction of features for any dataset in long format, including grouping of summaries by other factor. For example, if every person in your dataset has 1 observation each day for 8 days, and they do this in two bursts, once every 6 months, you can calculate features of the overall series, 16 observations from both bursts, or separately for each burst). Some features are integrated from other packages, such as: (e1071, Meyer et al. 2019; Hmisc, Harrell Jr, Charles Dupont, and others. 2019; forecast, Hyndman and Khandakar 2008; zoo, Zeileis and Grothendieck 2005; viridis, Garnier 2018; psych, Revelle 2018; entropy, Hausser and Strimmer 2014; Langevin, Friedrich et al. 2011; Rinn et al. 2016).

By design, tsfeaturex is able to cope with missing data (in R, of class NA), a key deviation from tsfresh (blue-yonder 2016a). In addition to feature extraction, this package also calculates feature correlations amongst outputted features.

tsfeaturex is capable of outputting both long and wide data structures – both of use for different purposes (e.g., long format preferred for plotting in ggplot2) and analyses (e.g., wide format preferred for repeated measures ANOVA in most statistical software).

Purpose & Audience

tsfeaturex is intended for use by researchers with time-series data, and will be of most interest to those developing their statistical and coding skills – allowing them to extract many features from their time-series data with easy to use code and without need for high-level mathematics background. The desire for feature extraction tools is widespread across all domains of data science, including, but not limited to, applications in: biological systems, finance, and psychology.

Feature Roadmap

The current expectation is that over time, tsfeaturex, will allow for two-levels of feature extraction from almost any data form (e.g., text, audio, images): (1) extracting time-series descriptive features from numerical data (already implemented); (2) extracting numerical features from nonnumerical data (e.g., number of exclamation points in Twitter data; coming soon).

Mentions of Ongoing Projects

tsfeaturex is currently being used in analysis of experience sampling and multi-trial performance data in a variety of projects at the interface of data science and psychological science, including:

Figure 1.

Figure 1

depicts example wide (top) and long (bottom) data structures for a dataset containing two (2) measurements from two (2) individuals. Notice that there is one row for each individual in the wide format, and two (2) rows for each individual in the long format, one for each column.

Figure 2.

Figure 2

depicts sample time series data from two participants, both with mean value of 5. You will notice, although they have identical means, the shape of the time series, and locations of peaks is different. tsfeaturex calculates features to better characterize differences such as these.

Acknowledgements

Nelson A. Roque was supported by National Institute on Aging Grant T32 AG049676 to The Pennsylvania State University.

We thank Github user @blue-yonder, and other contributors, for creating tsfresh (https://github.com/blue-yonder/tsfresh (https://github.com/blue-yonder/tsfresh)) and inspiring tsfeaturex. We would like to acknowledge and thank Github user @stas-g, for code on finding peaks (stas-g (2017)), and Dr. Nilam Ram for code on calculating probability of acute change (PAC).

We gratefully acknowledge contributions from Dr. Nilam Ram, Dr. Anthony Ong, Dr. Martin Sliwinski, and the Sliwinski lab throughout the early development process.

References

  1. blue-yonder. 2016a. “Allow Nan or None Values to Be Passed in, and Silently Ignored” GitHub Repository. https://github.com/blue-yonder/tsfresh/issues/90 (https://github.com/blue-yonder/tsfresh/issues/90); GitHub. [Google Scholar]
  2. ——. 2016b. “Tsfresh” GitHub Repository. https://github.com/blue-yonder/tsfresh (https://github.com/blue-yonder/tsfresh); GitHub. [Google Scholar]
  3. Chang Winston, Cheng Joe, Allaire JJ, Xie Yihui, and McPherson Jonathan. 2019. Shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny (https://CRAN.R-project.org/package=shiny).
  4. Christ Maximilian, Braun Nils, Neuffer Julius, and Kempa-Liehr Andreas W. 2018. “Time Series Feature Extraction on Basis of Scalable Hypothesis Tests (Tsfresh: A Python Package).” Neurocomputing 307: 72–77. doi: 10.1016/j.neucom.2018.03.067 (https://doi.org/https://doi.org/10.1016/j.neucom.2018.03.067). [DOI] [Google Scholar]
  5. Cloud, IBM Marketing. 2017. “10 Key Marketing Trends for 2017 and Ideas for Exceeding Customer Expectations.” IBM. https://www.ibm.com/downloads/cas/XKBEABLN (https://www.ibm.com/downloads/cas/XKBEABLN). [Google Scholar]
  6. Friedrich Rudolf, Peinke Joachim, Sahimi Muhammad, and Tabar Mohammed Reza Rahimi. 2011. “Approaching Complexity by Stochastic Methods: From Biological Systems to Turbulence” Physics Reports 506 (5). Elsevier BV: 87–162. doi: 10.1016/j.physrep.2011.05.003 ( 10.1016/j.physrep.2011.05.003). [DOI] [Google Scholar]
  7. Garnier Simon. 2018. Viridis: Default Color Maps from ‘Matplotlib’. https://CRAN.R-project.org/package=viridis (https://CRAN.R-project.org/package=viridis). [Google Scholar]
  8. Harrell Jr, Frank E, with contributions from Charles Dupont, and many others. 2019. Hmisc: Harrell Miscellaneous. https://CRAN.R-project.org/package=Hmisc (https://CRAN.R-project.org/package=Hmisc). [Google Scholar]
  9. Hausser Jean, and Strimmer Korbinian. 2014. Entropy: Estimation of Entropy, Mutual Information and Related Quantities. https://CRAN.R-project.org/package=entropy (https://CRAN.R-project.org/package=entropy).
  10. Hyndman Rob J, and Yeasmin Khandakar. 2008. “Automatic Time Series Forecasting: The Forecast Package for R.” Journal of Statistical Software 26 (3): 1–22. http://www.jstatsoft.org/article/view/v027i03 (http://www.jstatsoft.org/article/view/v027i03).19777145 [Google Scholar]
  11. Jahng Seungmin, Wood Phillip K., and Trull Timothy J.. 2008. “Analysis of affective instability in ecological momentary assessment: Indices using successive difference and group comparison via multilevel modeling.” Psychological Methods 13 (4): 354–75. doi: 10.1037/a0014173 ( 10.1037/a0014173). [DOI] [PubMed] [Google Scholar]
  12. Meyer David, Dimitriadou Evgenia, Hornik Kurt, Weingessel Andreas, and Leisch Friedrich. 2019. E1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), Tu Wien. https://CRAN.R-project.org/package=e1071 (https://CRAN.R-project.org/package=e1071). [Google Scholar]
  13. R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; https://www.R-project.org/ (https://www.R-project.org/). [Google Scholar]
  14. Revelle William. 2018. Psych: Procedures for Psychological, Psychometric, and Personality Research. Evanston, Illinois: Northwestern University; https://CRAN.R-project.org/package=psych (https://CRAN.R-project.org/package=psych). [Google Scholar]
  15. Rinn Philip, Lind Pedro G., WÃchter Matthias, and Peinke Joachim. 2016. “The Langevin Approach: An R Package for Modeling Markov Processes” Journal of Open Research Software 4 (1). Ubiquity Press: e34. doi: 10.5334/jors.123 ( 10.5334/jors.123). [DOI] [Google Scholar]
  16. Roque Nelson. 2019. “Tsfeaturex.” Zenodo. doi: 10.5281/zenodo.2574990 ( 10.5281/zenodo.2574990). [DOI] [Google Scholar]
  17. stas-g. 2017. “FindPeaks” GitHub Repository. https://github.com/stas-g/findPeaks (https://github.com/stas-g/findPeaks); GitHub. [Google Scholar]
  18. Wickham Hadley, Romain François, Lionel Henry, and Kirill Müller. 2019. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr (https://CRAN.R-project.org/package=dplyr). [Google Scholar]
  19. Zeileis Achim, and Grothendieck Gabor. 2005. “Zoo: S3 Infrastructure for Regular and Irregular Time Series.” Journal of Statistical Software 14(6): 1–27. doi: 10.18637/jss.v014.i06 ( 10.18637/jss.v014.i06). [DOI] [Google Scholar]

RESOURCES