Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 May 20.
Published in final edited form as: Cytometry A. 2014 Nov 18;87(1):86–88. doi: 10.1002/cyto.a.22586

ISAC’s Classification Results File Format (CLR)*

Josef Spidlen 1, Chris Bray 2; ISAC Data Standards Task Force, Ryan R Brinkman 1,3,
PMCID: PMC4874736  NIHMSID: NIHMS776985  PMID: 25407887

Abstract

Identifying homogenous sets of cell populations in flow cytometry is an important process for sorting and selecting populations of interests for further data acquisition and analysis. Many computational methods are now available to automate this process, with several algorithms partitioning cells based on high-dimensional separation versus the traditional pairwise two-dimensional visualization approach of manual gating. ISAC’s Classification Results File Format (CLR) was developed to exchange the results of both manual gating and algorithmic classification approaches in a standardized way based on per event based classifications, including the potential for soft classifications expressed as the probability of an event being a member of a class.

Keywords: flow cytometry, classification, clustering, standard, software interoperability, file format, analysis interchange

Introduction

Traditionally, manual gating has been the core of flow cytometry data analysis that is supported by virtually all flow cytometry analysis software. With manual gating, boundaries are drawn to select populations of interest. A standardized way of exchanging unambiguous descriptions of these boundaries is crucial for interoperability among flow cytometry software. This need is being addressed by efforts to develop the Gating-ML [1] specification, which allows for unambiguous gate definitions based on population boundaries in multidimensional space. Recently, the increased amount of high-throughput and high-content flow cytometry data [2] motivated the development of various automated methods to supplement manual gating [3]. The results of these methods are often per event based classifications assigning events (cells) to a certain class (i.e., a cell type). These assignments can include soft classifications expressed as the probability of an event being a member of a class. Often, there are no unambiguous boundaries that would enclose these events. Consequently, Gating-ML is not suitable to capture these algorithmically classified events.

The Classification Results File Format has been developed by the International Society for Advancement of Cytometry (ISAC) Data Standards Task Force to address the need for a standard means for the exchange of the results of cell population classifications, both manual and automated. This format has been developed to be simple to process by any software application written in any programming language as well as to be editable by humans using common spreadsheet programs.

Materials and Methods

The CLR specification was developed reusing the methodology and best practices from international standardization bodies, such as the World Wide Web Consortium (W3C), the Institute of Electrical and Electronics Engineers (IEEE), and the Internet Engineering Task Force (IETF). This is reflected in the standardized terminology used through the specification, the structure of the specification as well as the process of development and ISAC’s approval of the standard. The CLR file format is based on the widely used comma-separated values (CSV) spreadsheet format [4], which ensures compatibility with common spreadsheet programs and maintains simplicity that is allowing for an easy integration in existing tools.

Results

The full specification of the CLR file format is included in the supplemental material and can also be obtained from ISAC. A valid CLR file shall be named with the .csv file name extension and shall follow the requirements of a valid CSV spreadsheet file as specified by RFC 4180 [4]. In the CLR CSV spreadsheet, columns correspond to classes, column headings to class names, and rows to events (e.g., cells). The order of rows in the spreadsheet shall correspond to the order of events in the list mode data file (e.g., FCS file [5]) that has been classified, and there shall be a one-to-one correspondence between rows in the spreadsheet and events in the classified data file. The data in the spreadsheet shall express the probability of the particular event being a member of the particular class.

Line endings

The CSV specification requires that line endings be encoded as a sequence of two characters: CR (Carriage Return, ASCII code 0D hex) and LF (Line Feed, ASCII code 0A hex). Therefore, in order to be fully compliant, line endings in a CLR file should be encoded as a sequence of these two characters. This is a common practice of encoding line endings on the MS Windows platform. Unfortunately, Mac OS X and Linux/Unix-like operating systems use only a single LF character to encode line endings in text based files. Older Apple computers (up to Mac OS version 9) use a single CR character. These platform-specific line endings are also used by some spreadsheet tools when saving CSV files. Consequently, in order to increase interoperability, software applications reading CLR files should be able to handle line endings encoded either as a sequence of CR+LF, or as single LF or CR characters.

Class names

The names of the classes that events are assigned to shall be stated as column headings in the first row of the CLR file. UTF-8 [6] encoding shall be used if characters outside of the standard ASCII [7] character set are required (i.e., if international characters are part of a class name). Class names shall be unique within a single CLR file.

Line breaks, double quotes, or commas that are part of a class name shall be handled according to the CSV specification. Specifically, class names containing line breaks, double quotes, or commas shall be enclosed in double-quotes. If double-quotes are used to enclose a class name, then a double-quote appearing inside the class name shall be escaped by preceding it with another double quote.

Class assignments

Starting on the second row (i.e., after the headings), the fields (cell values) in the spreadsheet shall express the probability of the particular event being a member of the class stated in the corresponding column heading. Let c1, c2, …, ck be the class names stated in the header of the CLR file and let e1, e2, …, en be the events as in the original datafile (e.g., FCS file) used to perform the classification. Then, the field fi+1,j (i, j ∈ ℕ, i ∈ [1, n], j ∈ [1, k]) in the CLR file (i.e., the field in the row i + 1 and column j) shall express the probability that the event ei is a member of class cj.

For definite class assignments, the value zero (0) shall be used to state that the event is not a member of a specific class according to these classification results. The value shall be ASCII-encoded, i.e., the character ASCII code 30 hex shall be used. The value one (1) encoded as the ASCII code 31 hex shall be used to specify that the event is a member of the class.

For soft classifications, a floating-point number from the interval [0, 1] shall be used to express that an event is a member of a specific class with specified probability. Specifically, the value v, v ∈ ℝ, v ∈ [0, 1], shall be used to state that a specific event is a member of a specific class with the probability of v. Note that the sum of values in any particular row may differ from 1 as the classes are not necessarily mutually exclusive (which can result in a sum greater than 1), and also, classes defined in the CLR file do not necessarily enumerate all possible options (which can lead to a sum less than 1). A floating-point number from the interval [0, 1] may also be used to express fractional class membership.

The value v shall be encoded in ASCII with the point character (ASCII code 2E hex) used as a decimal separator. No other separators shall be used. A leading zero may or may not be used (i.e., the value may start with the decimal separator). There shall be no white space characters in the ASCII representation of the value. The value v may be expressed using the so called E notation [8]. In this form of scientific notation, values are expressed in the form of aEb, where a ∈ ℝ is any real number, b ∈ ℕ is an integer, and the construct shall be interpreted as a * 10b. Either the character “E” (ASCII code 45 hex) or the character “e” (ASCII code 65 hex) may be used to separate the coefficient a from the exponent b. Both a and b shall be encoded in ASCII with the point character used as a decimal separator. No other separators or white space characters shall be used. Consequently, only the following ASCII characters may be used to encode v in a CLR file: “-”, “.”, “E”, “e”, “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8” and “9”.

An empty value shall be used to state that the probability of an event being a member of a specific class is not known. An empty value shall be separated from preceding and following values by a comma. Examples of various CLR files with definite and soft classifications, distinct and overlapping classes, E notation, as well as unknown values are included in the supplemental information of this manuscript.

Discussion

A floating-point number may be used to express either membership probability or fractional class membership. However, in flow cytometry (and in other domains), distinguishing between membership probability and fractional membership can become a grey area, or a matter of personal view. For example with respect to T cells, since CD3 is a “defining” antigen, one could make a good case that we are looking at the probability of membership: a cell either expresses CD3, or it does not, and in that sense, it is either a T cell, or it is not. But CD3 becomes downregulated when the cells are activated and therefore, it may be preferable to ask “how T-ish the cell is”. Cell differentiation and oncogenic transformation are other examples where fractional membership can be meaningful. During differentiation, a cell can start in one class and differentiate into another class. During this period, the cell would be expected to show antigens that are associated with the parent as well as the child classes. Similarly, there are intermediate forms during a malignant transformation. Even cell death could be seen as an example of fractional class membership. While a cell can be completely dead or fresh ex vivo and very much alive, cell death is a process with intermediate events that may be better thought of as how far along are the cells in the dying process rather than the probability that they are dead. Generally, the distinction between membership probability and fractional membership can be made on conceptual grounds pertinent to what is being measured, which may be captured by choosing an appropriate class label in the CLR file.

The CLR format does not distinguish between events with unknown class labels and events with intentionally unassigned class labels. The former may indicate that there is not enough information to classify an event, while the latter typically indicates an outlying event that is being ignored in the classification. We recommend defining a separate “outlier” class in applications where such a distinction is required by simply adding a separate “outlier” column in the CLR file.

The CLR file format has been designed as a simple file format that is compatible with common spreadsheet tools, editable by humans using basic text editors (such as Notepad) and easily supported by third party software tools due to the existing support of CSV in virtually every programming language [9]. The format has not been optimized for space or performance purposes (e.g., a quick lookup of all events belonging to a specific class). This is not a problem as CLR is a file interchange format, not the internal representation in software. The use of lossless compression tools, such as ZIP, is recommended in order to reduce the file size for communication and storage purposes. CLR files compress extremely well. For example, a CLR file with a definite classification of 30,000 events into 3 overlapping classes requires approximately 210 kB (7 bytes per event), but it compresses to as little as 3.4 kB using ZIP. ZIP is also automatically used when CLR files are bundled inside the proposed Archival Cytometry Standard (ACS) [10] file format, which is the preferred mechanism of defining a link between FCS data files and CLR results.

The utility of this file format has already been shown by the use of CLR by participants of FlowCAP [3] to submit results of their automated clustering algorithms. While the format is very simple, there are several aspects that need to be unified and formally specified in order to avoid unnecessary variations that would lead to incompatibility. An early standardization in this aspect is critical, especially now that we are experiencing a spike in the development of computational methods to supplement manual gating [11, 12]. Finally, while the CLR format has been developed to address event-based classification in the field of flow cytometry, it is generally applicable in any biological and non-biological domain that needs to capture either soft or unambiguous classifications of any objects.

By publishing this standard, we are aiming to spread awareness of this file format to facilitate its adoption by additional analysis tools. In addition, we wish to avoid unnecessary variations and incompatibility issues as new computational algorithms and tools are being developed in flow cytometry.

Supplementary Material

01

Footnotes

*

This work was supported by NSERC, NIH/R01EB008400, the International Society for the Advancement of Cytometry and the Wallace H. Coulter Foundation.

References

  • 1.Spidlen J, Leif RC, Moore W, Roederer M, Brinkman RR for the Advancement of Cytometry Data Standards Task Force IS. Gating-ML: XML-based gating descriptions in flow cytometry. Cytometry Part A. 2008;73(12):1151–1157. doi: 10.1002/cyto.a.20637. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Chattopadhyay PK, Gierahn TM, Roederer M, Love JC. Single-cell technologies for monitoring immune systems. Nature immunology. 2014;15(2):128–135. doi: 10.1038/ni.2796. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Aghaeepour N, Finak G, Consortium TF, Dougall D, Khodabakhshi AH, Mah P, Obermoser G, Spidlen J, Taylor I, Wuensch SA, Bramson J, Eaves C, Weng AP, Iii ES, Ho K, Kollmann T, Rogers W, Rosa SD, Dalal B, Azad A, Pothen A, Brandes A, Bretschneider H, Bruggner R, Finck R, Jia R, Zimmerman N, Linderman M, Dill D, Nolan G, Chan C, Khettabi FE, O’Neill K, Chikina M, Ge Y, Sealfon S, Sugar I, Gupta A, Shooshtari P, Zare H, Jager PLD, Jiang M, Keilwagen J, Maisog JM, Luta G, Barbo AA, Majek P, Vilcek J, Manninen T, Huttunen H, Ruusuvuori P, Nykter M, McLachlan GJ, Wang K, Naim I, Sharma G, Nikolic R, Pyne S, Qian Y, Qiu P, Quinn J, Roth A, Consortium TD, Meyer P, Stolovitzky G, Saez-Rodriguez J, Norel R, Bhattacharjee M, Biehl M, Bucher P, Bunte K, Camillo BD, Sambo F, Sanavia T, Trifoglio E, Toffolo G, Dimitrieva S, Dreos R, Ambrosini G, Grau J, Grosse I, Posch S, Guex N, Keilwagen J, Kursa M, Rudnicki W, Liu B, Maienschein-Cline M, Manninen T, Huttunen H, Ruusuvuori P, Nykter M, Schneider P, Seifert M, Strickert M, Vilar JM, Hoos H, Mosmann TR, Brinkman R, Gottardo R, Scheuermann RH. Critical assessment of automated flow cytometry data analysis techniques. Nature Methods. 2013;10(3):228–238. doi: 10.1038/nmeth.2365. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Shafranovich Y Internet Engineering Task Force (IETF) RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files. 2005 Oct; URL: http://tools.ietf.org/html/rfc4180.
  • 5.Spidlen J, Moore W, Parks D, Goldberg M, Bray C, Bierre P, Gorombey P, Hyun B, Hubbard M, Lange S, Lefebvre R, Leif R, Novo D, Ostruszka L, Treister A, Wood J, Murphy RF, Roederer M, Sudar D, Zigon R, Brinkman RR. Data File Standard for Flow Cytometry, version FCS 3.1. Cytometry Part A. 2010;77(1):97–100. doi: 10.1002/cyto.a.20825. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Yergeau F Internet Engineering Task Force (IETF) [Accessed June 18, 2014];RFC 3629: UTF-8, a transformation format of ISO 10646. 2003 Nov; URL: http://tools.ietf.org/html/rfc3629.
  • 7.Cerf V Internet Engineering Task Force (IETF) [Accessed June 18, 2014];RFC 20: ASCII format for Network Interchange. 1969 Oct; URL: http://tools.ietf.org/html/rfc20.
  • 8.Wikipedia. [Accessed June 18, 2014];Scientific notation. URL: http://en.wikipedia.org/wiki/Scientific_notation.
  • 9.Wikipedia. [Accessed June 18, 2014];CSV application support. URL: http://en.wikipedia.org/wiki/CSV_application_support.
  • 10.International Society for Advancement of Cytometry Data Standards Task Force. [Accessed June 10, 2014];Archival Cytometry Standard (ACS) 2010 URL: http://flowcyt.sf.net/acs/ACS.v1.0.101013.pdf.
  • 11.Robinson JP, Rajwa B, Patsekin V, Davisson VJ. Computational analysis of high-throughput flow cytometry data. Expert opinion on drug discovery. 2012;7(8):679–693. doi: 10.1517/17460441.2012.693475. LR: 20130520; GR: 1R33CA140084/CA/NCI NIH HHS/United States; GR: 1R56AI089511/AI/NIAID NIH HHS/United States; JID: 101295755; 0 (Pharmaceutical Preparations); 2012/06/18 [aheadofprint]; ppublish. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.O’Neill K, Aghaeepour N, Spidlen J, Brinkman RR. Flow Cytometry Bioinformatics. PLoS Comput Biol. 2013;9(12):e1003365. doi: 10.1371/journal.pcbi.1003365. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

01

RESOURCES