Skip to main content
PLOS One logoLink to PLOS One
. 2021 Oct 28;16(10):e0259201. doi: 10.1371/journal.pone.0259201

Pathway-targeting gene matrix for Drosophila gene set enrichment analysis

Jack Cheng 1,2,#, Lee-Fen Hsu 3,4,5,#, Ying-Hsu Juan 1,#, Hsin-Ping Liu 6,*, Wei-Yong Lin 1,2,7,*
Editor: Katherine James8
PMCID: PMC8553153  PMID: 34710184

Abstract

Gene Set Enrichment Analysis (GSEA) is a powerful algorithm to determine biased pathways between groups based on expression profiling. However, for fruit fly, a popular animal model, gene matrixes for GSEA are unavailable. This study provides the pathway-targeting gene matrixes based on Reactome and KEGG database for fruit fly. An expression profiling containing neurons or glia of fruit fly was used to validate the feasibility of the generated gene matrixes. We validated the gene matrixes and identified characteristic neuronal and glial pathways, including mRNA splicing and endocytosis. In conclusion, we generated and validated the feasibility of Reactome and KEGG gene matrix files, which may benefit future profiling studies using Drosophila.

Introduction

Gene Set Enrichment Analysis (GSEA) is an algorithm [1] that determines whether a previously defined set of genes shows significant differences between two groups of biological samples. Since 2005, GSEA has been widely applied in profiling studies with more than 20,000 citations. In contrast to conventional fold-change (FC) ranking methods, GSEA does not require a manually defined cutoff, e.g., FC > 2, but determines whether members of a gene set tend to occupy the top (or bottom) of the FC list. Therefore, GSEA may provide neglected information due to the cutoff bias in the conventional FC ranking methods.

To apply GSEA, in addition to the profiling data, a Gene Matrix file (e.g., *.gmt) or its alternatives, describing the constitution of gene sets is required. Although 28,705 Homo sapiens gene sets are available in the Molecular Signatures Database (MSigDB) (http://www.gsea-msigdb.org/gsea/msigdb/collections.jsp), gene sets for other organisms are largely limited (http://ge-lab.org/gskb/) and majorly generated from Gene Ontology data.

Gene Ontology is a structured, precisely defined, controlled vocabulary for describing the roles of genes with three independent ontologies, i.e., biological process, molecular function, and cellular component [2]. Briefly, biological process indicates the biological objective of the gene/protein; molecular function describes its biochemical activity, while cellular component refers to its subcellular distribution. Although GO allows an easy and quick understanding of the roles of a gene, however, “it describes only what is done without specifying where or when the event actually occurs.” as stated in the original GO paper [2]. This knowledge gap is exactly what KEGG [3] and Reactome [4] try to fill. Both databases provide sequential information and partnership of the reaction of the gene/protein. On the contrary, the dependence on the published/curated scientific literature largely limits the application of KEGG/Reactome on genes of unknown function, while GO may cover this part by similarity prediction.

Thus, the choice of gene sets in GSEA is largely dependent on the purpose of the study. For example, two previous GSEA papers [5, 6] may be improved by adopting KEGG/Reactome gene sets to provide more details of the affected pathways, while for another study [7], GO gene sets is perfect for its goal of predicting functions of unknown genes.

As an experimental animal model, the fruit fly (Drosophila melanogaster) is a powerful tool to decipher genetic mechanisms both in behaviors and human diseases [8], owing to the convenience in gene-manipulating [9]. Furthermore, Reactome and KEGG are both manually curated and peer-reviewed pathway databases. Therefore, the objective of this study is to provide the pathway-targeting gene matrix files based on Reactome and KEGG for fruit fly profiling studies using GSEA. After generating the gene matrix files, the expression data of two distinct fruit fly cell phenotypes, i.e., neurons and glia, are run in the GSEA software to identify enriched gene sets typical for the respective cell types, which would support the validity of the gene matrix files.

Method

Generation of Reactome and KEGG gene matrix files

The curated pathway-gene information was retrieved from the Reactome and KEGG websites. The gene matrix files (S1 and S2 Files) were generated according to the GSEA data formats, i.e., a tab-delimited file, and each row represents a gene set.

Specifically, the Drosophila-specific genes of KEGG pathways were downloaded from the “KEGG Pathway Maps—Drosophila melanogaster (fruit fly)” with the website https://www.genome.jp/brite/query=00190&htext=br08901.keg&option=-a&node_proc=br08901_org&proc_enabled=dme&panel=collapse. By clicking each “tringle symbol” of mother categories, sub-categories will expand. By clicking the number preceding each sub-category, e.g., 00010 of Glycolysis / Gluconeogenesis, it will bring you to the map of the specific pathway (https://www.genome.jp/kegg-bin/show_pathway?dme00010). Further clicking the title of pathway map on the upper left corner will finally bring you to the detail page of that pathway (https://www.genome.jp/entry/dme00010). At the upper right corner of the page, an “all links” box contains the “KEGG GENES” list. Repeat the process to exhaust the Drosophila KEGG pathways.

The Drosophila-specific genes of Reactome pathways were downloaded from the (https://reactome.org/PathwayBrowser/#/R-DME-XXXXXXX, where XXXXXXX denotes for seven digits of a specific pathway, e.g., 9612973 for autophagy). There are three panels on the page. The left panel shows the hierarchy of Drosophila pathways in Reactome, while at the right lower panel, by clicking the tab “Molecules”, then the “protein” link, the gene/protein list is available. Repeat the process to exhaust the Drosophila Reactome pathways on the left panel.

A gmt file is a tab-separated plain text, and each row describes one gene set. In each row, the first column contains the name of the gene set, while the second column contains additional details, e.g., KEGG ID of the pathway (gene set). The gene set members, i.e., FlyBase IDs in this study, are listed from the third column of the row, one gene in one column. Thus, once the gene list of pathways is available, the gmt file can be generated by locating the pathway elements into corresponding cells with any plain text editor or Microsoft Excel. After saving the file as a tab-separated plain text, modify the filename extension, i.e., *.txt, to *.gmt in the file browser.

Collection of validation data

The expression profiling of GFP positive cells with Repo or Elav driver sorted from brains of Drosophila of the accession ID GSE45344, provided by DeSalvo MK & Bainton RJ [10], was downloaded from the NCBI GEO database. Elav is a gene encoding an RNA binding protein capable of regulating mRNA processing exclusively expressed in neurons, and Repo encodes a transcription factor specifically expressed in glia. By using GAL4-UAS reporter system, Repo-GAL4 drives the expression of UAS-GFP specifically in glia, while Elav-GAL4 drives UAS-GFP exclusively in neurons. Fluorescence activated cell sorting (FACS) is a technique to separate cells as they flow past stimulating lasers [11]. The downloaded file was saved as a tab-delimited txt file (S3 File) as an input to the GSEA 4.1.0 software. Notably, the probe IDs were converted to Flybase annotation ID (CG_ID) according to the conversion table (S4 File). The annotation of Microarray probe IDs is available from the “SOFT formatted family file” at https://ftp.ncbi.nlm.nih.gov/geo/series/GSE45nnn/GSE45344/soft/. Specifically, from line 110 of the file, the 1st column is the Microarray probe ID, and the 3rd column is the FlyBase annotation ID, i.e., CG_ID. However, the context must be “cleared” to extract the correct FlyBase ID, e.g., for “CG16844-RA”, the “-RA” must be trimmed to get the correct ID “CG16844”. If the dataset of your interest does not provide the probe ID annotation, you may try the gene ID conversion tool on https://david.abcc.ncifcrf.gov/conversion.jsp or from the website of the gene chip manufacturer. The non-Flybase-symbol-corresponding probes were omitted (e.g., those from cDNA library, pseudogene, or transposon). The intensity values were log2-based, and they were exponentially transformed before enrichment analysis. Only the record with the strangest average intensity was used for the validation, in the case of redundant intensities presented for an identical CG_ID, i.e., the highest value was chosen in case of multiple probe mappings to the same CG_ID.

Validation of generated gene matrix files

When running the GSEA 4.1.0 software according to the GSEA user guide, S3 File was assigned as the expression dataset, while S1 and S2 Files were assigned as the Gene set database. Notably, the “Collapse/Remap to gene symbols” parameter must be assigned as “No_collapse”, which means using the identifiers in the dataset as is in the original format. Phenotype labels were manually inputted by first selecting the “Create an on-the-fly phenotype” pull-down, then typing in the sample ID of the two groups, i.e., Elav_1 and Repo_1, in “Samples for class A” and “Samples for class B”, respectively.

Results

Characteristics of the generated Reactome and KEGG gene matrix files

The Drosophila Reactome pathways are presently classified into nine categories: circadian clock pathway, Hedgehog pathway, Hippo/Warts pathway, Imd pathway, insulin receptor-mediated signaling, JAK/STAT pathway, planar cell polarity pathway, Toll pathway, and Wingless pathway. There are currently 1450 pathways annotated and supplied with genetic information. The generated Drosophila Reactome gene matrix file is provided as S1 File.

In contrast, the Drosophila KEGG pathways are presently classified into five categories, including genetic information processing, environmental information processing, cellular processes, organismal systems, and metabolism. Although there are 137 pathways annotated, only 131 of them are currently supplied with genetic information. The generated Drosophila KEGG gene matrix file is provided as S2 File.

Validation of Reactome gene matrix file

A profiling dataset comparing ELAV-GFP (representing neurons) and REPO-GFP (representing glia) sorting cells from the Drosophila brain was used to test the feasibility of the generated gene matrix files. With the default Gene set size filters (min = 15, max = 500), 835 out of the 1450 gene sets were filtered out, and the remaining 615 gene sets were used in the analysis. All 13615 genes of the profiling dataset were included.

The Reactome gene matrix file worked well, and as a result, 5616 (41.2%) of the genes were associated with phenotype Class A (ELAV group) with a correlation area of 40.7%; while 7999 (58.8%) of the genes were associated with Class B (REPO group) with correlation area 59.3%. The heat map of the top 50 genes (Fig 1A) and the ranked gene list correlation profile (Fig 1B) are shown. The global enrichment scores across gene sets (Fig 1C) show a typical 2-peak separation for ELAV and REPO groups.

Fig 1. Global characteristics of the ELAV vs REPO profiling analyzed with the Reactome gmt.

Fig 1

A) Heat Map of the top 50 features for each phenotype. B) Ranked Gene List Correlation Profile. C) Global enrichment scores across gene sets (ES histogram).

For the Class A (ELAV group), under the criterion of false discovery rate (FDR) below 25%, 214 or 148 gene sets were upregulated significantly at a nominal p-value of 5% or 1%, respectively. Representative enrichment plots (Fig 2) highlighted mRNA splicing, transport of mature transcript to cytoplasm, synthesis of PIPs at the plasma membrane, and RAC1 GTPase cycle. For the Class B (REPO group), under the criterion of false discovery rate (FDR) below 25%, 68 or 52 gene sets were upregulated significantly at a nominal p-value of 5% or 1%, respectively. Representative enrichment plots (Fig 3) highlighted peptide hormone metabolism, mitochondrial fatty acid beta-oxidation, assembly of active LPL & LIPC lipase complexes, and activation of matrix metalloproteinase. The detailed Reactome enrichment results for Class A & B are provided as S5 and S6 Files, respectively.

Fig 2. Representative Reactome enrichment plots of the ELAV group.

Fig 2

For each plot, the upper part shows the enrichment score; while the lower part shows ranked list metric of the gene set. The middle part shows the ranked gene list, with red meaning upregulation, blue for downregulation, and black vertical line for the genes of the set.

Fig 3. Representative Reactome enrichment plots of the REPO group.

Fig 3

For each plot, the upper part shows the enrichment score; while the lower part shows ranked list metric of the gene set. The middle part shows the ranked gene list, with red meaning upregulation, blue for downregulation, and black vertical line for the genes of the set.

Validation of KEGG gene matrix file

The same dataset was also used for testing the feasibility of the generated KEGG gene matrix file. The KEGG gene matrix file also worked well. With the default Gene set size filters, 33 out of the 131 gene sets were filtered out, and the remaining 98 gene sets were used in the analysis. The global enrichment scores across gene sets (S1 File) also shows a typical 2-peak separation for ELAV and REPO groups. For the Class A (ELAV group), under the criterion of false discovery rate (FDR) below 25%, 27 or 20 gene sets were upregulated significantly at a nominal p-value of 5% or 1%, respectively. Representative enrichment plots (Fig 4) highlighted spliceosome, endocytosis, WNT signaling, and mTOR signaling pathway. For the Class B (REPO group), under the criterion of false discovery rate (FDR) below 25%, 31 or 19 gene sets were upregulated significantly at a nominal p-value of 5% or 1%, respectively. Representative enrichment plots (Fig 5) highlighted fatty acid degradation, glutathione metabolism, ABC transporters, and glycine, serine, & threonine metabolism. The detailed KEGG enrichment results for Class A & B are provided as S7 and S8 Files, respectively.

Fig 4. Representative KEGG enrichment plots of the ELAV group.

Fig 4

For each plot, the upper part shows the enrichment score; while the lower part shows ranked list metric of the gene set. The middle part shows the ranked gene list, with red meaning upregulation, blue for downregulation, and black vertical line for the genes of the set.

Fig 5. Representative KEGG enrichment plots of the REPO group.

Fig 5

For each plot, the upper part shows the enrichment score; while the lower part shows ranked list metric of the gene set. The middle part shows the ranked gene list, with red meaning upregulation, blue for downregulation, and black vertical line for the genes of the set.

Discussion

This study has generated gene matrix transposed files for Drosophila gene set enrichment analysis based on curated pathway databases Reactome and KEGG, containing 1450 and 131 gene sets, respectively. We validated their feasibility with GSEA 4.1.0 software by running analyses on a publically available profiling dataset of brain neurons and glia from Drosophila. Both gene matrixes separated the two groups well and highlighted the typical pathways.

Neurons are electrically excitable cells communicating with other cells through synapses by releasing or receiving neurotransmitters and dynamically modulating the density and composition of membrane receptors. In mature neurons, mRNA splicing controls ion channels, exocytosis apparatus, and neurotransmitter recycling [12]. Protein synthesis at the synapse requires transport of mature transcripts [13], while exocytosis requires RAC1 GTPase [14], and trafficking of synaptic vesicle membranes during the exocytic-endocytic cycle requires phosphoinositides [15]. These features are highlighted by the representative enrichment plots of neurons (Fig 2). In addition to confirming the spliceosome and endocytosis pathway, the enrichment plots (Fig 4) also highlight WNT signaling, which regulates presynaptic assembly and neurotransmitter release [16], and mTOR signaling, which participates in the dendritic spine and synapse formation [17].

Glia support the neurons in the central nervous system by forming myelin to protect and insulate neurons, by supplying nutrients and oxygen to neurons, and by destroying pathogens and removing neuronal debris [18]. The glia-neuron lactate shuttle fuels neural activity and reduces the glucose-stimulated ROS burden of neurons [19]. Meanwhile, neural lactate consumption fuels lipid production and shuttles them back to glia through ABC transports [19]. The representative enrichment plots of glia (Figs 3 and 5) highlight glutathione metabolism, which protects neurons from oxidative stress [20], and highlights also the ABC transporters, fatty acid degradation, and lipase complexes. Moreover, the glial cell also serves as a reservoir of neurotransmitters or active ligands. Among them, D-serine functions as a coagonist to NMDA receptors and controls synaptic memory [21]. Furthermore, glia induce the activity of matrix metalloproteinase 3 & 9 [22], which involves proteolysis responding to neural debris. These features are also highlighted in Fig 5.

Besides the typical pathways, the complete lists of enriched gene sets (S5S8 Files) are resources to discover potential novel pathways in neurons or glia. For example, the gene set of “SRP-dependent cotranslational protein targeting to membrane” was enriched in glia (S6 File), but its role in glial function has not been elucidated yet.

This study may benefit future profiling studies, including GeneChip microarray and NGS, of Drosophila by enabling pathway-targeting gene set enrichment analysis. The generated Reactome and KEGG gene matrix files are compatible with GSEA 4.1.0 software. However, their compatibility with older versions has not been tested. Moreover, before utilizing the gene matrix files, one has to convert the gene symbols or probe IDs in your profiling dataset into Flybase symbols (CG_ID) in advance. FlyBase ID Converter (https://www.biotools.fr/drosophila/fbgn_converter) or Gene ID conversion Tool of NIH [23] (https://david.ncifcrf.gov/conversion.jsp) may help batch ID conversion.

Conclusion

To facilitate the pathway-targeting gene set enrichment analysis for Drosophila, we generated and validated the feasibility of Reactome and KEGG gene matrix files, which may benefit future profiling studies using Drosophila. The gene sets are available in the supplementary files or on https://github.com/JackCheng-TW/GeneMatrix. Furthermore, we have clarified the exact download specifications, ID conversion, and gmt file generation procedure in the method section so that any researcher could generate his/her gene matrix files according to the latest KEGG and Reactome databases.

Supporting information

S1 File. Drosophila Reactome.

(GMT)

S2 File. Drosophila KEGG.

(GMT)

S3 File. Processed profiling of GSE45344.

(TSV)

S4 File. Probe ID to Flybase symbols (CG_ID) conversion table.

(TSV)

S5 File. Detailed Reactome enrichment results for Class A (ELAV).

(TSV)

S6 File. Detailed Reactome enrichment results for Class B (REPO).

(TSV)

S7 File. Detailed KEGG enrichment results for Class A (ELAV).

(TSV)

S8 File. Detailed KEGG enrichment results for Class B (REPO).

(TSV)

S1 Fig. The global enrichment scores across KEGG gene sets.

(PDF)

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

This work was supported by grants from the Ministry of Science and Technology in Taiwan (MOST108-2320-B-039-031-MY3 to HPL, MOST 109-2314-B-039-030 and MOST 110-2314-B-039-009 to WYL), form Chang Gung Memorial Hospital (CMRPF6H009 and CMRPF6L0011 to LFH), and from China Medical University & Hospital (CMU109-MF-85, CMU108-MF-68, and DMR-109-150 to WYL, CMU108-MF-61 to HPL). The funders had no role in this study. MOST Taiwan:www.most.gov.tw Chang Gung Memorial Hospital: www.cgmh.org.tw China Medical University: www.cmu.edu.tw China Medical University Hospital: www.cmuh.cmu.edu.tw.

References

  • 1.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102(43):15545–50. doi: 10.1073/pnas.0506580102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nature genetics. 2000;25(1):25–9. doi: 10.1038/75556 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, Tanabe M. KEGG: integrating viruses and cellular organisms. Nucleic Acids Research. 2021;49(D1):D545–D51. doi: 10.1093/nar/gkaa970 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Jassal B, Matthews L, Viteri G, Gong C, Lorente P, Fabregat A, et al. The reactome pathway knowledgebase. Nucleic acids research. 2020;48(D1):D498–D503. doi: 10.1093/nar/gkz1031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Martin M, Hiroyasu A, Guzman RM, Roberts SA, Goodman AG. Analysis of Drosophila STING reveals an evolutionarily conserved antimicrobial function. Cell reports. 2018;23(12):3537–50. e6. doi: 10.1016/j.celrep.2018.05.029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Palu RA, Ong E, Stevens K, Chung S, Owings KG, Goodman AG, et al. Natural genetic variation screen in Drosophila identifies Wnt signaling, mitochondrial metabolism, and redox homeostasis genes as modifiers of apoptosis. G3: Genes, Genomes, Genetics. 2019;9(12):3995–4005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Costello JC, Dalkilic MM, Beason SM, Gehlhausen JR, Patwardhan R, Middha S, et al. Gene networks in Drosophila melanogaster: integrating experimental data to predict gene function. Genome biology. 2009;10(9):1–29. doi: 10.1186/gb-2009-10-9-r97 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Pandey UB, Nichols CD. Human disease models in Drosophila melanogaster and the role of the fly in therapeutic drug discovery. Pharmacological reviews. 2011;63(2):411–36. doi: 10.1124/pr.110.003293 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Venken KJ, Bellen HJ. Emerging technologies for gene manipulation in Drosophila melanogaster. Nature Reviews Genetics. 2005;6(3):167–78. doi: 10.1038/nrg1553 [DOI] [PubMed] [Google Scholar]
  • 10.DeSalvo MK, Hindle SJ, Rusan ZM, Orng S, Eddison M, Halliwill K, et al. The Drosophila surface glia transcriptome: evolutionary conserved blood-brain barrier processes. Frontiers in neuroscience. 2014;8:346. doi: 10.3389/fnins.2014.00346 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Bonner W, Hulett H, Sweet R, Herzenberg L. Fluorescence activated cell sorting. Review of Scientific Instruments. 1972;43(3):404–9. doi: 10.1063/1.1685647 [DOI] [PubMed] [Google Scholar]
  • 12.Li Q, Lee J-A, Black DL. Neuronal regulation of alternative pre-mRNA splicing. Nature Reviews Neuroscience. 2007;8(11):819–31. doi: 10.1038/nrn2237 [DOI] [PubMed] [Google Scholar]
  • 13.Schuman EM. mRNA trafficking and local protein synthesis at the synapse. Neuron. 1999;23(4):645–8. doi: 10.1016/s0896-6273(01)80023-4 [DOI] [PubMed] [Google Scholar]
  • 14.Harada A, Furuta B, Takeuchi K-i, Itakura M, Takahashi M, Umeda M. Nadrin, a novel neuron-specific GTPase-activating protein involved in regulated exocytosis. Journal of Biological Chemistry. 2000;275(47):36885–91. doi: 10.1074/jbc.M004069200 [DOI] [PubMed] [Google Scholar]
  • 15.Cremona O, De Camilli P. Phosphoinositides in membrane traffic at the synapse. Journal of cell science. 2001;114(6):1041–52. [DOI] [PubMed] [Google Scholar]
  • 16.Ahmad-Annuar A, Ciani L, Simeonidis I, Herreros J, Fredj NB, Rosso SB, et al. Signaling across the synapse: a role for Wnt and Dishevelled in presynaptic assembly and neurotransmitter release. The Journal of cell biology. 2006;174(1):127–39. doi: 10.1083/jcb.200511054 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Lee C-C, Huang C-C, Hsu K-S. Insulin promotes dendritic spine and synapse formation by the PI3K/Akt/mTOR and Rac1 signaling pathways. Neuropharmacology. 2011;61(4):867–79. doi: 10.1016/j.neuropharm.2011.06.003 [DOI] [PubMed] [Google Scholar]
  • 18.Perea G, Sur M, Araque A. Neuron-glia networks: integral gear of brain function. Frontiers in cellular neuroscience. 2014;8:378. doi: 10.3389/fncel.2014.00378 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Liu L, MacKenzie KR, Putluri N, Maletić-Savatić M, Bellen HJ. The glia-neuron lactate shuttle and elevated ROS promote lipid synthesis in neurons and lipid droplet accumulation in glia via APOE/D. Cell Metab. 2017;26(5):719–37. e6. doi: 10.1016/j.cmet.2017.08.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Shih AY, Johnson DA, Wong G, Kraft AD, Jiang L, Erb H, et al. Coordinate regulation of glutathione biosynthesis and release by Nrf2-expressing glia potently protects neurons from oxidative stress. Journal of Neuroscience. 2003;23(8):3394–406. doi: 10.1523/JNEUROSCI.23-08-03394.2003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Panatier A, Theodosis DT, Mothet J-P, Touquet B, Pollegioni L, Poulain DA, et al. Glia-derived D-serine controls NMDA receptor activity and synaptic memory. Cell. 2006;125(4):775–84. doi: 10.1016/j.cell.2006.02.051 [DOI] [PubMed] [Google Scholar]
  • 22.Gottschall PE, Deb S. Regulation of matrix metalloproteinase expression in astrocytes, microglia and neurons. NEUROIMMUNOMODULATION-BASEL-. 1996;3:69–75. doi: 10.1159/000097229 [DOI] [PubMed] [Google Scholar]
  • 23.Jiao X, Sherman BT, Huang DW, Stephens R, Baseler MW, Lane HC, et al. DAVID-WS: a stateful web service to facilitate gene/protein list analysis. Bioinformatics. 2012;28(13):1805–6. doi: 10.1093/bioinformatics/bts251 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Katherine James

13 Aug 2021

PONE-D-21-18414

Pathway-targeting Gene Matrix for Drosophila Gene Set Enrichment Analysis

PLOS ONE

Dear Dr. Lin,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

You will see that while the reviewers are persuaded of the importance of your GSEA datasets and validation study, there are several points highlighted that require clarification before acceptance. In particular, PLOS ONE requires methods to be described in sufficient detail for another researcher to reproduce the experiments described, so further details of the gene set generation, as suggested by reviewer 2, are needed. I also agree that submission of these gene sets to MSigDB would increase their visibility and reuse. Reviewer 1 has also highlighted some relevant studies of Gene Ontology GSEA in Drosophila which will add to the contextualisation of your work.

Please submit your revised manuscript by Sep 27 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Katherine James, Ph.D.

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“This work was supported by grants from the Ministry of Science and Technology in Taiwan (MOST107-2314-B-039-042-MY2, MOST106-2314-B-039-009-, MOST108-2320-B-039-031- MY3, MOST 109-2314-B-039-030), form Chang Gung Memorial Hospital (CMRPF6H009), and from China Medical University & Hospital (CMU109-MF-85, CMU108-MF-68, CMU108-MF-61, CMU107-S-08, DMR-109-150, DMR-106-119). The funders had no role in this study.”

Funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

 “This This work was supported by grants from the Ministry of Science and Technology in Taiwan (MOST107-2314-B-039-042-MY2 to WYL, MOST106-2314-B-039-009 to WYL, MOST108-2320-B-039-031-MY3 to HPL, MOST 109-2314-B-039-030 to WYL), form Chang Gung Memorial Hospital (CMRPF6H009 to LFH), and from China Medical University & Hospital (CMU109-MF-85 to WYL, CMU108-MF-68 to WYL, CMU108-MF61 to HPL, CMU107-S-08 to WYL, DMR-109-150 to WYL, DMR-106-119 to WYL). The funders had no role in this study.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf."

3. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Summary

In this manuscript, the authors present two alternative ways to characterize expression in Drosophila through GSEA. They develop matrices that draw on the KEGG and REACTOME pathways as opposed to Gene Ontology datasets. They then validate their matrices using datasets from Drosophila brains consisting of either neuronal or glial cells sorted by GFP+ status. Their results suggest that their matrices can effectively identify pathways that are enriched in Drosophila cells based on expression.

Major Points

• The matrices generated are specific to KEGG and REACTOME, and do seem to be novel in their use of these databases for GSEA in Drosophila melanogaster. The abstract and introduction suggest that the innovation is the use of GSEA in melanogaster in general. There are in fact matrices that have been used in melanogaster to assign GO categories. These are just a few publications that have taken advantage of GSEA for GO terms in Drosophila. It would be helpful to cite these sources and then explain what is different in these matrices (the use of different databases) and what they will provide that is different from previous GO databases in GSEA. Below are a few papers that have used GO terms in GSEA in Drosophila melanogaster.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6114933/

https://www.g3journal.org/content/9/12/3995#ref-64

https://genomebiology.biomedcentral.com/articles/10.1186/gb-2009-10-9-r97

• Are there any REACTOME or KEGG pathways/gene sets that would have been expected to be enriched that were not? If so, what are they, and is there an explanation for these “false negatives?”

Minor Points

• “Method” is the only section with a numeral designation.

• Citation is needed in the results section for the datasets mined for validation of the matrices

Reviewer #2: GENERAL PERCEPTION

I am in general favor of the paper. The idea is obvious, as Drosophila is one of the most widely used model organisms and the availability of associated gene sets for gene set enrichment analysis is essential. The approach is as sound as it is simple and backed by the findings, which are well discussed. Another plus is that all data used in the study are provided in the supplement.

INTRODUCTION / METHODS

The paper is very brief, which I generally favor, however, I find that some basic explanations are missing and should be added. When terms such as “GFP positive cells”, “Repo”, “Elav” and “sorted” (FACS) are first mentioned, they should be briefly explained, e.g. that elav is a gene encoding an RNA binding / splicing protein and exclusively expressed in neurons, and that repo encodes a transcription factor specifically expresssed in glia, etc. Only later in Results there is a very brief part of a sentence about this: “ELAV-GFP (representing neurons) and REPO-GFP (representing glia) sorting cells from the Drosophila brain”. The introduction concludes with a statement about the objective of the study; however, it should be elaborated more on the exact approach, which is that, after generating the gene matrix files, the expression data of two distinct fruit fly cell phenotypes, i.e. neurons and glia, are run in the GSEA software to identify enriched gene sets typical for the respective cell types, which would then support the validity of the gene matrix files.

A minor discrepancy in the introduction is that, when querying the MSigDB site, there are not 32,284 gene sets of Homo sapiens, as the authors claim, but only 28,705, while the other gene sets are of four other model organisms.

According to the authors, the objective of the study is to provide two gene matrix files, which they do. However, the exact details of how these were generated are not given, as the authors merely state "The curated pathway-gene information was retrieved from the Reactome and KEGG websites. The gene matrix files [...] were generated according to the GSEA data formats". In order to reproduce these gene matrix files, a link or exact download specifications (filters, options, etc.) should be provided and information given about what gene identifiers/symbols the downloaded data came in and how these were converted to the FlyBase annotation IDs, i.e. the CG numbers. Apparently, one needs to subscribe to KEGG to download their data: https://www.kegg.jp/kegg/download/ (?). As for Reactome Pathways it is also unclear how exactly the Drosophila specific gene sets were retrieved (whereas the pathway gene sets for Homo sapiens (HSA) are easily found at https://reactome.org/download/current/ReactomePathways.gmt.zip). It should also be mentioned how the conversion table in Supplementary File 4 was obtained or generated to convert the Microarray probe IDs of the validation data to FlyBase annotation IDs. Speaking of which, the meaning of “strangest average intensity” in is a bit unclear in the sentence “Only the record with the strangest average intensity was used for the validation, in the case of redundant intensities presented for an identical CG_ID.”. Does this mean that, in a many-to-one mapping from probe IDs to CG numbers, the most extreme value was chosen in case of multiple probe mappings to the same CG number?

The steps of how the validation data and gene matrix files are run in the GSEA software are well explained.

RESULTS

The authors report the features of the generated gene matrix files and the results of the GSEA analyses correctly, except that they go by the nominal p-value when reporting significantly enriched gene sets, whereby the GSEA documentation states that due to this value not being adjusted for gene set size and multiple hypothesis testing, it is of limited value and that the FDR (false discovery rate) should be used as well (with below 25% as significant).

The highlighted representative gene sets typical for the respective cell types appear to be coherent. A minor seemingly deviating observation is that, according to the KEGG gene matrix, “ABC transporters” are (correctly) significantly enriched in the glial profile, while according to the REACTOME gene matrix, “ABC transporter in lipid homeostasis” is, though up-regulated, not significantly so. But since these two gene sets differ in genes, this might not be a meaningful comparison.

DISCUSSION

In the Discussion part, the biological implications of the highlighted enriched gene sets on neurons and glia are well discussed and backed by literature. The perspective of using the GSEA results from expression profiles of known phenotypes to explore novel pathways is a very valid point.

CONCLUSION

In the Conclusion part, it should be mentioned how/where the two gene matrix files can be accessed. Will the fruit fly researcher always have them available in the supplement of the paper or will they be submitted to the MSigDB site? On the MSigDB website, under Browse Gene Sets, there is already a category C2 for curated gene sets, specifically CP for canonical pathway, which already includes KEGG and REACTOME gene sets, just not specifically for Drosophila:

http://www.gsea-msigdb.org/gsea/msigdb/genesets.jsp?collection=CP:REACTOME

http://www.gsea-msigdb.org/gsea/msigdb/genesets.jsp?collection=CP:KEGG

Gene sets can be submitted at genesets@broadinstitute.org

As mentioned earlier, if the exact download specifications and ID conversion and .gmt file generation procedure were provided, the researcher could generate their own gene matrix files according to the authors’ procedure. This would ensure one could always obtain the latest data from the KEGG and REACTOME databases.

LANGUAGE

The text is written in well articulated language with merely a few minor wording issues, which possibly need some rephrasing; for instance, “wildly applied” probably meant “widely applied”, or “The intensity was transformed as 2 to the power of the log2-base value.” (the “log2-base value” is 2; easier: "intensity values were log2-transformed” or “log-transformed at base 2."), or “[…] citations of more than 20 thousand.” (i.e. “more than 20,000 citations”). There are very few minor misspellings, for example, “Collaspe”, a few plural/singular issues, e.g. “The […] scores […] shows […]”, and a few extra or missing “the” articles and unnecessary capitalizations, but nothing major. On a last note, it probably suffices to say "we validated the gene matrix files" and not "we validated the feasibility of the gene matrix files".

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Gerrit Bostelmann

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Oct 28;16(10):e0259201. doi: 10.1371/journal.pone.0259201.r002

Author response to Decision Letter 0


15 Sep 2021

Editor’s comments

You will see that while the reviewers are persuaded of the importance of your GSEA datasets and validation study, there are several points highlighted that require clarification before acceptance. In particular, PLOS ONE requires methods to be described in sufficient detail for another researcher to reproduce the experiments described, so further details of the gene set generation, as suggested by reviewer 2, are needed.

A: The methods are described in sufficient detail for another researcher to reproduce the results.

I also agree that submission of these gene sets to MSigDB would increase their visibility and reuse.

A: We have contacted MSigDB to include the gene sets. However, MSigDB currently only supports Human, Mouse, and Rat gene sets, although previously MSigDB has allowed some limited deposition of gene sets from other species. Instead, MSigDB recommended making the gene sets available somewhere publicly accessible, like through a GitHub page.

We have made the gene sets available on https://github.com/JackCheng-TW/GeneMatrix.

The following sentence is added in the conclusion. The gene sets are available in the supplementary files or on https://github.com/JackCheng-TW/GeneMatrix.

Reviewer 1 has also highlighted some relevant studies of Gene Ontology GSEA in Drosophila which will add to the contextualisation of your work.

A: The advantages and disadvantages comparing GO GSEA and KEGG/Reactome GSEA are addressed, especially in the context of the suggested relevant studies.

Reviewers' comments:

Reviewer #1: Summary

In this manuscript, the authors present two alternative ways to characterize expression in Drosophila through GSEA. They develop matrices that draw on the KEGG and REACTOME pathways as opposed to Gene Ontology datasets. They then validate their matrices using datasets from Drosophila brains consisting of either neuronal or glial cells sorted by GFP+ status. Their results suggest that their matrices can effectively identify pathways that are enriched in Drosophila cells based on expression.

Major Points

• The matrices generated are specific to KEGG and REACTOME, and do seem to be novel in their use of these databases for GSEA in Drosophila melanogaster. The abstract and introduction suggest that the innovation is the use of GSEA in melanogaster in general. There are in fact matrices that have been used in melanogaster to assign GO categories. These are just a few publications that have taken advantage of GSEA for GO terms in Drosophila. It would be helpful to cite these sources and then explain what is different in these matrices (the use of different databases) and what they will provide that is different from previous GO databases in GSEA. Below are a few papers that have used GO terms in GSEA in Drosophila melanogaster.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6114933/

https://www.g3journal.org/content/9/12/3995#ref-64

https://genomebiology.biomedcentral.com/articles/10.1186/gb-2009-10-9-r97

A: Gene Ontology is a structured, precisely defined, controlled vocabulary for describing the roles of genes with three independent ontologies, i.e., biological process, molecular function, and cellular component (Ashburner, Michael, et al. "Gene ontology: tool for the unification of biology." Nature genetics 25.1 (2000): 25-29.). Briefly, biological process indicates the biological objective of the gene/protein; molecular function describes its biochemical activity, while cellular component refers to its subcellular distribution. Although GO allows an easy and quick understanding of the roles of a gene, however, “it describes only what is done without specifying where or when the event actually occurs.” as stated in the original GO paper (Ashburner, Michael, et al. "Gene ontology: tool for the unification of biology." Nature genetics 25.1 (2000): 25-29.). This knowledge gap is exactly what KEGG and REACTOME try to fill. Both databases provide sequential information and partnership of the reaction of the gene/protein. On the contrary, the dependence on the published/curated scientific literature largely limits the application of KEGG/REACTOME on genes of unknown function, while GO may cover this part by similarity prediction.

Thus, the choice of gene sets in GSEA is largely dependent on the purpose of the study. For example, the first two GSEA papers (PMC6114933 and 3995#ref-64) may be improved by adopting KEGG/REACTOME gene sets to provide more details of the affected pathways, while for the third paper (gb-2009-10-9-r97), GO gene sets is perfect for its goal of predicting functions of unknown genes.

• Are there any REACTOME or KEGG pathways/gene sets that would have been expected to be enriched that were not? If so, what are they, and is there an explanation for these "false negatives?"

A: There are indeed crucial pathways that remain unannotated in both databases, such as Draper-dependent glial phagocytic activity, Draper‐mediated JNK signaling, and Glial phagocytosis. These are representative pathways characterizing glial activity. Another problem is the missing critical proteins in the annotated pathway; for example, serine racemase (CG8129) is not annotated in the serine biosynthesis pathway (R-DME-977347). Pathway databases are not perfect but still represent the state-of-the-art knowledge of the research community. These "false negatives" represent the knowledge gaps for us to explore.

Minor Points

• "Method" is the only section with a numeral designation.

A: Thanks for this. The numeral designation of “Method” is removed.

• Citation is needed in the results section for the datasets mined for validation of the matrices

A: DeSalvo, Michael K., et al. "The Drosophila surface glia transcriptome: evolutionary conserved blood-brain barrier processes." Frontiers in neuroscience 8 (2014): 346.

Reviewer #2: GENERAL PERCEPTION

I am in general favor of the paper. The idea is obvious, as Drosophila is one of the most widely used model organisms and the availability of associated gene sets for gene set enrichment analysis is essential. The approach is as sound as it is simple and backed by the findings, which are well discussed. Another plus is that all data used in the study are provided in the supplement.

INTRODUCTION / METHODS

The paper is very brief, which I generally favor, however, I find that some basic explanations are missing and should be added. When terms such as "GFP positive cells", "Repo", "Elav" and "sorted" (FACS) are first mentioned, they should be briefly explained, e.g. that elav is a gene encoding an RNA binding / splicing protein and exclusively expressed in neurons, and that repo encodes a transcription factor specifically expresssed in glia, etc. Only later in Results there is a very brief part of a sentence about this: "ELAV-GFP (representing neurons) and REPO-GFP (representing glia) sorting cells from the Drosophila brain".

A: These sentences are added. Elav is a gene encoding an RNA binding protein capable of regulating mRNA processing exclusively expressed in neurons, and Repo encodes a transcription factor specifically expressed in glia. By using GAL4-UAS reporter system, Repo-GAL4 drives the expression of UAS-GFP specifically in glia, while Elav-GAL4 drives UAS-GFP exclusively in neurons. Fluorescence activated cell sorting (FACS) is a technique to separate cells as they flow past stimulating lasers (Bonner, W. A., et al. "Fluorescence activated cell sorting." Review of Scientific Instruments 43.3 (1972): 404-409.).

The introduction concludes with a statement about the objective of the study; however, it should be elaborated more on the exact approach, which is that, after generating the gene matrix files, the expression data of two distinct fruit fly cell phenotypes, i.e. neurons and glia, are run in the GSEA software to identify enriched gene sets typical for the respective cell types, which would then support the validity of the gene matrix files.

A: This sentence is added. After generating the gene matrix files, the expression data of two distinct fruit fly cell phenotypes, i.e., neurons and glia, are run in the GSEA software to identify enriched gene sets typical for the respective cell types, which would support the validity of the gene matrix files.

A minor discrepancy in the introduction is that, when querying the MSigDB site, there are not 32,284 gene sets of Homo sapiens, as the authors claim, but only 28,705, while the other gene sets are of four other model organisms.

A: Thanks a lot. The number is modified.

According to the authors, the objective of the study is to provide two gene matrix files, which they do. However, the exact details of how these were generated are not given, as the authors merely state "The curated pathway-gene information was retrieved from the Reactome and KEGG websites. The gene matrix files [...] were generated according to the GSEA data formats".

In order to reproduce these gene matrix files, a link or exact download specifications (filters, options, etc.) should be provided and information given about what gene identifiers/symbols the downloaded data came in and how these were converted to the FlyBase annotation IDs, i.e. the CG numbers. Apparently, one needs to subscribe to KEGG to download their data: https://www.kegg.jp/kegg/download/ (?). As for Reactome Pathways it is also unclear how exactly the Drosophila specific gene sets were retrieved (whereas the pathway gene sets for Homo sapiens (HSA) are easily found at https://reactome.org/download/current/ReactomePathways.gmt.zip).

A: The Drosophila-specific genes of KEGG pathways were downloaded from the “KEGG Pathway Maps - Drosophila melanogaster (fruit fly)” with the website https://www.genome.jp/brite/query=00190&htext=br08901.keg&option=-a&node_proc=br08901_org&proc_enabled=dme&panel=collapse. By clicking each “tringle symbol” of mother categories, sub-categories will expand. By clicking the number preceding each sub-category, e.g., 00010 of Glycolysis / Gluconeogenesis, it will bring you to the map of the specific pathway (https://www.genome.jp/kegg-bin/show_pathway?dme00010). Further clicking the title of pathway map on the upper left corner will finally bring you to the detail page of that pathway (https://www.genome.jp/entry/dme00010). At the upper right corner of the page, an “all links” box contains the “KEGG GENES” list. Repeat the process to exhaust the Drosophila KEGG pathways.

The Drosophila-specific genes of REACTOME pathways were downloaded from the (https://reactome.org/PathwayBrowser/#/R-DME-XXXXXXX, where XXXXXXX denotes for seven digits of a specific pathway, e.g., 9612973 for autophagy). There are three panels on the page. The left panel shows the hierarchy of Drosophila pathways in REACTOME, while at the right lower panel, by clicking the tab “Molecules”, then the “protein” link, the gene/protein list is available. Repeat the process to exhaust the Drosophila REACTOME pathways on the left panel.

A gmt file is a tab-separated plain text, and each row describes one gene set. In each row, the first column contains the name of the gene set, while the second column contains additional details, e.g., KEGG ID of the pathway (gene set). The gene set members, i.e., FlyBase IDs in this study, are listed from the third column of the row, one gene in one column. Thus, once the gene list of pathways is available, the gmt file can be generated by locating the pathway elements into corresponding cells with any plain text editor or Microsoft Excel. After saving the file as a tab-separated plain text, modify the filename extension, i.e., *.txt, to *.gmt in the file browser.

It should also be mentioned how the conversion table in Supplementary File 4 was obtained or generated to convert the Microarray probe IDs of the validation data to FlyBase annotation IDs.

A: The annotation of Microarray probe IDs is available from the “SOFT formatted family file” at https://ftp.ncbi.nlm.nih.gov/geo/series/GSE45nnn/GSE45344/soft/. Specifically, from line 110 of the file, the 1st column is the Microarray probe ID, and the 3rd column is the FlyBase annotation ID. However, the context must be “cleared” to extract the correct FlyBase ID, e.g., for “CG16844-RA”, the “-RA” must be trimmed to get the correct ID “CG16844”. In the case that the dataset of your interest does not provide the probe ID annotation, you may try the gene ID conversion tool on https://david.abcc.ncifcrf.gov/conversion.jsp or from the website of the gene chip manufacturer.

Speaking of which, the meaning of "strangest average intensity" in is a bit unclear in the sentence "Only the record with the strangest average intensity was used for the validation, in the case of redundant intensities presented for an identical CG_ID.". Does this mean that, in a many-to-one mapping from probe IDs to CG numbers, the most extreme value was chosen in case of multiple probe mappings to the same CG number?

A: Yes, the sentence is added. The highest value was chosen in case of multiple probe mappings to the same CG ID.

The steps of how the validation data and gene matrix files are run in the GSEA software are well explained.

A: Thanks.

RESULTS

The authors report the features of the generated gene matrix files and the results of the GSEA analyses correctly, except that they go by the nominal p-value when reporting significantly enriched gene sets, whereby the GSEA documentation states that due to this value not being adjusted for gene set size and multiple hypothesis testing, it is of limited value and that the FDR (false discovery rate) should be used as well (with below 25% as significant).

A: We included the FDR 25% criterion, and therefore the “significant” gene sets now have to meet both FDR and nominal p-value. (Fortunately,) in this study, the gene sets with significant nominal p-value also meet the FDR 25% criterion (Supp File 5, 6, 7, 8). Therefore, the list of significant gene sets does not change due to the inclusion of FDR criterion. The description is added “under the criterion of false discovery rate (FDR) below 25%” when applicable.

The highlighted representative gene sets typical for the respective cell types appear to be coherent. A minor seemingly deviating observation is that, according to the KEGG gene matrix, "ABC transporters" are (correctly) significantly enriched in the glial profile, while according to the REACTOME gene matrix, "ABC transporter in lipid homeostasis" is, though up-regulated, not significantly so. But since these two gene sets differ in genes, this might not be a meaningful comparison.

A: ABC transporters (KEGG dme02010) couple ATP hydrolysis to active transport of a wide variety of substrates such as ions, sugars, lipids, sterols, peptides, proteins. While ABC transporter in lipid homeostasis (Reactome R-DME-1369062) is only a part of the “ABC transporters”.

DISCUSSION

In the Discussion part, the biological implications of the highlighted enriched gene sets on neurons and glia are well discussed and backed by literature. The perspective of using the GSEA results from expression profiles of known phenotypes to explore novel pathways is a very valid point.

A:Thanks a lot.

CONCLUSION

In the Conclusion part, it should be mentioned how/where the two gene matrix files can be accessed. Will the fruit fly researcher always have them available in the supplement of the paper or will they be submitted to the MSigDB site? On the MSigDB website, under Browse Gene Sets, there is already a category C2 for curated gene sets, specifically CP for canonical pathway, which already includes KEGG and REACTOME gene sets, just not specifically for Drosophila:

http://www.gsea-msigdb.org/gsea/msigdb/genesets.jsp?collection=CP:REACTOME

http://www.gsea-msigdb.org/gsea/msigdb/genesets.jsp?collection=CP:KEGG

Gene sets can be submitted at genesets@broadinstitute.org

A: We have contacted MSigDB to include the gene sets. However, MSigDB currently only supports Human, Mouse, and Rat gene sets, although previously MSigDB has allowed some limited deposition of gene sets from other species. Instead, MSigDB recommended making the gene sets available somewhere publicly accessible, like through a GitHub page.

We have made the gene sets available on https://github.com/JackCheng-TW/GeneMatrix.

The following sentence is added in the conclusion. The gene sets are available in the supplementary files or on https://github.com/JackCheng-TW/GeneMatrix.

As mentioned earlier, if the exact download specifications and ID conversion and .gmt file generation procedure were provided, the researcher could generate their own gene matrix files according to the authors' procedure. This would ensure one could always obtain the latest data from the KEGG and REACTOME databases.

A: The following sentence is added in the conclusion. We have clarified the exact download specifications, ID conversion, and gmt file generation procedure in the method section so that any researcher could generate his/her gene matrix files according to the latest KEGG and REACTOME databases.

LANGUAGE

The text is written in well articulated language with merely a few minor wording issues, which possibly need some rephrasing; for instance, "wildly applied" probably meant "widely applied", or "The intensity was transformed as 2 to the power of the log2-base value." (the "log2-base value" is 2; easier: "intensity values were log2-transformed" or "log-transformed at base 2."), or "[...] citations of more than 20 thousand." (i.e. "more than 20,000 citations"). There are very few minor misspellings, for example, "Collaspe", a few plural/singular issues, e.g. "The [...] scores [...] shows [...]", and a few extra or missing "the" articles and unnecessary capitalizations, but nothing major. On a last note, it probably suffices to say "we validated the gene matrix files" and not "we validated the feasibility of the gene matrix files".

A: Thank you very much. Modified as suggested.

Attachment

Submitted filename: ResponseReviewerCommnets_20210903.docx

Decision Letter 1

Katherine James

15 Oct 2021

Pathway-targeting Gene Matrix for Drosophila Gene Set Enrichment Analysis

PONE-D-21-18414R1

Dear Dr. Lin,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Katherine James, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have satisfactorily addressed all of my concerns. They have included explanation of how their analysis differs from previous analyses, giving context to the audience.

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Gerrit Bostelmann

Acceptance letter

Katherine James

19 Oct 2021

PONE-D-21-18414R1

Pathway-targeting Gene Matrix for Drosophila Gene Set Enrichment Analysis

Dear Dr. Lin:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Katherine James

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. Drosophila Reactome.

    (GMT)

    S2 File. Drosophila KEGG.

    (GMT)

    S3 File. Processed profiling of GSE45344.

    (TSV)

    S4 File. Probe ID to Flybase symbols (CG_ID) conversion table.

    (TSV)

    S5 File. Detailed Reactome enrichment results for Class A (ELAV).

    (TSV)

    S6 File. Detailed Reactome enrichment results for Class B (REPO).

    (TSV)

    S7 File. Detailed KEGG enrichment results for Class A (ELAV).

    (TSV)

    S8 File. Detailed KEGG enrichment results for Class B (REPO).

    (TSV)

    S1 Fig. The global enrichment scores across KEGG gene sets.

    (PDF)

    Attachment

    Submitted filename: ResponseReviewerCommnets_20210903.docx

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES