A Scientist’s Guide for Submitting Data to ZFIN

Douglas G Howe; Yvonne M Bradford; Anne Eagle; David Fashena; Ken Frazer; Patrick Kalita; Prita Mani; Ryan Martin; Sierra Taylor Moxon; Holly Paddock; Christian Pich; Sridhar Ramachandran; Leyla Ruzicka; Kevin Schaper; Xiang Shao; Amy Singer; Sabrina Toro; Ceri Van Slyke; Monte Westerfield

doi:10.1016/bs.mcb.2016.04.010

. Author manuscript; available in PMC: 2019 Jan 4.

Published in final edited form as: Methods Cell Biol. 2016 May 12;135:451–481. doi: 10.1016/bs.mcb.2016.04.010

A Scientist’s Guide for Submitting Data to ZFIN

Douglas G Howe ^1,^*, Yvonne M Bradford ¹, Anne Eagle ¹, David Fashena ¹, Ken Frazer ¹, Patrick Kalita ¹, Prita Mani ¹, Ryan Martin ¹, Sierra Taylor Moxon ¹, Holly Paddock ¹, Christian Pich ¹, Sridhar Ramachandran ¹, Leyla Ruzicka ¹, Kevin Schaper ¹, Xiang Shao ¹, Amy Singer ¹, Sabrina Toro ¹, Ceri Van Slyke ¹, Monte Westerfield ¹

PMCID: PMC6319372 NIHMSID: NIHMS1003660 PMID: 27443940

Abstract

The Zebrafish Model Organism Database (ZFIN; zfin.org) serves as the central repository for genetic and genomic data produced using zebrafish (Danio rerio). Data in ZFIN are either manually curated from peer-reviewed publications or submitted directly to ZFIN from various data repositories. Data types currently supported include mutants, transgenic lines, DNA constructs, gene expression, phenotypes, antibodies, morpholinos, TALENs, CRISPRs, disease models, movies, and images. The rapidly changing methods of genomic science have increased the production of data that cannot readily be represented in standard journal publications. These large data sets require web-based presentation. As the central repository for zebrafish research data, it has become increasingly important for ZFIN to provide the zebrafish research community with support for their data sets and guidance on what is required to submit these data to ZFIN. Regardless of their volume, all data that are submitted for inclusion in ZFIN must include a minimum set of information that describes the data. The aim of this chapter is to identify data types that fit into the current ZFIN database and explain how to provide those data in the optimal format for integration. We identify the required and optional data elements, define jargon, and present tools and templates that can help with the acquisition and organization of data as they are being prepared for submission to ZFIN. This information will also appear in the ZFIN wiki, where it will be updated as our services evolve over time.

Introduction

ZFIN is the central repository of genetic and genomic information for the zebrafish research community. Granting agencies increasingly require that large data sets be submitted to an appropriate database repository such as ZFIN. In light of this, we aim to integrate as much of the zebrafish mutant, transgenic, expression and phenotype data from the research community as possible and to provide links back to source databases when possible. To accomplish that, there must be clear guidance and documentation of the process and requirements for adding data to the ZFIN database. The process of preparing, submitting, and loading data into the ZFIN database is called a “data submission”. Planning ahead for data submission to ZFIN will result in more efficient and timely addition of data to the database. This publication, the associated ZFIN wiki pages, and data submission templates referenced herein provide an up-to-date reference resource to support submission of large and small data sets to ZFIN.

Why Load Data Into ZFIN?

In this era of big data it is increasingly important to integrate data from multiple sources to provide a unified view of data at a single location. This integration maximizes the value of each piece of data by allowing queries to return accurate and more complete results, accelerating research and reducing redundant effort and research cost. ZFIN supports a diverse collection of data types including mutants, transgenic lines, expression, phenotypes, constructs, morpholinos, TALENs, CRISPRs, antibodies, and disease models. Data curated from publications and from prior data loads are integrated to provide as complete a picture as possible of the role and function of each gene based on all the information available in the ZFIN database. Often, data stored in lab-specific databases lack the long-term stability, accessibility, and data integration that they will have in ZFIN. The goal at ZFIN is to capture the essential core of the data and any additional details the ZFIN database is able to support. In many cases it is possible to link from ZFIN back to the laboratory web pages, providing easy access to any further details that are not currently included at ZFIN. In the long term, the most important services ZFIN can provide related to data loads are to integrate data from disparate resources, provide a central location presenting a complete picture of what is known about a topic of interest, and provide critical long term data stability and accessibility.

The Structure of the ZFIN Database

Data at ZFIN are stored in a complex relational database consisting of over 300 database tables (http://zfin.org/schemaSpy). This database structure allows disparate pieces of data to relate to each other and to be presented in an integrated format. Aligning incoming data with existing ZFIN data and associated database constraints is one of the major challenges for data loads, particularly if alignment is considered only after data are collected. In contrast, if the structure of a potential data submission is understood early in the data gathering process, data collection can be optimized to facilitate a smooth data submission process. Below we describe the data submission process and each major data type we currently support, as well as their components, and we identify which components are required and which are optional.

The Data Submission Process

Data submission requests are typically initiated by an inquiry from a researcher. Once submission of data has been agreed upon, there are several steps to a typical data submission process (Figure 1). A curator assigned to the data load will provide guidance on data gathering, establish necessary records in ZFIN, and assist in getting the data into a format that can be loaded into the ZFIN database.

Figure 1. — Black boxes indicate work done by the data submitter, light gray boxes indicate work done by ZFIN, gradient filled boxes indicate iterative steps where work is shared between ZFIN and the data submitter.

Once submitted, data are subject to a number of quality control and data validation steps. Inconsistencies in the data are resolved through discussion with the data submitter. Once the data are free of errors, they are loaded into a test database for final review by the submitter. Once approved by the submitter, the data are loaded into the ZFIN database.

ZFIN does not hold private data. Once submitted, the data load will proceed as part of the normal software release cycle. When there are data that cannot be released to the public until they are published, an initial submission of high quality data from the same experiments that is not part of the publication can be considered. This may include data such as mutations with no obvious or early lethal phenotypes, gene expression where there is either no expression at a particular developmental stage or expression is ubiquitous, or enhancer traps that trap already characterized enhancers. This allows the researcher to become familiar with the data submission process and to validate the data submission file format while protecting the integrity of the unpublished data. Once published, those data can be submitted with confidence, knowing that similar data have already been integrated into ZFIN.

Data Submissions

The data required for a submission often involves multiple files that together provide the information needed to represent the data fully at ZFIN. In this section each of the data types that can be loaded into the ZFIN database are described along with their required and optional components.

Mutant and Transgenic Line Submission

Mutant features are genomic alterations often generated by an applied mutagen, whereas transgenic features are genomic alterations generated by insertion of one or more copies of a transgenic construct. These may or may not result in alleles of genes. Mutant and transgenic lines are strains of fish that contain one or more heritable transgenic or mutant features. Each transgenic feature may contain one or more transgenic constructs. The exact insertion site or sites may or may not be known. Lines that contain multiple distinct known transgenic insertion loci will have a distinct genomic feature designated for each insertion site. The transgenic line is then composed of a combination of those distinct insertions. Data submissions containing genomic features not yet in ZFIN include information to create new feature records. Below we describe the various elements of data that can be included in a mutant or transgenic line data submission.

Genotypes

The genotype represents the primary genomic sequence alterations present in a fish. The genotype conveys zygosity information about specific loci having known sequence variants or transgenic insertions as well as the genetic background. To define a specific genotype in ZFIN completely and uniquely, each genomic feature and the zygosity of the locus where it resides are required. To define the genotype further, information about the parental zygosity of each locus and the genetic background can also optionally be provided (Table 1). The required and optional data elements for transgenic line submission (Table 2) and for mutant submission (Table 3) are similar, but distinct.

Table 1.

Data for Submitting Genotypes to ZFIN

REQUIRED	OPTIONAL
Genomic Feature	Feature Maternal Zygosity
Feature Zygosity	Feature Paternal Zygosity
Genetic Background

REQUIRED	OPTIONAL
Genomic Feature	Link to Alternate Resource
Affected Gene Symbol	Insertion Accession
Affected Gene Accession	Note
Transgene Type
Mutagen
Subject
Construct
Laboratory of Origin
Citation

Abbreviation	Full Name	Description
AB	AB	Any of the inbred lines derived from the original Streisinger A and B incrosses. Includes AB* and ABC.
AB/TL	AB/Tupfel long fin	Mixed AB/ Tupfel long fin line either a maintained inbred line or a novel cross.
AB/TU	AB/ Tuebingen	Mixed AB/Tuebingen line either a maintained inbred line or a novel cross.
C32	C32	Either C32 derivatives from the Steve Johnson laboratory or the Kimmel lab
KOLN	Cologne	A Wild-Type line originally from the Campos-Ortega laboratory that has short fins.
DAR	Darjeeling	Wild-type line collected in Darjeeling, India by Heiko Bleher in 1987. Line maintained by inbreeding.
EKW	Ekkwil	Wild-type line from Ekkwill Breeders in Florida
HK	Hong Kong	Stock obtained from Hong Kong fish dealer.
IND	India	Stock obtained from expedition to Darjeeling (wild isolate).
NA	Nadia	Wild type line from the Nadia district. Original stock collected from stagnant ponds and flood plain. Inbred.
NHGR-1	NHGR-1	Fully sequenced inbred line derived from a Tuebingen/AB cross (1)
RW	RIKEN WT	Wild type line distributed by RIKEN.
SAT	Sanger AB Tuebingen	The AB/Tuebingen line derived from double haploid fish used by Sanger for genomic sequencing.
SJA	SJA	AB derived line that is bred to reduce polymorphism.
SJD	SJD	Sibling line to Darjeeling. Inbred to reduce polymorphisms.
TU	Tuebingen	Short fins, original stock from a Tuebingen pet shop.
TL	Tupfel long fin	Homozygous for leo^t1 and lof^dt2.
TLN	Tupfel long fin nacre	The TL-derived TLN wild type strain carries a mix of molecularly uncharacterized mitfa(nacre) ^s170 and mitfa(nacre) ^s184 in the background. TL is homozygous for cx41.8(leo) ^t1 and lof^dt2.
WIK	WIK	The WIK line is very polymorphic relative to the TU line.
WT	Wild type	Used to denote any wild type not listed above.

Required Data	Description
Genomic Feature	The unique identifier for the mutant
Affected Gene	The symbol for the affected gene
Affected Gene Accession	A ZDB-GENE ID or sequence accession number for the affected gene
Relationship	The relationship between the genomic feature and the affected gene. One of: gene missing, gene present, gene moved, is allele of gene

Mutation type	Definition	Notes	Data to provide	SO ID
point mutation	A single nucleotide change which has occurred at the same position of a corresponding nucleotide in a reference sequence.	Can be an allele of a single gene.	Affected gene and its ZDB-GENE ID	SO:1000008
small deletion	The point at which one or more contiguous nucleotides were excised.	The excision is within a single gene.	Affected gene and its ZDB-GENE ID	SO:0000159
insertion	The sequence of one or more nucleotides added between two adjacent nucleotides in the sequence.	Usually an allele of a single gene.	Affected gene and its ZDB-GENE ID	SO:0000667
indel	A sequence alteration which includes an insertion and a deletion, affecting 2 or more bases.	Usually an allele of a single gene.	Affected gene and its ZDB-GENE ID	SO:1000032
translocation	A region of nucleotide sequence that has translocated to a new position. The observed adjacency of two previously separated regions.	Has at least one breakpoint, frequently within a gene. Has genes that are in a new genomic context	Genes at breakpoint and their ZDB-GENE IDs. Genes that have been relocated and their associated ZDB-GENE IDs	SO:0000199
inversion	A continuous nucleotide sequence is inverted in the same position.	Has two break points which may occur in one or more genes and may have additional genes within the inverted sequence.	Genes at breakpoint and their ZDB-GENE IDs. Genes that have been relocated and their associated ZDB-GENE IDs	SO:1000036
deficiency	An incomplete chromosome.	The chromosome is missing more than a single gene. Has two break points. Other genes existing between the break points may have also have been lost.	Genes at breakpoint and their ZDB-GENE IDs. Genes that have been lost and their associated ZDB-GENE Ids	SO:1000029
unknown	A mutation where the lesion type is unknown.	May be an allele of a gene or in an unknown location.	Affected gene and its ZDB-GENE ID if known	NA

Mutagen Type	Description
TALEN	Transcription activator-like effector nucleases (TALENs) are nucleases specifically designed to cleave a DNA sequence of interest. Provide the name of the TALEN.
CRISPR	Clustered regularly interspaced short palindromic repeats (CRISPRs) are specifically designed to recruit the Cas9 enzyme to cleave DNA at a targeted DNA locus. Mutations often result from the subsequent DNA repair event. Provide the name of the CRISPR.
ENU	N-ethyl-N-nitrosourea, ENU, is a chemical alkylating agent and mutagen when applied to animals.
TMP	4,5′,8-trimethylpsoralen is a DNA cross-linking agent which often produces deletion mutations.
Gamma Rays	Ionizing electromagnetic radiation used to induce mutations. Often produces large deletions and chromosomal aberrations.
Spontaneous	de novo mutations not generated by the application of an external mutagen.
Zinc Finger Nuclease	Zinc finger nucleases are artificial restriction enzymes designed to target and cleave specific DNA sequences.
DNA	DNA sequence, usually a transgenic construct, injected into embryos to create heritable transgenic insertions

REQUIRED	OPTIONAL
Author list	PubMed ID
Publication title
Abstract describing data and methods

REQUIRED	OPTIONAL
Construct Name	Link to Alternate Resource
Promoter Gene Symbol	Construct Accession
Promoter Accession	Construct Map Image Name
Coding Sequence Gene Symbol	Note
Coding Sequence Accession
Engineered Region Name
Citation

REQUIRED	OPTIONAL
MO/TALEN/CRISPR Name	Link to Alternate Resource
Target Sequence 1	Note
Target Sequence 2 (TALEN Only)	Citation PMID
Target Gene Symbol
Target Gene Accession
Data Load Citation

REQUIRED	OPTIONAL
Expressed Gene Symbol	Image/Movie File Name
Expressed Gene Accession	Antibody Name (for Immunoassays)
Genotype	Probe GenBank Accession # (for Hybridization Assays)
Morpholinos, TALENs, CRISPRs	Antibody Name (for Immunoassays)
Anatomical Structure
Developmental Stage
Experimental Conditions
Citation
Assay Type

Condition	Description
standard	Experimental condition that is the standard environment for zebrafish husbandry, as described in The Zebrafish Book. In general the standard environment utilizes contaminant free tank water, heated to 28.5°C, with the fish fed a normal contaminant free diet, with standard osmolarity, pH, and normal light cycle of 14hr light/10hr dark.
generic control	Experimental condition that is used as a reference point to compare with results of treated zebrafish. Generic experimental controls often use sham injections, injections of vehicle, injections of control MOs, etc. This environment is used for non-standard conditions used in control treatments.
chemical	Experimental condition in which the fish is treated in tank water, or by injection or consumption, with a chemical substance. The ChEBI ID for the chemical should be included in the data submission.
pH, acidic	Experimental condition in which the pH of the water is lower than the pH of the controlled conditions.
pH, basic	Experimental condition in which the pH of the water is higher than the pH of the controlled conditions
electric field	Experimental condition in which an electric field is applied to the fish, fish cells, or organs as compared to control conditions.
gravity	Experimental condition in which the fish is exposed to forces that simulate low or high gravity as compared to earth’s gravity.
hyperoxia	Experimental condition in which the oxygen (O₂) concentration is higher than the one in controlled conditions.
hypoxia	Experimental condition in which the oxygen (O₂) concentration is lower than the one in controlled conditions.
light	Experimental condition in which the intensity, wavelength, and/or duration of illumination is (are) different from the one in controlled conditions.
magnetic field	Experimental condition in which the fish is exposed to a magnetic field as compared to control conditions. A magnetic field is a region in which the force of magnetism is applied.
mechanical stress	Experimental condition in which an external force is applied to the fish or part of the fish.
radiation	Experimental condition in which the fish is exposed to ionizing and/or non-ionizing radiation. The radiation could be ionizing such as gamma rays, alpha particles, UV, X-ray and non-ionizing such as infrared, microwaves etc.
bacterial infection	Experimental condition in which fish have been infected with bacteria. This infection can be done by addition of bacteria in the water or by injection of bacteria, (for example in the brain ventricle, in the caudal vein, in the yolk sac), or ingestion, or other means.
cancer	Experimental condition in which cancer cells are introduced to the fish via injection of tumor cells.
fungal infection	Experimental condition in which fish have been infected with a fungus.
germ free	Experimental condition in which fish were raised in the absence of bacteria
high calorie diet	Experimental condition in which fish are fed a high calorie diet as compared to the normal diet.
low calorie diet	Experimental condition in which fish are fed a low calorie diet as compared to the normal diet
organ culture	Experimental condition in which an organ is dissected/isolated/collected from the fish and placed in culture. The analysis of the experiment is done on this organ in culture.
primary cell culture	Experimental condition in which an embryo or adult fish is dissociated to a single cell suspension. The analysis is made on this cell culture.
regeneration/healing	Experimental condition in which fish’s organ (e.g. heart) or anatomical structure (e.g. fin) was wounded or amputated.
starvation	Experimental condition in which fish were deprived of food.
Salinity, hypertonic	Experimental condition in which the salt concentration is higher than the one in controlled conditions.
Salinity, hypotonic	Experimental condition in which the salt concentration is lower than the one in controlled conditions.
temperature, cold shock	Experimental condition in which fish are subjected for a short period of time to temperature lower than the controlled temperature. The standard controlled temperature (according to The Zebrafish Book) is 28.5°C
temperature, heat shock	Experimental condition in which fish are subjected for a short period of time to temperature higher than the controlled temperature. The standard controlled temperature (according to The Zebrafish Book) is 28.5°C
temperature, stable	Experimental condition in which fish are raised in temperature different (lower or higher) than the controlled temperature. The standard controlled temperature (according to The Zebrafish Book) is 28.5°C

Data	Description
Host Organism	Organism from which the antibody was made.
Immunogen Organism	Species from which the immunogen was obtained. If it is a peptide based on a sequence from a particular organism list that organism.
Antibody Type	List whether antibody is polyclonal or monoclonal.
Antibody Isotype	List isotype if known. (optional)
Source	If the antibody was purchased from a commercial supplier, list the supplier.
Catalog Number	If the antibody is from a commercial supplier please provide the catalog number. (optional)
Name	Include clone names if known. (optional)
Note	Include sequence of peptide or accession number of sequenced used to produce antibody if the antibody was custom made. Also include any usage notes here. (optional)
Target Gene	Provide ZDB-GENE ID or sequence accession number for the target gene if known (optional).
Citation for Original Source	If this is a previously published antibody please provide a reference PubMed ID. Otherwise the antibody will be attributed to the data load publication.

Label	Description
track	The track file name
shortLabel	A brief (17 character) label to describe the track in the genome browser. Visible to the left of the track in the genome browser. Example: 4 day methylome
longLabel	A longer label (76 character) to describe the track in the genome browser. Visible above the track in the genome browser. Provide enough detail to uniquely identify the track. Example: Howe et al. 2015 male 4 day methylome
type	States the track file format (bigWig, bigBed, etc.)

Data Type	Source of Valid Values
Anatomy	Zebrafish Anatomy Ontology
Developmental Stage	Zebrafish Developmental Stages
Human Disease	Human Disease Ontology
Biological Processe	Gene Ontology
Experimental Condition	Constrained list of experimental conditions (table 12)
Mutagen	Constrained list of mutagens (table 6)
Subject	Constrained list of subjects (table 7)
Phenotype Entity	Gene Ontology or Zebrafish Anatomy Ontology
Phenotype Quality	Phenotypic Trait Ontology
Genetic Background	The list of standard lines at ZFIN

Data Type Being Submitted	Data Sheets to Submit	Other Files
Mutants	Mutants/Transgenics Genotypes Citations
Transgenics	Constructs Mutants/Transgenics Genotypes Citations	Construct Image
Phenotype	Mutants/Transgenics Genotypes Citations Constructs Phenotypes	Media Files
Expression	Mutants/Transgenics Genotypes Citations Constructs Expression	Media Files
Morpholinos, TALENs, CRISPRs	MO/TAL/CRSP Citations
Genome Browser Tracks	TrackInfo Citations	Track File
Disease Models	Mutants/Transgenics Genotypes Disease Models Citation
Antibodies	Antibodies

REQUIRED	OPTIONAL
Genotype	Image or Movie File Name
Morpholinos, TALENs, CRISPRs
Developmental Stage
Experimental Conditions
Phenotype Entity
Phenotype Quality
Tag
Citation

PERMALINK

A Scientist’s Guide for Submitting Data to ZFIN

Douglas G Howe

Yvonne M Bradford

Anne Eagle

David Fashena

Ken Frazer

Patrick Kalita

Prita Mani

Ryan Martin

Sierra Taylor Moxon

Holly Paddock

Christian Pich

Sridhar Ramachandran

Leyla Ruzicka

Kevin Schaper

Xiang Shao

Amy Singer

Sabrina Toro

Ceri Van Slyke

Monte Westerfield

Abstract

Introduction

Why Load Data Into ZFIN?

The Structure of the ZFIN Database

The Data Submission Process

Figure 1. Summary of the data submission process.

Data Submissions

Mutant and Transgenic Line Submission

Genotypes

Table 1.

Table 2.

Table 3.

Genomic Feature

Feature Zygosity

Feature Maternal and Paternal Zygosity

Genetic Background

Table 4.

Affected Gene Symbol

Table 5.

Affected Gene Accession

Transgene type

Mutation Type

Table 6.

Mutagen

Table 7.

Subject

Construct

Laboratory of Origin

Sequence Accession

Link to Alternate Resource

Citations

Table 8.

Note

Sperm Samples

Transgenic Constructs

Table 9.

Construct Name

Figure 2.

Promoter Gene Symbol

Promoter Gene Accession

Coding Sequence Gene Symbol

Coding Sequence Gene Accession

Engineered Region Name

Construct Sequence Accession

Construct Map Image Name

Link to Alternate Construct Resource

Citation

Construct Note

Morpholinos, TALENs, and CRISPRs

Table 10.

MO/TALEN/CRISPR Name

Target Sequence 1

Target Sequence 2

Target Gene Symbol

Target Gene Accession

Link to Alternate Resource

Citations

Expression Data

Table 11.