EMBL Nucleotide Sequence Database
Object Model

Carsten Helgesen and Nicole Redaschi

EMBL - EBI

Introduction

The International Nucleotide Sequence Database Collaboration

The International Nucleotide Sequence Database Collaboration, DDBJ (DNA Data Bank of Japan) at CIB, GenBank at NCBI and EMBL (European Molecular Biology Laboratory) at EBI, maintains a comprehensive database of nucleotide sequences, that are submitted by researchers, sequencing groups and patent offices or taken from the scientific literature. Each of the three collaborating groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged on a daily basis.

Distributing and Querying Data

The EBI distributes quarterly releases of the database on CD-ROM. The releases, as well as incremental and cumulative updates, are also available at EBI's ftp site, which is mirrored at many EMBnet nodes. Data can presently be queried via several types of mail and WWW servers like SRS, BLAST, FASTA, Blitz, etc.

The EBI has now begun to make use of the Common Request Broker Architecture (CORBA) for the distribution and querying of biological data. The Radiation Hybrid Database is already available through CORBA. The next project is to make the EMBL Nucleotide Sequence Database accessible.

Project

Goals

The first step in developing a CORBA server is to create an object model of the system. In our case the system is a database, i.e. the model needs to describe the structure and constraints present in the data, and how the data can be accessed (queried). It can be seen as a specification of the data in the problem domain, independent of how the actual database is implemented. The model can be mapped to an IDL (Interface Definition Language) specification of a CORBA server, and to a database schema for the underlying data management and storage.

The primary goal of this project is to create an object model of the EMBL database, that defines the data in detail as a basis for an IDL interface to a CORBA server. A secondary goal is to develop a more general model of annotated sequence databases, that can be re-used for othere databases maintained at EBI. Such a general model will not be described in this document, however.

Legacy

The EMBL object model reflects the concepts and terminology of "The DDBJ/EMBL/GenBank Feature Table Definition" with respect to sequence annotation. Other ways of defining annotation is certainly viable, but it could be argued that representing sequence annotation through a feature table is a well-established practice by now, and therefore can be seen as a de-facto standard.

The EMBL Nucleotide Sequence Database is implemented in Oracle. To reflect the database precisely, the object model will inevitably contain legacy from the relational schema. For instance, features have been organized into groups, based on their biological meaning. We have not considered alternative grouping, but are aware that the assignment of features to groups can be debated.

Methodology

We use the UML (Unified Modeling Language) notation for this project, since this is now a widely accepted standard for object modeling. It is based on several other modeling languages, and is particularly close to OMT (described in Rumbaugh et al (1991), Blaha and Premerlani (1998)). This report will not describe object modeling further. For an introduction to object modeling in general, the reader is referred to the above mentioned sources.

Although the UML supports full object oriented systems design, we have mainly used its static, data structuring parts in this project. Thus most classes described so far do not yet contain methods. To define how to access data, methods need to be provided as well, and most likely also new classes representing factory objects that are able to create instances of the "data classes" which are results of queries.

Object Model

The model is drawn with RationalRose. We are still doing minor modifications to the model and working on the documentation, but meanwhile you can have a look at our latest version:

Class documentation is extracted from the RationalRose *.mdl file and converted to html format.

Guidelines

Naming

Classes have capitalized names (e.g. Book). For composite terms, each word of the term is capitalized (e.g. JournalArticle).
Attributes, methods, associations and roles have lowercase names (e.g. id). For composite terms, the first word is in lower case and the following words are capitalized (e.g. scientificName).
Roles of an association are normally given the same name as the class at that end of the association, but not capitalized. If misunderstandings can arise, e.g. in reflexive relationships like node links in a tree, the roles are given other suitable names (e.g. parent, children).

Data Structures

A structured (composite) data type is represented as a class (as usual in OO modeling). A class is used as an attribute type mainly if its role is solely to bundle simple data into a composite type (e.g. Date). A link to another class instance is usually represented using an association. An important exception to this rule occurs in the feature meta model, where associations are replaced by attributes representing cross-references to other classes/parts of the model.
Multi-valued attributes are represented by instances of a parameterized class Coll{Type} (e.g. Coll{string}). Alternatively, an association can represent a multi-valued attribute. This would, however, "drown" the model in many extra classes, that are only present for the sake of attributes being multi-valued.

Model Organization

The object model is organized into 5 main packages, where each package holds a set of closely related classes with a common purpose:

Sequence Info
classes representing biological sequences, general information about those sequences and administrative data associated with database entries
Feature Info
classes representing detailed sequence annotation (i.e. sequence features)
Reference Info
classes representing literature and other references that hold information about the sequences
Taxonomy Info
classes representing the taxonomy of the organisms from which the sequences were obtained
Location Info
classes representing locations on sequences

There is one additional package, Types, that holds classes representing all the special data types used in various parts of the model.

Each package contains a relatively isolated part of the entire object model, and is a clear candidate for re-use in models for other databases.

Two Models for Features

The independent development of formats and sequence annotation standards at EMBL and GenBank (later adopted by DDBJ) created significant difficulties for the data exchange. To overcome those problems, the collaboration began to devise a common format and standards for annotation practice, which is published in the document "The DDBJ/EMBL/GenBank Feature Table Definition". The feature table is defined by a set of features and associated qualifiers expressing valid data values for each feature.

As a consequence of the rapid developments in molecular biology, the definition is prone to change and the collaboration revises it accordingly at the annual meetings. This poses a challenge to the object model. Any change made to the structure of the model needs to be propagated to both the IDL defining the CORBA server interface and the underlying relational schema. This increases the maintenance work for database management, and also for client programs that have to be updated accordingly. Our solution to this problem is to define two models for the feature table:

The Explicit Model

The Meta Model

max / min 0 1

0 invalid invalid

1 optional single-valued mandatory single-valued

2 optional multi-valued mandatory multi-valued

For each feature/qualifier combination the minimum and maximum number of qualifier values for that feature is given. This setup covers all possible cases of validity (valid / invalid), participation (optional / mandatory) and multiplicity (single-valued / multi-valued) constraints that can exist between features and qualifiers. The structure of this model is not affected by changes to the feature table definition, which makes it suitable for defining a stable IDL.

max / min	0	1
`0`	`invalid`	`invalid`
`1`	`optional single-valued`	`mandatory single-valued`
`2`	`optional multi-valued`	`mandatory multi-valued`

EMBL Nucleotide Sequence Database Object Model