EMBL Nucleotide Sequence Database
Object Model

Carsten Helgesen and Nicole Redaschi

EMBL - EBI


Introduction

The International Nucleotide Sequence Database Collaboration

The International Nucleotide Sequence Database Collaboration, DDBJ (DNA Data Bank of Japan) at CIB, GenBank at NCBI and EMBL (European Molecular Biology Laboratory) at EBI, maintains a comprehensive database of nucleotide sequences, that are submitted by researchers, sequencing groups and patent offices or taken from the scientific literature. Each of the three collaborating groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged on a daily basis.

Distributing and Querying Data

The EBI distributes quarterly releases of the database on CD-ROM. The releases, as well as incremental and cumulative updates, are also available at EBI's ftp site, which is mirrored at many EMBnet nodes. Data can presently be queried via several types of mail and WWW servers like SRS, BLAST, FASTA, Blitz, etc.

The EBI has now begun to make use of the Common Request Broker Architecture (CORBA) for the distribution and querying of biological data. The Radiation Hybrid Database is already available through CORBA. The next project is to make the EMBL Nucleotide Sequence Database accessible.


Project

Goals

The first step in developing a CORBA server is to create an object model of the system. In our case the system is a database, i.e. the model needs to describe the structure and constraints present in the data, and how the data can be accessed (queried). It can be seen as a specification of the data in the problem domain, independent of how the actual database is implemented. The model can be mapped to an IDL (Interface Definition Language) specification of a CORBA server, and to a database schema for the underlying data management and storage.

The primary goal of this project is to create an object model of the EMBL database, that defines the data in detail as a basis for an IDL interface to a CORBA server. A secondary goal is to develop a more general model of annotated sequence databases, that can be re-used for othere databases maintained at EBI. Such a general model will not be described in this document, however.

Legacy

The EMBL object model reflects the concepts and terminology of "The DDBJ/EMBL/GenBank Feature Table Definition" with respect to sequence annotation. Other ways of defining annotation is certainly viable, but it could be argued that representing sequence annotation through a feature table is a well-established practice by now, and therefore can be seen as a de-facto standard.

The EMBL Nucleotide Sequence Database is implemented in Oracle. To reflect the database precisely, the object model will inevitably contain legacy from the relational schema. For instance, features have been organized into groups, based on their biological meaning. We have not considered alternative grouping, but are aware that the assignment of features to groups can be debated.

Methodology

We use the UML (Unified Modeling Language) notation for this project, since this is now a widely accepted standard for object modeling. It is based on several other modeling languages, and is particularly close to OMT (described in Rumbaugh et al (1991), Blaha and Premerlani (1998)). This report will not describe object modeling further. For an introduction to object modeling in general, the reader is referred to the above mentioned sources.

Although the UML supports full object oriented systems design, we have mainly used its static, data structuring parts in this project. Thus most classes described so far do not yet contain methods. To define how to access data, methods need to be provided as well, and most likely also new classes representing factory objects that are able to create instances of the "data classes" which are results of queries.


Object Model

The model is drawn with RationalRose. We are still doing minor modifications to the model and working on the documentation, but meanwhile you can have a look at our latest version:

Class documentation is extracted from the RationalRose *.mdl file and converted to html format.

Guidelines

Naming

Data Structures

Model Organization

The object model is organized into 5 main packages, where each package holds a set of closely related classes with a common purpose:

There is one additional package, Types, that holds classes representing all the special data types used in various parts of the model.

Each package contains a relatively isolated part of the entire object model, and is a clear candidate for re-use in models for other databases.

Two Models for Features

The independent development of formats and sequence annotation standards at EMBL and GenBank (later adopted by DDBJ) created significant difficulties for the data exchange. To overcome those problems, the collaboration began to devise a common format and standards for annotation practice, which is published in the document "The DDBJ/EMBL/GenBank Feature Table Definition". The feature table is defined by a set of features and associated qualifiers expressing valid data values for each feature.

As a consequence of the rapid developments in molecular biology, the definition is prone to change and the collaboration revises it accordingly at the annual meetings. This poses a challenge to the object model. Any change made to the structure of the model needs to be propagated to both the IDL defining the CORBA server interface and the underlying relational schema. This increases the maintenance work for database management, and also for client programs that have to be updated accordingly. Our solution to this problem is to define two models for the feature table:


This document was last modified on Apr 30 20:36:53 BST by redaschi@ebi.ac.uk
© EBI 1998