EMBL Nucleotide Sequence Database
Object Model
Carsten Helgesen and Nicole Redaschi
EMBL - EBI
Introduction
The International Nucleotide Sequence Database
Collaboration
The International Nucleotide Sequence Database Collaboration,
DDBJ (DNA Data Bank of Japan) at CIB,
GenBank at NCBI and
EMBL
(European Molecular Biology Laboratory) at EBI, maintains a
comprehensive database of nucleotide sequences, that are submitted by
researchers, sequencing groups and patent offices or taken from the
scientific literature. Each of the three collaborating groups collects
a portion of the total sequence data reported worldwide, and all new
and updated database entries are exchanged on a daily basis.
Distributing and Querying Data
The EBI distributes quarterly releases of the database on CD-ROM. The
releases, as well as incremental and cumulative updates, are also available at
EBI's ftp site,
which is mirrored at many
EMBnet
nodes. Data can presently be queried via several types of mail and WWW servers
like
SRS,
BLAST,
FASTA,
Blitz,
etc.
The EBI has now begun to make use of the Common Request Broker Architecture
(CORBA)
for the distribution and querying of biological data. The
Radiation Hybrid Database
is already available through CORBA. The next
project
is to make the EMBL Nucleotide Sequence Database accessible.
Project
Goals
The first step in developing a CORBA server is to create an object model of
the system. In our case the system is a database, i.e. the model needs to
describe the structure and constraints present in the data, and how the data can
be accessed (queried). It can be seen as a specification of the data in the
problem domain, independent of how the actual database is implemented. The model
can be mapped to an
IDL (Interface Definition Language) specification
of a CORBA server, and to a database schema
for the underlying data management and storage.
The primary goal of this project is to create an object model of the EMBL
database, that defines the data in detail as a basis for an IDL interface to a
CORBA server. A secondary goal is to develop a more general model of annotated
sequence databases, that can be re-used for othere databases maintained at EBI.
Such a general model will not be described in this document, however.
Legacy
The EMBL object model reflects the concepts and terminology of
"The DDBJ/EMBL/GenBank Feature Table Definition"
with respect to sequence annotation. Other ways of defining annotation is
certainly viable, but it could be argued that representing sequence annotation
through a feature table is a well-established practice by now, and therefore can
be seen as a de-facto standard.
The EMBL Nucleotide Sequence Database is implemented in Oracle. To reflect the
database precisely, the object model will inevitably contain legacy from the
relational schema. For instance, features have been organized into
groups, based on their biological meaning. We have not considered alternative grouping,
but are aware that the assignment of features to groups can be debated.
Methodology
We use the
UML
(Unified Modeling Language) notation for this project, since this is now a widely accepted
standard
for object modeling. It is based on several other modeling languages, and is
particularly close to OMT (described in Rumbaugh et al (1991), Blaha and
Premerlani (1998)).
This report will not describe object modeling further. For an introduction
to object modeling in general, the reader is referred to the above mentioned
sources.
Although the UML supports full object oriented systems design, we have mainly
used its static, data structuring parts in this project. Thus most classes
described so far do not yet contain methods. To define how to access data,
methods need to be provided as well, and most likely also new classes
representing factory objects that are able to create instances of the "data
classes" which are results of queries.
Object Model
The model is drawn with RationalRose. We are still doing minor
modifications to the model and working on the documentation, but
meanwhile you can have a look at our latest version:
Class documentation is extracted from the RationalRose *.mdl file and converted to html format.
Guidelines
Naming
- Classes have capitalized names (e.g. Book). For composite terms, each
word of the term is capitalized (e.g. JournalArticle).
- Attributes, methods, associations and roles have lowercase names (e.g. id).
For composite terms, the first word is in lower case and the following words
are capitalized (e.g. scientificName).
- Roles of an association are normally given the same name as the class at
that end of the association, but not capitalized. If misunderstandings can
arise, e.g. in reflexive relationships like node links in a tree, the roles
are given other suitable names (e.g. parent, children).
Data Structures
- A structured (composite) data type is represented as a class (as usual in
OO modeling). A class is used as an attribute type mainly if its role is
solely to bundle simple data into a composite type (e.g. Date). A link to
another class instance is usually represented using an association. An
important exception to this rule occurs in the feature meta model, where
associations are replaced by attributes representing cross-references
to other classes/parts of the model.
- Multi-valued attributes are represented by instances of a parameterized
class Coll{Type} (e.g. Coll{string}). Alternatively, an association can
represent a multi-valued attribute. This would, however, "drown" the model
in many extra classes, that are only present for the sake of attributes
being multi-valued.
Model Organization
The object model is organized into 5 main packages, where each package holds a set
of closely related classes with a common purpose:
- Sequence Info
classes representing
biological sequences, general information about those sequences and
administrative data associated with database entries
- Feature Info
classes representing
detailed sequence annotation (i.e. sequence features)
- Reference Info
classes representing
literature and other references that hold information about the sequences
- Taxonomy Info
classes representing
the taxonomy of the organisms from which the sequences were obtained
- Location Info
classes representing
locations on sequences
There is one additional package, Types, that holds classes
representing all the special data types used in various parts of the model.
Each package contains a relatively isolated part of the entire object model,
and is a clear candidate for re-use in models for other databases.
Two Models for Features
The independent development of formats and sequence annotation standards at EMBL
and GenBank (later adopted by DDBJ) created significant difficulties for the data
exchange. To overcome those problems, the collaboration began to devise a common
format and standards for annotation practice, which is published in the document
"The DDBJ/EMBL/GenBank Feature Table Definition".
The feature table is defined by a set of features and associated qualifiers
expressing valid data values for each feature.
As a consequence of the rapid developments in molecular biology, the definition
is prone to change and the collaboration revises it accordingly at the
annual meetings. This poses a challenge to the object model. Any change made to
the structure of the model needs to be propagated to both the IDL defining the
CORBA server interface and the underlying relational schema. This increases the
maintenance work for database management, and also for client programs that have
to be updated accordingly. Our solution to this problem is to define
two models for the feature table:
- The Explicit Model
This model represents the feature table definitions explicitly in classes
with attributes and associations. Each feature is represented by a
class, and qualifiers are class attributes or associations to other
classes. Biologically similar features inherit from the same
superclass, that defines their common attributes. Thus the set of
valid features and qualifiers, as well as the rules defining which
qualifiers can be used with which features are represented explicitly
through the model structure. This makes the model unstable, and
therefore unsuitable for defining an IDL. Also, the model is very big:
with 63 feature classes plus their superclasses the IDL would become
difficult to manage and require a lot of casting operations. The model
is useful, however, for understanding the structure and constraints
present in the data and to see how the model maps onto the existing
relational schema.
- The Meta Model
This model defines only one class for all features and one class for all
qualifiers and an association between them to represent the feature table.
Meta data is used to describe the rules for combining features and
qualifiers as shown in the following table:
max / min |
0 |
1 |
0 |
invalid |
invalid |
1 |
optional single-valued |
mandatory single-valued |
2 |
optional multi-valued |
mandatory multi-valued |
For each feature/qualifier combination the minimum and maximum number of
qualifier values for that feature is given.
This setup covers all possible cases of validity (valid / invalid),
participation (optional / mandatory) and multiplicity (single-valued /
multi-valued) constraints that can exist between features and qualifiers.
The structure of this model is not affected by changes to the feature
table definition, which makes it suitable for defining a stable IDL.
This document was last modified on Apr 30 20:36:53 BST
by redaschi@ebi.ac.uk
© EBI 1998