handbook The NCBI Handbook 1st McEntyre Jo Ostell Jim National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20892-6510 National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health Bethesda, MD Nov.2002 The Databases The Taxonomy Project Federhen Scott 9 10 2002 13 8 2003 Summary

The NCBI Taxonomy database is a curated set of names and classifications for all of the organisms that are represented in GenBank. When new sequences are submitted to GenBank, the submission is checked for new organism names, which are then classified and added to the Taxonomy database. As of April 2003, there were 176,890 total taxa represented.

There are two main tools for viewing the information in the Taxonomy database: the Taxonomy Browser, and Taxonomy Entrez. Both systems allow searching of the Taxonomy database for names, and both link to the relevant sequence data. However, the Taxonomy Browser provides a hierarchical view of the classification (the best display for most casual users interested in exploring our classification), whereas Entrez Taxonomy provides a uniform indexing, search, and retrieval engine with a common mechanism for linking between the Taxonomy and other relevant Entrez databases.

Introduction History of the Taxonomy Project.

By the time the NCBI was created in 1988, the nucleotide sequence databases (GenBank, EMBL, and DDBJ) each maintained their own taxonomic classifications. All three classifications derived from the one developed at the Los Alamos National Lab (LANL) but had diverged considerably. Furthermore, the protein sequence databases (SWISS-PROT and PIR) each developed their own taxonomic classifications that were very different from each other and from the nucleotide database taxonomies. To add to the mix, in 1990 the NCBI and the NLM initiated a journal-scanning program to capture and annotate sequences reported in the literature that had not been submitted to any of the sequence databases. We, of course, began to assign our own taxonomic classifications for these records.

The Taxonomy Project started in 1991, in association with the launch of Entrez (Chapter 15). The goal was to combine the many taxonomies that existed at the time into a single classification that would span all of the organisms represented in any of the GenBank sources databases (Chapter 1).

To represent, manipulate, and store versions of each of the different database taxonomies, we wrote a stand-alone, tree-structured database manager, TaxMan. This also allowed us to merge the taxonomies into a single composite classification. The resulting hybrid was, at first, a bigger mess than any of the pieces had been, but it gave us a starting point that spanned all of the names in all of the sequence databases. For many years, we cleaned up and maintained the NCBI Taxonomy database with TaxMan.

After the initial unification and clean-up of the taxonomy for Entrez was complete, Mitch Sogin organized a workshop to give us advice on the clean-up and recommendations for the long-term maintenance of the taxonomy. This was held at the NCBI in 1993 and included: Mitch Sogin (protists), David Hillis (chordates), John Taylor (fungi), S.C. Jong (fungi), John Gunderson (protists), Russell Chapman (algae), Gary Olsen (bacteria), Michael Donoghue (plants), Ward Wheeler (invertebrates), Rodney Honeycutt (invertebrates), Jack Holt (bacteria), Eugene Koonin (viruses), Andrzej Elzanowski (PIR taxonomy), Lois Blaine (ATCC), and Scott Federhen (NCBI). Many of these attendees went on to serve as curators for different branches of the classification. In particular, David Hillis, John Taylor, and Gary Olsen put in long hours to help the project move along.

In 1995, as more demands were made on the Taxonomy database, the system was moved to a SyBase relational database (TAXON), originally developed by Tim Clark. Hierarchical organism indexing was added to the Nucleotide and Protein domains of Entrez, and the Taxonomy browser made its first appearance on the Web.

In 1997, the EMBL and DDBJ databases agreed to adopt the NCBI taxonomy as the standard classification for the nucleotide sequence databases. Before that, we would see new organism names from the EMBL and DDBJ only after their entries were released to the public, and any corrections (in spelling, or nomenclature, or classification) would have to be made after the fact. We now receive taxonomy consults on new names from the EMBL and DDBJ before the release of their entries, just as we do from our own GenBank indexers. SWISS-PROT has also recently (2001) agreed to use our Taxonomy database and send us taxonomy consults.

Organismal taxonomy is a powerful organizing principle in the study of biological systems. Inheritance, homology by common descent, and the conservation of sequence and structure in the determination of function are all central ideas in biology that are directly related to the evolutionary history of any group of organisms. Because of this, taxonomy plays an important cross-linking role in many of the NCBI tools and databases.

The NCBI Taxonomy database is a curated set of names and classifications for all of the organisms that are represented in GenBank. When new sequences are submitted to GenBank, the submission is checked for new organism names, which are then classified and added to the taxonomy database. As of April 1, 2003, there were 4,653 families, 26,427 genera, 130,207 species, and 176,890 total taxa represented.

Of the several different ways to build a taxonomy, our group maintains a phylogenetic taxonomy. In a phylogenetic classification scheme, the structure of the taxonomic tree approximates the evolutionary relationships among the organisms included in the classification (the “tree of life”; see Figure 1). A phylogenetic classification scheme.

If two organisms (A and B) are listed more closely together in the taxonomy than either is to organism C, the assertion is that C diverged from the lineage leading to A+B earlier in evolutionary history, and that A and B share a common ancestor that is not in the direct line of evolutionary descent to species C. For example, the current consensus it that the closest living relatives of the birds are the crocodiles; therefore, our classification does not include the familiar taxon Reptilia (turtles, lizards and snakes, and crocodiles), which excludes the birds, and would break the phylogenetic principle outlined above.

Our classification represents an assimilation of information from many different sources (see Box 1). Much of the success of the project is attributable to the flood of new molecular data that has revolutionized our understanding of the phylogeny of many groups, especially of previous poorly understood groups such as Bacteria, Archea, and Fungi. Users should be aware that some parts of the classification are better developed than others and that the primary systematic and phylogenetic literature is the most reliable information source.

We do not rely on sequence data alone to build our classification, and we do not perform phylogenetic analysis ourselves as part of the taxonomy project. Most of the organisms in GenBank are represented by only a snippet of sequence; therefore, sequence information alone is not enough to build a robust phylogeny. The vast majority of species are not there at all, although about 50% of the birds and the mammals are represented. We therefore also rely on analyses from morphological studies; the challenge of modern systematics is to unify molecular and morphological data to elucidate the evolutionary history of life on earth.

Adding to the Taxonomy Database

Currently, more than 100 new species are added to the database daily, and the rate is accelerating as sequence analysis becomes an ever more common component of systematic research and the taxonomic description of new species.

Sources of New Names

The EMBL and DDBJ databases, as well as GenBank, now use the NCBI Taxonomy as the standard classification for nucleotide sequences (see Box 1). Nearly all of the new species found in the Taxonomy database are via sequences submitted to one of these databases from species that are not yet represented. In these cases, the NCBI taxonomy group is consulted, and any problems with the nomenclature and classifications are resolved before the sequence entries are released to the public. We also receive consults for submissions that are not identified to the species level (e.g., “Hantavirus” or “Bacillus sp.”) and for Anything that looks confusing, incorrect, or incomplete to the database indexers. All consults include information on the problem Organism names, source features, and publication titles (if any). The email addresses for the submitters are also included in case we need to contact them about the nomenclature, classification, or annotation of their entries.

The number and complexity of organisms in a submission can vary enormously. Many contain a single new name, others may include 100 species, all from the same familiar genus, whereas others may include 100 names (only half identified at the species level) from 100 genera (all of which are new to the Taxonomy database) without any other identifying information at all.

Some new organism names are found by software when the protein sequence databases (SWISS-PROT, PIR, and the PRF) are added to Entrez; because most of the entries in the protein databases have been derived from entries in the nucleotide database, this is a small number. The NCBI structure group may also find new names in the PDB protein structure database. Finally, because we made the Taxonomy database publicly accessible on the Web, we have had a steady stream of comments and corrections to our spellings and classification from outside users.

More on Submission

We often receive consults on submissions with explicitly new species names that will be published as part of the description of a new species. These sequence entries (like any other) may be designated “hold until published” (HUP) and will not be released until the corresponding journal article has been published. These species names will not appear on any of our taxonomy Web sites until the corresponding sequence entries have been released.

Occasionally, the same new genus name is proposed simultaneously for different taxa; in one case, two papers with conflicting new names had been submitted to the same journal, and both had gone through one round of review and revision without detection of the duplication. Although these duplications would have been discovered in time, the increasingly common practice of including some sequence analysis in the description of a new species can lead to earlier detection of these problems. In many cases, the new species name proposed in the submitted manuscript is changed during the editorial review process, and a different name appears in the publication. Submitters are encouraged to inform us when their new descriptions have been published, particularly if the proposed names have been changed.

We strongly encourage the submission of strain names for cultured bacteria, algae, and fungi and for sequences from laboratory animals in biochemical and genetic studies; of cultivar names for sequences from cultivated plants; and of specimen vouchers (something that definitively ties the sequence to its source) for sequences from phylogenetic studies. There are many other kinds of useful information that may be contained within the sequence submission, but these data are the bare minimum necessary to maintain a reliable link between an entry in the sequence database and the biological source material.

Using the Taxonomy Browser

The Taxonomy Browser (TaxBrowser) provides a hierarchical view of the classification from any particular place in the taxonomy. This is probably the display of choice for most casual users (browsers) of the taxonomy who are interested in exploring our classification. The TaxBrowser displays only the subset of taxa from the taxonomy database that is linked to public sequence entries. About 15% of the full Taxonomy database is not displayed on the public Web pages because the names are from sequence entries that have not yet been released.

TaxBrowser is updated continuously. New species will appear on a daily basis as the new names appear in sequence entries indexed during the daily release cycle of the Entrez databases. New taxa in the classification appear in TaxBrowser on an ongoing basis, as sections of the taxonomy already linked to public sequence entries are revised.

The Hierarchical Display

The browser produces two different kinds of Web pages:

hierarchy pages, which present a familiar indented flatfile view of the taxonomic classification, centered on a particular taxon in the database; and

taxon-specific pages, which summarize all of the information that we associate with any particular taxonomic entry in the database. For example, “hominidae” as a search term from the TaxBrowser homepage finds our human family (Figure 2). The Taxonomy Browser hierarchical display for the family Hominidae.

(a) There are four genera listed in this family (Gorilla, Homo, Pan, and Pongo) with six species-level names (Gorilla gorilla, Homo sapiens, Pan paniscus, Pan troglodytes, Pongo pygmaeus, and Pongo sp.) and 2 subspecies. Common names are shown in parentheses if they are available in the Taxonomy database. The lineage above Hominidae is shown in the line at the top of the display; selecting the word Lineage will toggle back and forth between the abbreviated lineage (the display used in GenBank flatfiles) and the full lineage (as it appears in the Taxonomy database). Selecting any of the taxa above Hominidae (in the lineage) or below Hominidae (in the hierarchical display) will refocus the browser on that taxon instead of the Hominidae. Selecting Hominidae itself, however, will display the taxon-specific page for the Hominidae. (b) The default setting displays three levels of the classification on the hierarchy pages. To change this, enter a different number in the Display levels box and select Accept. If any of the check boxes to the right of the Display levels box are selected (i.e., Nucleotide, Protein, Structures, ...), the numbers of records in the corresponding Entrez database that are associated with that taxon will appear as hyperlinks. Selecting a link retrieves those records.

The Taxon-specific Display

The taxon-specific browser display page shows all of the information that is associated with a particular taxon in the Taxonomy database and some information collected through links with related databases (Figure 3). The Taxonomy Browser taxon-specific display.

The name, taxid, rank, genetic codes, and other names (if any) associated with this taxon are all listed. The full lineage is shown; selecting the word Lineage toggles between the abbreviated and full versions. There may also be citation information and comments hyperlinked to the appropriate sources. The numbers of Entrez database records that link to this Taxonomy record are displayed and can be retrieved via the hotlinked entry counts. LinkOut links to external resources appear at the bottom of the page.

There are two sets of links to Entrez records from the Taxonomy Browser. The “subtree links” are accumulated up the tree in a hierarchical fashion; for example, there are 16 million nucleotide records and a half million protein records associated with the Chordata (Figure 3a). These are all linked into the taxonomy at or below the species leves and can be retrieved en masse via the subtree hotlink.

“Direct links” will retrieve Entrez records that are linked directly to this particular node in the taxonomy database. Many of the Entrez domains (e.g., sequences and structures) are linked into the taxonomy at or below the species level; it is a data error when a sequence entry is directly linked into the taxonomy at a taxon somewhere above the species level. For other Entrez domains (e.g., literature and phylogenetic sets), this is not the case. A journal article may talk about several different species but may also refer directly to the phylum Chordata. We have searched the full text of the articles in the PubMed Central archive with the scientific names from the taxonomy database. Twenty-seven articles in PubMed Central refer directly to the phylum Chordata; 9,299 articles are linked into the taxonomy somewhere in the Chordata subtree. The PopSet domain contains population studies, phylogenetic sets, and alignments. We have recently changed the way that we index phylogenetic sets in Entrez. The five “Direct links” at the Chordata will retrieve the phylogenetic sets that explicitly span the Chordata; the “Subtree links” will also include phylogenetic sets that are completely contained within the Chordata.

The taxon-specific browser pages now also show the NCBI LinkOut links to external resources. These include links to a broad range of different kinds of resources and are provided for the convenience of our users; the NCBI does not vouch for the content of these resources, although we do make an effort to ensure that they are of good scientific quality. A complete list of external resources can be found here. Groups interested in participating in the LinkOut program should visit the LinkOut homepage.

Search Options TaxBrowser results from using the wild card search, C* elegans.

Cunninghamella elegans

Caenorhabditis elegans

Codonanthe elegans

Cyclamen coum subsp. elegans

Cestrum elegans

Chaerophyllum elegans

Chalara elegans

Chrysemys scripta elegans

Ceuthophilus elegans

Carpolepis elegans

Cylindrocladiella elegans

Coluria elegans

Cymbidium elegans

Coronilla elegans

Gymnothamnion elegans (synonym: Callithamnion elegans)

Centruroides elegans

There are several different ways to search for names in the Taxonomy database. If the search results in a terminal node in our taxonomy, the taxon-specific browser page is displayed; if the search returns with an internal (non-terminal) node, the hierarchical classification page is displayed.

Complete Name. By default, TaxBrowser looks for the complete name when a term is typed into the search box. It looks for a case-insensitive, full-length string match to all of the nametypes stored in the Taxonomy database. For example, Homo sapiens, Escherichia, Tetrapoda, and Embryophyta would all retrieve results.

Names can be duplicated in the Taxonomy database, but the taxonomy browser can only be focused on a single taxon at any one time. If a complete name search retrieves more than one entry from the taxonomy, an intermediate name selection screen appears (Figure 4). Each duplicated name includes a manually curated suffix that differentiates between the duplicated names. Examples of search results for duplicated names.

Searches for Bacillus (a), Proboscidea (b), Anopheles (c), and reptiles (d) result in the options shown above. If the duplicated name is not the primary (scientific) name of the node [as in (d)], the primary name is given first, followed by the nametype and the duplicated name in square brackets.

Wild Card. This is a regular expression search, *, with wild cards. It is useful when the correct spelling of a scientific name is uncertain or to find ambiguous combinations for abbreviated species names. For example, C* elegans results in a list of 16 species and subspecies (Box 2). Note: there is still only one H. sapiens.

Token Set. This treats the search string as an unordered set of tokens, each of which must be found in one of the names associated with a particular node. For example, “sapiens” retrieves: Homo sapiens Homo sapiens neanderthalensis

Phonetic Name. This search qualifier can be used when the user has exhausted all other search options to find the organism of interest. The results using this function can be patchy, however. For example, “drozofila” and “kaynohrhabdietees” retrieve respectable results; however, “seenohrabdietees” and “eshereesheeya” are not found.

Taxonomy ID. This allows searching by the numerical unique identifier (taxid) of the NCBI Taxonomy database, e.g., 9606 or 666.

How to Link to the TaxBrowser

There is a help page that describes how to make hyperlinks to the Taxonomy Browser pages.

The Taxonomy Database: TAXON

The NCBI Taxonomy database is stored as a SyBase relational database, called TAXON. The NCBI taxonomy group maintains the database with a customized software tool, the Taxonomy Editor. Each entry in the database is a “taxon”, also referred to as a “node” in the database. The “root node” (taxid1) is at the top of the hierarchy. The path from the root node to any other particular taxon in the database is called its “lineage”; the collection of all of the nodes beneath any particular taxon is called its “subtree”. Each node in the database may be associated with several names, of several different nametypes. For indexing and retrieval purposes, the nametypes are essentially equivalent.

The Taxonomy database is populated with species names that have appeared in a sequence record from one of the nucleotide or protein databases. If a name has ever appeared in a sequence record at any time (even if it is not found in the current version of the record), we try to keep it in the Taxonomy database for tracking purposes (as a synonym, a misspelling, or other nametype), unless there are good reasons for removing it completely (for example, if it might cause a future submission to map to the wrong place in the taxonomy).

Taxids Files on the taxonomy FTP site.
File Uncompresses to Description
taxdump.tar.Za readme.txt A terse description of the dmp files
nodes.dmp Structure of the database; lists each taxid with its parent taxid, rank, and other values associated with each node (genetic codes, etc.)
names.dmp Lists all the names associated with each taxid
delnodes.dmp Deleted taxid list
merged.dmp Merged nodes file
division.dmp GenBank division files
gencode.dmp Genetic codes files
gc.prt Print version of genetic codes
gi_taxid_nucl.dmp.gz gi_taxid_nucl.dmp A list of gi_taxid pairs for every live gi-identified sequence in the nucleotide sequence database
gi_taxid_prot.dmp.gz gi_taxid_prot.dmp A list of gi_taxid pairs for every live gi-identified sequence in the protein sequence database
gi_taxid_nucl_diff.dmp gi_taxid_nucl_diff List of differences between latest gi_taxid_nucl and previous listing
gi_taxid_prot_diff.dmp gi_taxid_prot_diff List of differences between latest gi_taxid_prot and previous listing

For non-UNIX users, the file taxdmp.zip includes the same (zip compressed) data.

Each taxon in the database has a unique identifier, its taxid. Taxids are assigned sequentially. When a taxon is deleted, its taxid disappears and is not reassigned (Table 1; see the FTP for a list of deleted taxids). When one taxon is merged with another taxon (e.g., if the names were determined to be synonyms or one was a misspelling), the taxid of the node that has disappeared is listed as a “secondary taxid” to the taxid of the node that remains (see the merged taxid file on the FTP site). In either case, the taxid that has disappeared will never be assigned to a new entry in the database.

Nomenclature Issues TAXON Nametypes

There are many possible types of names that can be associated with an organism taxid in TAXON. To track and display the names correctly, the various names associated with a taxid are tagged with a nametype, for example “scientific name”, “synonym”, or “common name”. Each taxid must have one (and only one) scientific name but may have zero or many other names (for example, several synonyms, several common names, along with only one “GenBank common name”).

When sequences are submitted to GenBank, usually only a scientific name is included; most other names are added by NCBI taxonomists at the time of submission or later, when further information is discovered. For a complete description of each nametype used in TAXON, see Appendix 1.

Classes of TAXON Scientific Names

Scientific names, the only required nametype for a taxid, can be further qualified into different classes. Not all “scientific names” that accompany sequence submissions are true Linnaean Latin binomial names; if the taxon is not identified to the species level, it is not possible to assign a binomial name to it. For indexing and retrieval purposes, TAXON needs to know whether the scientific name is a Latin binomial name, or otherwise. A full listing of the classes of TAXON scientific names can be viewed in Appendix 2.

Duplicated Names

The treatment of duplicated names was discussed briefly in the section on the Taxonomy browser. For our purposes, there are four main classes of duplicated scientific names: (1) real duplicate names, (2) structural duplicates, (3) polyphyletic genera, and (4) other duplicate names.

Real Duplicate Names

There are several main codes of nomenclature for living organisms: the Zoological Code (International Code of Zoological Nomenclature, ICZN; for animals), the Botanical Code (International Code of Botanical Nomenclature, ICBN; for plants), the Bacteriological Code (International Code of Nomenclature of Bacteria, ICNB; for prokaryotes), and the Viral Code (International Code of Virus Classification and Nomenclature, ICVCN; for viruses). Within each code, names are required to be unique. When duplicate names are discovered within a code, one of them is changed (generally, the newer duplicate name). However, the codes are complex, and not all names are subject to these restrictions. For example, Polyphaga is both a genus of cockroaches and a suborder of beetles, and the damselfly genus Lestoidea is listed within the superfamily Lestoidea.

There is no real effort to make the scientific names of taxa unique among Codes, and among the relatively small set of names represented in the NCBI taxonomy database (20,000 genera), there are approximately 200 duplicate names (or about 1%), mostly at the genus level.

Early in 2002, the first duplicate species name was recorded in the Taxonomy database. Agathis montana is both a wasp and a conifer. In this case, we have used the full species names (with authorities) to provide unambiguous scientific names for the sequence entries. (The conifer is listed as Agathis montana de Laub; the wasp, Agathis montana Shest).

Structural Duplicates

In the Zoological and Bacteriological Codes, the subgenus that Includes the type species is required to have the same name as the genus. This is a systematic source of duplicate names. For these duplicates, we use the associated rank in the unique name, e.g., Drosophila <genus> and Drosophila <subgenus>. Duplicated genera/subgenera also occur in the Botanical Codes, e.g. Pinus <genus> and Pinus <subgenus>.

Polyphyletic Genera

Certain genera, especially among the asexual forms of Ascomycota and Basidiomycota, are polyphyletic, i.e., they do not share a common ancestor. Pending taxonomic revisions that will transfer species assigned to “form” genera such as Cryptococcus to more natural genera, we have chosen to duplicate such polyphyletic genera in different branches of the Taxonomy database. This will maintain a phylogenetic classification and ensure that all species assigned to a polyphyletic genus can be retrieved when searching on the genus. Therefore, for example, the basidiomycete genus Sporobolomyces is represented in three different branches of the Basidiomycota: Sporobolomyces <Sporidiobolaceae>, Sporobolomyces <Agaricostilbomycetidae>, and Sporobolomyces <Erythrobasidium clade>.

Other Duplicate Names

We list many duplicate names in other nametypes (apart from our preferred “scientific name” for each taxon). Most of these are included for retrieval purposes, common names or the names of familiar paraphyletic taxa that we have not included in our classification, e.g. Osteichthyes, Coelenterata, and reptiles.

Other TAXON Data Types

Aside from names, there are several optional types of information that may be associated with a taxid. These are (1) rank, such as species, genus or family; (2) genetic code, for translating proteins; (3) GenBank division; (4) literature citations; and (5) abbreviated lineage, for display in GenBank flat files. For more details on these data types, see Appendix 3.

Taxonomy in Entrez: A Quick Tour

The TAXON database is a node within the Entrez integrated retrieval system (Chapter 15) that provides an important organizing principle for other Entrez databases. Taxonomy provides an alternative view of TAXON to that of TaxBrowser. Entrez adds some very powerful capabilities (for example, Boolean queries, search history, and both internal and external links) to TAXON, but in many ways it is an unnatural way to represent such hierarchical data in Taxonomy. (TaxBrowser is the way to view the taxonomy hierarchically.)

Taxonomy was the first Entrez database to have an internal hierarchical structure. Because Entrez deals with unordered sets of objects in a given domain, an alternative way to represent these hierarchical relationships in Entrez was required (see the section Hierarchy Fields, below).

The main focus of the Entrez Taxonomy homepage is the search bar but also worth noting are the Help and TaxBrowser hotlinks that lead to Entrez generic help documentation and the Taxonomy browser, respectively.

The default Entrez search is case insensitive and can be for any of the names that can be found in the Taxonomy database. Thus, any of the following search terms, Homo sapiens, homo sapiens, human, or Man, will retrieve the node for Homo sapiens.

As for other Entrez databases, Taxonomy supports Boolean searching, a History function, and searches limited by field. The Taxonomy fields can be browsed under Preview/ Index, some are specific to Taxonomy (such as Lineage or Rank), and others are found in all Entrez databases (such as Entrez Date).

Each search result, listed in document summary (DocSum) format, may have several links associated with it. For example, for the search result Homo sapiens, the Nucleotide link will retrieve all the human sequences from the nucleotide databases, and the Genome link will retrieve the human genome from the Genomes database.

Search Tips and Tricks

A helpful list follows:

A search for Hominidae retrieves a single, hyperlinked entry. Selecting the link shows the structure of the taxon. On the other hand, a search for Hominidae[subtree] will retrieve a nonhierarchical list of all of the taxa listed within the Hominidae.

A search for species[rank] yields a list of all species in the Taxonomy database (108,020 in May 2002).

Find the Taxonomy update frequency by selecting Entrez Date from the pull-down menu under Preview/Index, typing “2002/02” in the box and selecting Index. The result: 2002/01 (5176) 2002/01/03 (478) 2002/01/08 (2) 2002/01/10 (2260) 2002/01/14 (7) 2002/01/16 (239) shows that in January 2002, 5,176 new taxa were added, the bulk of which appeared in Entrez for the first time on January 10, 2002. These taxa can be retrieved by selecting 2002/01/10, then selecting the AND button above the window, followed by Go.

An overview of the distribution of taxa in the DocSum list can be seen if Summary is changed to Common Tree, followed by selecting Display.

To filter out less interesting names from a DocSum list, add some terms to the query, e.g., 2002/01/10[date] NOT uncultured[prop] NOT unspecified[prop].

Displays in Taxonomy Entrez

There are a variety of choices regarding how search results can be displayed in Taxonomy Entrez.

Summary. This is the default display view. There are as many as four pieces of information in this display, if they are all present in the Taxonomy database: (1) scientific name of the taxon; (2) common name, if one is available; (3) taxonomic rank, if one is assigned; and (4) BLAST name, inherited from the taxonomy, e.g., Homo sapiens (human), species, mammals.

Brief. Shows only the scientific names of the taxa. This view can be used to download lists of species names from Entrez.

Tax ID List. Shows only the taxids of the taxa. This view can be used to download taxid lists from Entrez.

Info. Shows a summary of most of the information associated with each taxon in the Taxonomy database (similar to the TaxBrowser taxon-specific display; Figure 3). This can be downloaded as a text file; an XML representation of these data is under development.

Common Tree. A special display that shows a skeleton view of the relationships among the selected set of taxa and is described in the section below.

LinkOut. Displays a list of the linkout links (if any) for each of the selected sets of taxa (see Chapter 17).

Entrez Links. The remaining views follow Entrez links from the selected set of taxa to the other Entrez databases (Nucleotide, Protein, Genome, etc.) The Display view allows all links for a whole set of taxa to be viewed at once.

The Common Tree Viewer

The Common Tree view shows an abbreviated view of the taxonomic hierarchy and is designed to highlight the relationships between a selected set of organisms. Figure 5 shows the Common Tree view for a familiar set of model organisms. The Common Tree view for some model organisms.

The ten species shown in bold are the ones that were selected as input to the Common Tree display. The other taxa displayed show the taxonomic relationships between the selected taxa. For example, Eutheria is included because it is the smallest taxonomic group in our classification that includes both Homo sapiens and Mus musculus. A “+” box in the tree indicates that part of the taxonomic classification has been suppressed in this abbreviated view; selecting the “+” will fill in the missing lineage (and change the “+” to a “-”). The Expand All and Collapse All buttons at the top of the display will do this globally. The Search for box at the top of the display can be used to add taxa to the Common Tree display; taxa can be removed using the list at the bottom of the page.

If there are more than few dozen taxa selected for the common tree view, the display becomes visually complex and generally less useful. When a large list of taxa is sent to the Common Tree display, a summary screen is displayed first. For example, we currently list 727 families in the Viridiplantae (plants and green algae) (Figure 6). The Common Tree summary page for the plants and green algae.

The taxa are aggregated at the predetermined set of nodes in the Taxonomy database that have been assigned “BLAST names”. This serves an informal, very abbreviated, vernacular classification that gives a convenient overview. The BLAST names will often not provide complete coverage for all species at all levels in the tree. Here, not all of our flowering plants are flagged as eudicots or monocots. The Common Tree summary display recognizes cases such as these and lists the remaining taxa as other flowering plants. The full Common Tree for some or all of the taxa can be seen by selecting the check box next to monocots on the summary page and then Choose. This will display the full Common Tree view for 108 families of monocots.

There are several formatting options for saving the common tree display to a text file: text tree, phylip tree, and taxid list.

Hyperlinks to a common tree display can be made in two ways:

by specifying the common tree view in an Entrez query URL (for example, this link, which displays the common tree view of all of the taxonomy nodes with LinkOut links to the Butterfly Net International Web site); or

by providing a list of taxids directly to the common tree cgi function (for example, this link, which will display a live version of Figure 5).

Using Batch Taxonomy Entrez

The Batch Entrez page allows you to upload a file of taxids or taxon names into Taxonomy Entrez.

Indexing Taxonomy in Entrez

As for any Entrez database, the contents are indexed by creating term lists for each field of each database record (or taxid). For TAXON, the types of fields include name fields, hierarchy fields, inherited fields, and generic Entrez fields.

Name Fields

There are five different index fields for names in Taxonomy Entrez.

All names, [name] in an Entrez search – this is the default search field in Taxonomy Entrez. This is different from most Entrez databases, where the default search field is the composite [All Fields].

Scientific name, [sname] – using [sname] as a qualifier in a search restricts it to the nametype “scientific name”, the single preferred name for each taxon.

Common name, [cname] – restricts the search to common names.

Synonym, [synonym] – restricts the search to the “synonym” nametype.

Taxid, [uid] – restricts the search to taxonomy IDs, the unique numerical identifiers for taxa in the database. Taxids are not indexed in the other Entrez name fields.

Hierarchy Fields Examples of combining the subtree and lineage field limits with Boolean operators for searching Taxonomy Entrez.

Mammalia[subtree] AND Mammalia[lineage] returns the taxon Mammalia.

Mammalia[subtree] OR Mammalia[lineage] returns all of the taxa in a direct parent–child relationship with the taxon Mammalia.

root[subtree] NOT (Mammalia[subtree] OR Mammalia[lineage]) returns all of the taxa not in a direct parent–child relationship with the taxon Mammalia.

Sauropsida[subtree] NOT Aves[subtree] will retrieve the members of the classical taxon Reptilia, excluding the birds.

The [lineage] and [subtree] index fields are a way to superimpose the hierarchical relationships represented in the taxonomy on top of the Entrez data model. For an example of how to use these field limits for searching Taxonomy Entrez, see Box 3.

Lineage. For each node, the [lineage] index field retrieves all of the taxa listed at or above that node in the taxonomy. For example, the query Mammalia[lineage] retrieves 18 taxa from Entrez.

Subtree. For each node, the [subtree] index field retrieves all of the taxa listed at or below that node in the taxonomy. For example, the query Mammalia[lineage] retrieves 4,021 taxa from Entrez (as of March 9, 2002).

Next level. Returns all of the direct children of a given taxon.

Rank. Returns all of the taxa of a given Linnaean rank. The query Aves[subtree] AND species[rank] retrieves all of the species of birds with public sequence entries (there are 2,459, approximately half of the currently described species of extant birds).

Inherited Fields

The genetic code [gc], mitochondrial genetic code [mgc], and GenBank division [division] fields are all inherited within the taxonomy. The information in these fields refers to the genetic code used by a taxon or in which GenBank division it resides. Because whole families or branches may use the same code or reside in the same GenBank division, this property is usually indexed with a taxon high in the taxonomic tree, and the information is inherited by all those taxa below it. If there is no [gc] field associated with a taxon in the database, it is assumed that the standard genetic code is used. A genetic code may be referred to by either name or translation table number. For example, the two equivalent queries, standard[gc] and translation table 1[gc], each retrieves the set of organisms that use the standard genetic code for translating genomic sequences. Likewise, these two queries echinoderm mitochondrial[mgc] and translation table 9[mgc] will each retrieve the set of organisms that use the echinoderm mitochondrial genetic code for translating their mitochondrial sequences.

Generic Entrez Fields The properties [prop] field of Taxonomy in Entrez.

There are several useful terms and phrases indexed in the [prop] field. Possible search strategies that specify the prop field are explained below.

Using functional nametypes and classifications

unspecified [prop] not identified at the species level

uncultured [prop] environmental sample sequences

unclassified [prop] listed in an “unclassified” bin

incertae sedis [prop] listed in an “incertae sedis” bin

We do not explicitly flag names as “unspecified” in TAXON; rather, we rely on heuristics to index names as “unspecified” in the properties field. Many are missed. Taxa are indexed as “uncultured” if they are listed within an environmental samples bin or if their scientific names begin with the word “uncultured”.

Using rank level of taxon

All of these search strategies below are valid. Taxonomy Eentrez displays only taxa that are linked to public sequence entries, and because sequence entries are supposed to correspond to the Taxonomy database at or below the species level, the Entrez query: terminal [prop] NOT “at or below species level” [prop] should only retrieve problem cases.

above genus level [prop]

above species level [prop]

“at or below species level” [prop] (needs explicit quotes)

below species level [prop]

terminal [prop]

non terminal [prop]

Inherited value assignment points

genetic code [prop]

mitochondrial genetic code [prop]

standard [prop] invertebrate mitochondrial [prop]

translation table 5 [prop]

The query “genetic code [prop]” retrieves all of the taxa at which one of the genomic genetic codes is explicitly set. The second query retrieves all of the taxa at which one of the mitochondrial genetic codes is explicitly set, and so on.

division [prop]

INV [prop]

invertebrates [prop]

The above terms index the assignments of the GenBank division codes, which are divided along crude taxonomic categories (see Chapter 1). We have placed the division flags in the database so as to preserve the original assignment of species to GenBank divisions.

The remaining index fields are common to most or all Entrez domains, although some have special features in the taxonomy domain. For example, the field text word, [word], indexes words from the Taxonomy Entrez name indexes. Most punctuation is ignored, and the index is searched one word at a time; therefore, the search “homo sapiens[word]” will retrieve nothing.

Several useful terms are indexed in the properties field, [prop], including functional nametypes and classifications, the rank level of a taxon, and inherited values. See Box 4 for a detailed discussion of searches using the [prop] field.

More information on using the generic Entrez fields can be found in the Entrez Help documents.

Taxonomy Fields in Other Entrez Databases

Many of the Entrez databases (Nucleotide, Protein, Genome, etc.) include an Organism field, [orgn], that indexes entries in that database by taxonomic group. All of the names associated with a taxon (scientific name, synonyms, common names, and so on) are indexed in the Organism field and will retrieve the same set of entries. The Organism field will retrieve all of the entries below the term and any of their children.

To not retrieve such “exploded” terms, the unexploded indexes should be used. This query will only retrieve the entries that are linked directly to Homo sapiens: Homo sapiens[orgn:noexp]. This query will not retrieve entries that are linked to the subordinate node Homo sapiens neanderthalensis.

Taxids are indexed with the prefix txid: txid9606 [orgn].

Source organism modifiers are indexed in the [properties] field, and such queries would be in the form: src strain[prop], src variety[prop], or src specimen voucher[prop]. These queries will retrieve all entries with a strain qualifier, a variety qualifier, or a specimen_voucher qualifier, respectively.

All of the organism source feature modifiers (/clone, /serovar, /variety, etc.) are indexed in the text word field, [text word]. For example, one could query GenBank for: “strain k-12” [text word]. Because strain information is inconsistent in the sequence databases (as in the literature), a better query would be: “strain k 12”[word] OR “strain k12”[word]. Note: explicit double-quotes may be necessary for some of these queries.

The Taxonomy Statistics Page

The Taxonomy Statistics page displays tables of counts of the number of taxa in the public subset of the Taxonomy database. The numbers displayed are hyperlinks that will retrieve the corresponding set of entries. The table can be configured to display data based on three criteria: Entrez release date, rank, and taxa. The default setting shows the counts by rank for a pre-selected set of taxa (across all dates).

The checkboxes unclassified, uncultured, and unspecified will exclude the corresponding sets of taxa from the count. These work by appending the terms “NOT unclassified[prop]”, etc., to the statistics query. Checking uncultured and unspecified removes about 20% of taxa in the database and gives a much better count of the number of formally described species. As of April 1, 2003, the count was as follows: Archaea: 82 genera, 364 species Bacteria: 1163 genera, 9927 species Eukaryota: 24939 genera, 74832 species

Selecting one of the rank categories (e.g., species) loads a new table that shows, in this example, the number of new species added to each taxon each year, starting with 1993. The Interval pull-down menu shows release statistics in finer detail. The list of taxa in the display can be customized through the Customize link.

Other Relevant References Taxonomy FTP

A complete copy of the public NCBI taxonomy database is deposited several times a day on our FTP site. See Table 1 for details.

Tax BLAST

Taxonomy BLAST reports (Tax BLAST) are available from the BLAST results page and from the BLink pages. Tax BLAST post-processes the BLAST output results according to the source organisms of the sequences in the BLAST results page. A help page is available that describes the three different views presented on the Tax BLAST page (Lineage Report, Organism Report, and Taxonomy Report).

Toolkit Function Libraries

The function library for the taxonomy application software in the NCBI Toolkit is ncbitxc2.a (or libncbitxc2.a). The source code can be found in the NCBI Toolkit Source Browser and can be downloaded from the toolbox directory on the FTP site.

NCBI Taxonomists

In the early years of the project, Scott Federhen did all of the software and database development. In recent years, Vladimir Soussov and his group have been responsible for software and database development.

Scott Federhen (1990–present)

Andrzej Elzanowski (1994–1997)

Detlef Leipe (1994–present)

Mark Hershkovitz (1996–1997)

Carol Hotton (1997–present)

Mimi Harrington (1999–2000)

Ian Harrison (1999–2002)

Sean Turner (2000–present)

Rick Sternberg (2001–present)

Contact Us

If you have a comment or correction to our Taxonomy database, perhaps a misspelling or classification or if something looks wrong, please send a message to info@ncbi.nlm.nih.gov.

Appendix 1. <abbrev>TAXON</abbrev> nametypes. Scientific Name

Every node in the database is required to have exactly one “scientific name”. Wherever possible, this is a validly published name with respect to the relevant code of nomenclature. Formal names that are subject to a code of nomenclature and are associated with a validly published description of the taxon will be Latinized uninomials above the species level, binomials (e.g. Homo sapiens) at the species level, and trinomials for the formally described infraspecific categories (e.g., Homo sapiens neanderthalensis). For many of our taxa, it is not possible to find an appropriate formal scientific name; these nodes are given an informal “scientific name”. The different classes of informal names are discussed in Appendix 2. Functional Classes of TAXON Scientific Names.

The scientific name is the one that will be used in all of the sequence entries that map to this node in the Taxonomy database. Entries that are submitted with any of the other names associated with this node will be replaced with this name. When we change the scientific name of a node in the Taxonomy database, the corresponding entries in the sequence databases will be updated to reflect the change. For example, we list Homo neanderthalensis as a synonym for Homo sapiens neanderthalensis. Both are in common use in the literature. We try to impose consistent usage on the entries in the sequence databases, and resolving the nomenclatural disputes that inevitably arise between submitters is one of the most difficult challenges that we face.

Synonym

The “synonym” nametype is applied to both synonyms in the formal nomenclatural sense (objective, nomenclatural, homotypic versus subjective, heterotypic) and more loosely to include orthographic variants and a host of names that have found their way into the taxonomy database over the years, because they were found in sequence entries and later merged into the same taxon in the Taxonomy database.

Acronym

The “acronym” nametype is used primarily for the viruses. The International Committee on Taxonomy of Viruses (ICTV) maintains an official list of acronyms for viral species, but the literature is often full of common variants, and it is convenient to list these as well. For example, we list HIV, LAV-1, HIV1, and HIV-1 as acronyms for the human immunodeficiency virus type 1.

Anamorph

The term “anamorph” is reserved for names applied to asexual forms of fungi, which present some special nomenclatural challenges. Many fungi are known to undergo both sexual and asexual reproduction at different points in their life cycle (so-called “perfect” fungi); for many others, however, only the asexually reproducing (anamorphic or mitosporic) form is known (in some, perhaps many, asexual species, the sexual cycle may have been lost altogether). These anamorphs, often with simple and not especially diagnostic morphology, were given Linnaean binomial names. A number of named anamorphic species have subsequently been found to be associated with sexual forms (teleomorphs) with a different name (for example, Aspergillus nidulans is the name given to the asexual stage of the teleomorphic species Emericella nidulans). In these cases, the teleomorphic name is given precedence in the GenBank Taxonomy database as the “scientific” name, and the anamorphic name is listed as an “anamorph” nametype.

Misspelling

The “misspelling” nametype is for simple misspellings. Some of these are included because the misspelling is present in the literature, but most of them are there because they were once found in a sequence entry (which has since been corrected). We keep them in the database for tracking purposes, because copies of the original sequence entry can still be retrieved. Misspellings are not listed on the TaxBrowser pages nor on the Taxonomy Entrez Info display views, but they are indexed in the Entrez search fields (so that searches and Entrez queries with the misspelling will find the appropriate node).

Misnomer

“Misnomer” is a rarely used nametype. It is used for names that might otherwise be listed as “misspellings” but which we want to appear on the browser and Entrez display pages.

Common Name

The “common name” nametype is used for vernacular names associated with a particular taxon. These may be found at any level in the hierarchy; for example, “human”, “reptiles”, and “pale devil's-claw” are all used. Common names should be in lowercase letters, except where part of the name is derived from a proper noun, for example, “American butterfish” and “Robert's arboreal rice rat”.

The use of common names is inherently variable, regional, and often inconsistent. There is generally no authoritative reference that regulates the use of common names, and there is often not perfect correspondence between common names and formally described scientific taxa; therefore, there are some caveats to their use. For scientific discourse, there is no substitute for formal scientific names. Nevertheless, common names are invaluable for many indexing, retrieval, and display purposes. The combination “Oecomys roberti (Robert's arboreal rice rat)” conveys much more information than either name by itself. Issues raised by the variable, regional, and inexact use of common names are partly addressed by the “genbank common name” nametype (below) and the ability to customize names in the GenBank flatfile.

<abbrev>BLAST</abbrev> Name

The “BLAST names” are a specially designated set of common names selected from the Taxonomy database. These were chosen to provide a pool of familiar names for large groups of organisms (such as “insects”, “mammals”, “fungi”, and others) so that any particular species (which may not have an informative common name of its own) could inherit a meaningful collective common name from the Taxonomy database. This was originally developed for BLAST, because a list of BLAST results will typically include entries from many species identified by Latin binomials, which may not be familiar to all users. BLAST names may be nested; for example, “eukaryotes”, “animals”, “chordates”, “mammals”, and “primates” are all flagged as “blast names”.

BLAST names are now used in several other applications, for example the Tax BLAST displays, the Summary view in Entrez Taxonomy, and in the Summary display of the Common Tree format.

In-part

The “in-part” nametype is included for retrieval terms that have a broader range of application than the taxon or taxa at which they appear. For example, we list reptiles and Reptilia as in-part nametypes at our nodes Testudines (the turtles), Lepidosauria (the lizards and snakes), and Crocodylidae (the crocodilians).

Includes

The “includes” nametype is the opposite of the in-part nametype and is included for retrieval terms that have a narrower scope of application than the taxon at which they appear. For example, we could list “reptiles” as an “includes” nametype for the Amniota (or at any higher node in the lineage).

Equivalent Name

The “equivalent name” nametype is a catch-all category, used for names that we would like to associate with a particular node in the database (for indexing or tracking purposes) but which do not seem to fit well into any of the other existing nametypes.

GenBank Common Name

The “genbank common name” was introduced to provide a mechanism by which, when there is more than one common name associated with a particular node in the taxonomy, one of them could be designated to be the common name that should be used by default in the GenBank flatfiles and other applications that are trying to find a common name to use for display (or other) purposes. This is not intended to confer any special status or blessing on this particular common name over any of the other common names that might be associated with the same node, and we have developed mechanisms to override this choice for a common name on a case-by-case basis if another name is more appropriate or desirable for a particular sequence entry. Each node may have at most one “genbank common name”.

GenBank Acronym

There may be more than one acronym associated with a particular node in the Taxonomy database (particularly if several virus names have been synonymized in a single species). Just as with the “genbank common name”, the “genbank acronym” provides a mechanism to designate one of them to be the acronym that should be used for display (or other) purposes. Each node may have at most one “genbank acronym”.

GenBank Synonym

The “genbank synonym” nametype is intended for those special cases in which there is more than one name commonly used in the literature for a particular species, and it is informative to have both names displayed prominently in the corresponding sequence record. Each node may have at most one “genbank synonym”. For example, SOURCE Takifugu rubripes (Fugu rubripes) ORGANISM Takifugu rubripes

GenBank Anamorph

Although the use of either the anamorph or teleomorph name is formally correct under the International Code of Botanical Nomenclature, we prefer to give precedence to the telemorphic name as the “scientific name” in the Taxonomy database, both to emphasize their commonality and to avoid having two (or more) taxids that effectively apply to the same organism. However, in many cases, the anamorphic name is much more commonly used in the literature, especially when sequences are normally derived from the asexual form of the species. In these cases, the “genbank anamorph” nametype can be used to annotate the corresponding sequence records with both names. Each node may have at most one “genbank anamorph”. For example: SOURCE Emericella nidulans (anamorph: Aspergillus nidulans) ORGANISM Emericella nidulans

Appendix 2. Functional classes of <abbrev>TAXON</abbrev> scientific names. Formal Names

Whenever possible, formal scientific names are used for taxa. There are several codes of nomenclature that regulate the description and use of names in different branches of the tree of life. These are: the International Code of Zoological Nomenclature (ICZN), the International Code of Botanical Nomenclature (ICBN), the International Code of Nomenclature for Cultivated Plants (ICNCP), the International Code of Nomenclature of Bacteria (ICNB), and the International Code of Virus Classification and Nomenclature (ICVCN).

The viral code is less well developed than the others, but it includes an official classification for the viruses as well as a list of approved species names. Viral names are not Latin binomials (as required by the other codes), although there are some instances (e.g., Herpesvirus papio or Herpesvirus sylvilagus). When possible, we try to use ICTV-approved names for viral taxa, but new viral species names appear in the literature (and therefore in the sequence databases) much faster than they are approved into the ICTV lists. We are working to set up taxonomy LinkOut links (see Chapter 17) to the ICTV database, which will make the subset of ICTV-approved names explicit.

The zoological, botanical, and bacteriological codes mandate Latin binomials for species names. They do not describe an official classification (such as the ICTV), with the exception that the binomial species nomenclature itself makes the classification to the genus level explicit. If a genus is found to be polyphyletic, the classification cannot be corrected without formally renaming at least some of the species in the genus. (This is somewhat reminiscent of the “smart identifier” problem in computer science.)

The fungi are subject to the botanical code. The cyanobacteria (blue-green algae) have been subject to both the botanical and the bacteriological codes, and the issue is still controversial.

Authorities

“Authorities” appear at the end of the formal species name and include at least the name or standard abbreviation of the taxonomist who first described that name in the scientific literature. Other information may appear in the authority as well, often the year of description, and can become quite complicated if the taxon has been transferred or amended by other taxonomists over the years. We do not use authorities in our taxon names, although many are included in the database listed as synonyms. We have made an exception to this rule in the case of our first duplicated species name in the database, Agathis montana, to provide unambiguous names for the corresponding sequence entries.

Subspecies

All three of the codes of nomenclature for cellular organisms provide for names at the subspecies level. The botanical and bacteriological codes include the string “subsp.” in the formal name; the zoological code does not, e.g., Homo sapiens neanderthalensis, Zea mays subsp. mays, and Klebsiella pneumoniae subsp. ozaenae.

Varietas and Forma

The botanical code (but none of the others) provides for two additional formal ranks beneath the subspecies level, varietas and forma. These names will include the strings “var.” and “f.”, respectively, e.g., Marchantia paleacea var. diptera, Penicillium aurantiogriseum var. neoechinulatum, Salix babylonica f. rokkaku, or Fragaria vesca subsp. Vesca f. alba

Other Subspecific Names

We list taxa with other subspecific names where it seems useful and appropriate and where it is necessary to find places for names in the sequence databases. For indexing purposes in the Genomes division of Entrez, it is convenient to have strain-level nodes for bacterial species with a complete genome sequence, particularly when there are two or more complete genome sequences available for different strains of the same species, e.g., Escherichia coli K12, Escherichia coli O157:H7, Escherichia coli O157:H7 EDL933, Mycobacterium tuberculosis CDC1551, and Mycobacterium tuberculosis H37Rv.

Several other classes of subspecific groups do not have formal standing in the nomenclature but represent well-characterized and biologically meaningful groups, e.g., serovar, pathovar, forma specialis, and others. In many cases, these may eventually be promoted to a species; therefore, it is convenient to represent them independently from the outset, e.g., Xanthomonas campestris pv. campestris, Xanthomonas campestris pv. vesicatoria, Pneumocystis carinii f. sp. hominis, Pneumocystis carinii f. sp. mustelae, Salmonella enterica subsp. enterica serovar Dublin, and Salmonella enterica subsp. enterica serovar Panama.

Many other names below the species level have been added to the Taxonomy database to accommodate SWISS-PROT entries, where strain (and other) information is annotated with the organism name for some species.

Informal Names

In general, we try to avoid unqualified species names such as Bacillus sp., although many of them exist in the Taxonomy database because of earlier sequence entries. Bacillus sp. is a particularly egregious example, because Bacillus is a duplicated genus name and could refer to either a bacterium or an insect. In our database, Bacillus sp. is assumed to be a bacterium, but Bacillus sp. P-4-N, on the other hand, is classified with the insects.

When entries are not identified at the species level, multiple sequences can be from the same unidentified species. Sequences from multiple different unidentified species in the same genus are also possible. To keep track of this, we add unique informal names to the Taxonomy database, e.g., a meaningful identifier from the submitters could be used. This could be a strain name, a culture collection accession, a voucher specimen, an isolate name or location—anything that could tie the entry to the literature (or even to the lab notebook). If nothing else is available, we may construct a unique name using a default formula such as the submitter's initials and year of submission. This way, if a formal name is ever determined or described for any of these organisms, we can synonymize the informal name with the formal one in the Taxonomy database, and the corresponding entries in the sequence databases will be updated automatically. For example, AJ302786 was originally submitted (in November 2000) as Agathis sp. and was added to the Taxonomy database as Agathis sp. RDB-2000. In January 2002, this wasp was identified as belonging to the species Agathis montana, and the node was renamed; the informal name Agathis sp. RDB-2000 was listed as a synonym. A separate member of the genus, Agathis sp. DMA-1998, is still listed with an informal name.

Here are some examples of informal names in the Taxonomy database: Anabaena sp. PCC 7108 Anabaena sp. M14-2 Calophyllum sp. 'Fay et al. 1997' Scutellospora sp. Rav1/RBv2/RCv3 Ehrlichia-like sp. 'Schotti variant' Gilia sp. Porter and Heil 7991 Camponotus n. sp. BGW-2001 Camponotus sp. nr. gasseri BGW-2001 Drosophila sp. 'white tip scutellum' Chrysoperla sp. 'C.c.2 slow motorboat' Saranthe aff. eichleri Chase 3915 Agabus cf. nitidus IR-2001 Simulium damnosum s.l. 'Kagera' Amoebophrya sp. ex Karlodinium micrum

We use single quotes when it seems appropriate to group a phrase into a single lexical unit. Some of these names include abbreviations with special meanings.

“n. sp.” indicates that this is a new, undescribed species and not simply an unidentified species. “sp. nr.” indicates “species near”. In the example above, this indicates that this is similar to Camponotus gasseri. “aff.”, affinis, related to but not identical to the species given. “cf.”, confer; literally, “compare with” conveys resemblance to a given species but is not necessarily related to it. “s.l.”, sensu lato; literally, “in the broad sense”. “ex”, “from” or “out of” the biological host of the specimen.

Note that names with cf., aff., nr., and n. sp. are not unique and should have unique identifiers appended to the name.

Cultured bacterial strains and other specimens that have not been identified to the genus level are given informal names as well, e.g., Desulfurococcaceae str. SRI-465; crenarchaeote OlA-6.

Names such as Camponotus sp. 1 are avoided, because different submitters might easily use the same name to refer to different species. See Box 4 for how to retrieve these names in Taxonomy Entrez.

Uncultured Names

Sequences from environmental samples are given “uncultured” names. In these studies, nucleotide sequences are cloned directly from the environment and come from varied sources, such as Antarctic sea ice, activated sewer sludge, and dental plaque. Apart from the sequence itself, there is no way to identify the source organisms or to recover them for further studies. These studies are particularly important in bacterial systematics work, which shows that the vast majority of environmental bacteria are not closely related to laboratory cultured strains (as measured by 16S rRNA sequences). Many of the deepest-branching groups in our bacterial classification are defined only by anonymous sequences from these environmental samples studies, e.g., candidate division OP5, candidate division Termite group 1, candidate subdivision kps59rc, phosphorous removal reactor sludge group, and marine archael group 1.

These samples vary widely in length and in quality, from short single-read sequences of a few hundred base pairs to high-quality, full-length 16S sequences. We now give all of these samples anonymous names, which may indicate the phylogenetic affiliation of the sequence, as far is it may be determined, e.g., uncultured archaeon, uncultured crenarchaeote, uncultured gamma proteobacerium, or uncultured enterobacterium. See Box 4 for how to retrieve these names in Taxonomy Entrez.

Candidatus Names

Some groups of bacteria have never been cultured but can be characterized and reliably recovered from the environment by other means. These include endosymbiotic bacteria and organisms similar to the phytoplasmas, which can be identified by the plant diseases that they cause. We do not give these “uncultured” names, as above. These represent a special challenge for bacterial nomenclature, because a formal species description requires the designation of a cultured type strain. The bacteriological code has a special provision for names of this sort, Candidatus, e.g., Candidatus Endobugula or Candidatus Endobugula sertula; Candidatus Phlomobacter or Candidatus Phlomobacter fragariae. These often appear in the literature without the Candidatus prefix; therefore, we list the unqualified names as synonyms for retrieval purposes.

Informal Names above the Species Level

We allow informal names for unranked nodes above the species level as well. These should all be phylogenetically meaningful groups, e.g., the Fungi/Metazoa group, eudicotyledons, Erythrobasidium clade, RTA clade, and core jakobids. In addition, there are several other classes of nodes and names above the species level that explicitly do not represent phylogenetically meaningful groups. These are outlined below.

Unclassified Bins

We are expected to add new species names to the database in a timely manner, preferably within a day or two. If we are able to find only a partial classification for a new taxon in the database, we place it as deeply as we can and list it in an explicit “unclassified” bin. As more information becomes available, these bins are emptied, and we give full classifications to the taxa listed there. In general, we suppress the names of the unclassified bins themselves so they do not appear in the abbreviated lineages that appear in the GenBank flatfiles, e.g., unclassified Salticidae, unclassified Bacteria, and unclassified Myxozoa.

Incertae Sedis Bins

If the best taxonomic opinion available is that the position of a particular taxon is uncertain, then we will list it in an “incertae sedis” bin. This is a more permanent assignment than for taxa that are listed in unclassified bins, e.g., Neoptera incertae sedis, Chlorophyceae incertae sedis, and Riodininae incertae sedis.

Mitosporic Bins

Fungi that were known only in the asexual (mitosporic, anamorphic) state were placed formerly in a separate, highly polyphyletic category of “imperfect” fungi, the Deuteromycota. Spurred especially by the development of molecular phylogenetics, current mycological practice is to classify anamorphic species as close to their sexual relatives as available information will support. Mitosporic categories can occur at any rank, e.g., mitosporic Ascomycota, mitosporic Hymenomycetes, mitosporic Hypocreales, and mitosporic Coniochaetaceae. The ultimate goal is to fully incorporate anamorphs into the natural phylogenetic classification.

Other Names

The requirement that the Taxonomy database includes names from all of the entries in the sequence database introduces a number of names that are not typically treated in a taxonomic database. These are listed in the top-level group “Other”. Plasmids are typically annotated with their host organism, using the /plasmid source organism qualifier. Broad-host-range plasmids that are not associated with any single species are listed in their own bin. Plasmid and transposon names from very old sequence entries are listed in separate bins here as well. Plasmids that have been artificially engineered are listed in the “vectors” bin.

Appendix 3. Other <abbrev>TAXON</abbrev> data types. Ranks

We do not require that Linnaean ranks be assigned to all of our taxa, but we do include a standard rank table that allows us to assign formal ranks where it seems appropriate. We do not require that sibling taxa all have the same rank, but we do not allow taxa of higher rank to be listed beneath taxa of lower rank. We allow unranked nodes to be placed at any point in our classification.

The one rank that we particularly care about is “species”. We try to ensure that all of the sequence entries map into the Taxonomy database at or below a species-level node.

Genetic Codes

The genetic codes and mitochondrial genetic codes that are appropriate for translating protein sequences in different branches of the tree of life are assigned at nodes in the Taxonomy database and inherited by species at the terminal branches of the tree. Plastid sequences are all translated with the standard genetic code, but many of the mRNAs undergo extensive RNA editing, making it difficult or impossible to translate sequences from the plastid genome directly. The genetic codes are listed on our Web site.

GenBank Divisions

GenBank taxonomic division assignments are made in the Taxonomy database and inherited by species at the terminal branches of the tree, just as with the genetic codes.

References

The Taxonomy database allows us to store comments and references at any taxon. These may include hotlinks to abstracts in PubMed, as well as links to external addresses on the Web.

Abbreviated Lineage

Some branches of our taxonomy are many levels deep, e.g., the bony fish (as we moved to a phylogenetic classification) and the drosophilids (a model taxon for evolutionary studies). In many cases, the classification lines in the GenBank flatfiles became longer than the sequences themselves. This became a storage and update issue, and the classification lines themselves became less helpful as generally familiar taxa names became buried within less recognizable taxa.

To address this problem, the Taxonomy database allows us to flag taxa that should (or should not) appear in the abbreviated classification line in the GenBank flatfiles. The full lineages are indexed in Entrez and displayed in the Taxonomy Browser.

Glossary 3-D or 3D

Three-dimensional.

Accession number

An Accession number is a unique identifier given to a sequence when it is submitted to one of the DNA repositories (GenBank, EMBL, DDBJ). The initial deposition of a sequence record is referred to as version 1. If the sequence is updated, the version number is incremented, but the Accession number will remain constant.

Alu

The Alu repeat family comprises short interspersed elements (SINES) present in multiple copies in the genomes of humans and other primates. The Alu sequence is approximately 300 bp in length and is found commonly in introns, 3′ untranslated regions of genes, and intergenic genomic regions. They are mobile elements and are present in the human genome in extremely high copy number. Almost 1 million copies of the Alu sequence are estimated to be present, making it the most abundant mobile element. The Alu sequence is so named because of the presence of a recognition site for the AluI endonuclease in the middle of the Alu sequence. Because of the widespread occurrence of the Alu repeat in the genome, the Alu sequence is used as a universal primer for PCR in animal cell lines; it binds in both forward and reverse directions. The Alu universal primer sequence is as follows: 5′-GTG GAT CAC CTG AGG TCA GGA GTT TC-3′ (26-mer).

allele

One of the variant forms of a gene at a particular locus on a chromosome. Different alleles produce variation in inherited characteristics such as hair color or blood type. In an individual, one form of the allele (the dominant one) may be expressed more than another form (the recessive one). When “genes” are considered simply as segments of a nucleotide sequence, allele refers to each of the possible alternative nucleotides at a specific position in the sequence. For example, a CT polymorphism such as CCT[C/T]CCAT would have two alleles: C and T.

API

Application Programming Interface. An API is a set of routines that an application uses to request and carry out lower-level services performed by a computer's operating system. For computers running a graphical user interface, an API manages an application's windows, icons, menus, and dialog boxes.

ASN.1

Abstract Syntax Notation 1 is an international standard data-representation format used to achieve interoperability between computer platforms. It allows for the reliable exchange of data in terms of structure and content by computer and software systems of all types.

BAC

Bacterial Artificial Chromosome. A BAC is a large segment of DNA (100,000–200,000 bp) from another species cloned into bacteria. Once the foreign DNA has been cloned into the host bacteria, many copies of it can be made.

BankIt

BankIt is a tool for the online submission of one or a few sequences into GenBank and is designed to make the submission process quick and easy. (BankIt also automatically uses VecScreen to identify segments of nucleic acid sequence that may be of vector, adapter, or linker origin to combat the problem of vector contamination in GenBank.)

bit score

The value S′ is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. By normalizing a raw score using the formula: a “bit score” S′ is attained, which has a standard set of units, and where K and lambda are the statistical parameters of the scoring system. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.

BLAST

Basic Local Alignment Search Tool (Altschul et al., J Mol Biol 215:403 -410; 1990). A sequence comparison algorithm that is optimized for speed and used to search sequence databases for optimal local alignments to a query. See the BLAST chapter (Chapter 15) or the tutorial or the narrative guide to BLAST.

blastn

nucleotide–nucleotide BLAST. blastn takes nucleotide sequences in FASTA format, GenBank Accession numbers, or GI numbers and compares them against the NCBI Nucleotide databases.

blastp

protein–protein BLAST. blastp takes protein sequences in FASTA format, GenBank Accession numbers, or GI numbers and compares them against the NCBI Protein databases.

BLAT

A DNA/Protein sequence analysis program to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. BLAT is not BLAST. (See the BLAT web page.)

BLink

BLAST Link. BLink displays the results of BLAST searches that have been done for every protein sequence in the Entrez Protein data domain. It can be accessed by following the BLink link displayed beside any hit in the results of an Entrez Protein search. In contrast to Entrez's Related Sequences feature, which lists the titles of similar sequences, BLink displays the graphical output of precomputed blastp results against the non-redundant (nr) protein database. The output includes the positions of up to 200 BLAST hits on the query sequence, scores, and alignments. BLink offers a variety of display options, including the distribution of hits by taxonomic grouping, the best hit to each organism, the protein domains in the query sequence, similar sequences that have known 3D structures, and more. Additional options allow you to specify from which taxa you would like to exclude, increase, or decrease the BLAST cutoff score or filter the BLAST hits to show only those from a specific source database, such as RefSeq or SWISS-PROT. See the BLink help document for additional information.

BLOB

Binary Large Object (or binary data object). BLOB refers to a large piece of data, such as a bitmap. A BLOB is characterized by large field values, an unpredictable table size, and data that are formless from the perspective of a program. It is also a keyword designating the BLOB structure, which contains information about a block of data.

BLOSUM 62

Blocks Substitution Matrix. A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM 62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment to avoid overweighting closely related family members (Henikoff and Henikoff, Proc Natl Acad Sci USA 89:10915-10919; 1992).

Boolean

This term refers to binary algebra that uses the logical operators AND, OR, XOR, and NOT; the outcomes consist of logical values (either TRUE or FALSE). The keyword boolean indicates that the expression or constant expression associated with the identifier takes the value TRUE or FALSE. The logical-AND (&&) operator produces the value 1 if both operands have nonzero values; otherwise, it produces the value 0. The logical-OR (×€×€) operator produces the value 1 if either of its operands has a nonzero value. The logical-NOT (!) operator produces the value 0 if its operand is true (nonzero) and the value 1 if its operand is FALSE (0). The exclusive OR (XOR) operator yields TRUE only if one of its operands are TRUE and the other is FALSE. If both operands are the same (either TRUE or FALSE), the operation yields FALSE.

build

A run of the genome assembly and annotation process of the set of products generated by that run.

CCAP

Cancer Chromosome Aberration Project. CCAP was designed to expedite the definition and detailed characterization of the distinct chromosomal alterations that are associated with malignant transformation. The project is a collaboration among the NCI, the NCBI, and numerous research labs.

CD

Conserved Domain. CD refers to a domain (a distinct functional and/or structural unit of a protein) that has been conserved during evolution. During evolution, changes at specific positions of an amino acid sequence in the protein have occurred in a way that preserve the physico-chemical properties of the original residues, and hence the structural and/or functional properties of that region of the protein.

CDART

Conserved Domain Architecture Retrieval Tool. When given a protein query sequence, CDART displays the functional domains that make up the protein and lists proteins with similar domain architectures. The functional domains for a sequence are found by comparing the protein sequence to a database of conserved domain alignments, CDD using RPS-BLAST.

CDD

Conserved Domain Database. This database is a collection of sequence alignments and profiles representing protein domains conserved during molecular evolution.

cDNA

complementary DNA. A DNA sequence obtained by reverse transcription of a messenger RNA (mRNA) sequence.

CDS

coding region, coding sequence. CDS refers to the portion of a genomic DNA sequence that is translated, from the start codon to the stop codon, inclusively, if complete. A partial CDS lacks part of the complete CDS (it may lack either or both the start and stop codons). Successful translation of a CDS results in the synthesis of a protein.

CEPH

Centre d'Etude du Polymorphism Humain

CGAP

Cancer Genome Anatomy Project. CGAP is an interdisciplinary program to identify the human genes expressed in different cancerous states, based on cDNA (EST) libraries, and to determine the molecular profiles of normal, precancerous, and malignant cells. The project is a collaboration among the NCI, the NCBI, and numerous research labs.

CGH

Comparative Genomic Hybidization. CGH is a fluorescent molecular cytogenetic technique that identifies chromosomal aberrations and maps these changes to metaphase chromosomes. CGH can be used to generate a map of DNA copy number changes in tumor genomes. CGH is based on quantitative two-color fluorescence in situ hybridization (FISH). DNA extracted from tumor cells is labeled in one color (e.g., green) and mixed in a 1:1 ratio with DNA from normal cells, which is labeled in a different color (e.g., red). The mixture is then applied to normal metaphase chromosomes. Portions of the genome that are equally represented in normal and tumor cells will appear orange, regions that are deleted in the tumor sample relative to the normal sample will appear red, and regions that are present in higher copy number in the tumor sample (because of amplification) will appear green. Special image analysis tools are necessary to quantitate the ratio of green-to-red fluorescence to determine whether a given region is more highly represented in the normal or in the tumor sample.

CGI

Common Gateway Interface. A mechanism that allows a Web server to run a program or script on the server and send the output to a Web browser.

cluster

A group that is created based on certain criteria. For example, a gene cluster may include a set of genes whose similar expression profiles are found to be similar according to certain criteria, or a cluster may refer to a group of clones that are related to each other by homology.

Cn3D

“See in 3-D” is a structure and sequence alignment viewer for NCBI databases. It allows viewing of 3-D structures and sequence–structure or structure–structure alignments. Cn3D can work as a helper application to the browser or as a client–server application that retrieves structure records from the Molecular Modeling Database (MMDB, see below) directly from the internet. The Cn3D homepage provides access to information on how to install the program, a tutorial to get started, and a comprehensive help document.

codon

Sequence of three nucleotides in DNA or mRNA that specifies a particular amino acid during protein synthesis; also called a triplet. Of the 64 possible codons, 3 are stop codons, which do not specify amino acids.

COGs

Clusters of Orthologous Groups (of proteins) were delineated by comparing protein sequences from completely sequenced genomes. Each COG consists of individual proteins or groups of paralogs from at least three lineages and thus corresponds to an ancient conserved domain.

consensus sequence

The nucleotides or amino acids found most commonly at each position in the sequences of homologous DNAs, RNAs, or proteins.

contig

A contiguous segment of the genome made by joining overlapping clones or sequences. A clone contig consists of a group of cloned (copied) pieces of DNA representing overlapping regions of a particular chromosome. A sequence contig is an extended sequence created by merging primary sequences that overlap. A contig map shows the regions of a chromosome where contiguous DNA segments overlap. Contig maps provide the ability to study a complete and often large segment of the genome by examining a series of overlapping clones, which then provide an unbroken succession of information about that region.

Coriell

Coriell Institute of Aging Cell Repository

CPU

Central Processing Unit. The CPU is the computational and control unit of a computer, the device that interprets and executes instructions.

CSS

Cascading Style Sheets. CSS specify the formatting details that control the presentation and layout of HTML and XML elements. CSS can be used for describing the formatting behavior and text decoration of simply structured XML documents but cannot display structure that varies from the structure of the source data.

Cubby

A tool of Entrez, the Cubby stores search strategies that may be updated at any time, stores LinkOut preferences to specify which LinkOut providers have to be displayed in PubMed, and changes the default document delivery service.

DCMS

Data Creation and Maintenance System

DDBJ

DNA Data Bank of Japan

definition line

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The definition line or description line is distinguished from the sequence data by a “greater than” (>) symbol in the first column (see example); also DEFLINE, as in a flatfile.

DNA

Deoxyribonucleic acid is the chemical inside the nucleus of a cell that carries the genetic instructions for making living organisms. DNA is composed of two anti-parallel strands, each a linear polymer of nucleotides. Each nucleotide has a phosphate group linked by a phosphoester bond to a pentose (a five-carbon sugar molecule, deoxyribose), that in turn is linked to one of four organic bases, adenine, guanine, cytosine, or thymine, abbreviated A, G, C, and T, respectively. The bases are of two types: purines, which have two rings and are slightly larger (A and G); and pyrimidines, which have only one ring (C and T). Each nucleotide is joined to the next nucleotide in the chain by a covalent phosphodiester bond between the 5′ carbon of one deoxyribose group and the 3′ carbon of the next. DNA is a helical molecule with the sugar–phosphate backbone on the outside and the nucleotides extending toward the central axis. There is specific base-pairing between the bases on opposite strands in such a way that A always pairs with T and G always pairs with C.

domain

A “domain” refers to a discrete portion of a protein assumed to fold independently of the rest of the protein and which possesses its own function.

draft sequence

Draft sequence refers to DNA sequence that is not yet finished but is generally of high quality (i.e., an accuracy of greater than 90%). Draft sequence data are mostly in the form of 10,000 base pair-sized fragments, the approximate chromosomal locations of which are known. The following keywords are associated with draft sequence: phase 0, light-pass coverage of a clone, generally only 1× coverage; phase 1, 4–10× coverage of a BAC clone (order and orientation of the fragments are unknown); and phase 2, 4–10× coverage of a BAC clone (order and orientation of the fragments are known). Phase 3 refers to the completely finished sequence.

DTD

Document Type Definition. The DTD is an optional part of the prolog of an XML document that defines the rules of the document. It sets constraints for an XML document by specifying which elements are present in the document and the relationships between elements, e.g., which tags can contain other tags, the number and sequence of the tags, and attributes of the tags. The DTD helps to validate the data when the receiving application does not have a built-in description of the incoming data.

DUST

A program for filtering low-complexity regions from nucleic acid sequences.

E-value

Expect value. The E-value is a parameter that describes the number of hits one can “expect” to see by chance when searching a database of a particular size. It decreases exponentially with the score (S) that is assigned to a match between two sequences. Essentially, the E-value describes the random background noise that exists for matches between sequences. For example, an E-value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size, one might expect to see one match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to “0”, the higher is the “significance” of the match. However, it is important to note that searches with short sequences can be virtually identical and have relatively high E-value. This is because the calculation of the E-value also takes into account the length of the query sequence. This is because shorter sequences have a high probability of occurring in the database purely by chance. For more information, see the following tutorial.

EC number

A number assigned to a type of enzyme according to a scheme of standardized enzyme nomenclature developed by the Enzyme Commission of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB). EC numbers may be found in ENZYME, the Enzyme nomenclature database, maintained at the ExPASy molecular biology server.

EMBL

European Molecular Biology Laboratory

Entrez

Entrez is a retrieval system for searching several linked databases. It provides access to the following NCBI databases: PubMed, GenBank, Protein, Structure, Genome, PopSet, OMIM, Taxonomy, Books, ProbeSet, 3D Domains, UniSTS, SNP, and CDD. (See the Entrez chapter or the Entrez web page.)

EST

Expressed Sequence Tag. ESTs are short (usually approximately 300–500 base pairs), single-pass sequence reads from cDNA. Typically, they are produced in large batches. They represent the genes expressed in a given tissue and/or at a given developmental stage. They are tags (some coding, others not) of expression for a given cDNA library. They are useful in identifying full-length genes and in mapping.

e-PCR

Electronic PCR is used to compare a query sequence to mapped sequence-tagged sites (STSs) to find a possible map location for the query sequence. e-PCR finds STSs in DNA sequences by searching for subsequences that closely match the PCR primers present in mapped markers. The subsequences must have the correct order, orientation, and spacing that they could plausibly prime the amplification of a PCR product of the correct molecular weight.

epub citation

“Ahead-of-print” citation. PubMed now accepts citations from publishers for articles that have been published electronically ahead of the printed issue. PubMed displays the category “[epub ahead of print]” in the part of the citation where the volume and pagination would ordinarily display. For example: Proc Natl Acad Sci USA. 2000 May 2 [epub ahead of print].

ExoFish

Exon Finding by Sequence Homology. Exofish is a tool based on homology searches for the rapid and reliable identification of human genes. It relies on the sequence of another vertebrate, the pufferfish Tetraodon nigroviridis (similar to Fugu), to detect conserved sequences with a very low background. The genome of T. nigroviridis is eight times more compact than the human genome and has been used in the comparative identification of human genes from the rough draft of the human genome ( Roest Crollius et al., Nat Genet 25:235-238; 2000).

exon

Refers to the portion of a gene that encodes for a part of that gene's mRNA. A gene may comprise many exons, some of which may include only protein-coding sequence; however, an exon may also include 5' or 3' untranslated sequence. Each exon codes for a specific portion of the complete protein. In some species (including humans), a gene's exons are separated by long regions of DNA (called introns or sometimes “junk DNA”) that often have no apparent function but have been shown to encode small untranslated RNAs or regulatory information. (See also splice sites.)

exon-trapped

Exon trapping is a technique for cloning exon sequences from genomic DNA by selecting for functional splice sites, relying on the cellular splicing machinery. The genomic DNA containing the putative exon(s) is cloned into an exon-trap vector, which has a promoter, polyadenylation signals, and splice sites, and then transfected into a cell line. If there are functional splice sites in the genomic DNA fragment, the segments of DNA between the splice sites will be removed. Total RNA is isolated and reverse-transcribed. After cDNA synthesis and PCR amplification, the exon of interest is cloned.

ExPASy

Expert Protein Analysis System is a proteomics server of the Swiss Bioinformatics Institute (SIB).

FASTA

The first widely used algorithm for similarity searching of protein and DNA sequence databases. The program looks for optimal local alignments by scanning the sequence for small matches called “words”. Initially, the scores of segments in which there are multiple word hits are calculated (“init1”). Later, the scores of several segments may be summed to generate an “initn” score. An optimized alignment that includes gaps is shown in the output as “opt”. The sensitivity and speed of the search are inversely related and controlled by the “k-tup” variable, which specifies the size of a “word” (Pearson and Lipman). Also refers to a format for a nucleic acid or protein sequence.

fingerprint

The pattern of bands on a gel produced by a clone when restricted by a particular enzyme, such as HindIII.

finished sequence

High-quality, low-error DNA sequence that is free of gaps. To qualify as a finished sequence, only a single error out of every 10,000 bases (i.e., an accuracy of 99.999%) is allowed.

FISH

Fluorescence in situ hybridization. In this technique, fluorescent molecules are used to label a DNA probe, which can then hybridize to a specific DNA sequence in a chromosome spread so that the site becomes visible through a microscope. FISH has been used to highlight the locations of genes, subchromosome regions, entire chromosomes, or specific DNA sequences. It has been used for mapping and the detection of genomic rearrangements, as well as studies on DNA replication.

flatfile or flat file

A flat file is a data file that contains records (each corresponding to a row in a table); however, these records have no structured relationships. To interpret these files, the format properties of the file should be known. For example, a database management system may allow the user to export data to a comma-delimited file. Such a file is called a flat file because it has no inherent information about the data, and interpretation requires additional information. Files in a database management system have more complex storage structures.

freeze

To copy changing data so as to preserve the dataset as it existed at a particular point in time. Also used to refer to the resulting set of frozen data.

FTP

File Transfer Protocol. A method of retrieving files over a network directly to the user's computer or to his/her home directory using a set of protocols that govern how the data are to be transported.

gap

A gap is a space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment. (See the figure for more information.)

GB

gigabytes

GBFF

GenBank Flat File. Refers to a format .gbff.

GenBank

GenBank is a database of nucleotide sequences from more than 100,000 organisms. Records that are annotated with coding region features also include amino acid translations. GenBank belongs to an international collaboration of sequence databases that also includes EMBL and DDBJ. [See the GenBank chapter (Chapter 1) or the GenBank web page.]

genetic code

The instructions in a gene that tell the cell how to make a specific protein. A, T, G, and C are the “letters” of the DNA code; they stand for the chemicals adenine, thymine, guanine, and cytosine, respectively, that make up the nucleotide bases of DNA. Each gene's code combines the four chemicals in various ways to spell out three-letter “words” that specify which amino acid is needed at every position for making a protein.

GenomeScan

A gene identification algorithm that is used to identify exon–intron structures in genomic DNA sequence.

genotype

The genetic identity of an individual that does not show as outward characteristics. The genotype refers to the pair of alleles for a given region of the genome that an individual carries.

GEO

Gene Expression Omnibus. GEO is a gene expression data repository and online resource for the retrieval of gene expression data from any organism or artificial source. Many types of gene expression data from platform types, such as spotted microarray, high-density oligonucleotide array, hybridization filter, and serial analysis of gene expression (SAGE) data, are accepted, accessioned, and archived as a public dataset. [See the GEO chapter (Chapter 6) or the GEO web page.]

GI

The GenInfo Identifier is a sequence identification number for a nucleotide sequence. If a nucleotide sequence changes in any way, a new GI number will be assigned. A separate GI number is also assigned to each protein translation within a nucleotide sequence record, and a new GI is assigned if the protein translation changes in any way. GI sequence identifiers run parallel to the new accession.version system of sequence identifiers (see the description of Version).

GSS

Genome Survey Sequences are analogous to ESTs except that the sequences are genomic in origin, rather than cDNA (mRNA). The GSS division of GenBank contains (but is not limited to) the following types of data: random “single-pass read” genome survey sequences, cosmid/BAC/YAC end sequences, exon-trapped genomic sequences, and Alu-PCR sequences.

heterozygosity

The probability that a diploid individual will have two different alleles at a particular genome locus. These individuals are defined as heterozygous, whereas individuals who have two identical alleles at the locus are defined as homozygous. The probability can be estimated by sampling a representative number of individuals from the population and dividing the number of heterozygotes by the total number sampled.

HIV

Human Immunodeficiency Virus. HIV-1 is a retrovirus that is recognized as the causative agent of AIDS (Acquired Immunodeficiency Syndrome).

HNPCC

Hereditary nonpolyposis colon cancer

homogeneously staining region

A region of the chromosome identified cytologically by DNA staining or the FISH technique because of the presence of multiple copies of a subchromosomal region resulting from amplification.

homologous

The term refers to similarity attributable to descent from a common ancestor. Homologous chromosomes are members of a pair of essentially identical chromosomes, each derived from one parent. They have the same or allelic genes with genetic loci arranged in the same order. Homologous chromosomes synapse during meiosis.

HTGS

High-Throughput Genomic Sequences. The source of HTGS are large-scale genome sequencing centers; unfinished sequences are in phases 0, 1, and 2, and finished sequences are in phase 3.

HTGS_CANCELLED

A keyword added to GenBank entries by sequencing centers to indicate that work has stopped on a clone and that the existing sequence will not be finished. Sequencing centers may stop work because the clone is redundant or for various other reasons.

HTGS_PHASE0, HTGS_PHASE1, HTGS_PHASE2, HTGS_PHASE3

Keywords added to GenBank entries by sequencing centers to indicate the status (phase) of the sequence (see phase definitions described under draft sequence).

HTML

Hypertext Markup Language. HTML is derived from SGML. It is a text-based mark-up language and is used to primarily display information using a web browser and to link pieces of information via hyperlinks. The tags used in an HTML document provide information only on how the content is to be displayed but do not provide information about the content they encompass.

HUP

Hold Until Published. HUP refers to the category for data that is electronically submitted for when it should be released to the public.

ICBN

International Code of Botanical Nomenclature

ICD

International Classification of Diseases

ICD-O-3

International Classification of Diseases for Oncology, 3rd edition

ICNB

International Code of Nomenclature of Bacteria

ICNCP

International Code of Nomenclature for Cultivated Plants

ICTV

International Committee on Taxonomy of Viruses

ICVCN

International Code of Virus Classification and Nomenclature

ICZN

International Code of Zoological Nomenclature

ideogram

A diagrammatic representation of the karyotype of an organism.

IMAGE Consortium

Integrated Molecular Analysis of Genomes and their Expression. A consortium of academic groups that share high-quality, arrayed cDNA libraries and place sequence, map, and expression data of the clones in these arrays into the public domain. With the use of this information, unique clones can be rearrayed to form a “master array”, with the aim of ultimately having a representative cDNA from every gene in the genome under study. To date, human, mouse, rat, zebrafish, and Xenopus laevis genomes have been studied.

intron

Refers to that portion of the DNA sequence that is present in the primary transcript and that is removed by splicing during RNA processing and is not included in the mature, functional mRNA, rRNA, or tRNA. Also called an intervening sequence. (See also splice sites.)

ISAM

Indexed Sequential-Access Method. ISAM is a database access method. It allows data records in a database to be accessed either sequentially (in the order in which they were entered) or randomly (using an index). In the index, each record has a unique key that enables its rapid location. The key is the field used to reference the record.

ISCN

International System for Human Cytogenetic Nomenclature

ISO

International Organization for Standardization

ISSN

International Standard Serial Number. The ISSN is an eight-digit number that identifies periodical publications, including electronic serials.

karyotype

The particular chromosome complement of an individual or a related group of individuals, as defined by both the number and morphology of the chromosomes, usually in mitotic metaphase, and arranged by pairs according to the standard classification.

LANL

Los Alamos National Lab

LIMS

Laboratory Information Management Systems. LIMS comprise software that helps biological and chemical laboratories handle data generation, information management, and data archiving.

LinkOut

A registry service to create links from specific articles, journals, or biological data in Entrez to resources on external web sites. Third parties can provide a URL, resource name, brief description of their web sites, and specification of the NCBI data from which they would like to establish links. The specification can be written as a valid Boolean query to Entrez or as a list of identifiers for specific articles or sequences. Entrez PubMed users can then select which external links are visible in their searches through the NCBI Cubby service (see above). (See the LinkOut chapter or web page.)

locus

In a genomic contect, locus refers to position on a chromosome. It may, therefore, refer to a marker, a gene, or any other landmark that can be described.

LocusID

Each new LocusLink record is assigned a unique identifying number—a LocusID (although coding regions on genomic sequences found by gene prediction software are an exception to this).

LocusLink

LocusLink provides a single query interface to curated sequence and descriptive information about genetic loci. LocusLink issues a stable ID (LocusID) for each locus and presents information on official nomenclature, aliases, sequence Accession numbers, phenotypes, EC numbers, OMIM numbers, UniGene clusters, map information, and relevant web sites. LocusLink is a collaborative effort among NCBI, Human Gene Nomenclature Committee, OMIM, and others. LocusLink currently contains human, mouse, rat, zebrafish, and fruit fly loci; organisms can be searched together or separately. (See the LocusLink chapter or web page.)

MACAW

Multiple Alignment Construction and Analysis Workbench. MACAW is a program for locating, analyzing, and editing blocks of localized sequence similarity among multiple seqences and linking them into a composite multiple alignment.

Map Viewer

The Map Viewer is a software component of Entrez Genomes that provides special browsing capabilities for a subset of organisms. It allows one to view and search an organism's complete genome, display chromosome maps, and zoom into progressively greater levels of detail, down to the sequence data for a region of interest. If multiple maps are available for a chromosome, it displays them aligned to each other based on shared marker and gene names and, for the sequence maps, based on a common sequence coordinate system. The organisms currently represented in the Map Viewer are listed in the Entrez Map Viewer help document, which provides general information on how to use that tool. The number and types of available maps vary by organism and are described in the “data and search tips” file provided for each organism.

MB

megabytes

MEDLINE

MEDLINE is NLM's database of indexed journal citations and abstracts in the fields of biomedicine and healthcare. It encompasses nearly 4,500 journals published in the United States and more than 70 other countries. (For more information, see the Fact Sheet.)

MegaBLAST

MegaBLAST is a program for aligning sequences that differ slightly as a result of sequencing or other similar “errors”. When larger word size is used, it is up to 10 times faster than more common sequence-similarity programs. MegaBLAST is also able to efficiently handle much longer DNA sequences than the blastn program of the traditional BLAST algorithm. It uses the GREEDY algorithm for a nucleotide sequence alignment search.

MeSH

Medical Subject Headings. MeSH refers to the controlled vocabulary of NLM used for indexing articles in PubMed. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts. (See the MeSH homepage.)

Metathesaurus

Metathesaurus is a National Cancer Institute browser containing different biomedical vocabularies, including the International Classification of Diseases for Oncology ICD-O-3.

mFASTA

Multi-FASTA format.

MGC

Mammalian Gene Collection. MGC is a project of the NIH to provide a complete set of full-length (open reading frame) sequences and cDNA clones of expressed genes for human and mouse.

MGD

Mouse Genome Database. MGD contains information on mouse genetic markers, molecular segments, phenotypes, comparative mapping data, experimental mapping data, and graphical displays for genetic, physical, and cytogenetic maps.

MGI

Mouse Genome Informatics. MGI houses a database that provides integrated access to data on the genetics, genomics, and biology of the laboratory mouse.

microsatellite

Repetitive stretches of short sequences of DNA used as genetic markers to track inheritance in families (e.g., CC[TATATATA]CCCT). Also known as short tandem repeats (STRs).

MIM

Mendelian Inheritance in Man. First published in 1966, Mendelian Inheritance in Man (MIM) is a genetic knowledge base that serves clinical medicine and biomedical research, including the Human Genome Project.

minimal tiling path

An ordered list or map that defines the minimal set of overlapping clones needed to provide complete coverage of a chromosome or other extended segment of DNA (compare with tiling path).

MMDB

Molecular Modeling Database. MMDB is a database of three-dimensional biomolecular structures derived from X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy.

MMDB-ID

Molecular Modeling Database Accession number.

mRNA

messenger RNA. mRNA describes the section of a genomic DNA sequence that is transcribed, and can include the 5' untranslated region (5'UTR), CDS, and 3' untranslated region (3'UTR).Successful translation of the CDS section of an mRNA results in the synthesis of a protein.

mutation

A permanent structural alteration in DNA. In most cases, DNA changes have either no effect or cause harm, but occasionally a mutation can improve an organism's chance of surviving, and the beneficial change is passed on to the organism's descendants. Typically, mutations are more rare than polymorphisms in population samples because natural selection recognizes their lower fitness and removes them from the population.

NCBI

National Center for Biotechnology Information

NCBI Toolkit

Contains supported software tools from the Information Engineering Branch (IEB) of the NCBI. The NCBI Toolkit describes the three components of the ToolBox: data model, data encoding, and programming libraries. Provides access to documentation for the DataModel, C Toolkit, C++ Toolkit, NCBI C Toolkit Source Browser, XML Demo Program, XML DTDs, and the FTP site.

NCI

National Cancer Institute

NEXUS

NEXUS refers to a file format designed to contain data for processing by computer programs. NEXUS files should end with .nxs or .nex for purposes of clarity (Maddison et al., Syst Biol 46:590-621; 1997).

NIH

National Institutes of Health

NLM

National Library of Medicine

NMR

Nuclear Magnetic Resonance. NMR is a spectroscopic technique used for the determination of protein structure.

nr-PDB

non-redundant Protein Data Bank

OMIM

Online Mendelian Inheritance in Man. OMIM is a directory of human genes and genetic disorders, with links to literature references, sequence records, maps, and related databases.

ortholog

Orthology describes genes in different species that derive from a single ancestral gene in the last common ancestor of the respective species.

orthology

Orthology describes genes in different species that derive from a common ancestor, i.e., they are direct evolutionary counterparts.

paralog

A paralog is one of a set of homologous genes that have diverged from each other as a consequence of gene duplication. For example, the mouse α-globin and β-globin genes are paralogs. The relationship between mouse α-globin and chick β-globin is also considered paralogous (see the figure).

paralogy

Paralogy describes the relationship of homologous genes that arose by gene duplication.

PCR

Polymerase Chain Reaction. A technique for amplifying a specific DNA segment in a complex mixture. Also present in the DNA mixture are short oligonucleotide primers to the DNA segment of interest and reagents for DNA synthesis. PCR relies on the ability of DNA to separate into its two complementary strands at high temperature (a process called denaturation) and for the two strands to anneal at an optimal lower temperature (annealing). The annealing phase is followed by a DNA synthesis step at an optimal temperature for a heat-stable DNA polymerase. After multiple rounds of denaturation, annealing, and DNA synthesis, the DNA sequence specified by the oligonucleotide primers is amplified.

PDB

Protein Data Bank. The PDB is a database for 3D macromolecular structure data.

Pfam

Pfam is a database housing a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains.

phenotype

The observable traits or characteristics of an organism, e.g., hair color, weight, or the presence or absence of a disease. Phenotypic traits are not necessarily genetic.

PHRAP

A computer program that assembles raw sequence into sequence contigs (see above) and assigns to each position in the sequence an associated “quality score”, on the basis of the PHRED scores of the raw sequence reads. A PHRAP quality score of X corresponds to an error probability of approximately 10-X/10. Thus, a PHRAP quality score of 30 corresponds to 99.9% accuracy for a base in the assembled sequence.

PHRED

A computer program that analyses raw sequence to produce a “base call” with an associated “quality score” for each position in the sequence. A PHRED quality score of X corresponds to an error probability of approximately 10-X/10. Thus, a PHRED quality score of 30 corresponds to 99.9% accuracy for the base call in the raw read.

phyletic pattern

Pattern of presence–absence of a cluster of orthologs (COG) in different species.

PHYLIP

PHYLogeny Inference Package. A package of programs for various computer platforms to infer phylogenies or evolutionary trees, freely available from the Web.

PIR

Protein Information Resource

PMC

PubMed Central. NLM's digital archive of life sciences journal literature.

PMID

PubMed ID number

PNG

Portable Network Graphics. An extensible file format for the lossless, well-compressed storage of raster images (images that are composed of horizontal lines of pixels, such as those created by a computer screen). Compression of image, media, and application files is necessary to reduce the transmission time across the web. The technique of lossless compression reduces the size of the file without sacrificing any original data, and the image after expansion is exactly as it was before compression. PNG overcomes the patent issues of GIF (Graphic Interchange Format) and can replace many common uses of TIFF (Tagged Image File Format). Several features such as indexed color, grayscale, and truecolor are supported, as well as an optional alpha-channel. PNG is designed to work well in online viewing applications and is supported as an image standard by the WWW.

poly A

A string of adenylic acid residues that are added to the 3′ end of the primary mRNA transcript. Poly(A) polymerase is the enzyme that adds the poly A tail, which is between 100 and 250 bases long.

polymorphism

A common variation in the sequence of DNA among individuals. Genetic variations occurring in more than 1% of the population would be considered useful polymorphisms for genetic linkage analysis.

polypeptide

Linear polymer of amino acids connected by peptide bonds. Proteins are large polypeptides, and the two terms are commonly used interchangeably.

PRF

Protein Research Foundation

private polymorphism

Variations that are only common in specific populations. Usually such populations are reproductively isolated from other, larger groups. These variations may be completely absent in other groups.

ProtEST

A database of protein sequences from eight organisms: human (Homo sapiens), mouse (Mus musculus), rat (Rattus norvegicus), fruitfly (Drosophila melanogaster), worm (Caenorhabditis elegans), yeast (Saccharomyces cerevisiae), plant (Arabidopsis thaliana), and bacteria (Escherichia coli). (See the ProtEST web page.)

PROW

Protein Reviews On the Web. An online resource that features PROW Guides—authoritative, short, structured reviews on proteins and protein families. The Guides provide approximately 20 standardized categories of information (abstract, biochemical function, ligands, references, etc.) for each protein.

pseudogene

A sequence of DNA that is very similar to a normal gene but that has been altered slightly so that it is not expressed. Such genes were probably once functional but, over time, acquired one or more mutations that rendered them incapable of producing a protein product.

PSI-BLAST

Position-Specific Iterated BLAST. PSI-BLAST (Altschul et al., J Mol Biol 215:403- 410; 1990) is used for iterative protein–sequence similarity searches using a position-specific score matrix (PSSM). It is a program for searching protein databases using protein queries to find other members of the same protein family. All statistically significant alignments found by BLAST are combined into a multiple alignment, from which a PSSM is constructed. This matrix is used to search the database for additional significant alignments, and the process may be iterated until no new alignments are found.

PSSM

Position-Specific Score Matrix. The PSSM gives the log-odds score for finding a particular matching amino acid in a target sequence.

PubMed

A retrieval system containing citations, abstracts, and indexing terms for journal articles in the biomedical sciences. It includes literature citations supplied directly to NCBI by publishers as well as URLs to full text articles on the publishers' web sites. PubMed contains the complete contents of the MEDLINE and PREMEDLINE databases. It also contains some articles and journals considered out of scope for MEDLINE, based on either content or on a period of time when the journal was not indexed and, therefore, is a superset of MEDLINE.

PXML

PubMed Central XML file

QBLAST

A queuing system to BLAST that allows users to retrieve their results at their convenience and format their results multiple times with different formatting options.

QTL

Quantitative Trait Locus. A QTL is a hypothesis that a certain region of the chromosome contains genes that contribute significantly to the expression of a complex trait. QTLs are generally identified by comparing the linkage of polymorphic molecular markers and phenotypic trait measurements. The density of the linkage map is important in the accurate and precise location of QTLs; the higher the map density, the more precise the location of the putative QTL, although there is increased likelihood that false positives will be detected. Once QTLs have been mapped to a relatively small chromosomal region, other molecular methods can be used to isolate specific genes.

RCSB

Research Collaboratory for Structural Bioinformatics. RCSB is a nonprofit consortium that works toward the elucidation of biological, macromolecular, 3-D structures.

Reciprocal best hits

Reciprocal best hits are proteins from different organisms that are each other's top BLAST hit, when the proteomes from those organisms are compared to each other. For example, proteins A–Z in organism 1 are compared against proteins AA–ZZ in organism 2. If protein A has a best hit to protein RR, and RR's best hit, when it is compared to all the proteins in organism 1, also turns out to protein A, then A and RR are reciprocal best hits. However, if RR's best hit is to B rather than to A, then A and RR are not reciprocal best hits.

RefSeq

RefSeq is the NCBI database of reference sequences; a curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, and entire chromosomes.

RepeatMasker

Program that screens DNA sequences for interspersed repeats and low-complexity DNA sequences.

RFLP

Restriction Fragment Length Polymorphism. Genetic variations at the site where a restriction enzyme cuts a piece of DNA. Such variations affect the size of the resulting fragments. These sequences can be used as markers on physical maps and linkage maps. RFLP is also pronounced “rif lip”.

RH map

Radiation Hybrid map. A genome map in which STSs are positioned relative to one another on the basis of the frequency with which they are separated by radiation-induced breaks. The frequency is assayed by analyzing a panel of human–hamster hybrid cell lines. These hybrids are produced by irradiating human cells, which damages the cells and fragments the DNA. The dying human cells are fused with thymidine kinase negative (TK−) live hamster cells. The fused cells are grown under conditions that select against hamster cells and favor the growth of hybrid cells that have taken up the human TK gene. In the RH maps, the unit of distance is centirays (cR), denoting a 1% chance of a break occurring between two loci.

RNA

Ribonucleic Acid. A single-stranded nucleic acid, similar to DNA, but having a ribose sugar, instead of deoxyribose, and uracil instead of thymine as one of its bases.

RPS-BLAST

Reverse Position-Specific BLAST. A program used to identify conserved domains in a protein query sequence. It does this by comparing a query protein sequence to position-specific score matrices (PSSM)s that have been prepared from conserved domain alignments. RPS-BLAST is a “reverse” version of position-specific iterated BLAST (PSI-BLAST); however, RPS-BLAST compares a query sequence against a database of profiles prepared from ready-made alignments, whereas PSI-BLAST builds alignments starting from a single protein sequence.

SAGE

Serial Analysis of Gene Expression. An experimental technique designed to quantitatively measure gene expression.

Sequin

Sequin is a stand-alone software tool developed by the NCBI for submitting and updating entries to the GenBank, EMBL, or DDBJ sequence databases. It is capable of handling simple submissions that contain a single, short mRNA sequence and complex submissions containing long sequences, multiple annotations, segmented sets of DNA, or phylogenetic and population studies.

SGD

Saccharomyces Genome Database. A database for the molecular biology and genetics of Saccharomyces cerevisceae, also known as baker's yeast.

SGML

Standard Generalized Markup Language. The international standard for specifying the structure and content of electronic documents. SGML is used for the markup of data in a way that is self-describing. SGML is not a language but a way of defining languages that are developed along its general principles. A subset of SGML called XML is more widely used for the markup of data. HTML (Hypertext Markup Language) is based on SGML and uses some of its concepts to provide a universal markup language for the display of information and the linking of different pieces of that information.

SKY

Spectral Karyotyping. SKY is a technique that allows for the visualization of all of an organism's chromosomes together, each labeled with a different color. This is achieved by using chromosome-specific, single-stranded DNA probes (each labeled with a different fluorophore) to hybridize or bind to the chromosomes of a cell; resulting in each chromosome being painted a different color. This technique is useful for identifying chromosome abnormalities because it is easy to spot instances where a chromosome painted in one color has a small piece of another chromosome, painted in a different color, attached to it. (Also see FISH, CGH.)

SKYGRAM

1. A software tool to automatically convert the short-form karyotype into an image representation of a cell or clone, with each chromosome displayed in a different color, with band overlay. The program will also incorporate the number of cells for each structural abnormality, which is displayed in brackets. 2. The full ideogram or a cell or clone, with each chromosome displayed in a different color, with band overlay.

SMART

Simple Modular Architecture Research Tool. A tool to allow automatic identification and annotation of domains in user-supplied protein sequences. For example, the SWISS-PROT database is an extensively annotated and nonredundant collection of protein sequences. SWISS-PROT annotations have been mined for SMART-derived annotations of alignments.

SMD

Stanford Microarray Database. SMD stores raw and normalized data from microarray experiments, as well as their corresponding image files. In addition, the SMD provides interfaces for data retrieval, analysis, and visualization. Data are released to the public at the researcher's discretion or upon publication.

SNP

Common, but minute, variations that occur in human DNA at a frequency of 1 every 1,000 bases. An SNP is a single base-pair site within the genome at which more than one of the four possible base pairs is commonly found in natural populations. Several hundred thousand SNP sites are being identified and mapped on the sequence of the genome, providing the densest possible map of genetic differences. SNP is pronounced “snip”.

SOFT

Simple Omnibus Format in Text. SOFT is an ASCII text format that was designed to be a machine-readable representation of data retrieved from, or submitted to, the Gene Expression Omnibus (GEO). SOFT is also a line-based format, making it easy to parse, using commonly available text processing and formatting languages. (For examples of SOFT, see the guide.)

splice sites

Refers to the location of the exon-intron junctions in a pre-mRNA (i.e., the primary transcript that must undergo additional processing to become a mature RNA for translation into a protein). Splice sites can be determined by comparing the sequence of genomic DNA with that of the cDNA sequence. In mRNA, introns (non-protein coding regions) are removed by the splicing machinery; however, exons can also be removed. Depending on which exons (or parts of exons) are removed, different proteins can be made from the same initial RNA or gene. Different proteins created in this way are “splice variants” or “alternatively spliced”.

SSAHA

Sequence Search and Alignment by Hashing Algorithm. SSAHA is a software tool for very fast matching and alignment of DNA sequences and is used for searching databases containing large amounts (gigabases) of genome sequence. It achieves its fast search speed by converting sequence information into a “hash table” data structure, which can then be searched very rapidly for matches (Ning et al., Genome Res 11:1725-1729; 2001).

SSLP

Simple Sequence Length Polymorphisms. SSLPs are markers based on the variation in the number of short tandem repeats in DNA.

STS

A short DNA segment that occurs only once in the human genome, the exact location and order of bases of which are known. Because each is unique, STSs are helpful for chromosome placement of mapping and sequencing data from many different laboratories. STSs serve as landmarks on the physical map of the human genome.

substitution matrix

A substitution matrix containing values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Such matrices are constructed by assembling a large and diverse sample of verified pairwise alignments of amino acids. If the sample is large enough to be statistically significant, the resulting matrices should reflect the true probabilities of mutations occurring through a period of evolution. (See also BLOSUM 62.)

SWISS-PROT

SWISS-PROT is a curated protein sequence database that provides a high level of annotation (such as the description of protein function, domain structures, post-translational modifications, variants, etc.), a minimal level of redundancy, and high level of integration with other databases.

Sybase

A trademarked family of products that include databases, development tools, integration middleware, enterprise portals, and mobile and wireless servers.

synteny

On the same strand. The phrase “conserved synteny” refers to conserved gene order on chromosomes of different, related species.

Tax BLAST

BLAST Taxonomy Reports page. Tax BLAST groups BLAST hits by source organism, according to information in NCBI's Taxonomy database. Species are listed in order of sequence similarity with the query sequence, the strongest match listed first.

taxID

Taxonomy Identifier. The taxID is a stable unique identifier for each taxon (for a species, a family, an order, or any other group in the taxonomy database). The taxID is seen in the GenBank records as a “source” feature table entry; for example, /db_xref=“taxon:<9606>” is the taxID for Homo sapiens, and the line is therefore found in all recent human sequence records.

taxid

See taxID.

termination codon or stop codon

One of three codons that do not specify any amino acid and hence causes translation of mRNA into protein to be terminated. These codons mark the end of a protein coding sequence.

TIGR

The Institute for Genomic Research

tiling path

An ordered list or map that defines a set of overlapping clones that covers a chromosome or other extended segment of DNA.

TPA

Third-Party Annotation

TPF

Tiling Path Format. A table format used to specify the set of clones that will provide the best possible sequence coverage for a particular chromosome, the order of the clones along the chromosome, and the location of any gaps in the clone tiling path. Also used to refer to a file (Tiling Path File) in which the minimal tiling path of clones covering a chromosome is specified in Tiling Path Format or to the minimal tiling path of clones so defined.

translation start site

The position within an mRNA at which synthesis of a protein begins. The translation start site is usually an AUG codon, but occasionally, GUG or CUG codons are used to initiate protein synthesis.

UID

Unique Identifier

UMLS

Unified Medical Language System. A project of the National Library of Medicine for the development and distribution of multipurpose, electronic “Knowledge Sources”, and associated lexical programs. The purpose of the UMLS is to aid the development of systems that help health professionals and researchers retrieve and integrate electronic biomedical information from a variety of sources and to make it easy for users to link disparate information systems, including computer-based patient records, bibliographic databases, factual databases, and expert systems.

unfinished sequence

See draft sequence.

UniGene cluster

ESTs and full-length mRNA sequences organized into clusters such that each represents a unique known or putative gene within the organism from which the sequences were obtained. UniGene clusters are annotated with mapping and expression information when possible (e.g., for human) and include cross-references to other resources. Sequence data can be downloaded by cluster through the UniGene web pages, or the complete dataset can be downloaded from the repository/UniGene directory of the FTP site.

UniSTS

UniSTS presents a unified, non-redundant view of sequence-tagged sites (STSs). UniSTS integrates marker and mapping data from a variety of public resources. If two or more markers have different names but the same primer pair, a single STS record is presented for the primer pair, and all the marker names are shown.

UNIX

UNIX is an operating system that was developed by Dennis Ritchie and Kenneth Thompson at Bell Labs more than 30 years ago. It allows multitasking and multiuser capabilities and offers portability with other operating systems. It comes with hundreds of programs that are of two types: integral utilites, such as the command line interpreter; and tools such as email, which are not necessary for the operation of UNIX but provide additional capabilities to the user. It is functionally organized at three levels: the kernel, which schedules tasks and manages storage; the shell, which connects and interprets user's commands, calls programs from memory, and executes them; and tools and applications, which offer additional functionality to the operating system, such as word processing and business applications. UNIX® was registered by Bell Laboratories as a trademark for computer operating systems. Today, this mark is owned by The Open Group.

URL

Uniform Resource Locator. The address of a resource on the Internet. URL syntax is in the form of protocol://host/localinfo, where “protocol” specifies the means of fetching the object (such as HTTP, used by WWW browsers and servers to exchange information, or FTP), “host” specifies the remote location where the object resides, and “localinfo” is a string (often a file name) passed to the protocol handler at the remote location. Also called Uniform Resource Identifier (URI).

UTF-8

UCS (Universal Character Set) Transformation Format. An AscII-preserving encoding method for Unicode (a standard to provide a unique number for every character irrespective of the platform, program, or language).

UTR

Untranslated Region. The 3′ UTR is that portion of an mRNA from the position of the last codon that is used in translation to the 3′ end. The 5′ UTR is that portion of an mRNA from the 5′ end to the position of the first codon used in translation.

VAST

Vector Alignment Search Tool. A computer algorithm used to identify similar protein 3D structures.

weight

An assignment of importance to a term in a search query. If a term in a search query is found to match a word in a document, that word is given a “weight”. The exact weight of the word will depend on the emphasis given to the word by the author or its position in the document. For example, a word that occurs in a chapter title will have a higher weight than the same word if it occurs in the body of the chapter. Similarly, words that occur in data collections are also assigned weights, depending on how frequently the terms occur in the collection.

WGS sequence

Whole Genome Shotgun sequence. In this semi-automated sequencing technique, high-molecular-weight DNA is sheared into random fragments, size selected (usually 2, 10, 50, and 150 kb), and cloned into an appropriate vector. The clones are then sequenced from both ends. The two ends of the same clone are referred to as mate pairs. The distance between two mate pairs can be inferred if the library size is known and has a narrow window of deviation. The sequences are aligned using sequence assembly software. Proponents of this approach argue that it is possible to sequence the whole genome at once using large arrays of sequencers, which makes the whole process much more efficient than the traditional approaches.

WHO

World Health Organization

WWW

World Wide Web. A consortium (W3C) that develops technologies such specifications, guidelines, software, and tools for the internet.

XML

Extensible Markup Language. XML describes a class of data objects called XML documents and partially describes the behavior of computer programs that process them. XML is a subset of SGML, and XML documents are conforming SGML documents. XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters (a unit of text), some of which form character data, and some of which form markup. Markup includes tags that provide information about the data, i.e., a description of the structure and content of the document. Character data comprises all the text that is not markup. XML provides a mechanism to impose constraints on the storage layout and logical structure.

XSL

Extensible Stylesheet Language. XSL is used for the transformation of XML-based data into HTML or other presentation formats, for display in a web browser. This is a two-part process. First, the structure of the input XML tree must be transformed into a new tree (e.g., HTML), allowing reordering of the elements, addition of text, and calculations—all without modification to the source document. This process is described by XSLT. Second, XSL-FO (XSL Formatting Objects, an XML vocabulary for formatting) is used for formatting the output, defining areas of the display page and their properties. In this way, the source XML document can be maintained from the perspective of “pure content” and can be separated from the presentation. An XML document can be delivered in different formats to different target audiences by simply switching style sheets.

XSLT

Extensible Stylesheet Language: Transformations. XSLT is a language for transforming the structure of an XML document. XSLT is designed for use as part of XSL, the stylesheet language for XML. A transformation expressed in XSLT describes a sequence of template rules for transforming a source tree into a result tree; elements from the source tree can be filtered and reordered, and a different structure can be added. A template rule has two parts: a pattern that is matched against nodes in the source tree; and a template that can be instantiated to form part of the result tree. This makes XSLT a declarative language because it is possible to specify what output should be produced when specific patterns occur in the input, which distinguishes it from procedural programming languages, where it is necessary to specify what tasks have to be performed in what order. XSLT makes use of the expression language defined by XPath (a language for addressing the parts of an XML document) for selecting elements for processing, for conditional processing, and for generating text.

YAC

Yeast Artificial Chromosome. Extremely large segments of DNA from another species spliced into the DNA of yeast. YACs are used to clone up to one million bases of foreign DNA into a host cell, where the DNA is propagated along with the other chromosomes of the yeast cell.

ZFIN

Zebrafish Information Network. ZFIN is a database for the zebrafish model organism that holds information on wild-type stocks, mutants, genes, gene expression data, and map markers.