Bioinformatics consists of a computational approach to biomedical information management and analysis. It is being used increasingly as a component of research within both academic and industrial settings and is becoming integrated into both undergraduate and postgraduate curricula. The new generation of biology graduates is emerging with experience in using bioinformatics resources and, in some cases, programming skills.
The National Center for Biotechnology Information (NCBI) is one of the world's premier Web sites for biomedical and bioinformatics research. Based within the National Library of Medicine at the National Institutes of Health, USA, the NCBI hosts many databases used by biomedical and research professionals. The services include PubMed, the bibliographic database; GenBank, the nucleotide sequence database; and the BLAST algorithm for sequence comparison, among many others. The NCBI Web site is visited by about 250,000 people per day.
Although each NCBI resource has online help documentation associated with it, there is no cohesive approach to describing the databases and search engines, nor any significant information on how the databases work or how they can be leveraged, for bioinformatics research on a larger scale. The NCBI Handbook is designed to address this information gap.
All of our users know how to execute a straightforward PubMed or BLAST search. However, feedback from help desk personnel and booth staff at scientific meetings suggests that people often want to know how to use our resources in a more sophisticated manner and are frequently unaware of less well-known databases that might be helpful to them. The intended audience for The NCBI Handbook is, therefore, the growing number of scientists and students who would like a more in-depth guide to NCBI resources—powerusers and aspiring powerusers.
The NCBI Handbook is focused on the relatively stable information about each resource; it is not a point-and-click user guide (this type of information can be found in the online help documents, referred to frequently but not repeated, in the Handbook). Each chapter is devoted to one service; after a brief overview on using the resource, there is an account of how the resource works, including topics such as how data are included in a database, database design, query processing, and how the different resources relate to each other. For example, the BLAST chapter briefly describes what to use BLAST for, the various varieties of the BLAST algorithm, and BLAST statistics, before discussing output formats, query processing, and tips for setting up a BLAST database. A certain amount of biological knowledge is assumed.
The online content will be updated when necessary, although major changes are not expected to occur more than once every few years. (For example, PubMed query processing does not change dramatically year after year.) We hope that The NCBI Handbook will provide a valuable reference for anyone who wants to use our resources more effectively.
The GenBank sequence database is an annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced at National Center for Biotechnology Information (
Direct submissions are made to GenBank using
Initially, GenBank was built and maintained at Los Alamos National Laboratory (
In the mid-1990s, the GenBank database became part of the International Nucleotide Sequence Database Collaboration with the EMBL database (
The Collaboration created a
When scientists submit data to GenBank, they have the opportunity to keep their data confidential for a specified period of time. This helps to allay concerns that the availability of their data in GenBank before publication may compromise their work. When the article containing the citation of the sequence or its Accession number is published, the sequence record is released. The database staff request that submitters notify GenBank of the date of publication so that the sequence can be released without delay. The request to release should be sent to
The typical GenBank submission consists of a single, contiguous stretch of DNA or RNA sequence with annotations. The annotations are meant to provide an adequate representation of the biological information in the record. The GenBank
Currently, only nucleotide sequences are accepted for direct submission to GenBank. These include mRNA sequences with coding regions, fragments of genomic DNA with a single gene or multiple genes, and ribosomal RNA gene clusters. If part of the nucleotide sequence encodes a protein, a conceptual translation, called a
Multiple sequences can be submitted together. Such batch submissions of non-related sequences may be processed together but will be displayed in Entrez (
What defines a set? Environmental sample, population, phylogenetic, and mutation sets all contain a group of sequences that spans the same gene or region of the genome. Environmental samples are derived from a group of unclassified or unknown organisms. A population set contains sequences from different isolates of the same organism. A phylogenetic set contains sequences from different organisms that are used to determine the phylogenetic relationship between them. Sequencing multiple mutations within a single gene gives rise to a mutation set.
All sets, except segmented sets, may contain an alignment of the sequences within them and might include external sequences already present in the database. In fact, the submitter can begin with an existing alignment to create a submission to the database using the Sequin submission tool. Currently, Sequin accepts
Segmented sets are a collection of noncontiguous sequences that cover a specified genetic region. The most common example is a set of genomic sequences containing
Diagram showing the orientation and gaps that might be expected in high-throughput sequence from phases 1, 2, and 3.
Phase 0, 1, and 2 records are in the HTG division of GenBank, whereas phase 3 entries go into the taxonomic division of the organism, for example, PRI (primate) for human. An entry keeps its Accession number as it progresses from one phase to another but receives a new Accession.Version number and a new gi number each time there is a sequence change.
To submit sequences in bulk to the HTG processing system, a center or group must set up an FTP account by writing to
fa2htgs is a command-line program that is downloaded to the user's computer. The submitter invokes a script with a series of parameters (arguments) to create a submission. It has an advantage over Sequin in that it can be set up by the user to create submissions in bulk from multiple files.
Submissions to HTG must contain three identifiers that are used to track each HTG record: the genome center tag, the sequence name, and the Accession number. The genome center tag is assigned by NCBI and is generally the FTP account login name. The sequence name is a unique identifier that is assigned by the submitter to a particular clone or entry and must be unique within the group's submissions. When a sequence is first submitted, it has only a sequence name and genome center tag; the Accession number is assigned during processing. All updates to that entry must include the center tag, sequence name, and Accession number, or processing will fail.
Submitters deposit HTGS sequences in the form of Seq-submit files generated by Sequin, fa2htgs, or their own
Entries can fail HTG processing because of three types of problems:
Formatting: submissions are not in the proper Seq-submit format. Identification: submissions may be missing the genome center tag,
sequence name, or Accession number, or this information is incorrect. Data: submissions have problems with the data and therefore fail
the validator checks.
When submissions fail HTG processing, a GenBank annotator sends email to the sequencing center, describing the problem and asking the center to submit a corrected entry. Annotators do not fix incorrect submissions; this ensures that the staff of the submitting genome center fixes the problems in their database as well.
The processing pathway also generates reports. For successful submissions, two files are generated: one contains the submission in GenBank flat file format (without the sequence); and another is a status report file. The status report file, ac4htgs, contains the genome center, sequence name, Accession number, phase, create date, and update date for the submission. Submissions that fail processing receive an error file with a short description of the error(s) that prevented processing. The GenBank annotator also sends email to the submitter, explaining the errors in further detail.
When successful submissions are loaded into GenBank, they undergo additional validation checks. If GenBank annotators find errors, they write to the submitters, asking them to fix these errors and submit an update.
Genome centers are taking multiple approaches to sequencing complete genomes from a number of organisms. In addition to the traditional clone-based sequencing whose data are being submitted to HTGS, these centers are also using a
Each sequencing project is assigned a stable project ID, which is made up of four letters. The Accession number for a WGS sequence contains the project ID, a two-digit version number, and six digits for the contig ID. For instance, a project would be assigned an Accession number AAAX00000000. The first assembly version would be AAAX01000000. The last six digits of this ID identify individual contigs. A master record for each assembly is created. This master record contains information that is common among all records of the sequencing project, such as the biological source, submitter, and publication information. There is also a link to the range of Accession numbers for the individual contigs in this assembly.
WGS submissions can be created using tbl12asn, a utility that is packaged with the Sequin submission software. Information on submitting these sequences can be found at
Expressed Sequence Tags (
EST, STS, and GSS sequences reside in their respective divisions within GenBank, rather than in the taxonomic division of the organism. The sequences are maintained within GenBank in the dbEST, dbSTS, and dbGSS databases.
Because of the large numbers of sequences that are submitted at once, dbEST, dbSTS, and dbGSS entries are stored in relational databases where information that is common to all sequences can be shared. Submissions consist of several files containing the common information, plus a file of the sequences themselves. The three types of submissions have different requirements, but all include a Publication file and a Contact file. See the
In general, users generate the appropriate files for the submission type and then email the files to
HTC records are High-Throughput cDNA/mRNA submissions that are similar to ESTs but often contain more information. For example, HTC entries often have a systematic gene name (not necessarily an official gene name) that is related to the lab or center that submitted them, and the longest open reading frame is often annotated as a coding region.
FLIC records, Full-Length Insert cDNA, contain the entire sequence of a cloned cDNA/mRNA. Therefore, FLICs are generally longer, and sometimes even full-length, mRNAs. They are usually annotated with genes and coding regions, although these may be lab systematic names rather than functional names.
HTC entries are usually generated with
HTC entries undergo the same validation and processing as non-bulk submissions. Once processing is complete, the records are loaded into GenBank and are available in Entrez and other retrieval systems.
FLICs are processed via an automated FLIC processing system that is based on the HTG automated processing system. Submitters use the program tbl2asn to generate their submissions. As with HTG submissions, submissions to the automated FLIC processing system must contain three identifiers: the genome center tag, the sequence name (SeqId), and the Accession number. The genome center tag is assigned by NCBI and is generally the FTP account login name. The sequence name is a unique identifier that is assigned by the submitter to a particular clone or entry and must be unique within the group's FLIC submissions. When a sequence is first submitted, it has only a sequence name and genome center tag; the Accession number is assigned during processing. All updates to that entry include the center tag, sequence name, and Accession number, or processing will fail.
The FLIC processing system is analogous to the HTG processing system. Submitters deposit their submissions in the FLICSEQSUBMIT directory of their FTP account and notify us that the submissions are there. We then run the scripts to pick up the files from the FTP site and copy them to the processing pathway, as well as to an archive. Once processing is complete and if there are no errors in the submission, the files are automatically loaded into GenBank.
As with HTG submissions, FLIC entries can fail for three reasons: problems with the format, problems with the identification of the record (the genome center, the SeqId, or the Accession number), or problems with the data itself. When submissions fail FLIC processing, a GenBank annotator sends email to the sequencing center, describing the problem and asking the center to submit a corrected entry. Annotators do not fix incorrect submissions; this ensures that the staff of the submitting genome center fixes the problems in their database as well. At the completion of processing, reports are generated and deposited in the submitter's FTP account, as described for HTG submissions.
Direct submissions to GenBank are prepared using one of two submission tools, BankIt or Sequin.
Completed Sequin submissions should be emailed to GenBank at
All direct submissions to GenBank, created either by Sequin or BankIt, are processed by the GenBank annotation staff. The first step in processing submissions is called triage. Within 48 hours of receipt, the database staff reviews the submission to determine whether it meets the minimal criteria for incorporation into GenBank and then assigns an Accession number to each sequence. All sequences must be >50 bp in length and be sequenced by, or on behalf of, the group submitting the sequence. GenBank will not accept sequences constructed
Triaged submissions are subjected to a thorough examination, referred to as the indexing phase. Here, entries are checked for:
Biological validity. For example, does the conceptual translation of a coding region match the amino acid sequence provided by the submitter? Annotators also ensure that the source organism name and lineage are present, and that they are represented in NCBI's taxonomy database. If either of these is not true, the submitter is asked to correct the problem. Entries are also subjected to a series of BLAST similarity searches to compare the annotation with existing sequences in GenBank. Vector contamination. Entries are screened against NCBI's Publication status. If there is a published citation, Formatting and spelling. If there are problems with the sequence or annotation, the annotator works with the submitter to correct them.
Completed entries are sent to the submitter for a final review before release into the public database. If the submitters requested that their sequences be released after processing, they have 5 days to make changes prior to release. The submitter may also request that GenBank hold their sequence until a future date. The sequence must become publicly available once the Accession number or the sequence has been published. The GenBank annotation staff currently processes about 1,900 submissions per month, corresponding to approximately 20,000 sequences.
GenBank annotation staff must also respond to email inquiries that arrive at the rate of approximately 200 per day. These exchanges address a range of topics including:
updates to existing GenBank records, such as new annotation or sequence changes problem resolution during the indexing phase requests for release of the submitter's sequence data or an extension of the hold date requests for release of sequences that have been published but are not yet available in GenBank lists of Accession numbers that are due to appear in upcoming issues of a publisher's journals reports of potential annotation problems with entries in the public database requests for information on how to submit data to GenBank
One annotator is responsible for handling all email received in a 24-hour period, and all messages must be acted upon and replied to in a timely fashion. Replies to previous emails are forwarded to the appropriate annotator.
The annotation staff uses a variety of tools to process and update sequence submissions. Sequence records are edited with Sequin, which allows staff to annotate large sets of records by global editing rather than changing each record individually. This is truly a time saver because more than 100 entries can be edited in a single step (see
The GenBank direct submissions group has processed more than 50 complete microbial genomes since 1996. These genomes are relatively small in size compared with their eukaryotic counterparts, ranging from five hundred thousand to five million bases. Nonetheless, these genomes can contain thousands of genes, coding regions, and structural RNAs; therefore, processing and presenting them correctly is a challenge. Currently, the DDBJ/EMBL/GenBank Nucleotide Sequence Database Collaboration has a 350-kilobase (kb) upper size limit for sequence entries. Because a complete bacterial genome is larger than this arbitrary limit, it must be split into pieces. GenBank routinely splits complete microbial genomes into 10-kb pieces with a 60-bp overlap between pieces. Each piece contains approximately 10 genes. A CON entry, containing instructions on how to put the pieces back together, is also made. The CON entry contains descriptor information, such as source organism and references, as well as a join statement providing explicit instructions on how to generate the complete genome from the pieces. The Accession number assigned to the CON record is also added as a secondary Accession number on each of the pieces that make up the complete genome (see The information toward the
Submitters of complete genomes are encouraged to contact us at
Complete genome submissions are reviewed by a member of the GenBank annotation staff to ensure that the annotation and gene and protein identifiers are correct, and that the entry is in proper GenBank format. Any problems with the entry are resolved through communication with the submitter. Once the record is complete, the genome is carefully split into its component pieces. The genome is split so that none of the breaks occurs within a gene or coding region. A member of the annotation staff performs quality assurance checks on the set of genome pieces to ensure that they are correct and representative of the complete genome. The pieces are then loaded into GenBank, and the CON record is created.
The microbial genome records in GenBank are the building blocks for the
The vast amount of publicly available data from the human genome project and other genome sequencing efforts is a valuable resource for
scientists throughout the world. A laboratory studying a particular gene
or gene family may have sequenced numerous cDNAs but has neither the
resources nor inclination to sequence large genomic regions containing
the genes, especially when the sequence is available in public
databases. The researcher might choose then to download genomic
sequences from GenBank and perform experimental analyses on these
sequences. However, because this researcher did not perform the
sequencing, the sequence, with its new annotations, cannot be submitted
to DDBJ/EMBL/GenBank. This is unfortunate because important scientific
information is being excluded from the public databases. To address this
problem, the International Nucleotide Sequence Database Collaboration
established a separate section of the database for such TPA (see
All sequences in the TPA database are derived from the publicly available collection of sequences in DDBJ/EMBL/GenBank. Researchers can submit both new and alternative annotations of genomic sequence to GenBank. TPA entries can be also created by combining the exon sequences from genomic sequences or by making contigs of EST sequences to make mRNA sequences. TPA submissions must use sequence data that are already represented in DDBJ/EMBL/GenBank, have annotation that is experimentally supported, and appear in a peer-reviewed scientific journal. TPA sequences will be released to the public database only when their Accession numbers and/or sequence data appear in a peer-reviewed publication in a biological journal.
PubMed is a database developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), one of the institutes of the National Institutes of Health (NIH). The database was designed to provide access to citations (with abstracts) from biomedical journals. Subsequently, a linking feature was added to provide access to full-text journal articles at Web sites of participating publishers, as well as to other related Web resources. PubMed is the bibliographic component of the NCBI's Entrez retrieval system.
In addition to MEDLINE citations, PubMed® provides access to non-MEDLINE resources, such as out-of-scope citations, citations that precede MEDLINE selection, and PubMed Central (PMC; see
In response to new approaches to electronic publishing, PubMed can now also accommodate articles published electronically in advance of being collected into an issue. We refer to these citations as “ahead of print” or “epub” citations.
All content in PubMed ultimately comes from publishers of biomedical journals, and journals that are to be included in MEDLINE are subject to a selection process. The
Electronic data submission benefits everyone: publishers, the NLM, and users. For the NLM, it eliminates the tremendous costs associated with entering data by hand. For publishers and users, it means that newly published data appear rapidly and accurately in PubMed. Some publishers are now making pre-publication material available before it is formally published (“ahead of print” or “epub” citations); others are publishing electronic-only journals. By close collaboration with the publisher, the citations for these publications can appear in PubMed on the same day as the article is published.
Furthermore, electronic data submission allows publishers to create links from abstracts in PubMed to the full text of the appropriate articles available on their own Web site. This can be achieved using LinkOut (
Although the NLM works with many publishers directly, some publishers contract with commercial data aggregators, companies that prepare and submit the publisher's data to the NLM. Many aggregators also host publisher data on their Web sites.
All electronic data are supplied via
NCBI staff will guide new data providers through the approval process for file submission. New providers are asked to submit test files, which are then checked for XML formatting and syntax and for bibliographic accuracy and completeness. The files are revised and resubmitted as many times as necessary until all criteria are met. Once approved, a private account is set up on our FTP site to receive new journal issues, or in the case of online publications, individual articles as they are added to the publisher's Web site. We run a file-loading script that automatically processes the files daily, Monday through Friday at approximately 9:00 a.m. (Eastern Time). The new citations are assigned a PubMed ID number (PMID), a confirmation report is sent to the provider, and the new citations usually become available in PubMed sometime after 11:00 a.m. the next day, Tuesday through Saturday.
After posting in PubMed, the citations are forwarded to NLM's Indexing Section for bibliographic data verification and for the addition of subject indexing terms from Medical Subject Headings [
PubMed is one of the NCBI databases within the relational database management system,
Requests for NCBI services, including PubMed, are first proxied through three load-balanced Dell PowerEdge 1650 servers, each with two central processing units. The proxy servers, in turn, load-balance requests forwarded on to the Web servers for PubMed and other NCBI services.
The PubMed Web servers comprise eight Dell PowerEdge 8450 servers. The Dell servers have eight central processing units, 8 GB of memory, and about 300 GB of disk space and run the Linux operating system.
The Web servers retrieve PubMed records from two Sybase SQL database servers, which run on Sun Enterprise 450s. To accommodate the data volume output by PubMed and other Web-based services, the NLM has a high-speed connection (OC-3, up to 155 Mbits/sec) to the Internet, as well as a 622 Mbits/sec connection (OC-12) to Internet2, the noncommercial network used by many leading research universities.
Citations in PubMed are assigned one of three citation status tags that display next to the PubMed ID (PMID) numbers on all PubMed citations. The citation status tags indicate the citation's stage in the MEDLINE indexing process. The three tags are:
Most citations that are received electronically from publishers progress through “in process” status to MEDLINE status. Those citations not indexed for MEDLINE remain tagged [PubMed - as supplied by publisher]. Citations with “in process” status proceed to MEDLINE status after MeSH terms, publication types, sequence Accession numbers, and other indexing data are added.
All records are added to PubMed Monday through Friday and become available for viewing Tuesday through Saturday. For additional information, please see the NLM
The aim of the computer indexing process is to automatically create multiple machine-readable access points that refer to the different components of the journal citations for use when searching PubMed. The citations are loaded into PubMed from both the NLM Data Creation and Maintenance System (DCMS) and directly from journal publishers (
During the computer indexing process, the citation information is broken down into index fields such as Journal Name, Author Name, and Title/Abstract. The words in each of the fields are checked against the corresponding index (i.e., title words in a new citation are looked up in the Title/Abstract Index). If the word already exists, the PMID of the citation is listed with that index term. If the word is a new one for the Index, it is added as a new Index term, and the PMID is listed alongside it. (In the first instance that the term already exists, the new term will have only this one citation associated with it; this is how the PubMed indexes grow.)
Each PubMed citation is, therefore, associated with several indexes, and in cases similar to the Title/Abstract Index, many different index terms can refer back to a single citation. Likewise, commonly used terms will refer to thousands of citations (the term “cell”, for example, is found in the Title/Abstract of 1,092,124 citations at the time of this writing). The Field Indexes can be browsed by using PubMed's
PubMed uses MeSH Translation Table Journals Translation Table 3. Author Index
The MeSH Terms
See-Reference mappings (also known as entry terms) for MeSH terms Mappings derived from the Unified Medical Language System ( Names of Substances and synonyms to the Names of Substances (now known as Supplementary Concept Substance Names)
If the search term is found in this translation table, the term will be mapped to the appropriate MeSH term, and the Indexes will be searched as both the text word entered by the user and the MeSH term:
“Gallstones” is an entry term for the MeSH term “cholelithiasis” in the MeSH translation table. Search translated to: “cholelithiasis” [MeSH Terms] OR gallstones [Text Word]
When a term is searched as a MeSH term, PubMed automatically searches that term plus the more specific terms underneath in the “Breast cancer” is an entry term for the MeSH term “breast neoplasms” in the MeSH translation table. “Breast neoplasms” has the specific headings “breast neoplasms, male”, “mammary neoplasms”, “mammary neoplasms, experimental”, and “phyllodes tumor”, all of which are also searched.
If the search term(s) is not found in the MeSH Translation Table, the PubMed search algorithm then looks up the term in the “New England Journal of Medicine” maps to N Engl J Med. Search translated to: “N Engl J Med” [Journal Name]
If a journal name is also a MeSH term, PubMed will search the term as both a MeSH term and as a Text Word, but not as a Search translated as: “cells” [MeSH Terms] OR cell [Text Word] Search translated as: “Cell” [Journal]
If the phrase is not found in MeSH or the Journals Translation Table and is a word with one or two letters after it, PubMed then checks the
If only one initial is used, PubMed finds all names with that first initial, and if only an author's last name is entered, PubMed will search that name in All Fields. It will not default to the Author Index because an initial does not follow the last name:
Search translated as: o'malley fa, o'malley fb, o'malley fc, o'malley fd, o'malley f jr, etc. Search translated as: “o'malley” [All Fields]
A history of the NLM's author indexing policy regarding the number of authors to include in a citation is outlined in
Dates | Policy |
---|---|
1966–1984 | MEDLINE did not limit the number of authors. |
1984–1995 | The NLM limited the number of authors to 10, with “et al.” as the eleventh occurrence. |
1996–1999 | The NLM increased the limit from 10 to 25. If there were more than 25 authors, the first 24 were listed, the last author was used as the 25th, and the twenty-sixth and beyond became “et al.”. |
2000–present | MEDLINE does not limit the number of authors. |
It is possible to override PubMed's Automatic Term Mapping by using search rules, syntax, and qualifying terms with search field abbreviations.
The
Search term: o'malley [au] will search only the author field. Specifying the field precludes the Automatic Term Mapping, which would result in the search o'malley[All Fields] if the field were not specified. Similarly, using the search term Cell [Journal] avoids using the MeSH Translation Table, which would interpret Cell as only a text word and MeSH term.
A simple search can be conducted from the
If more than one term is entered in the query box, PubMed will go through the Automatic Term Mapping protocol described in the previous section, first looking for all the terms, as typed, to find an exact match. If the exact phrase is not found, PubMed clips a term off the end and repeats Automatic Term Mapping, again looking for an exact match, but this time to the abbreviated query. This continues until none of the words are found in any one of the translation tables. In this case, PubMed combines terms (with the AND Boolean operator) and applies the Automatic Term Mapping process to each individual word. PubMed ignores
Translated as: ((“ascorbic acid” [MeSH Terms] OR vitamin c [Text Word]) AND (“common cold” [MeSH Terms] OR common cold [Text Word]))
Translated as: (((“single person” [MeSH Terms] OR single [Text Word]) AND (“cell separation” [MeSH Terms] OR cell separation [Text Word])) AND (“brain” [MeSH Terms] OR brain [Text Word]))
If a phrase of more than two terms is not found in any translation table, then the last word of the phrase is dropped, and the remainder of the phrase is sent through the entire process again. This continues, removing one word at a time, until a match is found.
If there is no match found during the Automatic Term Mapping process, the individual terms will be combined with AND and searched in All Fields.
One can see how PubMed interpreted a search by selecting
There are a variety of ways that PubMed can be searched in a more sophisticated manner than simply typing search terms into the search box and selecting
The following resources are available to facilitate effective searches:
PubMed retrieves and displays search results in the Summary format in the order the record was initially added to PubMed, with the most recent first. (Note that this date can differ widely from the publication date.) Citations can be viewed in several other
A variety of links can be found on PubMed citations including:
NCBI databases, as well as other resources, may be available from the
The Entrez system provides three distinct ways to create Web
The Entrez Programming Utilities can be used to create URL links directly to all Entrez data, including PubMed citations and their link information, without using the front-end Entrez query engine. These Utilities provide a fast, efficient way to search and download citation data.
If you need more assistance, please contact our
Additional information is also available in the
The resources provided by NCBI for studying the three-dimensional (3D) structures of proteins center around two databases: the Molecular Modeling Database (MMDB), which provides structural information about individual proteins; and the Conserved Domain Database (CDD), which provides a directory of sequence and structure alignments representing conserved functional domains within proteins(
To enable scientists to accomplish these tasks, NCBI has integrated MMDB and CDD into the Entrez retrieval system (
Protein structures can be visualized using Cn3D, an interactive 3D graphic modeling tool. Details of the structure, such as ligand-binding sites, can be scrutinized and highlighted. Cn3D can also display multiple sequence alignments based on sequence and/or structural similarity among related sequences, 3D domains, or members of a CDD family. Cn3D images and alignments can be manipulated easily and exported to other applications for presentation or further analysis.
The Structure This page can be found by selecting the
Often used in conjunction with
The structures within MMDB are now being linked to the NCBI Taxonomy database (
The second database within the
All the above databases and tools are discussed in more detail in other parts of this Chapter, including tips on how to make the best use of them.
MMDB records have several types of Accession numbers associated with them, representing the following data types:
Each MMDB record has at least three Accession numbers: the PDB code of the corresponding PDB record (e.g., 1CYO, 1B8G); a unique MMDB-ID (e.g., 645, 12342); and a gi number for each protein chain. A new MMDB-ID is assigned whenever PDB updates either the sequence or coordinates of a structure record, even if the PDB code is retained. If an MMDB record contains more than one polypeptide or nucleotide chain, each chain in the MMDB record is assigned an Accession number in Entrez Protein or Nucleotide consisting of the PDB code followed by the letter designating that chain (e.g., 1B8GA, 3TATB, 1MUHB). Each 3D Domain identified in an MMDB record is assigned
a unique integer identifier that is appended to the Accession number of the chain to which it belongs (e.g., 1B8G A 2). This new Accession number becomes its identifier in Entrez 3D Domains. New 3D Domain identifiers are assigned whenever a new MMDB-ID is assigned. For conserved domains, the Accession number is based on
the source database:
To build MMDB (
The data are converted into
After initial processing, 3D domains are automatically identified within each MMDB record. 3D domains are annotations on individual MMDB structures that define the boundaries of compact substructures contained within them. In this way, they are similar to secondary structure annotations that define the boundaries of helical or β-strand substructures. Because proteins are often similar at the level of domains, VAST compares each 3D domain to every other one and to complete polypeptide chains. The results are stored in Entrez as a
To identify 3D domains within a polypeptide chain, MMDB's domain parser searches for one or more breakpoints in the structure. These breakpoints fall between major secondary structure elements such that the ratio of intra- to interdomain contacts remains above a set threshold. The 3D domains identified in this way provide a means to both increase the sensitivity of structure neighbor calculations and also present 3D superpositions based on compact domains as well as on complete polypeptide chains. They are not intended to represent domains identified by comparative sequence and structure analysis, nor do they represent modules that recur in related proteins, although there is often good agreement between domain boundaries identified by these methods.
After initially processing the PDB record, structure staff add a number of links and other information that further integrate the MMDB record with other NCBI resources. To begin, the sequence information extracted from the PDB record is entered into the Entrez Protein and/or Nucleotide databases as appropriate, providing a means to retrieve the structure information from sequence searches. As with all sequences in Entrez, precomputed BLAST searches are then performed on these sequences, linking them to other molecules of similar sequence. For proteins, these BLAST neighbors may be different than those determined by VAST; whereas VAST uses a conservative significance threshold, the structural similarities it detects often represent remote relationships not detectable by sequence comparison. The literature citations in the PDB record are linked to PubMed so that Entrez searches can allow access to the original descriptions of the structure determinations. Finally, semiautomatic processing of the “source” field of the PDB record provides links to the NCBI Taxonomy database. Although these links normally follow the genus and species information given, in some cases this information is either absent in the PDB record or refers only to how a sample was obtained. In these cases, the staff manually enters the appropriate taxonomy links.
The Structure Summary page for each MMDB record summarizes the database content for that record and serves as a starting point for analyzing the record using the NCBI structure tools ( The page consists of three parts: the header, the view bar, and the graphic display. The header contains basic identifying information about the record: a description of the protein (
Although VAST itself is not a database, the VAST results computed for each MMDB record are stored with this record and are summarized on a separate page for the whole polypeptide chain as well as for each 3D domain found in the protein ( The
The non-redundant PDB database (nr-PDB) is a collection of four sets of sequence-dissimilar cluster PDB polypeptide chains assembled by NCBI Structure staff. The four sets differ only in their respective levels of non-redundancy. The staff assembles each set by comparing all the chains available from PDB with each other using the BLAST algorithm. The chains are then clustered into groups of similar sequence using a single-linkage clustering procedure. Chains within a sequence-similar group are automatically ranked according to the quality of their structural data. Details of the measures used to determine structure precision and completeness and the methodology of assembling the nr-PDB clusters can be found on the nr-PDB
CDs are recurring units in polypeptide chains (sequence and structure motifs), the extents of which can be determined by comparative analysis. Molecular evolution uses such domains as building blocks and these may be recombined in different arrangements to make different proteins with different functions. The CDD contains sequence alignments that define the features that are conserved within each domain family. Therefore, the CDD serves as a classification resource that groups proteins based on the presence of these predefined domains. CDD entries often name the domain family and describe the role of conserved residues in binding or catalysis. Conserved domains are displayed in MMDB Structure summaries and link to a sequence alignment showing other proteins in which the domain is conserved, which may provide clues on protein function.
The collections of domain alignments in the CDD are imported either from two databases outside of the NCBI, named Pfam (
Once imported and constructed, each domain alignment in CDD is used to calculate a model sequence, called a
RPS-BLAST (
Analogous to the Structure Summary page, the CD Summary page displays the available information about a given CD and offers various links for either viewing the CD alignment or initiating further searches ( The
In 2002, NCBI released the first group of curated CD records, a new and expanding set of annotated protein multiple sequence alignments and corresponding structure alignments. These new records have Accession numbers beginning with “cd” and have been added to the default CD-Search database. Most curated CD records are based on existing family descriptions from SMART and Pfam, but the alignments may have been revised extensively by quantitatively using three-dimensional structures and by re-examining the domain extent. In addition, CDD curators annotate conserved functional residues, ligands, and co-factors contained within the structures. They also record evidence for these sites as pointers to relevant literature or to three-dimensional structures exemplifying their properties. These annotations may be viewed using Cn3D and thus provide a direct way of visualizing functional properties of a protein domain in the context of its three-dimensional structure. (See
The term “domain” refers in general to a distinct functional and/or structural unit of a protein. Each polypeptide chain in MMDB is analyzed for the presence of two classes of domains, and it is important for users to understand the difference between them. One class, called 3D Domains, is based solely on similar, compact substructures, whereas the second class, called Conserved Domains (CDs), is based solely on conserved sequence motifs. These two classifications often agree, because the compact substructures within a protein often correspond to domains joined by recombination in the evolutionary history of a protein. Note that CD links can be identified even when no 3D structures within a family are known. Moreover, 3D Domain links may also indicate relationships either to structures not included in CDD entries or to structures so distantly related that no significant similarity can be found by sequence comparisons.
Suppose that we are interested in the biosynthesis of aminocyclopropanes and would like to find structural information on important active site residues in any available aminocyclopropane synthases. To begin, we would go to the
Once we have found the Structure Summary page, viewing the 3D structure is straightforward. To view the structure in Cn3D, we simply select the
Upon inspecting the structure, we immediately notice that a small molecule is bound to the protein, likely at the active site of the enzyme. How do we find out what that molecule is? One easy way is to return to the Structure Summary page and select the link to the PDB code, which takes us to the PDB Structure Explorer page for 1B8G. Quickly, we see that pyridoxal-5′-phosphate (PLP) is a HET group, or heterogen, in the structure. Our interest piqued, we would now like to know more about the structural domain containing the active site. Returning to Cn3D, we manipulate the structure so that PLP is easily visible and then use the mouse to double-click on any PLP atom. The molecule becomes selected and turns yellow. Now from the
Given that this enzyme is a dimer, we arbitrarily choose domain 3 in chain A, the accession of which is thus 1B8GA3. By clicking on the 3D Domain bar at a point within domain 3, we are taken to the VAST Structure Neighbors page for this domain, where we find nearly 200 structure neighbors.
Perhaps we would now like to identify some of the most evolutionarily distant structure neighbors of domain 1B8GA3 as a means of finding conserved residues that may be associated with its binding and/or catalytic function. One powerful way of doing this is to choose structure neighbors from phylogenetically distant organisms. We therefore need to combine our present search with a Taxonomy search. Given that 1B8G is derived from the superkingdom Eukaryota, we would like to find structure neighbors in other superkingdom taxa, such as Eubacteria and Archaea. Returning to the Structure Summary page, select the 3D Domains link in the graphic display to open the list of 3D Domains in Entrez. Finding 1B8GA3 in the list, selecting the
Looking at the Archaea results, we find among them 1DJUA3, a domain from an aromatic aminotransferase from
Returning to the VAST Structure Neighbors page for 1B8GA3, we want to select 1DJUA3 and 3TATA2 to display in a structural alignment. One way to do this is to enter these two Accession numbers in the
Cn3D again displays the aligned residues in red, and we can highlight these further by selecting The backbone atoms of the aligned residues of the three structures are shown colored according to their sequence conservation of each position in the alignment. Highly conserved positions are colored more
For an example query on finding and viewing structures, see
To determine the overall shape and size of a protein
To locate a residue of interest in the overall structure
To locate residues in close proximity to a residue of interest
To develop or test chemical hypotheses regarding an enzyme mechanism
To locate or predict possible binding sites of a ligand
To interpret mutation studies
To find areas of positive or negative charge on the protein surface
To locate particularly hydrophobic or hydrophilic regions of a protein
To infer the 3D structure and related properties of a protein with unknown structure from the structure of a
To study evolutionary processes at the level of molecular structure
To study the function of a protein
To study the molecular basis of disease and design novel treatments
The first step to any structural analysis at NCBI is to find the structure records for the protein of interest or for proteins similar to it. One may search MMDB directly by entering search terms such as PDB code, protein name, author, or journal in the Entrez Structure
By using the full array of Entrez search tools, the resulting list of MMDB records can be honed, ideally, to a workable list from which a record can be selected. Users should note that multiple records may exist for a given protein, reflecting different experimental techniques, conditions, and the presence or absence of various ligands or metal ions. Records may also contain different fragments of the full-length molecule. In addition, many structures of mutant proteins are also available. The PDB record for a given structure generally contains some description of the experimental conditions under which the structure was determined, and this file can be accessed by selecting the PDB code link at the top of the Structure Summary page.
Structure Summary pages can also be found from the following NCBI databases and tools:
Select the Structure Select the Choose the PDB database from a Select the From the results of any protein BLAST search, click on a red 'S' linkout to view the sequence alignment with a structure record.
The 3D domains of a protein are displayed on the Structure Summary page. It is useful to know how many 3D domains a protein contains and whether they are continuous in sequence when viewing the full 3D structure of the molecule.
Knowing the secondary structure of a protein can also be a useful prelude to viewing the 3D structure of the molecule. The secondary structure can be viewed easily by first selecting the
Cn3D is a software package for displaying 3D structures of proteins. Once it has been
For an example query on finding and viewing structure neighbors, see
To determine structurally conserved regions in a protein family
To locate the structural equivalent of a residue of interest in another related protein
To gain insights into the allowable structural variability in a particular protein family
To develop or test chemical hypotheses regarding an enzyme mechanism
To predict possible binding sites of a ligand from the location of a binding site in a related protein
To identify sites where conformational changes are concentrated
To interpret mutation studies
To find areas of conserved positive or negative charge on the protein surface
To locate conserved hydrophobic or hydrophilic regions of a protein
To identify evolutionary relationships across protein families
To identify functionally equivalent proteins with little or no sequence conservation
The Vector Alignment Search Tool (VAST) is used to calculate similar structures on each protein contained in the MMDB. The graphic display on each Structure Summary page ( The 3D Domains link transfers the user to Entrez 3D Domains, showing a list of the VAST neighbors. Selecting the chain bar displays the VAST Structure Neighbors page for the entire chain. Selecting a 3D Domain bar displays the VAST Structure Neighbors page for the selected domain.
From any Entrez search, select
A graphic 2D HTML alignment of VAST neighbors can be viewed as follows:
On the lower portion of the VAST Structure Neighbors page ( On the Select
Alignments of VAST structure neighbors can be viewed as a 3D image using Cn3D.
On the lower portion of the VAST Structure Neighbors page (
On the
Select
Cn3D automatically launches and displays the aligned structures. Each displayed chain has a unique color; however, the portions of the structures involved in the alignment are shown in red. These same colors are also reflected in the Sequence/Alignment Viewer. Among the many viewing options provided by Cn3D, of particular use is the
Suppose that we are interested in topoisomerase enzymes and would like to find human topoisomerases that most closely resemble those found in eubacteria and thus may share a common ancestor. Further suppose that through a colleague, we are aware of a recent and particularly interesting crystal structure of a topoisomerase from
The Struture summary page displays only the CDs that give the best match to the protein sequence. To see all of the matching CDs, we can easily perform a full CD-Search. Select the
We now would like to find human proteins that have these same CDs. To perform a CDART search, simply select the
We now would like to view the alignment of the topoisomerase in the human protein to other members of this CD. On the CD-Search page, select the colored bar of this CD to see a CD-Browser window displaying the alignment. Because this is a curated CD record, we are able to view functional features of the protein domain on a structural template. The rightmost menu in the View Alignment bar shows the available features for this domain, whereas the topmost row in the alignment itself marks the residues involved in this feature with # symbols. The second row of the alignment is the consensus sequence of the CD record, whereas the third row contains the NP_004609 sequence, labeled “query”. At the bottom of the page, buttons allow Cn3D to be launched with various structural features highlighted. For example, if we are interested in nucleotide binding site II, Cn3D will launch with the view depicted in The
For an example query on finding and viewing conserved domains, see
To locate functional domains within a protein
To predict the function of a protein whose function is unknown
To establish evolutionary relationships across protein families
To interpret mutation studies
To predict the structure of a protein of unknown structure
Following the Domains link for any protein in Entrez, one can find the conserved domains within that protein. The
Information on the CDs contained within a protein can also be found from these databases and tools:
From any Entrez search: select the From the Structure Summary page of a MMDB record: this page displays the CDs within each protein chain immediately below the 3D Domain bar in the graphic display. Selecting the From an Entrez Domains search: choose From the CDD page: locate CDs by entering text terms into the search box and proceed as for an Entrez CD search. From a BLink report: select the From the BLAST main page: follow the RPS-BLAST link to load the CD-Search page.
Results from a CD search are displayed as colored bars underneath a sequence ruler. Moving the mouse over these bars reveals the identity of each domain; domains are also listed in a format similar to BLAST summary output (
These can be displayed by clicking a CD bar within a MMDB Structure Summary page or from a hyperlinked CD name on a CD-Search results page.
If members of a CD have MMDB records, one of these records can be viewed as a 3D image along with the sequence alignment using Cn3D (launched by selecting the pink dot on a CD-Search results page). As in other alignment views, colored capital letters indicate aligned residues, allowing the sequence of the protein sequence of interest to be mapped onto the available 3D structure.
For an example query on finding and viewing proteins with similar domain architectures, see
To locate related functional domains in other protein families
To gain insights into how a given CD is situated within a protein relative to other CDs
To explore functional links between different CDs
To predict the function of a protein whose function is unknown
To establish evolutionary relationships across protein families
Following the
From a CD-Search results page, click
From a CD-Summary page, click the
From an Entrez Domains searc, click the
These are described in At the
As illustrated in the sections above, there are numerous connections between the Structure resources and other databases and tools available at the NCBI. What follows is a listing of major tools that support connections.
Because Entrez is an integrated database system (
Although the BLAST service is designed to find matches based solely on sequence, the sequences of Structure records are included in the BLAST databases, and by selecting the PDB search database, BLAST searches only the protein sequences provided by MMDB records. A new
The BLink report represents a precomputed list of similar proteins for many proteins (see, for example, links from LocusLink records;
A particularly useful interface with the structural databases is provided on the
As stated elsewhere, all records in the MMDB are obtained originally from the Protein Data Bank (PDB) (
The CDD staff imports CD collections from both the Pfam and SMART databases. Links to the original records in these databases are located on the appropriate CD Summary page. Both Pfam and SMART are updated several times per year in roughly bimonthly intervals, and the CDD staff update CDD accordingly.
Structures displayed in Cn3D can be exported as a Portable Network Graphics (
Individual MMDB records can be saved/downloaded to a local computer directly from the Structure Summary page for that record.
Alignments of VAST neighbors can be saved/downloaded from the VAST Structure Neighbors page of any MMDB record. By selecting options in the
Users can download the NCBI Structure databases from the NCBI FTP site: mmdbdata: the current MMDB database (NOTE: these files can not be read directly by Cn3D). vastdata: the current set of VAST neighbor annotations to MMDB records nrtable: the current non-redundant PDB database pdbeast: table listing the taxonomic classification of MMDB records
CDD data can be downloaded from
The NCBI Taxonomy database is a curated set of names and classifications for all of the organisms that are represented in GenBank. When new sequences are submitted to GenBank, the submission is checked for new organism names, which are then classified and added to the Taxonomy database. As of April 2003, there were 176,890 total taxa represented.
There are two main tools for viewing the information in the Taxonomy database: the Taxonomy Browser, and Taxonomy Entrez. Both systems allow searching of the Taxonomy database for names, and both link to the relevant sequence data. However, the Taxonomy Browser provides a hierarchical view of the classification (the best display for most casual users interested in exploring our classification), whereas Entrez Taxonomy provides a uniform indexing, search, and retrieval engine with a common mechanism for linking between the Taxonomy and other relevant Entrez databases.
By the time the NCBI was created in 1988, the nucleotide
sequence databases (GenBank,
The Taxonomy Project started in 1991, in association with
the launch of Entrez (
To represent, manipulate, and store versions of each of the different database taxonomies, we wrote a stand-alone, tree-structured database manager, TaxMan. This also allowed us to merge the taxonomies into a single composite classification. The resulting hybrid was, at first, a bigger mess than any of the pieces had been, but it gave us a starting point that spanned all of the names in all of the sequence databases. For many years, we cleaned up and maintained the NCBI Taxonomy database with TaxMan.
After the initial unification and clean-up of the taxonomy for Entrez was complete, Mitch Sogin organized a workshop to give us advice on the clean-up and recommendations for the long-term maintenance of the taxonomy. This was held at the NCBI in 1993 and included: Mitch Sogin (protists), David Hillis (chordates), John Taylor (fungi), S.C. Jong (fungi), John Gunderson (protists), Russell Chapman (algae), Gary Olsen (bacteria), Michael Donoghue (plants), Ward Wheeler (invertebrates), Rodney Honeycutt (invertebrates), Jack Holt (bacteria), Eugene Koonin (viruses), Andrzej Elzanowski (PIR taxonomy), Lois Blaine (ATCC), and Scott Federhen (NCBI). Many of these attendees went on to serve as curators for different branches of the classification. In particular, David Hillis, John Taylor, and Gary Olsen put in long hours to help the project move along.
In 1995, as more demands were made on the Taxonomy database,
the system was moved to a SyBase relational database
(
In 1997, the
Organismal taxonomy is a powerful organizing principle in the
study of biological systems. Inheritance, homology by common descent,
and the conservation of sequence and structure in the determination
of function are all central ideas in biology that are directly related
to the evolutionary history of any group of organisms. Because of this,
taxonomy plays an important cross-linking role in many of the
The NCBI Taxonomy database is a curated set of names and classifications
for all of the organisms that are represented in
Of the several different ways to build a taxonomy, our group maintains
a If two organisms (A and B) are listed more closely together in the taxonomy
than either is to organism C, the assertion is that C diverged from the
lineage leading to A+B earlier in evolutionary history, and that A and B
share a common ancestor that is not in the direct line of evolutionary
descent to species C. For example, the current consensus it that the closest
living relatives of the birds are the crocodiles; therefore, our classification
does not include the familiar taxon Reptilia (turtles, lizards and snakes,
and crocodiles), which excludes the birds, and would break the phylogenetic
principle outlined above.
Our classification represents an assimilation of information from many
different sources (see
We do not rely on sequence data alone to build our classification, and we do not perform phylogenetic analysis ourselves as part of the taxonomy project. Most of the organisms in GenBank are represented by only a snippet of sequence; therefore, sequence information alone is not enough to build a robust phylogeny. The vast majority of species are not there at all, although about 50% of the birds and the mammals are represented. We therefore also rely on analyses from morphological studies; the challenge of modern systematics is to unify molecular and morphological data to elucidate the evolutionary history of life on earth.
Currently, more than 100 new species are added to the database daily, and the rate is accelerating as sequence analysis becomes an ever more common component of systematic research and the taxonomic description of new species.
The
The number and complexity of organisms in a submission can vary enormously. Many contain a single new name, others may include 100 species, all from the same familiar genus, whereas others may include 100 names (only half identified at the species level) from 100 genera (all of which are new to the Taxonomy database) without any other identifying information at all.
Some new organism names are found by software when the protein
sequence databases (
We often receive consults on submissions with explicitly new species
names that will be published as part of the description of a new species.
These sequence entries (like any other) may be designated “hold
until published” (
Occasionally, the same new genus name is proposed simultaneously for different taxa; in one case, two papers with conflicting new names had been submitted to the same journal, and both had gone through one round of review and revision without detection of the duplication. Although these duplications would have been discovered in time, the increasingly common practice of including some sequence analysis in the description of a new species can lead to earlier detection of these problems. In many cases, the new species name proposed in the submitted manuscript is changed during the editorial review process, and a different name appears in the publication. Submitters are encouraged to inform us when their new descriptions have been published, particularly if the proposed names have been changed.
We strongly encourage the submission of strain names for cultured bacteria, algae, and fungi and for sequences from laboratory animals in biochemical and genetic studies; of cultivar names for sequences from cultivated plants; and of specimen vouchers (something that definitively ties the sequence to its source) for sequences from phylogenetic studies. There are many other kinds of useful information that may be contained within the sequence submission, but these data are the bare minimum necessary to maintain a reliable link between an entry in the sequence database and the biological source material.
The
TaxBrowser is updated continuously. New species will appear on a daily basis as the new names appear in sequence entries indexed during the daily release cycle of the Entrez databases. New taxa in the classification appear in TaxBrowser on an ongoing basis, as sections of the taxonomy already linked to public sequence entries are revised.
The browser produces two different kinds of Web pages:
hierarchy pages, which present a familiar indented
flatfile view of the taxonomic classification,
centered on a particular taxon in the database; and taxon-specific pages, which summarize all of the information that
we associate with any particular taxonomic entry in the database. For
example, “hominidae” as a search term from the (
The taxon-specific browser display page shows all of the information
that is associated with a particular taxon in the Taxonomy database and
some information collected through links with related databases
( The name, taxid, rank,
There are two sets of links to Entrez records from the Taxonomy
Browser. The “subtree links” are accumulated up the tree
in a hierarchical fashion; for example, there are 16 million nucleotide
records and a half million protein records associated with the Chordata
(
“Direct links” will retrieve Entrez records that are
linked directly to this particular node in the taxonomy database. Many
of the Entrez domains (e.g., sequences and structures) are linked into
the taxonomy at or below the species level; it is a data error when a
sequence entry is directly linked into the taxonomy at a taxon somewhere
above the species level. For other Entrez domains (e.g., literature and
phylogenetic sets), this is not the case. A journal article may talk
about several different species but may also refer directly to the
The taxon-specific browser pages now also show the NCBI LinkOut links
to external resources. These include links to a broad range of different
kinds of resources and are provided for the convenience of our users; the
NCBI does not vouch for the content of these resources, although we do
make an effort to ensure that they are of good scientific quality. A
complete list of external resources can be found
There are several different ways to search for names in the Taxonomy database. If the search results in a terminal node in our taxonomy, the taxon-specific browser page is displayed; if the search returns with an internal (non-terminal) node, the hierarchical classification page is displayed.
Names can be duplicated in the Taxonomy database, but the taxonomy
browser can only be focused on a single taxon at any one time. If a
complete name search retrieves more than one entry from the taxonomy,
an intermediate name selection screen appears ( Searches for
There is a
The NCBI Taxonomy database is stored as a SyBase relational database, called TAXON. The NCBI taxonomy group maintains the database with a customized software tool, the Taxonomy Editor. Each entry in the database is a “taxon”, also referred to as a “node” in the database. The “root node” (taxid1) is at the top of the hierarchy. The path from the root node to any other particular taxon in the database is called its “lineage”; the collection of all of the nodes beneath any particular taxon is called its “subtree”. Each node in the database may be associated with several names, of several different nametypes. For indexing and retrieval purposes, the nametypes are essentially equivalent.
The Taxonomy database is populated with species names that have appeared in a sequence record from one of the nucleotide or protein databases. If a name has ever appeared in a sequence record at any time (even if it is not found in the current version of the record), we try to keep it in the Taxonomy database for tracking purposes (as a synonym, a misspelling, or other nametype), unless there are good reasons for removing it completely (for example, if it might cause a future submission to map to the wrong place in the taxonomy).
File | Uncompresses to | Description |
---|---|---|
taxdump.tar.Z |
readme.txt | A terse description of the dmp files |
nodes.dmp | Structure of the database; lists each taxid with its parent taxid, rank, and other values associated with each node (genetic codes, etc.) | |
names.dmp | Lists all the names associated with each taxid | |
delnodes.dmp | Deleted taxid list | |
merged.dmp | Merged nodes file | |
division.dmp | GenBank division files | |
gencode.dmp | Genetic codes files | |
gc.prt | Print version of genetic codes | |
gi_taxid_nucl.dmp.gz | gi_taxid_nucl.dmp | A list of gi_taxid pairs for every live gi-identified sequence in the nucleotide sequence database |
gi_taxid_prot.dmp.gz | gi_taxid_prot.dmp | A list of gi_taxid pairs for every live gi-identified sequence in the protein sequence database |
gi_taxid_nucl_diff.dmp | gi_taxid_nucl_diff | List of differences between latest gi_taxid_nucl and previous listing |
gi_taxid_prot_diff.dmp | gi_taxid_prot_diff | List of differences between latest gi_taxid_prot and previous listing |
For non-UNIX users, the file taxdmp.zip includes the same (zip compressed) data.
Each taxon in the database has a unique identifier, its taxid.
Taxids are assigned sequentially. When a taxon is deleted, its taxid
disappears and is not reassigned (
There are many possible types of names that can be associated
with an organism taxid in TAXON. To track and display the names
correctly, the various names associated with a taxid are tagged
with a nametype, for example “scientific name”,
“synonym”, or “common name”. Each taxid
When sequences are submitted to GenBank, usually only a scientific
name is included; most other names are added by NCBI taxonomists at
the time of submission or later, when further information is discovered.
For a complete description of each nametype used in TAXON, see
Scientific names, the only required nametype for a taxid, can be
further qualified into different classes. Not all “scientific
names” that accompany sequence submissions are true Linnaean
Latin binomial names; if the taxon is not identified to the species
level, it is not possible to assign a binomial name to it. For
indexing and retrieval purposes, TAXON needs to know whether the
scientific name is a Latin binomial name, or otherwise. A full
listing of the classes of TAXON scientific names can be viewed in
The treatment of duplicated names was discussed briefly in the section on the Taxonomy browser. For our purposes, there are four main classes of duplicated scientific names: (1) real duplicate names, (2) structural duplicates, (3) polyphyletic genera, and (4) other duplicate names.
There are several main codes of nomenclature for living organisms:
the Zoological Code (International Code of Zoological Nomenclature,
ICZN; for animals), the Botanical Code (International Code of Botanical
Nomenclature, ICBN; for plants), the Bacteriological Code (International
Code of Nomenclature of Bacteria, ICNB; for prokaryotes), and the Viral
Code (International Code of Virus Classification and Nomenclature, ICVCN;
for viruses). Within each code, names are required to be unique. When
duplicate names are discovered within a code, one of them is changed
(generally, the newer duplicate name). However, the codes are complex,
and not all names are subject to these restrictions. For example,
There is no real effort to make the scientific names of taxa unique among Codes, and among the relatively small set of names represented in the NCBI taxonomy database (20,000 genera), there are approximately 200 duplicate names (or about 1%), mostly at the genus level.
Early in 2002, the first duplicate species name was recorded in the
Taxonomy database.
In the Zoological and Bacteriological Codes, the subgenus that
Includes the type species is required to have the same name as the genus.
This is a systematic source of duplicate names. For these duplicates,
we use the associated rank in the unique name, e.g.,
Certain genera, especially among the asexual forms of Ascomycota
and Basidiomycota, are polyphyletic, i.e., they do not share a common
ancestor. Pending taxonomic revisions that will transfer species assigned
to “form” genera such as
We list many duplicate names in other nametypes (apart from our preferred “scientific name” for each taxon). Most of these are included for retrieval purposes, common names or the names of familiar paraphyletic taxa that we have not included in our classification, e.g. Osteichthyes, Coelenterata, and reptiles.
Aside from names, there are several optional types of information
that may be associated with a taxid. These are (1) rank, such as
species, genus or family; (2) genetic code, for translating proteins;
(3) GenBank division; (4) literature citations; and (5) abbreviated
lineage, for display in GenBank flat files. For more details on these
data types, see
The TAXON database is a node within the Entrez integrated retrieval
system (
Taxonomy was the first Entrez database to have an internal
hierarchical structure. Because Entrez deals with unordered sets
of objects in a given domain, an alternative way to represent these
hierarchical relationships in Entrez was required (see the section
The main focus of the Entrez Taxonomy
The default Entrez search is case insensitive and can be for any of
the names that can be found in the Taxonomy database. Thus, any of the
following search terms, Homo sapiens, homo sapiens, human, or Man, will
retrieve the node for
As for other Entrez databases, Taxonomy supports Boolean searching,
a
Each search result, listed in document summary (DocSum) format, may
have several links associated with it. For example, for the search
result Homo sapiens, the
A helpful list follows:
A search for Hominidae retrieves a single, hyperlinked entry.
Selecting the link shows the structure of the taxon. On the other
hand, a search for Hominidae[subtree] will retrieve a
nonhierarchical list of all of the taxa listed within the
Hominidae. A search for species[rank] yields a list of all species in
the Taxonomy database (108,020 in May 2002). Find the Taxonomy update frequency by selecting Entrez
An overview of the distribution of taxa in the DocSum list can be
seen if To filter out less interesting names from a DocSum list, add some
terms to the query, e.g., 2002/01/10[date] NOT uncultured[prop]
NOT unspecified[prop].
There are a variety of choices regarding how search results can be displayed in Taxonomy Entrez.
The Common Tree view shows an abbreviated view of the taxonomic
hierarchy and is designed to highlight the relationships between a
selected set of organisms. The ten species shown in
If there are more than few dozen taxa selected for the common
tree view, the display becomes visually complex and generally less
useful. When a large list of taxa is sent to the Common Tree display,
a summary screen is displayed first. For example, we currently list
727 families in the Viridiplantae (plants and green algae)
( The taxa are aggregated at the predetermined set of nodes in the
Taxonomy database that have been assigned “BLAST names”.
This serves an informal, very abbreviated, vernacular classification
that gives a convenient overview. The BLAST names will often not
provide complete coverage for all species at all levels in the tree.
Here, not all of our
There are several formatting options for saving the common tree display to a text file: text tree, phylip tree, and taxid list.
Hyperlinks to a common tree display can be made in two ways:
by specifying the common tree view in an Entrez query
by providing a list of taxids directly to the common tree
cgi function (for example,
The
As for any Entrez database, the contents are indexed by creating
term lists for each field of each database record (or taxid). For
There are five different index fields for names in Taxonomy Entrez.
The [lineage] and [subtree] index fields are a way to superimpose
the hierarchical relationships represented in the taxonomy on top of
the Entrez data model. For an example of how to use these field
limits for searching Taxonomy Entrez, see
The genetic code [gc], mitochondrial genetic code [mgc], and GenBank division [division] fields are all inherited within the taxonomy. The information in these fields refers to the genetic code used by a taxon or in which GenBank division it resides. Because whole families or branches may use the same code or reside in the same GenBank division, this property is usually indexed with a taxon high in the taxonomic tree, and the information is inherited by all those taxa below it. If there is no [gc] field associated with a taxon in the database, it is assumed that the standard genetic code is used. A genetic code may be referred to by either name or translation table number. For example, the two equivalent queries, standard[gc] and translation table 1[gc], each retrieves the set of organisms that use the standard genetic code for translating genomic sequences. Likewise, these two queries echinoderm mitochondrial[mgc] and translation table 9[mgc] will each retrieve the set of organisms that use the echinoderm mitochondrial genetic code for translating their mitochondrial sequences.
There are several useful terms and phrases indexed in the [prop] field. Possible search strategies that specify the prop field are explained below.
unspecified [prop] not identified at the species level
uncultured [prop] environmental sample sequences
unclassified [prop] listed in an “unclassified” bin
incertae sedis [prop] listed in an “incertae sedis” bin
We do not explicitly flag names as “unspecified” in
All of these search strategies below are valid. Taxonomy Eentrez displays only taxa that are linked to public sequence entries, and because sequence entries are supposed to correspond to the Taxonomy database at or below the species level, the Entrez query: terminal [prop] NOT “at or below species level” [prop] should only retrieve problem cases.
above genus level [prop]
above species level [prop]
“at or below species level” [prop] (needs explicit quotes)
below species level [prop]
terminal [prop]
non terminal [prop]
genetic code [prop]
mitochondrial genetic code [prop]
standard [prop] invertebrate mitochondrial [prop]
translation table 5 [prop]
The query “genetic code [prop]” retrieves all of the taxa at which one of the genomic genetic codes is explicitly set. The second query retrieves all of the taxa at which one of the mitochondrial genetic codes is explicitly set, and so on.
division [prop]
INV [prop]
invertebrates [prop]
The above terms index the assignments of the GenBank division
codes, which are divided along crude taxonomic categories (see
The remaining index fields are common to most or all Entrez domains, although some have special features in the taxonomy domain. For example, the field text word, [word], indexes words from the Taxonomy Entrez name indexes. Most punctuation is ignored, and the index is searched one word at a time; therefore, the search “homo sapiens[word]” will retrieve nothing.
Several useful terms are indexed in the properties field, [prop],
including functional nametypes and classifications, the rank level of
a taxon, and inherited values. See
More information on using the generic Entrez fields can be found
in the
Many of the Entrez databases (Nucleotide, Protein, Genome, etc.)
include an
To not retrieve such “exploded” terms, the unexploded
indexes should be used. This query will only retrieve the entries that
are linked directly to
Taxids are indexed with the prefix txid: txid9606 [orgn].
Source organism modifiers are indexed in the [properties] field, and such queries would be in the form: src strain[prop], src variety[prop], or src specimen voucher[prop]. These queries will retrieve all entries with a strain qualifier, a variety qualifier, or a specimen_voucher qualifier, respectively.
All of the organism source feature modifiers (/clone, /serovar, /variety, etc.) are indexed in the text word field, [text word]. For example, one could query GenBank for: “strain k-12” [text word]. Because strain information is inconsistent in the sequence databases (as in the literature), a better query would be: “strain k 12”[word] OR “strain k12”[word]. Note: explicit double-quotes may be necessary for some of these queries.
The Taxonomy
The checkboxes
Selecting one of the
A complete copy of the public NCBI taxonomy database is deposited
several times a day on our
Taxonomy BLAST reports (
The function library for the taxonomy application software in the
NCBI Toolkit is ncbitxc2.a (or libncbitxc2.a). The source code can be
found in the
In the early years of the project, Scott Federhen did all of the software and database development. In recent years, Vladimir Soussov and his group have been responsible for software and database development.
Scott Federhen (1990–present)
Andrzej Elzanowski (1994–1997)
Detlef Leipe (1994–present)
Mark Hershkovitz (1996–1997)
Carol Hotton (1997–present)
Mimi Harrington (1999–2000)
Ian Harrison (1999–2002)
Sean Turner (2000–present)
Rick Sternberg (2001–present)
If you have a comment or correction to our Taxonomy database,
perhaps a misspelling or classification or if something looks wrong,
please send a message to
Every node in the database is required to have exactly one
“scientific name”. Wherever possible, this is a
validly published name with respect to the relevant code of
nomenclature. Formal names that are subject to a code of
nomenclature and are associated with a validly published
description of the taxon will be Latinized uninomials above
the species level, binomials (e.g.
The scientific name is the one that will be used in all of
the sequence entries that map to this node in the Taxonomy database.
Entries that are submitted with any of the other names associated
with this node will be replaced with this name. When we change
the scientific name of a node in the Taxonomy database, the
corresponding entries in the sequence databases will be updated
to reflect the change. For example, we list
The “synonym” nametype is applied to both synonyms
in the formal nomenclatural sense (objective, nomenclatural,
homotypic
The “acronym” nametype is used primarily for the
viruses. The International Committee on Taxonomy of Viruses
(
The term “anamorph” is reserved for names applied
to asexual forms of fungi, which present some special nomenclatural
challenges. Many fungi are known to undergo both sexual and asexual
reproduction at different points in their life cycle (so-called
“perfect” fungi); for many others, however, only the
asexually reproducing (anamorphic or mitosporic) form is known (in
some, perhaps many, asexual species, the sexual cycle may have been
lost altogether). These anamorphs, often with simple and not especially
diagnostic morphology, were given Linnaean binomial names. A number of
named anamorphic species have subsequently been found to be associated
with sexual forms (teleomorphs) with a different name (for example,
The “misspelling” nametype is for simple misspellings. Some of these are included because the misspelling is present in the literature, but most of them are there because they were once found in a sequence entry (which has since been corrected). We keep them in the database for tracking purposes, because copies of the original sequence entry can still be retrieved. Misspellings are not listed on the TaxBrowser pages nor on the Taxonomy Entrez Info display views, but they are indexed in the Entrez search fields (so that searches and Entrez queries with the misspelling will find the appropriate node).
“Misnomer” is a rarely used nametype. It is used for names that might otherwise be listed as “misspellings” but which we want to appear on the browser and Entrez display pages.
The “common name” nametype is used for vernacular names associated with a particular taxon. These may be found at any level in the hierarchy; for example, “human”, “reptiles”, and “pale devil's-claw” are all used. Common names should be in lowercase letters, except where part of the name is derived from a proper noun, for example, “American butterfish” and “Robert's arboreal rice rat”.
The use of common names is inherently variable, regional, and often
inconsistent. There is generally no authoritative reference that
regulates the use of common names, and there is often not perfect
correspondence between common names and formally described scientific
taxa; therefore, there are some caveats to their use. For scientific
discourse, there is no substitute for formal scientific names.
Nevertheless, common names are invaluable for many indexing, retrieval,
and display purposes. The combination “
The “
The “in-part” nametype is included for retrieval
terms that have a broader range of application than the taxon or
taxa at which they appear. For example, we list reptiles and Reptilia
as in-part nametypes at our nodes
The “includes” nametype is the opposite of the in-part
nametype and is included for retrieval terms that have a narrower
scope of application than the taxon at which they appear. For example,
we could list “reptiles” as an “includes”
nametype for the
The “equivalent name” nametype is a catch-all category, used for names that we would like to associate with a particular node in the database (for indexing or tracking purposes) but which do not seem to fit well into any of the other existing nametypes.
The “genbank common name” was introduced to provide a mechanism by which, when there is more than one common name associated with a particular node in the taxonomy, one of them could be designated to be the common name that should be used by default in the GenBank flatfiles and other applications that are trying to find a common name to use for display (or other) purposes. This is not intended to confer any special status or blessing on this particular common name over any of the other common names that might be associated with the same node, and we have developed mechanisms to override this choice for a common name on a case-by-case basis if another name is more appropriate or desirable for a particular sequence entry. Each node may have at most one “genbank common name”.
There may be more than one acronym associated with a particular node in the Taxonomy database (particularly if several virus names have been synonymized in a single species). Just as with the “genbank common name”, the “genbank acronym” provides a mechanism to designate one of them to be the acronym that should be used for display (or other) purposes. Each node may have at most one “genbank acronym”.
The “genbank synonym” nametype is intended for those
special cases in which there is more than one name commonly used in
the literature for a particular species, and it is informative to have
both names displayed prominently in the corresponding sequence record.
Each node may have at most one “genbank synonym”. For example,
Although the use of either the anamorph or teleomorph name is
formally correct under the International Code of Botanical Nomenclature,
we prefer to give precedence to the telemorphic name as the
“scientific name” in the Taxonomy database, both to
emphasize their commonality and to avoid having two (or more) taxids
that effectively apply to the same organism. However, in many cases,
the anamorphic name is much more commonly used in the literature,
especially when sequences are normally derived from the asexual form
of the species. In these cases, the “genbank anamorph”
nametype can be used to annotate the corresponding sequence records
with both names. Each node may have at most one “genbank
anamorph”. For example:
Whenever possible, formal scientific names are used for taxa.
There are several codes of nomenclature that regulate the description
and use of names in different branches of the tree of life. These are:
the International Code of Zoological Nomenclature (
The viral code is less well developed than the others, but it includes
an official classification for the viruses as well as a list of approved
species names. Viral names are not Latin binomials (as required by the
other codes), although there are some instances (e.g.,
The zoological, botanical, and bacteriological codes mandate Latin
binomials for species names. They do not describe an official
classification (such as the
The fungi are subject to the botanical code. The cyanobacteria (blue-green algae) have been subject to both the botanical and the bacteriological codes, and the issue is still controversial.
“Authorities” appear at the end of the formal species
name and include at least the name or standard abbreviation of the
taxonomist who first described that name in the scientific literature.
Other information may appear in the authority as well, often the year
of description, and can become quite complicated if the taxon has been
transferred or amended by other taxonomists over the years. We do not
use authorities in our taxon names, although many are included in the
database listed as synonyms. We have made an exception to this rule in
the case of our first duplicated species name in the database,
All three of the codes of nomenclature for cellular organisms
provide for names at the subspecies level. The botanical and
bacteriological codes include the string “subsp.” in the
formal name; the zoological code does not, e.g.,
The botanical code (but none of the others) provides for two
additional formal ranks beneath the subspecies level, varietas and
forma. These names will include the strings “var.” and
“f.”, respectively, e.g.,
We list taxa with other subspecific names where it seems useful
and appropriate and where it is necessary to find places for names
in the sequence databases. For indexing purposes in the Genomes
division of Entrez, it is convenient to have strain-level nodes for
bacterial species with a complete genome sequence, particularly when
there are two or more complete genome sequences available for
different strains of the same species, e.g.,
Several other classes of subspecific groups do not have formal
standing in the nomenclature but represent well-characterized and
biologically meaningful groups, e.g., serovar, pathovar, forma
specialis, and others. In many cases, these may eventually be promoted
to a species; therefore, it is convenient to represent them
independently from the outset, e.g.,
Many other names below the species level have been added to the Taxonomy database to accommodate SWISS-PROT entries, where strain (and other) information is annotated with the organism name for some species.
In general, we try to avoid unqualified species names such as
When entries are not identified at the species level, multiple
sequences can be from the same unidentified species. Sequences
from multiple different unidentified species in the same genus
are also possible. To keep track of this, we add unique informal
names to the Taxonomy database, e.g., a meaningful identifier
from the submitters could be used. This could be a strain name,
a culture collection accession, a voucher specimen, an isolate
name or location—anything that could tie the entry to the
literature (or even to the lab notebook). If nothing else is
available, we may construct a unique name using a default formula
such as the submitter's initials and year of submission. This way,
if a formal name is ever determined or described for any of these
organisms, we can synonymize the informal name with the formal
one in the Taxonomy database, and the corresponding entries in
the sequence databases will be updated automatically. For example,
AJ302786 was originally submitted (in November 2000) as
Here are some examples of informal names in the Taxonomy database:
We use single quotes when it seems appropriate to group a phrase into a single lexical unit. Some of these names include abbreviations with special meanings.
“n. sp.” indicates that this is a new, undescribed species and not simply an unidentified species. “sp. nr.” indicates “species near”. In the example above, this indicates that this is similar to Camponotus gasseri. “aff.”, affinis, related to but not identical to the species given. “cf.”, confer; literally, “compare with” conveys resemblance to a given species but is not necessarily related to it. “s.l.”, sensu lato; literally, “in the broad sense”. “ex”, “from” or “out of” the biological host of the specimen.
Note that names with cf., aff., nr., and n. sp. are not unique and should have unique identifiers appended to the name.
Cultured bacterial strains and other specimens that have not been identified to the genus level are given informal names as well, e.g., Desulfurococcaceae str. SRI-465; crenarchaeote OlA-6.
Names such as Camponotus sp. 1 are avoided, because different
submitters might easily use the same name to refer to different species.
See
Sequences from environmental samples are given “uncultured” names. In these studies, nucleotide sequences are cloned directly from the environment and come from varied sources, such as Antarctic sea ice, activated sewer sludge, and dental plaque. Apart from the sequence itself, there is no way to identify the source organisms or to recover them for further studies. These studies are particularly important in bacterial systematics work, which shows that the vast majority of environmental bacteria are not closely related to laboratory cultured strains (as measured by 16S rRNA sequences). Many of the deepest-branching groups in our bacterial classification are defined only by anonymous sequences from these environmental samples studies, e.g., candidate division OP5, candidate division Termite group 1, candidate subdivision kps59rc, phosphorous removal reactor sludge group, and marine archael group 1.
These samples vary widely in length and in quality, from short
single-read sequences of a few hundred base pairs to high-quality,
full-length 16S sequences. We now give all of these samples anonymous
names, which may indicate the phylogenetic affiliation of the sequence,
as far is it may be determined, e.g., uncultured archaeon, uncultured
crenarchaeote, uncultured gamma proteobacerium, or uncultured enterobacterium.
See
Some groups of bacteria have never been cultured but can be characterized and reliably recovered from the environment by other means. These include endosymbiotic bacteria and organisms similar to the phytoplasmas, which can be identified by the plant diseases that they cause. We do not give these “uncultured” names, as above. These represent a special challenge for bacterial nomenclature, because a formal species description requires the designation of a cultured type strain. The bacteriological code has a special provision for names of this sort, Candidatus, e.g., Candidatus Endobugula or Candidatus Endobugula sertula; Candidatus Phlomobacter or Candidatus Phlomobacter fragariae. These often appear in the literature without the Candidatus prefix; therefore, we list the unqualified names as synonyms for retrieval purposes.
We allow informal names for unranked nodes above the species level as well. These should all be phylogenetically meaningful groups, e.g., the Fungi/Metazoa group, eudicotyledons, Erythrobasidium clade, RTA clade, and core jakobids. In addition, there are several other classes of nodes and names above the species level that explicitly do not represent phylogenetically meaningful groups. These are outlined below.
We are expected to add new species names to the database in a timely manner, preferably within a day or two. If we are able to find only a partial classification for a new taxon in the database, we place it as deeply as we can and list it in an explicit “unclassified” bin. As more information becomes available, these bins are emptied, and we give full classifications to the taxa listed there. In general, we suppress the names of the unclassified bins themselves so they do not appear in the abbreviated lineages that appear in the GenBank flatfiles, e.g., unclassified Salticidae, unclassified Bacteria, and unclassified Myxozoa.
If the best taxonomic opinion available is that the position of
a particular taxon is uncertain, then we will list it in an
“incertae sedis” bin. This is a more permanent assignment
than for taxa that are listed in unclassified bins, e.g.,
Fungi that were known only in the asexual (mitosporic, anamorphic)
state were placed formerly in a separate, highly polyphyletic category
of “imperfect” fungi, the Deuteromycota. Spurred especially
by the development of molecular phylogenetics, current mycological
practice is to classify anamorphic species as close to their sexual
relatives as available information will support. Mitosporic categories
can occur at any rank, e.g.,
The requirement that the Taxonomy database includes names from all of the entries in the sequence database introduces a number of names that are not typically treated in a taxonomic database. These are listed in the top-level group “Other”. Plasmids are typically annotated with their host organism, using the /plasmid source organism qualifier. Broad-host-range plasmids that are not associated with any single species are listed in their own bin. Plasmid and transposon names from very old sequence entries are listed in separate bins here as well. Plasmids that have been artificially engineered are listed in the “vectors” bin.
We do not require that Linnaean ranks be assigned to all of our taxa, but we do include a standard rank table that allows us to assign formal ranks where it seems appropriate. We do not require that sibling taxa all have the same rank, but we do not allow taxa of higher rank to be listed beneath taxa of lower rank. We allow unranked nodes to be placed at any point in our classification.
The one rank that we particularly care about is “species”. We try to ensure that all of the sequence entries map into the Taxonomy database at or below a species-level node.
The genetic codes and mitochondrial genetic codes that are
appropriate for translating protein sequences in different branches
of the tree of life are assigned at nodes in the Taxonomy database
and inherited by species at the terminal branches of the tree. Plastid
sequences are all translated with the standard genetic code, but many
of the mRNAs undergo extensive RNA editing, making it difficult or
impossible to translate sequences from the plastid genome directly.
The genetic codes are listed on our
GenBank taxonomic division assignments are made in the Taxonomy database and inherited by species at the terminal branches of the tree, just as with the genetic codes.
The Taxonomy database allows us to store comments and references
at any taxon. These may include hotlinks to abstracts in
Some branches of our taxonomy are many levels deep, e.g., the bony fish (as we moved to a phylogenetic classification) and the drosophilids (a model taxon for evolutionary studies). In many cases, the classification lines in the GenBank flatfiles became longer than the sequences themselves. This became a storage and update issue, and the classification lines themselves became less helpful as generally familiar taxa names became buried within less recognizable taxa.
To address this problem, the Taxonomy database allows us to flag taxa that should (or should not) appear in the abbreviated classification line in the GenBank flatfiles. The full lineages are indexed in Entrez and displayed in the Taxonomy Browser.
Sequence variations exist at defined positions within genomes and are responsible for individual phenotypic characteristics, including a person's propensity toward complex disorders such as heart disease and cancer. As tools for understanding human variation and molecular genetics, sequence variations can be used for gene mapping, definition of population structure, and performance of functional studies.
The Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of
simple genetic
The dbSNP accepts submissions for variations in any species and from any part of a genome. This document will provide you with options for finding SNPs in dbSNP, discuss dbSNP content and organization, and furnish instructions to help you create your own (local) copy of dbSNP.
The db The structure of the flanking sequence in dbSNP is a composite of bases
either assayed for variation or included from published sequence. We make the distinction to distinguish regions of sequence
that have been experimentally surveyed for variation (
In the physical mapping of nucleotide sequences, variations are used as positional markers. When mapped to
a unique location in a genome, variation markers work with the same logic as Sequence Tagged Sites
(
Variations that occur in functional regions of genes or in conserved non-coding regions can cause
significant changes in the complement of transcribed sequences, leading to changes in protein expression that can affect
aspects of the
The associations between variations and complex genetic traits are more ambiguous than simple, single-gene mutations that lead to a phenotypic change. When multiple genes are involved in a trait, then the identification of the genetic causes of the trait requires the identification of the chromosomal segment combinations, or haplotypes, that carry the putative gene variants.
The variations in dbSNP currently represent an uneven but large sampling of genome diversity. The human
data in dbSNP include submissions from the SNP Consortium, variations mined from genome sequence as part of the human genome
project, and individual lab contributions of variations in specific genes, mRNAs,
Systematic surveys of sequence variation will undoubtedly reveal sequences that are invariant in the sample. These observations can be submitted to dbSNP as NoVariation records that record the sequence, the population, and the sample size that were used in the survey.
The SNP database can be queried from the dbSNP homepage We organized the dbSNP homepage with links to documentation, FTP, and sub-query
pages on the
The dbSNP is now a part of the Entrez integrated information retrieval system (
Use this query module to select SNPs based on dbSNP record identifiers. These include reference SNP (refSNP) cluster ID numbers (rs#), submitted SNP Accession numbers (ss#), and local (or submitter) IDs for the same variations.
Method class | Class code in Sybase, ASN.1, and XML |
---|---|
Denaturing high pressure liquid chromatography (DHPLC) | 1 |
DNA hybridization | 2 |
Computational analysis | 3 |
Single-stranded conformational polymorphism (SSCP) | 5 |
Other | 6 |
Unknown | 7 |
Restriction fragment
length polymorphism ( |
8 |
Direct DNA sequencing | 9 |
Population class | Description | Population class in Sybase, ASN.1, and XML |
---|---|---|
Central Asia | Samples from Russia and its satellite Republics and from nations bordering the Indian Ocean between East Asia and the Persian Gulf regions. | 8 |
Central/South Africa | Samples from nations south of the Equator, Madagascar, and neighboring island nations. | 4 |
Central/South America | Samples from Mainland Central and South America and island nations of the western Atlantic, Gulf of Mexico, and Eastern Pacific. | 10 |
East Asia | Samples from eastern and south eastern Mainland Asia and from Northern Pacific island nations. | 6 |
Europe | Samples from Europe north and west of Caucasus Mountains, Scandinavia, and Atlantic islands. | 5 |
Multi-National | Samples that were designated to maximize measures of heterogeneity or sample human diversity in a global fashion. Examples include OEFNER|GLOBAL and the CEPH repository. | 1 |
North America | All samples north of the Tropic of Cancer, including defined samples of United States Caucasians, African Americans, Hispanic Americans, and the NHGRI polymorphism discovery resource (NCBI|NIHPDR). | 9 |
North/East Africa and Middle East | Samples collected from North Africa (including the Sahara desert), East Africa (south to the Equator), Levant, and the Persian Gulf. | 2 |
Pacific | Samples from Australia, New Zealand, Central and Southern Pacific Islands, and Southeast Asian peninsular/island nations. | 7 |
Unknown | Samples with unknown geographic provinces that are not global in nature. | 11 |
West Africa | Sub-Saharan nations bordering the Atlantic north of the Congo River and central/southern Atlantic island nations. | 3 |
Use this module to construct a query that will select SNPs based on submission records by
laboratory (submitter), new data (called “new batches”, this query limitation is more recent than a
user-specified date), methods used to assay for variation (
Use sets of variation IDs collected from other queries to generate a variety of SNP reports.
The links in this section all point to the
Free Form is the most flexible query structure in dbSNP. Modeled on the NCBI Entrez retrieval system, queries can be conducted using multiple database field values to restrict a query to specific subsets of data. The Easy Form query is identical to the Free Form query, with the exception that the Easy Form query has a series of pull-down menus from which a value can be selected for the most popular query fields.
Use this query approach if you are interested in retrieving variations that have been mapped to a
specific region of the genome bounded by two STS markers. Other map-based queries are available through the NCBI
All links located on the left sidebar of the dbSNP homepage are also provided in text format at
the bottom of the page to support browsing by text-based Web browsers. Suggestions for improving database access by disabled
persons should be sent to:
The SNP database has two major classes of content: the first class is submitted data, i.e., original
observations of sequence variation ( The major sections of the report are described in the
A complete copy of the SNP database is publicly available and can be downloaded from the SNP
The essential component of a submission to dbSNP is the nucleotide sequence itself. dbSNP accepts
submissions as either genomic DNA or cDNA (i.e., sequenced mRNA transcript) sequence. Sequence submissions have a minimum
length requirement to maximize the specificity of the sequence in larger contexts, such as a reference genome sequence. We
also structure submissions so that the user can distinguish regions of sequence actually surveyed for variation from regions
of sequence that are cut and pasted from a published reference sequence to satisfy the minimum-length requirements.
dbSNP variation
class |
Rules for assigning allele classes | Sample allele definition | Class code in Sybase,
ASN.1, and XML |
---|---|---|---|
Single Nucleotide
Polymorphisms (SNPs) |
Strictly defined as single base substitutions involving A, T, C, or G. | A/T | 1 |
Deletion/Insertion
Polymorphisms (DIPs) |
Designated using the full sequence of the insertion as one allele, and either a fully defined string for the variant allele or a “-” character to specify the deleted allele. This class will be assigned to a variation if the variation alleles are of different lengths or if one of the alleles is deleted (“-”). | T/-/CCTA/G | 2 |
Heterozygous
sequence |
The term heterozygous is used to specify a region detected by certain methods that do not resolve the polymorphism into a specific sequence motif. In these cases, a unique flanking sequence must be provided to define a sequence context for the variation. | (heterozygous) | 3 |
Microsatellite or short
tandem repeat (STR) |
Alleles are designated by providing the repeat motif and the copy number for each allele. Expansion of the allele repeat motif designated in dbSNP into full-length sequence will be only an approximation of the true genomic sequence because many microsatellite markers are not fully sequenced and are resolved as size variants only. | (CAC)8/9/10/11 | 4 |
Named variant |
Applies to
insertion/deletion polymorphisms of longer sequence features, such as retroposon dimorphism for |
(alu) / - | 5 |
No-variation |
Reports may be submitted for segments of sequence that are assayed and determined to be invariant in the sample. | (NoVariation) | 6 |
Mixed |
Mix of other classes | 7 | |
Multi-Nucleotide
Polymorphism (MNP) |
Assigned to variations that are multi-base variations of a single, common length. | GGA/AGT | 8 |
Seven of the classes apply to both submissions of variations (submitted SNP assay, or ss#) and the non-redundant refSNP clusters (rs#'s) created in dbSNP.
The “Mixed” class is assigned to refSNP clusters that group submissions from different variation classes.
Class codes have a numeric representation in the
database itself and in the export versions of the data (
Alleles define variation class (
Each submitter defines the methods in their submission as either the techniques used to assay
variation or the techniques used to estimate allele frequencies. We group methods by method class (
Each submitter defines population samples either as the group used to initially identify
variations or as the group used to identify population-specific measures of allele frequencies. These populations may be one
and the same in some experimental designs. We assign populations a population class (
There are two sample-size fields in dbSNP. One field is called the
Validation evidence | Description | Code in database for ss# | Code in FTP dumps for ss# | Code in database for rs# | Code in FTP dumps for rs# |
---|---|---|---|---|---|
Not validated | For ss#, no batch update or validation data, no frequency data (or frequency is 0 or 1). rs# status code is OR'd from the ss# codes. | 0 | Not present | 0 |
Not present |
Multiple reporting | Status = 1 for an rs# with at least two ss# numbers; having at least one ss# is validated by a non-computational method. For a ss#, status = 1 if the method is non-computational. | 1 | 1 |
1,0 |
1 |
With frequency | Frequency data is present with a value between 0 and 1. | 2 | 2 | 2 | 2 |
Both frequency | For ss#, the method is non-computational and frequency data is present. If the ss# is a single cluster member, then the rs# code is set to 2. | 3 | 3 | 3/2 | 3 |
Submitter validation | Submission of a batch update or validation section that reports a second validation method on the assay. | 4 | 4 | 4 | 4 |
If the rs# has a single ss# with code 1, then rs# is set to code 0.
For a single member rs where the ss# validation status = 1, the rs# validation status is set to 0.
Alleles typically exist at different frequencies in different populations; a very common allele in
one population may be quite rare in another population. Also, allelic variants can emerge as
Similar to alleles,
Some methods for detection of variation (e.g., denaturing high-pressure liquid chromatography or
DHPLC) can recognize when DNA fragments contain a variation without resolving the precise nature of the sequence change.
These data define an empirical measure of
dbSNP accepts individual genotypes for samples from publicly available repositories such as
dbSNP accepts individual assay records (ss# numbers) without validation evidence. When possible,
however, we try to distinguish high-quality validated data from unconfirmed (usually computational) variation reports. Assays
validated directly by the submitter through the
We release the content of dbSNP to the public in periodic “builds” that we synchronize with
the release of new genome assemblies ( The dbSNP build cycle starts with close of data for new submissions. We map all
data, including existing refSNP clusters and new submissions, to reference genome sequence if available for the organism.
Otherwise, we map them to non-redundant DNA sequences from GenBank. We then use map data on co-occurrence of hit locations to
either merge submissions into existing clusters or to create new clusters. We then annotate the new non-redundant refSNP (rs)
set on reference sequences and dump the contents of dbSNP in a variety of comprehensive and denormalized formats on the dbSNP
FTP site for release with the online build of the database.
Each build starts with a “close of data” that defines the set of new submissions that
will be mapped to genome sequence by
We annotate the non-redundant set of variations (refSNP cluster set) on reference genome sequence
Public release of a new build involves an update to the public database and the production of a
new set of files on the dbSNP FTP site. We make an announcement to the
Data submitted to dbSNP are clustered and provide a non-redundant set of variations for each
organism in the database. We maintain these clusters as refSNPs in dbSNP in parallel to the underlying submitted data. We
distinguish refSNPs from assay submissions by using an
refSNPs are compact sets of identifiers that are used to annotate variations on other NCBI
resources. A refSNP has a number of summary properties that are computed over all cluster members ( rs7412 has an average heterozygosity of 12.7% based on the frequency data
provided by the three submissions, and the cluster as a whole is validated because one of the underlying submissions has been
experimentally validated. rs7412 is annotated as a variation feature on RefSeq contigs, mRNAs, and proteins. Pointers in the
refSNP summary record direct the user to additional information on the three submitter Web sites, through the linkout
We compute summary measures for each refSNP to integrate data provided by each independent submitter.
Submitters can arbitrarily define variations on either strand of DNA sequence; therefore,
submissions in a refSNP cluster can be reported on the forward or reverse strand. The orientation of the refSNP and, hence,
its sequence and allele string, is set by a cluster exemplar. By convention, the clustering process We define clusters on shared locations (refSNPs) when we BLAST all
existing refSNPs against contig sequence. In cases where contig sequence is not available or the variation is defined in a
mRNA flanking sequence that will not map to a contig, we compute the refSNP set based on hits to the RefSeq or GenBank
sequences for the organism. We rank map hits as either LO or HI quality and parse the hits to assign a
Once the clustering process determines the orientation of all member sequences in a cluster, it will gather a comprehensive set of alleles for a refSNP cluster.
When the alleles of a submission appear to be different from the alleles of its parent refSNP, check the orientation of the submission for reverse orientation.
The best single measure of a variation's diversity in different populations is its average
Additional summary measures of variation include counts of populations and individuals sampled for this variation.
When reference genome assemblies are available, we use them as anchor sequence to place refSNP
clusters into a genomic context. We clean dbSNP flanking sequence with
We define a refSNP operationally as a variation at a location on an ideal reference chromosome. Such reference chromosomes are the goal of genome assemblies. However, work is still in progress in cases such as the human genome project; therefore, we must currently define a refSNP as a variation in the interim reference contig sequence. Every time there is a genomic assembly update, the interim reference contig sequence changes, and refSNPs must be updated or reclustered.
The reclustering process begins when NCBI updates the genomic assembly. We BLAST all existing refSNPs as well as any newly submitted SNPs (not yet bound to a refSNP cluster) against the genome assembly. Then, we cluster SNPs that co-locate at the same place on the genome into a single refSNP. Usually, new clusters are composed entirely of new submitted SNPs, or else the newly submitted SNPs cluster to an already existing refSNP. When newly submitted SNPs cluster among themselves, they are assigned to a new refSNP ID#, and when they cluster with an already existing refSNP, they are assigned to the cluster for that refSNP.
Sometimes a refSNP will co-locate with another existing refSNP. In this case, the refSNP with a higher ID number is retired, and all the submitted SNPs in its cluster are reassigned to the refSNP with the lower ID number.
Once the clusters are formed, the variation of a refSNP is the union of all possible alleles
defined in the set of submitted SNPs that composed the cluster.
We annotate
The two hits that define a weight 2 variation may not reflect
We do not believe that weight 3 and weight 10 variations have sufficient utility to warrant their annotation, but the mapping results for these variations are still available in dbSNP.
We annotate NoVariation records on NCBI RefSeq chromosomes, contig sequences, mRNAs, and proteins as a miscellaneous feature, or misc_feat. All dbSNP annotations also include a db_Xref cross-reference pointer back to dbSNP that uses the refSNP ID number.
GenBank records can be annotated only by their original authors. Therefore, when we find
high-quality hits of refSNP records to the
We annotate RefSeq mRNAs with variation features when the refSNP has a high-quality hit to the mRNA sequence. If the variation is in the coding region of the transcript and has a non-synonymous allele that changes the protein sequence, we also annotate the variation on the protein translation of the mRNA. The alleles in protein annotations are the amino acid translations of the affected codons.
The Map Viewer (
The summary values can be viewed or downloaded directly as a tab-delimited table if you
select the
Functional class | Description | Database code |
---|---|---|
Locus region | Variation is within 2 Kb 5′ or 500 bp 3′ of a gene feature (on either strand), but the variation is not in the transcript for the gene. This class is indicated with an L in graphical summaries. | 1 |
Coding | Variation is in the coding region of the gene. This class is assigned if the allele-specific class is unknown. This class is indicated with a C in graphical summaries. | 2 |
Coding-synon | The variation allele is synonymous with the contig codon in a gene. An allele receives this class when substitution and translation of the allele into the codon makes no change to the amino acid specified by the reference sequence. A variation is a synonymous substitution if all alleles are classified as contig reference or coding-synon. This class is indicated with a C in graphical summaries. | 3 |
Coding-nonsynon | The variation allele is nonsynonymous for the contig codon in a gene. An allele receives this class when substitution and translation of the allele into the codon changes the amino acid specified by the reference sequence. A variation is a nonsynonymous substitution if any alleles are classified as coding-nonsynon. This class is indicated with a C or N in graphical summaries. | 4 |
mRNA-UTR | The variation is in the transcript of a gene but not in the coding region of the transcript. This class is indicated by a T in graphical summaries. | 5 |
Intron | The variation is in the intron of a gene but not in the first two or last two bases of the intron. This class is indicated by an L in graphical summaries. | 6 |
Splice-site | The variation is in the first two or last two bases of the intron. This class is indicated by a T in graphical summaries. | 7 |
Contig-reference | The variation allele is identical to the contig nucleotide. Typically, one allele of a variation is the same as the reference genome. The letter used to indicate the variation is a C or N, depending on the state of the alternative allele for the variation. | 8 |
Coding-exception | The variation is in the coding region of a gene, but the precise location cannot be resolved because of an error in the alignment of the exon. The class is indicated by a C in graphical summaries. | 9 |
Most gene features are defined by the location of the variation with respect to
transcript
We compute a functional context for sequence variations by inspecting the flanking sequence for gene features during the contig annotation process. We are also currently developing a method to do the same analysis on RefSeq/GenBank mRNAs.
Typically, one allele of a variation will be the same as the contig (contig reference), and the other allele will be either a synonymous change or a nonsynonymous change. In some cases, one allele will be a synonymous change, and the other allele will be a nonsynonymous change. If any allele is a nonsynonymous change, then the variation is classified as a nonsynonymous variation. Otherwise, the variation is classified as a synonymous variation.
The allele is the same as the contig (contig reference) and hence causes no change to the translated sequence.
The allele, when substituted for the reference sequence, yields a new codon that encodes the same amino acid. This is termed a synonymous substitution.
The allele, when substituted for the reference sequence, yields a new codon that encodes a different amino acid. This is termed a nonsynonymous substitution.
A problem with the annotated coding region feature prohibits conceptual translation. In this case, we note the variation class as coding, based solely on position.
Because functional classification is defined by positional and sequence parameters, two
facts emerge: (
When a SNP results in amino acid sequence change, knowing where that amino acid lies in
the protein structure is valuable. We provide this information using the following procedure. To find the location of a SNP
within a particular protein, we attempt to identify similar proteins whose structure is known by comparing the protein
sequence against proteins from the
The SNP database supports and encourages connections between assay records (ss#'s) and
supplementary data on the submitter's Web site. This connection is made using the
We make the following connections between refSNP clusters and other NCBI resources during the contig annotation process:
There are two methods by which we localize variations to known genes: (
When an original submitted SNP record shows a relationship between a SNP and a STS, we share the data with dbSTS and establish a link between the SNP and the STS record. We also examine refSNPs for proximity to STS features during contig annotation. When we determine that a variation needs to be placed within an STS feature, we note the relationship in the dbSNP table SnpInSts.
The contig annotation pipeline relates refSNPs to UniGene EST clusters based on shared
chromosomal location. We store Variation/
We connect individual submissions to PubMed record(s) of publications cited at the time of
submission. If you want to view links from PubMed to dbSNP, select
dbSNP stores the underlying variation data that define HLA alleles at the nucleotide level. The combinations of alleles that define specific HLA alleles are stored in dbMHC. dbSNP points to dbMHC at the haplotype level, and dbMHC points to dbSNP at both the haplotype and variation level.
dbSNP is a relational database with about 100 tables. NCBI deploys dbSNP in both MSSQL and Sybase
environments, and the public can download the full contents of the database from the dbSNP
A schema is a necessary part of constructing your own copy of dbSNP because it is a visual
representation of dbSNP and shows the logical relationship between data in dbSNP. It is available as a printable PDF
Data in dbSNP are organized into “zones” or boxes, depending on the nature of the
data. Each zone is color coded to allow the viewer to find the data more easily. The current color groupings are
The
The /schema subdirectory contains the files dbSNP_table.atx.gz and dbSNP_index.atx.gz. These files use standard SQL DDL language to create tables and indexes.
There are many utilities available to generate table/index creation statements
from a database. We use a tool called
For example, on UNIX operating systems, use gunzip to decompress the files: gunzip
dbSNP_table.atx.gz gunzip dbSNP_index.atx.gz to get the files dbSNP_table.atx and dbSNP_index.atx.
If you are using Sybase and the SQL server, use the command
Use “ftp -i” to turn off interactive prompting during multiple file transfers
to avoid having to hit “yes” to confirm transfer hundreds of times.
Because of the sheer volume of data, our tests to load data and then create indexes on a
Sun Sparc Sybase server using “fast bcp” took about 8 hours.
The file docsum.asn is the ASN structure definition file for both formats of the data. The 00readme file, located in the /specs subdirectory of the
The XML DTD comprises three files located in the /specs subdirectory of the
The FASTA report format provides the flanking sequence for each report of variation in dbSNP, as well as all submitted sequences that have no variation. ss FASTA contains all submitted SNP sequences in FASTA format, whereas rs FASTA contains all the reference SNP sequences in FASTA format. The FASTA data format is typically used for sequence comparisons using BLAST. BLAST SNP is useful for conducting a few sequence comparisons in the FASTA format, wherease multiple FASTA sequence comparisons will require the construction of a local BLAST database of FASTA formatted data and the installation of a local stand-alone version of BLAST.
The rs docsum flatfile report is generated from the ASN.1 datafiles and is provided in the
files “/ASN1_flat/ds_flat_chXX.flat”. As with all of the large report dumps, files are generated per chromosome (chXX in file name). Because flatfile reports are compact, they will not provide you with as much information as the ASN.1 binary report, but they are useful for scanning human SNP data manually because they provide detailed information at a glance. A full description of the information provided in the rs docsum flatfile format is available in the 00readme file, located in the SNP directory of the
The chromosome reports format provides an ordered list of RefSNPs in approximate chromosome coordinates. Chromosome reports is a small file to download but contains a great deal of information that might be helpful in identifying SNPs useful as markers on maps or contigs because the coordinate system used in this format is the same as that used for the NCBI genome Map Viewer. It should also be mentioned that the chromosome reports directory might contain the multi/ file and/or the noton/ files. These files are lists (in chromosome report format) of SNPs that hit multiple chromosomes in the genome and those that did not hit any chromosomes in the genome, respectively. A full description of the information provided in the chromosome reports format is available in the 00readme file, located in the SNP directory of the
A cycle of flank sequence masking and MegaBLAST alignment to the NCBI genome assembly of an organism is initiated either by the appearance of FASTA-formatted genome sequence for a new build of the assembly or by the significant accrual of newly submitted SNP data for that organism.
We prepare FASTA-formatted sequence for unclustered ss assays of a particular organism and run RepeatMasker (with the mixed case option set) on the sequence, using the appropriate ALU-repeat library. For human, we use
Both refSNP and subSNP FASTA sets are aligned to the genome assembly using MegaBLAST, and the SNP position is computed from the list of ungapped alignments returned with the
We also perform tailored BLAST algorithms to specialized subsets of the database that are known to fail or perform poorly in the standard pathway. Short sequences of 75 bases or fewer are BLASTed with word size = 22. To capture hits in cases where the variation length might break the alignment, each flank of microsatellites or indels are BLASTed independently and paired up in post processing to provide the SNP position. We perform high-stringency BLASTs for many of the heavily masked SNPs using wordsize = 40 and 99% quality in any alignment. This is an attempt to capture low-multiplicity hits of repetitive sequence without overburdening the BLAST algorithm with a large number of seeded alignments.
To reduce hit multiplicity in cases where SNPs align several times on the genome, we examine the relative hit quality of multiple hits. Hits are discarded when the mismatch count is greater than the minimum mismatch count. The mismatch count parameter is currently set to 3. We BLAST refSNPs and subSNPs against GenBank mRNA, RefSeq mRNA, and GenBank clone accessions by a similar procedure.
We send the output from the genome alignment (or clone alignment where a genome is not available) through a clustering procedure that assigns rs IDs to unclustered new submissions. In some cases, this clustering procedure will re-cluster existing refSNP clusters based on colocation on the genome. After clustering, we update BLAST output to synchronize the ss IDs and the re-clustered rs IDs to the associated refSNP cluster defined in the database. Location data are loaded into the database, wherease cumulative hit data for each mapped SNP are computed and loaded to the database. Finally, we update SNPFlankStatus.
When a protein is known to have a structure neighbor, dbSNP projects the RefSNPs located in that protein sequence onto sequence structures.
First, contig annotation results provide the SNP ID (snp_id), protein accession (protein_acc), contig and SNP amino acid residue (residue), as well as the amino acid position (aa_position) for a particular RefSNP. These data can be found in the dbSNP table, SNPContigLocusId. FASTA sequence is then obtained for each protein accession using the program idfetch, with the command line parameters set to:
We BLAST these sequences against the PDB database using blastall with the command line parameters set to:
Each SNP position in the protein sequence is used to determine its corresponding amino acid and amino acid position in the 3D structure from the BLAST result. These data are stored in the SNP3D table.
The Gene Expression Omnibus (
High-throughput hybridization array- and sequencing-based experiments have become increasingly common in molecular biology laboratories in recent years (
Because of the plethora of measuring techniques for molecular abundance in use, our primary goal in creating the Gene Expression Omnibus (
This chapter is both more current and more detailed than the previous literature report on GEO (
The entity–relationship diagram for GEO.
An actual example of three samples referencing one platform and contained in a single series.
Accession prefix | Entity type | Subtype | Description |
---|---|---|---|
GPL | Platform | Commercial nucleotide array | Commercially available nucleotide hybridization array |
Commercial tissue array | Commercially available tissue array | ||
Commercial antibody array | Commercially available antibody array | ||
Non-commercial nucleotide array | Nucleotide array that is not commercially available | ||
Non-commercial tissue array | Tissue array that is not commercially available | ||
Non-commercial antibody array | Antibody array that is not commercially available | ||
GSM | Sample | Dual channel | Dual mRNA target sample hybridization |
Single channel | Single mRNA target sample hybridization | ||
Dual channel genomic | Dual DNA target sample hybridization, e.g., array CGH | ||
SAGE | Serial analysis of gene expression | ||
GSE | Series | Time–course | Time–course experiment, e.g., yeast cell cycle |
Dose–response | Dose–response experiment, e.g., response to drug dosage | ||
Other ordered | Ordered, but unspecified | ||
Other | Unordered |
The three principle components (or entities) of GEO are modeled after the three organizational units common to high-throughput gene expression and array-based methodologies. These entities are called
The GEO repository is a relational database, which required that some fundamental implementation decisions were made:
(
(
Once a platform, sample, or series is defined by a submitter, an Accession number (i.e., a unique, stable identifier) is assigned (
Daily usage statistics evaluated over a 4-week period January 24 to February 20, 2002. Web server
It is very easy to use the Type in a valid public or private Select desired display options. Press the
Three types of display options are currently available:
To view one's own private, currently unreleased accessions, login with username and password at the bottom
A GEO Accession number is required to retrieve data from the GEO repository database (
Given a valid GEO Accession number, the Accession Display tool available on the GEO Web site provides a number of options for the retrieval and display of repository contents (see
Cumulative individual sample measurements submitted to GEO are shown. Data are presented by quarter since operations began on July 25, 2000.
Simple Omnibus Format in Text (SOFT) is a line-based, ASCII text format that allows for the representation of multiple GEO platforms, samples, and series in one file. In SOFT, metadata appear as label-value pairs and are associated with the tab-delimited text tables of platforms and samples. SOFT has been designed for easy manipulation by readily available line-scanning software and may be quite readily produced from, and imported into, spreadsheet, database, and analysis software. More information about SOFT and the submission process is available from the
There are several formats in which data can be deposited and retrieved from GEO. For deposit: (1) a file containing an ASCII-encoded text table of data can be uploaded, and metadata fields can be interactively entered through a series of Web forms; or (2) both data and metadata for one or more platforms, samples, or series can be uploaded directly in a format we call Simple Omnibus Format in Text, or SOFT (
Interactive and direct modes of communication are available for new data submissions and updating data submissions. The interactive Web form route is straightforward and most suited for occasional submissions of a relatively small number of samples. Bulk submissions of large data sets may be rapidly incorporated into GEO via direct deposit of SOFT formatted data.
Submissions may be held private for a maximum of 6 months; this policy allows data release concordant with manuscript publication. Such submissions are given a final Accession number at the time of submission, which may be quoted in a publication.
Currently, submissions are validated according to a limited set of criteria (see the
A quarterly, cumulative graph of the number of individual molecular abundance measurements in public submissions made through the first quarter of 2002 is shown in
Field name | Description |
---|---|
Accession | GEO accession identifier |
Author | Author of GEO sample |
CloneID | Clone identifier of GEO sample's platform |
Country | Country of GEO sample's submitter |
email of submitter | |
GBAcc | GenBank Accession of GEO sample's platform |
Institute | Institute of GEO sample's submitter |
Keyword | Keyword of GEO sample |
ORF | Open reading frame (ORF) designation of GEO sample's platform |
Organism | Organism of GEO sample and its parent taxonomic nodes |
RefSeq | RefSeq accession of GEO sample's platform |
SAGEtag | Serial analysis of gene expression (SAGE) 10-bp tag of GEO sample |
Subtype | Subtype of GEO sample |
Target ref | Target reference of GEO sample |
Target src | Target source of GEO sample |
Text Word | Word from description of GEO sample or sample's platform, and word from the titles of sample and its platform |
Title | Titles of GEO sample and its platform |
The basic unit (defined by a unique identifier, or UID, in Entrez parlance) in Entrez ProbeSet is the GEO sample, fused with its affiliated platform and series information. The indexing process iterates through all platforms in the GEO database, extracting metadata and the data table and fishing for any sequence-based identifiers such as GenBank Accession, ORFs, Clone IDs, or SAGE tags. Each sample belonging to that platform is in turn assigned a new UID and indexed with the above platform information plus any related series metadata (
GenBank Accessions, PubMed references, and taxonomy information are also linked to the appropriate Entrez databases for cross-reference and appear in the
Extensive indexing and linking on the data in GEO are performed periodically and can be queried through Entrez ProbeSet (
Because samples are oftentimes organized into meaningful data sets within series, an example of retrieving a series and all the data of its associated samples and platform(s) is illustrative of the retrieval capabilities of the GEO Web site. For this example, to select a series of interest, we scan down a list of series in the GEO repository. However, to arrive at our series of interest, we could have just as well performed an Entrez ProbeSet query and followed GEO accession links to a sample and then to its related series, or followed links from PubMed to Entrez ProbeSet, and then to GEO. A step-by-step example of selecting a series of data and retrieving the data for this series from the GEO repository follows:
Select the linked number of public series from the table of Repository Contents given on the Scan down the list of The description of GSE27 on the In the A browser dialog states that it took 19 seconds to download the 5 MB SOFT file of data and metadata for one series (GSE27), seven samples (GSM992 to GSM1000), and one platform (GPL67).
Anticipated development of gene expression resources at NCBI is shown.
The GEO resource is under constant development and aims to improve its indexing, linking, searching, and display capabilities to allow vigorous data mining. Because the data sets stored within GEO are from heterogeneous techniques and sources, they are not necessarily comparable. For this reason, we have defined a ProbeSet to be a collection of GEO samples that contains comparable data. The selection of GEO samples into ProbeSets is necessary before integrating data in the GEO repository into other NCBI resources (see
How do I submit my data?
To submit data, an identity within the GEO resource must first be established. On first login, authentication and contact information must be provided. Authentication information (username and password) is used to identify users making submissions and updates to submissions. Contact information is displayed when repository contents are retrieved by others. This information is entered only once and can be updated at any time.
Is there a “hold until date” feature in GEO?
Yes. This feature allows a submitter to submit data to GEO and receive a GEO Accession number before the data become public. There is currently a 6-month limit to this hold period. All private data are publicly released eventually.
What kinds of data will GEO accept?
GEO was designed around the common features of most of the high-throughput gene expression and array-based measuring technologies in use today. These technologies include hybridization filter, spotted microarray, high-density oligonucleotide array, serial analysis of gene expression, and Comparative Genomic Hybridization (CGH) and protein (antibody) arrays but may be expanded in the future.
Does GEO archive raw data images?
No. However, a reference image will be optionally accepted (limited to 100 Kb in size in JPEG format). In combination with optional references to horizontal and vertical coordinates, this image can be used to provide the user of the data with a qualitative assessment of the data.
Are there any Quality Assurance (QA) measurements that are required by GEO?
Not at this time. These requirements may be added in the future.
How can I submit QA measurements to GEO?
QA measurements are currently optional. If QA measurements are performed at the image-analysis step, these can be submitted as additional sample data.
How can I make corrections to data that I have already submitted?
By logging in with a username and password, an option to update a previous submission or your contact information is given. Accession updates can also be made through a link from the Accession Display after logging in. Updating the data of an already existing and valid GEO Accession number will cause a new version of that data element to be created. Alterations of metadata will not create a new version. All of the various versions of a data element will remain in the database.
How are submitters authenticated?
In their first submission to GEO, submitters will be asked to select a username and password. This username and password can be used to submit additional data in the future without reentering contact information, as well as to authenticate the submitter when updating or resubmitting data elements under an existing GEO Accession number.
How do I get data from GEO?
You need not login to retrieve data. All the data are available for downloading. NCBI places no restrictions on the use of data whatsoever but does not guarantee that no restrictions exist from others. You should carefully read NCBI's data disclaimer, available on the GEO Web site.
What kind of queries and retrievals will be possible in GEO?
Currently, there are three ways to retrieve submissions. One way is by entering a valid GEO Accession number into the query box on the header bar of this page; this will take you to the Accession Display. Another is to use the platform, sample, and series lists, located on the GEO Statistics page. Sophisticated queries of GEO data and linking to other Entrez databases can be accomplished by using Entrez ProbeSet.
What does
GEO platforms (GPL prefix) may have related samples and, through those related samples, related series. GEO samples (GSM prefix) will always have one related platform and may have multiple, related series. GEO series (GSE prefix) will have at least one related sample and, through those related samples, will have at least one related platform. The
What is SOFT?
SOFT stands for Simple Omnibus Format in Text. SOFT is an ASCII text format that was designed to be a machine-readable representation of data retrieved from, or submitted to, GEO. SOFT output is obtained by using the Accession Display, and SOFT can be used to submit data to GEO. Please see Box 2 for more details.
What does the word “taxon” mean?
The NCBI's Taxonomy group has constructed and maintains a taxonomic hierarchy based upon the most recent information, which is described in
We gratefully acknowledge the work of Vladimir Soussov, as well as the entire NCBI Entrez team, especially Grisha Starchenko, Vladimir Sirotinin, Alexey Iskhakov, Anton Golikov, and Pramod Paranthaman. We thank Jim Ostell for guidance, Lou Staudt for discussions during our initial planning for GEO, and the extreme patience shown by Brian Oliver, Wolfgang Huber, and Gavin Sherlock when making the first data submissions. Admirable patience was also exhibited by Al Zhong during the development of the direct deposit validator. Special thanks go to Manish Inala and Wataru Fujibuchi for their continuing work on future features and tools.
Source | Accessions | Description |
---|---|---|
NHGRI melanoma study | GSE1 | This series represents a group of cutaneous malignant melanomas and unrelated controls that were clustered based on correlation coefficients calculated through a comparison of gene expression profiles. |
Stanford Microarray Database | GSE4 to GSE9, and GSE18 to GSE29 | These series represent microarray studies from the public collection of the Stanford Microarray Database ( |
Cancer Genome Anatomy Project | GSE14 | This series represents the Cancer Genome Anatomy Project SAGE library collection. Libraries contained herein were either produced through CGAP funding or donated to CGAP. |
Affymetrix Gene Chips™ | GPL71 to GPL101 | These platforms represent the latest probe attributes of the commercially available Affymetrix Gene Chips™ high density oligonucleotide arrays. |
National Children's Medical Center Microarray Center | GSM1131 to GSM1345 | These samples represent direct deposits of data derived from Affymetrix Gene Chip™ arrays and come from the Microarray Center database at the National Children's Medical Center. |
Online Mendelian Inheritance in Man (
Trademark status. OMIMTM and Online Mendelian Inheritance in ManTM are trademarks of The Johns Hopkins University.
MIM number range |
Explanation |
---|---|
100000-199999 | Autosomal loci or phenotypes (entries created before May 15, 1994) |
200000-299999 | Autosomal loci or phenotypes (entries created before May 15, 1994) |
300000-399999 | X-linked loci or phenotypes |
400000-499999 | Y-linked loci or phenotypes |
500000-599999 | Mitochondrial loci or phenotypes |
600000- | Autosomal loci or phenotypes (entries created after May 15, 1994) |
MIM numbers are frequently preceded by a symbol. An asterisk (*) before a MIM number indicates that the entry describes a distinct gene or phenotype and that the mode of inheritance of the phenotype has been proved (in the judgment of the authors and editors) and that the phenotype described is not known to be determined by a gene represented by other asterisked entries in MIM.
A number sign (#) before a MIM number describing a phenotype indicates that the phenotype is caused by mutation in a gene represented by another entry and usually in any of two or more genes represented by other entries. The number sign is also used for phenotypes that result from specific chromosomal aberrations, such as Down syndrome, and for contiguous gene syndromes, such as Langer-Giedion syndrome. Whenever a number sign is used, the reason is stated at the outset of the entry.
The absence of an asterisk (or other sign) preceding the number indicates that the distinctness of the phenotype as a mendelizing entity or the characterization of the gene in the human is not established.
OMIM entries are authored and edited by
experts in the field and by the OMIM staff, based on information in
the published literature. All entries are assigned a unique, stable,
six-digit ID number and provide names and symbols used for the disorder and/or gene, a literature-based description, citations, contributor information, and creation and editing dates.
Because MIM is derived from the primary literature, the text is replete with citations and links to PubMed. OMIM authors create entries for each unique gene or Mendelian disorder for which sufficient information exists and do not wittingly create more
than one entry for each gene
MIM is organized into autosomal, X-linked,
Y-linked, and mitochondrial catalogs, and MIM numbers are assigned
sequentially within each catalog (
With the increasing complexity of biological information, OMIM makes a critical contribution by distilling what is known about a gene or disease into a single, searchable entry. The rich text of the OMIM entry, along with the source reference citations, make it easy to retrieve data of interest. The OMIM entry can then serve as a gateway to other sources of related information via the many curated and computed links within each entry.
The OMIM
The Synopsis of the human gene map has also been sorted alphabetically by disorder and is referred to as the Morbid Map.
OMIM can be found either by direct query
via
Queries can also be entered in the query
box shown on all Entrez database pages, selecting OMIM as the
database in the
The Gene Map can be accessed from an
individual OMIM entry via the cytogenetic location displayed when
appropriate under the entry titles. The Gene Map may also be queried
Title (default)
Details
Clinical Synopsis
Allelic Variants
mini-MIM
ASN.1
LinkOut
Related Entries
Genome Links
Nucleotide Links
Protein Links
PubMed Links
SNP Links
Structure Links
UniSTS Links
1. Enter query term or terms (example: renal failure hypertension).
2. Default display is
3. Select
4. Similarly, select
NOTE: In the same bar, the number
of entries to display and the format in which to display them can be
configured by use of the
OMIM is queried via a standard Entrez
query bar. The mechanics of selecting entries to display, how to
display them, and identifying related entries either within Entrez or
from external resources is also according to the Entrez/LinkOut
standard. The display options (
The OMIM
When viewing the text of an OMIM entry,
however, the navigational links serve as an electronic table of
contents. The section headings within the entry are listed similar to a table of contents, and selecting one moves the display to that section. Within an entry, selecting the
OMIM staff actively contributes to the curation of data in LocusLink. Thus, if the MIM number is represented in LocusLink, a reciprocal LocusLink link is provided in the section to the left. Other links provided by the LocusLink collaboration may also be listed in this section, e.g., links to Nomenclature, Reference Sequences, or UniGene clusters that are specific to the subject OMIM entry.
The sequence links in the LocusLink
section may be different from the Entrez indexing links available via
the
OMIM entries may also contain a link to
LinkOut for resources external to NCBI (see
At the top of any report page, or
associated with each entry in the query result page, are the links to
related data generated from Entrez (
Each OMIM entry has a unique number and is given a primary title and symbol. This is the title that is displayed in the document retrieval list. Alternative designations are listed below the primary title. Some entries contain information that is related but not synonymous to the primary topic and is not addressed in another entry (e.g., splice variants, phenotypic variants, etc.). This information is set off by the word “included” in the title. The first “included” title is displayed in the document retrieval list. The cytogenetic map location when known is given for each entry. When a disease shows genetic heterogeneity, more than one map location may be given. The “light bulb” icon at the end of text paragraphs links to related articles in PubMed. References within the text are linked to the complete citation at the end of the entry. There, the PubMed ID is linked to the PubMed abstract.
Some entries contain an
The OMIM data are available for
The OMIMTM database, including the
collective data contained therein, is the property of The Johns
Hopkins University, which holds the copyright thereto. The OMIM
database is made available to the general public subject to certain
restrictions. You may use the OMIM database and data obtained from
this site for your personal use, for educational or scholarly use, or
for research purposes only. The OMIM database may not be copied,
distributed, transmitted, duplicated, reduced, or altered in any way
for commercial purposes or for the purpose of redistribution without
a license from The Johns Hopkins University. Requests for information
regarding a license for commercial use or redistribution of the OMIM
database may be sent via email to
OMIM is funded by a contract from the National Library of Medicine and the National Human Genome Research Institute and by licensing fees paid to the Johns Hopkins University by commercial entities for adaptations of the database. The terms of these licenses are being managed by the Johns Hopkins University in accordance with its conflict of interest policies.
The
The online books are displayed one section at a time, with navigation provided to other parts of the current chapter or to other chapters within the book. Many of the books on the BookShelf can be browsed without any restriction at all; others have less flexibility for navigating the complete content. The publisher (or the owner of the content) defines the rules for access.
The books are linked to PubMed through research papers citations within the text. In the future, more links may be established between the BookShelf and other resources at NCBI, such as gene and protein sequences, genomes, and macromolecular structures.
The BookShelf provides a venue through which publishers and authors can make the full text of biomedical books available to the scientific community. As the BookShelf grows, we welcome proposals from authors, editors, or publishers of any “in-scope” texts, from undergraduate textbooks to more specialized publications, including collections of review articles or workshop proceedings. The scope of the BookShelf is broadly biomedical, including clinical works and those concerning basic biological and chemical sciences.
All books made available on the BookShelf have been provided in electronic format to
The complete contents of each book will be converted into
Any book that was printed from
Figures should be supplied in TIFF format, although GIF and JPEG formats may be accepted. The submitted text files are converted into XML according to the NCBI Book DTD; graphic files are converted into GIF and JPEG formats. Three hard copies of the book are also required, along with the electronic files.
The XML files are stored in a database. When a reader requests a book, chapter, or section, the XML is retrieved from the database and converted into HTML on the fly using Extensible Stylesheet Language Transformations (
There are three ways to access the content in BookShelf:
1. Through hyperlinked terms in PubMed abstracts
2. By a direct search using search terms or phrases (in the same way as the bibliographic database of PubMed is searched)
3. Through the Table of Contents of the book (note: some publishers restrict browsing through the entire book by disabling hyperlinks in the Table of Contents)
The BookShelf can be accessed from all PubMed abstract pages. When viewing a full PubMed abstract, select the This view was generated by selecting
A statistical weighting system based on the frequency of each phrase in a book section, relative to the rest of the book, is used to identify “good” phrases. A phrase that appears repeatedly in only a few sections and rarely in other parts of the book indicates a definitive phrase for those few sections; therefore, it ranks highly. Furthermore, the appearance of a phrase in the title, for example, has a greater value in the weighting system than one appearing solely in the text.
Each PubMed abstract can thus be linked to the appropriate book pages. This method allows two very dissimilar types of text—the dense, focused PubMed abstracts and the more descriptive book text—to find common ground.
Field |
Use |
---|---|
[Author] | Search for the authors of books or chapters. |
[Book] | Typically used with a Boolean expression to limit a search to a particular book. |
[PmId] | Locate a journal article citation in a book by its PubMed ID. |
[Rid] | Locate a particular book element (such as a figure or table) by its reference ID. |
[Secondary Text] | Search for secondary text, e.g., units (mg/l, etc.) |
[Title] | Search for words used in any title (book, chapter, section, subsection, figure, etc.). |
[Type] | Locate a division of a book such as a section, chapter, or figure group. |
Filters are applied immediately following a search term, with no separating spaces, e.g., watson[author] AND cmed[book].
Book contents may be searched directly from the
Results are shown as a list of books in which the term is found, along with the number of sections, figures, and tables that contain the term ( If more than 20 sections, tables, and/or figures are found that contain the query term, a summary page, such as the one above, is displayed.
The document summary list is sorted with the most relevant documents shown at the top of the page. The sorting makes use of scores allocated to phrases as a measure of how relevant they are to a given section (a part of the statistical weighting system also used for linking PubMed abstracts to the books). For each book section found, the title, along with some context regarding the hierarchy of the section location (e.g., the chapter and book), is given. An icon is used to distinguish figure legends and tables from text sections ( The most relevant sections to the query appear at the top of the list. Note the icons that designate figure and table hits. The list may also be displayed in a brief format that lists only the section names that contain the term by choosing
Each HTML page of content seen in a Web browser represents one section of one chapter of a book, i.e., all of the content (including subsections and so on) within the first-level heading of a chapter. The amount of content this represents varies according to the structure of the original book. Some books have very long sections, some short, some a mixture; although on the whole, most chapters are divided into 3-10 sections.
The top of every page contains links to both short and detailed Tables of Contents and a description of the current location within the book ( Hyperlinks to various sections within a chapter appear within a navagation bar to the
All book content submitted to the BookShelf is converted to XML according to the public NCBI Book DTD.
Any files submitted in SGML or XML are converted to the Book DTD using
The XML for a book is generally a set of files. Each chapter and appendix is an independent file, as is the frontmatter. The book is pulled together by the book.xml file, which defines the structure of the book. For example, a book with two chapters and a bibliography at the end of the book would be structured as follows:
The book.xml file is composed of two parts. The first part defines all of the components that are required to build the book. These definitions occur within the <!DOCTYPE [ ] > tag. The second part builds the structure of the book. The root element is <book>. <book> contains whatever is in &frontmatter;, <body>, and <back>.
The book.xml example above refers to five external files: fm.xml, which contains all of the frontmatter of the book; ch1.xml, which contains chapter 1; ch2.xml, which contains chapter 2; biblist.xml, which contains the bibliography for the book; and graphics.xml, which defines the images. If any of these files is not valid according to the DTD or if the files are not found where they are defined in their <!ENTITY > declaration, then the book will not be valid.
All of the images, including figures, icons, and book-specific character graphics (see Special Characters below), are called out in the text as entities. The entities are defined in the graphics.xml file.
<!ENTITY ch2fu6 SYSTEM "data/mga/pictures/ch2/ch2fu6.gif" NDATA GIF>
<!ENTITY ch2fu7 SYSTEM "data/mga/pictures/ch2/ch2fu7.jpg" NDATA JPG>
<!ENTITY ch2fu8 SYSTEM "data/mga/pictures/ch2/ch2fu8.gif" NDATA GIF>
<!ENTITY ch2fu9 SYSTEM "data/mga/pictures/ch2/ch2fu9.gif" NDATA GIF>
<!ENTITY ch2fu10 SYSTEM "data/mga/pictures/ch2/ch2fu10.gif" NDATA GIF>
<!ENTITY ch2e1 SYSTEM "data/mga/pictures/ch2/ch2e1.gif" NDATA GIF>
<!ENTITY ch2e2 SYSTEM "data/mga/pictures/ch2/ch2e2.gif" NDATA GIF>
Graphic files are converted into GIF and JPEG formats and optimized for display on the Web. The images are not loaded into the database; they are retrieved from a file server when called by the HTML page.
Math expressions and chemical formulae and structures are handled as images.
The BookShelf uses the same character sets that PubMed Central uses (see
The XML files for each book are stored in a
The XML is converted to HTML using XSLT stylesheets. The look of the HTML pages is controlled further using CSS, which allow manipulation of colors, fonts, and typefaces (
The Table of Contents for a book is created from actual elements within the book content, rather than from the Table of Contents given in the book frontmatter of the hard copy. This ensures that the Table of Contents represents the content accurately as it is organized on BookShelf.
In the first version of the NCBI BookShelf project, Quark files were converted directly to HTML for display online. The result was effective, illustrating the value of having a textbook online and linked to PubMed; however, it was labor intensive, limiting, and not scaleable.
To simplify the delivery of books online and to allow for the expansion of linking within the Entrez system, NCBI decided to convert all content into a centralized XML format. The normalized XML content is easier to render, allows added value such as the addition of links to other NCBI databases, and simplifies the addition of new volumes.
PMC created a new DTD for the BookShelf project, which was based on the ISO-12803 DTD. As more books were converted to the NCBI Book DTD, changes had to be made to accommodate the data.
The NCBI Book DTD is a public DTD available on request from
How do I access the books at NCBI?
The online books can be accessed by direct searching in Entrez or through PubMed abstracts. After performing a general PubMed search, click on the author name of one of the search results to view the abstract. A hypertext link called
Which books are available at NCBI?
The book list is updated on a regular basis and can be viewed on the BookShelf homepage:
Can I search the books at NCBI?
Yes, the books can be searched either as a complete collection or as a single, selected book (restricted using search options found under
Can I browse the whole book?
The system has been designed so that the user is delivered to the most relevant book sections for a particular term or concept. Although navigation is possible in the immediate vicinity of the page to which you are delivered, it may not be possible to browse the complete book on BookShelf. The range of navigation for each book is determined on a case-by-case basis, in agreement with the publisher.
I am the publisher/author/editor of a book. How can I participate?
Please email
PubMed Central (PMC) is the National Library of Medicine's digital archive of full-text journal literature. Journals deposit material in PMC on a voluntary basis. Articles in PMC may be retrieved either by browsing a table of contents for a specific journal or by searching the database. Certain journals allow the full text of their articles to be viewed directly in PMC. These are always free, although there may be a time lag of a few weeks to a year or more between publication of a journal issue and when it is available in PMC. Other journals require that PMC direct users to the journal's own Web site to see the full text of an article. In this case, the material will always be available free to any user no more than 1 year after publication but will usually be available only to the journal's subscribers for the first 6 months to 1 year.
To increase the functionality of the database, a variety of links are added to the articles in PMC: between an article correction and the original article; from an article to other articles in PMC that cite it; from a citation in the references section to the corresponding abstract in PubMed and to its full text in PMC; and from an article to related records in other Entrez databases such as Reference Sequences, OMIM, and Books.
The
Every article citation in a table of contents includes one or more links (
In addition to the header information for the article itself, the upper part of a Full Text or PubLink page contains a variety of links, including links to other forms of the article, to related information in PubMed and other
An Abstract page is identical to a Full Text page that has been cut off at the end of the abstract.
A PMC PubLink page (
When an article has been cited by other articles in PMC, a “cited-in” link displays just under the article header information on both the Abstract and Full Text pages. Selecting this cited-in link gives you a list of the articles that have referenced the subject article (
PMC article citations may also be retrieved by doing a search in PMC or through a PubMed search. (In PubMed, use the
Participation by publishers in PMC is voluntary, although participating journals must meet certain editorial standards. A participating journal is expected to include all of its peer-reviewed primary research articles in PMC. Journals are encouraged to also deposit other content such as review articles, essays, and editorials. Review journals, and similar publications that have no primary research articles, are also invited to include their contents in PMC. However, primary research papers without peer review are not accepted.
Journals that deposit material in PMC may make the full text viewable directly in PMC or may require that PMC link to the journal site for viewing the complete article. In the latter case, the full text must be freely available at the journal site no more than 1 year after publication. In the case of full text that is viewable directly in PMC, which by definition is free, the journal may delay the release of its material for more than 1 year after publication, although all current journals have delays of 1 year or less.
In either case, the journal must provide
The rationale behind the insistence on free access is that continued use of the material, which is encouraged by open access, serves as the best test of the durability and utility of the archive as technology changes over time. PMC does not claim copyright on any material deposited in the archive. Copyright remains with the journal publisher or with individual authors, whichever is applicable.
Refer to
From Abstract and Full Text pages in PMC are links to related articles in PubMed and to related records in other Entrez databases, such as Nucleotides or Books. These are identical to the links between databases that you can find in any Entrez record.
PubMed Central is an XML-based publishing system for full-text journal articles. All journal content in the archive was either supplied in, or has been converted to, a Document Type Definition (
The content is displayed dynamically on the PMC site by journal, volume, and issue (if applicable). XML, Web graphics, PDFs, and supplemental data are stored in a
We receive journal content either directly from publishers or from publishers' vendors. This content includes:
SGML or XML of the articles to be deposited High-resolution images Supplemental data associated with the articles PDF versions of the articles
All of the text is converted to a central DTD, the PMC DTD, and the images are converted to Web format (GIF and JPEG). These files, along with any supplemental data or PDFs, are loaded into the database for linking, indexing, and retrieval (
Once all of the files in an issue or batch have been validated, they are converted to PMC XML (referred to as a PMC XML file or PXML) using XSLT (
For each publisher that submits SGML, an XML version of the DTD is created. This is used to parse the output of SX before the XML conversion is started. The XML version of the publisher's DTD is not used to validate the source data (because this has already been done using the original DTD).
XSLT requires that the input file be valid XML. If a DTD is not available for validation, the parser will check the syntax; it will also replace all of the character entities with the appropriate
After the XSLT conversion, the original character entities are converted to character entities that are valid under the PMC DTD. Character translation tables for each source DTD regulate this conversion.
The resulting XML is then validated against the PMC DTD.
Several other items are created along with the PXML file. These are:
1. An entity file (articlename.ent). This file lists all of the character entities (from the PMC DTD-defined entity sets) that are in the article. One entity file is created for each article. This information is loaded into the database and is used to prepare the final HTML file for display. A sample is below:
2. A PubMedID file (articlename.pmid). This file includes a set of reference citations in the format
When the article is converted, this information is collected from each journal citation in the bibliography and sent in a query to PubMed (using the
3. A source node file (articlename.src). When an article is passed through the XSLT conversion, a list is made of each node, or named piece of information, that is included in the file. As the conversion is running, each node that is being processed is recorded. When the conversion is complete, the processed node list is compared with the list of nodes in the source file, and any piece of information that was not processed is reported in a conversion log. A sample is below:
To accommodate the archiving requirements of the PubMed Central project, it is important that figures be submitted in the greatest resolution possible, in TIFF or EPS format. Figures in these formats will be available for data migration when formats change in the future, and PubMed Central will be able to keep all of the figures current.
For display on the PMC site, two copies of each figure are made: a GIF thumbnail (100 pixels wide) and a JPEG file that will be displayed with the figure caption when the figure is requested.
Supplemental data include any supporting information that accompanies the article but is not part of the article. They may be text files, Word document files, spreadsheet files, executables, video, and others.
Sometimes a journal has a Web site where all of this supplemental information is stored. In this case, PubMed Central establishes links from the article to the supplemental information on the publisher's site.
In other cases, the supplemental data files are submitted with the article to be loaded into the PMC database. Either way, the information concerning this supplemental data is collected in a Supplemental Data file, which includes the location of the supplemental file(s), the type of information that is available, and how the link should be built from the article. PMC does not validate any of the supplemental data files that are supplied.
Mathematical symbols and notations can be difficult to display in HTML because of built-up expressions and unusual characters. For the most part, expressions that are simple enough to display using HTML are not handled as math unless they are tagged as math specifically. Publishers that supply content to PMC handle math expressions in one of two ways: supplied images or encoding in SGML.
Any expression that cannot be tagged by the source DTD is supplied as an image. In this case, PMC will pass the image callout through to the PXML file and display the supplied image in the HTML file.
Several of the Source DTDs used by publishers to submit data to PMC are robust enough to allow coding of almost any mathematical expression in SGML. Most of these were derived from the Elsevier DTD; therefore, many of the elements are similar.
During article conversion, any items that are recognized as math are translated into Teχ. This would include any expression tagged specifically as a “formula” or “display formula,” as well as any free-standing expression that cannot be represented in HTML. These expressions include radicals, fractions, and anything with an overbar (other than accented characters). For example:
“ “1/2” would not be recognized as a math expression, but <fraction><numerator>1</numerator><denominator>2 </denominator></fraction> would be. “<radical>2x</radical>” would be recognized as a math expression, as would “<overbar>47X</overbar>.”
The SGML:
Converts to:
When the articles are loaded into the database, the equation markup is written into an Equation table. This table will also include the equation image, which will be created from the Teχ markup.
The image for the equation shown above in SGML and PXML (with Teχ) is:
Because all of the content in PubMed Central is in the same format—PMC DTD—loading articles into the database is relatively straightforward. Once an article is loaded into the production (public) database, it will retain its ArticleId (article ID number) in perpetuity. On loading, each article is validated against the PMC DTD. Also, any external files that are referenced by the XML are checked. If any file, such as a figure, is missing, the loading will be aborted.
The database loading software and daily maintenance programs perform several other tests to ensure the accuracy and vitality of the archive:
Journal identity. The Journal title being loaded is verified against the Duplicate articles. An article may not be loaded more than once. Any changes to the article must be submitted as a replacement article, which will use the same ArticleId. Publication date/delay. Rules for delay of publication embargo can be set up in the database to ensure that an issue will not be released to the public before a certain amount of time has passed since the publisher made the issue available. PubMed IDs. PubMedIDs for the article being loaded or any bibliographic citation in the article that are not defined in the PXML are looked up upon loading. Link updates. Links between related articles and from articles to external sources are updated daily.
The database has been designed to allow multiple versions of articles. In addition to article information, the database also stores information on content suppliers and publishers and journal-specific information.
PubMed Central uses a number of standard
Each publisher DTD defines a set of characters that may be used in their articles. Generally, these publisher DTDs use the same standard ISO character sets listed in
The supplied data also include groups of entities that are to be combined in the final document. Sometimes these are grouped in a tag such as:
and sometimes they are just positioned next to each other in the text. These combined entities must be mapped either to an ISO character or to a character in the PMC character set.
For the most flexibility in displaying characters across platforms, PMC uses UTF-8 encoding whenever possible. Because not all browsers support the same subset of UTF-8 characters and some characters cannot be represented in UTF-8, PMC displays characters as a combination of GIFs and UTF-8 characters, depending on the Browser/OS combination and the character to be displayed.
In the first version of the PMC project, the SGML and XML were loaded into a database in its native format. The HTML rendering software was then required to convert content from different sources into normalized HTML on the fly when a reader requested an article.
This was slow and cumbersome on the rendering side and was not scaleable. At that time, PMC was receiving content for about five journals in two DTDs, the keton.dtd from HighWire Press and the article.dtd from BioMed Central. The set-up for a new journal was difficult, and it soon became obvious that this solution would not scale easily.
To satisfy the archiving requirement for the PMC project and to simplify the delivery of articles online, PubMed Central decided to convert all content into a centralized format. The normalized content is easier to render, allows enhanced value such as links to other NCBI databases to be added, and simplifies content archiving.
PMC created a new DTD, which was strongly influenced by the BioMedCentral article.dtd and the keton.dtd. The original emphasis was on simplicity. As more and more articles from more and more journals were converted to the PMC DTD, changes had to be made to accommodate the data. The
Because the PMC DTD grew rapidly, it was feared that the original “simplicity” of its design would lead to confusing data structures. With more and more publishers inquiring about submitting content directly in the PMC DTD, PubMed Central decided that an independent review was necessary.
At approximately the same time, under the auspices of a Mellon Grant to explore ejournal archiving, Harvard University Library contracted with Can a common DTD be designed and developed into which publishers' proprietary SGML files can be transformed to meet the requirements of an archiving institution? If such a structure can be developed, what are the issues that will be encountered when transforming publishers' SGML files into the archive structure for deposit into the archive?
The requirement of the archival article DTD was defined as the ability to represent the intellectual content of journal articles. This
The NLM Archiving DTD will not be backwards-compatible with the pmc-1.dtd. It should be publicly available by the end of 2002, along with complete documentation for publishers and authors. A draft version is available (
Please refer to the
The RGB image demonstrates cytogenetic abnormalities in a cell from a secondary leukemia cell line.
Spectral Karyotyping (
The goal of the SKY/CGH database is to allow investigators to submit and analyze both clinical and research (e.g., cell lines) SKY and CGH data. The database is growing and currently has a total of about 700 datasets, some of which are being held private until published. Several hundred labs around the world use this technique, with many more looking at the data they generate. Submitters can enter data from their own cases in either of two formats, public or private; the public data is generally that which has already been published, whereas the private data can be viewed only by the submitters, who can transfer it to the public format at their discretion. The results are stored under the name of the submitter and are listed according to case number. The
Detailed information on how to submit data either to the SKY or CGH sectors of the database can be found through links on the
The submitter enters the written Chromosome images on the The written karyotype and a one-line case summary are also provided.
The CGH database displays gains, losses, and amplification of chromosomes and chromosome segments. The data are entered either by hand or automatically from various CGH software programs. In the manual format, the submitter enters the band information for each affected chromosome, describing the start band and stop band for each gain or loss, and the computer program displays the final karyotype with vertical bars to the left or right of each chromosome, indicating loss or gain, respectively (
Clinical data submitted include case identification, World Health Organization (WHO) disease classification code, diagnosis, organ, tumor type, and disease stage. To obtain the correct classification code, a link is provided to the NCI's
The references for the published cases are entered into the Case Information page and are linked to their abstracts in PubMed.
A colored karyotype with band overlay is presented to the submitter, who then builds each aberrant chromosome by cutting and pasting (by clicking with the mouse at appropriate breakpoints) ( Clicking on a chromosome brings up that chromosome with band overlay. Using the cursor, the operator cuts and pastes together each abnormal chromosome. The abnormal chromosome shown is a combination of chromosomes 10, 16, and 22.
To speed up the entry of cytogenetic data into the database, NCBI has built a computer program to automatically read short-form karyotypes, extract the information therein, and insert it into the SKY database ( The short-form written karyotype, entered in the karyotype field in (
Data submitters must use the same terminology for diagnosis (morphology) and organ site (topography) to permit comparison or combination of the data in the SKY/CGH database. From the many different disease classification systems, the
Quick Search can be found at the top of the SKY/CGH
The query results page displays information on all relevant cases, clones, and cells, along with details of SKY and/or CGH studies and clinical information for each case.
All of the public clinical and cytogenetic information can be searched. This format is currently under development.
This
All chromosomal bands, including breakpoints involved in chromosomal abnormalities, are linked to the Map Viewer database (
Links are provided to related Web sites including: chromosome databases (e.g., the Mitelman database); other NCI (e.g.,
One of the most intensely studied regions of the human genome is the Major Histocompatibility Complex (MHC), a group of genes that occupies approximately 4–6 megabases on the short arm of chromosome 6. The MHC genes, known in humans as Human Leukocyte Antigen (HLA) genes, are highly polymorphic and encode molecules involved in the immune response. The MHC database, Detailed single nucleotide polymorphism (SNP) mapping data of the HLA region KIR gene analysis data HLA diversity/anthropology data Multigene haplotype data HLA/disease association data Peptide binding prediction data
The MHC database is fully integrated with other NCBI resources, as well as with the International Histocompatibility Working Group (IHWG) Web site, and provides links to the IMmunoGeneTics HLA (IMGT/HLA) database.
This chapter provides a detailed description of current dbMHC resources and reviews dbMHC content and data computation protocols.
The most important known function of HLA genes is the presentation of processed peptides (antigens) to T cells, although there are many genes in the HLA region yet to be characterized, and work continues to locate and analyze new genes in and around the MHC (
The staff of the MHC database are currently accepting online submissions for the Primer/Probe/Mix component of dbMHC. Submissions can include typing data from any of the following: Sequence Specific Oligonucleotides (SSOs), Sequence Specific Primers (SSPs), SSO and/or SSP mixes, HLA typing kits, and Sequencing Based Typing (SBT). dbMHC allows submitters to edit their submissions online at any time.
A user can access dbMHC resources as a guest, that is, without having or being a member of an account. However, dbMHC guests will be unable to submit data to dbMHC or edit existing data. Data from a guest session will not be saved from session to session or even from frame to frame. Guests do have the option to download data from a particular session.
To create a dbMHC account, select the “Create an Account” from the left sidebar of the dbMHC homepage. Here, you will provide institutional information and specify an account administrator. Only the account administrator is allowed to do the following:
Enter new users. View or edit existing users. Change user permissions. This includes permission to modify allele reactivities and to enter new primers/probes/mixes or typing kits and to modify existing ones. Edit institutional information.
The
dbMHC can be used as a tool to help design primers/probes, to evaluate the reactivity patterns of potential probes, and to evaluate the polymorphism content of a particular gene.
Physical location Number and length of known alleles Allelic motifs Informativity Heterozygosity Primer sequences
Users will find the markers shown on the dbMHCms Web page useful because they provide evidence for genetic linkage and/or association of a genomic region with disease susceptibility. This resource was developed by NCBI in collaboration with A. Foissac, M. Salhi, and Anne Cambon-Thomsen, who provided the original data on MHC microsatellites in a series of updates (
The
Submitted sequences are aligned to reference locus sequences located in the dbMHC allele database based on a user-defined degree of nucleotide mismatch. After a brief analysis, the SBT interface will display exons, introns, and untranslated regions for each sequence. Allele assignments are listed according to matching order. Mismatched nucleotide positions are listed separately (in the lower frame of the SBT interface) after selecting
This Probe hybridization using Sequence Specific Oligonucleotides (SSOs). DNA amplification using Sequence Specific Primers (SSPs). Sequencing Based Typing (SBT) protocols. The Primer/Probe database can be used as a reference resource for probe design and is intended to provide necessary tools for the exchange of DNA typing data based on the above technologies. This database is composed of a Primer/ Probe Interface and a Typing Kit Interface. Each interface has a series of interactive frames that allow the user to access different database functions.
The Primer/Probe Interface provides information on individual reagents used for the typing of MHC or MHC-related loci. Users can access this information through the multiple functional frames within the interface. The words “primers” and “probes” represent SSOs, SSPs, SSP mixes, SSO mixes, and their combinations, which include nested PCR or SSO hybridization on group-specific amplifications. Nested PCR and group-specific amplification are both based on a primary amplification with an SSP mix.
The process of selecting a primer/probe begins by entering information in the upper frame of the Primer/Probe Interface page. Users enter search options for primers, probes, and mixes, which are grouped by type (SSO, SSP, SSO mix, and SSP mix), locus, and source (submitting institution). Primers/probes can also be selected by entering either the global or local name into the search field. Once this information is entered, the lower frame fills in with the Primer/Probe Listing. Primers/probes that are then selected by the user within the Primer/Probe Listing will be displayed in the large scroll box (in the upper frame) and can be downloaded in XML format.
Upon submission, dbMHC creates a unique global ID number for each primer/probe/mix/kit submitted. This global ID consists of three letters for the submitting institution and seven digits.
dbMHC uses the global ID number to store the entire history of a reagent. The global ID number will never change, but every time a user edits the sequence of a submitted primer/probe/mix/kit, that edit is given an incremental version number. Thus, all previous versions of a particular primer/probe/mix/kit are consequently accessible, and submitted primers/probes/mixes/kits can never be deleted.
A “local name” is the identifier given by submitters to their primers and probes.
This function is found in the lower frame of the Primer/Probe Interface page. Results of a Primer/Probe selection are displayed here and will include a unique global ID, local name, source, locus, and type for each primer/probe result. The Global Name (global ID) of each result provides a link to the Primer/Probe View or Mix View functions, and the check box associated with each primer/probe/mix can be used to generate a list of items in the Primer/Probe Interface for subsequent download in XML format.
The Primer/Probe View is located in the upper frame of the Primer/Probe Interface and appears as a result of selecting a global ID. This view will display the entire set of data of the SSP and/or SSO primers and/or probes that the user had selected in the Primer/Probe Interface. The data displayed include:
Local name Locus as specified by the submitter Global ID Date of last change (“last modified”) Probe orientation as specified by the submitter Corresponding allele sequence (Allele Seq:) Reagent type Optional filter, i.e., pre-amplification Version number Probe sequence (Probe Seq:) The annealing position of the 3' end of the probe as specified by the submitter Probe stringency
Primer/Probe View offers links that will enable a user to edit primer/probe information, change primer/probe stringency, and allow users to list alleles detected by a probe with a sequence alignment. If a list of alleles detected by a primer/probe without sequence alignment is wanted, Primer/Probe View can do this as well. Select
The primer/probe edit function allows users to enter new primers/probes into dbMHC or allows users to edit existing primer/probe data. All but the following three fields can be edited (these fields are set by dbMHC):
Global ID Version number Date of last change
Within the primer/probe edit frame, there is an alternative input field for primer/probe sequence called Allele Sequence. If primer or probe orientation is set to “reverse”, the Allele Sequence field will reverse complement the allele sequence as displayed in the alignment to generate the appropriate primer/probe sequence.
Before submitting primer/probe data, a user should define the matching stringency of the primer or probe in the Annealing Stringency field. Once the user has selected the matching stringency, the probe can be submitted. Submission of a primer/probe triggers dbMHC to begin an allele reactivity calculation that is based on the primer/probe sequence and the selected matching stringency. The result of this calculation is a list of alleles that might be detected by the submitted primer/probe. The user will find this detectable allele list in the Reactive Allele Listing (lower frame).
Reactivity scores characterize the reactivity between a primer/probe and an allele for primers and probes and are listed below:
Positive: probe anneals with allele. Weak: probe anneals sometimes with allele. Unclear: annealing cannot be predicted, no empirical information, allele has not been sequenced at the annealing position. Negative: probe does not anneal. dbMHC calculates primer/probe reactivity based on sequence similarity only and does not take into account laboratory experimental conditions such as magnesium concentration, temperature, etc.
A submitter's reactivity score can differ from the dbMHC's calculated reactivity score. If there is disparity between the submitter's score and the dbMHC score, users should regard the submitter's score as reliable. In cases of unexplainable disagreement, however, users should contact the submitting institution for further information.
The Reactive Allele Listing is located in the lower frame of the Primer/Probe Interface. It lists the alleles that might be detected by a selected primer/probe or mix (
A submitter can set individual allele reactivity scores either one-by-one or in a batch. The allele reactivity score for all alleles or for individual alleles that are unedited can be set the same as dbMHC's proposed allele reactivity score.
If the user chooses to set the reactivity score as a batch, be aware that alleles within the batch that are not currently displayed will be scored as well. If the user makes a mistake in the scoring,
Submitting allele reactivity scores will trigger a new dbMHC allele reactivity calculation. If the submitted primer/probe sequence is shorter than 10 nucleotides, dbMHC will use the score information of the alleles to extend the probe sequence.
Selecting
The alignment position of the 3' end of the probe is recorded for each allele. Both the sense and the reverse alignment positions are displayed if the alignment represents an SSP mix.
Only users with permission from an account administrator will be able to edit or add additional alleles to a Reactive Allele Listing for a particular primer/probe or mix. To edit the allele reactivity list, mark the “edit reactivity list” check box in the reactive allele alignment frame.
The Mix View function is located in the upper frame of the Primer/Probe Interface and can be accessed via the link on the global ID of the Primer/Probe Listing page.
It displays an entire set of mix data for selected SSO and SSP mixes:
Local name Mix type Locus as specified by the submitter Optional filter, i.e., pre-amplification Global ID Version number Date of last change List of probes as mix elements Mix stringency
The Mix View function contains links to the Mix Element function, where users can change or add elements of a mix. Users will also find links to the Reactive Allele Listing, where they may list alleles detected by a selected mix or selected individual elements of a mix that do not have a sequence alignment. Finally, users will find links to the reactive allele alignment function, where they can view alleles with sequence alignments that are detected by a selected mix or selected individual elements of a mix.
The Mix Edit function is located in the upper frame of the Primer/Probe Interface and allows users to enter new mixes or to edit existing mix data. Users can edit all but the following three fields (these fields are set by dbMHC):
Global ID Version number Date of last change
Annealing stringency defines the cumulative matching stringency of the elements of the mix. For SSP mixes, both the sense and the reverse primer must react with the allele with at least the defined stringency.
The Mix Element function is located in the lower frame of the Primer/Probe Interface and allows users to add elements to a mix or edit existing elements of a mix. To use the Mix Element function, the user must first specify the mix to be altered in the Mix Edit function and define a source for the primers/probes that the user wants listed. The mix element function displays only SSO probes for SSO mixes and displays SSP probes from SSP mixes in a sense column and a reverse orientation column. For mixes that contain probes from different sources, users must enter the mix elements separately.
All typing kits are identified by a global ID that is created by dbMHC, which will store the entire history of changes made to a typing kit. An incremental version number is given to every editing session of a typing kit; thus, even if some or all elements of a kit were deleted or altered by a user during an editing session, previous versions of the kit and kit elements will still be available.
The
Use the Typing Kit Interface to enter search parameters for typing kits. Typing kits are grouped by:
Type: SSO, SSP Locus Source: the submitting institution
Typing kits can also be selected by typing either the global or local name into the search field.
Search results, based on user-selected parameters, will be displayed in the Typing Kit Listing, which appears in the lower frame of the Typing Kit Interface. Typing kits selected from the Typing Kit Listing will be displayed in the large scroll box (of the upper frame). They either can be used for combined pattern interpretation of multiple kits or downloaded in XML format.
The Typing Kit Listing is displayed in the lower frame of the Typing Kit Interface. On the basis of the criteria selected, kits are listed with their unique global ID, local name, source, locus, and type. The global ID for each kit provides a link to the Typing Kit View (upper frame of the Typing Kit Interface). The check boxes associated with each typing kit are used to generate a list of items in the Kit Select view for subsequent interpretation of a pattern or for download in XML format.
If a primer/probe within an existing typing kit malfunctions and a user created a modification of the primer/probe, that modified primer/probe is considered by dbMHC to be an entirely new probe.
The kit that contains this modified primer/probe is considered by dbMHC as an entirely new kit. Therefore, when entering a sequence modification of primer/probe within a kit, a submitter must create a new kit by using
The Typing Kit View (upper frame of the Typing Kit Interface) lists the entire set of primer/probe data for selected SSO and SSP kits. Typing kit probe data include:
Local name Kit type Locus (with or without group-specific pre-amplifications) Global ID Version number Batch
If
Typing Kit Elements is displayed in the lower frame of the Typing Kit Interface and is used to display all of the elements (components) within a particular typing kit. This function will group the kit elements according to the kit locus groups and will display them as an ordered list. The global ID of each kit element serves as a link to either the Mix View or the Primer/Probe View of this element.
HLA typing kits are usually used to detect alleles in one or more loci. Within a typing kit, a particular locus and an optional pre-amplification define what is termed a “group”. Thus, one typing kit can contain several groups, with each group either consisting of the same locus and a different pre-amplification (or no pre-amplification) or consisting of different loci (with or without pre-amplification).
The Typing Kit Interface
The kit group function is located in the lower frame of the Typing Kit Interface and allows a user to add single elements to, or remove single elements from, a kit group. The kit group frame displays the elements for a particular locus selected by the user in the
Access to the kit edit function,
Access to the typing kit interpretation function,
The reactivity string or pattern entered will be interpreted as a heterozygote allele combination. Multiple kits can be combined. Users can set the degree of tolerance, which limits the number of false-positive or false-negative reactivities per locus. Each locus is analyzed separately. Allele assignments for each locus are listed according to the number of false-positive/false-negative reactivities.
Because of operating system limitations, the Alignment Viewer can only display up to 300 nucleotides in one line if the browser is Internet Explorer, whereas Netscape is not restricted by this limitation.
Several essential parts of dbMHC are based on Javascript interaction and dynamic text generation within a page. Users must be aware that many browsers are unable to properly interpret and display Javascript-generated text.
Netscape version 4.76 does not check for browser content size changes; therefore, users must manually resize to trigger the correct size recognition. Users may resize contents by using a “post” command, which will lead to a new download of the initial request instead of simply resizing the window.
Netscape 4.7 may sometimes cause fatal errors and does not allow users to copy and paste sequences from the alignment to the probe sequence field in the probe edit function.
Internet Explorer version 5.5 and Netscape version 6.2 will correctly interface with dbMHC.
Selecting
A “+” with a green background signals a detection of an allele by a kit element, a “w” with an orange background signals a weak detection, a “?” with a yellow background signals a lack of information, and an “r” with a red background signals a rejected interaction, although originally suggested by the prediction algorithm. If a kit is designed to detect only a subset of alleles, the display will be limited to this subset (see
Table name | Column name | Data type | Column comment |
---|---|---|---|
allele | allele_nr | int | Unique identifier allele instance. |
allele_id | int | Identifier for each allele. | |
allele_name | varchar(30) | Full allele name defined by IHWG. | |
allele_short | varchar(30) | Common use allele name. | |
allele_group | int | Group associated with allele. | |
compound_nr | int | Number identifying to which compound allele is a member. | |
locus_id | int | Number identifying to which locus allele is a member. | |
active | tinyint | Status labeling current active allele for this allele_id. | |
db_version_nr | int | IHWG database version to which this allele belongs. | |
user_id | int | Identifier or user who created/updated allele. | |
user_date | datetime | Date when record was created/updated. | |
block | block_nr | int | Unique identifier for block instance. |
block_id | int | Identifier for each block. | |
block_ord | int | Sequence order for each block. | |
block_type | varchar(1) | Defines block type. | |
locus_id | int | Identifier for locus for which block is a member. | |
ref_pos | int | Start reference position. | |
ref_length | int | Reference length. | |
working_pos | int | Start working position. | |
working_length | int | Working length. | |
block_name | varchar(20) | Common name. | |
active | tinyint | Status of row. | |
db_version_nr | int | IHWG database version to which this block belongs. | |
compound | locus_id | int | Locus identifier. |
compound_nr | int | Compound number. | |
job_trans | trans_nr | int | Unique identifier for transaction. |
app_name | varchar(30) | Application name that processes transaction. | |
app_arg | varchar(30) | Arguments to processing application. | |
create_date | datetime | Time created. | |
start_date | datetime | Time application started processing transaction. | |
end_date | datetime | Time application completed processing transaction. | |
status | int | Current status of transaction. | |
priority | int | Defines priority level. | |
message | varchar(255) | Any message that the processing application wants to record. | |
kit | kit_nr | int | Unique identifier for kit instance. |
kit_id | int | Kit identifier. | |
kit_local_id | int | Submitter identifier. | |
version_nr | int | Version number of kit. | |
kit_name | varchar(30) | Submitter name. | |
kit_global_id | varchar(30) | NCBI-defined name. | |
kit_batch | varchar(30) | Submitter batch. | |
type | int | Type. | |
active | tinyint | Status labeling current active kit for this kit_id. | |
source_id | int | Source identifier. | |
user_id | int | User identifier. | |
user_date | datetime | Date created/updated. | |
kit_locus | kit_locus_nr | int | Unique identifier for kit locus instance. |
kit_nr | int | ||
kit_id | int | Kit identifier. | |
locus_id | int | Locus identifier. | |
locus_order_nr | int | Order for loci to be used. | |
filter_id | int | Filter identifier. | |
tube_id | int | Tube identifier. | |
tube_order_nr | int | Order for tube to be used. | |
active | tinyint | Status labeling current active kit locus for this kit id. | |
user_id | int | User identifier. | |
user_date | datetime | Date created/updated. | |
locus | locus_id | int | Identifier for locus. |
locus_NCBI_id | varchar(30) | NCBI locus identifier. | |
locus_name | varchar(30) | Common name. | |
display_id | int | Identifier of allele that should be used as the display reference. | |
ref_id | int | Identifier of reference allele. | |
locus_MIM_id | int | MIM locus identifier. | |
locus_pub_id | int | PubMed locus identifier. | |
display_order | int | Order number for display. | |
mhc_snp | locus_id | int | Identifier for locus. |
ref_pos | int | Reference position. | |
ref_offset | int | Reference offset. | |
subsnp_id | int | dbSNP's ss number. | |
snp_length | int | SNP's length. | |
pos_conv | compound_nr | int | Compound number. |
working_pos | int | Working position. | |
ref_pos | int | Reference position. | |
ref_offset | int | Reference offset. | |
blast_pos | int | BLAST position. | |
probe_working | probe_working_nr | int | Unique identifier for each row. |
tube_nr | int | Unique identifier for tube instance. | |
probe_id | int | Tube (probe) identifier. | |
user_id | int | User identifier. | |
user_date | datetime | Date created/updated. | |
active | tinyint | Active status. | |
sequence | varchar(255) | Working sequence. | |
sequence | seq_type | int | Sequence type. |
seq_nr | int | Sequence number. | |
seq_ord | int | Sequence order. | |
sequence | varchar(255) | Character sequence. | |
session | cur_session_nr | int | Current session number. |
next_session_nr | int | Next avaliable session number. | |
user_id | int | User identifier. | |
create_date | datetime | Date created/updated. | |
source | source_nr | int | Unique identifier source instance. |
source_id | int | Source identifier. | |
source_code | varchar(3) | Source code for use in naming tubes and kits. | |
institution | varchar(50) | Source name. | |
admin_id | int | User identifier for administrator for source. | |
address | varchar(50) | Address. | |
city | varchar(50) | City. | |
state | varchar(50) | State. | |
postal_code | varchar(20) | Postal code. | |
country | varchar(50) | Country. | |
phone | varchar(30) | Phone number. | |
fax | varchar(30) | Fax number. | |
varchar(50) | Email address. | ||
active | tinyint | Status labeling current active source for this source id. | |
user_id | int | User identifier. | |
user_date | datetime | Date created/updated. | |
options | int | Not used. | |
info | varchar(255) | Comments/general information. | |
tube | tube_nr | int | Unique identifier for tube instance. |
tube_id | int | Tube identifier. | |
tube_name | varchar(30) | Submitter name. | |
tube_global_id | varchar(30) | NCBI-defined name. | |
filter_id | int | Filter identifier. | |
source_id | int | Source identifier. | |
source_date | datetime | Not used. | |
type | int | Type. | |
sense | tinyint | Orientation. | |
locus_id | int | Locus identifier. | |
ref_pos | int | Submitted reference position. | |
ref_offset | int | Submitted reference offset. | |
stringency | int | Stringency set for annealing results. | |
active | tinyint | Status labeling current active tube for this tube_id. | |
user_id | int | User identifier. | |
user_date | datetime | Date created/updated. | |
recalc_date | datetime | Not used. | |
db_version | int | IHWG database version to which this tube belongs. | |
info | varchar(255) | Comments/general information. | |
sequence | varchar(255) | Submitted sequence. | |
source_nr | int | ||
tube_allele | tube_allele_nr | int | Unique identifier for tube allele instance. |
tube_id | int | Tube identifier. | |
locus_id | int | Locus identifier. | |
allele_id | int | Allele identifier. | |
user_status | int | User status. | |
system_status | int | System status. | |
ref_pos | int | Reference position. | |
ref_offset | int | Reference offset. | |
block_nr | int | Not used. | |
score | real | Annealing score. | |
working_pos | int | Working position. | |
for_tube_id | int | Forward tube identifier. | |
for_working_pos | int | Forward working position. | |
rev_tube_id | int | Reverse tube identifier. | |
rev_working_pos | int | Reverse working position. | |
sense | tinyint | Orientation. | |
seq_known | int | Sequence known (sequence fully defined). | |
active | tinyint | Status labeling current active tube allele for this tube_id allele_id combination. | |
user_id | int | User identifier. | |
user_date | datetime | Date created/updated. | |
tube_probe | tube_probe_nr | int | Unique identifier for tube probe instance. |
tube_nr | int | Unique identifier for tube instance. | |
probe_id | int | Tube identifier for sub tube in mix. | |
tube_id | int | Tube identifier. | |
active | tinyint | Status labeling current active tube probe for this tube_id probe_id combination. | |
users | user_id | int | User identifier. |
user_date | datetime | Date created/updated. | |
user_nr | int | Unique identifier for user instance. | |
user_id | int | User identifier. | |
first_name | varchar(30) | First name. | |
last_name | varchar(30) | Last name. | |
login | varchar(30) | Login. | |
pwd | varchar(30) | Password. | |
source_id | int | Source identifier. | |
phone | varchar(30) | Phone number. | |
fax | varchar(30) | Fax number. | |
varchar(50) | Email address. | ||
create_probe | tinyint | Permission to create probe. | |
create_kit | tinyint | Permission to create kit. | |
modify_probe | tinyint | Permission to modify probe. | |
modify_react | tinyint | Permission to modify reactivity. | |
modify_kit | tinyint | Permission to modify kit. | |
info | varchar(255) | Comments/general information. | |
active | tinyint | Status labeling current active user for this user_id. | |
createdby_id | int | Created by identifier (another user_id). | |
created_date | datetime | Date created/updated. | |
source_admin | tinyint | Is user_id an administrator for source? |
The data submitted to dbMHC are stored in a Microsoft SQL (MSSQL) relational database.
The relationships between dbMHC database tables are depicted in
dbMHC's primer/probe database or interface is not curated. As such, the accuracy of the information presented in this database is dependent entirely upon the accuracy of the data submitted to dbMHC.
We suggest that each primer or probe submitted should be characterized by its complete sequence. In cases where submission of the complete sequence is impossible because the sequence information is considered proprietary, dbMHC offers the option to submit partial sequences. Submitted primer/probe specifications should comply with American Society of Histocompatibility and Immunogenetics (ASHI) (
The allele reactivity prediction tool for primers and probes available through dbMHC presents calculations based solely on sequence similarity and can therefore offer only suggestions for the possible allele reactivities of individual primers and probes. Users should not consider this tool to be 100% accurate. A reliable prediction of primer/probe allele reactivities requires complete sequence information, as well as reaction data that include annealing temperature and magnesium concentration. Because these data are not consistently available to dbMHC, we are unable to offer more than a suggestion for primer/probe allele reactivity.
dbMHC uses sequence comparisons to create a sequence interaction match or stringency grade that is compiled within a penalizing system. dbMHC's starting point for grading the interaction between a primer/probe and an allele is 100%, with each difference in sequence between the primer/probe and the allele causing a reduction in the remaining match grade by a certain percentage. The primary factors dbMHC uses to compute stringency grading include:
Nucleotide differences Nucleotide position and primer/probe type
Probe | DNA template | |||
---|---|---|---|---|
A | T | G | C | |
A |
|
|
|
|
T |
|
|
|
|
G |
|
|
|
|
C |
|
|
|
|
Penalties calculated according to Peyret et al. (
dbMHC divides nucleotide interactions into five categories: perfect match, high match, medium match, poor match, and no match. dbMHC defines these categories by using the purine and pyrimidine interaction between the allele and the primer/probe, as well as the number of hydrogen bonds affected during the virtual pairing of the allele with the primer/probe. See
5' | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 3' |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
Primer | T | T | T | C | T | T | C | A | C |
|
T | C | C | G | T | G | T | C | |
DNA template | T | T | T | C | T | T | C | A | C |
|
T | C | C | G | T | G | T | C |
An 18-mer primer anneals to a mismatched DNA. Primer and template are shown in the sense orientation. The substitution of C with A leads to the mismatch C-T with a penalty of 0.94 (refer to
This position has a 6% influence on the overall probe reactivity.
The overall probe score is 1 − (0.06 × 0.94) = 0.94.
|
10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
Probe | T | T | T | C | T |
|
C | A | C | A | T | C | C | G | T | G | T | C | C | C | |
DNA template | T | T | T | C | T |
|
C | A | C | A | T | C | C | G | T | G | T | C | C | C |
A 20-mer SSO anneals to a mismatched DNA. Both probe and template are shown in the sense orientation. The substitution of T with C leads to the probe-template mismatch T-G with a penalty of 0.42 (refer to
This position has a 70% influence on the overall probe reactivity.
The overall probe score is 1 − (0.7 × 0.42) = 0.71.
dbMHC uses a penalty system based on nucleotide position in its calculation of SSO/SSP interactions with different alleles. The system dbMHC uses indicates the extent to which an individual nucleotide position will affect the interaction stringency grade in a worst-case scenario, where guanines are in opposing sequence positions or cytosines are in opposing sequence positions. The maximum penalty given in the system is in the middle of an SSO and at the 3' end of an SSP, as shown in
If the initial interaction stringency grade for a particular primer/probe–allele interaction is 100% and a mismatch occurs in a position that carries a penalty of 100%, then dbMHC will reduce the interaction stringency grade to 0%. If the mismatch occurs in a position that carries a penalty of 95%, dbMHC will reduce the interaction stringency grade to 5%. A more detailed
dbMHC's reactive allele alignment algorithm searches for sequences with a match grade above the stringency level set by the submitter of each probe. Currently, the search algorithm searches within all loci that are part of dbMHC. Within each locus, the algorithm constructs compound sequences that represent the combined polymorphic positions of alleles of the same length. Alleles with insertions or deletions are handled separately. If a primer or probe matches with a certain position within the compound, all contributing alleles are checked for that position, whether or not they match.
The reactive allele alignment algorithm will check within a specified locus for sequences that have a match grade in accordance with primer/probe specifications at a user-indicated position. The algorithm then extends SSPs toward the 5' end of the probe, to a maximum of 10 nucleotides, and extends both sides of SSOs a maximum of 15 nucleotides, observing all polymorphic positions in matching alleles. The resulting probe extension is then used by the alignment algorithm to check for cross-reactivities within other loci.
If the submitted primer/probe sequence is shorter than 10 nucleotides and the user submits a score that accepts certain alleles and rejects others, the reactive allele alignment algorithm will use this information to refine the probe extension. If the probe submitter rejects all alleles containing a unique sequence motif in the vicinity of the probe sequence, the algorithm will generate the extended probe sequence such that it does not match the unique sequence motif.
The dbMHC provides a set of internal links to other NCBI resource homepages through either the black “quick link” bar located at the top of the dbMHC homepage or through the rotating HLA molecule located at the top left-hand corner of the dbMHC homepage. These links will take users to the following NCBI resources:
PubMed Nucleotide Protein OMIM BLAST Molecular Modeling database (MMdb)
The dbMHC Alignment Viewer provides locus-level linkage to NCBI's LocusLink and to dbSNP at the individual SNP level.
The Graphic View page is a hyperlink-enabled representation of the MHC region on chromosome 6. Locus-level links from the Graphic View page include:
dbSNP (at the SNP haplotype level) Map View LocusLink OMIM Nucleotide Protein PubMed Structure Books
The header and link column sections of dbMHC's Graphic View are the same as those on the main page. The content section of this page contains a selection box, called Choose Linked Resource, and three horizontally arranged sections, called Chromosome 6, HLA Class I, and HLA Class II. Genes listed in the HLA Class II section act as hyperlinks to the Web resource selected in Choose Linked Resource.
Currently, dbMHC provides links (from the left sidebar) to the following external sites:
Sequin is a stand-alone sequence record editor, designed for preparing new sequences for submission to GenBank and for editing existing records. Sequin runs on the most popular computer platforms found in biology laboratories, including PC, Macintosh, UNIX, and Linux. It can handle a wide range of sequence lengths and complexities, including entire chromosomes and large datasets from population or phylogenetic studies. Sequin is also used within NCBI by the GenBank and Reference Sequence indexers for routine processing of records before their release.
Sequin has a modular construction, which simplifies its use, design, and implementation. Sequin relies on many components of the NCBI Toolkit and thus acts as a quality assurance that these functions are working properly.
Detailed information on how to use Sequin to submit records to GenBank or edit sequence records can be found in the
As input, Sequin takes a biological sequence(s) from a scientist wanting to submit or edit sequence data. The sequence (or set of sequences) can be new information that has not yet been assigned a GenBank Accession number, or it can be an existing GenBank sequence record. If Sequin is being used to submit a sequence(s) to GenBank, then the scientist is prompted to include his/her contact information, information about other authors, and the sequence, at the start of the submission process. Once all the necessary information has been entered, it is then possible to view the sequence in a variety of displays and edit it using Sequin's suite of editing tools.
Sequin is designed for use by people with different levels of expertise. Thus, it has several built-in functions that can, for example, ensure that a new user submits a valid sequence record to GenBank, or it can be prompted to automatically generate a sequence
Sequin's versatility is based on its design: (
Sequin is used to edit and submit sequences to GenBank and handles a wide range of sequence lengths and complexities. After
Sequin expects sequence data in
A sequence in FASTA format consists of a definition line, which starts with a “>”, and the sequence itself, which starts on a new line directly below the definition line. The definition line should contain the name or identifier of the sequence but may also include other useful information. In the case of nucleotides, the name of the source organism and strain should be included; for proteins, it is useful to include the gene and protein names. Given all this information, Sequin can automatically assemble a record suitable for inclusion in GenBank (see below). Detailed information on how to prepare FASTA files for Sequin can be found in the
For single nucleotide sequence submissions to GenBank, the submitter supplies Sequin with the nucleotide sequence and any translated protein sequence(s). For example, a submission consisting of a nucleotide from mouse strain BALB/c that contains the β-hemoglobin gene, encoding the adult major chain β-hemoglobin protein, would have two sequences with the following definition lines, where “BALB23g” and “BALB23p” are nucleotide and protein IDs provided by the submitter:
The organism name is essential to make a legal GenBank
Although it is not necessary to include a protein translation with the nucleotide submission, scientists are strongly encouraged to do so because this, along with the source organism information, enables Sequin to automatically calculate the coding region (
A segmented nucleotide entry is a set of non-contiguous sequences that has a defined order and orientation. For example, a genomic DNA segmented set could include encoding exons along with fragments of their flanking introns. An example of an mRNA segmented pair of records would be the 5′ and 3′ ends of an mRNA where the middle region has not been sequenced. To import nucleotides in a segmented set, each individual sequence must be in FASTA format with an appropriate definition line, and all sequences should be in the same file.
Genome Sequencing Centers use automated sequencing machines to rapidly produce large quantities of “unfinished” DNA sequence, called high-throughput genomic sequence (
The sequencing machines produce intensity traces for the four fluorescent dyes that correspond to the four bases adenine, cytosine, guanine, and thymine. Software such as
Some Genome Centers now analyze their sequences and record the base positions of a number of sequence features such as the gene, mRNA, or coding regions. Sequin can capture this information and include it in a GenBank submission as long as it is formatted correctly in a
Population, phylogenetic, and mutation studies all involve the alignment of a number of sequences with each other so that regions of sequence similarity are emphasized. Sometimes it is necessary to introduce gaps into the sequences to give the best alignment. Sequin reads several output formats from sequence and phylogenetic analysis programs, including PHYLIP, NEXUS, PAUP, or FASTA+GAP.
The submitted sequence alignment represents the relationship between sequences. This inferred relationship allows Sequin to propagate features annotated on one sequence to the equivalent positions on the remaining sequences in the alignment. Feature propagation is one of the many editing functions possible in Sequin. Using this tool significantly reduces the time required to annotate an alignment submission.
tbl2asn is a program that automates the submission of sequence records to GenBank. It uses many of the same functions as Sequin but is driven entirely by data files, and records need no additional manual editing before submission. Entire genomes, consisting of many chromosomes with feature annotation, can be processed in seconds using this method.
Sequences given to Sequin in the input data formats described in this chapter are retained within Sequin memory, allowing them to be manipulated in real time. For example, for submission to GenBank, the sequence is transformed from the Sequin internal structure to Abstract Syntax Notation 1 (
Most sequence submissions are packaged into a BioseqSet, which contains one or more sequences (Bioseqs), along with supporting information that has been included by the submitter, such as source organism, type of molecule, sequence length, and so on ( The display can be understood as a Venn diagram. Selecting the up or down arrow expands or contracts, respectively, the level of detail shown. In a typical submission of a protein-coding gene, a BioseqSet (of class “nuc-prot”) contains two Bioseqs, one for the nucleotide and one for the protein. Descriptors, such as BioSrc, can be packaged on the set and thus apply to all Bioseqs within the set. Features allow annotation on specific regions of a sequence. For example, the CDS location provides instructions to translate the DNA sequence into the protein product.
Features are usually packaged on the sequence indicated by their location. For example, the gene feature is packaged on the nucleotide Bioseq, and a protein feature is packaged on the protein Bioseq. Proteins are real sequences, and features such as mature peptides are annotated on the proteins in protein coordinates (although they can be mapped to nucleotide coordinates for display in a GenBank flatfile). A CDS (coding region) feature location points to the nucleotide, but the feature product points to the protein. For historical reasons, the CDS is usually packaged on the nuc-prot set instead of on the nucleotide sequence.
Format | Notes |
---|---|
Alignment | For sets of aligned sequences |
ASN.1 | Abstract Syntax Notation 1 format |
Desktop | The internal structure of a record |
EMBL | As the record would appear in the EMBL database |
FASTA | FASTA format |
GBSeq | XML structured representation of GenBank format |
GenBank | As the record would appear in GenBank or DDBJ |
GenPept | Flatfile view of a protein |
Graphic | Graphical representation of the sequence (several styles are available) |
Quality | Displays the quality scores for each base in biological order |
Sequence | Nucleotide sequence as letters plus any annotated features |
Summary | Similar to graphical view but with no labels |
Table | Sequin's 5-column feature table format |
XML | eXtensible Markup Language, representation of ASN.1 data |
After the record has been constructed, the features can be viewed in a variety of display formats (
The different format generators all work independently from one another. When Sequin starts up, it registers a set of function procedures used to generate each display format. While issuing Sequin commands during manipulation of the sequence, appropriate messages (for example, “generate the view from the internal sequence record”, “highlight this feature”, “export the view to the clipboard”, etc.) are sent to the viewer by calling one of these procedures. Separate lists of registered formats are maintained for nucleotide, protein, and genome record types.
Just as the different format generators do not need to know about each other, Sequin's viewer windows do not need to know about other Sequin viewer or editor windows that are active at the same time. When editing a sequence, the user may have several different views of the same sequence open at the same time (for example, a GenBank flatfile and a graphical view). Clicking on a feature in the graphical view will select the same feature in the GenBank flatfile, and double-clicking on a feature launches the specific editor for that feature. This type of communication between different windows is orchestrated by the NCBI Toolkit's object manager.
The sequence editor is used like a text editor, with new sequence added at the position of the cursor. Furthermore, the sequence editor automatically adjusts the biological feature intervals as editing proceeds. For example, if 60 bases are pasted or typed onto the 5′ end of a sequence record, the sequence editor will shift all the features by 60 bases. This means that interval correction does not need to be done by hand. Prior to Sequin, it was usually easier to resubmit from scratch than to edit all of the feature intervals manually.
Feature editor windows have a common structure, organized by tabs (Figures 2–4). The first tab is for elements specific to the given feature (
Sequin combines several NCBI Toolkit functions to perform many useful computations on the data in Sequin's memory.
When a nucleotide is submitted to GenBank using Sequin, it is essential to give the name of the source organism. The submitter is also strongly encouraged to supply the translated protein sequence(s) for the nucleotide.
Supplying the organism name allows Sequin to automatically find and use the correct genetic code for translating the nucleotide sequence to protein for the most frequently sequenced organisms. On the basis of only the genetic code and sequences of the nucleotide and protein products, Sequin will then calculate the location of the protein-coding region(s) on the nucleotide sequence. This is an extremely powerful function of Sequin. The ability to do this automatically, instead of by hand, has made sequence submission much faster and less error-prone.
Sequin uses a reverse translating alignment algorithm, called Suggest Intervals, to locate the protein-coding region(s) on the nucleotide sequence. The algorithm builds a table of the positions of all possible stretches of three amino acids in the protein. It then translates the nucleotide in all six reading frames and searches for a match to one of these triplets. When it finds one, it attempts to extend the match on each side of the initial hit. If the extension hits a mismatch or an intron, it stops. Given these candidate regions of matching, Sequin then tries to find the best set of other identical regions that will generate a complete protein. While doing this, the algorithm takes
The ability to propagate features through an alignment and the way the sequence editor can adjust feature positions as the sequence is edited are combined in Sequin to provide a simple and automatic method for updating an existing sequence.
The The
The final version of the sequence, complete with all the annotated features, can be checked using the validator. This function checks for consistency and for the presence of required information for submission to GenBank. The validator searches for missing organism information, incorrect coding-region lengths (compared to the submitted protein sequence), internal stop codons in coding regions, mismatched amino acids, and non-consensus splice sites. Double-clicking on an item in the error report launches an editor on the “offending” feature. The NCBI Toolkit has a program (testval), which is a stand-alone version of the validator.
The validator also checks for inconsistency between nucleotide and protein sequences, especially in coding regions, the protein product, and the protein feature on the product. For example, if the coding region is marked as incomplete at the 5′ end, the protein product and protein feature should be marked as incomplete at the amino end. (Unless told otherwise, the CDS editor will automatically synchronize these anomalies, facilitating the correction of this kind of inconsistency.)
Additional checks include ensuring that all features are annotated within the range of the sequence, all feature location intervals are noted on the same DNA strand, tRNA codons conform to the given genetic code, and that there are no duplicate features or different genes with the same names. The validator even checks that the sequence letters are valid for the indicated alphabet (e.g., the letter “E” may appear in proteins but not in nucleotides).
In cases where an exception has been flagged in a feature editor, specific validator tests can be disabled. For example, if the reason given for an exception is “RNA editing”, this turns off CDS translation checking in the validator. Likewise, “ribosomal slippage” disables exon splice checking, and “trans splicing” suppresses the error message that usually appears when feature intervals are indicated on different DNA strands.
NCBI has a preferred format for the definition line in the GenBank format. It starts with the organism name, then the names of the protein products of coding regions (with the gene name in parentheses), with “complete” or “partial” at the end:
There is also a standard style for explaining alternative splice products. Sequin's Automatic Definition Line Generator collects CDS, RNA, and exon features in the order that they appear on the nucleotide sequence, finds the relevant genes (usually by location overlap), and prepares a definition line that conforms to GenBank policy.
In spite of all the safeguards built into Sequin, a submitter sometimes uses an incorrect genetic code for an organism. This means that the protein products of CDS translations may be incorrect. Sequin can retranslate all CDS features with a single command. Even so, if the sequence being edited is large, for example, a whole chromosome, this can be a time-consuming operation. To speed it up, the NCBI Toolkit uses a finite state machine (an efficient pattern search algorithm) for rapid translation. The machine is primed with a given genetic code, and then nucleotide sequence letters are fed into the algorithm one at a time, in the order they appear in the sequence. This allows all six frames (three frames on each strand) to be translated in the least possible time. The
The Special Menu of Sequin encompasses a powerful set of tools that are available to GenBank and Reference Sequence indexers only. The Special Menu is not available to the public in the standard release because without a thorough understanding of the NCBI Data Model, use of the functions can cause irreparable damage to a record. It allows indexers to globally edit features, qualifiers, or descriptors in all sequences in a record, so that the same correction does not have to be made at each occurrence of the error. For example, all CDS features with internal stop codons can be converted to pseudogenes. Another common error made by submitters is to enter a repeat unit (a rpt_unit; e.g., ATTGG) in the repeat-type field (a rpt_type; e.g., tandem repeat). The Special Menu allows indexing staff to convert rpt_type to rpt_unit throughout the record.
Although Sequin has editors for changing specific fields and Special Menu functions for doing bulk changes on several features, it is not possible to anticipate all of the manipulations NCBI indexers might need to do to clean up a problem record. The NCBI Desktop window shows the internal structure of a record (
For technically adept Sequin users, the Desktop is where additional analysis functions can be added to Sequin without building a complicated user interface. With a feature or sequence selected, items in the Filter menu perform specific analyses on the selected objects. The standard filters include reverse, complement, and reverse complement of a sequence, and reverse complement of a sequence and all of its features. These are needed to repair the occasional record that came in on the wrong strand or in 3′→5′ direction. Adding new filter functions requires adding code to one of the Sequin source code files (one is provided with no other code in it for this purpose) and recompiling the program.
The functions of Sequin can be expanded by the addition of a configuration file that specifies the URLs for other programs (CGI scripts) available from the Internet. For example,
At a minimum, the CGI should be able to read FASTA format and should return either sequence data or the five-column feature table (discussed above) as its result. The external programs that Sequin knows about appear in the
As sequencing of complete bacterial genomes and eukaryotic chromosomes becomes more commonplace, the demand to break up long sequences into more manageable bites has increased (although Sequin is perfectly capable of editing these large records). Some genomes at NCBI are thus represented as segmented (or delta) Bioseqs, which are composed of pointers to other raw sequences in GenBank. Obtaining the entire sequence and all of the features requires fetching the individual components from a network service.
The object manager allows Sequin to know about different fetch functions that can be used. When a sequence is needed, these functions will be called until one of them satisfies the request. For example, the lsqfetch configuration file can be edited to point to a directory containing sequence files on a user's disk. The SeqFetch function calls a network service at NCBI to obtain sequences and to look up Accession numbers given gi numbers or gi numbers given Accession numbers.
When used internally by NCBI indexers, Sequin can also fetch records from the DirSub and TMSmart databases. To ensure the confidentiality of pre-released records, this access requires the indexer to have a database password and to be working from a computer within NCBI. For additional protection, the paths to the database scripts are stored in a configuration file and are not encoded in the public Sequin source code.
Sequin has an important dual role as a primary submission tool and as a full-featured sequence record editor. It is designed to be modular on several levels, which simplifies the design and implementation of its components. Sequin sits at the top of the NCBI software Toolkit, relying on many of the underlying components, and thus acts as a quality assurance that these functions are working properly.
The biological sequence information that makes up the foundation of NCBI's databases and curated resources comes from many sources. How are these data managed and processed once they reach NCBI? This chapter will discuss the flow of sequence data, from the management of data submission to the making of data available for use.
Access | URL |
---|---|
FTP |
|
BLAST |
|
Entrez |
|
The sequence database GenBank (
The sequence data available at NCBI comes from many different sources (
GenBank sequences (
Reference sequences (
Sequences from other databases, such as SWISS-PROT, PIR, PRF, and PDB
Sequences from United States patents
The submission pathway depends on the data source (see
The data received are subject to some quality control. The submission tools BankIt, Sequin, and fa2htgs have built-in validation mechanisms that ensure that the data submitted have the correct structure and contain the essential information. The work of the GenBank indexing staff, who also use Sequin, adds an additional layer of quality control and provides assistance with problems or complex submissions and updates.
The ID database is a group of standard relational databases that holds both ASN.1 objects and sequence identifier-related information. ASN.1 objects follow the specifications in the asn.all file for NCBI sequence data objects. ID holds data for GenBank and other data, all of which are included in Entrez. All of the sequences for the International Nucleic Acid Database Collaboration are in GenBank and must have Accession numbers assigned to them. These Accession numbers define an entity, the sequence of which is described. When the understanding of that sequence changes, the sequence can have a new version. This leads to two parallel ways of tracking sequence versions for an object. In the GenBank flatfile format, there is an Accession.Version, where the version gives the ordinal instance (version) of the sequence. Within ID, each unique sequence is assigned a gi number; therefore, the chain of gi numbers for an Accession also gives all of the instances of the sequence for that Accession. The chain identifier within ID can be used to link these gi numbers. (Not all sequences within ID are in GenBank and not all have sequence versions, but all sequences have a chain of gi numbers. For this reason, internally, the gi number is the universal pointer to a particular sequence, as opposed to the Accession.Version, which would work only for versioned sequences.) The ID database is also the controller for allowed “takeovers” of one Accession by another. A takeover would occur, for example, when two clones' sequences were now to be merged into a single clone. One or both of the two older clones' Accessions could be taken over by the new clone. The actual details of the architecture of this group of databases and software associated with it are described later in this chapter.
Once all incoming data have been converted to ASN.1 format and have been entered into ID, the data are then replicated to several different servers and also transformed into several different formats (
The IQ database is a
For example, as part of the processing of incoming sequences, each protein and nucleotide sequence is searched (using BLAST,
As well as storing the relationships between nucleotide or between proteins, IQ also holds information on the links between entries in different Entrez databases. This might include, for example, information on the publications cited within sequence records (which link to PubMed), or to an organism in the Taxonomy database. Some of this information comes from the analysis of the ASN.1 in ID by e2index.
The BLAST Control database receives information from ID and uses it to generate BLAST databases (
Many people who use NCBI services think of the GenBank flatfile as the archetypal sequence data format. However, in terms of the ID internal data flow system, ASN.1 is considered the original format from which reports such as the GenBank flatfile can be generated. Generally, the GenBank flatfile is generated on demand from the ASN.1. However, for certain products, such as complete GenBank releases, a GenBank flatfile image is made for each active sequence and is stored in a database called FF4Release, which consists of the latest transformation of ASN.1 to the GenBank flatfile format.
This database also doubles as the place where internal error reports are captured. The reports can be analyzed and displayed for different time points in the data processing pathway: (
Type | Source | ASN.1 | GBFF |
Qscore | GenPept | Protein FASTA |
---|---|---|---|---|---|---|
Cumulative | GenBank | X | X | X | X | |
Incremental | GenBank | X | X | X | X | |
Incremental | GenBank |
X | X | |||
Cumulative | RefSeq | X | X | X | X | |
Incremental | RefSeq | X | X | X | X |
GBFF, GenBank flatfile.
NCBI records only.
When sequences are submitted to GenBank or one of the collaborating databases, useful information about the sequence is often also included. This might include a brief description of a gene in the definition line, along with annotated sequence features, e.g., the source organism name. To make this information searchable via Entrez, these words have to be indexed. They are therefore extracted from the ASN.1 (using a tool called e2index) and are then stored in the Entrez posting files, which are optimized for Boolean queries by the Entrez system (see
All of these products from the ID system are listed in
ID has a distributed architecture, the features of which are shown in
IDProdOS is an open server (commonly called “middleware”), which sits between the clients and the database system. It hides details of the underlying complexity from the client API. For example, the previous version of the ID system consisted of a single database. “Using the Open Server” meant that when the conversion to the current system took place, only the open server had to be changed, and none of the clients. IDProdOS begins the complex process of checking that the actions implied by the load are allowed and changes the SeqId pointers to gi numbers to be used as sequence-specific pointers. For example, in a record that has DNA and protein sequences, including annotation and sequence identifiers, the identifier on the protein has to be unique. If the identifier has been used previously on another DNA sequence and the current sequence is not replacing the previous sequence, then this is not allowed because proteins are not allowed to move between GenBank records. As an exception, this rule is sometimes relaxed for proteins moving between segments of a complete genome submission.
The IdMain database contains the sequence identifiers for each of the sequence records, including all those for ASN.1 blobs that contain multiple sequences. It enforces sequence version rules, among others.
Relational satellite databases are fully normalized databases that hold records for which there is only one sequence per intended ASN.1 blob. Few, if any, features are allowed on records intended for relational satellite databases. (The PubSeqOS produces the ASN.1 by querying the relational tables.) This is in contrast to the Blob satellite databases, from which ASN.1 is retrieved pre-made.
Blob satellite databases, although relational databases, contain ASN.1 objects as unnormalized data objects.
The SnpAnnot database contains only feature information. It has simple mutation information from dbSNP (
To visualize the role of replication, the rectangle in the middle of
PubSeqOS is an open server (commonly called “middleware”) that sits between the clients and the database system. It hides details of the underlying complexity from the client API. It actually has an almost identical code base to IDProdOS because it serves similar functions. When the ASN.1 that has been requested is to be presented in a format other than ASN.1, psansconvert is called to do the conversion. This separate process allows both insulation from any possible instability and allows for use of multiple central processing units in a natural way.
The EntrezControl and The Graveyards databases are uniquely on the query side. The EntrezControl database controls information on records that await indexing and have been indexed by e2index.The Graveyards are blob databases and contain those blobs killed by replacement or that were taken over by other blobs. Once taken over, blobs are not changed; therefore, they need not be on the loading side.
NCBI may assemble a genome prior to annotation, add annotations to a genome assembled elsewhere, or simply process an annotated genome to produce RefSeqs and maps for display in Map Viewer (
The basic procedures used to annotate other eukaryotic genomes are essentially the same as those used to annotate the human genome. However, the overall process is adjusted to accommodate the different types of input data that are available for each organism. Genes can be annotated on any genome for which a significant number of mRNA, EST, or protein sequences are available. Other features, such as clones, STS markers, and SNPs, can also be annotated whenever the relevant data are available for an organism.
For example, genes and other features are placed on the mouse Whole Genome Shotgun (WGS) assembly from the
The primary data produced by genome sequencing projects are often highly fragmented and sparsely annotated. This is especially true for the
This chapter describes the series of steps, the “pipeline”, that produces NCBI's annotated genome assembly from data deposited in the public sequence databases. A variant of the annotation process developed for the human genome is used to annotate the mouse genome, and similar procedures will be applied to other genomes (
NCBI constantly strives to improve the accuracy of its human genome assembly and annotation, to make the data displays more informative, and to enhance the utility of our access tools. Each run through the assembly and annotation procedure, together with feedback from outside groups and individual users, is used to improve the process, refine the parameters for individual steps, and add new features. Consequently, the details of the assembly and annotation process change from one run to the next. This chapter, therefore, describes the overall human genome assembly and annotation process and provides short descriptions of the key steps, but it does not detail specific procedures or parameters. However, sufficient detail is provided to enable users of our assembly and annotations to become familiar with the complexities and possible limitations of the data we provide.
New sequence data that could be used to improve the genome assembly and annotation become available on a daily basis. Since the assembly and annotation process takes several weeks to complete, the data are “frozen” at the start of the build process by making a copy of all of the data available for use at that time. Freezing the data provides a stable set of inputs for the remainder of the build process. Additional or revised data that become available during the period taken to complete the process are not used until the next build.
A build begins with a freeze of the input data and ends with the public release of an annotated assembly of genomic sequences (
The early steps of the genome assembly and annotation pipeline involve many computationally intensive processes, including masking the repetitive sequences and aligning each genomic sequence to the other genomic sequences, mRNAs, and Expressed Sequence Tags (
The manual refinement of the set of assembled genomic
Because some data change infrequently, some relatively quick steps are executed on time frames that are not tied to the build cycle. For example, the list of special cases used to override the automatic process is updated whenever the need becomes apparent.
The main inputs for the genome assembly and annotation process are genomic sequences, transcript sequences, and Sequence Tagged Site (STS) maps.
Genomic sequences from the following five data sets are processed for use in the assembly:
Sequences known to come from both ends of the same cloned genomic fragment provide valuable linking information that helps to order and orient sequence contigs in the assembly step (
Human genomic sequences that are not used for either assembly or annotation are processed so that their relationships to the assembled genome can be displayed in Map Viewer (
Human transcript sequences are used to help order and orient genomic fragments in the assembly step, for feature annotation and also to produce maps that show the locations of the transcripts on the assembled genome. Transcripts used include: (
Transcripts from other organisms may be aligned to the genome being processed. These data may reveal the location of potential genes not identified by other means. RefSeq mRNAs, GenBank mRNAs, and ESTs, obtained from the same sources that provide the human transcripts, are used. Their alignments are processed for display in Map Viewer but are not used in the assembly step or for feature annotation.
Map type | Map | Contig assembly | Contig placement | Display |
---|---|---|---|---|
Genetic linkage | Genethon ( |
X | X | X |
Genetic linkage | Marshfield ( |
X | X | X |
Genetic linkage | Decode ( |
X | ||
Radiation hybrid | GeneMap99-G3 ( |
X | X | X |
Radiation hybrid | GeneMap99-GB4 ( |
X | X | X |
Radiation hybrid | NCBI RH ( |
X | ||
Radiation hybrid |
|
X | X | X |
Radiation hybrid | Stanford TNG ( |
X | X | |
Radiation hybrid |
|
X | X | X |
YAC |
|
X | X | X |
Genetic linkage maps, radiation hybrid (
The maps listed in
Our own review of previous genome assemblies or feedback from users sometimes identify particular cases in which bad data or overlooked data prevent the automated processes from producing the best possible assembly of a particular segment of the genome. To help guide the assembly process, a list of such special cases is maintained. The list is used to provide supplemental data that override the automatic processes that assign a particular input genomic sequence to a chromosome or determine whether it is used for assembly.
The raw input genomic sequences are screened for contaminants, the repetitive sequences are masked, and the draft genomic sequences are split into fragments in preparation for alignment to other sequences. The input transcript sequences are also screened for contaminants before they are aligned to the genomic sequences. The STS content of the input genomic sequences is determined.
Draft-quality HTGSs sometimes contain segments of sequence derived from foreign sources, most commonly the cloning vector or bacterial host. Finished sequences are usually, but not always, free of such contaminants. Common contaminants introduce artificial blocks of homologous sequence that can give rise to misleading alignments between two unrelated genomic sequences.
Sequences that occur in many copies in the genome will align to many different clones. Such repetitive sequences include interspersed repeats (SINEs, LINEs, LTR elements, and DNA transposons), satellite sequences, and low-complexity sequences (
Draft HTGSs consist of a set of sequence contigs derived from a particular clone artificially linked together to form a single sequence. The masked, draft genomic sequences are split at the gaps between their constituent contigs to create separate sequence fragments that can be aligned independently. Vector sequences and other contaminants are also removed at this stage by trimming or further splitting the sequence fragments.
Any STS markers contained within the input genomic sequences are identified by
Sequences from other clones being sequenced at the same institution can occasionally cross-contaminate draft HTGSs. The contaminating sequences may come from another clone from the same organism or from another organism. The raw genomic sequences are screened in several ways to detect cross-contamination: (
At this stage, draft sequences composed of fragments that are too small to contribute significantly to the assembly or that are tagged with the HTGS_CANCELLED keyword are also flagged for removal. Another filter rejects sequences annotated as being from another organism or as being RNA, erroneously included in the input sequences.
To improve assembly of the genomic sequences, the input genomic sequences are assigned to a specific chromosome before attempting to merge the sequences. Genomic sequences that appear on any of the chromosome tiling paths are automatically assigned to the designated chromosome. Other genomic sequences are assigned to a chromosome based on: (
Transcript sequences that contain sequences derived from vectors or other common contaminants can produce artificial alignments to the assembled genomic sequence. The input transcript sequences are therefore compared with a database of contaminants using MegaBLAST (
mRNA sequences shorter than 300 bases are excluded from the set of sequences that are aligned to the genomic sequences because they are too small to contribute significantly to genome assembly or annotation. Also excluded are any mRNA sequences flagged because they do not represent the true sequence of a transcript, e.g., those that are chimeric or contain genomic sequences.
Alignment of the input genomic sequences to each other and to various other sequences is essential for both genome assembly and genome annotation. All relevant sequences are initially aligned to the unassembled genomic sequences because this means that the computationally intensive alignment processes can be run incrementally at an early stage in the pipeline. If necessary, these alignments are remapped to the sequence of the assembled genome at a later stage by a process that requires relatively little computation.
Assembly of the genomic sequences from individual clones into longer contiguous sequences (contigs) requires knowledge of which sequences overlap. The overlaps between genomic sequences are evaluated by aligning the sequences from individual genomic clones to each other. After masking of repeats, decontamination and fragmentation, each fragment of genomic sequence is aligned pairwise to all of the other fragments using MegaBLAST (
The pairs of short genomic sequences derived from the ends of plasmid clones help to order and orient sequence fragments in the assembly step. These clone end sequences are aligned to the processed genomic sequences, as described for
Annotation of genes requires knowledge of where the sequences for known transcripts align to the assembled genomic sequences. RefSeq RNA sequences, mRNA sequences from GenBank, and EST sequences from dbEST are aligned to the processed genomic sequences, as described for
Curated genomic regions provide accurate annotation for regions of the genome that are difficult to annotate correctly by automated processes. Sequences from curated genomic regions are initially aligned to the unassembled genomic sequences and later remapped to the sequence of the assembled genome, as described for
Homologies between the polypeptides encoded by the genomic sequences and known proteins/conserved protein domains provide hints for the gene prediction process. The repeat-masked genomic sequences are compared with a non-redundant database of vertebrate proteins and to the NCBI Conserved Domain Database (CDD; Ref.
The input genomic sequences are assembled into a series of genomic sequence contigs. These are then ordered, oriented with respect to each other, and placed along each chromosome with appropriately sized gaps inserted between adjacent contigs. The resulting genome assembly thus consists of a set of genomic sequence contigs and a specification for how to arrange the sequence contigs along each chromosome.
A chromosome sequence is considered
Genomic sequence contigs for unfinished chromosomes are assembled and laid out based largely on the clone
Before the tiling paths are used in the genome assembly, the order of the finished clone sequences included in the tiling paths is compared with the specifications used to assemble the curated contigs of finished sequence. Discrepancies are resolved before proceeding with the assembly. Sequence from any clone should appear at just one place in the assembled genome; therefore, if a clone is listed more than once in the tiling paths, only the location with the best evidence is used in the assembly step.
Clone sequences that consist only of unassembled reads (HTGS_PHASE0) or that were flagged because of suspected cross-contamination or other problem detected in the pre-assembly screens are not used in the assembly step.
Adjacent, finished clone sequences from the chromosome tiling path that have good sequence overlap are merged. Tiling path draft sequences that are adjacent to and overlap the finished clone sequences or other draft clone sequences are added to extend the initial genomic sequence contigs. After that, genomic sequences from clones not on any chromosome tiling path are added, provided they have good overlaps with the assembled tiling path clones. Genomic sequences from additional clones may be added if they provide the sequence for a known gene that is missing from the existing genomic sequence contigs. Finally, the individual fragments of draft sequences are ordered and oriented.
After all of the chromosomes have been assembled, any remaining genomic clone sequence that contains a known gene not present in the other contigs is added to the assembly as a separate contig.
After the genomic sequence contigs are assembled, they are oriented and placed in order along each chromosome with appropriately sized gaps inserted between adjacent contigs. The chromosome
Gaps between the clone contigs laid out in the chromosome tiling paths are arbitrarily set at 50 Kbp, and 3 Mbp for the centromere, unless another gap size is specified in the tiling path. Any remaining gaps between genomic sequence contigs are arbitrarily set to 10 Kbp.
A set of sequences and data files is produced to represent the provisional assembly. This set includes: (
The provisional assembly is checked for consistency with the chromosome tiling paths and various STS maps. The order in which the component clone sequences appear in the assembled chromosomes is compared with their order in the tiling paths on which the assembly was based. The STS marker order along each chromosome in the provisional assembly, as determined by e-PCR (
Identification of genes within the genome assembly reveals the functional significance of particular stretches of genomic sequence. Genes are found using three complementary approaches: (
Alignments between known transcripts and the assembled genomic sequences are processed to produce gene models. Each gene model consists of an ordered series of exons. The transcripts defining each gene model are used as evidence to support that model.
The alignments between RefSeq RNA sequences, mRNA and EST sequences from GenBank and the component genomic sequences are remapped to produce alignments of these transcripts to the assembled genomic contigs.
A candidate gene model is produced from each set of alignments between a particular transcript and one strand of a particular genomic contig as follows: (1) putative exons are identified by looking for mRNA splice sites near the ends of those alignments that satisfy minimum length and percentage identity criteria; (2) a mutually compatible set of exons for the model is selected by applying rules, such as restrictions on the size of an intron, that define plausible exon–intron structures; and (3) BLAST (
Each RefSeq RNA represents a distinct transcript produced from a particular gene (
Many gene models may be produced for the same gene because the input data set frequently contains multiple EST or mRNA sequences representing the same transcript. This redundancy is used to refine the splice sites defining a particular exon. Similar exons are clustered, and splice sites may be adjusted in some models to match those used by the majority of models containing the same exon. Inconsistent models may be discarded at this stage, unless they have sufficient support to be retained as likely splice variants.
Many of the mRNAs and most of the ESTs used to generate the initial gene models provide sequence for only part of the native transcript. Overlapping gene models that are compatible with each other are combined into an extended model. This chaining step produces models more likely to represent the full gene.
GenomeScan produces better results when long genomic sequences are broken into shorter segments at putative gene boundaries. The locations of gene models based on RefSeq RNA alignments are, therefore, used to divide the assembled genomic contigs into segments. Repetitive sequences are masked by remapping the repeats found in the component genomic sequences.
GenomeScan can use data on protein homologies to improve its gene predictions (
Each segment of genomic sequence is processed by GenomeScan using the combined set of protein homology-based hints as an additional input. This produces one model containing all of the predicted exons for each putative gene. Models with coding sequences shorter than 90 amino acids are discarded. Each remaining model is aligned to proteins from SWISS-PROT and NCBI RefSeq proteins using blastp. The eukaryotic protein with the best match to any model is used as evidence for that model and to provide a clue as to the possible function of that model.
Consolidation of the transcript-based gene models and the predicted gene models forms a single set of models. Models are clustered into genes if they share one or more exons or if LocusLink (
Some gene models are discarded because: superior gene annotation is available from a curated genomic region, they are likely to represent pseudogenes, or they are incompatible with other gene models.
The manually reviewed annotations from curated genomic region RefSeqs are used in preference to any corresponding gene models generated by automated processing. The curated genomic regions are aligned to the assembled genomic contigs by remapping the alignments between these RefSeqs and the component genomic sequences. Any gene model that significantly overlaps a segment of the assembled sequence that corresponds to a curated genomic region is discarded.
When transcripts from a particular gene are aligned to the genomic sequences, they will align not only to the active copy of the gene but also to any segment of the genome containing a pseudogene derived from the active gene. Because model transcripts or model proteins that represent nontranscribed pseudogenes are undesirable, an attempt is made to identify and remove such models.
Whenever possible, alignments of RefSeqs for pseudogenes, either curated genomic regions or RNAs, are used to annotate pseudogenes. Some additional models derived from pseudogenes that are not yet represented by RefSeqs are eliminated by the following mechanism. All models based on the same supporting mRNA are compared with respect to the percent identity of the alignments and the number of exons. Only the model with the strongest evidence is retained.
When two gene models are found to have an extensive overlap, then in general only the model with the stronger evidence is retained. However, models based on RefSeqs are always retained. Whereas any model not based on a RefSeq is discarded if it overlaps a model that is RefSeq based, two RefSeq-based models that overlap are both retained.
Initially, the longest open reading frame from each gene model is annotated as the protein coding sequence. This annotation can be revised if evidence associated with that model provides support for an alternative coding region. The protein coding sequence from any transcript used as evidence for a gene model is compared with the longest open reading frame in that model using BLAST (
The set of gene models produced by the preceding steps is a mixture of models for predicted genes and for known genes. To help identify models representing known genes, the model transcripts are compared with known transcripts. To help name the predicted genes, the proteins encoded by the models are also compared with known proteins.
To provide continuity from build to build and to identify genes based on their predicted transcripts, MegaBLAST (
The eukaryotic proteins with the best match to each protein predicted by the annotation process are used to identify the best model for a possible gene and to assign a name to gene models that are novel. The proteins encoded by the models are aligned to proteins from SWISS-PROT (
Gene models are attributed to known genes whenever the correspondence is clear. If a model RNA has a reciprocal best hit with a known RNA, then the annotation of the known RNA is used to identify the gene. The first models to be assigned to genes are those that have reciprocal best hits with RefSeq RNAs. This is followed by assignment of those models that have reciprocal best hits to models from the previous build or to GenBank mRNAs. Gene data for models that match a mRNA not yet represented by a RefSeq are obtained from NCBI gene-specific databases (currently LocusLink,
Multiple models based on alternative transcripts for some genes may be produced. In most of these cases, one transcript model is selected to represent the product of the gene for annotation purposes. Any homology between eukaryotic proteins and proteins encoded by the models guides the choice between alternative models. Multiple transcripts are annotated only if the models are based on RefSeq mRNAs representing alternative transcripts from the same gene.
Although alternative transcript models are not annotated, the alignments between the transcripts that represent alternative splicing and genomic contigs are processed for display in Map Viewer, Evidence Viewer, and Model Maker (see
The transcripts and protein products of any models that have been assigned to a known gene are given the product names that appear in the LocusLink entry for that gene. The gene products from other genes are named based on any significant homology to other eukaryotic proteins, provided that the matching protein has a meaningful name (i.e., names such as “Hypothetical...” are ignored).
The genomic contig RefSeqs are annotated with features that provide information about the location of genes, mRNAs, and coding regions. Features from curated genomic region RefSeqs are copied to the contigs based on the alignment between the curated sequence and the corresponding contig. Protein domains from the Conserved Domain Database (CDD; Ref.
Reference sequences produced by the genome assembly process are annotated with features that provide landmarks valuable for making connections between maps based on different coordinate systems and for associating genes with diseases.
Placement of STSs on the genome assembly allows sequence-based data to be integrated with non-sequence-based maps that contain STS markers, such as genetic and radiation hybrid maps. STSs are identified by using e-PCR (
Placement on the genome assembly of clones that have been mapped to cytogenetic bands by FISH provides the means to determine the correspondence between the sequence and cytogenetic coordinate systems (
Placement of Single Nucleotide Polymorphisms (SNPs) and other variations on the genome provides numerous landmarks that are valuable for associating genes with diseases (
The products of our assembly and annotation process are made available to the public as RefSeqs of assembled chromosome sequences, genomic sequence contigs, model transcripts, and model proteins. RefSeqs are produced in alternative formats so that they can be retrieved by Entrez, BLAST, or FTP.
A fully annotated Refseq entry is made for each genomic sequence contig. Separate RefSeq model RNA and protein entries are also made for any of the transcripts and coding regions annotated on genomic contigs not identified as existing RefSeqs. Finally, a RefSeq entry is made for each chromosome by combining the annotated sequences of the genomic contigs in the appropriate order and with the appropriate spacing. The resulting RefSeqs can be retrieved through Entrez.
The assembled genomic contig RefSeqs are formatted as a BLAST database (
The annotated genomic sequence contig, model transcript, and model protein RefSeqs are saved in GenBank flatfile and ASN.1 formats. The same sets of sequences that are used to make BLAST databases are also saved in FASTA format. All of these data files, together with files that specify the construction of the genomic contigs and their arrangement along the chromosomes, are made available for download by
We produce many maps showing the locations of various features annotated on our genome assembly. Maps containing whatever combination of features that interests the user can be selected and displayed side-by-side using
Basic map data are prepared for each map to identify each feature, delineate its position on the chromosome, and specify how it is to be displayed. For many maps, supplemental data are prepared to provide more information about each feature. Map Viewer displays this map-specific supplemental information when users select a particular map as the Master Map (
Maps that display those features annotated on the genomic sequence contigs (genes, STSs, clones, and SNPs) are generated by translating the positions of the features on the contigs into chromosome coordinates. Contig coordinates are translated into chromosome coordinates using the positions of the contigs along each chromosome, as determined in the genome assembly step. Using this same method, alignments between various sequences and the genomic contigs are translated into chromosome coordinates to produce additional maps that show the locations of the aligned sequences on the chromosomes. Maps generated from sequence alignments include maps that show the genomic positions of mRNA plus EST sequences, or genomic sequences from GenBank. The specifications used to build each genomic contig are also translated into chromosome coordinates to produce one map that shows the component sequences used to assemble each contig and another that simply shows the finished and draft sections of the contigs.
Cytogenetic maps, genetic linkage maps, and radiation hybrid maps use different coordinate systems that are not based on sequence. To generate data for these types of maps, the locations of the map elements are listed in the coordinate system appropriate to each map. Map Viewer can scale maps defined in different coordinate systems so that they can de displayed side by side.
All of the map data for the new genome assembly are loaded into the Map Viewer database. Next, the objects in the new maps are indexed so that users can search for and then display specific features (
To ensure that a consistent view of the annotated genome assembly is presented, the release of databases and FTP files is coordinated. When everything is ready for release, the Map Viewer
The products of the genome assembly and annotation process are linked extensively to various NCBI resources. These links provide different views of the data and more information for researchers as they follow a particular line of investigation.
Map object | Linked NCBI resource | Resource description |
---|---|---|
Accession number | Entrez |
|
Clone | Clone Registry |
|
Disease gene | OMIM |
|
EST or mRNA | UniGene |
|
Gene | LocusLink |
|
Gene or transcript | Evidence Viewer |
|
Gene or transcript | Model Maker |
|
Gene | Human–mouse homology map |
|
STS | UniSTS |
|
Variation (SNP) | dbSNP |
|
The maps displayed by Map Viewer have embedded links between map objects and relevant NCBI resources (
During the production of RefSeqs, links between the annotated features (clones, genes, SNPs, and STSs) and the relevant resources listed in
A customized BLAST
Richa Agarwala, Jonathan Baker, Hsiu-Chuan Chen, Vyacheslav Chetvernin, Deanna Church, Cliff Clausen, Dmitry Dernovoy, Olga Ermolaeva, Wratko Hlavina, Wonhee Jang, Philip Johnson, Jonathan Kans, Paul Kitts, Alex Lash, David Lipman, Donna Maglott, Jim Ostell, Keith Oxenrider, Kim Pruitt, Sergei Resenchuk, Victor Sapojnikov, Greg Schuler, Steve Sherry, Andrei Shkeda, Alexandre Souvorov, Tugba Suzek, Tatiana Tatusova, Lukas Wagner, and Sarah Wheelan
Entrez is the text-based search and retrieval system used at NCBI for all of the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, OMIM, and many others. Entrez is at once an indexing and retrieval system, a collection of data from many sources, and an organizing principle for biomedical information. These general concepts are the focus of this chapter. Other chapters cover the details of a specific Entrez database (e.g., PubMed in
The first version of Entrez was distributed by NCBI in 1991 on CD-ROM. At that time, it consisted of nucleotide sequences from GenBank and PDB; protein sequences from translated GenBank, PIR, SWISS-PROT, PDB, and PRF; and associated citations and abstracts from MEDLINE (now PubMed and referred to as PubMed below). We will use this first design to illustrate the principles behind Entrez.
An Entrez “node” is a collection of data that is grouped together and indexed together. It is usually referred to as an Entrez database. In the first version of Entrez, there were three nodes: published articles, nucleotide sequences, and protein sequences. Each node represents specific data objects of the same type, e.g., protein sequences, which are each given a unique ID (UID) within that logical Entrez Proteins node. Records in a node may come from a single source (e.g., all published articles are from PubMed) or many sources (e.g., proteins are from translated GenBank sequences,
Note that the UID identifies a single, well-defined object (i.e., a particular protein sequence or PubMed citation). There may be other information about objects in nodes, such as protein names or
Another criterion for selecting a particular data type to be an Entrez node is to enable linking to other Entrez nodes in a useful and reliable way. For example, given a protein sequence, it is very useful to quickly find the nucleotide sequence that encodes it. Or given a research article, it is useful to find the sequences it describes, if any.
One way to achieve this is to put all of the information into one record. For example, many GenBank records contain pertinent article citations. However, PubMed also contains the article abstract and additional index terms (e.g., MeSH terms); furthermore, the bibliographic information is also more carefully curated than the citation in a GenBank entry. It therefore makes much more sense to search for articles in PubMed rather than in GenBank.
When a subset of articles has been retrieved from PubMed, it may be useful to link to sequence information associated with the abstracts. The article citation in the GenBank record can be used to establish the link to PubMed and, conversely, to make the reciprocal link from the PubMed article back to the GenBank record. Treating each Entrez node separately but enabling linking between related data in different nodes means that the retrieval characteristics for each node can be optimized for the characteristics and strengths of that node, whereas related data can be reached in nodes with different strengths.
This approach also means that new connections between data can be made. In the example above, the GenBank record cited the published article, but there was no link from that article in PubMed to the sequence until Entrez made the reciprocal link from PubMed. Now, when searching articles in PubMed, it is possible to find this sequence, although no PubMed records have been changed. Because of this design principle, the Entrez system is richly interconnected, although any particular association may originate from only one record in one node.
Another type of linking in Entrez is between records of the same type, often called “neighbors”, in sequence and structure nodes. Most often these associations are computed at NCBI. For example, in Entrez Proteins, all of the protein sequences are “
Again, associations that may not be present in the original records can be made. For example, a well-annotated SWISS-PROT record for a particular protein may have fields that describe other protein or GenBank records from which it was derived. At a later date, a closely related protein may appear in GenBank that will not be referenced by the SWISS-PROT record. However, if a scientist finds an article in PubMed that has a link to the new GenBank record, that person can look at the protein and then use the BLAST-computed neighbors to find the SWISS-PROT record (as well as many others), although neither the SWISS-PROT record nor the new GenBank record refers to each other anywhere.
There are many advantages to establishing new associations by computational methods (as in the GenBank–SWISS-PROT example above), especially for large, rapidly changing data sets such as those in biomedicine.
As computers get faster and cheaper, this type of association can be made more efficiently. As data sets get bigger, the problem remains tractable or may even improve because of better statistics. If a new algorithm or approach is found to be an improvement, it is possible to apply it over the whole data set within a practical timescale and by using a reasonable number of resources. Any associations that require human curation, such as the application of controlled vocabularies, do not scale well with rapidly growing sets of data or evolving data interpretations. Although these manual kinds of approaches certainly add value, computational approaches can often produce good results more objectively and efficiently.
A data-retrieval system succeeds when you can retrieve the same data you put in. A discovery system is intended to let you find more information than appears in the original data. By making links between selected nodes and making computed associations within the same node, Entrez is designed to infer relationships between different data that may suggest future experiments or assist in interpretation of the available information, although it may come from different sources.
The ability to compare genotype information across a huge range of organisms is a powerful tool for molecular biologists. For example, this technique was used in the discovery of a gene associated with hereditary nonpolyposis colon cancer (HNPCC). The tumor cells from most familial cases of HNPCC had altered, short, repeated DNA sequences, suggesting that DNA replication errors had occurred during tumor development. This information caused a group of investigators to look for human homologs of the well-characterized
The researchers could connect the functional data about the yeast and bacterial genes with the genetic mapping and clinical phenotype information in humans. Entrez is designed to support this kind of process when the underlying data are available electronically. In PubMed, the research paper about the discovery of
The original three-node Entrez system has evolved over the past 10 years to include more nodes ( Taxonomy, which is organized around the names and phylogenetic relationships of organisms Structure, organized around the three-dimensional structures of proteins and nucleic acids Genomes, in which each record represents a chromosome of an organism PopSet, consisting of collections of aligned sequences from a single population study Books, representing published books in biomedicine
More nodes are planned for addition in the near future. Each one of these nodes is richly connected to others. Each offers unique information and unique new relationships among its members. The combination of new links and new relationships increases the chances for discovery. The addition of each new node creates different paths through the data that may lead to new connections, without more work on the old nodes.
Entrez integrates data from a large number of sources, formats, and databases into a uniform information model and retrieval system. The actual databases from which records are retrieved and on which the Entrez indexes are based have different designs, based on the type of data, and reside on different machines. These will be referred to as the “source databases”. A common theme in the implementation of Entrez is that some functions are unique to each source database, whereas others are common to all Entrez databases.
Some of the common routines and formats for every Entrez node include the term lists and posting files (i.e., the retrieval engine) used for Boolean queries, the links within and between nodes, and the summary format used for listing search results in which each record is called a DocSum. Generally, an Entrez query is a
The software that tracks the addition of new or updated records or identifies those that should be deleted from Entrez may be unique for each source database. Each database must also have accompanying software to gather index terms, DocSums, and links from the source data and present them to the common Entrez indexer. This can be achieved through either a set of C++ libraries or by generating an XML document in a specific DTD that contains the terms, DocSums, and links. Although the common engine retrieves a DocSum(s) given a UID(s), the retrieval of a full, formatted record is directed to the source database, where software unique to that database is used to format the record correctly. All of this software is written by the NCBI group that runs the database.
This combination of database-specific software and a common set of Entrez routines and applications allows code sharing and common large-retrieval server administration but enables flexibility and simplicity for a wide variety of data sources.
Although the basic principles of Entrez have remained the same for almost a decade, the software implementation has been through at least three major redesigns and many minor ones.
Currently, Entrez is written using the NCBI C++ Toolkit. The indexing fields (which for PubMed, for example, would be Title, Author, Publication Date, Journal, Abstract, and so on) and DocSum fields (which for PubMed are Author, Title, Journal, Publication Date, Volume, and Page Number) for each node are defined in a configuration file; but for performance at runtime, the configuration files are used to automatically generate base classes for each database. These are the basic pieces of information used by Entrez that can also be inherited and used by more database-specific, hand-coded features. The term indexes are based on the Indexed Sequential-Access Method (ISAM) and are in large, shared, memory-mapped files. The postings are large bitmaps, with one bit per document in the node. Depending on how sparsely populated the posting is, the bit array is adaptively compressed on disk using one of four possible schemes. Boolean operations are performed by using AND or OR postings of bit arrays into a result bit array. DocSums are small, fielded data structures stored on the same machines as the postings to support rapid retrieval.
The Web-based Entrez retrieval program, called
Boolean query processing, DocSum retrieval, and other common functions are supported on a number of load-balanced “front-end” UNIX machines. Because Entrez can support session context (for example, in the use of query history, NCBI Cubby, Filters, etc.), a “history server” has been implemented on the front-end machines so that if a user is sent to machine “A” by the load balancer for their first query but to machine “B” for the second query, Entrez can quickly locate the user's query history and obtain it from machine “A”. Other than that, the front-end machines are completely independent of each other and can be added and removed readily from
The comparison of nucleotide or protein sequences from the same or different organisms is a very powerful tool in molecular biology. By finding similarities between sequences, scientists can infer the function of newly sequenced genes, predict new members of gene families, and explore evolutionary relationships. Now that whole genomes are being sequenced, sequence similarity searching can be used to predict the location and function of protein-coding and transcription-regulation regions in genomic DNA.
Basic Local Alignment Search Tool (BLAST) (
The way most people use BLAST is to input a nucleotide or protein sequence as a query against all (or a subset of) the public sequence databases, pasting the sequence into the textbox on one of the BLAST
There are many different
This chapter will first describe the BLAST architecture—how it works at the NCBI site—and then go on to describe the various BLAST outputs. The best known of these outputs is the default display from BLAST Web pages, the so-called “traditional report”. As well as obtaining BLAST results in the traditional report, results can also be delivered in structured output, such as a hit table (see below), XML, or ASN.1. The optimal choice of output format depends upon the application. The final part of the chapter discusses stand-alone BLAST and describes possibilities for customization. There are many interfaces to BLAST that are often not exploited by users but can lead to more efficient and robust applications.
The BLAST algorithm is a heuristic program, which means that it relies on some smart shortcuts to perform the search faster. BLAST performs “local” alignments. Most proteins are modular in nature, with functional domains often being repeated within the same protein as well as across different proteins from different species. The BLAST algorithm is tuned to find these domains or shorter stretches of sequence similarity. The local alignment approach also means that a mRNA can be aligned with a piece of genomic DNA, as is frequently required in genome assembly and analysis. If instead BLAST started out by attempting to align two sequences over their entire lengths (known as a global alignment), fewer similarities would be detected, especially with respect to domains and motifs.
When a query is submitted via one of the BLAST Web pages, the sequence, plus any other input information such as the database to be searched, word size, expect value, and so on, are fed to the
BLAST does not search GenBank flatfiles (or any subset of GenBank flatfiles) directly. Rather, sequences are made into BLAST databases. Each entry is split, and two files are formed, one containing just the header information and one containing just the sequence information. These are the data that the algorithm uses. If BLAST is to be run in “stand-alone” mode, the data file could consist of local, private data, downloaded NCBI BLAST databases, or a combination of the two.
After the algorithm has looked up all possible “words” from the query sequence and extended them maximally, it assembles the best alignment for each query–sequence pair and writes this information to an SeqAlign data structure (in The QBLAST system located on the BLAST server executes the search, writing information about the sequence alignment in ASN.1. The results can then be formatted by fetching the ASN.1 (
The BLAST Formatter, which sits on the BLAST server, can use the information in the SeqAlign to retrieve the similar sequences found and display them in a variety of ways. Thus, once a query has been completed, the results can be reformatted without having to re-execute the search. This is possible because of the
Once BLAST has found a similar sequence to the query in the database, it is helpful to have some idea of whether the alignment is “good” and whether it portrays a possible biological relationship, or whether the similarity observed is attributable to chance alone. BLAST uses
The bit score gives an indication of how good the alignment is; the higher the score, the better the alignment. In general terms, this score is calculated from a formula that takes into account the alignment of similar or identical residues, as well as any gaps introduced to align the sequences. A key element in this calculation is the “
The E-value gives an indication of the statistical significance of a given pairwise alignment and reflects the size of the database and the scoring system used. The lower the E-value, the more significant the hit. A sequence alignment that has an E-value of 0.05 means that this similarity has a 5 in 100 (1 in 20) chance of occurring by chance alone. Although a statistician might consider this to be significant, it still may not represent a biologically meaningful result, and analysis of the alignments (see below) is required to determine “biological” significance.
Most BLAST users are familiar with the so-called “traditional” BLAST report. The report consists of three major sections: (1) the header, which contains information about the query sequence, the database searched ( The The query sequence is represented by the Each line is composed of four fields: ( The alignment is preceded by the sequence identifier, the full definition line, and the length of the matched sequence, in amino acids. Next comes the bit score (the raw score is in
The traditional report is really designed for human readability, as opposed to being parsed by a program. For example, the one-line descriptions are useful for people to get a quick overview of their search results, but they are rarely complete descriptors because of limited space. Also, for convenience, there are several pieces of information that are displayed in both the one-line descriptions and alignments (for example, the E-values, scores, and descriptions); therefore, the person viewing the search output does not need to move back and forth between sections.
New features may be added to the report, e.g., the addition of links to LocusLink records (
By default, a maximum of 500 sequence matches are displayed, which can be changed on the advanced BLAST page with the
Although the traditional report is ideal for investigating the characteristics of one gene or protein, often scientists want to make a large number of BLAST runs for a specialized purpose and need only a subset of the information contained in the traditional BLAST report. Furthermore, in cases where the BLAST output will be processed further, it can be unreliable to parse the traditional report. The traditional report is merely a display format with no formal structure or rules, and improvements may be made at any time, changing the underlying This shows the results of a search of an
The screening of many newly sequenced human Expressed Sequence Tags (
For these purposes, the hit table output is more useful than the traditional report; it contains only the information required in a more formal structure. The hit table output contains no sequences or definition lines, but for each sequence matched, it lists the sequence identifier, the start and stop points for stretches of sequence similarity (offset by one residue), the percent identity of the match, and the E-value.
There are drawbacks to parsing both the BLAST report and even the simpler hit table. There is no way to automatically check for truncated or otherwise corrupted output in cases when a large number of sequences are being screened. (This may happen if the disk is full, for example.) Also, there is no rigorous check for syntax changes in the output, such as the addition of new features, which can lead to erroneous parsing. Structured output allows for automatic and rigorous checks for syntax errors and changes. Both
As well as the hit table and traditional report shown in HTML, BLAST results can also be formatted in plain text, XML, and ASN.1 ( Note that some nodes can be viewed as both HTML and text. XML is also structured output but can be produced from ASN.1 because it has equivalent information.
A change in BLAST format without re-executing the search is possible because when a scientist looks at a Web page of BLAST results at NCBI, the HTML that makes that page has been created from ASN.1 (
SeqAlign is the ASN.1 object that contains the alignment information about the BLAST search. The SeqAlign does not contain the actual sequence that was found in the match but does contain the start, stop, and gap information, as well as scores, E-values, sequence identifiers, and (DNA) strand information.
As mentioned above, the actual database sequences are fetched from the BLAST databases when needed. This means that an identifier must uniquely identify a sequence in the database. Furthermore, the query sequence cannot have the same identifier as any sequence in the database unless the query sequence itself is in the database. If one is using stand-alone BLAST with a custom database, it is possible to specify that every sequence is uniquely identified by using the
Any BLAST database or FASTA file from the NCBI Web site that contains gi numbers already satisfies the uniqueness criterion. Unique identifiers are normally a problem only when custom databases are produced and care is not taken in assigning identifiers. The identifier for a FASTA entry is the first token (meaning the letters up to the first space) after the > sign on the definition line. The simplest case is to simply have a unique token (e.g., 1, 2, and so on), but it is possible to construct more complicated identifiers that might, for example, describe the data source. For the FASTA identifiers to be reliably parsed, it is necessary for them to follow a specific syntax (see Appendix 1).
More information on the SeqAlign produced by BLAST can be found
XML and ASN.1 are both structured languages and can express the same information; therefore, it is possible to produce a SeqAlign in XML. Some users do not find the format of the information in the SeqAlign to be convenient because it does not contain actual sequence information, and when the sequence is fetched from the BLAST database, it is packed two or four bases per byte. Typically, these users are familiar with the BLAST report and want something similar but in a format that can be parsed reliably. The XML produced by BLAST meets this need, containing the query and database sequences, sequence definition lines, the start and stop points of the alignments (one offset), as well as scores, E-values, and percent identity. There is a public
The BLAST code is part of the NCBI Toolkit, which has many low-level functions to make it platform independent; the Toolkit is supported under Linux and many varieties of UNIX, NT, and MacOS. To use the Toolkit, developers should write a function “Main”, which is called by the Toolkit “main”. The BLAST code is contained mostly in the tools directory (see Appendix 2 for an example).
The BLAST code has a modular design. For example, the Application Programming Interface (API) for retrieval from the BLAST databases is independent of the compute engine. The compute engine is independent from the formatter; therefore, it is possible (as mentioned above) to compute results once but view them in many different modes.
The readdb API can be used to easily extract information from the BLAST databases. Among the data available are the date the database was produced, the title, the number of letters, number of sequences, and the longest sequence. Also available are the sequence and description of any entry. The latest version of the BLAST databases also contains a taxid (an integer specifying some node of the NCBI taxonomy tree; see
Only a few function calls are needed to perform a BLAST search. Appendix 3 shows an excerpt from a Demonstration Program doblast.c.
MySeqAlignPrint (called in the example in Appendix 3) is a simple function to print a view of a SeqAlign (see Appendix 4).
Database name | Identifier syntax |
---|---|
GenBank | gb|accession|locus |
EMBL Data Library | emb|accession|locus |
DDBJ, DNA Database of Japan | dbj|accession|locus |
NBRF PIR | pir||entry |
Protein Research Foundation | prf||name |
SWISS-PROT | sp|accession|entry name |
Brookhaven Protein Data Bank | pdb|entry|chain |
Patents | pat|country|number |
GenInfo Backbone Id | bbs|number |
General database identifier |
gnl|database|identifier |
NCBI Reference Sequence | ref|accession|locus |
Local Sequence identifier | lcl|identifier |
gnl allows databases not included in this list to use the same identifying syntax. This is used for sequences in the
The syntax of the FASTA definition lines used in the NCBI BLAST databases depends upon the database from which each sequence was obtained (see
For example, if the identifier of a sequence in a BLAST result is gb|M73307|AGMA13GT, the gb tag indicates that sequence is from GenBank, M73307 is the GenBank Accession number, and AGMA13GT is the GenBank locus.
The bar (|) separates different fields. In some cases, a field is left empty, although the original specification called for including this field. To make these identifiers backwards-compatible for older parsers, the empty field is denoted by an additional bar (||).
A gi identifier has been assigned to each sequence in NCBI's sequence databases. If the sequence is from an NCBI database, then the gi number appears at the beginning of the identifier in a traditional report. For example, gi|16760827|ref|NP_456444.1 indicates an NCBI reference sequence with the gi number 16760827 and Accession number NP_456444.1. (In stand-alone BLAST, or when running BLAST from the command line, the
The reason for adding the gi identifier is to provide a uniform, stable naming convention. If a nucleotide or protein sequence changes (for example, if it is edited by the original submitter of the sequence), a new gi identifier is assigned, but the Accession number of the record remains unchanged. Thus, the gi identifier provides a mechanism for identifying the exact sequence that was used or retrieved in a given search. This is also useful when creating crosslinks between different Entrez databases (
A simple program (db2fasta.c) that demonstrates the use of the readdb API.
Note that:
Readdb_new allocates an object for reading the database. Readdb_acc2fasta fetches the ordinal number (zero offset) of the record given a FASTA identifier (e.g., gb|AAH06776.1|AAH0676). Readdb_get_bioseq fetches the BioseqPtr (which contains the sequence, description, and identifiers) for this record. BioseqRawToFasta dumps the sequence as FASTA.
Note also that Main is called, rather than “main”, and a call to GetArgs is used to get the command-line arguments. db2fasta.c is contained in the tar archive ftp://ftp.ncbi.nih.gov/blast/demo/blast_demo.tar.gz.
Type |
Element | Description |
---|---|---|
Nlm_FloatHi | expect_value | Expect value cutoff |
Int2 | wordsize | Number of letters used in making words for lookup table |
Int2 | penalty | Mismatch penalty (only blastn and MegaBLAST) |
Int2 | reward | Match reward (only blastn and MegaBLAST) |
CharPtr | matrix | Matrix used for comparison (not blastn or MegaBLAST) |
Int4 | gap_open | Cost for gap existence |
Int4 | gap_extend | Cost to extend a gap one more letter (including first) |
CharPtr | filter_string | Filtering options (e.g., L, mL) |
Int4 | hitlist_size | Number of database sequences to save hits for |
Int2 | number_of_cpus | Number of CPUs to use |
The types are given in terms of those in the NCBI Toolkit. Nlm_FloatHi is a double, Int2/Int4 are 2- or 4-byte integers, and CharPtr is just char*.
The main steps here are:
BLASTOptionNew allocates a BLASTOptionBlk with default values for the specified program (e.g., blastp); the Boolean argument specifies a gapped search. The expect_value member of the BLASTOptionBlk is changed to a non-default value specified on the command-line. BioseqBlastEngine performs the search of the BioseqPtr (query_bsp). The BioseqPtr could have been obtained from the BLAST databases, Entrez, or from FASTA using the function call FastaToSeqEntry.
The BLASTOptionBlk structure contains a large number of members. The most useful ones and a brief description for each are listed in
Note that:
SeqAlignId gets the sequence identifier for the zero-th identifier (zero offset). This is actually a C structure. SeqIdWrite formats the information in query_id into a FASTA identifier (e.g., gi|129295) and places it into query_buf. SeqAlignStart and SeqAlignStop return the start values of the zero-th and first sequences (or first and second).
All of this is done by high-level function calls, and it is not necessary to write low-level function calls to parse the ASN.1.
The power of linking is one of the most important developments that the World Wide Web offers to the scientific and research community. By providing a convenient and effective means for sharing ideas, linking helps scientists and scholars promote their research goals.
Any Entrez database record, e.g., a nucleotide sequence, a taxonomic record, a protein structure, or a PubMed abstract, can be linked to Web resources external to NCBI via LinkOut. The
The URLs to LinkOut resources are all provided by the person or organization that owns or created the resource. Links can be provided in any URL syntax, and providers of links may choose as much or as little access to their resource as they wish. Providers use one format to submit links to all Entrez databases.
LinkOut is in itself an Entrez database that holds all the linking information to external resources. The separation of the data records (e.g., PubMed abstracts) from the external linking information (e.g., URLs to journal articles on a publisher's Web site) enables both the external link providers and NCBI to manage linking in a flexible manner. This means that if links to external resources change, such as in the case of a Web site redesign, this will not affect the Entrez database records, and linking information can be updated as frequently as necessary.
The LinkOut database contains information on the relationship between a link and all of the applicable unique Entrez ID numbers (UIDs). By taking advantage of the interconnectivity among Entrez nodes, the linking information is presented seamlessly and efficiently.
LinkOut information is submitted in XML, defined by the LinkOut Document Type Definition (
Linking information is supplied in two elements: the Provider element, which specifies information about a link provider; and the LinkSet element, which describes information about the link. Each element should be submitted to NCBI in a separate file. Identity files contain the Provider element, and Resource files contain the LinkSet element.
The Identity file is always called providerinfo.xml. It describes the identity of a provider, including an ID (ProviderId) and an abbreviated name (NameAbbr) assigned by NCBI, the provider's name, and other general information about the provider. There should be only one providerinfo.xml file for each provider (see
The Resource file, which contains the LinkSet information, specifies a set of Entrez records with a valid Entrez query, a specific rule to build the link to an external resource, and description of the resource using the
Terms used in SubjectType and Attribute elements are controlled to describe LinkOut resources in a systematic manner. This is because resources are presented to users by SubjectType on the LinkOut display page (and within the Cubby system), making it easier to browse and access available resources. Attributes can be used to describe the nature of a LinkOut resource (i.e., whether the resource requires a subscription or registration to access the content). A short text string may be used in the UrlName element to describe a resource. UrlName is typically used when the allowed SubjectType and Attribute terms cannot describe the resource adequately or when multiple links are available from one provider for a single Entrez record.
All links from Entrez are generated on a daily basis so that new or modified Entrez records will have accurate LinkOut resources connected to them. Once a day, all LinkOut files are parsed according to the LinkOut DTD, and the LinkOut database is rebuilt, relating the Entrez UIDs with the link information specified in the LinkSet XML files.
A LinkOut record consists of a link and the associated information, including its URL and all descriptive terms (SubjectType, Attribute, and UrlName) pertaining to the link. The Entrez UIDs applicable to the link are indexed to associate this information to the corresponding Entrez databases. As explained in
To facilitate search and retrieval of LinkOut resources, there are a number of filters in the LinkOut-enabled Entrez databases. These filters, although not part of the LinkOut database, use the result generated in the LinkOut indexing process.
The filters are all prefixed with
To use these filters to retrieve a set of Entrez records with LinkOut resources, the filter term can be entered as a search. For example, in PubMed, searching
LinkOut resources should be directly relevant to specific subjects of the Entrez records to which they will be linked, thus providing further research resources for Entrez users. The information and its delivery system should be of high quality and must not, through typographic or factual errors, omissions, or other flaws or inconsistencies, mislead, hinder, or frustrate the research efforts of Entrez users. The resources should be easy to use and navigate. Resources from professional societies, government agencies, educational institutions, or individuals and organizations that have received grants from major funding organizations are preferred.
Participation in LinkOut is voluntary. Providers need to submit two types of files to describe the LinkOut resources, Identity files and Resource files (see Boxes
A list of
All access restrictions will still apply. For example, if access to a database is limited by the user IP address, access will only be allowed via computers within an approved IP range; if access is password protected, the password must still be entered.
Interested parties can consult the following guides for more details:
A number of tools are available to facilitate participation in LinkOut:
Additional tools are being developed to assist other groups of providers. Interested parties can subscribe to announcement lists described in
LinkOut resource providers can communicate with NCBI's LinkOut team in a number of ways. Users and providers can write to
The Reference Sequence (RefSeq) database provides a biologically non-redundant collection of DNA, RNA, and protein sequences. Each RefSeq represents a single, naturally occurring molecule from a particular organism. RefSeqs are frequently based on GenBank records but differ in that each RefSeq is a synthesis of information, not a piece of a primary research data in itself. Similar to a review article in the literature, a RefSeq is an interpretation by a particular group at a particular time. RefSeqs can be retrieved in several different ways: by searching the Entrez Nucleotide or Protein database, by BLAST searching, by FTP, or through links from other NCBI resources.
The goal of the NCBI
As a non-redundant collection of sequences, RefSeq offers a significant advantage during database searches or when identifying sequences (whether by BLAST, text, or accession queries, or inclusion in a local custom database). RefSeq represents an objective and experimentally verifiable definition of non-redundancy by providing one example of each natural biological molecule per organism. For some organisms, the RefSeq collection includes alternatively spliced transcripts that share some identical exons, or identical proteins expressed from these alternatively spliced transcripts, or close paralogs or homologs. RefSeq provides the substrate for a variety of objective conclusions about non-redundancy based on clustering identical sequences or families of related sequences.
RefSeq is unique in providing a large, multi-species, curated sequence database that explicitly links chromosome, transcript, and protein information; it establishes a baseline for integrating sequence, genetic, expression, and functional information into a single, consistent framework. RefSeq is substantially based on GenBank sequence records (
Note that although based upon GenBank, RefSeq is distinct from GenBank. GenBank represents the sequence and annotations that are supplied by the original authors and is never changed by others. GenBank remains the primary sequence repository. RefSeq is one of many possible “review articles” based on that essential archive.
The RefSeq collection establishes a consistent baseline and clear model of the central dogma. RefSeq standards support genome annotation, gene characterization, mutation analysis, expression studies, and polymorphism discovery. The RefSeq collection offers advantages in:
Facile identification of sequence standards for genomes, transcripts, or proteins Genome annotation Comparative genomics Reduction of redundancy in clustering approaches Providing a foundation for unambiguous association of functional information (supports navigation)
Accession prefix | Molecule type |
---|---|
NC_ | Complete genomic molecule |
NG_ | Genomic region |
NM_ | mRNA |
NP_ | Protein |
NR_ | RNA |
NT_ |
Genomic contig |
NW_ |
Genomic contig (WGS |
XM_ |
mRNA |
XP_ |
Protein |
XR_ |
RNA |
NZ_ |
Genomic (WGS) |
ZP_ |
Protein, on NZ_ |
Computed.
Assembly of Whole Genome Shotgun (WGS) sequence data.
An ordered collection of WGS for a genome.
The current RefSeq collection includes sequences from over 2,000 distinct taxonomic identifiers, ranging from viruses to bacteria to eukaryotes. It represents chromosomes, organelles, plasmids, transcripts, and over 700,000 proteins. Every sequence is assigned a stable Accession number, version number, and gi number; older versions remain available if a sequence is updated over time.
RefSeq updates are provided on a daily basis, as needed. New records may be added to the collection, or existing records may be updated to reflect sequence or annotation changes or as part of a bulk update from a collaborator. New and updated records are made available in Entrez as soon as possible. The FTP site also provides daily update information (see below).
RefSeq records appear similar in format to the GenBank records from which they are derived; distinguishing features include a unique accession prefix that includes an underscore, which is never present in a GenBank accession (
Code | Description |
---|---|
GENOME ANNOTATION | The RefSeq record is provided via automated processing and is not subject to individual review or revision between builds. |
INFERRED | The RefSeq record has been predicted by genome sequence analysis, but it is not yet supported by experimental evidence. The record may be partially supported by homology data. |
PREDICTED | The RefSeq record has not yet been subject to individual review, and some aspect of the RefSeq record is predicted. |
PROVISIONAL | The RefSeq record has not yet been subject to individual review. The initial sequence-to-gene name associations have been established by outside collaborators or NCBI staff. |
REVIEWED | The RefSeq record has been reviewed by NCBI staff or by a collaborator. The NCBI review process includes assessing available sequence data and the literature. Some RefSeq records may incorporate expanded sequence and annotation information. |
VALIDATED | The RefSeq record has undergone an initial review to provide the preferred sequence standard. The record has not yet been subject to final review, at which time additional functional information may be provided. |
WGS | The RefSeq record is provided to represent a collection of whole genome shotgun sequences. These records are not subject to individual review or revisions between genome updates. |
The RefSeq database is compiled through several processes including collaboration, extraction from GenBank, and computation. Each molecule is annotated as accurately as possible with the correct organism name, correct gene symbol for that organism, and informative protein names whenever possible. Collaborations with authoritative groups outside of NCBI provide a variety of information ranging from curated sequence data, nomenclature, feature annotations, and links to external organism-specific resources. If a collaboration has not been established, then NCBI staff assembles the data from GenBank. Each record has a tag indicating the level of curation it has received (
In cases when a molecule is represented by multiple sequences for an organism in GenBank, an effort is made by NCBI staff to select the “best” sequence to instantiate as a RefSeq. The goal is to avoid known mutations, sequencing errors, cloning artifacts, and erroneous annotation; should an existing RefSeq be identified with a problem of this type, it is corrected. Sequences are validated to confirm that the genomic sequence corresponding to an annotated mRNA feature matches the mRNA sequence record, and that coding region features really can be translated into the corresponding protein sequence.
RefSeq records may be added to the collection, or existing records may be updated, on a daily basis. Separate working groups that use distinct process pipelines compile the RefSeq collection for different organisms. RefSeq records are provided by collaboration and the following three pipelines:
Genome Annotation pipeline LocusLink pipeline Entrez Genomes pipeline
Organism | Collaborator |
---|---|
|
Saccharomyces Genome Database ( |
|
The Institute for Genomic Research ( |
|
|
|
Drosophila Sequencing Consortium, |
We welcome collaborations whenever authoritative groups outside NCBI are willing to provide sequences, nomenclature, annotations, or links to phenotypic or organism-specific resources. For some species, the RefSeq collection is curated entirely by a collaborating authoritative group that provides both the sequences and annotations. Other species may be provided via varying levels of collaborative efforts. For example, a
NCBI is providing annotation of genomic sequence data for some genomes including some microbial species, human, and mouse. These pipelines are automated and yield genomic, transcript, and protein RefSeq records (records that are provided vary by organism). Data are refreshed periodically, and for the eukaryotic annotation pipeline, records are not subject to individual incremental updates or manual curation (see
Collaboration with authoritative groups including:
FlyBase Human Gene Nomenclature Committee (HGNC) Mouse Genome Informatics (MGI) OMIM RATMAP Rat Genome Database (RGD) WormBase Zebrafish Information Network (ZFIN) Gene Family Authorities
In-House Processing:
Extraction from GenBank Genome Annotation pipeline Homology analysis In-house curation UniGene analysis
The RefSeq curation process may result in correction of errors as well as provision of additional sequence information and feature annotation, as indicated below.
Remove chimeric sequences.
Remove vector and linker sequence.
Remove sequences annotated with incorrect organism information.
Resolve sequence-to-locus mis-associations.
Correct apparent sequence errors as identified through sequence analysis and personal communication; an attempt is made to reconcile sequence differences to finished genomic sequence.
Modify the extent of the original GenBank CDS annotation, as determined through sequence analysis (including homology considerations), literature review, and personal communication.
Extend UTRs based on publicly available sequence data or literature review.
Provide RefSeq with full-length CDS from a series of overlapping partial GenBank sequences.
Provide splice variants and corresponding protein variants.
Alias symbols and names
Brief description of transcript and protein variants
Publications
Summary of gene, protein function
Alternate translation start sites
Enzyme Commission number
Mature peptide products
Non-AUG translation start sites
Protein domains
Selenocysteine proteins
Indication of transcript completeness, as known
Poly(A) signal, site
RNA editing
Variation features
LocusLink supports the generation of RefSeq records for human, mouse, rat, fly, zebrafish, and cow; the RefSeq process for these organisms takes advantage of the descriptive information available in LocusLink. Multiple collaborations support the collection of this descriptive information (
This data set consists of genomic regions, transcripts, and protein. Records representing genomic regions are provided primarily to support more comprehensive genome-level annotation and may represent gene clusters or single genes or pseudogenes. Records are annotated with the level of curation it has undergone; records may have an INFERRED, PREDICTED, PROVISIONAL, REVIEWED, or VALIDATED status (
Sequences in LocusLink records enter RefSeq by a mixture of computational analysis process, collaboration, and in-house curation. As illustrated in Sequence-to-locus association conflicts (e.g., close paralogs) Vector contamination The GenBank record used is a genomic record with “not experimental” annotation The protein sequence is annotated as partial The RefSeq transcript would be suspiciously long Once a gene is defined and associated with sufficient sequence information in LocusLink, it can be pushed into the RefSeq pipeline. New genes are added to LocusLink by collaborators and in-house review. The RefSeq process is initiated by an automated BLAST step, which uses the sequence information in LocusLink as a query against GenBank to identify the longest mRNA for each locus. This sequence is represented as a provisional, predicted, or inferred RefSeq record. Subsequent review and curation may result in a sequence or annotation update (as described in
Conflicts and problems of these types must be resolved before the RefSeq can become public. Records are subject to validation to correct annotation errors and to provide annotation in a more consistent format. LocusLink descriptive information including official nomenclature, alias symbols, alternate descriptive names, map location, and additional citations are applied to the records. Records at this stage have a PROVISIONAL, PREDICTED, or INFERRED status.
These initial records then enter into the in-house review pipeline, where additional manual curation may be applied. The review process prioritizes known genes, gene families, and problem cases that are identified through user input or analysis. The curation process includes analysis of a suite of precomputed BLAST results, literature review, review of available database and Web resources (both in-house and external), and collaboration to curate the nucleotide and protein sequence data and to apply additional annotation and descriptive information to both the RefSeq sequence record and to the LocusLink database.
Additional ongoing review is applied to identified problem sets; for example, periodic analysis may be carried out to identify sequences that include repeats, that have poor-quality splice sites, are very short or very long, or that are extremely similar to sequences associated with a different LocusID. Review of problem sets may result in discontinuing a RefSeq, LocusID, or both. A RefSeq is suppressed if it is found to represent a transcribed repeat element, to be derived from the wrong organism (i.e., the GenBank sequence it was based on does not have accurate organism annotation), or to not represent a “gene”. An Entrez query will still retrieve a suppressed record, with a disclaimer appearing on the query result document summary ( (
Input from the research community is welcome to further improve the quality of this data set. Interested parties are invited to contact us by sending an email to the NCBI Help Desk (
Web Page | Web Site |
---|---|
Entrez Genomes Home |
|
Prominent Organisms |
|
Microbes |
|
Viral Genomes |
|
Organelle Genome Resources |
|
Plant Genomes Central |
|
The Entrez
In general, these RefSeq records undergo an initial automated validation step before being released. The resulting record is a copy of a GenBank entry, but validation may make some corrections and provides more consistent feature annotation. Records provided via collaboration have a status of REVIEWED and are attributed to the collaborating group. Records provided by in-house processing have a PROVISIONAL, VALIDATED, or REVIEWED status.
Entrez Genomes record processing falls into four primary categories: chromosomes, microbial genomes, small complete genomes, and viruses.
RefSeq records in this category are usually submitted directly to Entrez Genomes as a complete chromosome sequence representing an assembly of individual clones that are themselves available in GenBank. Examples of this type of RefSeq include
These records are curated by the organism-specific collaborating group and undergo NCBI validation before being released. The validation process checks for logical conflicts in the annotation (which are reported back to the submitting group) and makes small modifications to format the submission as a RefSeq record. This category of RefSeq record is displayed using the NCBI Map Viewer (
RefSeq | Organism |
---|---|
|
|
|
|
|
|
|
|
|
|
Microbial complete genomes are submitted to GenBank, but because of the GenBank/EMBL/DDBJ collaboration agreement, which limits the size to 350 kb, they are made available in GenBank as a series of Accession numbers. RefSeq does not need to adhere to this upper limit, and therefore complete genome sequences are available as RefSeq records in Entrez Genomes.
These records are subject to additional automated validation and computational analysis; the computational analysis results are then manually reviewed, resulting in more complete annotation of the RefSeq record (see
Smaller complete genomic sequences, including organelles, plasmids, and viruses, are based on single GenBank records. Automatic processing scans GenBank daily for complete genome updates and new submissions; identified records are candidates for a complete genome RefSeq. These records are manually evaluated to make the final decision; if more than one genomic sequence is available for the genome, then only one is selected to become the RefSeq standard. This selection takes into account various factors including the level of annotation, strain information, and community input.
Some of these RefSeq records undergo manual curation and have a REVIEWED status. For example, viral genomes are re-annotated using GeneMarkS in collaboration with Mark Borodovsky's Bioinformatics Group at the Georgia Institute of Technology. Following
RefSeq records can be accessed by direct query, BLAST and FTP download, or indirectly through links provided from several NCBI resources. In addition, RefSeqs are included in some computed resources, and therefore links may be found from those pages to individual RefSeq records. For example, the RefSeq collection is included in Clusters of Orthologous Groups of proteins (COG; BLAST results ( BLink (precomputed protein blast results) CDD dbSNP ( Entrez ( LocusLink ( Map Viewer ( OMIM ( UniGene ( UniSTS
The distinct Accession number format used for RefSeq records (
Query | Accession prefix | RefSeq category retrieved |
---|---|---|
|
NC_, NG_, NT_, NW_, NZ_, NM_, NR_, XM, XR_, NP_, XP_, ZP_ | All |
|
NC_, NG_, NM_, NR_, NP_ | REVIEWED, PROVISIONAL, PREDICTED, INFERRED, and VALIDATED |
|
NC_, NG_, NM_, NR_, NP_ | REVIEWED records |
|
NC_,NM_,NR_,NP_ | VALIDATED records |
|
NC_, NG_, NM_, NR_, NP_ | PROVISIONAL records |
|
NM_, NR_, NP_ | PREDICTED records |
|
NM_,NR_,NP_ | INFERRED records |
|
NT_, NW_, XM_, XR_, XP_, ZP_ | Genome annotation model records |
Retrieves those NT_ and NW_ records that have gene annotation.
RefSeq records can be retrieved by querying various databases in the Entrez system (
Queries against the nucleotide and protein databases can be restricted to the RefSeq collection by using the The
A subset of the RefSeq collection is represented in LocusLink, a gene-centered database, including human, mouse, rat, Drosophila, zebrafish, cow, and HIV-1 ( Alternate names Enzyme Committee numbers Gene summaries Publications Related GenBank accessions Transcript variant descriptions
RefSeq transcript and protein records are included in the non-redundant nucleotide and protein BLAST databases, and genomic sequences are included in the “chromosome” database; therefore, when a query sequence matches a RefSeq record, the hit is included in the BLAST results (see In a BLAST summary list of results, the abbreviation
Currently, RefSeq records generated by different pipelines are available in different FTP areas [
Besemer J, Lomsadze A, Borodovski M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29:2607-2618; 2001.
Blake JA, Eppig JT, Richardson JE, Davisson MT. The Mouse Genome Database (MGD): expanding genetic and genomic resources for the laboratory mouse. The Mouse Genome Database Group. Nucleic Acids Res 28:108-111; 2000.
Boguski MS, Schuler GD. ESTablishing a human transcript map. Nat Genet 10:369-371; 1995.
Coffin JM, Hughes SH, Varmus E. Retroviruses. Plainview (NY): Cold Spring Harbor Laboratory Press; 1997.
FlyBase Consortium. The FlyBase database of the Drosophila Genome Projects and community literature. The FlyBase Consortium. Nucleic Acids Res 27:85-88; 1999.
Hamosh A, Scott AF, Amberger J, Valle D, McKusick VA. Online Mendelian Inheritance in Man (OMIM). Hum Mutat 15:57-61; 2000.
Lukashin A, Borodovski M. GeneMark.hmm new solutions for gene finding. Nucleic Acids Res 26:1107-1115; 1998.
Marchler-Bauer A, Panchenko AR, Shoemaker BA, Thiesse PA, Geer LY, Bryant SH. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res 30:281-283; 2002.
Pruitt KD, Katz KS, Sicotte H, Maglott DR. Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. Trends Genet 16:44-47; 2000.
Tatusova TA, Karsch-Mizrachi I, Ostell JA. Complete genomes in WWW Entrez: data representation and analysis. Bioinformatics 15:536-543; 1999.
Twigger S, Lu J, Shimoyama M, Chen D, Pasko D, Long H, Ginster J, Chen CF, Nigam R, Kwitek A, Eppig J, Maltais L, Maglott D, Schuler G, Jacob H, Tonellato PJ. Rat Genome Database (RGD): mapping disease into the genome. Nucleic Acids Res 30:125-128; 2002.
Westerfield M, Doerry E, Kirkpatrick AE, Douglas SA. Zebrafish informatics and the ZFIN database. Methods Cell Biol 60:339-355; 1999.
White JA, McAlpine PJ, Antonarakis S, Cann H, Eppig JT, Frazer K, Frezal J, Lancet D, Nahmias J, Pearson P, Peters J, Scott A, Scott H, Spurr N, Talbot C Jr, Povey S. Guidelines for human gene nomenclature (1997). HUGO Nomenclature Committee. Genomics 45:468-471; 1997.
LocusLink organizes information from collaborating public databases and from other groups within NCBI to provide a locus-centered view of genomic information from human, mouse, rat, zebrafish,
LocusLink serves as a hub of information for
LocusLink gathers information from databases both within and external to NCBI. These are scanned using a combination of automatic and manual techniques. LocusLink collaborates with organism-specific and nomenclature databases, from which a combination of sequence data and names are used to initiate new LocusLink records. LocusLink also finds new records by a weekly review of submissions to GenBank and new versions of UniGene (which are released periodically). Each new LocusLink record is assigned a unique identifying number that is tracked, a LocusID.
LocusLink records are used in turn by UniGene, dbSNP, and the organism-specific and nomenclature databases. For example, a new LocusLink record created on the basis of a sequence submitted to GenBank can be the basis for a new entry in an organism-specific database. This new defining sequence of the LocusLink record is “
LocusLink also feeds new sequences into the Reference Sequence (RefSeq) project. The prerequisite for this process is that the sequence encodes a complete protein. For more details about how eukaryotic RefSeq mRNA and protein records are initiated, see
Although historically many of the GenBank sequences used to initiate LocusLink records represented characterized genes sequenced by individual research labs, an increasing number of records in LocusLink are generated as a part of NCBI's genome annotation pipeline (see
In summary, no matter whether the data are stored as a result of curation or computation, the central function of LocusLink is to establish an accurate connection between the defining sequence for a locus and other descriptors for that locus. With such a connection in place, it is possible to:
Establish a RefSeq for that locus. Identify or validate putative Support the NCBI annotation pipeline based on mRNA sequence alignment.
As will be discussed in more detail in the following sections, other NCBI resources make use of the LocusLink LocusID-to-sequence connection to provide appropriate nomenclature and other identifiers for sequences within their scope [see also
Field | Meaning | Example of search term |
---|---|---|
[chr] | Chromosome number | 21[chr] |
[loc] | LocusLink ID | 4292[loc] |
[mim] | OMIM number | 300200[mim] |
[sym] | Gene symbol | abc*[sym] |
[pm] | PubMed ID | 123456[pm] |
[ngi] | Nucleotide gi number | 223344[ngi] |
[pgi] | Protein gi number | 1234567[pgi] |
No space is allowed between the value and the field name.
Term | Meaning |
---|---|
disease_known | Human locus associated with a phenotype defined by a MIM number (may be only that phenotype) |
has_homol | Associated with a HomoloGene record |
has_omim | Associated with an OMIM record |
has_refseq | Associated with a RefSeq |
has_seq | Associated with nucleotide sequence |
has_snp | Associated with a dbSNP record |
seq_map | Either the gene or an STS in this record has been localized to the sequence-based map available from Map Viewer |
type_dseg | A DNA segment. May include BAC or YAC ends. |
type_gene_other | A gene that does not fall into the category type_gene_protein |
type_gene_protein | A gene that encodes a protein product. Does not include unreviewed, putative genes based only on modeling; named genes that encode only part of a protein product, such as immunoglobulin variable, diversity, joining, or constant regions; or genes that exhibit somatic rearrangement. |
type_pheno | Characterized as a mapped phenotype only |
type_pseudo | A pseudogene |
type_qtl | A phenotype characterized as a QTL only |
type_region | A region on the genome. Examples are named gene clusters and viral integration sites. |
The
LocusLink supports several types of queries, including gene names, GenBank Accession numbers, and other resource-specific ID numbers such as
A query on the LocusLink homepage returns a report page that is organized in a tabular format, as shown in
This section is organized according to the features and subdivisions seen on a LocusLink report page, from top to bottom. We will illustrate the descriptions of the features by using screen shots from the LocusLink report of BRCA2 and CDKN1A interacting protein (BCCIP; LocusID
On each LocusLink report page, the first set of links is a table of contents for the sections included in the report being displayed, which are hyperlinked to the appropriate section.
As part of the genome annotation process at NCBI, GenBank and RefSeq mRNAs are aligned to genomic
The diagram at the top of a LocusLink report represents the intron/exon structure of the gene as determined by the Genome Annotation Pipeline (see
This abbreviated output displays the longest alignment, where tick marks indicate the positions of the exons. Genes with a single exon are represented as a solid thick bar with flanking horizontal lines. The diagram is hyperlinked to the Evidence Viewer, which in turn is linked to the appropriate Entrez Nucleotide records by the Accession number of each sequence. Mousing over the individual entries in the Evidence Viewer displays a definition line from that sequence record.
where (from left to right) they represent links to: PubMed publications, UniGene, Map Viewer, dbSNP variation database, the Genome Database, Ensembl, and UCSC Human Genome Browser. The buttons are used for connecting to related resources within NCBI or to external genomic databases. Others in this category include Ace Viewer, OMIM, MGC, MGI, and RGD. More links for any record may be listed at the end of the report page in the
LocusLink uses symbols and gene names from official authority lists when available. If no connection to official nomenclature can be made, symbols and names are selected as available from the defining sequence record. If sequence and positional homology (synteny) suggest that a locus not named officially in one species is orthologous to a named gene in another species, the symbol from the ortholog may be included in the LocusLink record. If no symbol can be identified for a new locus, the letters LOC are prepended to the LocusID. Once an official or meaningful symbol has been identified, that LOC symbol is discontinued (because the record will still be searchable and identified by the LocusID itself).
Information on the nomenclature authorities that collaborate with LocusLink can be viewed by following the
Resource | Keys into LocusLink | Method of matching resource to LocusID |
---|---|---|
HomoloGene | mRNA | Alignment |
Map Viewer | mRNA, gene feature | Alignment |
UniGene | mRNA, protein gi | Clustering |
UniSTS | mRNA | e-PCR |
Genomic annotation | e-PCR | |
Marker name | Publications |
Several NCBI databases use the nomenclature maintained by LocusLink. These names are incorporated into other databases based primarily on name–LocusID–sequence connections, i.e., sequence comparisons identify similar sequences in LocusLink, all of which have a LocusID; from this, the associated nomenclature can be extracted and applied to the original sequence from the collaborating database (
This section of the LocusLink report page may include any or all of the following categories of information.
The
D segment
RNA, ribosomal
RNA, small cytoplasmic
RNA, small nuclear
RNA, small nucleolar
RNA, transfer
gene with no protein product
gene with protein product, demonstrates somatic rearrangement
gene with protein product, function known or inferred
gene with protein product, function unknown
gene, segment
model,
model,
model, supported by EST alignments
model, supported by mRNA alignments
model, supported by mRNA and EST alignments
phenotype only
pseudogene
pseudogene, transcribed
quantitative trait locus (QTL)
region
regulatory element
repetitive element
unknown
Information about the function of a gene and its RNA or protein products is gathered from several sources.
For human genes, links to OMIM, if available, are given under the heading Phenotype. For all genomes, links to the published literature are provided under the heading Gene References into Function (GeneRIF). Any user can submit a reference to a paper they think is important for a locus, but beginning in February 2002, these links are also supplied by
Increasingly, the Gene Ontology™ (GO) vocabulary terms are incorporated from GO's
This section reports other loci, and/or the proteins that they encode, that have a defined relationship to the locus being displayed.
At present, this section includes: (
We plan to expand this section to include other types of pairwise relationships. For example, “overlap”, “interspersed”, and “protein binds” will refer to genes that overlap, that are contained in or contain another gene, and for which the encoded proteins interact, respectively. Although primarily for relationships within a species, when proteins of one species interact with those of another (e.g., infectious agents), such a relationship will be reported in this section as well.
This section reports map data for the locus. The location listed is the same as that on the LocusLink query result page, but if there are any conflicts, additional locations may be reported, along with the source of the conflicting data and a link to that resource.
For the human and mouse genomes, if no published map location has been identified and if the gene has been aligned to NCBI's genome assembly, the cytogenetic position is recalculated with each build. The conversion files used in the process are
Genetic and physical map positions are incorporated from the published maps used in Map Viewer. Rather than report all position data for any locus in any coordinate system, links are provided to Map Viewer via the
In LocusLink, markers are defined as sequence tagged sites (STSs) associated with the locus. LocusLink reports markers either as the locus itself or as a marker that has some relationship with a gene. LocusLink does not store all of the markers available for a genome, which is the function of
The marker data that LocusLink reports come primarily from any of the following paths: (1) a report from a genome-specific database that states that a marker is within a gene or locus; (2) for genes, a caluculation based on
A LocusLink record includes several categories of
A second category in the RefSeq section reports sequences generated or annotated as part of an NCBI genome annotation project. The first Accession number(s) displayed is the genomic record (contig), as well as links to the genomic sequence containing the gene (
The
This section provides a list of sites that may have additional information. The OMIM and UniGene links are redundant, with the button links at the top of the page, but provide the respective ID numbers that are hidden in the button links. Other links shown here, such as
Maintenance | Type of locus |
---|---|
Tracked | Officially named genes and pseudogenes, for both nuclear and mitochondrial genomes, whether or not the final gene product is known to be a protein or a RNA. |
Tracked | Probable protein-coding genes, defined by one or more mRNAs. The function of the encoded protein is not necessarily known. |
Tracked | Mapped phenotypes, such as disease susceptibility loci or QTLs |
Tracked | Gene segments (such as coding regions for variable regions of immunoglobulin or T-cell receptors) |
Not tracked | Gene predictions from NCBI's genome annotation pipeline. |
LocusLink records can be categorized by two major criteria: the type of locus being described and the tracking maintained for the record (
LocusLink supports retrieval of inactive records in the following ways:
If a non-interim record has been discontinued and the record appeared to have been created in error (i.e., could not be merged into or made secondary to another record), then the withdrawn record can be retrieved. These records are clearly noted with the term “withdrawn” on the query result table and the report page. The report page also includes an explanation for why the record was discontinued. If a locus record has been made secondary to another record, a query on the secondary ID will take you to the current record.
Merges are reported in the FTP file LocusID_history, and the status of all queryable LocusIDs is reported in the file LL_tmpl.gz (
Records are added to LocusLink in several ways:
External resources and collaborators provide information on new, officially named genes and the sequences that define them. New sequences are released from GenBank/DDBJ/EMBL and are identified as being from a gene not yet described in LocusLink. If sequence alignments indicate that a new sequence matches an interim locus annotated in the Annotation pipeline, then the interim locus is converted to a “curated” one, and the sequence accession is added to that record. Communications from the public. LocusLink provides three mechanisms for users: (
File or directory | Description |
---|---|
README | Documentation for the directory |
HomologyMaps | Directory of data files used in the human/mouse comparative map |
LL.out.gz | Tab-delimited file of descriptors for current LocusLink records |
LL.out_dm.gz | Drosophila subset of LL.out |
LL.out_dr.gz | Zebrafish subset of LL.out |
LL.out_hs.gz | Human subset of LL.out |
LL.out_mm.gz | Mouse subset of LL.out |
LL.out_rn.gz | Rat subset of LL.out |
LL_tmpl.gz | Current text file for displays on the LocusLink site |
LocusID_history | Report of locus_id merges |
homol_seq_pairs.gz | mRNA accession pairs determined by MegaBlast |
loc2UG | Current LocusID/UniGene cluster conversion table |
loc2acc | Current LocusID/GenBank accession report |
loc2cit | Current LocusID/PubMed ID/MedLine ID report |
loc2ref | Current LocusID/RefSeq accession report |
loc2sts | Current LocusID/UniSTS ID relationships |
mim2loc | Current LocusID/MIM ID relationships |
LocusLink can be obtained by
The file LL_tmpl.gz represents the text file for displays of the complete LocusLink site and is in a semistructured tag:value format, which makes it challenging to parse. A subset of these data is available in tab-delimited format, either for all species covered by LocusLink (LL.out) or in species-specific files (e.g., LL.out.hs and LL.out.dm). These files report the LocusID, symbols, names, map location, and identity of the contributing database.
The file loc2UG (the LocusID/UniGene cluster conversion table) is refreshed with each UniGene build, and homology reports are refreshed with new genome annotation builds. These files can be used to obtain names and sequences connected to a locus.
Resource | Connection made | Basis |
---|---|---|
dbSNP | LocusID to Reference SNP ID | dbSNP accessions aligned to mRNAs, or within the boundary of 2000 nt upstream through 500 nt downstream of known exons |
Map Viewer | LocusID to cytogenetic, genetic, or sequence position | Reports of cytogenetic position or calculated from sequence assembly; reports of genetic position, connecting LocusIDs to sequence position based on alignment of accessions associated with LocusIDs |
RefSeq mRNAs and proteins | LocusID to mRNA and protein accessions | Calculated and curated LocusID/mRNA or protein_id relationships |
UniGene | LocusID to UniGene cluster ID | Based on the LocusID/mRNA or protein_id relationships, identifying the cluster ID and requiring that the LocusID/UniGene cluster ID relationship be 1:1 |
UniSTS | LocusID to UniSTS sts_id | Based on the LocusID/mRNA relationships or overlap in the genomic assembly, after positioning the STS by e-PCR |
The database supporting LocusLink houses more than only the unique loci identifiers for the genomes in its scope. It also records, whenever possible, the public sequence Accession numbers that define these loci and, along with its collaborators, applies several tests for data consistency. Thus, the relationships between LocusID and sequence Accession numbers are used by other databases at NCBI to convert information about mRNA or protein sequences to the LocusID and all other information associated with that LocusID (name, database cross-references, protein product, and others). These relationships are outlined in
On LocusLink's static HTML pages, such as the homepage, there are links to general resources. Specific sites provide information for
There are many different approaches to starting a genomic analysis. These include literature searching, searching databases for gene names and other genomic features, performing sequence comparisons, or using map data to find gene information by position relative to other landmarks. The NCBI Map Viewer has been developed to facilitate this latter approach.
The purpose of this chapter is to provide a foundation for gaining maximum benefit from using the Map Viewer and related resources at NCBI. It is important to note that in this document, the term “map” refers to a position of a particular type of object in a particular coordinate system. This means, for example, that there is not one sequence map but a set of maps in sequence coordinates. Readers interested in precisely how sequence-based maps are annotated and assembled should refer to
First launched with the release of the sequence of
Map Viewer integrates map and sequence data from a variety of sources. The basic architecture and principle of Map Viewer can be applied to any complete or incomplete genome as long as map data exist to support it. Map Viewer is a powerful tool because it provides: (1) a mechanism to compare maps in different coordinate systems; (2) a robust query interface; (3) diverse options for configuring the display; (4) multiple functions to report and download maps and annotated information; (5) tools to manipulate nucleotide sequence such as ModelMaker (for constructing mRNAs from putative exon sequences); (6) connections to comprehensive data files for transfer by FTP; and (7) detailed descriptions of the objects displayed on the maps.
Feature | Coordinate system | Representative maps |
---|---|---|
Sequence (Mb), Radiation hybrid (cRay), Genetic (cM), Clone content (ordinal), Cytogenetic | STS, STSnw, G3, GM4, GeneMap'99, TNG, Marshfield, Genethon, deCode, Whitehead YAC, phenotype maps such as Quantitative Trait Loci (QTL) | |
Sequence, Cytogenetic | Clone, BES, Components | |
Expression | Sequence | SAGE tag, UniGene |
Sequence (Mb), Cytogenetic (band names) | Genes_seq, Genes_cyto | |
Gene-related | Sequence, Cytogenetic | UniGene, GenomeScan, Mitelman recurrent breakpoint, morbid |
Sequence (Mb) | Variation | |
Published accessions | Sequence (Mb) | GenBank |
Phenotype | Cytogenetic, Cytogenetic (abnormalities), Sequence | OMIM's morbid map, Mitelman's recurrent breakpoint, QTL (in progress) |
Sequence (Mb) | Component | |
Homology | Sequence (Mb) | Indirectly via LocusLink or UniGene. For mouse and human, through the homology (hm) link to the mouse–human homology map |
The feature column lists the types of objects annotated on maps seen in Map Viewer. Those features in bold type are annotated on the RefSeqs; the rest are provided only from the Map Viewer, and the files are available for FTP transfer.
The different map types and coordinate systems that may contain a particular type of feature.
A partial enumeration of named maps that represents positions of this feature type.
Some of the annotation of genomic sequence carried out by NCBI is included in the genomic reference sequences (NC, NT, and NW Accession number format); however, other annotation is represented only in the Map Viewer and in the associated reports (
Resource | Description |
---|---|
Clone sequencing sequence status, STS content, and availability | |
Single Nucleotide Polymorphisms (SNPs), polymorphisms, small-scale insertions/deletions, polymorphic repetitive elements | |
Directory of key resources for the genome, with links to related resources and tutorials. The directory to guide pages is available from Genomic Biology. | |
Locus-specific data for a subset of organisms with extensive links to related resources and sequence data | |
Human genes and Mendelian disorders | |
NCBI's curated, non-redundant RefSeqs | |
Computed clusters of cDNA and Expressed Sequence Tag (EST) sequences from the same gene, with tissue expression information and links to related resources | |
Unified, nonredundant database of sequence tagged sites (STSs) |
Feature annotation is computed primarily in two ways: (1) by alignment of the defining sequence to the genome; or (2) for sequence tagged sites (STSs), by e-PCR (
In some cases, the position of these features may suggest the location of other genomic regions of interest. For example, the position of STS markers can help define the position of phenotypes such as quantitative trait loci (QTL). Although the best annotation of a gene or region is always through annotation by an expert researcher, automated annotation of genomes and comparison to that provided by experts can provide significant useful information. Experts interested in analyzing or assisting with genome annotation should contact us at
In addition to supporting the display of multiple maps in the same coordinate system (e.g., multiple sequence-based maps), Map Viewer also displays maps in different coordinate systems by calculating the correspondances among them (e.g., sequence to genetic). This is accomplished by: (
The identification of known genes within the genome assembly provides critical landmarks and functional context to the sequence data, which in turn makes it easier to traverse to other rich sources of gene and protein information, including publications, OMIM, RefSeq, Conserved Domain Database (CDD), and LocusLink.
The power of calculating correspondances between coordinate systems may be more apparent when considering a common application of Map Viewer, i.e., identifying candidate genes within a region defined by genetic markers. When markers are palced on both genetic and sequence maps, it is then possible to use the gene-related maps (gene, UniGene/EST, or
For many genomes, identifying and positioning chromosomes and genes within sequence blocks is an ongoing process. In those cases, the Map Viewer can be used to evaluate the evidence that supports the current representation of the sequence and visualize possible conflicts. Inconsistencies in map order or in the placement of any object can be seen in the Map Viewer; this is assisted in some cases by the use of color coding (Figures Potential inconsistencies in the order or orientation of sequence blocks can be investigated by displaying a genetic map ( A comparison of cDNA alignments (UniGene, RNA) and gene predictions (GenomeScan) to the genomic contig annotation can be achieved by displaying three maps simultaneously. The genomic contig (NT_024981.9) annotation is shown in the
For some genomes, the color-coded contig map displays whether the annotation is based on sequence assembled from draft or finished clones (blue, finished; green, whole genome shotgun; orange, draft). This is helpful when evaluating the level of confidence in the completeness of the annotation of a gene and/or its coding region.
Map Viewer also uses color coding or diagrams to represent the level of confidence in the placement of any mapped object. For example, SNPs or STSs that are placed at more that one position in a given map are noted by color (yellow) in the detailed labels (
(
Although maps provided from external sources are updated when new data are available, the maps dependent on NCBI's annotation process are updated periodically in versions called “builds”. Thus, mRNA or other supporting evidence that becomes available after the data “freeze” date for one build will not be incorporated into the display until the next build. However, some of the supporting databases linked from the Map Viewer may have more updated information. For example, UniSTS may provide more recent e-PCR results, or LocusLink may show a newer name or additional sequence data. dbSNP may make major data releases between builds; in this case, the variation map is updated.
Although most of this chapter discusses the human genome Map Viewer, there is a growing number of organisms for which there is Map Viewer access to the genome. To identify the taxa that have Map Viewer access to the genome, query the taxonomy database by typing “loprovmapviewer”[filter] into the query box on the Entrez Taxonomy
Many NCBI databases are now integrated into Map Viewer (
Genome-specific resource pages also support queries via chromosome diagrams ( Note that there are two ways to connect: (1) by selecting
Genome-specific BLAST pages that restrict a search to a specific genome are provided for several organisms and allow the results of the search to be displayed in a genomic context (provided by Map Viewer). Genome-specific BLAST searches can be accessed from the Selecting the
When already at a genome-specific Map Viewer page, any combination of query terms can be entered into a Map Viewer Note the links to the help documentation and related resources. Also note the check boxes to use the advanced query page and/or to display objects calculated to have links to any object returned by the query. A link to the genome-specific BLAST site is also provided at the
Queries may include any unique identifier for a database record, e.g., a sequence Accession number or OMIM (MIM) number, or a text term or phrase, e.g., a gene symbol (BRCA2) or descriptor (p53-binding), or disease name (lung cancer). The Boolean AND operator is used automatically if multiple terms are entered. Therefore, a query for “fanconi anemia” will automatically be interpreted as “fanconi AND anemia”. The wildcard operator (*) provides a convenient mechanism to retrieve genes that share a common symbol or name, as is often found for gene families. For example, a query for ABC* will return matches to the ATP-binding cassette superfamily.
The advanced query page, accessed by checking the
The same options for wild cards and Boolean operators for your query term(s) apply when starting at the Map Viewer
To use Map Viewer to display a particular section of a genome by using a range of positions as a query, it is first necessary to select a particular chromosome for display from either a genome-specific Map Viewer page or a Genome Guide page.
Once a single chromosome is displayed, position-based queries can be defined by: (1) entering a value into the
Tutorials in Chpater 23, particularly #2, provide more examples of querying Map Viewer by position.
The results from a query are displayed both graphically and in a summary table (
(
General information on the chromosome being viewed is summarized at the top of the map page: the species and chromosome currently being viewed, the query term, and the name of the focal map, termed the Master Map ( (
The summary also includes the following statistics concerning the number of objects on the Master Map, which are:
the number of objects localized (positioned) on the chromosome the number of objects not localized but present on the chromosome the number of objects localized in the region displayed (i.e., the number decreases as you zoom in) the number of objects for which text descriptions are shown (dependent on user-defined page length)
A thumbnail map on the left of the page provides a coarse indication of the region displayed; by default, this is a cytogenetic map, although the Master Map can be selected (
Maps are displayed vertically, with the name of each map hyperlinked to a description of it (
The positions of BLAST hits are highlighted on the Contig map, and a text summary of the BLAST hit is provided with links to regional alignment reports. All of the options described previously for configuring your display are still available. Thus, it is possible to evaluate the sequence match by the location (possible intron/exon structure, percent identity) as well as to determine whether the matching genomic region contains all of the query sequence in the expected order. Adding other maps to the display using the
A tabular report of the region and maps being displayed can be generated by selecting the
If any of the maps are in sequence coordinates, an option is presented to report data for any sequence map in the region. Note: Links are provided for downloading tab-delimited files for any or all maps.
The Map Viewer display can be customized with regard to the region shown, the number and coordinate systems of maps, the number of objects labeled on the Master Map, and whether to show connections between objects. Each of these will be described in this section.
The Map Viewer provides zoom, navigation, and other map display controls. These can be found on the display page itself and in the The menus and boxes in this window allow definition of the range of the chromosome to display (
As the resolution of a view is changed, the chromosome diagram is updated. The view automatically centers on a highlighted query term, or on the middle of the chromosome if browsing only. The chromosome view can be moved up and down or zoomed in and out. Zooming can be achieved in several ways: (1) by using the zoom control, located in the left column; (2) by providing a range or bounding markers in the
Maps are categorized by the coordinate system as well as type of feature. The maps available for a genome can be seen by scrolling through the
Some basic map controls are available directly on the display including removal of a map from the display by clicking on the
The
There has been considerable effort to integrate data on the sequence-based maps with data from non-sequence-based maps. Map connections provide a unique and powerful mechanism to identify features in a relevant region of the sequence map when starting with information from a different coordinate system (see
The features that are available with Map Viewer are summarized in
Text
Text, advanced
Nucleotide query (by alignment or Accession number)
Protein query (by alignment)
By position in genome
Graphical
Tabular
Assembled sequence
Annotated feature sequence
Sequence region
Other map data for region
Custom model (Model Maker)
Zoom
Scroll along chromosome
Add/Remove tracks/maps
Scalebar (ruler)
Change order of track/map
Specify coordinates to view
Jump to different chromosome
Show links
Alter number of rows displayed
Assembled sequence
Model mRNA sequence
Model protein sequence
Contig/chromosome conversion tables
Map location, sequence-based
Map location, non-sequence-based
Help documentation
Statistics
FAQs
Map Viewer provides links to several tools to display, download, or manipulate the sequence in a user-defined region. Whenever a sequence-based map is the master (the one at the right), the link
The Evidence Viewer (
The Sequence Viewer (
Sequence Download (
Model Maker (
The data displayed in Map Viewer are freely available. In addition to the view-specific reports, all of the data are available by
Dynamic links to Map Viewer can be generated by constructing URLs with arguments that define the species, chromosome, range, types of maps (with or without units), display order, number of labels, query string, how to center a display around a query result, and the type of label for the display. The most current documentation is provided in the
Query terms are indexed for retrieval using the Entrez system. Thus, wild cards, Boolean operators, filters, and properties are managed as for other Entrez databases.
Each distinct object on the map is assigned a unique identifier that is specific to a particular build. Each object may have other secondary identifiers, such as IDs, in the sequence, Clone Repository, dbSNP, LocusLink, UniGene, or UniSTS databases. All descriptors are indexed as text. In addition, some are indexed by specific field values or by pre-identified properties, such as genes with associated diseases, SNPs with heterozygosity values in pre-defined ranges, or evidence type for genes. These field names or properties can be applied to restrict a query either in the Web-based query form or within a URL. The complete listings of current implementations for field qualifiers and properties are provided in the online help documentation.
Data for each map are retrieved for display from a relational database based on the IDs returned from the Entrez query. The database is used only to support display; it is refreshed with each NCBI build or update of any other map but not to track changes from build to build. Data from previous builds are archived at NCBI, but direct access is not currently supported.
|
|
Ensembl | www.ensembl.org |
NCBI MapViewer | www.ncbi.nlm.nih.gov/mapview |
UCSC Genome Browser | www. genome.ucsc.edu |
Sequencing Information | |
NHGRI Sequencing Information | www.nhgri.nih.gov/Data/ |
Celera Genomics | www.celera.com |
|
|
BLAT | http://genome.ucsc.edu/cgi-bin/hgBlat?command=start |
BLAST | http://www.ncbi.nlm.nih.gov/BLAST/ |
e-PCR | http://www.ncbi.nlm.nih.gov/ sts/epcr.cgi |
Sim4 (mRNA to genomic alignment tool) | http://globin.cse.psu.edu/ |
Spidey (mRNA to genomic alignment tool) | http://www.ncbi.nlm.nih.gov/spidey |
SSAHA | http://www.sanger.ac.uk/Software/analysis/SSAHA/ |
RepeatMasker | http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker |
Maps | |
BAC FingerPrint Map | http://genome.wustl.edu/gsc/human/Mapping/ |
Other Annotation Sources and Viewers | |
Celera Genomics | http://www.celera.com |
DAS | http://www.biodas.org |
DoubleTwist | http://www.doubletwist.com |
The Genome Channel | http://compbio.ornl.gov/channel/ |
Incyte Genomics | http://www.incyte.com |
FTP Sites | |
Ensembl | ftp.ensembl.org/pub/current/data/ |
NCBI | ftp.ncbi.nlm.nih.nih.gov/genomes/H_sapiens/ |
UCSC | ftp.genome.cse.ucsc.edu/goldenPath |
Map Viewer displays represent the current synthesis of information available at the time of the data freeze (
Means of reviewing reliability include: (
The task of assembling an inventory of all genes of
At a time when the genomes of many species have been sequenced completely, a fundamental resource expected by many researchers is a simple list of all of an organism's genes. A gene list, together with associated physical reagents and electronic information, allows one to begin to investigate the ways in which many genes interact in the complex system of the organism. However, many species of medical and agricultural importance have not yet been prioritized for genomic sequencing, and expressed cDNAs have provided the primary source of gene sequences. Furthermore, when the genomic sequence of an organism becomes available, a collection of cDNA sequences provides the best tool for identifying genes within the DNA sequence. Thus, we can anticipate that the sequencing of transcribed products will remain a significant area of interest well into the future.
The era of high-throughput cDNA sequencing was initiated in 1991 by a landmark study from Venter and his colleagues (
Despite their fragmentary and inaccurate nature, ESTs were found to be an invaluable resource for the discovery of new genes, particularly those involved in human disease processes (
Organism | ESTs |
---|---|
4,070,035 | |
2,522,776 | |
326,707 | |
255,456 | |
234,900 | |
230,256 | |
197,630 | |
197,565 | |
191,268 | |
148,338 | |
147,658 | |
137,588 | |
113,330 | |
112,489 | |
104,803 | |
104,284 | |
103,321 | |
88,963 | |
88,742 | |
84,712 |
One avenue to gene discovery is to use a database search tool, such as BLAST (
For EST sequencing to be maximally productive, certain details of the library construction require some attention. For example, normalization procedures have been used to reduce the abundance of highly expressed genes so as to favor the sampling of rarer transcripts (
Although ESTs are a useful way to identify clones of interest and provide guidance in identifying gene structure, a full-insert sequence of cDNA clones is preferable for both purposes. High-throughput full-insert cDNA sequencing projects have been the source of over 80,000 sequence submissions accessioned to date (August 2002). The full-insert cDNA sequence can allow identification of the translation product of the sequenced transcript, as well as potentially providing evidence for gene structure. Moreover, for the investigator wanting to use the clone as a reagent, having the accurate and complete sequence of the clone's insert at hand makes complete resequencing unnecessary, if the full-insert cDNA sequencing project makes clones available. Verifying that the full-insert sequence corresponds to either the complete transcript of interest or to its complete, uncorrupted coding sequence is possible without committing laboratory resources and time to a clone that produced an EST. cDNA libraries do not generally include the entire transcript sequence; therefore, many full-insert sequences do not contain the entire transcription unit. Large transcripts (>6 kb) are particularly difficult to obtain.
The sheer number of transcribed sequences is extraordinary, indeed for most organisms much larger than the number of genes. A major challenge is to make putative gene assignments for these sequences, recognizing that many of these genes will be anonymous, defined only by the sequences themselves. Computationally, this can be thought of as a clustering problem in which the sequences are vertices that may be coalesced into clusters by establishing connections among them.
Experience has shown that it is important to eliminate low-quality or apparently artifactual sequences before clustering because even a small level of noise can have a large corrupting effect on a result. Thus, procedures are in place to eliminate sequences of foreign origin (most commonly
With a given a set of sequences, a variety of different sources of information may be used as evidence that any pair of them is or is not derived from the same gene. The most obvious type of relationship would be one in which the sequences overlap and can form a near-perfect sequence alignment. One dilemma is that some level of mismatching should be tolerated because of known levels of base substitution errors in ESTs, whereas allowing too much mismatching will cause highly similar paralogous genes to cluster together. One way to improve the results is to require that alignments show an approximate “dovetail” relationship, which is to say that they extend about as far to the ends of the sequences as possible. Values of specific parameters governing acceptable sequence alignments are chosen by examining ratios of true to false connections in curated test sets. It is important to note that the resulting clusters may contain more than one alternative-splice form.
Multiple incomplete but non-overlapping fragments of the same gene are frequently recognized in hindsight when the gene's complete sequence is submitted. To minimize the frequency of multiple clusters being identified for a single gene, UniGene clusters are required to contain at least one sequence carrying readily identifiable evidence of having reached the 3′ terminus. In other words, UniGene clusters must be anchored at the 3′ end of a transcription unit. This evidence can be either a canonical polyadenylation signal (
The UniGene Web site allows the user to view UniGene information on a per cluster, per sequence, or per library basis. Each UniGene Web page ( A Web view of the UniGene cluster representing the human serine proteinase inhibitor gene
The UniGene Cluster page summarizes the sequences in the cluster and a variety of derived information that may be used to infer the identity of the gene.
Possible protein products for the gene are suggested by providing protein similarities between one representative sequence from the cluster and protein sequences from eight selected model organisms. For each organism, the protein with the highest degree of sequence similarity to the nucleotide sequence is listed, with its title and GenBank Accession number. The sequence alignment is described using the percent identity and length of the aligned region. Also provided is a link to ProtEST, which summarizes the UniGene protein similarities on a per protein basis.
The next section summarizes information on the inferred map position of the gene. In some cases, chromosome assignments can be drawn from other databases, such as OMIM or MGI. In other cases, radiation hybrid (RH) maps have been constructed using Sequence Tagged Site (STS) markers derived from ESTs. In these cases, the UniGene cluster can be associated with a marker in the UniSTS database, and a map position can be assigned from the RH map. More recently, map positions have been derived by alignment of the cDNA sequences to the finished or draft genomic sequences present in the NCBI
Although ESTs are a poor probe of gene expression, both the total number of ESTs and the tissues from which they originated are often useful. Both of these are displayed in the cluster browser. The tissues are listed under Expression Information, which includes the tissue source of libraries of the component sequences and, for human, links to the SAGE resource. Moreover, if genomic sequence is available, the UniGene map view displays expression for each exon (more precisely, for each portion of genome similar to a transcript; because incompletely processed mRNAs are not unheard of, the presence of a transcript is insufficient to identify an exon).
The component sequences of the cluster are listed, with a brief description of each one and a link to its UniGene Sequence page. The Sequence page provides more detailed information about the individual sequence, and in the case of ESTs, includes a link to its corresponding UniGene Library page. On the Cluster page, the EST clones that are considered by the Mammalian Gene Collection (MGC) project to be putatively full length are listed at the top, whereas others follow in order of their reported insert length. At the bottom of the UniGene Cluster page is an option for the user to download the sequences of the cluster in FASTA format.
The
The ProtEST Web site has three features: information describing the amino acid sequence; information describing the nucleotide–protein alignments; and the ability for the user to modify various display options. The sequence alignments in ProtEST are summarized in tabular form ( A view of protein similarities for the human
To further refine the view, the sequence alignments in the table can be sorted by: (
DDD is a tool for comparing EST-based expression profiles among the various libraries, or pools of libraries, represented in UniGene. These comparisons allow the identification of those genes that differ among libraries of different tissues, making it possible to determine which genes may be contributing to a cell's unique characteristics, e.g., those that make a muscle cell different from a skin or liver cell. Along similar lines, DDD can be used to try to identify genes for which the expression levels differ between normal, premalignant, and cancerous tissues or different stages of embryonic development.
As in UniGene, the DDD resource is organism specific and is available from the UniGene
DDD uses the Fisher Exact test to restrict the output to statistically significant differences (
HomoloGene is a resource for exploring putative homology relationships among genes, bringing together curated homology information and results from automated sequence comparisons. UniGene clusters, supplemented by data from genome sequencing projects, have been used as a source of gene sequences for automated comparisons.
Homology relationships, according to the experts who judge these, have been obtained from several sources. Collaborations with MGI and ZFIN at the University of Oregon have provided a large body of literature-derived data centered around
MegaBLAST (
The connections made by these methods result in a complex web of relationships. To simplify the Web view, it is useful to have each report page focus on an individual gene, called the “key gene”, and to show connections that follow from it. An example of the report for the Homology information for the mouse
The protein database of Clusters of Orthologous Groups (COGs) is an attempt to phylogenetically classify the complete complement of proteins (both predicted and characterized) encoded by complete genomes. Each COG is a group of three or more proteins that are inferred to be orthologs, i.e., they are direct evolutionary counterparts. The current release of the COGs database consists of 4,873 COGs, which include 136,711 proteins (~71% of all encoded proteins) from 50 bacterial genomes, 13 archaeal genomes, and 3 genomes of unicellular eukaryotes, the yeasts
The recent progress in genome sequencing has led to a rapid enrichment of protein databases with an unprecedented variety of deduced protein sequences, most of them without a documented functional role. Computational biology strives to extract the maximal possible information from these sequences by classifying them according to their
The COGs database has been designed as an attempt to classify proteins from completely sequenced genomes on the basis of the orthology concept (
COGs have been identified on the basis of an all-against-all sequence comparison of the proteins encoded in complete genomes using the gapped BLAST program ( The
Briefly, COG construction includes the following steps:
Perform the all-against-all protein sequence comparison. Detect and collapse obvious paralogs, i.e., proteins from the same genome that are more similar to each other than to any proteins from other species. Detect triangles of mutually consistent, genome-specific best hits (BeTs), taking into account the paralogous groups detected at step 2. Merge triangles with a common side to form COGs. Perform a case-by-case analysis of each COG. This analysis serves to eliminate false-positives and to identify groups that contain multidomain proteins by examining the pictorial representation of the BLAST search outputs. The sequences of detected multidomain proteins are split into single-domain segments, and steps 1–4 are repeated with the resulting shorter sequences, which assigns individual domains to COGs in accordance with their distinct evolutionary affinities. Examine large COGs that include multiple members from all or several of the genomes using phylogenetic trees, cluster analysis, and visual inspection of alignments. As a result, some of these groups are split into two or more smaller ones that are included in the final set of COGs.
By the design of this procedure, a minimal COG includes three genes from distinct phylogenetic lineages; protein sets from closely related species were merged before COG construction. The approach used for the construction of COGs does not supplant a comprehensive phylogenetic analysis. Nevertheless, it provides a fast and convenient shortcut to delineate a large number of families that most likely consist of orthologs.
New proteins can be assigned to the COGs using the COGnitor program, the principal tool associated with the COGs database. COGnitor “BLASTs” the query sequences against all protein sequences encoded in the genomes that are classified in the current release of the COG system. To assign proteins to COGs, COGnitor applies the same principle that is embedded in the COG construction procedure, i.e., the consistency of genome-specific BeTs. For any given query protein, if the number of BeTs for a particular COG exceeds a predefined cut-off (three by default; the cut-off value can be changed by the user), the query protein is assigned to that COG; in cases where there are more than three BeTs to two different COGs, an ambiguous result is reported.
Once the COGs have been identified using the above procedure, new members can be added using the COGnitor program. The assignments are further checked and curated by hand to eliminate potential false-positives. It has been shown that 95–97% of the COGnitor assignments typically require no correction (
In bacterial and archaeal genomes, approximately 70% of the proteins typically belong to the COGs. Because each COG includes proteins from at least three distantly related species, this reveals the generally high level of evolutionary conservation of protein sequences, making the COGs a powerful tool for functional annotation of uncharacterized proteins. The COGs were classified into 18 functional categories that loosely follow those introduced by Riley (
A phyletic pattern is the pattern of species that are represented or not represented in a given COG; alternatively, phyletic patterns can be described in terms of the sets of COGs that are represented in a given range of species. The COGs show a broad diversity of phyletic patterns; only a small fraction are universal COGs, i.e., they are represented in all sequenced genomes, whereas COGs present in only three or four species are most abundant. This patchy distribution of phyletic patterns probably reflects the major role of horizontal gene transfer and lineage-specific gene loss in the evolution of prokaryotes, as well as the rapid evolution of certain genes in specific lineages, which may be linked to functional changes. Phyletic patterns are informative not only as indicators of probable evolutionary scenarios but also functionally; most often, different steps of the same pathway are associated with proteins that have the same phyletic pattern, whereas on some occasions, complementary patterns indicate that distinct (sometimes unrelated) proteins are responsible for the same function in different sets of species. The COG system includes a simple phyletic pattern search tool that allows the selection of COGs according to any given pattern of species. This tool effectively provides the functionality of “differential genome display” (for example, allowing the selection of all COGs that are present in one, but not the other, of a pair of genomes of interest) and can be helpful for delineating sets of candidate proteins for a particular range of functional features, e.g., virulence or hyperthermophily.
The main COGs
The individual COG pages can be reached from any of the COG lists mentioned above or by searching the site (see, for example, the COG for
The COG data set and the COGnitor program also are available by anonymous ftp at
Substantial evolution of the COGs is expected in the near future in terms of both growth by adding more genomes and the addition of new functionalities and layers of presentation. Quantitatively, the main forthcoming addition is the COGs for eukaryotic genomes, which are expected to approximately double the size of the COG system. Many of the COGs include paralogous proteins, and this will be addressed by introducing hierarchical organization into the COG system, whereby related COGs will be unified at a higher level. In addition, partial integration of the COGs with the NCBI's Conserved Domains Database (CDD) is expected (
The COG system is developed and maintained by a team of programmers and expert biologists.
Project leader: Eugene V. Koonin.
The programming group: Roman L. Tatusov (group leader), Boris Kiryutin, Victor Smirnov, and Alexander Sverdlov (student)
The annotation group: Darren A. Natale (group leader), Natalie Fedorova, Anastasia Nikolskaya, Aviva Jacobs, Jodie Yin, B. Sridhar Rao, Dmitri M. Krylov, Sergei Mekhedov, John Jackson, Raja Mazumder, and Sona Vasudevan
The User Services team is the primary liaison between the public and the resources and data at NCBI. User Services disseminates information through outreach training programs and exhibits at scientific conferences and responds to incoming questions by email and telephone assistance. The team instructs people in the use of NCBI resources, responds to a wide range of questions, receives comments and suggestions, and coordinates with the NCBI resource developers to implement suggestions from users. In addition, User Services develops documentation, tutorials, and other support materials; produces the NCBI News; and publishes articles on NCBI resources.
User Services consists of a staff of scientists and information specialists with diverse backgrounds and experiences. Scientist members of the staff hold Masters or Ph.D. degrees in an area of molecular biology, biochemistry, or biotechnology. Information specialists have Masters degrees in Library and Information Science and extensive experience using online databases of scientific information.
Help Desk assistance is available from 8:30 a.m. to 5:30 p.m. Eastern Time, Monday through Friday. Two email addresses are available, info@ncbi.nlm.nih.gov and blast-help@ncbi.nlm.nih.gov. The
email questions are answered as expeditiously as possible, usually within a day of receipt of the question. However, those that require extended investigation may take longer. Questions are usually handled directly by members of the User Services staff, although some are referred to a specific database development team for attention.
Examples of question topics include: data submission protocols, including the use of
Because of the genetic focus of many of NCBI resources, we receive a number of questions from the general public regarding medical issues. The NCBI Help Desk staff can neither provide direct answers to medical questions nor give medical advice or guidance. However, we do provide suggestions on how to search our resources for information on the gene or condition of interest and refer users to the National Library of Medicine (NLM) customer service group for further assistance with PubMed (
Questions about
Because of its ongoing personal contact with our users, the User Services group plays an important role in communicating with database development and production teams, making suggestions, testing new releases and new features, and keeping them informed of problems that people are having with the services. The team also collaborates with developers in creating help documents, frequently asked questions (FAQs), tutorials, and workshop materials.
Web-based tutorials for
In keeping with the “plain language initiative” at NIH, the
The NCBI
The
The User Services group also prepares fact sheets, brochures, and other public information materials to describe and illustrate NCBI services. A
Overview articles entitled
NCBI's continuing emphasis on outreach to the scientific community is evident in its multifaceted program that includes exhibiting its services at scientific meetings, offering a variety of training courses, and developing Web-based tutorials and workshops.
NCBI exhibits at approximately 15 scientific meetings per year, providing an opportunity for a wide range of researchers, students, and teachers to see demonstrations of NCBI resources and interact directly with NCBI staff. The current exhibit
Workshops are offered at select scientific meetings and include the standing workshops described below in the Training section, but workshops also can be customized for particular audiences. Meeting organizers who would like to invite NCBI to offer a workshop are encouraged to do so.
NCBI has a growing training program consisting of full-day, half-day, and two-hour courses that are usually a combination of lecture and computer-based formats. There are also advanced courses that are given over a more extended time period. Each is described briefly below, and further information on the training programs can be found in the Education section of the Web site, under
The course is offered by invitation at academic institutions as well as at selected scientific conferences. If you are interested in hosting a course at your institution or conference, write to
The course is also offered four times a year at the NLM on the NIH campus in Bethesda, Maryland, and is free and open to anyone who would like to attend.
More information on
This course is designed primarily for medical and science librarians or other professionals who are providing support services for molecular biology information resources. It provides an introduction to four categories of molecular biology information available from NCBI: nucleotide sequences, protein sequences, three-dimensional structures, and complete genomes and maps. An overview of search systems available at the NCBI, particularly Entrez and BLAST, emphasizes how search skills related to other types of information resources also apply to molecular biology databases. The course concludes with a discussion of various levels of molecular biology information services provided by librarians.
The Medical Library Association approves this course for eight continuing education credit hours. The course has been given at 24 locations since May 1997. Because of the increase in NCBI services, courses are being revised and are not being scheduled at this time.
More information on
A new 5-day advanced course on NCBI resources has been developed as part of a collaborative project with a group of scientists and librarians who currently provide bioinformatics support services at their universities. The course provides detailed descriptive information as well as hands-on experience with handling a wide range of user questions. The course is designed for bioinformatics support staff based in university medical libraries so that they can, in turn, assist students, faculty, staff, and clinicians at their institutions in the use of molecular biology information resources. Additional information on
The Service Desk staff also offers four mini-courses: BLAST QuickStart, Unmasking Genes in the Human Genome, Making Sense of DNA and Protein Sequences, and GenBank and PubMed Searching. Each is described briefly below. The purpose of the mini-courses is to focus on specific research application areas and address how to use multiple NCBI resources together to answer a research question. Additional problem-oriented mini-courses are under development.
The courses are 2 hours each in length. An overview is given during the first hour in lecture format, followed by a 1-hour hands-on session. Although primarily given on the NIH campus, NCBI is beginning to offer these workshops at outside institutions as well. Although the mini-courses were originally designed to be presented by an instructor, they are constructed in an online notebook format; therefore, it is possible to take the course on your own. Revisions to augment the online notebooks with lecture material and make the courses completely self-guided are currently under way.
This mini-course is a practical introduction to the BLAST family of sequence-similarity search programs. Exercises range from simple searches to creative uses of the BLAST programs.
This mini-course covers how to find genes, promoters, and transcription factor-binding sites in human DNA sequences. It is designed around a program developed within User Services called Greengene, which integrates the output of several gene-finding tools and allows a coding sequence and accompanying protein translation to be assembled from the exons detected by these programs. Because the output of several programs is integrated, there is increased reliability in exon selection.
In this course, participants find a gene within a eukaryotic DNA sequence. They then predict the function of the derived protein by seeking sequence similarities to proteins with documented function using BLAST and other tools. Finally, a 3D modeling template is located for the protein sequence using the Conserved Domain Search (CDD-Search).
During the first hour, an instructor walks the class through an analysis of an uncharacterized
This mini-course provides an overview of literature searching and sequence retrieval using the PubMed and Entrez database search interfaces. Exercises illustrate advanced search tips for using Entrez, many of which explore the use of the
In addition to communication by email and phone provided through the Help Desk, a regular research consultation service provides one-on-one support for researchers in the NIH community. The consults are available by appointment and are provided in 1-hour time slots at the NIH Library as well as the NCBI training facility. Because of the success of the program, this type of service may be offered by appointment at selected scientific meetings in the future.
An innovative training program that began in 2001 aims to train molecular biologists for a new type of career as bioinformatics specialists who provide institutional support for users of computational biology tools. The NCBI Core Bioinformatics Facility (referred to as the CoreBio program) currently functions to train and support a network of bioinformatics specialists serving individual Institutes at NIH. NCBI's CoreBio facility trains Core members identified by their respective institutes in the use of its bioinformatics tools. The Core members, in turn, support the use of NCBI tools and databases by researchers at their institutes.
The training is provided over a 9-week period, with students attending lectures and completing practical exercises in the morning and returning to their regular workplace in the afternoon. The coursework centers on one major topic each week and follows the rough schedule given below:
WEEK 1: Introduction to the Sequence Databases
WEEK 2: BLAST
WEEK 3: The Human Genome
WEEK 4: Genomic Biology
WEEK 5: Molecular Modeling
WEEK 6: Web Page Development
WEEK 7: Setting Up a BLAST Web Server
WEEK 8: Interaction with Users
WEEK 9: Practicum
During week 9, the students pursue an institute-related project with the assistance of NCBI instructors. These projects run the gamut from the compilation of specialized datasets and data mining to the creation of novel BLAST interfaces and the construction of new data display tools. Students also develop a Web page to support the services they are developing for the respective Institutes at the NIH.
Although currently a NIH-based program, other organizations are welcome to consider using the program as a model for development of similar initiatives to meet their bioinformatics support needs.
At NCBI, we encourage our users to contact us with questions, suggestions, and requests for training or presentations on NCBI services. We invite feedback on tutorials, FAQs, and other support materials and welcome suggestions regarding additional materials that would be useful in guiding users through the wide range of services offered by NCBI.
This chapter contains tutorials for using Map Viewer. Step-by-step instructions are provided for several common biological research problems that can be addressed by exploiting the whole-genome and positional perspectives of Map Viewer. Please be aware that the examples in these tutorials may return different results when you execute them, because the underlying data may have been updated, but we hope that the framework for obtaining, interpreting, and processing your results will be sufficiently clear if that happens. Most of the examples are for human genes, but the same logic applies to other genomes as well.
Please note that each of these tutorials is accompanied by a figure. If you are using this tutorial on the Web, we suggest that you open another browser window so that you can view the figure as you are reading the text. You are also encouraged to use Map Viewer interactively.
We welcome any suggestions that you have for improving the existing tutorials or for adding new ones.
There are many instances in molecular biological research when you may have only a cDNA sequence but need to have the nucleotide sequence that lies 5′ or 3′ to a gene or the introns for additional analyses. Because genomic sequence available from the public database may not have this annotation or may be so large as to make it difficult to retrieve only a region of interest, tools have been added to Map Viewer to make it easier to define, view, and download genomic sequence in multiple formats.
From the Map Viewer
The most informative maps, for the purpose of this example, are the Contig, Component, and Genes_seq maps. Selecting the
There are two links on the Map Viewer display that are used to view and download of the region of interest: the
Selecting the
Let us assume that we would like to download 5.0 kb of upstream DNA and 1.0 kb of downstream DNA. To define this region, we will need to follow the
Another tool for obtaining the desired sequence is the Sequence Viewer, available via the
In this example, we will use the Map Viewer to look for human candidate genes in a region. The types of queries that can be posted to the Map Viewer that will address this type of question are queries by genetic marker or STS.
Please note that Map Viewer supports queries by any named object positioned on a map so that it is possible to query by gene symbol or GenBank Accession number or any other object that might define your range of interest.
To refine our search, we will enter the names of two STSs. In the text box, enter “sWXD113 OR DXS52” on chromosome X. Select
The maps that are displayed include the UniG_Hs, Genes_seq, and STS maps. The UniG_Hs map shows the density of ESTs and mRNAs that align to the current assembly of the human genome. The Genes_seq map displays known and predicted genes that are annotated on the genomic contigs. The
At the current resolution, it is not possible to view all of the information displayed on the three maps. Therefore, some adjustment will be necessary.
Narrow the region further using the ruler adjacent to the STS map as a guide. It is not necessary to display a ruler alongside the other two maps because they are all on the same coordinate system. In instances where sequence, genetic, cytogenetic, or radiation hybrid maps are being displayed in the same view, it is advisable to display additional rulers because the different maps show the mapped element on different coordinate systems (Kbp, cM, banding position, or centiRays, respectively). Enter the range 133.0 M to147.0 M in the
To see all of the genes in the region defined by the two markers, select the
We can also change the page length under
At this point, you can now browse the description of the genes that are being displayed. Each gene or locus name is hyperlinked to LocusLink, where a detailed report about the gene or locus is provided. If the gene or locus of interest has supporting EST and mRNA data, then you can select the UniGene cluster number and link to UniGene, where more detailed information is provided about this gene, including its pattern of expression. LocusLink also provides connections to BLink and thus indirectly to reports of related proteins in the protein database and to viewers of protein structure, if your protein of interest is related to a protein for which the structure is known (see also Exercise 8 in this chapter).
The example above summarizes the approach taken when defining a region of interest by entering names of markers in the query box. Gene symbols, reference SNP names, and GenBank Accession numbers for ESTs could also be used. It should also be noted that when a chromosome is displayed, you may also submit a query using the
In this example, we will locate and display the human gene implicated in Fragile X syndrome using the Map Viewer. We can find the gene beginning with several types of data. Refer to
If we are fortunate enough to know the official gene symbol for the Fragile X gene,
Genes that are linked to a disease in Online Mendelian Inheritance in Man (OMIM) are referenced on the Morbid map and can be found by searching with a disease name or phenotype. In our case, “fragile-X” can be used. Using this query, we pick up hits to genes related to
The
Often, a gene is known to reside only in a particular region. Suppose that we know only that
Suppose that we have the sequence of the mouse homolog of the human
Please note that other resources within NCBI also support querying for genes. Consider also LocusLink, UniGene, and Entrez Nucleotide. When a record of interest has been retrieved, each of these provides links to Map Viewer.
We will analyze the
To begin the analysis, we can select the link in the
We are now ready to select the maps to display. The maps displayed will depend on the sort of analysis intended; however, one useful set of maps includes the Genes_seq, Contig, Comp, GScan, UniG_Hs, RNA, and gbDNA maps. This set of maps can be selected for viewing using the panel invoked by the
In
The RNA or Transcript map shows the alignment of a single mRNA sequence to the genome, and in this case, the pattern of exons produced matches exactly that shown on the Genes_seq map. If additional splice variants are sequenced, multiple alignments will be shown on the RNA track, and the gene model given on the Genes_seq track will be a composite model made up of all the exons implied by these alignments.
The GenomeScan track shows gene predictions made using GenomeScan that are independent of supporting mRNA alignments. The GenomeScan model for
Both the predicted model (GScan) and the alignment-based model (Genes_seq) can be compared to the mapping of ESTs on the UniG_Hs track. The
This map is comparable to the UniG_Hs track but is based instead on alignment of mouse cDNA sequences (conventional and EST). In this example, the exons suggested by the alignment of mouse cDNAs is comparable to that based on alignment of human cDNAs.
SAGE_tag provides another view of expression levels and connections to more information about the tissue of origin of the expressed sequences. The SAGE_tag map also provides a histogram of expression, and each tag is connected to a tag-specific report page.
Looking across to the Contig track, we can see that the gene maps to a contig that is drawn in
Additional GenBank sequences that align to the genome but were not part of the assembly are shown on the gbDNA track. In this case, a number of short sequences for the individual exons of
The Map Viewer displays the alignment of transcripts, such as mRNA GenBank sequences and RefSeqs, to genomic sequence and shows the positions of predicted genes, but it does not stop there. By using a utility called the ModelMaker, it is possible to combine the alignment evidence with the results of gene prediction to construct novel transcripts.
Beginning with the standard Map Viewer display for the gene
To attempt this synthesis, we first select the
In this example, there were multiple, putative, full-length mRNAs. Please note that ESTs can be added to the display by selecting
This assembly can be displayed by using Map Viewer for the mouse. In this particular case, the official symbol for the human and mouse genes is the same; therefore, a query by symbol returns the expected result. If this were not the case, however, it is also possible to search the mouse genome by the human sequence using
Consider an example in which we want to know whether there is a human/mouse synteny region that includes our favorite gene,
Let us choose the NCBI
The most complete assembly of the mouse genome available at NCBI is the Mouse Genome Consortium Version 3 WGS assembly. This assembly can be displayed using the mouse version of the Map Viewer; however, the mouse homolog of
Finding members of a gene family is not straightforward by any means. However, the Map Viewer can be used to flag sets of genes that are related, either by nomenclature or by sequence similarity.
Consider the gene
Selecting the gene name (the rows for the FMR* query hits are highlighted) invokes a corresponding LocusLink page that serves as a portal to available information for the gene, including the precomputed results of a similarity search against the nr database. Go to the NCBI Reference Sequences section of the LocusLink page and select the BLAST Link (
When we look at the BLAST summary for
The BLink page lists two annotated human homologs of the
To see whether there are undocumented homologs of
The results of such a search are shown in The summary of significant BLAST hits is shown in the
The Map Viewer displays a graphic representation of genomic information with links to related resources that allow it to serve as a springboard for many types of analyses.
For example, one might be interested in the domain structure of the gene product. Let us use the human
To see the functional domains that have been identified in the FMR1 protein, select the Conserved Domains Database (CDD)
We can easily see whether there exists a three-dimensional structure that includes these conserved domains. The
Cut and paste the sequence of the KH-domain from the FMR1 protein and run a genome-specific BLAST search. We will use the tblastn program to compare the 44-amino acid sequence of the KH domain to the nucleotide sequence of the human genome. The results will show us other regions of the genome with the potential to code for this domain. Of course, we already know from the BLink page that there are some autosomal homologs, but we do not know whether they actually contain the KH RNA-binding domain. The obvious caveat is that some pseudogenes might contain the domain as well. Our tblastn search returns 4 hits, one being the
Three-dimensional.
An Accession number is a unique identifier given to a sequence when it is submitted to one of the DNA repositories (GenBank, EMBL, DDBJ). The initial deposition of a sequence record is referred to as version 1. If the sequence is updated, the version number is incremented, but the Accession number will remain constant.
The
One of the variant forms of a gene at a particular
Application Programming Interface. An API is a set of routines that an application uses to request and carry out lower-level services performed by a computer's operating system. For computers running a graphical user interface, an API manages an application's windows, icons, menus, and dialog boxes.
Abstract Syntax Notation 1 is an international standard data-representation format used to achieve interoperability between computer platforms. It allows for the reliable exchange of data in terms of structure and content by computer and software systems of all types.
Bacterial Artificial Chromosome. A BAC is a large segment of DNA (100,000–200,000 bp) from another species cloned into bacteria. Once the foreign DNA has been cloned into the host bacteria, many copies of it can be made.
BankIt is a tool for the online submission of one or a few sequences into
The value S′ is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. By normalizing a raw score using the formula:
Basic Local Alignment Search Tool (Altschul et al., J Mol Biol 215:403-410; 1990). A sequence comparison
nucleotide–nucleotide BLAST. blastn takes nucleotide sequences in
protein–protein BLAST. blastp takes protein sequences in
A DNA/Protein sequence analysis program to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. BLAT is not BLAST. (See the
BLAST Link. BLink displays the results of
Binary Large Object (or binary data object). BLOB refers to a large piece of data, such as a bitmap. A BLOB is characterized by large field values, an unpredictable table size, and data that are formless from the perspective of a program. It is also a keyword designating the BLOB structure, which contains information about a block of data.
Blocks Substitution Matrix. A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the
This term refers to binary algebra that uses the logical operators AND, OR, XOR, and NOT; the outcomes consist of logical values (either TRUE or FALSE). The keyword boolean indicates that the expression or constant expression associated with the identifier takes the value TRUE or FALSE. The logical-AND (&&) operator produces the value 1 if both operands have nonzero values; otherwise, it produces the value 0. The logical-OR (×€×€) operator produces the value 1 if either of its operands has a nonzero value. The logical-NOT (!) operator produces the value 0 if its operand is true (nonzero) and the value 1 if its operand is FALSE (0). The exclusive OR (XOR) operator yields TRUE only if one of its operands are TRUE and the other is FALSE. If both operands are the same (either TRUE or FALSE), the operation yields FALSE.
A run of the genome assembly and annotation process of the set of products generated by that run.
Cancer Chromosome Aberration Project. CCAP was designed to expedite the definition and detailed characterization of the distinct chromosomal alterations that are associated with malignant transformation. The project is a collaboration among the
Conserved Domain. CD refers to a domain (a distinct functional and/or structural unit of a protein) that has been conserved during evolution. During evolution, changes at specific positions of an amino acid sequence in the protein have occurred in a way that preserve the physico-chemical properties of the original residues, and hence the structural and/or functional properties of that region of the protein.
Conserved Domain Architecture Retrieval Tool. When given a protein query sequence, CDART displays the functional domains that make up the protein and lists proteins with similar domain architectures. The functional domains for a sequence are found by comparing the protein sequence to a database of conserved domain alignments,
Conserved Domain Database. This database is a collection of sequence alignments and profiles representing protein domains conserved during molecular evolution.
complementary DNA. A
coding region, coding sequence. CDS refers to the portion of a genomic DNA sequence that is translated, from the start codon to the stop codon, inclusively, if complete. A partial CDS lacks part of the complete CDS (it may lack either or both the start and stop codons). Successful translation of a CDS results in the synthesis of a protein.
Cancer Genome Anatomy Project. CGAP is an interdisciplinary program to identify the human genes expressed in different cancerous states, based on cDNA (
Comparative Genomic Hybidization. CGH is a fluorescent molecular cytogenetic technique that identifies chromosomal aberrations and maps these changes to metaphase chromosomes. CGH can be used to generate a map of DNA copy number changes in tumor genomes. CGH is based on quantitative two-color fluorescence
Common Gateway Interface. A mechanism that allows a Web server to run a program or script on the server and send the output to a Web browser.
A group that is created based on certain criteria. For example, a gene cluster may include a set of genes whose similar expression profiles are found to be similar according to certain criteria, or a cluster may refer to a group of clones that are related to each other by homology.
“See in 3-D” is a structure and sequence alignment viewer for NCBI databases. It allows viewing of 3-D structures and sequence–structure or structure–structure alignments. Cn3D can work as a helper application to the browser or as a client–server application that retrieves structure records from the Molecular Modeling Database (MMDB, see below) directly from the internet. The
Sequence of three nucleotides in DNA or mRNA that specifies a particular amino acid during protein synthesis; also called a triplet. Of the 64 possible codons, 3 are stop codons, which do not specify amino acids.
Clusters of Orthologous Groups (of proteins) were delineated by comparing protein sequences from completely sequenced genomes. Each COG consists of individual proteins or groups of paralogs from at least three lineages and thus corresponds to an ancient conserved domain.
The nucleotides or amino acids found most commonly at each position in the sequences of homologous DNAs, RNAs, or proteins.
A contiguous segment of the genome made by joining overlapping clones or sequences. A clone contig consists of a group of cloned (copied) pieces of DNA representing overlapping regions of a particular chromosome. A sequence contig is an extended sequence created by merging primary sequences that overlap. A contig map shows the regions of a chromosome where contiguous DNA segments overlap. Contig maps provide the ability to study a complete and often large segment of the genome by examining a series of overlapping clones, which then provide an unbroken succession of information about that region.
Central Processing Unit. The CPU is the computational and control unit of a computer, the device that interprets and executes instructions.
Cascading Style Sheets. CSS specify the formatting details that control the presentation and layout of
A tool of
Data Creation and Maintenance System
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The definition line or description line is distinguished from the sequence data by a “greater than” (>) symbol in the first column (see
Deoxyribonucleic acid is the chemical inside the nucleus of a cell that carries the genetic instructions for making living organisms. DNA is composed of two anti-parallel strands, each a linear polymer of nucleotides. Each nucleotide has a phosphate group linked by a phosphoester bond to a pentose (a five-carbon sugar molecule, deoxyribose), that in turn is linked to one of four organic bases, adenine, guanine, cytosine, or thymine, abbreviated A, G, C, and T, respectively. The bases are of two types: purines, which have two rings and are slightly larger (A and G); and pyrimidines, which have only one ring (C and T). Each nucleotide is joined to the next nucleotide in the chain by a covalent phosphodiester bond between the 5′ carbon of one deoxyribose group and the 3′ carbon of the next. DNA is a helical molecule with the sugar–phosphate backbone on the outside and the nucleotides extending toward the central axis. There is specific base-pairing between the bases on opposite strands in such a way that A always pairs with T and G always pairs with C.
A “domain” refers to a discrete portion of a protein assumed to fold independently of the rest of the protein and which possesses its own function.
Draft sequence refers to DNA sequence that is not yet finished but is generally of high quality (i.e., an accuracy of greater than 90%). Draft sequence data are mostly in the form of 10,000 base pair-sized fragments, the approximate chromosomal locations of which are known. The following keywords are associated with draft sequence: phase 0, light-pass coverage of a clone, generally only 1× coverage; phase 1, 4–10× coverage of a
Document Type Definition. The DTD is an optional part of the prolog of an XML document that defines the rules of the document. It sets constraints for an XML document by specifying which elements are present in the document and the relationships between elements, e.g., which tags can contain other tags, the number and sequence of the tags, and attributes of the tags. The DTD helps to validate the data when the receiving application does not have a built-in description of the incoming data.
A program for filtering low-complexity regions from nucleic acid sequences.
Expect value. The E-value is a parameter that describes the number of hits one can “expect” to see by chance when searching a database of a particular size. It decreases exponentially with the score (S) that is assigned to a match between two sequences. Essentially, the E-value describes the random background noise that exists for matches between sequences. For example, an E-value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size, one might expect to see one match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to “0”, the higher is the “significance” of the match. However, it is important to note that searches with short sequences can be virtually identical and have relatively high E-value. This is because the calculation of the E-value also takes into account the length of the query sequence. This is because shorter sequences have a high probability of occurring in the database purely by chance. For more information, see the following
A number assigned to a type of enzyme according to a scheme of standardized enzyme nomenclature developed by the Enzyme Commission of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB). EC numbers may be found in
Entrez is a retrieval system for searching several linked databases. It provides access to the following NCBI databases:
Expressed Sequence Tag. ESTs are short (usually approximately 300–500 base pairs), single-pass sequence reads from
Electronic
“Ahead-of-print” citation.
Exon Finding by Sequence Homology. Exofish is a tool based on homology searches for the rapid and reliable identification of human genes. It relies on the sequence of another vertebrate, the pufferfish
Refers to the portion of a gene that encodes for a part of that gene's mRNA. A gene may comprise many exons, some of which may include only protein-coding sequence; however, an exon may also include 5' or 3' untranslated sequence. Each exon codes for a specific portion of the complete protein. In some species (including humans), a gene's exons are separated by long regions of DNA (called
Exon trapping is a technique for cloning exon sequences from genomic DNA by selecting for functional splice sites, relying on the cellular splicing machinery. The genomic DNA containing the putative exon(s) is cloned into an exon-trap vector, which has a promoter, polyadenylation signals, and splice sites, and then transfected into a cell line. If there are functional splice sites in the genomic DNA fragment, the segments of DNA between the splice sites will be removed. Total RNA is isolated and reverse-transcribed. After
The first widely used algorithm for similarity searching of protein and DNA sequence databases. The program looks for optimal local alignments by scanning the sequence for small matches called “words”. Initially, the scores of segments in which there are multiple word hits are calculated (“init1”). Later, the scores of several segments may be summed to generate an “initn” score. An optimized alignment that includes gaps is shown in the output as “opt”. The sensitivity and speed of the search are inversely related and controlled by the “k-tup” variable, which specifies the size of a “word” (Pearson and Lipman). Also refers to a
The pattern of bands on a gel produced by a clone when restricted by a particular enzyme, such as
High-quality, low-error DNA sequence that is free of gaps. To qualify as a finished sequence, only a single error out of every 10,000 bases (i.e., an accuracy of 99.999%) is allowed.
Fluorescence
A flat file is a data file that contains records (each corresponding to a row in a table); however, these records have no structured relationships. To interpret these files, the format properties of the file should be known. For example, a database management system may allow the user to export data to a comma-delimited file. Such a file is called a flat file because it has no inherent information about the data, and interpretation requires additional information. Files in a database management system have more complex storage structures.
To copy changing data so as to preserve the dataset as it existed at a particular point in time. Also used to refer to the resulting set of frozen data.
File Transfer Protocol. A method of retrieving files over a network directly to the user's computer or to his/her home directory using a set of protocols that govern how the data are to be transported.
A gap is a space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment. (See the
gigabytes
GenBank is a database of nucleotide sequences from more than 100,000 organisms. Records that are annotated with coding region features also include amino acid translations. GenBank belongs to an international collaboration of sequence databases that also includes
The instructions in a gene that tell the cell how to make a specific protein. A, T, G, and C are the “letters” of the
A gene identification algorithm that is used to identify exon–intron structures in genomic DNA sequence.
The genetic identity of an individual that does not show as outward characteristics. The genotype refers to the pair of alleles for a given region of the genome that an individual carries.
Gene Expression Omnibus. GEO is a gene expression data repository and online resource for the retrieval of gene expression data from any organism or artificial source. Many types of gene expression data from platform types, such as spotted microarray, high-density oligonucleotide array, hybridization filter, and serial analysis of gene expression (
The GenInfo Identifier is a sequence identification number for a nucleotide sequence. If a nucleotide sequence changes in any way, a new GI number will be assigned. A separate GI number is also assigned to each protein translation within a nucleotide sequence record, and a new GI is assigned if the protein translation changes in any way. GI sequence identifiers run parallel to the new accession.version system of sequence identifiers (see the description of
Genome Survey Sequences are analogous to
The probability that a diploid individual will have two different alleles at a particular genome locus. These individuals are defined as heterozygous, whereas individuals who have two identical alleles at the locus are defined as homozygous. The probability can be estimated by sampling a representative number of individuals from the population and dividing the number of heterozygotes by the total number sampled.
Human Immunodeficiency Virus. HIV-1 is a retrovirus that is recognized as the causative agent of AIDS (Acquired Immunodeficiency Syndrome).
Hereditary nonpolyposis colon cancer
A region of the chromosome identified cytologically by DNA staining or the
The term refers to similarity attributable to descent from a common ancestor. Homologous chromosomes are members of a pair of essentially identical chromosomes, each derived from one parent. They have the same or allelic genes with genetic loci arranged in the same order. Homologous chromosomes synapse during meiosis.
High-Throughput Genomic Sequences. The source of HTGS are large-scale genome sequencing centers;
A keyword added to GenBank entries by sequencing centers to indicate that work has stopped on a clone and that the existing sequence will not be finished. Sequencing centers may stop work because the clone is redundant or for various other reasons.
Keywords added to GenBank entries by sequencing centers to indicate the status (phase) of the sequence (see phase definitions described under
Hypertext Markup Language. HTML is derived from
Hold Until Published. HUP refers to the category for data that is electronically submitted for when it should be released to the public.
International Code of Botanical Nomenclature
International Classification of Diseases
International Code of Nomenclature of Bacteria
International Code of Nomenclature for Cultivated Plants
A diagrammatic representation of the karyotype of an organism.
Integrated Molecular Analysis of Genomes and their Expression. A consortium of academic groups that share high-quality, arrayed cDNA libraries and place sequence, map, and expression data of the clones in these arrays into the public domain. With the use of this information, unique clones can be rearrayed to form a “master array”, with the aim of ultimately having a representative cDNA from every gene in the genome under study. To date, human, mouse, rat, zebrafish, and
Refers to that portion of the DNA sequence that is present in the primary transcript and that is removed by splicing during RNA processing and is not included in the mature, functional
Indexed Sequential-Access Method. ISAM is a database access method. It allows data records in a database to be accessed either sequentially (in the order in which they were entered) or randomly (using an index). In the index, each record has a unique key that enables its rapid location. The key is the field used to reference the record.
International System for Human Cytogenetic Nomenclature
The particular chromosome complement of an individual or a related group of individuals, as defined by both the number and morphology of the chromosomes, usually in mitotic metaphase, and arranged by pairs according to the standard classification.
Laboratory Information Management Systems. LIMS comprise software that helps biological and chemical laboratories handle data generation, information management, and data archiving.
A registry service to create links from specific articles, journals, or biological data in
In a genomic contect, locus refers to position on a chromosome. It may, therefore, refer to a marker, a gene, or any other landmark that can be described.
Each new
LocusLink provides a single query interface to curated sequence and descriptive information about genetic loci. LocusLink issues a stable ID (
Multiple Alignment Construction and Analysis Workbench. MACAW is a program for locating, analyzing, and editing blocks of localized sequence similarity among multiple seqences and linking them into a composite multiple alignment.
The Map Viewer is a software component of
megabytes
MEDLINE is
MegaBLAST is a program for aligning sequences that differ slightly as a result of sequencing or other similar “errors”. When larger word size is used, it is up to 10 times faster than more common sequence-similarity programs. MegaBLAST is also able to efficiently handle much longer DNA sequences than the
Medical Subject Headings. MeSH refers to the controlled vocabulary of
Multi-FASTA format.
Mammalian Gene Collection.
Repetitive stretches of short sequences of DNA used as genetic markers to track inheritance in families (e.g., CC[TATATATA]CCCT). Also known as short tandem repeats (STRs).
Mendelian Inheritance in Man. First published in 1966,
An ordered list or map that defines the minimal set of overlapping clones needed to provide complete coverage of a chromosome or other extended segment of DNA (compare with
Molecular Modeling Database. MMDB is a database of three-dimensional biomolecular structures derived from X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy.
Molecular Modeling Database Accession number.
messenger RNA. mRNA describes the section of a genomic DNA sequence that is transcribed, and can include the 5' untranslated region (5'UTR),
A permanent structural alteration in DNA. In most cases, DNA changes have either no effect or cause harm, but occasionally a mutation can improve an organism's chance of surviving, and the beneficial change is passed on to the organism's descendants. Typically, mutations are more rare than polymorphisms in population samples because natural selection recognizes their lower fitness and removes them from the population.
Contains supported software tools from the Information Engineering Branch (IEB) of the NCBI. The NCBI Toolkit describes the three components of the ToolBox: data model, data encoding, and programming libraries. Provides access to documentation for the DataModel, C Toolkit, C++ Toolkit, NCBI C Toolkit Source Browser, XML Demo Program, XML DTDs, and the
NEXUS refers to a file format designed to contain data for processing by computer programs. NEXUS files should end with .nxs or .nex for purposes of clarity (Maddison et al., Syst Biol 46:590-621; 1997).
Nuclear Magnetic Resonance. NMR is a spectroscopic technique used for the determination of protein structure.
non-redundant Protein Data Bank
Online Mendelian Inheritance in Man. OMIM is a directory of human genes and genetic disorders, with links to literature references, sequence records, maps, and related databases.
Orthology describes genes in different species that derive from a single ancestral gene in the last common ancestor of the respective species.
Orthology describes genes in different species that derive from a common ancestor, i.e., they are direct evolutionary counterparts.
A paralog is one of a set of homologous genes that have diverged from each other as a consequence of gene duplication. For example, the mouse α-
Paralogy describes the relationship of homologous genes that arose by gene duplication.
Polymerase Chain Reaction. A technique for amplifying a specific DNA segment in a complex mixture. Also present in the DNA mixture are short oligonucleotide primers to the DNA segment of interest and reagents for DNA synthesis. PCR relies on the ability of DNA to separate into its two complementary strands at high temperature (a process called denaturation) and for the two strands to anneal at an optimal lower temperature (annealing). The annealing phase is followed by a DNA synthesis step at an optimal temperature for a heat-stable DNA polymerase. After multiple rounds of denaturation, annealing, and DNA synthesis, the DNA sequence specified by the oligonucleotide primers is amplified.
The observable traits or characteristics of an organism, e.g., hair color, weight, or the presence or absence of a disease. Phenotypic traits are not necessarily genetic.
A computer program that assembles raw sequence into sequence contigs (see above) and assigns to each position in the sequence an associated “quality score”, on the basis of the
A computer program that analyses raw sequence to produce a “base call” with an associated “quality score” for each position in the sequence. A PHRED quality score of
Pattern of presence–absence of a cluster of orthologs (COG) in different species.
PubMed ID number
Portable Network Graphics. An extensible file format for the lossless, well-compressed storage of raster images (images that are composed of horizontal lines of pixels, such as those created by a computer screen). Compression of image, media, and application files is necessary to reduce the transmission time across the web. The technique of lossless compression reduces the size of the file without sacrificing any original data, and the image after expansion is exactly as it was before compression. PNG overcomes the patent issues of GIF (Graphic Interchange Format) and can replace many common uses of TIFF (Tagged Image File Format). Several features such as indexed color, grayscale, and truecolor are supported, as well as an optional alpha-channel. PNG is designed to work well in online viewing applications and is supported as an image standard by the
A string of adenylic acid residues that are added to the 3′ end of the primary
A common variation in the sequence of
Linear polymer of amino acids connected by peptide bonds. Proteins are large polypeptides, and the two terms are commonly used interchangeably.
Variations that are only common in specific populations. Usually such populations are reproductively isolated from other, larger groups. These variations may be completely absent in other groups.
A database of protein sequences from eight organisms: human (
A sequence of DNA that is very similar to a normal gene but that has been altered slightly so that it is not expressed. Such genes were probably once functional but, over time, acquired one or more mutations that rendered them incapable of producing a protein product.
Position-Specific Iterated BLAST. PSI-BLAST (Altschul et al., J Mol Biol 215:403-410; 1990) is used for iterative protein–sequence similarity searches using a position-specific score matrix (
Position-Specific Score Matrix. The PSSM gives the log-odds score for finding a particular matching amino acid in a target sequence.
A retrieval system containing citations, abstracts, and indexing terms for journal articles in the biomedical sciences. It includes literature citations supplied directly to NCBI by publishers as well as
PubMed Central XML file
A queuing system to BLAST that allows users to retrieve their results at their convenience and format their results multiple times with different formatting options.
Quantitative Trait Locus. A QTL is a hypothesis that a certain region of the chromosome contains genes that contribute significantly to the expression of a complex trait. QTLs are generally identified by comparing the linkage of polymorphic molecular markers and phenotypic trait measurements. The density of the linkage map is important in the accurate and precise location of QTLs; the higher the map density, the more precise the location of the putative QTL, although there is increased likelihood that false positives will be detected. Once QTLs have been mapped to a relatively small chromosomal region, other molecular methods can be used to isolate specific genes.
Reciprocal best hits are proteins from different organisms that are each other's top BLAST hit, when the proteomes from those organisms are compared to each other. For example, proteins A–Z in organism 1 are compared against proteins AA–ZZ in organism 2. If protein A has a best hit to protein RR, and RR's best hit, when it is compared to all the proteins in organism 1, also turns out to protein A, then A and RR are reciprocal best hits. However, if RR's best hit is to B rather than to A, then A and RR are not reciprocal best hits.
RefSeq is the NCBI database of reference sequences; a curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, and entire chromosomes.
Restriction Fragment Length Polymorphism. Genetic variations at the site where a restriction enzyme cuts a piece of DNA. Such variations affect the size of the resulting fragments. These sequences can be used as markers on physical maps and linkage maps. RFLP is also pronounced “rif lip”.
Radiation Hybrid map. A genome map in which
Ribonucleic Acid. A single-stranded nucleic acid, similar to
Reverse Position-Specific BLAST. A program used to identify conserved domains in a protein query sequence. It does this by comparing a query protein sequence to position-specific score matrices (
Serial Analysis of Gene Expression. An experimental technique designed to quantitatively measure gene expression.
Sequin is a stand-alone software tool developed by the
Saccharomyces Genome Database. A database for the molecular biology and genetics of
Standard Generalized Markup Language. The international standard for specifying the structure and content of electronic documents. SGML is used for the markup of data in a way that is self-describing. SGML is not a language but a way of defining languages that are developed along its general principles. A subset of SGML called
Spectral Karyotyping. SKY is a technique that allows for the visualization of all of an organism's chromosomes together, each labeled with a different color. This is achieved by using chromosome-specific, single-stranded DNA probes (each labeled with a different fluorophore) to hybridize or bind to the chromosomes of a cell; resulting in each chromosome being painted a different color. This technique is useful for identifying chromosome abnormalities because it is easy to spot instances where a chromosome painted in one color has a small piece of another chromosome, painted in a different color, attached to it. (Also see
1. A software tool to automatically convert the short-form karyotype into an image representation of a cell or clone, with each chromosome displayed in a different color, with band overlay. The program will also incorporate the number of cells for each structural abnormality, which is displayed in brackets. 2. The full ideogram or a cell or clone, with each chromosome displayed in a different color, with band overlay.
Simple Modular Architecture Research Tool. A tool to allow automatic identification and annotation of domains in user-supplied protein sequences. For example, the
Common, but minute, variations that occur in human DNA at a frequency of 1 every 1,000 bases. An SNP is a single base-pair site within the genome at which more than one of the four possible base pairs is commonly found in natural populations. Several hundred thousand SNP sites are being identified and mapped on the sequence of the genome, providing the densest possible map of genetic differences. SNP is pronounced “snip”.
Simple Omnibus Format in Text. SOFT is an ASCII text format that was designed to be a machine-readable representation of data retrieved from, or submitted to, the Gene Expression Omnibus (
Refers to the location of the exon-intron junctions in a pre-mRNA (i.e., the primary transcript that must undergo additional processing to become a mature RNA for translation into a protein). Splice sites can be determined by comparing the sequence of genomic DNA with that of the
Sequence Search and Alignment by Hashing Algorithm. SSAHA is a software tool for very fast matching and alignment of DNA sequences and is used for searching databases containing large amounts (gigabases) of genome sequence. It achieves its fast search speed by converting sequence information into a “hash table” data structure, which can then be searched very rapidly for matches (Ning et al., Genome Res 11:1725-1729; 2001).
Simple Sequence Length Polymorphisms. SSLPs are markers based on the variation in the number of short tandem repeats in DNA.
A short DNA segment that occurs only once in the human genome, the exact location and order of bases of which are known. Because each is unique, STSs are helpful for chromosome placement of mapping and sequencing data from many different laboratories. STSs serve as landmarks on the physical map of the human genome.
A substitution matrix containing values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Such matrices are constructed by assembling a large and diverse sample of verified pairwise alignments of amino acids. If the sample is large enough to be statistically significant, the resulting matrices should reflect the true probabilities of mutations occurring through a period of evolution. (See also
A trademarked family of products that include databases, development tools, integration middleware, enterprise portals, and mobile and wireless servers.
On the same strand. The phrase “conserved synteny” refers to conserved gene order on chromosomes of different, related species.
BLAST Taxonomy Reports page. Tax BLAST groups BLAST hits by source organism, according to information in
Taxonomy Identifier. The taxID is a stable unique identifier for each taxon (for a species, a family, an order, or any other group in the taxonomy database). The taxID is seen in the
See
One of three codons that do not specify any amino acid and hence causes translation of mRNA into protein to be terminated. These codons mark the end of a protein coding sequence.
An ordered list or map that defines a set of overlapping clones that covers a chromosome or other extended segment of DNA.
Third-Party Annotation
Tiling Path Format. A table format used to specify the set of clones that will provide the best possible sequence coverage for a particular chromosome, the order of the clones along the chromosome, and the location of any gaps in the clone tiling path. Also used to refer to a file (Tiling Path File) in which the
The position within an mRNA at which synthesis of a protein begins. The translation start site is usually an AUG codon, but occasionally, GUG or CUG codons are used to initiate protein synthesis.
Unique Identifier
See
UNIX is an operating system that was developed by Dennis Ritchie and Kenneth Thompson at Bell Labs more than 30 years ago. It allows multitasking and multiuser capabilities and offers portability with other operating systems. It comes with hundreds of programs that are of two types: integral utilites, such as the command line interpreter; and tools such as email, which are not necessary for the operation of UNIX but provide additional capabilities to the user. It is functionally organized at three levels: the kernel, which schedules tasks and manages storage; the shell, which connects and interprets user's commands, calls programs from memory, and executes them; and tools and applications, which offer additional functionality to the operating system, such as word processing and business applications. UNIX® was registered by
Uniform Resource Locator. The address of a resource on the Internet. URL syntax is in the form of protocol://host/localinfo, where “protocol” specifies the means of fetching the object (such as HTTP, used by
UCS (Universal Character Set) Transformation Format. An AscII-preserving encoding method for Unicode (a standard to provide a unique number for every character irrespective of the platform, program, or language).
Untranslated Region. The 3′ UTR is that portion of an
An assignment of importance to a term in a search query. If a term in a search query is found to match a word in a document, that word is given a “weight”. The exact weight of the word will depend on the emphasis given to the word by the author or its position in the document. For example, a word that occurs in a chapter title will have a higher weight than the same word if it occurs in the body of the chapter. Similarly, words that occur in data collections are also assigned weights, depending on how frequently the terms occur in the collection.
Whole Genome Shotgun sequence. In this semi-automated sequencing technique, high-molecular-weight DNA is sheared into random fragments, size selected (usually 2, 10, 50, and 150 kb), and cloned into an appropriate vector. The clones are then sequenced from both ends. The two ends of the same clone are referred to as mate pairs. The distance between two mate pairs can be inferred if the library size is known and has a narrow window of deviation. The sequences are aligned using sequence assembly software. Proponents of this approach argue that it is possible to sequence the whole genome at once using large arrays of sequencers, which makes the whole process much more efficient than the traditional approaches.
World Wide Web. A
Extensible Markup Language. XML describes a class of data objects called XML documents and partially describes the behavior of computer programs that process them. XML is a subset of SGML, and XML documents are conforming SGML documents. XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters (a unit of text), some of which form character data, and some of which form markup. Markup includes tags that provide information about the data, i.e., a description of the structure and content of the document. Character data comprises all the text that is not markup. XML provides a mechanism to impose constraints on the storage layout and logical structure.
Extensible Stylesheet Language. XSL is used for the transformation of XML-based data into HTML or other presentation formats, for display in a web browser. This is a two-part process. First, the structure of the input XML tree must be transformed into a new tree (e.g., HTML), allowing reordering of the elements, addition of text, and calculations—all without modification to the source document. This process is described by
Extensible Stylesheet Language: Transformations. XSLT is a language for transforming the structure of an XML document. XSLT is designed for use as part of
Yeast Artificial Chromosome. Extremely large segments of DNA from another species spliced into the DNA of yeast. YACs are used to clone up to one million bases of foreign DNA into a host cell, where the DNA is propagated along with the other chromosomes of the yeast cell.
Zebrafish Information Network.