- In Overview
- Open Access
Using Genomic Databases for Sequence-Based Biological Discovery
© Feinstein Institute for Medical Research 2003
- Received: 27 August 2003
- Accepted: 1 October 2003
- Published: 30 May 2004
The inherent potential underlying the sequence data produced by the International Human Genome Sequencing Consortium and other systematic sequencing projects is, obviously, tremendous. As such, it becomes increasingly important that all biologists have the ability to navigate through and cull important information from key publicly available databases. The continued rapid rise in available sequence information, particularly as model organism data is generated at breakneck speed, also underscores the necessity for all biologists to learn how to effectively make their way through the expanding “sequence information space.” This review discusses some of the more commonly used tools for sequence discovery; tools have been developed for the effective and efficient mining of sequence information. These include LocusLink, which provides a gene-centric view of sequence-based information, as well as the 3 major genome browsers: the National Center for Biotechnology Information Map Viewer, the University of California Santa Cruz Genome Browser, and the European Bioinformatics Institute’s Ensembl system. An overview of the types of information available through each of these front-ends is given, as well as information on tutorials and other documentation intended to increase the reader’s familiarity with these tools.
In April 2003, the scientific community celebrated the achievement of the Human Genome Project’s major goal: completion of a high-accuracy sequence of the human genome. The significance of attaining this goal, which many have compared with landing a man on the moon, cannot be underestimated. This milestone firmly marks the entrance of modern biology into the genomic era (and not the post-genomic era, as many have stated), changing the way in which biological and clinical research will be conducted in the future. The intelligent use of sequence data from human and model organisms, along with technological innovations fostered by the Human Genome Project, will lead to significant advances in our understanding of diseases that have a genetic basis and, more importantly, in how health care is delivered from this point forward.
The completion of human genome sequencing has provided the biological community an opportunity to look forward and begin to think about how to use genomic approaches in a way that will lead to tangible health benefits. To that end, the National Human Genome Research Institute led a 2-year process involving hundreds of scientists and members of the public in more than a dozen workshops and individual consultations. The result of this process has led to the publication of a document entitled A Vision for the Future of Genomics Research (1). This “vision document” sets forth a number of “grand challenges” organized around 3 major themes: genomics to biology, genomics to health, and genomics to society. These grand challenges are intended to provide ambitious, interdisciplinary research goals for the scientific community that will eventually translate the promise of the Human Genome Project into improved human health.
As part of this vision, 6 critical “cross-cutting elements” were identified as being relevant to all 3 of the thematic areas. One of these areas is computational biology, an area whose importance will continue to increase as more and more sequence data becomes available, as data sets continue to get larger and larger, and as the complexity of both the data and the kinds of questions being addressed become more sophisticated. The focus on computational biology (or, as it is more often called, bioinformatics) underscores that both laboratory- and computationally-based approaches will be necessary to do cutting-edge research in the future. In the same way that investigators are trained in basic biochemistry and molecular biology techniques, a basic understanding of bioinfor-matic techniques as part of the biologist’s arsenal will be absolutely indispensable in the future.
The database that most biologists are familiar with is GenBank (2), an annotated collection of all publicly available DNA and protein sequences maintained by the National Center for Biotechnology Information (NCBI) at the National Institutes of Health. At the time of this writing, GenBank contained 35.6 billion nucleotide bases, representing 29.8 million sequences in more than 119000 species (2). Whereas the inherent value of these data cannot be understated, the sheer magnitude of data presents a conundrum to the inexperienced user, not just because of the size of the “sequence information space,” but because the information space continues to get larger, growing at an exponential pace. GenBank’s size doubles once every 12 to 14 mo; this translates to 45 new sequences being deposited every minute and 7 new structures becoming available every day. This exponential growth rate is expected to continue well into the future, particularly because of the September 2002 announcement of “high priority” model organisms earmarked for sequencing. The continued, rapid rise in available sequence information underscores the necessity for all biologists to learn how to effectively make their way through this sequence space. GenBank (or any other biological database, for that matter) serves little purpose unless the data can be easily searched and entries retrieved in a usable, meaningful format. Otherwise, sequencing efforts will have had no useful end, because the biological community as a whole would not be able to make use of the information hidden within these millions of bases and amino acids. Much effort has gone into making these data available to the biological community, and several of the most highly used interfaces resulting from these efforts are the focus of this review.
The next 3 alphabet blocks would take the user to actual sequence information for that gene. The R stands for RefSeq, and clicking on the R would take the user to the “reference sequence” for that entry. The RefSeq project at NCBI is geared toward reducing redundancy in the public databases, with the goal of representing each molecule in the central dogma (DNA, mRNA, or protein) by 1 and only 1 sequence. Often times, a user will do a query and get back a long list of sequences, all representing the same biological entity, and it is often unclear which entry should be used; by using the curated RefSeq entry, the user can be assured that they are using the most accurate sequence information available. The G and P are for GenBank and Protein, respectively, and will return all nucleotide and protein entries available for the gene of interest.
Web-based resources for sequence analysis
Major Public Sequence Databases
DNA Databank of Japan (DDBJ)
Gene-Centric Information Retrieval
NCBI Map Viewer
UCSC Genome Browser
A User’s Guide to the Human Genome
Current Topics in Genome Analysis
LocusLink obviously provides a very easy-to-use, gene-centric view of the “sequence information space,” but what if a scientist is more interested in seeing the gene of interest in context, particularly now that human genome sequencing is complete? A number of portals called genome browsers have been developed that allow users to access genomic data and, more importantly, view annotations that have been made on the underlying sequence data.
Three maps are shown in the main window, to the left, as long, vertical bars. The map marked Genes_cyto (for “genes-cytogenetic”) shows the cytogenetic locations of genes as reported in LocusLink. Twenty genes, in addition to MLH1, have been cytogenetically mapped to this region of chromosome 3. The next map, marked HsUniG (for “Human UniGene”) shows the positions of UniGene clusters (described above); put another way, mRNA and EST sequences that comprise a UniGene cluster map to this region. On the left side of this particular map are gray bars that form what appears to be a histogram. These bars are intended to illustrate the density of aligned mRNAs and ESTs in this region.
The thick blue lines to the right of this map are intended to illustrate exons. The final, right-most map is labeled Genes_seq (for “gene sequence”). The map occupying the right-most position in any view is called the “master map,” and the information appearing to the right of all the maps pertains to that master map. Three genes are plotted on the master map in this particular view: an EPM2A-interacting protein 1, then the MLH1 gene, which was the basis of the query (highlighted in red), and finally a leucine-rich repeat interacting protein 2 (LRRFIP2). For each gene, an indication of the gene’s structure is given by the blue line running along the right side of the map, with exons being represented as thick blue bars and introns being represented as the thinner, intervening blue lines. Finally, note the arrow immediately to the right of each gene name; this arrow represents the direction of transcription for each gene.
The master map (the right-most map) is now the Variation map, giving a different display than before. As with the UniGene map in Figure 10, the gray bars shown to the left of the Variation map indicate the density of SNPs at any given position. Some positions are simply marked with the number of variations (for example, “11 variations”), indicating that the map is too dense to display information on each individual SNP; simply zoom out to get more information at those positions (see below). In this view, numerous SNPs can be seen, each marked with an “rs” number. Clicking on that rs number would bring the user to the dbSNP page for that particular SNP, which is similar in appearance (but not identical) to the Variation page shown in the LocusLink example above (see Figure 5). Moving across from the rs number is a series of columns of interest. The column labeled Map indicates whether a particular SNP has been mapped to the genome. If the SNP has been mapped to a single position, a single green down-arrow would be shown (as in Figure 11); if the SNP has been mapped to multiple positions, a double down-arrow would be shown. The column labeled Gene indicates whether the SNP of interest is associated with a particular genomic feature. In each row of the Gene column, notice that there is an L, T, and C either “lit up” or “grayed out.” If the L (“locus,” blue) is lit, as it is for most of the SNPs in Figure 11, that indicates that the SNP lies within 2 kb of the 5′ end of a gene or within 500 bases of the 3′ end of a gene. If the T (“transcript,” green) is lit, the marker overlaps with a known mRNA. Finally, if the C (“coding,” orange) is lit, part or all of the SNP marker position overlaps with the coding region of a gene. The columns that follow provide additional information about the quality of the SNP marker, and more information on each of these can be found by clicking on the blue column headers.
In addition to changing the maps shown in any given view, the user can navigate by clicking anywhere on the ideogram on the left, or zoom in and out by clicking on the “out-zoom-in” picture above the ideogram. There are also short, gray bars at the top and bottom of each map that allow the user to “scroll up” or “scroll down,” moving to the next genomic segment.
While space obviously precludes an in-depth treatment of any of the 3 major genome browsers, a number of useful guides and papers have been published to help biologists make intelligent use of these powerful tools. Recently, National Human Genome Research Institute published A User’s Guide to the Human Genome (12). The majority of the guide is devoted to a series of worked examples, providing an overview of the types of data available, details on how these data can be browsed, and step-by-step instructions and strategies for using many of the most commonly used tools for sequence-based discovery. In addition, each browser’s Web site provides instructional information intended to assist the novice user in using the browsers to their best advantage. The URLs for these resources are given in Table 1.
Obviously, the range of publicly available data goes well beyond just the types of data discussed in this review. Because major public sequence databases like GenBank need to be able to store data in a generalized fashion, they often do not contain more specialized types of information that would be of interest to specific groups within the biological community. Many smaller, specialized databases have emerged to fill this gap, often developed and curated by biologists “in the trenches” to address the needs of their fellow investigators. These databases, which contain information ranging from strain crosses to gene expression data, provide a valuable supplement to the major sequence repositories, and the reader is encouraged to make intelligent use of both types of databases in their searches. An annotated list of such databases can be found in the yearly database issue of Nucleic Acids Research (13).
As is undoubtedly apparent by this point, there is no substitute for actually placing one’s hands on the keyboard to learn how to effectively search and use genomic sequence data. Readers are strongly encouraged to take advantage of the resources presented in this review, grow in confidence and capability by working with the available tools, and begin to apply bioinformatic methods and strategies toward advancing their own research interests.
- Collins FS, Green ED, Guttmacher AE, Guyer MS. (2003) A vision for the future of genomics research. Nature 422:835–47.View ArticleGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. (2003) GenBank. Nucleic Acids Res. 31:23–7.View ArticleGoogle Scholar
- Baxevanis AD. Information retrieval from biological databases. In:Bioinformatics: a practical guide to the analysis of genes and proteins. 2nd edition. Baxevanis AD and Ouellette BFF (eds.) John Wiley and Sons, New York, pp. 155–85.Google Scholar
- Hamosh A et al. (2002) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 30:52–5.View ArticleGoogle Scholar
- Wolfsberg TG, Landsman D. Expressed sequence tags. In: Bioinformatics: a practical guide to the analysis of genes and proteins. 2nd edition. Baxevanis AD and Ouellette BFF (eds.) John Wiley and Sons, New York, pp. 283–302.Google Scholar
- Velculescu VE, Vogelstein B, Kinzler KW. (2000) Analyzing uncharted transcriptomes with SAGE. Trends Genet. 16:423–5.View ArticleGoogle Scholar
- Blake JA et al. (2003) MGD: the Mouse Genome Database. Nucleic Acids Res. 31:193–5.View ArticleGoogle Scholar
- Sprague J et al. (2003) The Zebrafish Information Network (ZFIN): the zebrafish model organism database. Nucleic Acids Res. 31:241–3.View ArticleGoogle Scholar
- Yeh RF, Lim LP, Burge CB. (2001) Computational inference of homologous gene structures in the human genome. Genome Res. 11:803–16.View ArticleGoogle Scholar
- Karolchik D et al. (2003) The UCSC Genome Browser database. Nucleic Acids Res. 31:51–4.View ArticleGoogle Scholar
- Clamp M et al. (2003) Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res. 31:38–42.View ArticleGoogle Scholar
- Wolfsberg TG, Wetterstrand KA, Guyer MS, Collins FS, Baxevanis AD. (2002) A user’s guide to the human genome. Nat. Genet., vol. 32 supplement.Google Scholar
- Baxevanis AD. (2003) The Molecular Biology Database Collection: 2003 update. Nucleic Acids Res. 31:1–12.View ArticleGoogle Scholar