Bioinformatics Computers and the internet have revolutionized everything from agriculture and architecture to research. Biological research is no exception. Biology is the science of studying living beings. Bioinformatics is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems. It is the science of using information to understand biology.

Long before the invention of the word bioinformatics, researchers tried to use computers to assist in their research. These researcher realized three concepts which are still fundamental to bioinformatics today.

  • data representation
  • the concept of similarity
  • bioinformatics is a data-driven science as opposed to theoretical science

To make it possible for a computer to work on a problem. This problem must be abstracted in a computer understandable format. This often requires simplification of the problem and coding. Computers can be very clever at detecting similarity and similarity allows us to imply that two seemly different entities share a certain property. Bioinformatics is a data driven science meaning that we require lots of data. Fortunately, the biggest problem is not the lack of data but the quality of data, meaningful classification of data and our insufficient capacity to interpret the data.

Biological Databases

The invention of various techniques and instruments for analyzing living being at the molecular level has lead to an explosion of scientific data generated by the scientific community. This data cannot be stored on paper. It must be stored, organized, and indexed in an electronic database. In addition we need tools to view, verify, analyze and interface this data with other databases.

An electronic biological database is a large, organized body of persistent data that can be queried to add, update, extract, and remove data. Biological databases have to respond to the needs of its various users. A certain biological data often means very different things to different researchers. For example, a physicist, a biochemist, and a biologist sitting in the same room would be interested in different aspects of the same protein. They might even use different taxonomy to refer to the same protein. Even two biologists would be interested in looking at the protein from different perspectives.

Biological data is often very connected and these connections are essential for comprehension and discovery. A nucleotide sequence is linked to a protein it codes for. Nucleotide sequences are grouped into genes. A gene may code for one protein, several proteins or none at all. This protein might have different names in different species. A protein belongs to protein family and it must be linked to its evolutionary progeny. We would also like to have links to scientific publications related to our protein, find out the methods and instruments used for its discovery, and even the parameters of the instrument used. Researchers frequently repeat experiments conducted by others to verify and improve their processes.

Why do we need biological databases? Back in the 70s, researchers referred to the "Atlas of Protein Sequences and Structures" by Margaret Dayhoff to find information on their protein of interest. Since then biological has exploded to a point that we can no longer imagine publishing all the data on paper. One of the earliest electronic database was PIR ( which was essentially run by a group of researchers. This was a significant improvement since it offered the advantage of adding, updating, deleting and most importantly searching the data is a much more effecient manner. Today PIR is no longer in service. It is live but it only serves as an archive. It could not cope with the growing demands while databases such as SwissProt are built to cope with the needs..

Today, biology is a data-rich science where each experiment generates enormous amounts of data. We can no longer analyze all this data by a pair of eyes. We need powerful data analysis tools to help us interpret and understand the significance of this data. Biological databases offer data storage facility and various tools which help understand and analyze the data.

Bioinformatics Research Centers

Several research centers are dedicated to bioinformatics research. Following are most significant.

Nucleotide Sequence Databases

EMBL Nucleotide Sequence Database The EMBL Nucleotide Sequence Database constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications.

NCBI - National Center For Biotechnology Information The database is produced in an international collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis.

Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.

DDJB - DNA Data Bank of Japan DDBJ (DNA Data Bank of Japan) began DNA data bank activities in earnest in 1986 at the National Institute of Genetics (NIG). DDBJ has been functioning as the international nucleotide sequence database in collaboration with EBI/EMBL and NCBI/GenBank. DNA sequence records the organismic evolution more directly than other biological materials and ,thus, is invaluable not only for research in life sciences, but also human welfare in general. The databases are, so to speak, a common treasure of human beings.

Unigene Each UniGene entry is a set of transcript sequences that appear to come from the same transcription locus (gene or expressed pseudogene), together with information on protein similarities, gene expression, cDNA clone reagents, and genomic location.

Nucleotide Sequence Databases Each database is different, however, a nucleotide sequence is expected to contain at least the following:

  • id and/or accession number
  • taxonomic data
  • references
  • annotation/curation
  • keywords
  • cross references
  • sequences
  • documentation

Annotation refers to adding extra information regarding a certain record in a database. Curation refers to evaluating what goes in the database and what is not fit to go into the database.

First Generation Nucleotide Sequence Databases The first generation nucleotide sequence databases are essentially sequence archive. The data is present in the database as it was determined and interpreted by its publisher. The original author retains full control of the information he submitted. As one can imagine, this results in a multitude of problems such as:

  • data of varying quality and lengths
  • highly redundant data
  • errors in sequence, annotations, etc.
  • lack of consistency

Second Generation Nucleotide Sequence Databases The second generation nucleotide sequence databases were built with an eye on lessons learned from the first generation nucleotide sequence databases. The goal is to have one sequence entry for every naturally occuring molecule. In RefSeq, a second generation database, chromosome, gene, mRNA, and protein data are curated. Other data such as contigs, model mRNA, and model protein is calculated. A gene can result into multiple products. In such as case, separate RefSeq ids are used for each product and all are linked by a Locus Id. Second generation nucleotide sequences are essentially gene-centric databases.

Gene-Centric Databases In a gene-centric database, all information relevant to a given gene is made accessible at once. Entrez and RefSeq are the most commonly used. Entrez Gene is tightly linked to RefSeq. RefSeq, the Reference Sequence, collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript RNA, and protein products.

Gene-centric databases contain gene-specific information, which focuses on the genomes that have been completely sequenced, that have an active research community to contribute gene-specific information, or that are scheduled for intense analysis. The content of Entrez Gene represents the result of curation and automated integration of data from NCBI's RefSeq and other collaborating databases.

Genome-Centric Databases Genome-centric databases contain information about the gene sequence, relative position, strand orientation, biochemical functions, etc. Ensembl and TIGR are information management systems that are able to connect specialized sequence collection and browsing tools.

Genbank: case study

GenBank is a comprehensive public database of nucleotide sequences built and distributed by the NCBI. GenBank is primarily built from the sequence data submissions from authors and from the bulk submission of ESTs, GSS and other high-throughput data from sequencing centers.

EST: Expressed Sequence Tags produced by one-shot sequencing of a cloned cDNA. GSS: Genome Sequence Survey is similar to EST with the exception that most of the sequences are genomic in origin.

GenBank doubles in size every 18 months. WGS and envrionmental sequences now occupy a significant space in the databases.

WGS: Whole Genome Shotgun are contigs of a sequencing project. WGS data can contain annotation and should be updated as sequencing progresses. Contig: A contig is a DNA sequence assembled from DNA fragments of 100-300 base pairs. Environmental Sequences: These are all DNA sequences present in a sample. The sample often contains many different organisms and these organisms are very often unknown and unidentified.

Each GenBank entry includes a concise description of:

  • sequence
  • scientific name and taxonomy of the source organism
  • bibliographic references
  • listing of areas of biological significance such as coding regions and their protein translations, transcription units, repeat regions and sites of mutations or modifications.

GenBank partitions sequence into divisions that roughly correspond to:

  • taxonomic groups such as bacteria (BCT), viruses (VRL), and rodents (ROD).
  • sequencing strategies such as EST, GSS, HTG, HTC and environmental sample (ENV) sequences

HTC: High throughput cDNA HTG: High throughput genomic sequences, single-pass, unfinished genomic sequences

EST and HTC are RNA or cDNA. GSS, HTG, WGS, and ENV are DNA.

The data in GenBank, and the collaborating databases EMBL and DDBJ, are submitted primarily by individual authors to one of the three databases, or by sequencing centers as batches of EST, STS, GSS, HTC, WGS or HTG sequences. Data are exchanged daily with DDBJ and EMBL so that the daily updates from NCBI servers incorporate the most recently available sequence data from all sources. Virtually all records enter GenBank as direct electronic submissions.

EMBL, GenBank, DDJB and Swiss-Prot both identifiers and accession numbers to identify each entry. To make things more complicated, identifiers and accession numbers mean different things on different databases. On Swiss-Prot identifiers are alphanumeric terms that are meaningful to a human being. For example, HBA_HUMAN refers to a human haemoglobin alpha chain. Identifiers can change but they rarely do. Accession number the HBA_HUMAN is P69905. Accession numbers are primary keys so they never change. If two entries are merged, the new entry will have both accession numbers. One would be the primary key and the other would be the secondary key. When the entries are split, new accession numbers are assigned to each entry and the old accession number is noted as the secondary key.

GenBank data can be retrieved by Entrez. Entrez covers over 30 biological databases containing DNA and protein sequence data, genome mapping data, population sets, phylogenetic sets, environmental sample sets, gene expression data, the NCBI taxonomy, protein domain information, protein structures from the Molecular Modeling Database, MMDB, and MEDLINE references via PubMed. Entrez is a very good system to use since it returns much more information than is available on GenBank.

Biological databases often come with useful tools. BLAST is the very powerful tool which allows sequence-similarity comparisons.

GenBank database can be downloaded by ftp at

This page is a brief summary of descriptions of Swiss-Prot, GenBank, and EMBL available on their websites.

Protein Sequences

There are two major protein sequence resources:

  • UniProt = Swiss-Prot + TrEMBL + PIR
  • NCBI-nr = Swiss-Prot + GenPept + PIR + RefSeq + PDB + PRF

In addition, there are several different specialized protein databases.


UniProt is a central resource for protein sequence and function. The UniProt consortium (since 2003) consists of EMBL, SIB, and PIR. PIR is no longer being updated. It now only functions as a archive. UniProt itself is divided into several components.


UniProtKB/TrEMBL contains computer annotated protein sequences. TrEMBL entries are produced by translating nucleic acid sequences (CDS) in EMBL using computer tools. In addition, it includes data from PIR. TrEMBL suffers from poor submission of annotated CDS.

TrEMBL is a platform for the improvement of automated annotation tools. A TrEMBL entry is created after applying many annotation tools such as SignalP, TMHMM, REP, etc. Then evidence tags are added to any part of a TrEMBL entry not derived from the original EMBL entry.


UniProtKB/TrEMBL contains manually annotated protein sequences. Swiss-Prot entries are produced by manually annotating TrEMBL entries. Before creating a Swiss-Prot entry, the sequence is checked and analyzed. The data is cross-checked with literature and external scientific expertise. Once an entry is moved to Swiss-Prot, it is deleted from TrEMBL. Data in Swiss-Prot does not migrate to TrEMBL. Together, Swiss-Prot and TrEMBL provide all known protein sequences in the public domain.

The goals of Swiss-Prot are:

  • Non-redundant: (one entry - one gene - one specie)
  • Maximum manual annotation: maximum annotation of protein diversity
  • Maximum links to other databases

A Swiss-Prot Entry contains:

  • ID and accession number
  • names and taxonomy
  • references
  • comments
  • cross-references
  • keywords
  • features
  • sequence


One UniRef100 entry contains all identical sequences including fragments. One UniRef90 entry contains sequences that have at least 90% or more identity. One UniRef50 entry contains sequences that have at least 50% or more identity.


UniParc are raw archived protein sequences.

Sequences and information in UniProt is accessible via text search, BLAST similarity search, and FTP.

Pairwise Alignment The most basic sequence analysis is to ask if two sequences are related. This involves aligning two sequences and then deciding whether the sequences are related or is the similarity just by chance. The key issues to ponder over are:

  1. what sorts of alignments should be considered
  2. the scoring system used to rank alignments
  3. the algorithm used to find optimal (or good) scoring alignments
  4. the statistical methods used to evaluate the significance

Why are we interested in knowing the degree of similarity between two sequences?

Two similar sequences are probably biologically similar. Very often similar sequences have similar 3D structures. This is important since the 3D structure of a protein defines its functions. In addition, similar sequences can come from two species which share a common ancestor, thereby indicating their evolutionary relationship. In other words, the residues occupying similar positions could have similar functional roles. Evolution tends to conserve the more efficient functional units. Therefore, important sequences which code for the important proteins are conserved among organisms in nature.

In the absence of comprehension of the biological mechanisms, it is indispensable to compare a new unknown sequence to known sequences that we know better. Therefore, discovery of efficient and reliable algorithms are becoming more and more important as the number of sequences increase exponentially.

Similar, Identical, Homologous

Understanding the difference between similar and identical is crucial for sequence alignment. An identical pair is a pair of two same amino acids. A similar pair is a pair of amino acids could be considered chemically similar in that certain position. Two amino acids are considered similar if one can be substituted for another with a positive log odds score from a scoring matrix.


In this example, G, V and K are identical pairs while S,N and Q,K are similar pairs.

Similarity can often be misleading. It can reveal evolutionarily related sequences or it can align two sequences with completely different function and structure. The challenge is to differentiate between the former and the latter.

Alignment simply refers to placing one symbol against another. It does not involve judging the quality of the alignment. Sequence identity refers to the occurrence of exactly the same nucleic acid or amino acid in the same position in two aligned sequences. Sequence similarity is meaningful only when possible substitutions are scored according to the probability with which they occur. Sequence homology indicates evolutionary relatedness among sequences. Two sequences are said to be homologous if they are both derived from a common ancestral sequence. Similarity refers to the presence of identical and similar sites in two sequences, while homology reflects a stronger claim that the two sequences share a common ancestor.

Similarity is not definite in a unique and exact manner. It is a mix of biological knowledge and mathematical and heuristic concepts. Sequence similarity is not about comparing two texts to state whether they are similar or different. A sequence similarity must be capable of tolerating gaps and substitutions. This is an optimization problem which could be formulated in a dynamic programming problem. The idea is to give a score to each pair of residues. Then search for insertions and deletions which can maximize the global score using a substitution matrix. In addition, the degree of similarity must be validated biologically and statistically. It is also important to be able to distinguish between accidental similarity and similarity based on biological factors.


An alignment consists of writing two sequences one on each axis and inserting letters and symbols such that the two sequences have the same length. All methods are permitted as long as the order of the symbols in the sequences is not modified. There is no quality evaluation in the alignment step.

Note: Parts of this post are summary of Durbin. 24 442 Multiple Sequence Alignment Multiple sequence alignment techniques are most commonly applied to protein sequences; ideally they are a statement of both evolutionary and structural similarity among the proteins encoded by each sequence in the alignment.

Multiple alignments must usually be inferred from primary sequences alone. Biologists produce high quality multiple sequence alignments by hand using expert knowledge of protein sequence evolution. This knowledge comes from experience. Important factors include:

• specific sorts of columns in alignments, such as highly conserved residues or buried hydrophobic residues • the influence of secondary or tertiary structure, such as the alteration of hydrophobic and hydrophilic columns in exposed beta sheet • expected patterns of insertions and deletions, that tend to alternate with blocks of conserved sequence

The phylogenetic relationships between sequences dictate constraints on the changes that occur in columns and in the patterns of gaps.

Manual alignment is tedious. To automate the process, it is hard to define exactly what an optimal multiple sequence alignment is, and impossible to set a standard for a single correct multiple alignment. In theory, there is one underlying evolutionary process and one evolutionarily correct alignment generated from any group of sequences. However, the differences between sequences can be so great in parts of an alignment that there isn’t an apparent, unique solution to be found by an alignment algorithm. Those same divergent regions are often structurally unalignable as well. Most of the insight that we derive from multiple alignments comes from analyzing the regions of similarity, not from attempting to align highly diverged regions.

In general, an automatic method must have a way to assign a score so that better multiple alignments get better scores. We should carefully distinguish the problem of scoring a multiple alignment from the problem of searching over possible multiple alignments to find the best one. Descriptions of multiple alignment programs tend to emphasize the alignment algorithm rather than the scoring function. However, the scoring function is our primary concern in probabilistic modeling. We wish to incorporate an expert’s evaluation criteria into our scoring procedure.

To automate multiple alignment, we need to do the following:

• looking at what we need to do for automatic multiple alignment structurally and evolutionarily • consider how to turn the biological criteria into a numerical scoring scheme, so that a program will recognize a good multiple alignment. • Examine various approaches by different multiple alignment programs • Describing a full probabilistic multiple alignment approaches based on profile HMM

Note: This post is a summary of Durbin. 25 1796 Biological Data Biology is now a data-intensive science and fortunately most of the data is available freely over the Internet. Before beginning, one needs to know what kind of data is available, where, in what format, and how it can be accessed. Most databases provide very useful and powerful tools to help its users access, manipulate, and analyze the data. Knowing and using these tools would help the user avoid lots of unnecessary work. 26 26 Non-coding DNA A remarkable variability exists in genome size among eukaryotes that has little correlation with organismal complexity, size or number of coding genes. Even a unicellular organism can have a larger genome than a mammal! This striking disparity is due to non-coding DNA.

Non-coding DNA describes DNA which does not contain instructions for making cell products. It constitutes a large portion of the genome of eukaryotes. Some this non-coding DNA is involved in regulating the coding regions of DNA. Functions of the remaining non-coding DNA are still unknown.

The genome contains several types of non-coding regions (regions not coding for proteins). Non-coding regions can be found in three areas:

• Genic DNA, • genic DNA coding for ncRNA, and • intergenic DNA

Genic DNA is involved directly in gene expression. UTR regions (untranslated regions of mRNA), and introns are genic DNA.

The intergenic region contains mostly repetititve regions. Functional regions which constitute to about 15% of intergenic regions contains SAR (scaffold attachment regions), telomeres, centromeres. The functions of the remaining 85% regions are unknown.

SAR (Scaffold attachment regions) is an AT-rich segment of a eukaryotic genome that acts as an attachment point to the nuclear matrix. Nuclear matrix is a proteinaceous scaffold-like network that permeates the cell.

A telomere is a region of highly repetitive DNA at the end of a chromosome that functions as a disposable buffer. Every time linear eukaryotic chromosomes are replicated, the DNA polymerase complex is incapable of replicating all the way to the end of the chromosome; if it were not for telomeres, this would quickly result in the loss of useful genetic information.

The centromere is the site where spindle fibers of the mitotic spindle attach to the chromosome during mitosis. In most eukaryotes, the centromere has no defined DNA sequence. It typically consists of large arrays of repetitive DNA where the sequence within individual repeat elements is similar but not identical.

Repetitive DNA sequence classes

Much of this variation in genome size is due to non-coding, tandemly repeated DNA. A substantial fraction of the eukaryote genomes is often composed of repetitive DNA.

1. Simple Repeats

Simple repeats are duplications of the simple sets of DNA bases, typically 1 – 5bp. CpG are among the most important simple repeats. A CpG island is a short stretch of DNA in which the frequency of the dinucleotide sequence CG is higher than other regions. The p simply indicates that C and G are connected by a phosphodiester bond. To be classified a CpG island, a sequence must be at least 200 bases long.

DNA methylation occurs at CG-rich sites. Methylated cytosines may be converted to thymine by deamination over evolution CpG -> TpG. Methylated (inactive regions) are thus poor in CpG. CpG islands are unmethylated regions of the genome that are associated with the 5’ ends of genes which are frequently switched on. Often CpG islands ovelap the promoter and extend about 1000 base pairs downstream into the transcription unit.

2. Tandem Repeats - DNA satellites

Tandem repeats are typically found at the centromeres and telomeres of chromosomes. These are duplications of more complex 100-200 base sequences. DNA satellites can further be divided into satellites, minisatellites, and microsatellites, based on the number of nucleotides involved.

3. Segmental Duplications

Segmental Duplications are large blocks of 10-300kbp which have been copied to another region of the genome.

4. Interspersed Repeats (Transposons)

Interspersed repeats are repeated DNA sequences located at dispersed regions in a genome. They are also known as mobile elements for transposable elements. LINEs are long interspersed elements. SINEs are short interspersed elements.

5. Pseudogenes

Pseudogenes are defined as nonfunctional sequences of DNA originally derived from functional genes (evolutionary relics). There are 2 major classes:

• unprocessed pseudogenes derived from gene duplication and • processed pseudogenes derived from retrotransposition of mRNA

Pseudogenes may be transcribed but not translated. Their chromosomal distributions appear random and dispersed. Pseudogenes can be considered as ‘potogenes’, i.e. DNA sequences with a probability of becoming new genes.

Processed pseudogenes are very similar to their closest corresponding human gene, being 94% complete in coding regions, with sequence similarity of 75% for amino acids and 86% for nucleotides. is a organization which concentrates on pseudogenes. 27 27 Non-coding RNA Non-coding RNAs represent ~10% of the genes but ~98% of all human transcripts. snRNA participates in post-transciptional chemical modification or processing of different RNAs.

Micro RNAs (miRNAs) are a class of non-coding RNA gene. They play an important role in the regulation of translation and degradation of mRNAs through base pairing to partially complementary sites in the untranslated regions (UTRs) of the messenger.

Antisense transcription is transcription from the opposite strand to a protein-coding or sense strand. Computational analysis suggests that between 15 and 25% of mammalian genes overlap, give rise to pairs of sense and antisense RNA. They are almost universally associated with candidate imprinted loci, also occurring on the autosomes. Its play roles in gene regulation involving degradation of the corresponding sense transcripts (RNA interference) as well as gene silencing at the chromatin level. The challenge is to determine the correct orientation for an expressed sequence, especially an expressed tag sequence (ESTs).

Antisense mRNA is an mRNA transcript that is complementary to endogenous mRNA. It is the noncoding strand complementary to the coding sequence of mRNA. Introducing a transgene coding for antisense mRNA is a strategy used to block expression of a gene of interest. A strand of antisense mRNA can also be introduced into the cytosol by microinjection. Radioactively-labelled antisense mRNA can be used to hybridise to endogenous sense mRNA, which can show the level of transcription of genes in various cell types.

ncRNA genes are found in genomic sequences by their sequence or structural homology.

tRNA have conserved sequence elements. Programs use a combination of patterns searches; probabilistic methods and (for eukaryotes) search for Pol III promoters. tRNAscan is a very good program for finding tRNAs. 28 28 Protein Coding DNA In prokaryotes, one gene codes for one protein. Eukaryotes used a much more elaborate mechanism to increase sequence diversity and to enable themselves to produce newer proteins.

Alternative promoter usage

Several exons are involved to code for a single protein. Any one of the several exons can used to initiate the expression. The choice of the initiating exon could generate a different isoform of the same protein. In other words, alternative usage of promoters results in proteins with different isoforms.

Alternative splicing

RNA splicing is a precisely regulated co- and post- transcriptional process (occurring prior to mRNA translation) that removes introns and joins exons in a primary transcript.

During RNA splicing, exons can either be retained in the mature message or targeted for removal in different combinations to create a diverse array of mRNAs from a single pre-mRNA, a process referred to as alternative RNA splicing (tissue and cell specific).

There are four known modes of alternative splicing:

  1. Alternative selection of promoters: This is the only method of splicing which can produce an alternative N-terminus domain in proteins. In this case, different sets of promoters can be spliced with certain sets of other exons.
  2. Alternative selection of cleavage/polyadenylation sites: This is the only method of splicing which can produce an alternative C-terminus domain in proteins. In this case, different sets of polyadenylation sites can be spliced with the other exons.
  3. Intron retaining mode In this case, instead of splicing out an intron, the intron is retained in the mRNA transcript. However, the intron must be properly encoding for amino acids. The intron's code must be properly expressible, otherwise a stop codon or a shift in the reading frame will cause the protein to be non-functional.
  4. Exon cassette mode: In this case, certain exons are spliced out to alter the sequence of amino acids in the expressed protein.mRNA editing

…~15 % of disease-causing mutations involve misregulation of alternative splicing (missplicing)…

Exon order is not conserved. It cam be scrambled. A technique used in alternative promotor usage.

Trans-splicing vs. Cis-splicing

Splicing prepares pre-mRNA in eukaryotes to produce mature mRNA. This mature messenger RNA is then prepared to undergo translation as part of protein synthesis to produce proteins. When the exons are in the SAME RNA transcript, it is called cis-splicing.

Trans-splicing is a form of splicing that joins two exons that are not within the same RNA transcript.

Exonic splicing enhancers (ESEs) – pre-mRNA cis-acting elements

ESEs are discrete sequences within exons that promote both constitutive and regulated splicing. The precise mechanism by which ESEs facilitate the assembly of splicing complexes has been controversial. However, recent studies have provided insights into this question and have led to a new model for ESE function. Other recent work has suggested that ESEs are comprised of diverse sequences and occur frequently within exons. Ominously, these latter studies predict that many human genetic diseases linked to mutations within exons might be caused by the inactivation of ESEs.

Exon sequence enhancers prediction -

Alternative splicing database project -

Gene Prediction Gene prediction refers to algorithmically identifying stretches of DNA sequences that are biologically functional. In the old days, gene prediction was a very painstaking and difficult process. Today, thanks to comprehensive genome sequencing and powerful computational resources, gene prediction is largely a computational problem.

Gene prediction is used to find a functional sequence. In other words, a region of the DNA which is coding for a protein or mRNA. Regulatory regions, regions of DNA that regulate gene expression, are also considered functional. Gene prediction does not tell us which genes code for which proteins.

There are two primary approaches for predicting genes:

• Intrinsic approach – Ab Initio • Extrinsic approaches – homology-based 30 30 Prerequisite Knowledge A gene is the fundamental physical and functional unit of heredity. It is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific function product (RNA or protein).

An Open Reading Frame (ORF) is a series of DNA codons which do not contain any stop codons.

A Coding Sequence (CDS) is a region of DNA or RNA whose sequence determines the sequence of amino acids in a protein.

Frames always read from 5’ to 3’.

Prokaryotic gene model

Prokaryotes have small genomes with high gene density. They contain operons, which mean that one transcript results in many genes. Since there are no introns, one gene produces one protein. There is one ORF per gene. ORFs begin with start codon and end with stop codon. There are conserved promoter regions around the start sites of transcription and translation. Genes often overlap in prokaryotes.

The principal difficulties with prokaryote gene prediction are overlapping ORFs, short genes, and finding promoters. In spite of these difficulties, gene prediction in prokaryotes is 99% accurate.

Eukaryotic gene structure

31 31 Ab Initio Gene Prediction Ab Initio gene prediction is an intrinsic method based on gene content and signal detection. In Ab Initio method, genomic DNA sequence is systematically searched for signs of coding genes. Signal indicates the presence of coding regions in the vicinity. Ab initio methods make a prediction based on the sequence information only. It identifies only coding exons of protein-coding genes. Transcription start site, 5’ and 3’ UTRs are ignored. These methods can detect new genes with no similarity to known sequences or domains.

Ab initio methods are based on rules, using coding statistics and signal detection. Statistical properties of coding regions are also taken into consideration. Training sets of known gene structures are used to generate statistical tests for the likelihood of a prediction being real. Since these statistical properties are unique to each species, knowledge is usually not transferable. This method can detect genes with no similarity to known sequences or domains.

Gene Content Certain information in the gene content such as GC content, codon bias, and hexamer frequency is used by ab initio methods to discriminate coding regions from non-coding regions. Codon bias refers to unusually high usage of certain codons over its alternates. For example, L can be coded by six different codons. However, human genes prefer to use CTG over others.

Coding statistics Coding statistics is a function that for a given DNA sequence we are able to compute the likelihood that the sequence is coding for a protein. We know that intergenic regions, introns and exons have different nucleotide content. This information helps the function discriminate between the regions. For example, the probability of finding a stop codon in a random sequence would be different from finding it in a coding sequence.

Intergenic regions are DNA sequences located between genes that comprise a large percentage of the human genome with no known function.

Unequal usage of codons in the coding regions is a universal feature of the genomes (codon bias). Uneven usage of amino acids, uneven usage of synonymous codons (correlates with the abundance of corresponding tRNAs) (codon usage), and hexamer usage also help discriminate coding region from non-coding regions.

Gene identification in prokaryotes

Gene prediction is easier and more accurate in prokaryotes than eukaryotes since prokaryote gene structure is much simpler. In prokaryotes, ab initio methods look for:

• The presence of an ORF (start + stop) with a statistically significant size to code for a protein • Codon usage bias • RBS (ribosome binding signal) and terminator identification.

Locating ORFs is much simpler in prokaryotes. DNA sequences encoding proteins are generally transcribed into mRNA which is translated into protein with very little modification. Locating an ORF from a start codon to a stop codon may suggest protein-coding regions. Longer ORFs are more likely to predict protein-coding regions than shorter ORFs.

Ab initio gene prediction has certain advantages largely due to the simplicity of prokaryote genomes. The genomes are small with high gene density and simple strurcture (no exons/introns).

The principle difficulties are:

• detection of initiation site (AUG) • alternative start codons • gene overlap • undetected small proteins

Inspite of these difficulties, prokaryote gene prediction can reach 99% accuracy.

Gene prediction in Eukaryotes

Gene identification in eukaryotes is much more complicated, difficult and a lot less accurate. In eukaryotes, we look for the following patterns:

• upstream promoter sequences, • Kozak sequence, and • exon-intron boundaries

We use this information to predict Poly-A signal and the start/stop prediction. In eukaryotes, the signals are not as clearly defined as in prokaryotes. Therefore simple pattern matching techniques cannot be used. The problems with eukaryote gene prediction are numerous and the prediction accuracy is about 50% at best. Modern gene prediction tools use advanced techniques such as hidden Markov Models. GENSCAN is a notable program in this domain.

Locating ORFs is less effective for eukaryotic genomes. There are large non-coding regions between genes and introns in genes. mRNA undergoes processing before translation (splicing and alternative splicing). A protein-encoding gene may contain stop codons within intronic regions. PTMs make gene prediction even more difficult. There are several tools which attempt to or help locate ORFs such as SpliceView, ORF finder, etc. Gene Prediction Methods Various pattern recognition methods are used to identify signals:

• weighted matrix • decision trees • HMM • Artificial neural networks • Linear discriminate analysis

An algorithm can be:

• Rule-based • Neural network based • HMM based

GENSCAN is a general-purpose gene identification program which analyzes genomic DNA sequences from a variety of organisms including human, other vertebrates, invertebrates and plants. Genscan:

• Identifies complete exon/intron structure of genes in genomic DNA • Predicts multiple genes, partial and complete genes • Uses HMM to model gene structure

Genscan takes the following things into account to make a prediction:

• Transcription signals • Translation signals • Splicing signals (donors, acceptors, and branch points) • Exon length distributions • Compositional features such as G+C regions and hexamer frequency

Weaknesses of ab Initio prediction Ab initio method is not reliable enough, especially in eukaryotes. It is not specific enough (too many false positives), however, exon sensitivity can be good. It is generally used to point sequence similarity searches in the right direction. 35 1263 Proteomics Swiss-Prot defines proteomics as "the qualitative and quantitative comparison of proteomes under different conditions to further unravel biological processes."

A genome is the collection of all the genes present in a living being. Genomics refers to the field which concerns itself with the study of a genome. The proteome is the collection of all the proteins. Some doing proteomics is doing research on the proteome.

Proteomics is a far more complicated branch of science than genomics. All cell of a species capable of storing DNA carry the same genome. However, the proteome of different cells are completely different. For example, the proteomes of a neuron and a red blood cell in the same organism would have very little similarity. Proteins are functional unit of living being. Obviously, two cell performing completely different tasks would have a completely different set of functional units (proteome).

The complexities and problems involved in proteomics are enormous in comparison with genomic research. The genome gives us the code for making proteins. Then why do we need to study proteins themselves? The answer is that the genome just gives us a list of parts. It does not tell us what the parts do, how they interact, when they are produced and in what quantity.

Unlike differential display-PCR cDNA micro arrays, and serial analysis of gene expression, proteomics techniques directly investigate the functional molecules. Protein abundance often does not correspond to mRNA abundance. Therefore protein quantitation is essential.

Post Translational Modifications (PTM) are ver common in proteins. A PTM can modify both protein function and structure. PTMs cannot be predicted from genetic code. Proteomics is capable of PTM identication and quantitation.

Before a protein can be examined, it must be extracted from tissues, cells, and organelles. It has to be separated and profiled in a manner which would preserve its functional and structural integrity.

Proteomics aims to:

  • Localize proteins within the cells
  • Define the functions of subcellular complexes
  • Define the functions of proteins

Proteomics research often involves the following steps:

  1. fractionation
  2. protein separation and validation
  3. validation of proteomics data

36 36 Body Fluids All living beings including human beings have different bodily fluids. Each fluid is there to perform several functions or make the execution of several function possible. Generally, a fluid contains a mixtures of compounds, including many proteins. Our bodies make sure that our various bodily fluids remain separate. For example, blood, CSF do not mix.

When studying a disease, researchers often study the proteins involved in the disease or disorder. In a diagnostic experiment, the choice of body fluid is crucial. For example, CSF is a better choice than blood plasma for many brain diseases since blood and neurons are separated by the blood brain barrier.

There are hundreds of different fluids present in our body. We cannot cover all of them in this book. We will only be looking at the most common and most important ones.


Plasma is the liquid portion of the blood. It contains blood cells and other compounds. Plasma is about 55% to 60% of the blood content. To collect blood plasma, a syringe containing suitable anti-coagulant is used. Anti-coagulant is added to prevent blood from clotting. Plasma contain red blood cell (RBC), white blood cells (WBC), platelets, fibrinogens, lipids, salts, urea, antibodies, etc. Plasma serves as a transport medium for nutrients, waste, cells. In proteomics, plasma is fractionated before being examined since it contains to many substances and too many proteins.


To collect serum, blood is allowed to clot. Once the clot is removed, we are left with serum. Serum contain water, electrolytes, albumin, antibodies, etc. It does not contain red blood cell (RBC), white blood cells (WBC), platelets, fibrinogens.

Cerebrospinal fluid (CSF)

CSF is a clear body fluid that is found in the subarachnoid space in the brain. Subarachnoid space is a between the skull and the cerebral cortex. It cushions and buffers the cortex. CSF is collected by a lumbar puncture. It is used in diagnosing cerebrospinal and neurological diseases such as Jacob Creutzfeldt disease.


In animals, urine is produced by filtering blood through kidney. It is then collected in the bladder and excreted through the urethra or penis. Urine contains excess compounds and undesirable substances that are not needed by the body or harmful to the body.

The major interest in the analysis of urine is to discriminate between glomerular and tubular diseases. In glomerular diseases, additional high molecular weight plasma proteins may be detected in the urine due to alteration of the glomerus. In contrast, tubular diseases show only additional low molecular weight proteins.

Other fluids

Several other fluids are also routinely used in medicine, diagnositics and research but we would not talking about them in this book.

  • Amniotic fluid
  • Bile
  • Cowper's fluid
  • Female ejaculate
  • Interstitial fluid
  • Lymph
  • Pleural fluid
  • Saliva
  • Semen
  • Sweat
  • Tears

39 39 Fractionation In proteomics, fractionation is a separation process in which a mixture of compounds is divided into smaller fractions according to a gradient. The gradient can be based on a specific property of a set of properties. Fractionation widely employed to separate substances of interest from other substances. Fractionation techniques can be applied at different levels.

  1. homogenization: disruption of the cellular organization
  2. fractionation: isolation of functional complexes

An ideal homogenate contains a suspension of intact and individualized subcellular compartments. To collect individualized cells, we use chelating agents such as EDTA for enzymatic and mechanic disaggregation. Then we disrupt plasma membrane by detergents or mechanical methods such as ultrasonification.

Cell Fractionation or Subcellular Fractionation

While some claim cell fractionation and subcellular fractionation to be different, most researchers use the terms interchangeably. Cell fractionation refers to fractioning the different components of the cell. This is among the first steps in protein profiling.

Subcellular compartments can be separated based on their properties such as size, density, and charge. The first step is to separate the nuclei and the unbroken cells from cytoplasmic organelles by differential sedimentation at low centrifugal force in order to obtain postnuclear supernatant (PNS). The supernatant can then be separated using differential centrifugation based on size, weight, density or even shape. Differential centrifuagation is a time-dependent technique. Isopycnic centrifugation separates by density by passing the organelles through a sucrose gradient. Free-flow electrophoresis (FFE) separates by charge.[3] Immunoisolation techniques can be used to separate using the biological properties of teh organelles. Several commercial and non-commercial devices and techniques exist for this purpose.

Sample Fractionation

The goal of sample fractionation is to improve the detection of low abundance proteins. The dynamic range for 2DE is limited to 104 while teh protein expression range is from 107 to 1012. Therefore, only the most abundant proteins are detected.

Increasing the amount of sample is not always a good solution. It is usually better to further subdivide the mixture. Fractionation methods are based on the physico-chemical properties of proteins. Following is a list some properties and techniques exploiting these properties.

Property Fractionation Method
Size/shape size exclusion chromatography
Surface charge ion-exchange chromatography
Isoelectric point electrophoretic methods
Surface hydrophobicity reverse phase chromatography
Binding specificity affinity chromatography
Solubility solvent extraction

Depletion of abundant proteins

22 proteins comprise 99% of the protein mass in serum. Affinity chromatography can be used to remove albumin, immunoglobulins, etc. Antibodies and proteins A, G, and L are often used to remove these proteins.

Protein precipitation

Protein precipitation can be induced by using organic solvents such as acetone, salts such as ammonium sulfate, or by changing the pH. Protein precipitation is used for removal of large abundant proteins. This methods lack specificity.


Ultrafiltration is a pressure-driven, semi permeable membrane-based separation process. It achieves separation on the basis of size. The membrane retains larger molecules. This method lacks specificity. 51 1251 Mass Spectrometry A mass spectrometer determines the mass of a molecule by measuring the mass-to-charge ration (m/z) of its ion. Ions are generated by inducing either the loss or gain of a charge from a neutral species. Once formed, ions are electrostatically directed into a mass analyzer where they are separated according to m/z and finally detected. The result of molecular ionization, ion separation, and ion detection is a spectrum that can provide highly accurate molecular mass and mass structural information.

An analyte is a collection of peptides derived protein after digestion. Two types of analyses are carried out on an analyte:

• Analysis of intact peptide ions - PMF • Analysis of fragmented ions – PFF

There are several different mass spectrometers. However, all sample molecules undergo the same processes regardless of instrument configuration. Sample molecules are introduced into the instrument through a sample inlet. Once inside the instrument, the sample molecules are converted to ions in the ionization source, before being electrostatically propelled into the mass analyzer. Ions are then separated according to their m/z within the mass analyzer. The detector converts the ion energy into electrical signals, which are then transmitted to a computer. 63 63 Bibliography [1] O’Farrell, P. Z., J. Biol. Chem. 1975, 250, 4007-4021. [2] Rabilloud, T., Valette, C., Lawrence, J., Electrophoresis 1994, 15, 1552-1558. [3] Sanchez, ARBF 1998, Sample preparation and solubilization: crucial steps preceding the two-dimensional gel electrophoresis process [4] R.M. Twyman, Principles of Proteomics [5] Peter J. Wirth, Alfredo Romano, 1995, Staining methods in gel electrophoresis, including use of multiple detection methods [6] Gurd, F. R., Methods Enzymol. 1967, 11, 532-541. [7] Griffith, O. W., Anal. Biochem. 1980, 106, 207-212. [8] Brune, D. R., Anal. Biochem. 1992, 207, 285-290. [9] Yohann Couté, Geneva University Hospital, mpb lecture notes 76 76 Bioinformatics Databases

What is a database?

In simple terms, a database is an electronic filing system. It allows a user to quickly store, search, retrieve, exchange and remove data. An application that manages a database (DB) is called a DBMS (Database Management System). The big biological databases can be queried through the Internet.

Why are there so many biological databases?

Biological data is very diverse and is growing at an exponential rate. Therefore, no single database can handle all the data and serve the diverse needs of the scientific community. As a result, many different databases exist, each with different capabilities and often redundant data. Right now, there is a large effort underway by different groups around the world to link and interface all the important databases and the data contained within them.

What will I find on this website?

We do not run or maintain any bioinformatics database. We simply lack the expertise and the funds. Here you will find links and brief descriptions to the various important databases. Our list is not exhaustive and it is not meant to be exhaustive. Our goal is the list the best and the most respected databases while offering links to pages or websites offering a comprehensive list.

How do I use a database listed here

All biological databases listed on this website come with a set of tools to help its users retrieve, submit, and analyze contained within. Tools evolve overtime, new tools are introduced and obsolete ones are removed. These tools often have to be learned and usually the database website offer help or tutorials to assist its users.

77 77 Genome Databases


Ensembl is a joint project between EMBL - EBI and the Sanger Institute to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes.


The Institute for Genomic Research (TIGR) is a not-for-profit center dedicated to deciphering and analyzing genomes – the complex molecular chains that constitute each organism’s unique genetic heritage. 78 78 Protein Databases

PMD - Protein Mutant Database

Compliations of protein mutant data are valuable as a basis for protein engineering. They provide information on what kinds of functional and/or structural influences are brought about by amino acid mutation at a specific position of protein. The Protein Mutant Database (PMD) that we are constructing covers natural as well as artificial mutants, including random and site-directed ones, for all proteins except members of the globin and immunoglobulin families. The PMD is based on literature, not on proteins. That is, each entry in the database corresponds to one article which may describe one, several or a number of protein mutants.


Structural and Functional Annotation of Protein Families


The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence. Proteins are classified by expert biologists into families and subfamilies of shared function, which are then categorized by molecular function and biological process ontology terms. For an increasing number of proteins, detailed biochemical interactions in canonical pathways are captured and can be viewed interactively.

DIP - Database of Interacting Proteins

The DIP database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions. The data stored within the DIP database were curated, both, manually by expert curators and also automatically using computational approaches that utilize the the knowledge about the protein-protein interaction networks extracted from the most reliable, core subset of the DIP data. Please, check the reference page to find articles describing the DIP database in greater detail.

HPRD - Human Protein Reference Database

The Human Protein Reference Database represents a centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome. All the information in HPRD has been manually extracted from the literature by expert biologists who read, interpret and analyze the published data. HPRD has been created using an object oriented database in Zope, an open source web application server, that provides versatility in query functions and allows data to be displayed dynamically.

For a more comprehnsive list, please refer to: expasy. 79 79 Protein Structure Databases

Protein Data Bank

The most authentic resource for protein structure information.

BMRDB - Biological Magnetic Resonance Data Bank

Repository for data on proteins, peptides, and nucleic acids from NMR spectroscopy

Swiss-Model Repository

The SWISS-MODEL Repository is a database of annotated three-dimensional comparative protein structure models generated by the fully automated homology-modelling pipeline SWISS-MODEL. The repository is developed at the Biozentrum Basel within the Swiss Institute of Bioinformatics.


CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels, Class(C), Architecture(A), Topology(T) and Homologous superfamily (H).

Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically. Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually. The topology level clusters structures into fold groups according to their topological connections and numbers of secondary structures. The homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to fold groups and homologous superfamilies are made by sequence and structure comparisons.

The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures. These include computational techniques, empirical and statistical evidence, literature review and expert analysis.


Nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin. The SCOP database, created by manual inspection and abetted by a battery of automated methods, aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. As such, it provides a broad survey of all known protein folds, detailed information about the close relatives of any particular protein, and a framework for future research and classification.

For a more comprehensive list, please refer to: expasy 80 80 Protein Sequence Databases


UniProt (Universal Protein Resource) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.

UniProt is comprised of three components, each optimized for different uses. The UniProt Knowledgebase (UniProtKB) is the central access point for extensive curated protein information, including function, classification, and cross-reference. The UniProt Reference Clusters (UniRef) databases combine closely related sequences into a single record to speed searches. The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences.

Swiss-Prot and TrEMBL

UniProtKB/Swiss-Prot: a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases

UniProtKB/TrEMBL a computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot. 81 81 Ionization Methods Ionization method refers to the mechanism of ionization while the ionization source is the mechanical device that allows ionization to occur. Ionization methods work by either ionizing a neutral molecule through electron ejection, electron capture, protonation, cationization, or deprotonation.

Protonation - basic residues

A proton is added to a molecule, producing a net positive charge of 1+ for every proton added. Positive charges tend to reside on the more basic residues of the molecule, such as amines, to form stable cations. Protonation can be achieved by MALDI and ESI.

Deprotonation - acidic residues

The negative charge of 1- is achieved through the removal of a proton from a molecule. Deprotonation is more useful with acidic residues. Deprotonation can be achieved by MALDI or ESI.

Cationization - carbohydrates

Cationization involves adding positively charged ions by addition of a cation adduct such as alkali or ammonium. It is very suitable carbohydrates. Cationization can be achieved by MALDI or ESI.

Transfer of a charged molecule to gas phase

This is generally achieved through the desorption or ejection of charged species from the condensed phase into the gas phase. This transfer is commonly achieved via MALDI or ESI. 82 82 Ionization Source MALDI and ESI are now the most common ionization sources for biomolecular mass spectrometry, offering excellent mass range and sensitivity.

Ionization Source Ionization Event
ESI Evaporation of charged droplets
nanoESI Evaporation of charged droplets
MALDI Photon absorption/proton transfer

The most important considerations for both MALDI and ESI are the physical state of the analyte and the ionization energy. Both instruments can produce positive or negative ions.

Importance of vacuum

A good vacuum is needed to allow ions to reach the detector without undesirable collisions. Unwanted collisions would result in reduced resolution and sensitivity. A vulnerable spot is the point of sample insertion. ESI uses capillary column to maintain vacuum. MALDI evacuates the sample chamber with a vacuum lock.

Flow technique (LC,CE) Pulse technique
Not very tolerant to salts (better off-axis) More tolerant to impurities (wash)
Multiple charging: complex but useful Generally singly charged
1 fmol/ul possible with NanoSpray Can consume less sample
Very high dynamic range Lower dynamic range

Actual results depend on sample, impurities and mass analyzer. Avoid or minimize salts, chaotropes, detergents, polymers, and non volatile compounds.

83 83 MALDI

MALDI – Matrix-assisted laser desorption/ionization

The analyte is mixed with a large excess of an aromatic ‘matrix compound’ that can absorb energy from the laser. The analyte and matrix are dissolved in an organic solvent and placed on a metallic probe or multiple-sample target. MALDI causes the ionization and transfer of a sample from the condensed phase to the gas phase via laser excitation of the sample. The solvent evaporates leaving matrix crystals in which the analyte is embedded.

Sample-matrix preparation procedures greatly influence the quality of MALDI mass spectra of peptides/proteins. Among the variety of reported preparation methods, the dried-droplet method is the most frequently used. MALDI is generally used with singly-charged ions with a high mass range analyzer such as TOF. MALDI can be used for high throughput microarrays on silicon chips, imaging of tissue or selection of individual cells or microorganisms.

MALDI requires samples to be dissolved in a matrix for analysis. Proteins are least soluble at their PI and therefore precipitate.

Detergents and salts should be reduced or removed from samples. Never use proteases and enzymes such as Cyanogen Bromide or trypsin in matrix preparation. They divide proteins.

Matrix crystal formation


Dried Droplet

In this method, the analyte and matrix solution are mixed together then loaded onto the MALDI sample plate. Solvents are dried by air drying. The disadvantage of this technique is the lack of signal reproducibility due to variability in drying conditions.

Fast Evaporation

Mix analyte with a specially prepared matrix. This produces small crystal with a very uniform surface, improving mass accuracy.


It involves a fast solvent evaporation step to form the first layer of small crystals, followed by deposition of a mixture and analyte solution on top of the crystal layer. This technique keeps the benefits of fast evaporation and reduces its disadvantages. Therefore it is more accurate and sensitive.

Matrix with nitrocellulose

When nitrocellulose is mixed with matrix solution, peptide ionization and signal reproducibility increases.

Ionization Mechanism

  1. Matrix absorbs UV or IR energy from pulsed laser
  2. Matrix ionizes and dissociates
  3. Ions released by expanding plume

Matrix dependencies

• Different lasers work with different matrices • Absorb energy from laser wavelength • Most matrices are acidic by nature • Matrix can be hot (likely to fragment peptides) or cold (not likely to fragment peptides) • Require high matrix-analyte ratio • Several crystallization methods exist

Hot or Cold

Cold --> DHB --> larger molecules and proteins --> creates fewer fragments Hot --> CHCA (alpha-cyano) --> better with peptides --> creates more fragments

Delayed Extraction

Delayed extraction is a technique which allows ions to be extracted from ionization source after a cooling period of ~150 nanoseconds. This narrows the kinetic energy distribution of the ions, thus providing higher resolution than in continuous extraction techniques.

In delayed extraction mode, potential gradient does not exist when sample in ionized. Accelerating voltage is pulsed after a user-set time delay and ions are accelerated.

In continuous extraction mode, accelerating voltage is continuously applied, and the potential gradient exists when sample is ionized. Ions are accelerated immediately.

Delayed extraction improves mass resolution by reducing the spread in arrival times.


MALDI is widely used as a tool for peptides, proteins, and most other biomolecules (oligonucleotides, carbohydrates, natural products, and lipids). The utility of heterogeneous samples makes it very attractive for mass analysis of complex biological samples such as proteolytic digests. MALDI is predominantly used for the analysis of simple peptide mixtures, such as the peptides derived from a single spot from a 2D-gel. The utility of MALDI for biomolecule analysis lies in its ability to provide molecular weight information on intact molecules.

Advantages Disadvantages
Practical mass range up to 300 000 Da Background interference by matrix material
Good sensitivity - femtomole Possibility of photo-degradation by laser
Soft ionization method with little or no fragmentation Acidic matrix may degrade some compounds
Very tolerant to salts
Suitable for analysis of complex mixtures

Soft ionization: not tearing the analyte apart. Soft ionization is capable of maintaining macromolcular complexes during ionization.

When MALDI does not work

Most charged or ionisable molecules interfere with the ionization of the analyte (i.e. compete for charges) and cause signal suppression and/or elevation of the background noise. These molecules include salts, chaotrophes, detergents, polymers, and all non volatile and ionic compounds.

Good crystallization is essential for good results. If salt or detergent impedes crystallization, you might get no signal. 84 84 ESI

ESI – Electrospray ionization

The analyte is dissolved and forced through a narrow needle held at high voltage. A fine spray of charged droplets emerges from the needle. These droplets are attracted to the entrance of the mass spectrometer due to high opposite voltage at the mass analyzer’s entrance. As they enter the mass spectrometer, the droplets are dried using a stream of inert gas, resulting is gas-phase ions that are accelerated through the analyzer towards the detector.

ESI is conducive to the formation of singly charged small molecules, but can also produce multiply charge species of larger molecules. Multiple charging makes it possible to observe very large molecules.

Many solvents can be used in ESI and are chosen based on the solubility of the compound of interest, the volatility of the solvent and the solvent’s ability to donate a proton. Better sensitivity is obtained when a volatile organic solvent is added.

ESI is generally used with multiply charged ions with quadrupoles and quadrupole ion traps. These instruments are more readily configured as tandem mass spectrometers for mass-selecting and fragmenting single components of a mixture.

Liquid technique is compatible with online chromatographic methods such as RP-HPLC, anion exchange chromatography and capillary electrophoresis.


ESI is a method routinely used with peptides, proteins, carbohydrates, small oligonucleotides, synthetic polymers, and lipids.

Advantages Disadvantages
Practical mass range up to 70000 Da Salts and ion-pairing agents reduce sensibility
Good sensitivity - femtomole Complex mixture can reduce sensitivity
Softest ionization method Simultaneous mixture analysis can be poor
Easily adaptable to LC Multiple charging can be confusing
Easily adaptable to tandem mass analyzers such as ion traps and triple quadrupoles Sample purity is important
Multiple charging analysis allows high mass ion analysis Carryover from sample to sample
No matrix interference


nanoESI has a very small needle positioned close to the entrance of the mass analyzer, resulting in more efficient ion transmission. Effusing the sample at very low rates allows for high sensitivity. The end result of this rather simple adjustment is increased efficiency, which includes a reduction in the amount of sample needed.

Since nanoESI droplets are smaller, the amount of evaporation necessary to obtain ion formation is much less. As a consequence, nanoESI is more tolerant to salts and other impurities. Less evaporation means that impurities are not concentrated down as much as they are in ESI. 85 85 Bibliography - MPB slides - Expanding role of mass spectrometry in biotechnology by Gary Siuzdak - Principles of proteomics by R. M. Twyman - - 86 86 Mass Analyzers Analytical instruments in general have variations in their capabilities as a result of their individual design and intended purpose. Mass analyzers also have their variations their strengths and weaknesses associated with each variation. A mass analyzer measures gas phase ions with respect to their (m/z). It is important to remember that mass analyzers measure m/z ratio, not mass. Quadrupoles and TOFs separate ions in space. Ion trap separates ions in time.

Mass Analyzers Detection Method
Quadrupole Scan radio frequency field (RF)
Quadrupole ion trap Scan radio frequency field (RF)
TOF TOF correlated directly to ion’s m/z
TOF Reflectron TOF correlated directly to ion’s m/z
Quad-TOF RF scanning + TOF
FT-ICR Translates ion cyclotron motion to m/z

Mass Analysis

Performance characteristics: The performance of a mass analyzer is judged by accuracy, resolution, mass range, tandem analysis capabilities, and scan speed.

Accuracy: The ability with which the analyzer can accurately provide m/z information. This is largely a function of an instrument’s stability and resolution.

Resolution: The ability of a mass spectrometer to distinguish between ions of different m/z ratios. So, greater resolution means increased ability to differentiate ions. As the data acquisition rate decreases, the resolution decreases as well.

Mass Range: The m/z range of the mass analyzer.

Tandem Mass Analysis: The ability of the analyzer to separate different molecular ions, generate fragment ions from a selected ion, and then measure the mass of fragmented ions. Fragmented ions are used for structural determination.

Scan speed: This refers to the rate at which the analyzer scans over a particular mass range.

Sensitivity: Sensitivity is an absolute quantity; resolution is a relative quantity. Sensitivity describes the smallest absolute amount of change that can be detected by a measurement. Sensitivity should not be confused with accuracy—they are entirely different parameters.

Purity: How well you can separate complex mixtures.

Cleanliness: Eliminate molecules that interfere with ionization/detection.

Isotopes: Isotopes are forms of an element whose nuclei have the same atomic number - the number of protons in the nucleus - but different mass numbers because they contain different numbers of neutrons. Isotopes have the same charge but different mass.

Charge-based data acquisition

Mass spectrometers can be made to record either positive or negative ions by making the source voltage positive or negative.
  • Peptides are best analyzed as positive ions
  • Phosphorylated or sulphated peptides may be analyzed in negative ion mode
  • Fatty acids are best analyzed as fatty acyl anions
  • Carbohydrates are more easily protonated than deprotonated

Tandem Analysis

Space vs. Time

Tandem is space refers to precursor selection, dissociation, and fragment separation taking place in different compartments of the mass spectrometer. e.g. TOF/TOF, QTOF, 3Q, etc.

Tandem is time refers to precursor selection, dissociation, and fragment separation taking place in the same space but at different times. e.g. ion trap, FT-ICR. 87 87 Quadrupole The quadrupole is the most widely used analyzer due to its ease of use, mass range covered, good linearity for quantitative work, resolution and quality of mass spectra. Reasonably priced.

The main characteristics are:

  • Working mass range: 10 to 4000 A.M.U.
  • Resolution: usually operated at a resolution = 1000, but resolution can be reasonably pushed up to 4000
  • Mass accuracy: 0.1 to 0.2 A.M.U.
  • Scan speed: up to 5000 A.M.U per second

A quadrupole can be operated in RF-only mode, which allows ions of any m/z ratio to pass through, or in scanning mode, where a potential difference is applied and the instrument acts as a mass filter.

A triple quadrupole has three quadrupoles arranged in a series. It can be set either for the analysis of intact peptides or their fragment ions.

  • Q1 is used to scan across a preset m/z range and select an ion of interest.
  • Q2 focuses and transmits the ions while introducing a collision gas into the flight path of the selected ion.
  • Q3 serves to analyze ion fragments.

Quadrupole separates ions in space.

Quadrupoles offer 3 main advantages:

  1. tolerate relatively high pressures
  2. have a significant mass range (up to 4000 m/z)
  3. Relatively low cost

U is voltage, V is RF. Mass spectrum is generated by increasing U and V at a constant ratio.

Operation mode and dwell time

The quadrupole can be used in two modes: SIM (single ion monitoring) or Scan. In SIM mode, the parameters (amplitude of the DC and RF voltages) are set to observe only a specific mass, or a selection of specific masses. This mode provides the highest sensitivity for users interested in specific ions or fragments, since more time can be spent on each mass. That time can be adjusted; it is called the dwell time.

The mass window for observing an ion in SIM mode can be adjusted, in order to compensate small mass calibration shift. This is the span factor.

In Scan mode, the amplitude of the DC and RF voltages are ramped (while keeping a constant RF/DC ratio), to obtain the mass spectrum over the required mass range. The sensitivity is a function of the scanned mass range, scan speed, and resolution.

Scanning Mode in Tandem MS

Single scan. Pass ion sources from ionization source to detector without collision. Gives molecular wt. information.

(1) Product ion scan. The precursor ion is focused in Q1 and transferred into Q2 - the collision cell - where it interacts with a collision gas and fragments. The fragments are then measured by scanning Q3. This results in the typical MS/MS spectrum and is the method most commonly employed with ESI ionization and/or LC-MS. Q1 fixed, Q3 scan. Gives structural information.

(2) Precursor ion scan. Q3 is held to measure the occurrence of a particular fragment ion and Q1 is scanned. This results in a spectrum of precursor ions that result in that particular product ion. Goal is to find all occurrence of a certain ion. Especially useful for EI and CI ionization. Q1 scan, Q3 fixed. Structural information & screening for analogues.

(3) Neutral loss scan. Q1 is scanned as in (2) but this time Q3 is also scanned to produce a spectrum of precursor ions that undergo a particular neutral loss. Again this mode is especially useful for EI and CI ionization. Q1 scan, Q3 scan – neutral offset. Structural information & screening for conjugates.

(4) Selected Reaction Monitoring. Q1 and Q3 are set to fixed masses. Goal is to detect a specific reaction. Which peptide would fragment into which fragments. –Both Q1 & Q2 are fixed. Target analysis & highest sensitivity. 88 88 TOF TOF - Time of flight mass analyzers are the simplest mass analyzers. TOF analysis is based on accelerating a group of ions to a detector where all of the ions are given the same amount of energy through an accelerating potential. Given the same push, lighter ions reach the detector before the heavier ones. Mass, charge, and kinetic energy of the ions affect the arrival time and the detector.

Unlike quadrupole instruments, electric field is not required to separate ions. MALDI-TOF/TOF or hybrid analyzers are extremely sensitive. TOF separates ions in space.


MALDI-TOF with reflectron is capable of detecting PSD. There are 3 kinds of reflectrons:

  1. Single-stage mirror: poor fragment separation
  2. Dual-stage mirror: High mass fragments enter the second stage offering good separation. Low mass fragments are reflected by the first stage practically without flight time dispersion.
  3. Curved-field: High order energy correction, but low sensitivity and resolution (due to radial potential gradient).

PSD – Post Source Decay

Fragment analysis can be carried out using MALDI TOF/TOF by detecting PSD. PSD is achieved by applying twice the intensity of laser. Reflectron takes advantage of the fact that the fragment ions have different kinetic energies and separates them based on how deeply the ions penetrate the reflectron field, thus producing a fragment ion spectrum.

Prompt Fragmentation is fragmentation before the push through the TOF. 89 89 Ion Trap Both ESI and MALDI can be used with ion trap analyzers. Ion trap consists of a chamber surrounded by a ring electrode and two end-cap electrodes. It can trap ions in a radio frequency quadrupole field.

Ions above a certain m/z threshold remain in the trap. Ions are ejected based on applied voltage. So a mass spectrum can be obtained by gradually increasing the voltage. Alternatively, an inert gas can be inserted to fragment the ions. Multiple rounds of fragmentation can be used.

Ions trap is capable of isolating ion species by ejecting all other from the trap. This is usually done to repeatedly fragment ions of interest. This significantly increases the amount of structural information which can be gathered.

Ion traps separate ions in time.

Ion trap is ideal for glycosylation analysis since it can break down sugars sequentially.

Linear Ion Trap

When we talk about ion trap, we usually refer to 3D ion trap. Linear ion trap uses 2D RF field with potentials applied to ends of quadrupole electrodes. This offers larger analyzer volume with improved range of quantitative analysis. Limitations

  • Cannot perform precursor ion scanning
  • 1/3 rule
  • dynamic range is limited (cannot hold too many ions in the trap)
90 90 FT-ICR FT-ICR mass analyzer is the most complex and difficult to operate. It offers the highest resolution, mass accuracy, and sensitivity.

FTMS is based on the principle of monitoring a charged particle’s orbiting motion in a magnetic field. While ions are orbiting, a pulsed RF signal is used to excite them. This allows the ions to produce detectable image current by bringing them into coherent motion and enlarging the radius of the orbit. The image current can generated by the ions can then be Fourier-transformed to obtain component frequencies of different ions, which correspond to their m/z. All ions with the same m/z value will orbit with the same cyclotron frequency in a uniform magnetic field. Since frequencies can be obtained at high accuracy, m/z can also be determined at high accuracy. In addition to high resolution, FTMS offers the ability to perform MSn. It is capable of ejecting all but the ions of interest.

Unlike double sector instruments, FT-ICR does not suffer from loss of sensitivity at high resolutions. 91 91 Hybrid Mass Analyzers A hybrid mass analyzer is a mixture of two or more mass analyzers. If done correctly, a hybrid can couple the benefits of different mass analyzers.


qTOF combines the stability of a quadrupole analyzer with the high efficiency, sensitivity, and accuracy of a TOF reflectron mass analyzer. The quadrupole can act as any simple quadrupole analyzer to scan across a specified m/z range. However, it can also be used to selectively isolate a precursor ion and direct that ion into the collision cell. The resultant fragment ions are then analyzed by the TOF reflectron mass analyzer.

qTOF exploits the quadrupole’s ability to select a particular ion and the ability of TOF to achieve simultaneous and accurate measurements of ions across full mass range. qTOF offers significantly higher sensitivity and accuracy over tandem quadrupole instruments when acquiring full fragment mass spectra. 92 92 Detectors Once separated by mass analyzer, ions reach ion detector which generates a current signal from incident ions. The most commonly used detector is the electron multiplier.3 different types of detectors: Electron multipliers, dynolyte photomultiplier, microchannel plates.

Electron multiplier

A conversion dynode is used to convert either negative or positive ions into electrons. These electrons are amplified by a cascade effect in a horn shape device, to produce a current. This device, also called channeltron, is widely used in quadrupole and ion trap instruments.

Dynolyte photomultiplier

Ions exiting the quadrupole are converted to electrons by a conversion dynode.These electrons strike a phosphor which when excited, emit photons. The photons strike a photocathode at the front of the photomultiplier to produce electrons and the signal is amplified by the photomultiplier. The photomultiplier is sealed in glass and held under vacuum. This prevents contamination and allows the detector to maintain its performance for a considerably longer period than conventional electron multipliers.

Microchannel plate

Most TOF spectrometers employ multichannel plate (mcp) detectors which have a time response < 1 ns and a high sensitivity (single ion signal > 50 mV). The large and plane detection area of mcp's results in a large acceptance volume of the spectrometer system. Only few mcp channels out of thousands are affected by the detection of a single ion i.e. it is possible to detect many ions at the same time which is important for laser ionisation where hundreds of ions can be created within a few nanoseconds. dynode phosphor photomultiplier.

93 93 PMF - Peptide Mass Fingerprinting PMF is an analytical technique for protein identification using data from intact peptide masses. A protease such as Trypsin is used to cleave a protein of interest. The masses of the proteins are measured with a mass spectrometer. Each protein can be uniquely identified by the masses of its constituent peptides since protein masses are extremely discriminatory. The Accuracy of PMF depends on quality and relative intensities of the peaks, mass accuracy of the instrument, and interfering factors such are PTMs. PMF can only be used to identify proteins which are sequenced. Therefore PMF is best suited to those organisms whose cDNA protein sequence data is available in a database. It must be noted that even small differences in mass can result in faulty results since PMF accuracy depends entirely on the accurate correlation of determined and predicted masses.

PMF relies on proteases to digest the protein into smaller peptides. Different proteases cut the proteins at different amino acids. An enzyme of low specificity, which digests a protein into too many peptides or results in many missed cleavages, should not be used. A complex mixture result is overlapping peaks. A missed cleavage is when a protease doesn’t cleave where it is supposed to. Trypsin or an enzyme with similar or higher specificity is a good choice. One should expect 1 or 2 missed cleavages from Trypsin per protein. The mass of a peptide, regardless of the protease used to cleave it, is the sum of the amino acids present in the peptide. The effects of modification which might be made on the amino acids must also be taken into account. For example, a phosphorylation results in an addition of a phosphate group, an addition of about 80 Daltons.

Mascot is tool which uses protein mass spectrometry data to identify proteins from primary sequence databases. This tool allows peptide mass fingerprinting. It allows you to specify the enzyme used, missed cleavages expected. 94 94 PFF - Peptide Fragment Fingerprinting When PMF fails, fragments in the CID spectrum can provide crucial information. The data can be used in two ways:

  • Uninterpreted fragment ion masses can be used in correlative database searching to identify proteins whose peptides would likely yield similar CID spectra under the same fragmentation conditions. Probability-based matching is used here.
  • Peaks of the mass spectrum can be interpreted, either manually or automatically, to derive partial de novo peptide sequences that can be used as standard database queries.

95 95 Peptide Fragmentation In order to obtain peptide sequence information by mass spectrometry, fragments of an ion must be produced that reflect structural features of the original compound. Fortunately, most peptides are linear molecules, which allow for relatively straightforward interpretation of the fragmentation data. This is accomplished by colliding the ions with an inert gas. The fragments then monitored via mass analysis.

Tandem mass spectrometry allows for a heterogeneous solution of peptides to be analyzed by filtering the ion of interest into the collision cell, structural information can be derived on each peptide from a complex mixture. The fragment ions produced in this process can be separated into two classes. Once class retains the charge on the N-terminus and fragmentation occurs at a, b, and c. The second class of fragment ions retain the charge on the C-terminus and fragmentation occurs at x, y, and z. Most fragments are obtained from cleavage between a carbonyl and amide bond.

  • b ions – charge retained by N-terminus
  • y ions – charge retained by C-terminus

In determining the amino acid sequence of a peptide, it is not possible to distinguish between L and I because they have the same mass.

Since a complete ion series (y or b) is not observed, the combination of the two series can provide useful information for protein identification.

Singly charged vs. multiply charged peptides

In multiply-charged peptides, the proton is strongly delocalized. This results in rich fragmentation and good sequence coverage. In singly-charged peptides, the proton is localized on basic residues. This results in poor fragmentation and low sequence coverage. 96 96 Biomolecular Analysis Important considerations for obtaining high quality signal are sample solubility, matrix selection, ionization characteristics, salt content and purity. Factors affecting data interpretation include quantitation, molecular weight calculation, isotope patterns, calibration/accuracy, sensitivity, and speed.


Do the ion intensities correlate to the relative amounts of each component? MALDI does not provide quantitative information unless the compound has been calibrated against an internal standard. ESI provides some quantitative information based on external calibration but internal calibration is more accurate. Following are some factors that affect the ability of a mass spectrometer to perform quantitative measurements:

  • Averaging: More averaging results in fewer errors associated with random noise
  • Quantity of material: common sense
  • Dynamic signal range: ion traps have small dynamic ranges. Quadrupoles and TOFs have much larger range.
  • Ionization technique: ESI’s stable signal provides better quantitation then MALDI.
  • Compound’s functional groups: Functional groups on a molecule can drastically affect the ionization properties. For instance, an amine will pick up a proton far more efficiently than an amide. Therefore in order to obtain good quantitative data an internal standard with comparable ionization characteristics is desirable.
  • Choice of internal standards: An internal standard with comparable ionization characteristics to the compound of interest allows for consistent relative signal stability. The best choice is an isotopically labeled internal standard.
  • Consistent sample handling: common sense

Ability of a molecule to become ionized is closely related to its functional groups. The best quantitation is obtained when a compound is calibrated against an internal standard similar to the molecule in question.

Calculating Molecular Weight

There are 3 different ways to calculate mass from molecular mass formula. Each is used for a specific reason.

  1. Monoisotopic Mass: The mass of an ion is calculated using the exact mass of the most abundant isotope. Monoisotopic mass is used when the individual are distinguishable.
  2. Average Mass: The mass of an ion calculated using average atomic weight of all the isotopes. Used when individual masses are not distinguishable.
  3. Nominal Mass: The mass of an ion is calculated using integer mass of the most abundant isotope. e.g. C13, N14, etc. Nominal mass is not used very often.

The resolving power of a mass spectrometer is very important in calculating mass of a molecule.

Isotope Patterns

Isotope patterns can be a great source of information. The spacing of isotopes indicate the charge state. ½ spacing = 2+ charge state, 1/3 spacing = 3+ charge state. Certain elements have distinct isotope patterns which help in identifying them.


Sample solubility is absolutely critical in obtaining quality data. The solvent of matrix allows the sample to be transformed to gas phase, thus playing a critical role in ionization.


Analyzing the sample ASAP after it has been prepared is important, as it is quite common for compounds to decompose or even react with the solvent in a relatively short time. A problem associated with hydrophobic compound is loss of the sample to the container’s surface.


Protein identification depends on accuracy of mass measurements. For high accuracy, it is often necessary to have an internal standard present or at least some reference compound. Standard compounds are used to calibrate a mass spectrometer’s mass analyzer with respect to how it measure m/z. Calibration is generally performed using a standard mixture, such as PPG, that generates a reliable source of known ions that cover the mass range of interest.

It is important to distinguish between internal and external calibration. External calibration refers to the instrument being calibrated followed by analysis without the present of a calibrant. Internal calibration refers to analyses that are performed with a calibrant present to improve accuracy.

Sample purity

Sample purity maximizes sensitivity. MALDI samples can be cleaned and desalted with ZipTip or a droplet of cold water. Dialysis for ESI.


If too little sample is used, the instrument will be unable to detect a signal. Too much sample can skew the intensity profile of the ions. Higher concentrations can also amplify the effects of impurities. It is important to be within the correct range of your instrument.

Ionization characteristics

The types of functional groups on a molecule will often determine how a compound should be analyzed. 97 97 Protein Structure Characterization Mass spectrometry can be used to determine both primary and higher order structures of proteins. The basis for these investigations lies in the ability of mass analysis techniques to detect changes in protein conformation under differing conditions. These experiments include:

  • Monitoring charge states of proteins
  • Monitoring charge states in combination with proteolytic digestion
  • Monitoring charges states with chemical modification

The accuracy and sensitivity of PMF allows for the exploration of protein structure and even structural dynamics.

PMF combines enzymatic digestion, mass spectrometry, and sequence specific data analysis to produce and examine proteolytic fragments. This information could then be used to identify protein and get information about protein structure.

Higher order structure of a protein can be evaluated when PMF techniques are combined with limited proteolytic digestion. Limited proteolysis refers to the exposure of a protein or complex to digestion conditions that last for a brief period. This is performed to gain information on the parts of the protein exposed to the surface.

The sequence specificity of the proteolytic enzyme plays a major role in the application of mass spectrometry to protein structure. A sequence specific protease reduces the number of fragments that are produced and, concomitantly, improves the likelihood for statistically matches. Accessibility and flexibility of a protein is also very important. Surface proteins are usually hydrophilic so proteases that cleave hydrophilic sites are preferred.

Recognizing conformational changes

PMF can be used to recognize simple conformational differences between protein states. After conformational changes, the same protein would digest into different mass maps.

ESI has been used to monitor protein folding and protein complexes. Some proteins exhibit a distinct difference in their charge state distribution which is a reflection of their solution conformation. ESI is a simple but highly sensitive and informative method to characterize the functional shape(s) of proteins (globular or extended) prior to more material-intensive and time-consuming spectroscopic or crystallographic studies. 98 98 Protein Quantitation

Protein quantitation by chemical incorporation of isotopes

Absolute quantitation: determine the absolute concentration of proteins/peptides in a selected fluid/cell/tissue: can apply to a series of samples. Relative quantitation: determine the concentration of proteins/peptides by comparison to an internal standard and/or similar fluid/cell/tissue at a different physiological stage.


Biological incorporation

Pre-harvesting. Labeling of peptide/protein is achieved by growing cell in media enriched in stable isotope containing amino acids. SILAC.

Chemical incorporation

Post-harvesting. Derivatisation reagent for chemical modification of proteins in a site-specific manner after harvest of the proteins.

Enzymatic incorporation

Labeling is achieved during enzymatic cleavage where proteolysis incorporates an O from the solvent (H2O) into the C-terminus.

ICAT – Isotope-coded affinity tagging

In MS, protein quantitation is often based on the use of stable isotopes. Gel-based methods use proteins for quantitation while MS uses peptides for quantitation. The general approach is to label alternative samples with equivalent reagents, one of which contains a light isotope. The samples are mixed, separated into fractions, and analyzed by MS. The ratio of 2 isotopic variants can be determined from the heights of the peaks in the mass spectra and used to identify proteins with differential abundance. MS methods are more reproducible and sensitive than gel-based methods for protein quantitation.

ICAT uses stable isotope labeling to perform quantitative analysis of paired protein samples, followed by separation and identification of proteins within these complex mixtures by LC-MS. The isotopic tags bind covalently to Cys within a protein. The tags are almost identical, possessing the same structure and chemical properties, but exist in two isotopic forms:

• Light - possessing eight hydrogens • Heavy - possessing eight deutriums

When bound to the same peptide, a concrete mass change of exactly 8 mass units will be evident when analyzed by MS.

The tag has three functional elements:

  1. a biotin tag, used during affinity capture
  2. isotopically encoded linker chain
  3. reactive group which will bind to and modify Cys residues

Using ICAT is a 4 step process:

  1. free cysteines in a protein are reacted with a special affinity tag
  2. labeled proteins are enzymatically digested
  3. labeled peptides are separated from bulk using LC prior to MS
  4. MS detects the mass differences in the same peptides

Two samples are separately treated with affinity tags. One with light ICAT and the other with heavy ICAT. The samples are mixed, digested and passed through MS.

The strength of this technique lies in its ability to allow quantification and identification within a single analysis. It also can be applied to samples from any source as it does not require metabolic labeling. The advantage over 2D gel is its speed and ease of automation.

Weaknesses of this method include the frequent need for extensive sample fractionation before MS/MS analysis. Since the procedure targets Cys residues, proteins that do not contain Cys cannot be quantified. This represents about 10% of the proteins.

Heavy Oxygen Labeling

An alternative to ICAT labeling of proteins that is not selective for cysteine-containing peptides is to label the peptides after digestion. When trypsin cleaves a protein and generates a peptide with a new C-terminus, it introduces an oxygen atom derived from a molecule of water into the carboxyl group of the peptide. This can be exploited for the identification of y-series ions in fragment ion spectra but it can also be used to differentially label peptides derived from alternative protein samples if normal water is used in one buffer and water substituted with heavy oxygen O-18 is used in the other. The abundance of the peptides can then be compared, since they will appear as doublets separated by two mass units. 99 99 Peptide cleavage

Why cleave peptides to proteins?

Peptides must be generated prior to MS identification since peptides fragment well in CID. For proteomics, peptides are typically generated from liquid solutions, gel pieces, and membranes.

Peptide Generation

Peptides can be efficiently generated by enzymatic or chemical cleavage methods compatible with MS. The most common technique is proteolytic cleavage followed by chemical cleavage.

Proteolytic Cleavage

In proteolytic cleavage, proteinases (proteases or peptidases) are enzymes which break peptide bonds of proteins.


Trypsin is the most widely used endoproteinase for protein identification by PMF for the following reasons: • Cleaves on C-terminal side of basic amino acids K and R when they are not followed by P. • Wide distribution of peptide masses (500 - 4000 Da) useful for MS analysis. • Creates 2 charge sites which are very useful for ESI • Positioning of charges on N and C terminals simplifies interpretation of CID spectra


Mainly cleaves on C-terminal sites of T, Y, F, and L residues. It is difficult to predict cleavage fragments and therefore it is not very useful for PMF. It is poorly specific due to large number of possible cleavage sites. However, due to high sequence coverage, it is considered very useful for characterizing PTMs.


Pepsin is a non-specific acidic protease. It can cleave both N and C terminals of aromatic amino acids and hydrophobic residues. Not useful for PMF but very useful for characterization of PTMs. It can also be used to digest hydrophobic membrane proteins that cannot be attacked by trypsin due to lack of basic residues.

Chemical Cleavage

Chemical cleavage of proteins is very important for Edman sequencing. Although chemical cleavage is very specific, it requires cleanup before MS.

Cyanogen Bromide

• It cleaves very specifically at methionyl residues in the peptide backbone. • This method has been successfully used with membrane proteins. • Since M is often situated in hydrophobic regions of proteins, digestion using CNBr can generate peptides with decreased hydrophobicity, ultimately enabling protein analysis.

Acidic Hydrolysis

External factors can have enormous influence on the cleavage. 100 100 Meta Databases A meta-database is DBMS which is either linked to or collects information from various other databases. A meta database allows users to access information related to a specific topic from several databases on one page.


The MetaDB metadatabase is a sorted, searchable collection of biological databases. Most entries in the metadatabase include a relevant peer-reviewed abstract or excerpt along with a link to the abstract or full text article. Database descriptions surrounded by quotation marks were borrowed from the database websites. It contains links to over 1200 databases.


Entrez is the integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others. Click on the graphic below for a more detailed view of Entrez integration.


euGenes provides a common summary of gene and genomic information from eukaryotic organism databases. This includes:

  • Gene symbol and full name,
  • Chromosome, genetic and molecular map information,
  • Gene product information (function, structure, and homologies).
  • Links to extended gene information.


GeneCards project defines its goal to be to integrate the fragments of information scattered over a variety of specialized databases into a coherent picture.


SOURCE is a unification tool which dynamically collects and compiles data from many scientific databases, and thereby attempts to encapsulate the genetics and molecular biology of genes from the genomes of Homo sapiens, Mus musculus, Rattus norvegicus into easy to navigate GeneReports. The mission of SOURCE is to provide a unique scientific resource that pools publicly available data commonly sought after for any clone, GenBank accession number, or gene. SOURCE is specifically designed to facilitate the analysis of large sets of data that biologists can now produce using genome-scale experimental approaches.


A picture speaks a thousand words, and the following screenshot of website is self-explanatory.

106 370 Scoring

Scoring Model

When we compare sequences, we are looking for evidence that they have diverged from a common ancestor by a process of mutation and selection. Basic mutational processes are:

• Substitutions: Residue changes in the sequence • Insertions: Addition of a residue • Deletions: Removal of a residue

Insertions and deletions, together, are called gaps.

The total score we assign to an alignment is the sum of terms for each aligned pair of residues, plus terms for each gap. In probabilistic interpretation, this would correspond to the log of the relative likelihood that the sequences are related, compared to being unrelated. In other words it is the log of the probability of being related to another sequence compared to the log of the probability of being unrelated.

It is expected that the identities and conservative substitutions are more likely to occur than randomly conserved regions. Thus true positives are more likely to have a positive score while random substitutions are expected to contribute towards a negative score.

Using an additive scoring corresponds to an assumption that we can consider mutations at different sites of the sequence to have occurred independently. In other words, each gap is a mutation. This seems to be a reasonable assumption for DNA and protein sequences. However, this assumption is seriously inaccurate for RNA, since RNA is transcribed from DNA.

Substitution Matrices / Scoring Matrices

What you really want to learn when evaluating a sequence alignment is whether a given alignment is random or meaningful. To access the meaningfulness of an alignment we construct a scoring matrix.

A scoring matrix is a table of values that describe the probability of a residue pair occurring in an alignment. The values in a scoring matrix are logarithms of ratios of two probabilities. The first is the probability of random occurrence of an amino acid in a sequence alignment. The second is the probability of meaningful occurrence of a pair of residues in a sequence alignment.

In order to score an alignment, the alignment program needs to know whether it is more likely or less likely that a given amino acid pair occurred randomly. Negative log odds ratio is random while positive indicates an evolutionary relationship. It is important to note that the scores are logarithms so a match of 2 residues is far from a coincidence.

Formula for log likelihood ratio of the residue pair (a,b):

There are several ways to derive substitution scores, however, substitution scoring based on probabilistic models seems to be the most accurate.

Given a pair of aligned sequences, we want to assign a score to the alignment that gives a measure of the relative likelihood that the sequences are related as opposed to being unrelated. We do this by having models that assign a probability to the alignment in each of the two cases; we then consider the ratio of the two probabilities. [edit]

The random model is the simplest. It assumes that an amino acid occurs independently with some frequency q. Hence the probability of two sequences is the product of the probabilities of each amino acid. [edit]

In the alternative model, aligned pairs of residues occur with a joint probability Pab. This value can be thought of as the probability that the residues a and b have each independently been derived from some unknown original residue c in their common ancestor. [edit]

The ratio between the probabilities of the two models is called the odds ratio:

P(alternative) / P(random)

In order to arrive at an additive scoring system, we take the log of this ratio. The log likelihood ratios can be arranged in a matrix. DNA has a 4 x 4 matrix while proteins have a 20 x 20 matrix. This matrix is called the score matrix or substitution matrix. Blosum50 and PAM are the most commonly used matrices.

Substitution matrices essentially make a statement about the probability of observing ab pairs in real alignments. Gap Penalties DNA sequences change not only by point mutation, but by insertion and deletion of residues as well. Consequently, it is often necessary to introduce gaps into one or both of the sequences being aligned to produce a meaningful alignment between them.

Gaps have to be penalized. The standard cost associated with a gap of length g is given either by a linear score or an affine score.

V(g) = -gd V(g) = -d-(g-1)e

where d is called the gap-open penalty and e is called the gap extension penalty.

Most sequence alignment models use affine gap penalties where the cost of opening a gap in a sequence is different from the cost of extending a gap that has already been started. The extension penalty is usually set to a number less than the gap-open penalty d. This allows insertions and deletions to be penalized less than they would in linear gap cost. This is desirable when gaps of a few residues are expected almost as often as gaps of a single residue. [edit]

Gap penalties also correspond to a probabilistic model of alignment, although this is less widely recognized than the probabilistic basis of substitution matrices. We assume that the probability of a gap occurring at a particular site in a given sequence is the product of a function f(g) of the length of the gap, and the combined probability of the set of inserted residues. In other words, the length of a gap is not correlated to the residues it contains. Here the gap penalties correspond to the log probability of a gap of that length. [edit]

On the other hand, if there is evidence for a different distribution of residues in gap regions then there should be residue-specific scores for the unaligned residues in gap regions. These scores should be equal to the logs of the ratio of their frequencies in gapped versus aligned regions. For example, a sequence is more likely to be in a hydrophobic region of the protein. [edit]

Gap penalties are intimately tied to the scoring matrix that aligns the sequences. The best pair of gap opening and extension penalties for one scoring matrix doesn’t necessarily work with another.

Linear Gap Penalty Linear gap penalties are the simplest type of gap penalty. The only parameter, d, is a penalty per gap. This is almost always negative, so that the alignment with fewer gaps is favored over the alignment with more gaps. Under a linear gap penalty, the overall penalty for one large gap is the same for many small gaps.

Affine Gap Penalty Affine gap penalties attempt to overcome this problem. In biological sequences, for example, it is much more likely that one big gap of length 10 occurs in one sequence, due to a single insertion or deletion event, than it is that 10 small gaps of length 1 are made. Therefore, affine gap penalties have a gap opening penalty, c, and a gap extension penalty, e. A gap of length l is then given a penalty c + (l-1)e. So the gaps are discouraged, c and e are almost always negative. Since a few large gaps is better than many small gaps, e is almost always smaller than c. 107 410 Alignment Algorithms Alignment Algorithms Given a scoring system, we need to have an algorithm for finding an optimal alignment for a pair of sequences. Where both sequences have the same length n, there is only one possible global alignment of the complete sequences, but things get complicated once gaps are allowed. It is not computationally feasible to enumerate all possible matches. [edit]

The algorithm for finding optimal alignments given an additive alignment score of the type mentioned above is called dynamic programming. Dynamic programming is crucial for computational sequence analysis. Unlike heuristic methods, dynamic programming algorithms are guaranteed to find the optimal scoring alignment or set of alignments. Dynamic programming involves dividing the problem into smaller problems and storing the results in a table.

In a log-odds ratio scoring scheme, better alignment would product higher scores. To find the optimal alignment, we would like to maximize the score. In terms of a Blosum50 matrix, we want to maximize the positive values and minimize the smaller values. Global Alignment: Needleman-Wunsch Algorithm The Needleman-Wunsch algorithm performs a global alignment on two sequences. A global alignment between two sequences is an alignment in which all the characters in both sequences participate in the alignment. Global alignments are useful mostly for finding closely-related sequences.

The Needleman-Wunsch algorithm is an example of dynamic programming, and is guaranteed to find the alignment with the maximum score. The goal is to maximize a similarity score, to give ‘maximum match’ (Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible insertions/deletions). Then find the best global alignment of any two sequences.

The idea is to build an optimal alignment using previous solutions for optimal alignments of smaller subsequences. We construct a matrix F indexed by i and j, one index for each sequence, where the value F(i,j) is the score of the best alignment between the row and the column. If we know:

F(i – 1, j – 1) F(i – 1, j) and F(i, j – 1) In other words, the cells to the top, left and diagonal left

It is possible to calculate F(i, j). There are three ways the best score F(i,j) of an alignment up to xi, yi could be obtained.

Aligning xi aligned to yi I G A xi
L G V yi F(i – 1, j – 1) + s(xi, yi)

Aligning xi to a gap A I G A xi G V yi - - F(i – 1, j) - d

Aligning yi to a gap G A xi - - S L G V yi F(i, j – 1) - d

The best score would the largest of the three options. The value of the matrix F(n,m) is by definition the best score for an alignment of x1..n to y1..m, which is what we want, the score of the best global alignment of x and y. To find the alignment itself, we must find the path of choices max(choice1,choice2,choice3) which led to this final value. This procedure is referred to as traceback.

Traceback works by building the alignment in reverse, starting from the final cell, and following the pointers that we stored when building the matrix. At each step we move back to either (i – 1, j – 1), (i – 1, j), i, j – 1), we add gaps to one that we didn’t traceback to.

The reason that the algorithm works is that the score is made of a sum of independent pieces, so the best score up toe some point in the alignment is the best score up to the point one step before.

This algorithm involves computing 3 sums and a max. It also requires (n + 1) x (m + 1) for storage. Thus, it takes O(nm) time and O(nm) memory. In other words O(n2).

In programming terms, N&W involves an iterative matrix method of calculation. All possible pairs of residues (bases or amino acids) - one from each sequence - are represented in a 2-dimensional array. All possible alignments (comparisons) are represented by pathways through this array.

The following four steps are necessary to align sequence1 of N positions with sequence2 of M positions:

  1. Build a matrix of size N * M;
  2. Assign similarity values;
  3. For each cell, look at all possible pathways back to the beginning of the sequence and give that cell the value of the maximum scoring pathway;
  4. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment.