BioPerl is a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics. It is open source and widely used in the bioinformatics community. Bioperl provides software modules for many of the typical tasks of bioinformatics programming. These include:

  • Accessing sequence data from local and remote databases
  • Transforming formats of database/ file records
  • Manipulating individual sequences
  • Searching for similar sequences
  • Creating and manipulating sequence alignments
  • Searching for genes and other structures on genomic DNA
  • Developing machine readable sequence annotations

Retrieving sequences from Swiss-Prot

The following script retrieves a sequence from Swiss-Prot:

#!/usr/bin/perl
use strict;
use Bio::Perl;

# it retrieves sequences from swissprot and generates fasta output
# this script will not work if you are not connected to the Internet

my $s = get_sequence('swiss',$ARGV[0]);
write_sequence(">$ARGV[1]",'fasta',$s);

usage:

$ perl get_swissprot_sequence.pl P11217 a.fasta
$ perl get_swissprot_sequence.pl PYGM_HUMAN b.fasta

Perl commandline parameters are stored in the @ARGV array. The first argument $ARGV[0] stores the Swiss-Prot accession number or identifier. The second argument $ARGV[1] defines the output file.

For more on use SeqIO with files, please refer to http://www.bioperl.org/wiki/Sequence_formats

Blasting a sequence

Aliging sequences using BLAST is the most common task performed in bioinformatics. Here is how you can do it using BioPerl:

#!/usr/bin/perl 
use strict;
use warnings;
use Bio::Perl;

# get sequence given an identifier or accession number
my $so = get_sequence('swiss',$ARGV[0]);

# get blast the sequence
my $blast_result = blast_sequence($so);

# write blast results into a file
write_blast(">$ARGV[1]",$blast_result);

usage:

$ perl blastseq.pl PYGM_HUMAN pygm.blast

Creating sequence objects

When working with fasta or other files, you have to first create sequence objects.

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
use Bio::SeqIO;

# create sequence object
my $s = Bio::SeqIO->new( -file => "pygm.fasta", -format => "fasta");
my $st = $s->next_seq;
print $st->seq;

usage:

$ perl create_sequence_object.pl

It takes a file named pygm.fasta as input and creates a sequence object. The last two lines are for printing the sequence.

If you have multiple fasta sequences in a file, SeqIO would create multiple sequence object for you automatically. To print all the sequences, you can use the while loop in the last line.

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Perl;
use Bio::SeqIO;

# create sequence objects
my $s = Bio::SeqIO->new( -file => $ARGV[0], -format => "fasta");

# print sequences
while (my $st = $s->next_seq) { print $st->seq; print "\n"; }

usage:

$ perl create_sequence_objects.pl pygm.fasta

Retrieving sequences from Genbank

Following code retrieves a Genbank sequence and creates an object.

#!/usr/bin/perl
use strict;
use warnings;
use Bio::DB::GenBank;
use Data::Dumper;

# create a GenBank object
my $a = Bio::DB::GenBank->new;   
my $b = $a->get_Seq_by_acc($ARGV[0]);

# Dump Data
print Dumper($b);

usage:

$ perl retrieve_genbank_sequence.pl EW695397

Dumper prints the contents of an object.

Installing ClustalW on Linux

ClustalW can be downloaded from ftp://ftp.ebi.ac.uk/pub/software/clustalw2/. Choose the src version, something like clustalw-2.0.10-src.tar.gz.

  1. wget ftp://ftp.ebi.ac.uk/pub/software/clustalw2/2.0.10/clustalw-2.0.10-src.tar.gz
  2. tar xzvf clustalw-2.0.10-src.tar.gz
  3. cd clustalw-2.0.10
  4. ./configure
  5. make
  6. su
  7. make install
  8. clustalw2

The last line is to test whether clustalw is properly installed and running.

Perl CGI not running

If your Perl CGI is not running, look at the ScriptAlias settings in your httpd.conf. It defines which directories are allowed to run CGI scripts.

ClustalW with BioPerl

BioPerl documentation on ClustalW is great but I faced some problems as a beginner. The following is a code from ClustalW docs modified to make life easier for the beginner.

  1. Make sure bioperl-run in installed in addition to BioPerl.
  2. Make sure clustalw is installed at executable
  3. Set path using the following command (assuming that clustalw is installed at /usr/local/bin/clustalw2):export CLUSTALDIR=/usr/local/bin/clustalw2

code

#!/usr/bin/perl
use Bio::AlignIO;
use Bio::Root::IO;
use Bio::Seq;
use Bio::SeqIO;
use Bio::SimpleAlign;
use Bio::TreeIO;

BEGIN { $ENV{CLUSTALDIR} = '/usr/local/bin/clustalw2/' }
use Bio::Tools::Run::Alignment::Clustalw;

# Build a clustalw alignment factory
@params = ('ktuple' => 2, 'matrix' => 'BLOSUM');
$factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);

# Pass the factory a list of sequences to be aligned.
$inputfilename = 'blastdump/input.fasta';
$aln = $factory->align($inputfilename); # $aln is a SimpleAlign object.
# or
$seq_array_ref = \@seq_array;
# where @seq_array is an array of Bio::Seq objects
$aln = $factory->align($seq_array_ref);

# Or one can pass the factory a pair of (sub)alignments
#to be aligned against each other, e.g.:
$aln = $factory->profile_align($aln1,$aln2);
# where $aln1 and $aln2 are Bio::SimpleAlign objects.

# Or one can pass the factory an alignment and one or more unaligned
# sequences to be added to the alignment. For example:
$aln = $factory->profile_align($aln1,$seq); # $seq is a Bio::Seq object.

# Get a tree of the sequences
$tree = $factory->tree(\@seq_array);

# Get both an alignment and a tree
($aln, $tree) = $factory->run(\@seq_array);

# Do a footprinting analysis on the supplied sequences, getting back the
# most conserved sub-alignments
my @results = $factory->footprint(\@seq_array);
foreach my $result (@results) {
  print $result->consensus_string, "\n";
}

# There are various additional options and input formats available.
# See the DESCRIPTION section that follows for additional details.