2013年4月16日星期二

Stothard Research Group

http://www.ualberta.ca/~stothard/software.html


align_learn.pl - this Perl script converts a multiple sequence alignment into a format that can be readily analyzed using common machine learning algorithms. Specifically, the program accepts a sequence alignment in FASTA format and converts it into an ARFF (attribute-relation file format) file containing data attributes and data instances. The ARFF format can then be read by the Weka machine learning software, which provides implementations of many machine learning algorithms.
Developed by: Paul Stothard.
Availability: align_learn.zipalign_learn.tar.gz.
annotate_SNPs.pl - this Perl script annotates SNPs identified by the next-generation sequencing of genomic DNA or transcripts. It is designed to accept SNPs from the AB diBayes SNP package, Maq, or any other program that can provide a reference sequence identifier, a SNP position, the base found at that position in the reference sequence, and the bases found at that position among the sequencing reads. The script examines each SNP record provided as input and uses Ensembl, NCBI, and EBI to provide a detailed description of its expected functional significance, regardless of whether or not the SNP has been previously described. Included in this description is a functional class ('NON_SYNONYMOUS_CODING' for example), and information about the affected transcript, if applicable, including transcript ID, position of the SNP, the alleles as they would appear in the transcript sequence, and Gene Ontology information. For protein-altering SNPs, each of the resulting protein sequences is compared to orthologous proteins, to determine which of the alleles most drastically changes the protein. Using a locally installed version of Ensembl this script can annotate 4,000,000 SNPs in about two days on a standard desktop system. Thus this script is suitable for annotating the SNPs arising from genome resequencing projects.
Developed by: Paul Stothard.
Availability: Included in the NGS-SNP package.
backup.sh - this shell script archives directories of interest on a Linux-based system. When it is first run, and on the first of each month, this script generates a full backup of the files and directories listed in the include.conf file. Files and directories listed in the exclude.conf file are not included in the archive. These full backups are not overwritten by future backups. Each Sunday the script performs a full backup that is overwritten the following Sunday. Every day the script performs an incremental backup, storing the files that have changed since the last full backup. These incremental backup files are named after the day of the week they are performed, and are overwritten each week. The script sends an email on the first of each month, and whenever any backup fails. The script splits each full backup into a series of smaller files, suitable for burning to CD or DVD. When a full backup is generated, the MD5 hash value of the complete backup file is written to a README file in the same directory as the backup files. Included in the README are directions for assembling the split backup files into the original file.
Developed by: Paul Stothard.
Availability: backup_script.zipbackup_script.tar.gz.
blast_hit_features.pl - this Perl script accepts BLAST results obtained from local_blast_client.pl or remote_blast_client.pl. The results must have been obtained using blastn, tblastn, or tblastx searches (i.e. nucleotide databases), since this script uses GenBank files to obtain feature information for sequence hits. For each entry in the BLAST results, the GI number of the hit, if available, is used to obtain the corresponding sequence record from NCBI, in GenBank format. The features in the GenBank file are compared to the coordinates of the HSP, and features overlapping with the HSP are added to the existing BLAST results. The modified results are written to a new file. The three nearest features preceding the HSP (located to the left of the HSP) and the three nearest features located after the HSP (located to the right of the HSP) are also added to the output.
Developed by: Paul Stothard.
Availability: blast_hit_features.zipblast_hit_features.tar.gz.
blast_hit_flanking_sequence.pl - this Perl script accepts blastn search results obtained from local_blast_client.pl or remote_blast_client.pl. In addition to the BLAST results file, the script requires the query sequences and database sequences in FASTA format. For each BLAST result, the script constructs a modified query sequence, in which the query is extended using sequence extracted from the hit sequence. The amount of hit sequence added to the ends of the query can be specified using the -u and -d options.
Developed by: Paul Stothard.
Availability: blast_hit_flanking_sequence.zipblast_hit_flanking_sequence.tar.gz.
blast_hits_in_ucsc_genome_browser.pl - this Perl script accepts BLAST results obtained from local_blast_client.pl or remote_blast_client.pl. The results must have been obtained using blastn, tblastn, or tblastx searches (i.e. nucleotide databases) and database sequences downloaded from the UCSC Genome Browser site (http://hgdownload.cse.ucsc.edu/downloads.html). The BLAST results are converted to annotation files for the UCSC Genome Browser, and a separate HTML file containing links to each feature in the annotation files is created. Clicking on a link in the HTML file loads the genome region involving the BLAST HSP into the UCSC Genome Browser and passes the annotations in the relevant annotation file to the browser for inclusion in the view.
Developed by: Paul Stothard.
Availability: blast_hits_in_ucsc_genome_browser.zipblast_hits_in_ucsc_genome_browser.tar.gz.
build_cluster_script.pl - this Perl script creates an executable shell script with the specified command repeated n number of times. Every occurrence of the '$' symbol in the command is replaced by a number, from 1 to n. Alternatively, the "-l" option causes letters to be used in place of numbers (eg 'aa' instead of '1', 'ab' instead of '2'). This script can be used to generate scripts for batch processing on a computer cluster.
Developed by: Jason Grant and Paul Stothard.
Availability: build_cluster_script.zipbuild_cluster_script.tar.gz.
cDNA_library_entropy.pl - this Perl script accepts a directory containing one or more sequence files in multi-FASTA format. Typically each file will contain the sequences obtained from a single EST library or tissue type. By default the script looks for a UniGene annotation identifier in each sequence title, for example 'Bt.22094'. A different ID type can be specified using the -m option. The number of sequences present for each ID is determined. The script uses these counts to calculate the information entropy of the library in bits. This value increases as the number of distinct sequences in a library increases, and decreases as the number of replicates of a particular sequence increases. The -d option can be used to obtain the information entropy of combinations of libraries. For example, specifying '-d 2' causes all possible combinations of two libraries to be evaluated. This script is intended to aid in the selection of tissues for SNP discovery by mRNA sequencing.
Developed by: Paul Stothard.
Availability: Included in the NGS-SNP package.
CGView - a Java package for generating high quality, zoomable maps of circular genomes. Its primary purpose is to serve as a component of sequence annotation pipelines. Feature information and rendering options are supplied to the program using an XML file, a tab delimited file, or an NCBI ptt file. CGView converts the input into a graphical map (PNG, JPG, or Scalable Vector Graphics format), complete with labels, a title, and legends. In addition to the default full view map, the program can generate a series of hyperlinked maps showing expanded views. The linked maps can be explored using any web browser, allowing rapid genome browsing, and facilitating data sharing. The feature labels in maps can be hyperlinked to external resources, allowing CGView maps to be integrated with existing web site content or databases.
Developed by: Paul Stothard.
Availability: http://bioinformatics.org/cgview/
CGView Comparison Tool (CCT) - a package for visually comparing bacterial, plasmid, chloroplast, or mitochondrial sequences of interest to existing genomes or sequence collections. The comparisons are conducted using BLAST, and the BLAST results are presented in the form of graphical maps that can also show sequence features, gene and protein names, COG category assignments, and sequence composition characteristics. CCT can generate maps in a variety of sizes, including 400 Megapixel maps suitable for posters. Comparisons can be conducted within a particular species or genus, or all available genomes can be used. The entire map creation process, from downloading sequences to redrawing zoomed maps, can be completed easily using scripts included with the CCT. User-defined features or analysis results can be included on maps, and maps can be extensively customized. To simplify program setup, a CCT virtual machine that includes all dependencies preinstalled is available. Detailed tutorials illustrating the use of CCT are included with the CCT documentation.
Developed by: Paul Stothard and Jason Grant.
Availability: http://stothard.afns.ualberta.ca/downloads/CCT/
cgview_xml_builder.pl - this Perl script accepts a variety of input files pertaining to circular genomes and generates an XML file for the CGView genome drawing program. This script can create the XML to display a variety of sequence composition plots, gene expression data, COG information, BLAST results, and more. See the included README file for additional information.
Developed by: Paul Stothard.
Availability: cgview_xml_builder.zipcgview_xml_builder.tar.gz.
combine_output_files.pl - this Perl script combines files that are part of a file series (created by split_records.pl for example). Several options are avialable for controlling how comments and header lines are handled. This script can be used to combine results files generated on a computer cluster.
Developed by: Paul Stothard.
Availability: combine_output_files.zipcombine_output_files.tar.gz.
genome_pattern_search.pl - a Perl program that reads a genomic sequence in FASTA format and searches for the patterns you specify using regular expressions. A summary is generated for each sequence match, including: the sequence fragment that matched the pattern; the position of the first base; the position of the last base; the strand on which the match was found; the name of the gene containing the match or "not in gene"; the name of the nearest downstream gene; a description of the gene; the distance to the nearest downstream gene; the total times this exact sequence was found; the percentage of the instances of this exact sequence that were found inside of genes; and the average number of base pairs to the downstream gene for this exact sequence.
Developed by: Paul Stothard.
Availability: genome_pattern_search.zipgenome_pattern_search.tar.gz.
get_cds.pl - this Perl script accepts a GenBank or EMBL file and extracts the protein translations or the DNA coding sequences and writes them to a new file in FASTA format. Information indicating the reading frame and position of the coding sequence relative to the source sequence is added to the titles.
Developed by: Paul Stothard.
Availability: get_cds.zipget_cds.tar.gz.
get_genes_in_area.pl - this Perl script accepts as input a position or list of positions in a genome and returns descriptions of nearby genes. The descriptions include position and function information, along with identifiers that can be used to access related records in other databases.
Developed by: Paul Stothard.
Availability: Included in the NGS-SNP package.
get_orfs.pl - this Perl script accepts a sequence file as input and extracts the open reading frames (ORFs) greater than or equal to the size you specify. The resulting ORFs can be returned as DNA sequences or as protein sequences. The titles of the sequences include start, stop, strand, and reading frame information. The sequence numbering includes the stop codon (when encountered) but the translations do not include a stop codon character.
Developed by: Paul Stothard.
Availability: get_orfs.zipget_orfs.tar.gz.
get_snps_by_gene_ontology.pl - this Perl script accepts a species name and a Gene Ontology (GO) accession number, and returns a list of SNPs located in or nearby genes associated with the GO accession. Several fields of information are provided for each SNP, including ID, location, flanking sequence, and alleles. Gene and transcript identifiers and descriptions of gene function are also provided.
Developed by: Paul Stothard.
Availability: get_snps_by_gene_ontology.zipget_snps_by_gene_ontology.tar.gz.
local_blast_client.pl - this Perl script accepts a FASTA file containing multiple sequences as input. It then submits each sequence to a locally installed version of the blastall program. For each of the hits obtained, the script retrieves a descriptive title by performing a separate Entrez search of NCBI's databases. Each BLAST hit and its descriptive title are written to a single tab-delimited output file.
Developed by: Paul Stothard.
Availability: local_blast_client.ziplocal_blast_client.tar.gz.
maq_pipeline.sh - this bash script processes short sequence reads from Illumina's Genome Analyzer (Solexa) system, using the Maq package. The script automates the entire analysis process, and parallelizes the most intensive analysis step if run on a computing cluster with Sun Grid Engine.
Developed by: Paul Stothard.
Availability: maq_pipeline.zipmaq_pipeline.tar.gz.
md5_sums.pl - this Perl script accepts a list of directories and recursively generates a list of the files in the directories and their MD5 values. An optional list of directories and files to exclude from the calculation can also be supplied. The MD5 calculation can be skipped for large files, using the optional size parameter.
Developed by: Jason Grant.
Availability: md5_sums.zipmd5_sums.tar.gz.
ncbi_monitor.pl - this Perl script performs NCBI Entrez searches to identify publications related to genomic regions of interest in a species of interest. More specifically, this script accepts an organism name, chromosome name, and base position as input. It then retrieves the IDs for all Entrez Gene records located within a certain distance of the base position (the distance can be adjusted using the -f option). For each Gene record the script obtains IDs of PubMed records identified by NCBI as being related to the Gene record. If the script has previously written output to the specified output directory (i.e. the directory supplied using the -o option), it examines the previously obtained PubMed IDs to see which IDs are new. An email message describing the newly obtained records is then sent to the email address supplied using the -e option. The PubMed results are also written to a file in the output directory. If the -h option is specified, NCBI's HomoloGene database is also queried for each Gene record, in an attempt to obtain additional PubMed records, linked to the HomoloGene hits. These PubMed records may describe results obtained in other species, but could be relevant nonetheless.
Developed by: Paul Stothard.
Availability: Included in the NGS-SNP package.
ncbi_search.pl - this Perl script uses NCBI's Entrez Programming Utilities to perform searches of NCBI databases. The script can return complete database records, or sequence IDs.
Developed by: Paul Stothard.
Availability: ncbi_search.zipncbi_search.tar.gz.
NGS-SNP - this collection of scripts annotates raw SNP lists returned from programs such as Maq. SNPs are classified as synonymous, nonsynonymous, 3' UTR, etc. regardless of whether or not they match existing SNP records. Included among the annotations, several of which are not available from any existing SNP annotation tools, are the results of detailed comparisons with orthologous sequences. These comparisons allow, for example, SNPs to be sorted or filtered based on how drastically the SNP changes the score of a protein alignment. Other fields indicate whether or not the SNP-altered residue exhibits co-evolution with other residues in the protein, the names of overlapping protein domains or features, and the conservation of both the SNP site and flanking regions. NCBI, Ensembl, and Uniprot IDs are provided for genes, transcripts, and proteins when applicable, along with Gene Ontology terms, a gene description, phenotypes linked to the gene, and an indication of whether the SNP is novel or known. A "Model_Annotations" field provides several annotations obtained by transferring in silico the SNP to an orthologous gene, typically in a well-characterized species.
Developed by: Paul Stothard, Jason Grant, and Xiaoping Liao.
Availability: NGS-SNP.
obtain_reference_transcripts.pl - this Perl script builds a FASTA file consisting of the canonical transcripts for all the genes in Ensembl for a given organism. The canonical transcript is defined as either the longest CDS (if the gene encodes a protein), or as the longest cDNA. Ensembl gene entries can be associated with many transcripts--this script aims to get the "best" single transcript for each gene. The resulting FASTA file is suitable for sequence searches and for mapping sequence reads derived from cDNAs. The -a option can be used to specify that all transcripts should be downloaded, not just the canonical ones.
Developed by: Paul Stothard.
Availability: Included in the NGS-SNP package.
obtain_reference_chromosomes.pl - this Perl script builds a FASTA file consisting of the chromosome sequences in Ensembl for a given organism. The resulting FASTA file is suitable for sequence searches and for mapping sequence reads.
Developed by: Paul Stothard.
Availability: Included in the NGS-SNP package.
random_sequence_reads.pl - this Perl script generates simulated sequence reads from a file of sequences in FASTA format. The starting position of each read is chosen at random. The length of the reads is specified using the -L option. Reads truncated because the end of a sequence is encountered are discarded if they are shorter than the length specified using the -m option. Sampling is continued until the desired number of reads is obtained.
Developed by: Paul Stothard.
Availability: random_sequence_reads.ziprandom_sequence_reads.tar.gz.
random_sequence_sample.pl - this Perl script selects a random sample of sequences from a FASTA file containing multiple sequences. The sample is written to a new text file. Sampling can be performed with or without replacement.
Developed by: Paul Stothard.
Availability: random_sequence_sample.ziprandom_sequence_sample.tar.gz.
remote_blast_client.pl - this Perl script accepts a FASTA file containing multiple sequences as input. It submits each sequence to NCBI's BLAST servers, to identify related sequences in a database of interest. An optional 'limit by entrez query' value can be supplied to restrict the search. For each BLAST hit a descriptive title is obtained using a separate Entrez search. Each BLAST hit and its descriptive title are written to a single tab-delimited output file.
Developed by: Paul Stothard.
Availability: remote_blast_client.zipremote_blast_client.tar.gz.
remote_in_silico_pcr.rb - this Ruby script accepts as input a list of primer sequences and uses the remote "UCSC In-Silico PCR" site to perform in silico PCR on the specified genome. By default only the top hit is returned for each primer pair--all the hits can be returned by using the '-m' option.
Developed by: Jason Grant.
Availability: remote_in_silico_pcr.zipremote_in_silico_pcr.tar.gz.
sequence_to_multi_fasta.pl - this Perl script reads a file consisting of a single DNA sequence (in raw, FASTA, GenBank, or EMBL format) and then divides the sequence into smaller sequences of the size you specify. The new sequences are written to a single output file with a modified title giving the position of the subsequence in relation to the original sequence. The new sequences are written in FASTA format.
Developed by: Paul Stothard.
Availability: sequence_to_multi_fasta.zipsequence_to_multi_fasta.tar.gz.
space_check.sh - this shell script monitors hard drive space and sends an email when space becomes scarce. On the first day of each month the script sends an email report of hard drive space.
Developed by: Paul Stothard.
Availability: space_check.zipspace_check.tar.gz.
split_records.pl - this Perl script splits an input file into multiple output files, to allow analysis jobs to be divided among nodes in a computer cluster Several options are included for handling header lines, for specifying the record separator, and for controlling how files are named.
Developed by: Jason Grant and Paul Stothard.
Availability: split_records.zipsplit_records.tar.gz.

没有评论:

发表评论