显示标签为“labs”的博文。显示所有博文
显示标签为“labs”的博文。显示所有博文

2013年11月8日星期五

Lior Pachter's lab

http://math.berkeley.edu/~lpachter/software.html

Software developed in the Pachter group and still under active development in the group
  • eXpress (2012) Streaming quantification for high-throughput sequencing.
  • SysCall (2011) Distinguishing heterozygous sites from systematic error in high-thoughput sequenced reads
  • Cufflinks (2010) Transcript assembly and abundance estimation for RNA-Seq (now a joint effort together with Cole Trapnell and the John Rinn Lab at Harvard University)
  • MetMap (2010) Analysis of Methyl-Seq experiments
Software developed in the Pachter group but now maintained/developed elsewhere
  • ReadSpy (2012) Assessment of uniformity in RNA-Seq reads (now supported by Valerie Hower and her group at the University of Miami)
  • TopHat (2009) Splice junction mapper for short RNA-seq reads (now supported by Steven Salzberg and his group at Johns Hopkins University)
  • FSA (2009) Fast Statistical Alignment (now supported by Robert Bradley and his group at FHCRC)
  • MERCATOR (2004) Homology mapping (now supported by Colin Dewey and his group at the University of Wisconsin)
  • VISTA (2000) Visualization tool for global alignments (now supported by Inna Dubchak and her group at the JGI)
Retired Software
These programs, originally developed in the Pachter group, are no longer under active development and are not being supported.
  • AMAP (2007) Protein multiple alignment (recommended instead: FSA)
  • GENEMAPPER (2006) Reference based gene annotation (recommended instead: an RNA-Seq experiment)
  • MJOIN (2006) Neighbor joining with subtree weights (archived here)
  • PARALIGN (2006) Alignment polytope construction (archived here)
  • SLIM (2003) Minimum network design for optimizing the search space for pair hidden Markov models (archived here)
  • SLAM (2003) Pairwise simultaneous alignment and gene finding (recommended instead: an RNA-Seq experiment)
  • MAVID (2003) Multiple alignment of large genomic sequences (recommended instead: FSA)
################################################################################
Submitted
L. Pachter, Models for transcript quantification from RNA-Seq, submitted.
In press
A. Roberts, L. Schaeffer and L. Pachter, Updating RNA-Seq analyses after re-annotation, in press.
M. Singer and L. Pachter, Bayesian networks in the study of genomewide DNA methylation, in press.
2013
A. Rahman and L. Pachter, CGAL: computing genome assembly likelihoods, Genome Biology, 14 (2013), R8.
2012
C. Trapnell, D.G. Hendrickson, M. Sauvageau, L. Goff, J.L. Rinn and L. Pachter, Differential analysis of gene regulation at transcript resolution with RNA-seq, Nature Biotechnology, advance online publication (2012).
S.A. Mortimer, C. Trapnell, S. Aviran, L. Pachter and J.B. Lucks, SHAPE-Seq: High throughput RNA structure analysis, Current Protocols in Chemical Biology, advance online publication.
A. Kleinman, M. Harel and L. Pachter, Affine and projective tree metric theorems, Annals of Combinatorics, advance online publication (2012).
A. Roberts and L. Pachter, Streaming fragment assignment for real-time analysis of sequencing experiments, Nature Methods, advance online publication (2012).
V. Hower, R. Starfield, A. Roberts, and L. Pachter, Quantifying uniformity in mapped reads, Bioinformatics, 28 (2012), 2680--2682.
L. Pachter, A closer look at RNA editing, Nature Biotechnology, 30 (2012), 246--247.
C. Trapnell, A. Roberts, L. Goff, G. Pertea, D. Kim, D.R. Kelley, H. Pimentel, S.L. Salzberg, J.L. Rinn and L. Pachter, Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks, Nature Protocols, 7 (2012), 562--578.

2013年8月4日星期日

Rannala Research Group

http://www.rannala.org/

The Rannala research group is located in the UC Davis Genome Center and Department of Evolution and Ecology. Research in the group focuses on developing theory and computational methods for interpreting patterns of molecular evolution and population genomic variation.

Yun S. Song lab

http://www.eecs.berkeley.edu/~yss/

Demographic Inference
  • diCal Version 1 [ Link ]
    Software accompaniment to "Sheehan, S.*, Harris, K.*, Song, Y.S. Estimating variable effective population sizes from multiple genomes: A sequentially Markov conditional sampling distribution approach. Genetics, 194 (2013) 647-662." "Chan, A.H., Jenkins, P.A., and Song, Y.S. 

    diCal Version 1 is a scalable demographic inference method based on the sequentially Markov conditional sampling distribution framework. At present, diCal can infer a piecewise-constant population size history from the genomes of multiple individuals sampled from a single population. We are currently working on extending the method to handle more complex demography, incorporating multiple populations, population splits, migration, admixture, etc. 

Estimating Recombination Rates
  • LDhelmet [ Link ]
    Software accompaniment to "Chan, A.H., Jenkins, P.A., and Song, Y.S. Genome-wide fine-scale recombination rate variation in Drosophila melanogaster. PLoS Genetics, vol. 8 no. 12 (2012) e1003090." 

    LDhelmet is a statistical method based on reversible jump MCMC and composite likelihood. It samples piecewise constant recombination maps from a posterior distribution. 
  • Overpaint [ Link ]
    Software accompaniment to "Yin, J. Jordan, M. I., and Song, Y. S.. Joint estimation of gene conversion rates and mean conversion tract lengths from population SNP data, Proceedings of ISMB 2009, Bioinformatics, 25 (2009) i231-i239." 

    Overpaint is a C++ package that can jointly estimate crossover rates, gene conversion rates and mean conversion tract lengths from population SNP dataset.

Short-Read Error Correction
  • ECHO [ Link ]
    Software accompaniment to
    "Kao, W.-C., Chan, A. H., and Song, Y. S. ECHO: A reference-free short-read error correction algorithm,Genome Research, 21 (2011) 1181-1192"

De novo Assembly
  • Telescoper [ Link ]
    Bresler, M., Sheehan, S., Chan, A.H., and Song, Y.S. Telescoper: De novo Assembly of Highly Repetitive Regions. ECCB'12 Special Issue, Bioinformatics, 28 (2012) i311-i317. 

    Telescoper is a local assembly algorithm designed for short-reads from NGS platforms such as Illumina. The reads must come from two libraries: one short insert, and one long insert. The algorithm begins with a user-given seed string, and assembles a graph of possible extensions, and prints one path of extensions, as a fasta file. The software is still a beta version. We have not yet tested it extensively, and envision many improvements down the line. 

Basecaller for the Illumina Platform
  • (naive)BayesCall [ Link ]
    Software accompaniment to
    "Kao, W.C., Stevens, K. and Song, Y.S. BayesCall: A model-based basecalling algorithm for high-throughput short-read sequencing. Genome Research, 19 (2009) 1884-1895."

    Kao, W.C. and Song, Y.S. naiveBayesCall: An efficient model-based base-calling algorithm for high-throughput sequencing. Proc. 14th Annual Intl. Conf. on Research in Computational Molecular Biology(RECOMB 2010)Lecture Notes in Computer Science 6044, pages 233--247, 2010.
    (A new base-calling algorithm that builds on our previous method BayesCall to achieve scalability.) 

Likelihoods under the Coalescent with Recombination
  • ASF [ Link ]
    Software accompaniment to "Jenkins, P.A. and Song, Y.S. Closed-form two-locus sampling distributions: accuracy and universality Genetics, 183 (2009) 1087-1103."
  • COB [ Link ]
    Software accompaniment to "Lyngsø, R., Song, Y.S., and Hein, J. Accurate computation of likelihoods in the coalescent with recombination via parsimony. Proc. 12th Annual Intl. Conf. on Research in Computational Molecular Biology (RECOMB 2008), Lecture Notes in Computer Science 4955, pages 463--477." 

    COB is a parsimony-based method of computing likelihoods accurately under the coalescent with recombination. 

Multi-locus Match Probability
  • Wright_Fisher_MP and Moran_MP [ Link ]
    Software accompaniment to "Bhaskar, A. and Song, Y.S. Multi-locus match probability in a finite population: A fundamental difference between the Moran and Wright-Fisher models. Proceedings of ISMB 2009, Bioinformatics, 25 (2009) i187-i195." 

Whole-Genome Association Mapping
  • BLOSSOC [ Link ]
    Software accompaniment to "Ding, Z., Mailund, T., and Song, Y.S. Efficient whole-genome association mapping using local phylogenies for unphased genotype data. Bioinformatics, 24 (2008) 2215-2221." 

    This program combines a recently found linear-time algorithm for phasing genotypes on trees with a tree-based method for association mapping. From unphased genotype data, our algorithm builds local phylogenies along the genome, and scores each tree according to the clustering of cases and controls. 

Algorithms for Detecting Recombination
  • HapBound and SHRUB [ Link ]
    Software accompaniment to "Song, Y.S., Wu, Y. and Gusfield, D. Efficient computation of close lower and upper bounds on the minimum number of recombinations in biological sequence evolution,Proceedings of ISMB 2005. Bioinformatics, 21, Suppl.1, (2005) i413-i422."

    HapBound and SHRUB respectively compute lower and upper bounds on the minimum number of crossover recombinations. SHRUB constructs an ancestral recombination graph for the input data. 
  • HapBound-GC and SHRUB-GC Link ]
    Software accompaniment to "Song, Y.S., Ding, Z., Gusfield, D., Langley, C.H., and Wu, Y. Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover Recombination in the Derivation of SNP Sequences in Populations Proceedings of RECOMB 2006. Lecture Notes in Computer Science 3909, (2006) 231-245."

    HapBound-GC and SHRUB-GC respectively compute lower and upper bounds on the minimum combined number of crossover and gene-conversion recombinations. SHRUB-GC constructs a graphical representation of evolutionary history involving coalescent, mutation, crossover and gene-conversion events. 
  • Beagle [ Link ] 
    Software accompaniment to "Lyngsø, R., Song, Y.S., and Hein, J. Minimum Recombination Histories by Branch and Bound. Proceedings of WABI 2005, Lecture Notes in Computer Science, 3692, pp. 239-250." 

    Beagle computes the minimum number of crossover recombinations. It also produces an ancestral recombination graph.

2013年8月1日星期四

Andreas Hamann's lab

http://www.ualberta.ca/~ahamann/index.html

Research Program Overview

My primary research field is ecological genetics. I ask: How are tree species and their populations adapted to the environments in which they occur? How are natural populations affected by observed and projected climate change? How should we manage our forest genetic resources under changing environments?

2013年7月30日星期二

Barton group

We study diverse topics in evolutionary genetics, but focus on the evolution of populations that are distributed through space, and that experience natural selection on many genes.  Understanding how species adapt, and how they split into new species, requires knowledge of the effects of spatial subdivision; conversely, spatial patterns can tell us about the strengths of evolutionary processes that are hard to measure directly.  Interactions between large numbers of genes are important in species formation, in the response to natural and artificial selection, and in the net effects of selection on the whole genome.  The recent  development of techniques for assaying large numbers of genetic markers, and indeed complete sequences, make analysis of the interactions amongst large numbers of genes essential.

http://ist.ac.at/research-groups-pages/barton-group/

2013年7月28日星期日

Identifying differential alternative splicing events from RNA sequencing data

http://www.mimg.ucla.edu/faculty/xing/index.html


  • Zhao K.Lu ZX.Park JW., Zhou Q., Xing Y. (2013) GLiMMPS: Robust statistical model for regulatory variation of alternative splicing using RNA-Seq data, Genome Biology, 14:R74. [journal] [GLiMMPS software]
  • Park JW., Tokheim C., Shen S., Xing Y. (2013) Identifying differential alternative splicing events from RNA sequencing data using RNASeq-MATS. Methods in Molecular Biology: Deep Sequencing Data Analysis, Invited Book Chapter,1038:171-179. [book] [PubMed]
  • 2013年4月26日星期五

    yandell lab - on human population genetics

    http://www.yandell-lab.org/software/index.html


     VAAST

    VAAST (the Variant Annotation, Analysis & Search Tool) is a probabilistic search tool for identifying damaged genes and their disease-causing variants in personal genome sequences. VAAST builds upon existing amino acid substitution (AAS) and aggregative approaches to variant prioritization, combining elements of both into a single unified likelihood-framework that allows users to identify damaged genes and deleterious variants with greater accuracy, and in an easy-to-use fashion. VAAST can score both coding and non-coding variants, evaluating the cumulative impact of both types of variants simultaneously. VAAST can identify rare variants causing rare genetic diseases, and it can also use both rare and common variants to identify genes responsible for common diseases. VAAST thus has a much greater scope of use than any existing methodology.

     MAKER 2 (updated 07-22-2012)

    MAKER is a portable and easily configurable genome annotation pipeline. It's purpose is to allow smaller eukaryotic and prokaryotic genomeprojects to independently annotate their genomes and to create genome databases. MAKER identifies repeats, aligns ESTs and proteins to a genome, produces ab-initio gene predictions and automatically synthesizes these data into gene annotations having evidence-based quality values. MAKER is also easily trainable: outputs of preliminary runs can be used to automatically retrain its gene prediction algorithm, producing higher quality gene-models on seusequent runs. MAKER's inputs are minimal and its ouputs can be directly loaded into a GMOD database. They can also be viewed in the Apollo genome browser; this feature of MAKER provides an easy means to annotate, view and edit individual contigs and BACs without the overhead of a database. MAKER should prove especially useful for emerging model organism projects with minimal bioinformatics expertise and computer resources.

     RepeatRunner

    RepeatRunner is a CGL-based program that integrates RepeatMasker with BLASTX to provide a comprehensive means of identifying repetitive elements. Because RepeatMasker identifies repeats by means of similarity to a nucleotide library of known repeats, it often fails to identify highly divergent repeats and divergent portions of repeats, especially near repeat edges. To remedy this problem, RepeatRunner uses BLASTX to search a database of repeat encoded proteins (reverse transcriptases, gag, env, etc...). Because protein homologies can be detected across larger phylogenetic distances than nucleotide similarities, this BLASTX search allows RepeatRunner to identify divergent protein coding portions of retro-elements and retro-viruses not detected by RepeatMasker. RepeatRunner merges its BLASTX and RepeatMasker results to produce a single, comprehensive XML-based output. It also masks the input sequence appropriately. In practice RepeatRunner has been shown to greatly improve the efficacy of repeat identifcation. RepeatRunner can also be used in conjunction with PILER-DF - a program designed to identify novel repeats - and RepeatMasker to produce a comprehensive system for repeat identification, characterization, and masking in the newly sequenced genomes.

    ImagePlane

    ImagePlane is python based image analysis software designed for the automated analysis of images of the animal S. mediterranea. This software allows the animals's neoblasts to be quantified and tested for assymetries along its veritcal and hoizontal axes. ImagePlane also allows simple mophology categorizations to be made based on the overall shape of the animal.

     CGL

    CGL is a software library designed to facilitate the use of genome annotations as substrates for computation and experimentation; we call it "CGL", an acronym for Comparitive Genomics Library, and pronounce it "Seagull". The purpose of CGL is to provide an informatics infrastructure for a laboratory, department, or research institute engaged in the large-scale analysis of genomes and their annotations.

    2013年4月16日星期二

    Stothard Research Group

    http://www.ualberta.ca/~stothard/software.html


    align_learn.pl - this Perl script converts a multiple sequence alignment into a format that can be readily analyzed using common machine learning algorithms. Specifically, the program accepts a sequence alignment in FASTA format and converts it into an ARFF (attribute-relation file format) file containing data attributes and data instances. The ARFF format can then be read by the Weka machine learning software, which provides implementations of many machine learning algorithms.
    Developed by: Paul Stothard.
    Availability: align_learn.zipalign_learn.tar.gz.
    annotate_SNPs.pl - this Perl script annotates SNPs identified by the next-generation sequencing of genomic DNA or transcripts. It is designed to accept SNPs from the AB diBayes SNP package, Maq, or any other program that can provide a reference sequence identifier, a SNP position, the base found at that position in the reference sequence, and the bases found at that position among the sequencing reads. The script examines each SNP record provided as input and uses Ensembl, NCBI, and EBI to provide a detailed description of its expected functional significance, regardless of whether or not the SNP has been previously described. Included in this description is a functional class ('NON_SYNONYMOUS_CODING' for example), and information about the affected transcript, if applicable, including transcript ID, position of the SNP, the alleles as they would appear in the transcript sequence, and Gene Ontology information. For protein-altering SNPs, each of the resulting protein sequences is compared to orthologous proteins, to determine which of the alleles most drastically changes the protein. Using a locally installed version of Ensembl this script can annotate 4,000,000 SNPs in about two days on a standard desktop system. Thus this script is suitable for annotating the SNPs arising from genome resequencing projects.
    Developed by: Paul Stothard.
    Availability: Included in the NGS-SNP package.
    backup.sh - this shell script archives directories of interest on a Linux-based system. When it is first run, and on the first of each month, this script generates a full backup of the files and directories listed in the include.conf file. Files and directories listed in the exclude.conf file are not included in the archive. These full backups are not overwritten by future backups. Each Sunday the script performs a full backup that is overwritten the following Sunday. Every day the script performs an incremental backup, storing the files that have changed since the last full backup. These incremental backup files are named after the day of the week they are performed, and are overwritten each week. The script sends an email on the first of each month, and whenever any backup fails. The script splits each full backup into a series of smaller files, suitable for burning to CD or DVD. When a full backup is generated, the MD5 hash value of the complete backup file is written to a README file in the same directory as the backup files. Included in the README are directions for assembling the split backup files into the original file.
    Developed by: Paul Stothard.
    Availability: backup_script.zipbackup_script.tar.gz.
    blast_hit_features.pl - this Perl script accepts BLAST results obtained from local_blast_client.pl or remote_blast_client.pl. The results must have been obtained using blastn, tblastn, or tblastx searches (i.e. nucleotide databases), since this script uses GenBank files to obtain feature information for sequence hits. For each entry in the BLAST results, the GI number of the hit, if available, is used to obtain the corresponding sequence record from NCBI, in GenBank format. The features in the GenBank file are compared to the coordinates of the HSP, and features overlapping with the HSP are added to the existing BLAST results. The modified results are written to a new file. The three nearest features preceding the HSP (located to the left of the HSP) and the three nearest features located after the HSP (located to the right of the HSP) are also added to the output.
    Developed by: Paul Stothard.
    Availability: blast_hit_features.zipblast_hit_features.tar.gz.
    blast_hit_flanking_sequence.pl - this Perl script accepts blastn search results obtained from local_blast_client.pl or remote_blast_client.pl. In addition to the BLAST results file, the script requires the query sequences and database sequences in FASTA format. For each BLAST result, the script constructs a modified query sequence, in which the query is extended using sequence extracted from the hit sequence. The amount of hit sequence added to the ends of the query can be specified using the -u and -d options.
    Developed by: Paul Stothard.
    Availability: blast_hit_flanking_sequence.zipblast_hit_flanking_sequence.tar.gz.
    blast_hits_in_ucsc_genome_browser.pl - this Perl script accepts BLAST results obtained from local_blast_client.pl or remote_blast_client.pl. The results must have been obtained using blastn, tblastn, or tblastx searches (i.e. nucleotide databases) and database sequences downloaded from the UCSC Genome Browser site (http://hgdownload.cse.ucsc.edu/downloads.html). The BLAST results are converted to annotation files for the UCSC Genome Browser, and a separate HTML file containing links to each feature in the annotation files is created. Clicking on a link in the HTML file loads the genome region involving the BLAST HSP into the UCSC Genome Browser and passes the annotations in the relevant annotation file to the browser for inclusion in the view.
    Developed by: Paul Stothard.
    Availability: blast_hits_in_ucsc_genome_browser.zipblast_hits_in_ucsc_genome_browser.tar.gz.
    build_cluster_script.pl - this Perl script creates an executable shell script with the specified command repeated n number of times. Every occurrence of the '$' symbol in the command is replaced by a number, from 1 to n. Alternatively, the "-l" option causes letters to be used in place of numbers (eg 'aa' instead of '1', 'ab' instead of '2'). This script can be used to generate scripts for batch processing on a computer cluster.
    Developed by: Jason Grant and Paul Stothard.
    Availability: build_cluster_script.zipbuild_cluster_script.tar.gz.
    cDNA_library_entropy.pl - this Perl script accepts a directory containing one or more sequence files in multi-FASTA format. Typically each file will contain the sequences obtained from a single EST library or tissue type. By default the script looks for a UniGene annotation identifier in each sequence title, for example 'Bt.22094'. A different ID type can be specified using the -m option. The number of sequences present for each ID is determined. The script uses these counts to calculate the information entropy of the library in bits. This value increases as the number of distinct sequences in a library increases, and decreases as the number of replicates of a particular sequence increases. The -d option can be used to obtain the information entropy of combinations of libraries. For example, specifying '-d 2' causes all possible combinations of two libraries to be evaluated. This script is intended to aid in the selection of tissues for SNP discovery by mRNA sequencing.
    Developed by: Paul Stothard.
    Availability: Included in the NGS-SNP package.
    CGView - a Java package for generating high quality, zoomable maps of circular genomes. Its primary purpose is to serve as a component of sequence annotation pipelines. Feature information and rendering options are supplied to the program using an XML file, a tab delimited file, or an NCBI ptt file. CGView converts the input into a graphical map (PNG, JPG, or Scalable Vector Graphics format), complete with labels, a title, and legends. In addition to the default full view map, the program can generate a series of hyperlinked maps showing expanded views. The linked maps can be explored using any web browser, allowing rapid genome browsing, and facilitating data sharing. The feature labels in maps can be hyperlinked to external resources, allowing CGView maps to be integrated with existing web site content or databases.
    Developed by: Paul Stothard.
    Availability: http://bioinformatics.org/cgview/
    CGView Comparison Tool (CCT) - a package for visually comparing bacterial, plasmid, chloroplast, or mitochondrial sequences of interest to existing genomes or sequence collections. The comparisons are conducted using BLAST, and the BLAST results are presented in the form of graphical maps that can also show sequence features, gene and protein names, COG category assignments, and sequence composition characteristics. CCT can generate maps in a variety of sizes, including 400 Megapixel maps suitable for posters. Comparisons can be conducted within a particular species or genus, or all available genomes can be used. The entire map creation process, from downloading sequences to redrawing zoomed maps, can be completed easily using scripts included with the CCT. User-defined features or analysis results can be included on maps, and maps can be extensively customized. To simplify program setup, a CCT virtual machine that includes all dependencies preinstalled is available. Detailed tutorials illustrating the use of CCT are included with the CCT documentation.
    Developed by: Paul Stothard and Jason Grant.
    Availability: http://stothard.afns.ualberta.ca/downloads/CCT/
    cgview_xml_builder.pl - this Perl script accepts a variety of input files pertaining to circular genomes and generates an XML file for the CGView genome drawing program. This script can create the XML to display a variety of sequence composition plots, gene expression data, COG information, BLAST results, and more. See the included README file for additional information.
    Developed by: Paul Stothard.
    Availability: cgview_xml_builder.zipcgview_xml_builder.tar.gz.
    combine_output_files.pl - this Perl script combines files that are part of a file series (created by split_records.pl for example). Several options are avialable for controlling how comments and header lines are handled. This script can be used to combine results files generated on a computer cluster.
    Developed by: Paul Stothard.
    Availability: combine_output_files.zipcombine_output_files.tar.gz.
    genome_pattern_search.pl - a Perl program that reads a genomic sequence in FASTA format and searches for the patterns you specify using regular expressions. A summary is generated for each sequence match, including: the sequence fragment that matched the pattern; the position of the first base; the position of the last base; the strand on which the match was found; the name of the gene containing the match or "not in gene"; the name of the nearest downstream gene; a description of the gene; the distance to the nearest downstream gene; the total times this exact sequence was found; the percentage of the instances of this exact sequence that were found inside of genes; and the average number of base pairs to the downstream gene for this exact sequence.
    Developed by: Paul Stothard.
    Availability: genome_pattern_search.zipgenome_pattern_search.tar.gz.
    get_cds.pl - this Perl script accepts a GenBank or EMBL file and extracts the protein translations or the DNA coding sequences and writes them to a new file in FASTA format. Information indicating the reading frame and position of the coding sequence relative to the source sequence is added to the titles.
    Developed by: Paul Stothard.
    Availability: get_cds.zipget_cds.tar.gz.
    get_genes_in_area.pl - this Perl script accepts as input a position or list of positions in a genome and returns descriptions of nearby genes. The descriptions include position and function information, along with identifiers that can be used to access related records in other databases.
    Developed by: Paul Stothard.
    Availability: Included in the NGS-SNP package.
    get_orfs.pl - this Perl script accepts a sequence file as input and extracts the open reading frames (ORFs) greater than or equal to the size you specify. The resulting ORFs can be returned as DNA sequences or as protein sequences. The titles of the sequences include start, stop, strand, and reading frame information. The sequence numbering includes the stop codon (when encountered) but the translations do not include a stop codon character.
    Developed by: Paul Stothard.
    Availability: get_orfs.zipget_orfs.tar.gz.
    get_snps_by_gene_ontology.pl - this Perl script accepts a species name and a Gene Ontology (GO) accession number, and returns a list of SNPs located in or nearby genes associated with the GO accession. Several fields of information are provided for each SNP, including ID, location, flanking sequence, and alleles. Gene and transcript identifiers and descriptions of gene function are also provided.
    Developed by: Paul Stothard.
    Availability: get_snps_by_gene_ontology.zipget_snps_by_gene_ontology.tar.gz.
    local_blast_client.pl - this Perl script accepts a FASTA file containing multiple sequences as input. It then submits each sequence to a locally installed version of the blastall program. For each of the hits obtained, the script retrieves a descriptive title by performing a separate Entrez search of NCBI's databases. Each BLAST hit and its descriptive title are written to a single tab-delimited output file.
    Developed by: Paul Stothard.
    Availability: local_blast_client.ziplocal_blast_client.tar.gz.
    maq_pipeline.sh - this bash script processes short sequence reads from Illumina's Genome Analyzer (Solexa) system, using the Maq package. The script automates the entire analysis process, and parallelizes the most intensive analysis step if run on a computing cluster with Sun Grid Engine.
    Developed by: Paul Stothard.
    Availability: maq_pipeline.zipmaq_pipeline.tar.gz.
    md5_sums.pl - this Perl script accepts a list of directories and recursively generates a list of the files in the directories and their MD5 values. An optional list of directories and files to exclude from the calculation can also be supplied. The MD5 calculation can be skipped for large files, using the optional size parameter.
    Developed by: Jason Grant.
    Availability: md5_sums.zipmd5_sums.tar.gz.
    ncbi_monitor.pl - this Perl script performs NCBI Entrez searches to identify publications related to genomic regions of interest in a species of interest. More specifically, this script accepts an organism name, chromosome name, and base position as input. It then retrieves the IDs for all Entrez Gene records located within a certain distance of the base position (the distance can be adjusted using the -f option). For each Gene record the script obtains IDs of PubMed records identified by NCBI as being related to the Gene record. If the script has previously written output to the specified output directory (i.e. the directory supplied using the -o option), it examines the previously obtained PubMed IDs to see which IDs are new. An email message describing the newly obtained records is then sent to the email address supplied using the -e option. The PubMed results are also written to a file in the output directory. If the -h option is specified, NCBI's HomoloGene database is also queried for each Gene record, in an attempt to obtain additional PubMed records, linked to the HomoloGene hits. These PubMed records may describe results obtained in other species, but could be relevant nonetheless.
    Developed by: Paul Stothard.
    Availability: Included in the NGS-SNP package.
    ncbi_search.pl - this Perl script uses NCBI's Entrez Programming Utilities to perform searches of NCBI databases. The script can return complete database records, or sequence IDs.
    Developed by: Paul Stothard.
    Availability: ncbi_search.zipncbi_search.tar.gz.
    NGS-SNP - this collection of scripts annotates raw SNP lists returned from programs such as Maq. SNPs are classified as synonymous, nonsynonymous, 3' UTR, etc. regardless of whether or not they match existing SNP records. Included among the annotations, several of which are not available from any existing SNP annotation tools, are the results of detailed comparisons with orthologous sequences. These comparisons allow, for example, SNPs to be sorted or filtered based on how drastically the SNP changes the score of a protein alignment. Other fields indicate whether or not the SNP-altered residue exhibits co-evolution with other residues in the protein, the names of overlapping protein domains or features, and the conservation of both the SNP site and flanking regions. NCBI, Ensembl, and Uniprot IDs are provided for genes, transcripts, and proteins when applicable, along with Gene Ontology terms, a gene description, phenotypes linked to the gene, and an indication of whether the SNP is novel or known. A "Model_Annotations" field provides several annotations obtained by transferring in silico the SNP to an orthologous gene, typically in a well-characterized species.
    Developed by: Paul Stothard, Jason Grant, and Xiaoping Liao.
    Availability: NGS-SNP.
    obtain_reference_transcripts.pl - this Perl script builds a FASTA file consisting of the canonical transcripts for all the genes in Ensembl for a given organism. The canonical transcript is defined as either the longest CDS (if the gene encodes a protein), or as the longest cDNA. Ensembl gene entries can be associated with many transcripts--this script aims to get the "best" single transcript for each gene. The resulting FASTA file is suitable for sequence searches and for mapping sequence reads derived from cDNAs. The -a option can be used to specify that all transcripts should be downloaded, not just the canonical ones.
    Developed by: Paul Stothard.
    Availability: Included in the NGS-SNP package.
    obtain_reference_chromosomes.pl - this Perl script builds a FASTA file consisting of the chromosome sequences in Ensembl for a given organism. The resulting FASTA file is suitable for sequence searches and for mapping sequence reads.
    Developed by: Paul Stothard.
    Availability: Included in the NGS-SNP package.
    random_sequence_reads.pl - this Perl script generates simulated sequence reads from a file of sequences in FASTA format. The starting position of each read is chosen at random. The length of the reads is specified using the -L option. Reads truncated because the end of a sequence is encountered are discarded if they are shorter than the length specified using the -m option. Sampling is continued until the desired number of reads is obtained.
    Developed by: Paul Stothard.
    Availability: random_sequence_reads.ziprandom_sequence_reads.tar.gz.
    random_sequence_sample.pl - this Perl script selects a random sample of sequences from a FASTA file containing multiple sequences. The sample is written to a new text file. Sampling can be performed with or without replacement.
    Developed by: Paul Stothard.
    Availability: random_sequence_sample.ziprandom_sequence_sample.tar.gz.
    remote_blast_client.pl - this Perl script accepts a FASTA file containing multiple sequences as input. It submits each sequence to NCBI's BLAST servers, to identify related sequences in a database of interest. An optional 'limit by entrez query' value can be supplied to restrict the search. For each BLAST hit a descriptive title is obtained using a separate Entrez search. Each BLAST hit and its descriptive title are written to a single tab-delimited output file.
    Developed by: Paul Stothard.
    Availability: remote_blast_client.zipremote_blast_client.tar.gz.
    remote_in_silico_pcr.rb - this Ruby script accepts as input a list of primer sequences and uses the remote "UCSC In-Silico PCR" site to perform in silico PCR on the specified genome. By default only the top hit is returned for each primer pair--all the hits can be returned by using the '-m' option.
    Developed by: Jason Grant.
    Availability: remote_in_silico_pcr.zipremote_in_silico_pcr.tar.gz.
    sequence_to_multi_fasta.pl - this Perl script reads a file consisting of a single DNA sequence (in raw, FASTA, GenBank, or EMBL format) and then divides the sequence into smaller sequences of the size you specify. The new sequences are written to a single output file with a modified title giving the position of the subsequence in relation to the original sequence. The new sequences are written in FASTA format.
    Developed by: Paul Stothard.
    Availability: sequence_to_multi_fasta.zipsequence_to_multi_fasta.tar.gz.
    space_check.sh - this shell script monitors hard drive space and sends an email when space becomes scarce. On the first day of each month the script sends an email report of hard drive space.
    Developed by: Paul Stothard.
    Availability: space_check.zipspace_check.tar.gz.
    split_records.pl - this Perl script splits an input file into multiple output files, to allow analysis jobs to be divided among nodes in a computer cluster Several options are included for handling header lines, for specifying the record separator, and for controlling how files are named.
    Developed by: Jason Grant and Paul Stothard.
    Availability: split_records.zipsplit_records.tar.gz.

    Sebas Ramos-Onsins lab


    http://bioinformatics.cragenomica.es/numgenomics/people/sebas/software/software.html

    Analysis of Nucleotide Variability on Natural and Domesticated Populations:

    Our focus is mainly on the species Sus scrofa (the pig) and close related species. We are interested in explaining the evolutionary processes involved in the history of this species and the effect of natural and artificial selection in wild and domestic breeds.

    We are also interested in analyzing the variability in autopolyploid species using statistics adapted for pooled lineages.

    Study and development of Neutrality tests and methods for statistical inference of evolutionary models:

    A useful way to obtain a more interpretable information from raw sequencing data is the use of statistics and neutrality tests. They summarize the information observed and make possible to understand the evolutionary patterns more easily. We study and develop neutrality test to be used in evolutionary parameter inference.

    Development of tools for the analysis of nucleotide variability:

    Until now we have been working in developing tools for the analysis of nucleotide variability in multilocus data: analysis on multilocus data can be much precise in comparison to single locus because they reduce the variance of the parameters inferred in the analysis. We develop tools for multilocus population genetic analysis.

    Now we are focusing on population genomics: massive parallel sequencing is revolutionizing the study of population genetics in many ways. We are developing bioinformatic tools for the analysis of variability at genomics level.

    2013年4月7日星期日

    Eckertlab - on evolutionary genomics of trees

    http://eckertlab.blogspot.ca/

    the Eckert Lab located in the Department of Biology at Virginia Commonwealth University
    A plant evolutionary genomics lab. 

    Research topics range from the dissection of adaptive plant phenotypes into their genetic components to inferences of genome-wide patterns of polymorphism, divergence and natural selection.

    http://evolfri.blogspot.ca/

    2013年3月31日星期日

    PGML - plant genome mapping lab

    http://www.plantgenome.uga.edu/personnel.html


    MULTIPLE COLLINEARITY SCAN - MCSCAN


    MCScanX-transposed is a software package able to detect transposed gene duplications that occurred within different epochs based on applying MCScanX within and between related genomes, also useful for integrative analysis of gene duplication modes and annotating a gene family of interest with gene duplication modes.

    MCScan is an algorithm to scan multiple genomes or subgenomes to identify putative homologous chromosomal regions, then align these regions using genes as anchors. MCScanXtoolkit implements an adjusted MCScan algorithm for detection of synteny and collinearity and extends the software by incorporating 15 utility programs for display and further analyses. Compared with MCScan version 0.8, MCScanX has the following new features:

    2013年3月29日星期五

    The Santos Lab


    >> mapping_pe_reads_w_bwa_bowtie2.sh - Shell script for mapping Illumina reads to scaffold(s) in FASTA format. Needs working installation of BWASAMtools and Bowtie 2.
    >> MPI-enabled MrBayes and PhyML - Precompiled binaries of the phylogenetic programs MrBayes and PhyML capable of utilizing multiple CPUs simultaneously (built for Apple Intel systems).
    >> remote_blast_client.prl - Performs various BLAST searches against NCBI's databases.
    >> blast_parse_all.prl - Parses BLAST reports for all HSPs with BioPerl's Bio::SearchIO module.
    >> blast_parse_single.prl - Parses BLAST reports for single best HSP with BioPerl's Bio::SearchIO module.
    >> blast2ps.prl - Creates a graphical representation of BLAST reports as a Postscript file.
    >> blast2table.prl - Parses BLAST reports using BioPerl's Bio::Tools::Blast.pm; writes the data from each HSP in tabular form in a variety of formats.
    >> bp_embl2picture.prl - Renders a GenBank or EMBL file into a PNG or GIF image.
    >> compare_library.prl - Accepts two files (i and j) containing multiple DNA sequences in FASTA format and compares each sequence in file i to that in file j using a local BLAST installation.
    >> count_types.sh - Counts how many files there are of each type in a directory.
    >> NCBI_accession_retrieval.sh - Downloads sequences from NCBI in FASTA format when provided with a file containing accession numbers.
    >> NCBI_condense_names.prl - Replaces entry names in downloaded FASTA sequences from NCBI with simplier names.
    >> NCBI_retrieval.prl - Uses NCBI's Entrez Programming Utilities to perform interactive batch requests to NCBI Entrez.
    >> split_fasta.prl - Accepts a file consisting of multiple FASTA formatted sequence records and splits them into multiple files.
    >> nanorc.txt - Customized configuration file for use with the GNU Nano 2.0.7 text editor. Allows nucleotide highlighting in FASTA and NEXUS files. Save to your home directory as .nanorc and it will be sourced by Nano at start-up.