2013年3月31日星期日

PGML - plant genome mapping lab

http://www.plantgenome.uga.edu/personnel.html


MULTIPLE COLLINEARITY SCAN - MCSCAN


MCScanX-transposed is a software package able to detect transposed gene duplications that occurred within different epochs based on applying MCScanX within and between related genomes, also useful for integrative analysis of gene duplication modes and annotating a gene family of interest with gene duplication modes.

MCScan is an algorithm to scan multiple genomes or subgenomes to identify putative homologous chromosomal regions, then align these regions using genes as anchors. MCScanXtoolkit implements an adjusted MCScan algorithm for detection of synteny and collinearity and extends the software by incorporating 15 utility programs for display and further analyses. Compared with MCScan version 0.8, MCScanX has the following new features:

2013年3月29日星期五

The Santos Lab


>> mapping_pe_reads_w_bwa_bowtie2.sh - Shell script for mapping Illumina reads to scaffold(s) in FASTA format. Needs working installation of BWASAMtools and Bowtie 2.
>> MPI-enabled MrBayes and PhyML - Precompiled binaries of the phylogenetic programs MrBayes and PhyML capable of utilizing multiple CPUs simultaneously (built for Apple Intel systems).
>> remote_blast_client.prl - Performs various BLAST searches against NCBI's databases.
>> blast_parse_all.prl - Parses BLAST reports for all HSPs with BioPerl's Bio::SearchIO module.
>> blast_parse_single.prl - Parses BLAST reports for single best HSP with BioPerl's Bio::SearchIO module.
>> blast2ps.prl - Creates a graphical representation of BLAST reports as a Postscript file.
>> blast2table.prl - Parses BLAST reports using BioPerl's Bio::Tools::Blast.pm; writes the data from each HSP in tabular form in a variety of formats.
>> bp_embl2picture.prl - Renders a GenBank or EMBL file into a PNG or GIF image.
>> compare_library.prl - Accepts two files (i and j) containing multiple DNA sequences in FASTA format and compares each sequence in file i to that in file j using a local BLAST installation.
>> count_types.sh - Counts how many files there are of each type in a directory.
>> NCBI_accession_retrieval.sh - Downloads sequences from NCBI in FASTA format when provided with a file containing accession numbers.
>> NCBI_condense_names.prl - Replaces entry names in downloaded FASTA sequences from NCBI with simplier names.
>> NCBI_retrieval.prl - Uses NCBI's Entrez Programming Utilities to perform interactive batch requests to NCBI Entrez.
>> split_fasta.prl - Accepts a file consisting of multiple FASTA formatted sequence records and splits them into multiple files.
>> nanorc.txt - Customized configuration file for use with the GNU Nano 2.0.7 text editor. Allows nucleotide highlighting in FASTA and NEXUS files. Save to your home directory as .nanorc and it will be sourced by Nano at start-up.

FaBox (1.41) - an online fasta sequence toolbox

1. FaBox,
http://onlinelibrary.wiley.com/doi/10.1111/j.1471-8286.2007.01821.x/abstract

FaBox is a collection of simple and intuitive web services that enable biologists to quickly perform typical task with sequence data. The services makes it easy to extract, edit, and replace sequence headers and join or divide data sets based on header information. Other services include collapsing a set of sequences into haplotypes and automated formatting of input files for a number of population genetics and phylogenetic programs, such as ArlequinTCS and MrBayes. The toolbox is expected to grow on the basis of requests for particular services and converters in the future.

2. download
http://users-birc.au.dk/biopv/php/fabox/faq.php

Kent source - bioinformatic operation on fasta and more

1. http://www.biostars.org/p/1852/

2. compile Kent source in Ubuntu
http://genomewiki.ucsc.edu/index.php/Source_tree_compilation_on_Debian/Ubuntu

2013年3月28日星期四

labs in evolutionary genetics

1. barkerlab lab, on genome duplication

http://barkerlab.net/

2Yaniv Brandvain

Population genetics of speciation and mating system evolution


http://yanivbrandvain.wordpress.com/publications/

3. Coop Lab
Population and evolutionary genetics

http://gcbias.org/publications/

visualization of biological data in Google Earth using R2G2, an R CRAN package



Arrigo, N., Albert, L. P., Mickelson, P. G. and Barker, M. S. (2012), Quantitative visualization of biological data in Google Earth using R2G2, an R CRAN package. Molecular Ecology Resources. doi: 10.1111/1755-0998.12012


Set Environment variables - PATH, CLASSPATH, JAVA_HOME, ANT_HOME in Ubuntu


Setting Environment variables in Ubuntu can be tricky. It comes with OpenOffice which requires OpenJDK, so the path for java is set to that of OpenJDK. The command

$ java -version

works but uses the OpenJDK. Once you have installed Sun JDK the location of the JDK should be "/usr/lib/jvm/java-6-sun-1.6.0.20" depending upon the version you have installed. Once you have installed it you may want to update the PATH, CLASSPATH variables and create environment variables like JAVA_HOME, ANT_HOME, etc.

The location to set up the environment variables in Ubuntu is /home/Your_User_Name/.bashrc. You will need to make entries like:

JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.20
ANT_HOME=/home/harkiran/javaTools/apache-ant-1.8.1
PATH=$PATH:$ORACLE_HOME/bin:$JAVA_HOME/bin:$ANT_HOME/bin
CLASSPATH=.:/usr/lib/jvm/java-6-sun-1.6.0.20/lib
export JAVA_HOME
export ANT_HOME
export CLASSPATH
export PATH

The separator to use in Linux between PATH is ":" (colon). Windows uses ";" (semi colon).

Once you have set this up you will also need to create symbolic links in the "/etc/alternatives" directory. You will need administrative privileges to do so.

$ ln -s /usr/lib/jvm/java-6-sun-1.6.0.20/bin/java /etc/alternatives/java
$ ln -s /usr/lib/jvm/java-6-sun-1.6.0.20/bin/javac /etc/alternatives/javac

Once done you can check using the commands java -version, javac -version, ant to check all is working and has been set up properly.
##########################################################
# what I used
# find / -name libjava.so 2> /dev/null
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64/
export ANT_HOME=/usr/share/ant
export PATH=$PATH:${JAVA_HOME}/bin:${ANT_HOME}/bin
export CLASSPATH=.:/usr/lib/jvm/java-6-sun-1.6.0.20/lib
########
# sudo needed
$ ln -s /usr/lib/jvm/java-6-openjdk-amd64/bin/java /etc/alternatives/java $ ln -s /usr/lib/jvm/java-6-openjdk-amd64/bin/javac /etc/alternatives/javac


2013年3月26日星期二

a bash script that loops through chromosomes

#to write a bash script that loops through chromosomes
#!/bin/bash
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
do
   cbatch "plink --bfile /home/GWAS/GeneralRelease/Imputed/Release3/CleanPlink/GRclean${i} --extract  c${i}.keep1 --maf .05 --out GRclean2 --make-bed"
done

2013年3月22日星期五

making maps with R

http://www.molecularecologist.com/2012/09/making-maps-with-r/



plot lm and glm models

1. http://strengejacke.wordpress.com/2013/03/22/plotting-lm-and-glm-models-with-ggplot-rstats/



2. http://www.surefoss.org/visualisation/plotting-odds-ratios-aka-a-forrestplot-with-ggplot2/


SPAms - help to build up ms simulation


SPAms (Simulation Program for the Analysis of ms)

SPAms is a user-friendly interface for simulating genetic data under several demographic scenarios. It uses the ms program, developed by Richard Hudson (2002), as an engine for simulating the genetic data. The program ms can be downloaded below or from Hudson's webpage . SPAms was written using MATLAB. Thus, depending on whether you have MATLAB installed or not, the files needed to run SPAms are different. The downladable package is thus divided in several files. But you do not need to have MATLAB to run SPAms. IF YOU HAVE MATLAB : you should NOT need the MCR Installer file. There is a set of examples for which we provide the R scripts (to analyse the outputs). We recommend that you read the user guide file before starting to use the program. Do not start with large number of simulations and large data sets before you understand how much memory and time you will need to carry out your simulations.

2013年3月21日星期四

Pattern Matching - cheat sheet


Pattern Matching

Shell globbing
Pattern matching in the shell against filenames has metacharacters defined differently from the rest of unix pattern matching prgorams. * is match any character except whitespace, ? is match one character except whitespace. so *.c is match any filename ending with the two characters .c (this will list out all c source files in the directory, assuming the directory's owner is sane).
grep, sed
Table of metacharacters:
  1. ^ (caret) match beginning of line. Anchors match.
  2. $ (dollar sign) match end of line. Anchors match.
  3. . (dot) match any character. Beware, command line globbing uses ? instead.
  4. * (star) matches zero or more of preceding chracters. Beware, command line uses * as in .*.
  5. [] (square braces) set of characters inside braces, match any one of.
  6. [^ ] (carat at first character inside braces), match any character except those inside braces
  7. [a-z] (use of dash inside braces) match a range. If - is to be matched, must be first character, to avoid misinterpretation as range operator.
  8. () {parenthesis, must be escaped with backslash), save match for later use with \n, where n is a number.
  9. {m}, {m,} and {m,n} (braces, which must be escaped with a backslash), matched m, more than m, or between m and n repretitions of preceeding character.
  10. & (ampersand) expands to the matched string, used in sed.
Grep, sed Flags for grep of note:
  • -i, case insensitive
  • -v, invert, select non-matching lines
  • -c, give count of matching lines.
Flags for sed of note:
  • -n, print the line only if forced to
  • -f, commands from a file
Sed commands,
  • form is [address][,address][!]command [arguments] You tend to have to enclose this in single quotes of the shell will demolish it. Or double quotes if you want shell variables expanded inside the mess.
  • No address: all lines; one address: lines matching address are processed; two addresses: first address starts processing, second address ends processiong.
  • Addresses can be line numbers, the dollar sign or a reg. exp enclosed in //.
  • example: s/a/b/g, substitute b for a, globally. Drop the g and you only substitute the first occurrance of a on a line. Add p with the g to print out the line, especially if you are using sed -n.
  • example: /but/d, delete any line that says "but", not buts allowed!
Examples
Match three letter reversal patterns:
grep '\(.\)\(.\)\(.\)\3\2\1' web2
Subsitution using sed:
sed 's/^.*:\*:\([^:]*\).*$/\1/' /etc/passwd
Try to save old files in a subdirectory.

2013年3月20日星期三

A Geospatial Modelling Approach Integrating Archaeobotany and Genetics to Trace the Origin and Dispersal of Domesticated Plants


A Geospatial Modelling Approach Integrating Archaeobotany and Genetics to Trace the Origin and Dispersal of Domesticated Plants


Background

The study of the prehistoric origins and dispersal routes of domesticated plants is often based on the analysis of either archaeobotanical or genetic data. As more data become available, spatially explicit models of crop dispersal can be used to combine different types of evidence.

Methodology/Principal Findings

We present a model in which a crop disperses through a landscape that is represented by a conductance matrix. From this matrix, we derive least-cost distances from the geographical origin of the crop and use these to predict the age of archaeological crop remains and the heterozygosity of crop populations. We use measures of the overlap and divergence of dispersal trajectories to predict genetic similarity between crop populations. The conductance matrix is constructed from environmental variables using a number of parameters. Model parameters are determined with multiple-criteria optimization, simultaneously fitting the archaeobotanical and genetic data. The consilience reached by the model is the extent to which it converges around solutions optimal for both archaeobotanical and genetic data. We apply the modelling approach to the dispersal of maize in the Americas.

Conclusions/Significance

The approach makes possible the integrative inference of crop dispersal processes, while controlling model complexity and computational requirements.

van Etten J, Hijmans RJ (2010) A Geospatial Modelling Approach Integrating Archaeobotany and Genetics to Trace the Origin and Dispersal of Domesticated Plants. PLoS ONE 5(8): e12060. doi:10.1371/journal.pone.0012060

2013年3月16日星期六

Wolf lab in Upsala U


http://www.ebc.uu.se/Research/IEG/evbiol/research/Wolf/club/


it is still hard to separate demography from selection in genomic inference

Joint analysis of demography and selection in population genetics: where do we stand and where could we go?

Teasing apart the effects of selection and demography on genetic polymorphism remains one of the major challenges in the analysis of population genomic data. The traditional approach has been to assume that demography would leave a genome-wide signature, whereas the effect of selection would be local. In the light of recent genomic surveys of sequence polymorphism, several authors have argued that this approach is questionable based on the evidence of the pervasive role of positive selection and that new approaches are needed. In the first part of this review, we give a few empirical and theoretical examples illustrating the difficulty in teasing apart the effects of selection and demography on genomic polymorphism patterns. In the second part, we review recent efforts to detect recent positive selection. Most available methods still rely on an a priori classification of sites in the genome but there are many promising new approaches. These new methods make use of the latest developments in statistics, explore aspects of the data that had been neglected hitherto or take advantage of the emerging population genomic data. A current and promising approach is based on first estimating demographic and genetic parameters, using, e.g., a likelihood or approximate Bayesian computation framework, focusing on extreme outlier regions, and then using an independent method to confirm these. Finally, especially for species where evidence of natural selection has been limited, more experimental and versatile approaches that contrast populations under varied environmental constraints might be more successful compared with species-wide genome scans in search of specific signatures.

Olivier Francois lab

http://membres-timc.imag.fr/Olivier.Francois/index.html


I am the head of the Computational and Mathematical Biology group in Grenoble. My research interests are in statistical population genetics and evolutionary genomics. I develop computational and statistical methods for the inference of population structure, demography and local adaptation from genomic data. My approaches are mainly based on MCMC, machine learning and approximate Bayesian approaches, including geographic and environmental data. My goal is to provide improved statistical estimation procedures under equilibrium and non-equilibrium processes in population genetics. I also study applications in coalescent theory, the mathematical properties of genealogies and the shape of phylogenetic trees. 
My group distributes the computer programs TESS  and POPS which compute individual cluster membership probabilities and admixture coefficients using multilocus genotypes and geographical or environmental variables.


TESS 2.3: Bayesian Clustering using tessellations and Markov models for spatial population genetics


The POPS program performs inference of ancestry distribution models. It uses a TESS-like interface to compute individual cluster membership and admixture proportions based on multilocus genotype data and their correlation with environmental and geographical variables


LFMM: A program for testing association between loci and environmental gradients using latent factor mixed models

2013年3月15日星期五

contact zone and population structure

http://mbe.oxfordjournals.org/content/26/9/1963.full

Genetic admixture of distinct gene pools is the consequence of complex spatiotemporal processes that could have involved massive migration and local mating during the history of a species. However, current methods for estimating individual admixture proportions lack the incorporation of such a piece of information. Here, we extend Bayesian clustering algorithms by including global trend surfaces and spatial autocorrelation in the prior distribution on individual admixture coefficients. We test our algorithm by using spatially explicit and realistic coalescent simulations of colonization followed by secondary contact. By coupling our multiscale spatial analyses with a Bayesian evaluation of model complexity and fit, we show that the algorithm provides a correct description of smooth clinal variation, while still detecting zones of sharp variation when they are present in the data. We also apply our approach to understand the population structure of the killifish, Fundulus heteroclitus, for which the algorithm uncovers a presumed contact zone in the Atlantic coast of North America.


2013年3月12日星期二

reports : An R package to assist in the workflow of writing academic articles and other reports

http://trinkerrstuff.wordpress.com/2013/03/12/reports-0-1-2-released/

The reports package assists in writing reports and presentations by providing a frame work that brings together existing R, LaTeX/.docx and Pandoc tools. The package is designed to be used with RStudio, MiKTex/Tex Live/LibreOffice, knitr, knitcitations, Pandoc and pander (and installr for Windows users). The user will want to download these free programs/packages to maximize the effectiveness of the reports package. Functions with two letter names are general text formatting functions for copying text from articles for inclusion as a citation.

2013年3月11日星期一

De novo genomic analyses for non-model organisms


Scripts were provided for De novo genomic analyses for non-model organisms studies.

High-throughput sequencing (HTS) is revolutionizing biological research by enabling scientists to quickly and cheaply query variation at a genomic scale. Despite the increasing ease
of obtaining such data, using these data effectively still poses notable challenges, especially for
those working with organisms without a high-quality reference genome. For every stage of
analysis – from assembly to annotation to variant discovery – researchers have to distinguish
technical artifacts from the biological realities of their data before they can make inference. In
this work, I explore these challenges by generating a large de novo comparative transcriptomic
dataset data for a clade of lizards and constructing a pipeline to analyze these data. Then, using
a combination of novel metrics and an externally validated variant data set, I test the efficacy
of my approach, identify areas of improvement, and propose ways to minimize these errors. I
find that with careful data curation, HTS can be a powerful tool for generating genomic data
for non-model organisms.

2013年3月8日星期五

proTRAC - a software for probabilistic piRNA cluster detection, visualization and analysis


proTRAC - a software for probabilistic piRNA cluster detection, visualization and analysis


Background

Throughout the metazoan lineage, typically gonadal expressed Piwi proteins and their guiding piRNAs (~26-32nt in length) form a protective mechanism of RNA interference directed against the propagation of transposable elements (TEs). Most piRNAs are generated from genomic piRNA clusters. Annotation of experimentally obtained piRNAs from small RNA/cDNA-libraries and detection of genomic piRNA clusters are crucial for a thorough understanding of the still enigmatic piRNA pathway, especially in an evolutionary context. Currently, detection of piRNA clusters relies on bioinformatics rather than detection and sequencing of primary piRNA cluster transcripts and the stringency of the methods applied in different studies differs considerably. Additionally, not all important piRNA cluster characteristics were taken into account during bioinformatic processing. Depending on the applied method this can lead to: i) an accidentally underrepresentation of TE related piRNAs, ii) overlook duplicated clusters harboring few or no single-copy loci and iii) false positive annotation of clusters that are in fact just accumulations of multi-copy loci corresponding to frequently mapped reads, but are not transcribed to piRNA precursors.

Results

We developed a software which detects and analyses piRNA clusters (proTRAC, probabilistic TRacking and Analysis of Clusters) based on quantifiable deviations from a hypothetical uniform distribution regarding the decisive piRNA cluster characteristics. We used piRNA sequences from human, macaque, mouse and rat to identify piRNA clusters in the respective species with proTRAC and compared the obtained results with piRNA cluster annotation from piRNABank and the results generated by different hitherto applied methods.
proTRAC identified clusters not annotated at piRNABank and rejected annotated clusters based on the absence of important features like strand asymmetry. We further show, that proTRAC detects clusters that are passed over if a minimum number of single-copy piRNA loci are required and that proTRAC assigns more sequence reads per cluster since it does not preclude frequently mapped reads from the analysis.

Conclusions

With proTRAC we provide a reliable tool for detection, visualization and analysis of piRNA clusters. Detected clusters are well supported by comprehensible probabilistic parameters and retain a maximum amount of information, thus overcoming the present conflict of sensitivity and specificity in piRNA cluster detection.

MIReStruC – an algorithm searching for miRNA structural clusters along a genome


Large scale chromosomal mapping of human microRNA structural clusters


MicroRNAs (miRNAs) can group together along the human genome to form stable secondary structures made of several hairpins hosting miRNAs in their stems. The few known examples of such structures are all involved in cancer development. A large scale computational analysis of human chromosomes crossing sequence analysis and deep sequencing data revealed the presence of >400 structural clusters of miRNAs in the human genome. An a posteriori analysis validates predictions as bona fidemiRNAs. A functional analysis of structural clusters position along the chromosomes co-localizes them with genes involved in several key cellular processes like immune systems, sensory systems, signal transduction and development. Immune systems diseases, infectious diseases and neurodegenerative diseases are characterized by genes that are especially well organized around structural clusters of miRNAs. Target genes functional analysis strongly supports a regulatory role of most predicted miRNAs and, notably, a strong involvement of predicted miRNAs in the regulation of cancer pathways. This analysis provides new fundamental insights on the genomic organization of miRNAs in human chromosomes.

VPA: an R tool for analyzing sequencing variants with user-specified frequency pattern


VPA: an R tool for analyzing sequencing variants with user-specified frequency pattern


Background

The massive amounts of genetic variant generated by the next generation sequencing systems demand the development of effective computational tools for variant prioritization.

Findings

VPA (Variant Pattern Analyzer) is an R tool for prioritizing variants with specified frequency pattern from multiple study subjects in next-generation sequencing study. The tool starts from individual files of variant and sequence calls and extract variants with user-specified frequency pattern across the study subjects of interest. Several position level quality criteria can be incorporated into the variant extraction. It can be used in studies with matched pair design as well as studies with multiple groups of subjects.

Conclusions

VPA can be used as an automatic pipeline to prioritize variants for further functional exploration and hypothesis generation. The package is implemented in the R language and is freely available from http://vpa.r-forge.r-project.org webcite.

NGS-SNP: In-depth annotation of SNPs arising from resequencing projects


NGS-SNP - Overview

Citing NGS-SNP

Grant JR, Arantes AS, Liao X, Stothard P (2011) In-depth annotation of SNPs arising from resequencing projects using NGS-SNP. Bioinformatics 27:2300-2301.

Description

NGS-SNP is a collection of command-line scripts for providing rich annotations for SNPs identified by the sequencing of transcripts or whole genomes from organisms with reference sequences in Ensembl. Included among the annotations, several of which are not available from any existing SNP annotation tools, are the results of detailed comparisons with orthologous sequences. These comparisons allow, for example, SNPs to be sorted or filtered based on how drastically the SNP changes the score of a protein alignment. Other fields indicate the names of overlapping protein domains or features, and the conservation of both the SNP site and flanking regions. NCBI, Ensembl, and Uniprot IDs are provided for genes, transcripts, and proteins when applicable, along with Gene Ontology terms, a gene description, phenotypes linked to the gene, and an indication of whether the SNP is novel or known. A “Model_Annotations” field provides several annotations obtained by transferring in silico the SNP to an orthologous gene, typically in a well-characterized species.

NGS-SNP scripts

  • annotate_SNPs.pl - used to annotate SNPs identified by the sequencing of genomic DNA or transcripts.
  • annotate_INDELs.pl - used to annotate INDELs identified by the sequencing of genomic DNA.
  • merge_and_sort_SNP_lists.pl - used to filter, merge, and sort SNP lists annotated using NGS-SNP.
  • cDNA_library_entropy.pl - used to choose the best tissues for SNP discovery by mRNA sequencing.
  • obtain_reference_chromosomes.pl - used to obtain reference chromosome sequences from Ensembl that can be supplied to SNP discovery tools such as Maq.
  • obtain_reference_transcripts.pl - used to obtain reference transcript sequences from Ensembl that can be supplied to SNP discovery tools such as Maq.
  • get_genes_in_area.pl - used to obtain information about genes located within or nearby CNVs or other variants supplied as input.
  • ncbi_monitor.pl - used to obtain publications related to genome regions supplied as input.