# SNPMeta

SNPMeta is a Python and BioPython-based tool to generate "metadata" for single nucleotide polymorphisms (SNPs) for easy filtering, or submission to SNP databases. Information reported includes gene name, whether the SNP is coding or noncoding, and whether the SNP is synonymous or nonsynonymous. SNPMeta outputs in either a dbSNP submission report format, or a tab-delimited format.

Companion Scripts
These are various helper scripts provided to help with running SNPMeta. They might have uses outside of that context, though.
Blast_SNPs.sh - A shell script to run BLAST on SNPs, and save the reports as XML. Requires an installation of NCBI's BLAST executables, and a Bash shell. Edit the script in a text editor so the variables match your system. Requires a directory with FASTA files, with one sequence per file. This script will create a new file for each FASTA in the directory, ending in '.xml', containing the BLAST report.
Convert_Illumina.py - A Python script to convert from the Illumina contextual sequence format to FASTA, for input to SNPMeta. Accepts a text file with two fields, separated by a tab: the SNP Name, and the SNP contextual sequence. Outputs a FASTA file with IUPAC ambiguities to stdout.
GBSContextualSeq.py - A Python script to build SNP contextual sequences from a reference sequence and a VCF file. Generates a separate FASTA file for each sample listed in the VCF file. This is useful for generating contextual sequence from genotype-by-sequence (GBS) data, as the SNPs will be stored as a VCF. Requires BioPython. Also requires Argparse if using Python < 2.7.
Split_FASTA.py - A Python script to split a large FASTA file into smaller files. Takes a FASTA file and a positive integer as arguments. Requires BioPython.

## 2013年2月27日星期三

### Structure-Pipeline

1. Structure-Pipeline, This is really a good tool.

`Users of Structure (Pritchard et al, 2000) may be familiar with the interface of the BioHPC cluster at Cornell. Unfortunately, guest access was discontinued in May, 2011. If you have structure installed on your SGE supercomuting cluster, several features of the web-based BioHPC cluster interface can be replaced with a pipeline of qsub and python scripts. This pipleline will guide you through setting up your datafile and parameter settings, running structure efficiently at many values of K, summarizing those results using CLUMPP, and vizualizing the results using custom R scripts.`

`2. structure Harvester`
`http://taylor0.biology.ucla.edu/structureHarvester/example/summary.html`

`3. a simple guide`
`https://wiki.duke.edu/display/SCSCusers/Using+Structure`

## 2013年2月24日星期日

### test for local adaptation and to analyze the performance of hybrids relative to native parental plants

tool and a reference on transplantation data analysis.

1. Geyer, C. J.S. Wagenius, and R. G. Shaw2007Aster models for life history analysisBiometrika 94:415426.

2.
http://onlinelibrary.wiley.com/doi/10.1111/j.1558-5646.2008.00457.x/full

### genomics of ecological speciation - some cases

1. http://onlinelibrary.wiley.com/doi/10.1111/j.1461-0248.2010.01546.x/full
A guide to the genomics of ecological speciation in natural animal populations

Interest in ecological speciation is growing, as evidence accumulates showing that natural selection can lead to rapid divergence between subpopulations. However, whether and how ecological divergence can lead to the buildup of reproductive isolation remains under debate. What is the relative importance of natural selection vs. neutral processes? How does adaptation generate reproductive isolation? Can ecological speciation occur despite homogenizing gene flow? These questions can be addressed using genomic approaches, and with the rapid development of genomic technology, will become more answerable in studies of wild populations than ever before. In this article, we identify open questions in ecological speciation theory and suggest useful genomic methods for addressing these questions in natural animal populations. We aim to provide a practical guide for ecologists interested in incorporating genomic methods into their research programs. An increased integration between ecological research and genomics has the potential to shed novel light on the origin of species.

2. http://www.sciencedirect.com/science/article/pii/S0169534712001863

# What is needed for next-generation ecological and evolutionary genomics?

Ecological and evolutionary genomics (EEG) aims to link gene functions and genomic features to phenotypes and ecological factors. Although the rapid development of technologies allows central questions to be addressed at an unprecedented level of molecular detail, they do not alleviate one of the major challenges of EEG, which is that a large fraction of genes remains without any annotation. Here, we propose two solutions to this challenge. The first solution is in the form of a database that regroups associations between genes, organismal attributes and abiotic and biotic conditions. This database would result in an ecological annotation of genes by allowing cross-referencing across studies and taxa. Our second solution is to use new functional techniques to characterize genes implicated in the response to ecological challenges.

Divergent selection and heterogeneous genomic divergence

Levels of genetic differentiation between populations can be highly variable across the genome, with divergent selection contributing to such heterogeneous genomic divergence. For example, loci under divergent selection and those tightly physically linked to them may exhibit stronger differentiation than neutral regions with weak or no linkage to such loci. Divergent selection can also increase genome-wide neutral differentiation by reducing gene flow (e.g. by causing ecological speciation), thus promoting divergence via the stochastic effects of genetic drift. These consequences of divergent selection are being reported in recently accumulating studies that identify: (i) ‘outlier loci’ with higher levels of divergence than expected under neutrality, and (ii) a positive association between the degree of adaptive phenotypic divergence and levels of molecular genetic differentiation across population pairs [‘isolation by adaptation’ (IBA)]. The latter pattern arises because as adaptive divergence increases, gene flow is reduced (thereby promoting drift) and genetic hitchhiking increased. Here, we review and integrate these previously disconnected concepts and literatures. We find that studies generally report 5–10% of loci to be outliers. These selected regions were often dispersed across the genome, commonly exhibited replicated divergence across different population pairs, and could sometimes be associated with specific ecological variables. IBA was not infrequently observed, even at neutral loci putatively unlinked to those under divergent selection. Overall, we conclude that divergent selection makes diverse contributions to heterogeneous genomic divergence. Nonetheless, the number, size, and distribution of genomic regions affected by selection varied substantially among studies, leading us to discuss the potential role of divergent selection in the growth of regions of differentiation (i.e. genomic islands of divergence), a topic in need of future investigation.

## 2013年2月23日星期六

### MultiGeneBlast: Combined BLAST searches for operons and gene clusters

MultiGeneBlast is an open source tool for identification of homologs of multigene modules such as operons and gene clusters. It is based on a reformatting of the FASTA headers of NCBI GenBank protein entries, using which it can track down their source nucleotide and coordinates.

Oftentimes when studying such genetic loci, much can be learned from their evolutionary context. Furthermore, MultiGeneBlast can aid in the detection of such multigene parts for synthetic biology projects; a synthetic library of operons can be created based on its output to identify those operons whose function is closest to the one desired by the user.
This tool provides the opportunities to identify all homologous genomic regions by combining the results of single BlastP runs on each gene, and sorting genomic regions from any GenBank entry by the number of hits, synteny conservation and cumulative Blast bit score. The basic algorithm behind this was previously used in our antiSMASH software.
Additionally, architecture searches can be performed to find any genomic regions with Blast hits to any user-specified combination of amino acid sequences.
The tool comes with a pre-configured database containing the most recent version of all relevant GenBank divisions. Moreover, you can easily make your own databases from local files or online GenBank entries or divisions

## 2013年2月22日星期五

### bigcor: Large correlation matrices in R

http://rmazing.wordpress.com/2013/02/22/bigcor-large-correlation-matrices-in-r/

It has been shown that by calculating the Pearson correlation between genes, one can identify (by high $\varrho$ values, i.e. > 0.9) genes that share a common regulation mechanism such as being induced/repressed by the same transcription factors:
I had an idea. How about using my microarray data of gene expression of 40000 genes in 28 samples and calculate the correlation between all 40000 genes (variables).

### Inferring Population Histories Using Genome-Wide Allele Frequency Data

The recent development of high-throughput genotyping technologies has revolutionized the collection of data in a wide range of both model and nonmodel species. These data generally contain huge amounts of information about the demographic history of populations. In this study, we introduce a new method to estimate divergence times on a diffusion time scale from large single-nucleotide polymorphism (SNP) data sets, conditionally on a population history that is represented as a tree. We further assume that all the observed polymorphisms originate from the most ancestral (root) population; that is, we neglect mutations that occur after the split of the most ancestral population. This method relies on a hierarchical Bayesian model, based on Kimura’s time-dependent diffusion approximation of genetic drift. We implemented a Metropolis–Hastings within Gibbs sampler to estimate the posterior distribution of the parameters of interest in this model, which we refer to as the Kimura model. Evaluating the Kimura model on simulated population histories, we found that it provides accurate estimates of divergence time. Assessing model fit using the deviance information criterion (DIC) proved efficient for retrieving the correct tree topology among a set of competing histories. We show that this procedure is robust to low-to-moderate gene flow, as well as to ascertainment bias, providing that the most distantly related populations are represented in the discovery panel. As an illustrative example, we finally analyzed published human data consisting in genotypes for 452,198 SNPs from individuals belonging to four populations worldwide. Our results suggest that the Kimura model may be helpful to characterize the demographic history of differentiated populations, using genome-wide allele frequency data.

http://mbe.oxfordjournals.org/content/30/3/654.full

### modeler4simcoal2 (m4s2) - a modeler for coalescent processes

modeler4simcoal2 (m4s2) is a modeler for coalescent processes. It allows the modeling of both demographies and chromosomes (i.e., markers with linkage relationships in multiple chromosome blocks).

m4s2 generates files for usage with Simcoal2 which can easily be analyzed with Arlequin3. m4s2 can be run standalone or can directly call and control Simcoal2. Arlequin3 can also be called after the simulations are run.

m4s2 is a Java Web Start application (requiring Java 1.4, available for Windows, Mac and Linux among others). It requires no installation and can be run directly from the web. m4s2 can be run on more platforms than those supported by Simcoal2 and Arlequin3 (in this case only in standalone mode).

The purpose of m4s2 is to allow biologists to concentrate more on biology and the underlying models used on analysis (and less on having to learn a new computer simulation tools). We expect that m4s2 will lower the barrier for coalescent simulator use.

m4s2 has full expressive power with regards to chromosome modeling (i.e., it can model all that Simcoal2 supports).

Regarding demographies, m4s2 includes a set of models which cover the vast majority found in the literature (e.g., island, stepping-stone). An extension system is also provided allowing for the creation of new models. A simple extension language is provided, if the language is not enough the full expressive power of Python (Jython) can be used to create new models. New models can be made available online as m4s2 can import those directly from the web. We make available an external model on the expansion of humans and domesticated species after the Neolithic as hierarchically structured.

Before using m4s2 we recommend reading the users guide. At least the first few lines... You can run m4s2 directly from here.

# Evolutionary Genomics

## Statistical and Computational Methods, Volume 2

### Evolutionary Biology for the 21st Century

#### Evolutionary Processes That Shape Genomic and Phenotypic Variation

The availability of genomic data from a remarkable range of species has allowed the alignment and comparison of whole genomes. These comparative approaches have been used to characterize the relative importance of fundamental evolutionary processes that cause genomic evolution and to identify particular regions of the genome that have experienced recent positive selection, recurrent adaptive evolution, or extreme sequence conservation[72][75]. Yet more recently, resequencing of additional individuals or populations is also allowing genome-wide population genetic analyses within species [76][82]. Such population-level comparisons will allow even more powerful study of the relative importance of particular evolutionary processes in molecular evolution as well as the identification of candidate genomic regions that are responsible for key evolutionary changes (e.g., sticklebacks [83], butterflies [84]Arabidopsis [85]). These data, combined with theoretical advances, should provide insight into long-standing questions such as the prevalence of balancing selection, the relative frequency of strong versus weak directional selection, the role of hybridization, and the importance of genetic drift. A key challenge will be to move beyond documenting the action of natural selection on the genome to understanding the importance of particular selective agents. For example, what proportion of selection on genomes results from adaptation to the abiotic environment, coevolution of species, sexual selection, or genetic conflict? Finally, as sequencing costs continue to drop and analytical tools improve, these same approaches may be applied to organisms that present intriguing evolutionary questions but were not tractable methodologically just a few years ago. The nonmodel systems of today may well become the model systems of tomorrow [86].

#### Understanding Biological Diversification

A major and urgent challenge is to improve knowledge of the identity and distribution of species globally. While we need to retain the traditional focus on phenotypes, powerful new capabilities to obtain and interpret both genomic and spatial data can and should revolutionize the science of biodiversity. Building on momentum from single-locus “barcoding" efforts, new genome-level data can build bridges from population biology to systematics [91]. By establishing a comprehensive and robust “Tree of Life," we will improve understanding of both the distribution of diversity and the nature and timing of the evolutionary processes that have shaped it.

### pandas - a python package working with dataframe

pandas is the utility belt for data analysts using python. The package centers around the pandas `DataFrame`, a two-dimensional data structure with indexable rows and columns. It has effectively taken the best parts of Base R, R packages like `plyr` and `reshape2` and consolidated them into a single library. It has lots of features (see library highlights). pandas gets its name from panel data, an econometrics term for multidimensional structured datasets (McKinney 5., 2013)

2. http://pandas.pydata.org/pandas-docs/stable/index.html

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.
pandas is well suited for many different kinds of data:
• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
• Ordered and unordered (not necessarily fixed-frequency) time series data.
• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
• Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

## 2013年2月17日星期日

### Overview

SHIPS (Spectral Hierarchical clustering for the Inference of Population Structure) is a non-parametric clustering algorithmthat clusters individuals from a population into genetically homogeneous sub-populations from genotype data. After computing a pairwise distance matrix, the algorithm progressively divides the original population in two sub-populations by the use of aspectral clustering algorithm. The process is then iterated in each of the two sub-populations and so on. This leads to the construction of a binary tree, where each node represents a group of individuals. To determine the final clusters a tree pruning procedure and an estimation of the optimal number of clusters, that is a gap statistic, are applied. In such an approach both the final clustering of the individuals and the number of clusters are estimated by the method.
The algorithm SHIPS is implemented with the software R that can be downloaded from the (CRAN web page) and is divided in several functions :
• ships.cluster constructing the tree and providing several clustering possibilities
• ships.gap that estimates the final number of clusters
• ships.plotCluster that provides a graphical representation of the clustering
• ships.plotGap that plots the criterion used to estimate the final number of clusters

### SHIPS ressources

Documentation: Documentation.pdf

## 2013年2月15日星期五

### R package for IBD

gdsfmt and SNPRelate are high-performance computing R packages for multi-core symmetric multiprocessing computer architectures. They are used to accelerate two key computations is GWAS: principal component analysis (PCA) and relatedness analysis using identity-by-descent (IBD) measures. The kernels of our algorithms are written in C/C++, and have been highly optimized. Benchmarks show the uniprocessor implementations of PCA and IBD are ~8 to 50 times faster than the implementations provided by the popular EIGENSTRAT (v3.0) and PLINK (v1.07) programs respectively, and can be sped up to 30~300 folds by utilizing eight cores. SNPRelate can analyze tens of thousands of samples, with millions of SNPs.
to identify pairs of closely-related
subjects based on genetic marker data from single-nucleotide polymorphisms (SNPs). The
package is able to accommodate SNPs in linkage disequibrium (LD), without the need to
thin the markers so that they are approximately independent in the population. Sample
pairs are identiﬁed by superposing their estimated identity-by-descent (IBD) coeﬃcients
on plots of IBD coeﬃcients for pairs of simulated subjects from one of several common
close relationships. The methods are particularly relevant to candidate-gene association
studies, in which dependent SNPs cluster in a relatively small number of genes spread
throughout the genome. The accommodation of LD allows the use of all available genetic data, a desirable property when working with a modest number of dependent SNPs
within candidate genes

1.

# The Role of GC-Biased Gene Conversion in Shaping the Fastest Evolving Regions of the Human Genome

GC-biased gene conversion (gBGC) is a recombination-associated evolutionary process that accelerates the fixation of guanine or cytosine alleles, regardless of their effects on fitness. gBGC can increase the overall rate of substitutions, a hallmark of positive selection. Many fast-evolving genes and noncoding sequences in the human genome have GC-biased substitution patterns, suggesting that gBGC—in contrast to adaptive processes—may have driven the human changes in these sequences.

2.

# Sequencing, Mapping, and Analysis of 27,455 Maize Full-Length cDNAs

#### Analysis

Contaminated FLcDNAs were found by comparing them against the maize, rice and Arabidopsis rRNA sequences with a BLAST e-value≤1e-50, which identified 26 rRNAs. An additional 110 FLcDNAs were identified that encoded proteins highly similar to bacteria (16 cDNAs), fungus (76 cDNAs) and vertebrate (18 cDNAs) and did not show similarity with plant proteins.
The ORFs were computed using the software GETORF in EMBOSS package [50] with parameters “–minsize 150, -find 1, -methionine, -noreverse”. TE and SSR analyses were performed using RepeatMasker (repeatmasker.org). For TE analysis, the Poaceae (grass family) TE database was downloaded from Genetic Information Research Institute (www.girinst.org) and the FLcDNAs that had masked sequence length of ≥100 bp were used for the TE insertion analysis. SSRs with length ≥20 bp and divergence ≤10% were selected for SSR location analysis. Putative transcription factors were analyzed using BLASTx with e-value≤1e-10 against rice and Arabidopsis transcription factor proteins downloaded from PlantTFDB (planttfdb.cbi.pku.edu.cn). Any maize cDNAs showing positive matches in both rice and Arabidopsis were assigned to TF families using the PlantTFDB nomenclature.
Plant homolog analysis was conducted using BLASTx (e-value≤1e-10) to compare rice, sorghum, Arabidopsis and poplar protein sequences downloaded from the following sites: 67,393 rice (MSU release 6.0; rice.plantbiology.msu.edu), 35,899 sorghum (www.phytozome.net/sorghum), 32,615 Arabidopsis (TAIR v8.0; www.arabidopsis.org) and 45,555 poplar (genome.jgi-psf.org). The maize FLcDNAs that did not have a homolog were compared with the plant UniProt database [29], where another 147 rice, sorghum, Arabidopsis or poplar homologs were identified and removed. Then the 1,475 putative unique maize FLcDNAs were mapped to GO annotated maize gene models with ≥95% ID and ≥90% alignment length using BLAT. GO over- and under- representation analysis were performed using Cytoscape [51] with BiNGO (Biological Networks Gene Ontology, [25]) plug-in and activating a hypergeometric distribution statistical test (p-value ≤0.05) with Benjamini and Hochberg false discovery rate (FDR) correction [52] relative to GO annotated maize gene models.
For annotation of all EST and FLcDNA assemblies, the unitrans were searched against the UniProt plants database (2009-06-17) using BLASTx with e-value≤1e-20. The GO [24]annotations were extracted from the UniProt file and gene association file (ftp.ebi.ac.uk/pub/databases/GO/goa/UNIP​ROT),which were mapped to plant GO Slim [32]. Some of the results were computed by custom Perl scripts, and the rest were obtained from the website, as follows: Table 6 was copied from the “Advanced Summary/Example Queries” page. The number of UniProt matches for the 27k were from the “UniTrans Search”, where “Non-maize UniProt Match” was set to ‘yes’; for the non-putative, the “Match Description” was set to “not putative”. Table 8Table 9, and the top of Table 10 can all be verified from the PAVE query system.

### DAVID and WebGESAT for pathway analysis

pathway analysis you can use:

DAVID (http://david.abcc.ncifcrf.gov/),

Gene Set Analysis Toolkit (http://bioinfo.vanderbilt.edu/webgestalt/)

# Genomic consequences of transitions from cross- to self-fertilization on the efficacy of selection in three independently derived selfing plants

#### Background

Transitions from cross- to self-fertilization are associated with increased genetic drift rendering weakly selected mutations effectively neutral. The effect of drift is predicted to reduce selective constraints on amino acid sequences of proteins and relax biased codon usage. We investigated patterns of nucleotide variation to assess the effect of inbreeding on the accumulation of deleterious mutations in three independently evolved selfing plants. Using high-throughput sequencing, we assembled the floral transcriptomes of four individuals of Eichhornia(Pontederiaceae); these included one outcrosser and two independently derived selfers of E.paniculata, and Eparadoxa, a selfing outgroup. The dataset included ~8000 loci totalling ~3.5 Mb of coding DNA.

#### Results

Tests of selection were consistent with purifying selection constraining evolution of the transcriptome. However, we found an elevation in the proportion of non-synonymous sites that were potentially deleterious in the Epaniculata selfers relative to the outcrosser. Measurements of codon usage in high versus low expression genes demonstrated reduced bias in both E. paniculataselfers.

#### Conclusions

Our findings are consistent with a small reduction in the efficacy of selection on protein sequences associated with transitions to selfing, and reduced selection in selfers on synonymous changes that influence codon usage.

## 2013年2月13日星期三

### Softberry programs for genomics

Softberry Programs available to academic users at no charge for occasional use in research projects