Rapid speciation with gene flow following the formation of Mount Etna


Environmental or geological changes can create new niches which drive ecological species divergence without the immediate cessation of gene flow. However, few such cases have been characterised. On the recently formed volcano, Mt. Etna, Senecio aethnensis and S. chrysanthemifolius inhabit contrasting environments of high and low altitude respectively. They have very distinct phenotypes, despite hybridising promiscuously, and thus may represent an important example of ecological speciation ‘in action’, possibly as a response to the rapid geological changes which Mt. Etna has recently undergone. To elucidate the species' evolutionary history, and help establish the species as study system for speciation genomics, we sequenced the transcriptomes of the two Etnean species, and the outgroup, S. vernalis, using Illumina sequencing. Despite the species' substantial phenotypic divergence, synonymous divergence between the high- and low-altitude species was low (dS = 0.016 ± 0.017 [SD]). A comparison of species divergence models with and without gene flow provided unequivocal support in favor of the former and demonstrated a recent time of species divergence (153,080 ya ± 11,470[SE]) that coincides with the growth of Mount Etna to the altitudes which separate the species today. Analysis of dN/dSrevealed wide variation in selective constraint between genes, and evidence that highly expressed genes, more ‘multifunctional’ genes and those with more paralogues were under elevated purifying selection. Taken together, these results are consistent with a model of ecological speciation, potentially as a response to the emergence of a new, high altitude niche as the volcano grew.


Alternative forms for genomic clines


Understanding factors regulating hybrid fitness and gene exchange is a major research challenge for evolutionary biology. Genomic cline analysis has been used to evaluate alternative patterns of introgression, but only two models have been used widely and the approach has generally lacked a hypothesis testing framework for distinguishing effects of selection and drift. I propose two alternative cline models, implement multivariate outlier detection to identify markers associated with hybrid fitness, and simulate hybrid zone dynamics to evaluate the signatures of different modes of selection. Analysis of simulated data shows that previous approaches are prone to false positives (multinomial regression) or relatively insensitive to outlier loci affected by selection (Barton's concordance). The new, theory-based logit-logistic cline model is generally best at detecting loci affecting hybrid fitness. Although some generalizations can be made about different modes of selection, there is no one-to-one correspondence between pattern and process. These new methods will enhance our ability to extract important information about the genetics of reproductive isolation and hybrid fitness. However, much remains to be done to relate statistical patterns to particular evolutionary processes. The methods described here are implemented in a freely available package “HIest” for the R statistical software (CRAN; http://cran.r-project.org/).

Theoretical Evolutionary Genetics - draft text

1. http://evolution.genetics.washington.edu/pgbook/pgbook.html

This would be a very good book on population genetics.

2. Evolution and Selection of Quantitative Traits by Bruce Walsh and Michael Lynch. While this book is in draft form it is available from Bruce Walsh's web page at: http://nitro.biosci.arizona.edu/zbook/NewVolume_2/newvol2.html (Bruce Walsh's web page is in general a fantastic source of information on all things population/quantitative genetics).

3. from Withlock in UBC


perl tutorial


hybrid zone

hybrid zones allow us:
(1) to quantify the genetic differences responsible for speciation,
(2) to measure the diffusion of genes between diverging taxa,
(3) to understand the spread of alternative adaptations.

The genomic impacts of drift and selection for hybrid performance

1. http://arxiv.org/abs/1307.7313

Modern maize breeding relies upon selection in inbreeding populations to improve the performance of cross-population hybrids. The United States Department of Agriculture - Agricultural Research Service reciprocal recurrent selection experiment between the Iowa Stiff Stalk Synthetic (BSSS) and the Iowa Corn Borer Synthetic No. 1 (BSCB1) populations represents one of the longest standing models of selection for hybrid performance. To investigate the genomic impact of this selection program, we used the Illumina MaizeSNP50 high-density SNP array to determine genotypes of progenitor lines and over 600 individuals across multiple cycles of selection. Consistent with previous research (Messmer et al., 1991; Labate et al., 1997; Hagdorn et al., 2003; Hinze et al., 2005), we found that genetic diversity within each population steadily decreases, with a corresponding increase in population structure. High marker density also enabled the first view of haplotype ancestry, fixation and recombination within this historic maize experiment. Extensive regions of haplotype fixation within each population are visible in the pericentromeric regions, where large blocks trace back to single founder inbreds. Simulation attributes most of the observed reduction in genetic diversity to genetic drift. Signatures of selection were difficult to observe in the background of this strong genetic drift, but heterozygosity in each population has fallen more than expected. Regions of haplotype fixation represent the most likely targets of selection, but as observed in other germplasm selected for hybrid performance (Feng et al., 2006), there is no overlap between the most likely targets of selection in the two populations. We discuss how this pattern is likely to occur during selection for hybrid performance, and how it poses challenges for dissecting the impacts of modern breeding and selection on the maize genome.

How do I match orthologues in one species to another, genome scale


how to detect ortholog among species.

Subset of heat-shock transcription factors required for the early response of Arabidopsis to excess light

1. http://www.sciencedaily.com/releases/2013/08/130806132939.htm

2. http://www.pnas.org/content/early/2013/07/31/1311632110

How Increasing CO2 and Temperatures Affect Plant Development

1. http://www.sciencedaily.com/releases/2013/07/130731225931.htm

2. http://www.nature.com/ncomms/2013/130731/ncomms3145/full/ncomms3145.html

Elevated levels of CO2 and temperature can both affect plant growth and development, but the signalling pathways regulating these processes are still obscure. MicroRNAs function to silence gene expression, and environmental stresses can alter their expressions. Here we identify, using the small RNA-sequencing method, microRNAs that change significantly in expression by either doubling the atmospheric CO2 concentration or by increasing temperature 3–6 °C. Notably, nearly all CO2-influenced microRNAs are affected inversely by elevated temperature. Using the RNA-sequencing method, we determine strongly correlated expression changes between miR156/157 and miR172, and their target transcription factors under elevated CO2 concentration. Similar correlations are also found for microRNAs acting in auxin-signalling, stress responses and potential cell wall carbohydrate synthesis. Our results demonstrate that both CO2 and temperature alter microRNA expression to affect Arabidopsis growth and development, and miR156/157- and miR172-regulated transcriptional network might underlie the onset of early flowering induced by increasing CO2.


Workflow of gene family evolution study

Workflow of gene family evolution study:

1. Analysis Method
Sequence Collection
PlantTribe, PlantGDB, GenBank, Conifer DBMagic assemblies
25 taxa comprising of 71 sequences

2. Phylogenetic analysis
Maximum Likelihood: RAxML (Stamatakis et. al)
Bayesian Method: MrBayes (Huelsenbeck, et al.)
Tree reconciliation: NOTUNG 2.6 (Chen et al.)

3. A case study

What use is a reference genome sequence . . . in applied tree breeding

• As a reference for re-sequencing elite individuals to identify
functional alleles or haplotypes, and consequently, to provide
superior estimates of kinship.
• As a physical map of marker locations, to guide imputation of
missing genotype data
• Essential for matrix-based methods of analysis
• Allows accurate imputation of progeny from structured
mating design based on known parental haplotypes
• As the fundamental framework for knowledge of conifer
genes and regulatory elements, to enable future advances in
MAS strategies as technology develops.



What use is a reference genome sequence to applied tree breeding


Primary goal: Produce improved genetic material for
deployment as planting stock, while maintaining sufficient
genetic diversity to manage risk.
 Understanding biological mechanisms is not a goal, but it can
be a tool.
• Primary tool: Modeling the genetic basis of phenotypic variation
in breeding populations.
 Phenotypes measured in field tests of progeny from
structured mating designs.
 Genetic information primarily based on pedigree records
 BLUP (best linear unbiased predictor) relies heavily on kinship


fermi - Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly

Motivation: Eugene Myers in his string graph paper (Myers, 2005) suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossless representation of reads. In principle, every analysis based on whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) calling, can also be achieved with unitigs.
Results: To explore the feasibility of using de novo assembly in the context of resequencing, we developed a de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads. SNPs and INDELs can be called by mapping the unitigs against a reference genome. By applying the method on 35-fold human resequencing data, we showed that in comparison to the standard pipeline, our approach yields similar accuracy for SNP calling and better results for INDEL calling. It has higher sensitivity than other de novo assembly based methods for variant calling. Our work suggests that variant calling with de novo assembly can be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing. In the methodological aspects, we propose FMD-index for forward-backward extension of DNA sequences, a fast algorithm for finding all super-maximal exact matches and one-pass construction of unitigs from an FMD-index.

some useful awk lines

# - SAM files - #
#Count number of reads aligning to each contig/chromosome and print total and as a percent
awk '{c[$3]++}END{for(j in c) print j,c[j],(c[j]/NR*100),"%"}' Aligned.sam

# - Blast files - #
#remove self hits
awk '$1!=$2' blast_all_vs_all/blast.tab > blast_all_vs_all/blast_no_self.tab
# how many matches are 200bp +?
awk '$4>200' blast_all_vs_all/blast_no_self.tab| wc
#of those how many have 80% ID?
awk '$3>80' blast_all_vs_all/blast_no_self.tab| wc

#random awking
#Show lines where there is a > 0.3 difference in the 5 and 6 columns
awk '($6-$5)>0.3' myfile.tab | less -S

Genotype By Sequencing (GBS) Barcodes - an good exploration posts




Cronn Lab:Protocols


Illumina GA Data Management

Short read toolbox. Many of our projects use short-read data from Illumina Genome Analyzer and HiSeq. Brian Knaus from our lab developed a number of scripts for managing and analyzing short-read files and data for the GA1 and GA2 platforms.

Illumina GA DNA-Seq

DNA_Seq Prep. Our research group has developed several methods for sequencing small genomes (mitochondria, chloroplasts, BACS) in multiplex using Illumina GA2. This page provides details on DNA-Seq library construction.

Illumina GA RNA-Seq

RNA_Seq Prep. We do mRNA-sequencing using methods developed by Todd Mockler's group at Oregon State University. This page provides details on RNA-Seq library construction.

Illumina GA Hyb-Seq

Hyb_Seq Prep. Like many groups, we've developed customized approaches to enrich rare genomic targets for high-throughput sequencing. Our method for isolating chloroplast genomes by Hyb-Seq is detailed here.

Whole Genome Amplification

WGA Prep. We use phi29-based whole-genome amplification in a variety of different applications. Our standard phi29 WGA method is detailed here.

Purifying DNA with Agilent AMPure Beads

AMPure_Mods. By altering the ratio of DNA:AMPure beads, it's possible to alter the size of the retained bands. We use AMPure beads to clean DNA bands, as well as reduce or eliminate the abundance of small DNAs (oligos, double-stranded adapters, primer dimers).

Random Lab Methods

Random Lab Methods. RNA extraction, DNA extraction, gels, short cuts... find it here

Rapid Isolation of RNA from Conifer Needles

Conifer_RNA_prep. Conifers join a long list of 'recalcitrant' plants that are difficult for RNA extraction, and fail using "traditional" RNA extraction kits. We use a modification of the method by Tai et al, 2004.

Isolation of poly(A) mRNA with Sera-Mag Oligo(dT) beads

mRNA_Prep. There are many methods for isolating mRNA from total RNA. We have had excellent results with Sera-Mag Oligo(dT) beads. We use this approach for constructing Illumina mRNA-Seq libraries, but it should be useful for any application that demands rRNA-depleted mRNA.

Preparation of strand-specific mRNA-Seq libraries with the Illumina TruSeq RNA Sample kit

directional-RNAseq_Prep. There are many methods available for making strand-specific mRNA libraries. We chose to adapt one of the most reliable methods identified by Levin et al. (2010) - dUTP labelling followed by dUTP degredation - for use in the Illumina TruSeq mRNA kit. This method produces mRNA-Seq libraries that are highly enriched for the complementary sequence of the native mRNA.

Rapid Isolation of DNA from Conifer Needles

DNA_Seq Prep. Conifers join a long list of 'recalcitrant' plants that are difficult for DNA extraction. We use a modification of the Fast-Prep method.

short read toolbox

1. http://brianknaus.com/software/srtoolbox/

2. http://openwetware.org/wiki/Short_read_toolbox_Botany2012

Why open source software?

Rocchini and Neteler 2012 Four Freedoms - An article which explains the importance of open source software in science.


Currently available platforms:
  • Illumina - Illumina (formerly Solexa).
  • 454 - 454/Roche.

Sequence format information

  • Short Read Toolbox - Descriptions and examples of qseq, scarf, fastq and fasta formats. Includes scripts to translate these formats to the fastq format standard.
  • FASTQ - Wikipedia's FASTQ page.
  • FASTA - Wikipedia's FASTA page.

Alignment format information

Short-read quality control software

  • TileQC - Requires R, RMySQL and MySQL.
  • FastQC - A quality control tool for high throughput sequence data. A Java application.
  • Short Read Toolbox - Scripts for quality control of Illumina data.

Open source de novo genome assemblers

  • Velvet - Implements De Bruijn Graphs in C. Requires 64 bit Linux OS.
  • ABySS - Multi-threaded de novo assembly.

Open source de novo transcriptome assemblers

  • Trinity - De novo assembler designed specifically for transcriptomes.
  • Rnnotator - Uses multiple calls to velvet (see de novo genome assemblers).
  • Trans-ABySS - Uses multiple calls to ABySS (see de novo genome assemblers).
  • Oases - Post-processes velvet output (see de novo genome assemblers) for transcriptomic work.

Hybrid assemblers (reference guided & de novo)

Open source reference guided assemblers

  • SOAP - Short Oligonucleotide Analysis Package.
  • MAQ - Mapping and Assembly with Qualities.
  • Bowtie - Bowtie. An ultrafast, memory-efficient short read aligner.
  • BWA - Burrows-Wheeler aligner.

SNP discovery and calling

Assembly viewers

  • Tablet - Tablet, visualizes ACE, AFG, MAQ, SOAP, SAM and BAM formats.
  • SAMtools - SAMtools.

Sequence query programs

  • PLAN - A web application for conducting, organizing, and mining large-scale BLAST searches (limited to 1,000 queries).
  • BLAT - BLAT.


A very brief example to demonstrate file input/output.
use strict;
use warnings;
my (@temp, $in, $out);
my $inf = "data.fq";
my $outf = "data_out.fq";
open($in, "<", $inf) or die "Can't open $inf: $!";
open($out, ">", $outf) or die "Can't open $outf: $!";
  chomp($temp[0]=$_); # First line is an identifier.
  chomp($temp[1]=<$in>); # Second line is sequence.
  chomp($temp[2]=<$in>); # Third line is an identifier.
  chomp($temp[3]=<$in>); # Fourth line is quality.
  print $out join("\t", @temp)."\n";
close $in or die "$in: $!";
close $out or die "$out: $!";
  • perlintro - Introduction to perl with links to other documentation.
  • BioPerl beginners - Introduction to BioPerl (be prepared for object oriented code).

R project

Computing resources

  • Galaxy - Web-based front end for popular bioinformatic tools.
  • Atmosphere - Virtual computing at iPlant.
  • XSEDE portal - Extreme Science and Engineering Discovery Environment.

Useful links