PICRUSt is a bioinformatics software package. The name is an abbreviation for Phylogenetic Investigation of Communities by Reconstruction of Unobserved States.
The tool serves in the field of metagenomic analysis where it allows inference of the functional profile of a microbial community based on marker gene survey along one or more samples. In essence, PICRUSt takes a user supplied operational taxonomic unit table (typically referred to as an OTU table), representing the marker gene sequences (most commonly a 16Scluster) accompanied with its relative abundance in each of the samples. The output of PICRUSt is a sample by functional-gene-count matrix, telling the count of each functional-gene in each of the samples surveyed. The ability of PICRUSt to estimate the functional-gene profile for a given sample relies on a set of known sequenced genomes. This could also be thought of as an automated alternative to manually researching the gene families likely to be present in organisms whose sequences are found in a 16S ribosomal RNAamplicon library. The below description corresponds to the original version of PICRUSt, but a major update to this tool is currently being developed.
In an initial preprocessing phase, PICRUSt constructs confidence intervals and point predictions for the number of copies of each gene family in each bacterial and archaeal strain in a reference tree, using organisms with sequenced genomes as a reference. More specifically, for each gene family, PICRUSt maps known gene copy numbers (from complete sequenced genomes) onto a reference tree of life. These gene family copy numbers are treated as continuous traits, and an evolutionary model constructed under the assumption of Brownian Motion. These evolutionary models can be constructed with either Maximum Likelihood, Relaxed Maximum Likelihood or Wagner Parsimony This evolutionary model is then used to predict both a point estimate and a confidence interval for the copy number of microorganisms without sequenced genomes. This ‘genome prediction’ step produces a large table of bacterial types (specifically operational taxonomic unit or OTUs) vs. gene family copy numbers. This table is distributed to end users. It is important to note that this prediction method is not the same as a nearest neighbor approach (i.e. just looking up the nearest sequenced genome), and was shown to give a small but significant improvement in accuracy over that strategy. However, nearest neighbor prediction is available as an option in PICRUSt.
Notably, while this functionality is typically used for prediction of gene copy numbers in bacteria, it could, in principle, be used for prediction of any other continuous trait given trait data for diverse organisms and a reference phylogeny.
Langille et al tested the accuracy of this genome prediction step using leave-one-out cross validation on the input set of sequenced genomes. Additional tests examined sensitivity to errors in phylogenetic inference, lack of genomic data, and the accuracy of the confidence intervals on gene content.
A similar step predicts the copy number of 16S rRNA genes.
When applying PICRUSt to a 16S rRNA gene library, PICRUSt matches reference operational taxonomic units against the tables, and retrieves a predicted 16S rRNA copy number and gene copy number for each gene family. The abundance of each OTU is divided by its predicted copy number (if a bacterium has multiple 16S copies, its apparent abundance in 16S rRNA data will be inflated), and then multiplied by the copy number of the gene family. This gives a prediction for the contribution of each OTU to the overall gene content of the sample (the metagenome). Finally, these individual contributions are summed together to produce an estimate of the genes present in the metagenome.
Langille et al., 2013 tested the accuracy of this genome prediction step by using previously reported datasets in which the same biological sample was subjected to 16S rRNA gene amplification and shotgun metagenomics. In these cases, the shotgun metagenomic results were taken as a representation of the ‘true’ community, and the 16S rRNA gene amplicon libraries fed into PICRUSt to attempt to predict those data. Test datasets included human microbiome samples from the Human Microbiome Project, soil samples, diverse mammalian samples, and samples from the Guerrero Negromicrobial mats