PCoA for microbial sequencing data — seqPCoA • seqgroup

A wrapper around various PCoA-based analyses implemented in vegan. The wrapper can handle groups and metadata. PCoA is carried out sample-wise. The na.action is set to na.omit, however envfit cannot deal with missing values, therefore if metadata are provided, they should be free of missing values.

seqPCoA(
  abundances,
  reference = NULL,
  rarefyRef = FALSE,
  addToRefStepwise = FALSE,
  refName = "ref",
  metadata = NULL,
  groupAttrib = "",
  groups = c(),
  groupColors = NULL,
  colors = c(),
  clusters = c(),
  labels = c(),
  sizes = c(),
  size.legend = "",
  time = c(),
  hiddenTaxa = c(),
  hiddenSamples = c(),
  dis = "bray",
  rda = FALSE,
  scale = FALSE,
  doScree = FALSE,
  topTaxa = 10,
  topMetadata = 10,
  arrowFactor = 0.5,
  metadataFactor = 1,
  centroidFactor = 1,
  taxonColor = "brown",
  metadataColor = "blue",
  hideGroupLegend = FALSE,
  drawEllipse = FALSE,
  ellipseOnClusters = FALSE,
  ellipseColorMap = NULL,
  clusterQualityIndex = "silhouette",
  groupDispersion = FALSE,
  xlim = NULL,
  ylim = NULL,
  permut = 1000,
  env.permut = 1000,
  pAdjMethod = "BH",
  qvalThreshold = 0.05,
  ellipseConf = 0.95,
  dimensions = c(1, 2),
  ...
)

Arguments

abundances	a matrix with taxa as rows and samples as columns
reference	an optional reference data set on which abundances are mapped; data are merged by matching row names (nonmatching ones are kept as sum); cannot be combined with rda or topMetadata (topMetadata needs to be set to zero)
rarefyRef	rarefy abundance and reference samples to the minimum total count found in any of the samples; recommended when the total counts differ
addToRefStepwise	compute ordination coordinates for each sample added to the reference separately; cannot be combined with metadata, rda or drawEllipse
refName	group name for reference samples
metadata	an optional data frame with metadata items as columns, where samples are in the same order as in abundances and data types (factor vs numeric) are supposed to be correct; if provided and rda is FALSE, envfit is carried out
groupAttrib	optional: the name of a metadata item that refers to a vector that provides for each sample its group membership
groups	an optional vector that provides for each sample its group membership and which is overridden by groupAttrib, if provided
groupColors	an optional map of predefined colors for groups that matches names in groups (which should be strings); adds a color legend to the plot; if reference is provided, refName is added if absent
colors	an optional vector of colors with entries in the same order as samples to be colored; it overrides groupColors if provided
clusters	an optional vector that provides for each sample its cluster membership (cluster membership is visualized through shape, up to 10 different shapes are possible)
labels	an optional vector that provides for each sample a label to display
sizes	a vector of numeric values that will be displayed as varying sample sizes (sizes will be shifted into positive range if necessary and scaled between 0.5 and 2.5)
size.legend	a string displayed as a legend for size
time	an optional vector with as many time points as samples, adds arrows between consecutive time points (time points should be in ascending order)
hiddenTaxa	an optional vector with names of taxa to be hidden (they will be taken into account for PCoA/RDA/envfit, but are not displayed among top co-varying taxa)
hiddenSamples	an optional vector with indices of samples to be hidden (they will be taken into account for PCoA/RDA/envfit, but are not displayed)
dis	dissimilarity or distance supported by vegan's vegdist function (if set to cor, a PCA is carried out using vegan's function rda with scale set to true)
rda	carry out an RDA instead of a PCoA using vegan's capscale function
scale	scale numeric metadata (subtract the mean and divide by standard deviation)
doScree	do a Scree plot
topTaxa	if larger than zero: show the top N taxa most strongly covarying with principal components as arrows in the PCoA if they are significant according to a permutation test
topMetadata	if larger than zero, metadata provided and rda false: show the top N most significant numeric metadata as arrows and the top N most significant factor metadata as text in the PCoA
arrowFactor	the length of taxon arrows (determined by scaled covariance) is multiplied with this factor
metadataFactor	the length of numeric metadata arrows (determined by Pearson correlation) is multiplied with this factor
centroidFactor	centroid positions (representing categoric metadata) are multiplied with this factor
taxonColor	the color of the taxon arrows and text
metadataColor	the color of the metadata arrows and text
hideGroupLegend	do not show the group legend
drawEllipse	if groups or groupAttrib given, draw polygons encapsulating groups using vegan's ordiellipse function (kind is sd, conf given via ellipseConf); print adonis R2 and p-value (permutation number given via env.permut)
ellipseOnClusters	draw ellipse around cluster members (indicated by shape) instead of group members (indicated by color)
ellipseColorMap	color ellipses according to this color list; entries have to be group names or, if ellipseOnClusters is true, cluster names; overrides groupColors for ellipses
clusterQualityIndex	if groups or groupAttrib given, report cluster quality according to silhouette function or any criterium supported by package clusterCrit (default: silhouette, set to none to disable computation of cluster quality)
groupDispersion	if groups or groupAttrib given, report Tukey's HSD test on differences in group dispersions (avg distance of group members to centroid) and do boxplot (wraps vegan's betadisper)
xlim	range shown on the x axis, by default the minimum and maximum of the first selected component
ylim	range shown on the y axis, by default the minimum and maximum of the second selected component
permut	number of permutations for top-covarying taxa; if NA, NULL or smaller than 1, no permutation test is carried out
env.permut	number of permutations for envfit, if drawEllipse is true, for adonis
pAdjMethod	method for multiple testing correction supported by p.adjust for top-covarying taxon and envfit p-values
qvalThreshold	threshold on multiple-testing corrected top-covarying taxon and envfit p-values
ellipseConf	confidence limit for drawEllipse
dimensions	the principal components used for plotting, by default the first and second
\dots	Additional arguments passed to plot()

Details

When a reference and groups are provided and the number of group memberships does not equal the number of samples in the combined abundance table, groups are automatically extended such that reference samples are assigned to a single group with name refName, which is colored in gray. The color vector is likewise extended if provided. If a clusters, time and/or labels vector is provided together with a referene, it has to refer to both data sets. Samples in the abundance matrix are appended after the reference samples, so cluster memberships, time points and/or labels have to be provided in the same order. Different total counts in abundances and reference samples may bias the result, therefore rarefyRef allows to rarefy both to the same total count after matching. RarefyRef will rarefy all samples to the lowest total count found in any sample. Rows with zero counts after rarefaction are removed. When topTaxa is set larger zero, significant top-varying taxa are shown. The permutation test is carried out by shuffling the selected number of top-covarying taxa. Multiple testing correction on parameter-free p-values is then only applied to these top-covarying taxa. The strength of covariance is determined as the norm of the vectors resulting from multiplying the standardized eigen vectors with the taxa. In contrast, envfit p-values are computed for all metadata and multiple-testing correction is consequently applied to all metadata provided, though only the selected number of most significant metadata are shown. Thus, topTaxa ranks taxa by covariance with significance as a filter, whereas topMetadata ranks metadata by significance. The number in the axis label brackets refers to the proportion of variance explained as computed with vegan's eigenvals function. Note that ordination values in plots are not scaled (equivalent to plot.cca(scaling=0)). If groups are provided, drawEllipse can be enabled to draw ellipses with vegan's ordiellipse and to run vegan's adonis (PERMANOVA) to assess whether group composition differs significantly. Adonis is sensitive to differences in within-group variation. To assess whether within-group variation differs, betadisper can be carried out by enabling groupDispersion. In addition, if groups are provided, seqPCoA computes cluster quality indices to assess group separation with package clusterCrit. Concerning interpretation of cluster quality indices: The silhouette index ranges between -1 and 1, a large Dunn's, Silhouette or Calinski-Harabasz index and a small Davies-Bouldin or C-index indicate well-defined clusters, respectively.

Examples

data("ibd_taxa")
data("ibd_metadata")
ibd_metadata=assignMetadataTypes(ibd_metadata,categoric=c("SRA_metagenome_name","Diagnosis"))
#> [1] "Metadata SRA_metagenome_name is categoric"
#> [1] "Metadata Age is numeric"
#> [1] "Metadata Diagnosis is categoric"
#> [1] "Metadata Fecal.Calprotectin is numeric"
#> [1] "Metadata antibiotic is binary"
#> [1] "Metadata immunosuppressant is binary"
#> [1] "Metadata mesalamine is binary"
#> [1] "Metadata steroids is binary"
seqPCoA(ibd_taxa,groups=as.vector(ibd_metadata$Diagnosis),topTaxa=30, drawEllipse=TRUE)
#> [1] "Adonis to test for significant difference in group compositions"
#> [1] "Adonis R2: 0.0604, p-value: 0.001"
#> [1] "Cluster quality index silhouette"
#> [1] -0.02240243
#> [1] "Among the top 30 covarying taxa, 2 are significant."
#> [1] "Subdoligranulum_unclassified"
#> [1] "Escherichia_unclassified"
# remove 65 samples with missing calprotectin measurements or other missing values in the metadata
na.indices=unique(which(is.na(ibd_metadata),arr.ind=TRUE)[,1])
indices.to.keep=setdiff(1:nrow(ibd_metadata),na.indices)
ibd_metadata=ibd_metadata[indices.to.keep,]
ibd_taxa=ibd_taxa[,indices.to.keep]
seqPCoA(ibd_taxa,metadata=ibd_metadata,groups=as.vector(ibd_metadata$Diagnosis),topTaxa=30)
#> [1] "1 significant numeric metadata found, in order of significance:"
#> [1] "Fecal.Calprotectin"
#> [1] "4 significant categoric metadata found, in order of significance:"
#> [1] "Diagnosis"
#> [1] "antibiotic"
#> [1] "immunosuppressant"
#> [1] "steroids"
#> [1] "Cluster quality index silhouette"
#> [1] -0.01835481
#> [1] "Among the top 30 covarying taxa, 4 are significant."
#> [1] "Faecalibacterium_prausnitzii"
#> [1] "Ruminococcus_bromii"
#> [1] "Streptococcus_thermophilus"
#> [1] "Bacteroides_uniformis"