Cluster taxon- or sample-wise — findClusters • seqgroup

Cluster sequencing data taxon- or sample-wise. By default, data are clustered sample-wise. The default method is Dirichlet-Multinomial mixtures (DMM) using the DirichletMultinomial package. Note that DMM expects counts and fails if there are taxa that are absent across all samples. Consequently, abundances are scaled to counts and taxa absent across all samples are removed. Also note that counts should not be too large for DMM (not above 10000). If this is the case, they will be scaled (divided by a constant and rounded). In addition, for DMM, the total number of counts should be the same across samples. Partitioning around medoids (PAM) does not expect counts, can deal with absent taxa and large counts. The cluster number can be omitted, in which case the best cluster number is determined in the range provided using either silhouette or a cluster quality index implemented in clusterCrit's function intCriteria.

findClusters(
  abundances,
  method = "dmm",
  k = 3,
  minK = 2,
  maxK = 10,
  qualityIndex = "silhouette"
)

Arguments

abundances	a matrix with taxa as rows and samples as columns
method	clustering method, supported are dmm and pam
k	cluster number, can be set to NA if cluster number is to be determined with a quality index
minK	the minimum cluster number to test
maxK	the maximum cluster number to test
qualityIndex	by default silhouette computed with silhouette function on bray curtis, can also be CH, Dunn, and other quality indices supported by clusterCrit's intCriteria function or none

Value

a cluster membership vector

Examples

data("ibd_taxa")
# unequal sample sums in the example data set are due to the removal of low-prevalence taxa
clusters=findClusters(ibd_taxa, k=NA, method="pam", qualityIndex="CH")
#> [1] "k equal 2 has quality value 16.0904817118765"
#> [1] "k equal 3 has quality value 15.6880504982704"
#> [1] "k equal 4 has quality value 12.8737754811385"
#> [1] "k equal 5 has quality value 13.7168831544949"
#> [1] "k equal 6 has quality value 13.9996390736813"
#> [1] "k equal 7 has quality value 14.0650955457399"
#> [1] "k equal 8 has quality value 13.7596433029095"
#> [1] "k equal 9 has quality value 13.7172891077008"
#> [1] "k equal 10 has quality value 13.65667748236"
#> [1] "The optimal cluster number according to Calinski_Harabasz is 2 with value 16.0904817118765"
table(clusters)
#> clusters
#>   1   2 
#> 147   8