SSU Ref NR (non-redundant)

Why a non-redundant version of the SILVA SSU Ref database?

For users interested in representative (rRNA) sequence collections, the rapid growth of the data sets has led to immense hardware requirements paired with a significantly increased amount of time to analyse the data. In case of large databases such as the current rRNA data sets, ARB especially requires large amounts of main memory (RAM) to be able to load the database (see also box below).

Basically, there are two options to face this problem: (1) a hardware upgrade to provide the amount of RAM required or (2) a reduction of the number of sequences in the ARB database to bring the RAM requirements down to the current hardware specifications.

For multiple reasons, the second option should be prefered as long as the resulting data set still is "representative" - a very important parameter in environmental microbiology. Therefore, the SILVA project has addressed this task, resulting in a "non-redundant" (NR) SSU Ref dataset build by a dereplication of the full SSU Ref (including HSM/MWM/GNHM datasets) using a 99% identity criterion.

As of SILVA release 119 the SSU Ref NR is the only SSU dataset with a manual curated guide tree. SSU Ref is still provided as an ARB dataset but without the guide tree.

Background information for current release (SSU Ref NR 119, July 2014)

The SSU Ref NR 119 dataset is based on the full SSU Ref 119 dataset dataset (see Opens internal link in current windowSILVA 119 documentation), in total encompassing 534,968 sequences.

By applying a 99% identity criterion to remove highly similar sequences using the UCLUST tool, the final number of sequences within the SSU Ref NR 119 dataset was reduced to 534,968 just about 34% of the database entries compared to the starting point. The results of the UCLUST calculation (the reference sequences of each cluster) have been mapped to the original SSU Ref tree and were randomly checked for consistency by expert eye. Sequences from cultivated species have been preserved in all cases. Please note that due to this preservation and additional technical limitations (clustering of large datasets) there can still be sequences in the dataset with an identity of >99%.

The final dataset can be used as a representative environmental dataset for classification, phylogenetic analysis and probe design (for probe match you should use a comprehensive database).

To make fully transparent the quality of the clustering analysis, we have uploaded a complete SSU Ref 104  dataset from February 2011 to which the clustering results (the reference sequence of each cluster) have been mapped ( link to file). The mapping can be accessed via the ARB configurations: just mark all sequences of the configuration "NR_cluster_references" and in the tree you will find highlighted all sequences of the NR dataset (the rest has been removed in the NR). But please keep in mind that for the final NR all cultivated organisms have been preserved independent from the clustering results as well as that the tree has been manually curated.

Downloads

The SSU Ref NR 119 dataset, or subsets thereof, can be downloaded via the Opens internal link in current windowBrowser and as ARB database file in the common Opens internal link in current window.arb format.

In the SILVA Archive (release_119/Exports) also FASTA exports of the NR dataset are available.  In the archive you can also find older (smaller!) versions of the SSU Ref NR dataset (ARB database files and FASTA exports). We have started producing SSU Ref NR files with SILVA release 102.

How to integrate the Ref NR in your workflow and hardware requirements

The SSU Ref NR is intended as a starting point for your ARB/SILVA work with just moderate hardware requirements. It represents a representative set of sequences showing all features of the full SSU Ref database (same alignment, navigation tree containing all sequences, new SILVA taxonomy etc.), just the total number of sequences is reduced by dereplication using a 99% identity criterion.

Once downloaded as an ARB file, the database can be supplemented with additional sequences using the Browse and Search functions of the SILVA webpage and afterwards the ARB merge tool, e.g. if you are interested to have for selected groups/clusters the full diversity as represented by the SILVA Parc/Ref databases. Of course, you can also delete selected groups/clusters from the SSU Ref NR data set to compensate for sequence additions.

Since ARB is a so-called "in-memory" database, the larger a dataset is, the more main memory (RAM) is required by ARB to handle it. This is the only significant hardware requirement of ARB, however, currently it represents a severe bottleneck for many users due to the rapid growth of the rRNA datasets.

The following table provided by Ribocon gives you concrete numbers on the ARB hardware requirements for the SILVA SSU Ref NR dataset and a general idea on the correlation of dataset size and ARB memory usage: ARB/SILVA Memory Requirements v1.3 (pdf, 34 kb).