SSU Ref NR (non-redundant)

Why a non-redundant version of the SILVA SSU Ref database?

For users interested in representative (rRNA) sequence collections, the rapid growth of the data sets has led to immense hardware requirements paired with a significantly increased amount of time to analyse the data. In case of large databases such as the current rRNA data sets, ARB especially requires large amounts of main memory (RAM) to be able to load the database (see also box below).

Basically, there are two options to face this problem: (1) a hardware upgrade to provide the amount of RAM required or (2) a reduction of the number of sequences in the ARB database to bring the RAM requirements down to the current hardware specifications.

For multiple reasons, the second option should be prefered as long as the resulting data set still is "representative" - a very important parameter in environmental microbiology. Therefore, the SILVA project has addressed this task, resulting in a "non-redundant" (NR) SSU Ref dataset build by a dereplication of the SSU Ref plus HSM/MWM/GNHM datasets using a 98% identity criterion (before SILVA release 111 it was 99%).

Background information for current release (SSU Ref NR 111, July 2012)

The SSU Ref NR 111 dataset is based on the full SSU Ref 111 dataset plus the separated HSM/MWM/GNHM dataset (see SILVA 111 documentation), in total encompassing 1,230,763 sequences.

By applying a 98% identity criterion to remove highly similar sequences using the UCLUST tool (for last release 99% was used), the final number of sequences within the SSU Ref NR 111 dataset was reduced to 286,858 - just about 23% of the database entries compared to the starting point. The results of the UCLUST calculation (the reference sequences of each cluster) have been mapped to the original SSU Ref tree and were randomly checked for consistency by expert eye. Sequences from cultivated species have been preserved in all cases. Please note that due to this preservation and additional technical limitations (clustering of large datasets) there can still be sequences in the dataset with an identity of >98%.

The final dataset can be used as a representative environmental dataset for classification, phylogenetic analysis and probe design (for probe match you should use a comprehensive database).

To make fully transparent the quality of the clustering analysis, we have uploaded a complete SSU Ref 104  dataset from February 2011 to which the clustering results (the reference sequence of each cluster) have been mapped ( link to file). The mapping can be accessed via the ARB configurations: just mark all sequences of the configuration "NR_cluster_references" and in the tree you will find highlighted all sequences of the NR dataset (the rest has been removed in the NR). But please keep in mind that for the final NR all cultivated organisms have been preserved independent from the clustering results and that it also contains sequences from the HSM project as well as that the tree has been manually curated.

Downloads

The SSU Ref NR 111 dataset is an ARB database file in the common .arb format and can be downloaded in the SILVA Download section via ARB Files.

 

In the SILVA Archive (release_111/Exports) also FASTA exports of the NR dataset are available.  In the archive you can also find older (smaller!) versions of the SSU Ref NR dataset (ARB database files and FASTA exports). We have started the project with SILVA release 102.

 

The main memory (RAM) requirements of the ARB/SILVA databases including the SSU Ref NR 106 are shown in this table (pdf, 34 kb - see also below).

How to integrate the Ref NR in your workflow and hardware requirements

The SSU Ref NR is intended as a starting point for your ARB/SILVA work with just moderate hardware requirements. It represents a representative set of sequences showing all features of the full SSU Ref database (same alignment, navigation tree containing all sequences, new SILVA taxonomy etc.), just the total number of sequences is reduced by dereplication using a 98% identity criterion (before SILVA release 111 it was 99%).

Once downloaded as an ARB file, the database can be supplemented with additional sequences using the Browse and Search functions of the SILVA webpage and afterwards the ARB merge tool, e.g. if you are interested to have for selected groups/clusters the full diversity as represented by the SILVA Parc/Ref databases. Of course, you can also delete selected groups/clusters from the SSU Ref NR data set to compensate for sequence additions.

Since ARB is a so-called "in-memory" database, the larger a dataset is, the more main memory (RAM) is required by ARB to handle it. This is the only significant hardware requirement of ARB, however, currently it represents a severe bottleneck for many users due to the rapid growth of the rRNA datasets.

The following table provided by Ribocon gives you concrete numbers on the ARB hardware requirements for the SILVA SSU Ref NR dataset and a general idea on the correlation of dataset size and ARB memory usage: ARB/SILVA Memory Requirements v1.3 (pdf, 34 kb).