de.NBI Logo


07.07.2011 14:47 Age: 12 yrs

Growing, growing, and growing …


By Jörg Peplies and Frank Oliver Glöckner for the SILVA Team


There has been a lot of activity within the SILVA project during the last weeks: First of all, the new SILVA database release 106 became available, now offering more than two million rRNA gene sequence entries! Additionally, the more and more popular non-redundant version of the SSU Ref dataset has been updated and a new high-quality type strain dataset provided the Living Tree project was released. Finally, a completely new function is available on the webpage, the TestProbe tool which allows you to evaluate probe and primer coverage and specificity in a comprehensive way.

Now, it is time to breathe, step aside, and to recap the first four years of the SILVA project and to have a look into the future.

The first SILVA database release has been published in February 2007 offering 353k SSU and 47k LSU rRNA gene sequences. Now, in July 2011, we have a status of 1.962k SSU entries and 231k LSU entries. In other words, in just four years, the size of the SILVA SSU and LSU datasets have increased by a factor of 5.6 and 4.9, respectively, both showing exponential growth ( SSU graph).

This tendency is not new or unexpected. Everybody in the field is aware of the fast growing nucleotide sequence databases and the impact of new sequencing technology, also highlighted by the general INSDC database statistics. Meanwhile, many people use the term “data tsunami” to picture the situation. And if you look at the SILVA database statistics, you can indeed see the wave coming, caused by the Next Generation Sequencing (NGS) techniques such as Roche 454 ( SSU graph). The read lengths of these new technologies are getting longer and longer and finally they will cross the magic 300 bases threshold of the SILVA quality cut-off and enter the SILVA databases. The question is no longer if but when.

This increase in data, on one side, comes along with an increase of potential information or even knowledge, a gold mine for scientists – no doubt about this. On the other hand, technical infrastructure (technology) is required to handle this deluge of data, the dark side of the coin. This challenge hits database providers and users. The best example is the latest SILVA SSU Parc dataset for the software package ARB. Just to open (!) this file in ARB, about 15 GB of main memory are required – probably a rather rare hardware configuration in the scientific world. And again, it is just a question of time when the requirements of the SSU Ref ARB download will exceed the capacities of most of the users.

So, how does the future of rRNA gene-based sequence analysis look like, especially if you are working in the field of environmental microbiology? The SILVA project has started to react on this in early 2010 already, by providing also a non-redundant version of the SSU Ref dataset. By using a 99% identity criterion for clustering, more than 50% (!) of the sequence entries could be removed. Considering 1-2% sequencing errors in the sequences, the NR dataset provides a de-noised and higher-quality dataset.

This is exactly how the future of the SILVA project will look like: more quality control, for higher quality, for more knowledge - with the positive side-effect of providing reference datasets which are still usable also on “wet-lab compatible” computer infrastructures.

This will leave room to concentrate on the analysis of the sequence data especially when it is done taking into account the environmental context. Standardisation of such contextual or metadata is an emerging field and SILVA is a lead member of the recently established Genomic Standards Consortium providing checklists to optimize data exchange ( Commentary in ISME Journal).

Finally, we would like to thank of all you for four years of positive and useful feedback and your ongoing interest in the SILVA database project. We will give our best to provide attractive solutions also during the next four years. And we wish you all the best for your projects based on rRNA gene sequence analysis!

Have a nice summer break, we will be back with SILVA 108 ...