News

27.03.2009 17:33 Age: 15 yrs

Editorial

The data tsunami has reached us – will it also drown us?

As you can see from the SILVA 98 statistics we have crossed the magical border of 1,000,000 (one million) SSU and LSU rRNA sequences within the SILVA web database (the “Parc”). This development did not appear unexpected if you have followed the growth of rRNA sequence databases over time. In summary the number of publicly available rRNA sequences doubles every 15 to 18 months.

From user reactions, support requests and statements on the ARB/SILVA mailing list, we know that the situation has started to become problematic for many of you during the last months and years. High-performance software and hardware are required to handle this massive amount of data. This is costly and comes along with skills in administering a growing amount of hardware. But some limitations can be at least temporally circumvented by simply reducing your all-day working database. If you are interested in e.g. the phylogeny of selected Eukarya, there is no need to also host all Bacteria and Archaea on your local system. However, finally the problem will return and new sequencing technologies and decreasing sequencing costs will rather tighten the situation in the future.

Concerning the SILVA 98 release, we have not only crossed the border of one million rRNA sequences but also the ability of ARB to build PT-servers with the complete SILVA SSU Ref database (368,368 sequences) provided in the ARB format. This is a severe bottleneck because the ARB PT-server is required for sequence-based searches as part of the alignment process as well as of probe design and probe match. The explanation is simple: the whole PT-server is held in the main memory (RAM) of your computer which is limited for standard 32bit operating systems and software (max. 4 GB). Obviously, we have now exceeded the number of sequences which can be handled by standard systems.

Again, this does not come unexpected. But the current absence of a sustainable solution already indicates the dimension of the challenge. In any case, ARB has done a perfect job so far. Keep in mind that the core of this software was designed at a time in which only a few hundred rRNA sequences have been available and the database is now still working with hundreds of thousands of sequences!

Let’s have a look at potential solutions. Since the PT-server bottleneck is caused by the limited amount of RAM that can be handled by a 32bit system, one solution is a 64bit version of ARB which can address a nearly unlimited amount of RAM. However, a fully operational 64bit version of ARB does not yet exist, but work is in progress. Adapting (and testing!) such a complex software system is not an easy task. In conclusion, this approach can solve current limitations if you are willing to invest in hardware (RAM), but ultimately it represents a race between growing databases and computational power.

A more elegant option is to filter down the databases provided for local usage with ARB in a reliable way. Think about all the redundancy represented by nearly identical sequences. But also this is a challenging task. The SILVA SSU Ref database already represents a filtered subset of the corresponding SILVA Parc database, mainly done by quality management. However, simply using more stringent settings does not work out because whole clusters start to disappear - the tree is then not representative anymore. Highly complex solutions are required which assure that information content is reduced in a reasonable way.

Looking into the crystal ball, we clearly see a combination of both approaches because even the work on filtered data sets will profit from a powerful 64bit version of ARB. And also computer hardware is becoming cheaper and cheaper. In the near future, current high-end hardware will become standard as it is the case for decades now.

For the moment, smart handling of the problem by the user is required. Creativity must replace not yet existing solutions, like always in times of limited resources. Potential approaches are (i) the elimination of not required domains, groups or clusters if your focus is on classification/phylogeny as already mentioned above or (ii) separate calculation of PT-servers for each domain if you aim for a comprehensive probe match.

Perhaps our statements serve as a starting point of a fruitful discussion on such tips and tricks (e.g. on the ARB/SILVA user group). Think positive! Few years ago, the main bottleneck of the rRNA ARB community was small and often outdated data sets. In this context, the current challenges are much more positive. And while you are reading these lines, we are already back at work to prepare solutions. Hang in there …

The SILVA Database Project
Bremen, March 2009

<- Back to: Home