After the release of SILVA 132, we decided to change the approach of how we generate the SILVA Ref NR 99 datasets. Previously, the order of the sequences for clustering was only based on the length of the sequences. For identity-based clustering tools, the order of the sequences is important as the first sequences of a cluster will become its reference. Whereas the length of the sequence is an important quality criterion for phylogenetic reconstruction, it is not the only one. Therefore, we now consider additional sequence quality values for the sorting order as well. Additionally, to keep the SILVA Ref NR 99 more stable in future releases, we put all members of the SILVA Ref NR 99 from the last releases at the top of the order. In this process, we also changed the clustering tool from the original uclust to vsearch. Changing the clustering tool, of course, made the reference sequences for this release more unstable in comparison to previous releases. But it was a necessary step to be able to provide better and more stable results in the future.
When we started to analyze the EMBL/ENA release 138 we were facing a new challenge. For the SSU dataset, more than 2.9 million RNA sequences predicted by RNAmmer were not included in any of the other RNA sequence databases and are, therefore, unique to the SILVA database. For the LSU dataset, the total number was lower at about 350,000 sequences. Proportional, for both datasets, these predicted candidates constitute about 20% of all candidates. In combination with the changed clustering approach and tool, the large number of predicted candidates led to a large number of sequences changing in the Ref NR datasets for both sub-units. Having to add a lot of new sequences to the guide trees made the process more time consuming and also increased the effort for our taxonomic curator(s).
For the SILVA 138 release, Pablo Yarza from Ribocon GmbH is for the first time the main curator of the SILVA taxonomy, replacing Pelin Yilmaz who stepped down from her role in the SILVA team to work as a consultant in the industry. Pablo is also a member of the LTP team and recently started a collaboration with the LPSN.
SILVA 138 is also the first release for which the taxonomies from the Genome Taxonomy Database (GTDB) and UniEuk were used for the classification and taxonomic curation of sequences. These new taxonomic information sources further increased the curational burden and led to two-thirds of the sequence classifications having changed at the higher taxonomic levels. The manual curation is still ongoing and the adoption of the new taxonomic sources will be continued in the next SILVA releases.
Judging from our experience with the SSU Ref NR 99 we are expecting the curation of the LSU Ref NR 99 (which has not started yet) to take some additional months. For this reason, we decided - for the first time in SILVA history - to do a split release and release the SSU datasets before the LSU datasets.
Estimated release dates:
You can access the preliminary release information here.