Release information: SILVA 123

Release information of the SILVA SSU and LSU databases 123 as of July 23, 2015

 

SSU 123

LSU 123

Parc

4,985,791

(+ 639,424)

563,332

(+ 116,334)

Ref

1,756,783

(+ 172,915)

96,642

(+ 39,096)

Ref NR 99 

597,607

(+ 62,639)

since release 115, only SSU Ref NR 99 contains a guide tree

Information about former releases can be found here.

Sequence Retrieval and Processing

 

SSU 123

LSU 123

candidates (total)

7,168,241

870,342

RNAmmer

84,098

52,820

< 300 bases

1,650,058

225,190

> 2% ambiguities

25,713

8228

> 2% homopolymers

122,621

18,799

> 2% vector contamination

2054

764

low alignment identity

481,981

77,347

total rejected by QC

2,182,450

306,872

Sequences have been retrieved from EMBL-EBI/ENA Release 123 (March 15) using a complex keyword search procedure and sequence based search with RNAmmer profiles. Cross checks with RDP II indicated no loss of primary data. Most of the sequences rejected by a low identity value after alignment with SINA were classified as not ribosomal RNA sequences by manual inspection.

Basic statistics for the SILVA databases

 

SSU Ref

SSU Ref NR

LSUParc

LSURef

Version

123

123

123

123

Total

1,756,783

597,607

563,332

96,642

Bacteria

1,575,088

513,311

108,541

78,944

Archaea

60,993

22,913

874

718

Eukaryota

120,702

61,383

453,917

16,980

Cultured

37,459

37,459

24,593

8202

Typestrains

22,174

22,174

5809

4663

Growth of the ribosomal RNA databases since 1992

Length Distribution (SSU & LSU)

New in Release 123

  • Webpage
    • SSU Ref can now be selected in Sequence search/classify, TestProbe and TestPrime
    • SSU LTP can now be selected in Search, TestProbe and TestPrime
    • All SILVA sequences are automatically taxonomically classified based on LSURef and SSURef NR 99
    • LTP updated to release 121.
  • ARB files  
    • The guide tree is only available for SSU Ref NR 99 and LSU Ref.
  • Exports
    • Mapping files for the SILVA taxonomy and ranks are now available for SSU and LSU
    • Megan compatible files are now available.
  • Pipeline
    • Several improvements and bug fixing.
  • Seed
    • The SSU Seed was extended with latest LTP version (121).
    • Cleaning up the Seed has resulted in the removal of 1986 sequences.
  • SILVA
  • Taxonomy
    • New consistency checks have been implemented; several errors in the bacterial, archaeal and eukaryotic taxonomy have been corrected.
    • All eukaryotic taxa have been significantly revised, see Eukaryotic Taxonomy.
    • SSURef NR 99 guide tree:
       
      • Dark matter phyla from SAG (single amplified genome) studies were added to Bacteria & Archaea.
      • Rumen microbial community data, generated as part of the Global Rumen Census and Hungate1000 projects (PIs Gemma Henderson, Bill Kelly, Sinead Leahy, and Peter Janssen, Rumen Microbiology Team, AgResearch Ltd, New Zealand), were mapped onto the SSURef NR guide tree. Clusters that contained rumen sequence data, but were taxonomically poorly defined, were given improved taxonomic strings and names, reflective of the types of microbes found within (e.g., strain names or designated as an uncultured group (UCG)). As a result, the taxonomic framework better reflects the diversity of microbes found in the rumen and can be used to assign a more precise taxonomic identity to rumen microbial sequence data. This work was funded by the New Zealand Government as part of its support for the Global Research Alliance on Agricultural Greenhouse Gases.
      • Archaeplastida (Plants and Algae) taxonomy was improved. Embryophyta taxonomy was adapted from ENA. Algae and other remaining groups were adapted from AlgaeBase.
      • Metazoa (Animals) taxonomic groups were added. Depending on how well the 18S rDNA resolves these groups, taxonomic ranks up to family were curated, but genera are not yet available. Taxonomy was adapted from ENA, WoRMS and ITIS.
    • LSURef guide tree has been completely migrated to the SILVA taxonomy, including the Eukaryota
      • Protists: ranks up to genus were curated, reflecting the Adl et al. 2012 framework
      • Fungi: ranks until order were curated
      • Plants and animals: ranks until phylum/class were curated

Known Bugs

  • SSUParc: 63,000 sequences have no Pintail values

Small Subunit rRNA Database

SSU Parc (Web database only) contains all aligned sequences with an alignment identity value equal and above 50, an alignment quality value equal and above 40 as well as an basepair score or sequence quality equal and above 30. All sequences with a Pintail value < 50 or an alignment quality value < 75 have been assigned to color group 1 in ARB (red). All Living Tree Project or  StrainInfo typestrains have been assigned to color group 2 in ARB (light blue). No further sequence curation has been applied. 
 

To create SSU Ref (Web database & ARB file), all sequences below 1,200 bases for Bacteria and Eukarya and below 900 bases for Archaea or an alignment identity below 70 or an alignment quality value below 50 have been removed from SSU Parc. All sequences with a Pintail value < 50 or an alignment quality value < 75 have been assigned to color group 1 in ARB (red). All Living Tree Project or  StrainInfo typestrains have been assigned to color group 2 in ARB (light blue).

To create SSU Ref NR 99 (Web database & ARB file), a 99% identity criterion to remove highly identical sequences using the UCLUST tool was applied. Sequences from cultivated species have been preserved in all cases. A guide tree was calculated by adding all sequences to the SSU Ref tree of SILVA release 119. For tree calculation, highly variable positions were removed for Bacteria, Archaea, and Eukarya with the respective position variability filters. Position variability filters for Bacteria, Archaea and Eukarya have been calculated and added to the dataset. The tree was extensively manually curated taking into account the latest taxonomic information. More information about the SILVA and LTP taxonomic frameworks can be found in the respective paper. Detailed information about the SSU Ref NR dataset is available here.

Remark: Before using the alignment for extensive phylogenetic reconstructions all sequences should be checked carefully.

Large Subunit rRNA Databases

LSU Parc (Web database & ARB file) contains all aligned sequences with an alignment identity value equal and above 40 and an alignment quality value, a basepair score or a sequence quality equal and above 30. All sequences with an alignment quality value < 75 have been assigned to color group 1 in ARB (red). All Living Tree Project or StrainInfo typestrains have been assigned to color group 2 in ARB (light blue). No further curation has been applied. 

Additionally, for LSU Ref (Web database & ARB file) all sequences below 1,900 bases or an alignment identity below 60 have been removed, a guide tree was calculated based on the LSU Ref tree of SILVA release 119, and basic filters have been added. The tree was manually curated taking into account the latest taxonomic information. More information about the SILVA and LTP taxonomic frameworks can be found in the respective paper.

Please take into account that the LSU SEED consists only of around 2,800 sequences and there is no guaranty that well aligned close relatives have always been available. We would recommend additional manual curation before using it for extensive phylogenetic reconstructions.

Taxonomy

With SILVA release 102 the default taxonomy shown on the webpage (browser/search) is the SILVA taxonomy. Briefly, the tree for Bacteria and Archaea has been organized based on the Bergey's taxonomic outline, LPSN and the literature. Starting with SILVA release 111 extensive care has been taken to also improve the eukaryotic taxonomy. Based on the curated SILVA Ref taxonomy all sequences in SILVA (Parc) have been automatically classified.  

Alternative Taxonomies

Besides the SILVA and EMBL-EBI/ENA taxonomy, alternative classifications taken from the greengenes, RDP II and LTP projects are also available in SILVA. On the webpage, the user can switch using the taxonomy menu. In ARB, the different taxonomies can be found in the fields: tax_slv, tax_embl, tax_gg, tax_rdp and tax_ltp for SILVA, EMBL-EBI/ENA, greengenes, RDP II and LTP, respectively. The corresponding *_name fields shows the respective sequence name for each entry.  Please take into account that greengenes, RDP II and LTP provide only a subset of the sequences hosted by SILVA. If no taxonomic mapping to greengenes, RDP II or LTP was available they are assigned as "unclassified" and the respective sequence name equals EMBL-EBI/ENA. For the LSU datasets only SILVA, LTP and EMBL-EBI/ENA taxonomies are available.

Altenative Names

All names of validly described species in the SSU and LSU databases have been checked for changes (basonyms, synonyms and orthographical corrections) against the DSMZ "Nomenclature up to date" catalogue as of April 2015.

Cultured and Type strains

The information if a sequence originates from a cultured or type strain has been added to the field strain and is indicated by [T] and [C]. Several sources have been used to compile the information: The StrainInfo.net bioportal (May 2015), The Ribosomal Database Project II (September 2014) and the Living Tree Project (release 121) which provides manually curated information compliant with Euzebys "List of Prokaryotic names with Standing in Nomenclature".

Strain Identifiers

SourceInformationTagDatasets
EMBL-EBI/ENATypestrains(t)SSU, LSU
EMBL-EBI/ENAGenomese[G]SSU, LSU
Straininfo.netCultureds[C]SSU, LSU
Straininfo.netTypestrainss[T]SSU, LSU
Living Tree Project Typestrains (curated) l[T]SSU
RDP IITypestrainsr[T]SSU


The identifiers can be used for data retrieval by searching in the strain field see FAQ.

Genomes

The information if a sequence originates from a genome project has been taken from EMBL-EBI/ENA and added to the field strain. It is indicated by e[G].

Detailed information about the corresponding identifiers and target databases can be found in the table to the right.

The identifiers can be used for data retrieval by searching in the strain field see FAQ.

Quality Values

The length and colours of the bars give a first indication on the sequence and alignment quality as well as the risk for sequence anomalies based on Pintail analysis. After downloading the sequences as an ARB file, sequences that need attention can be selected by searching for low quality (alignment, sequence) or Pintail values in the corresponding ARB database fields. A full description of the colour code and all database fields available in the ARB files can be found in the FAQ section. Taking into account the rich set of sequence associated information that comes along with every SILVA sequence, user designed sub-databases can be easily generated.

SEEDs

All rRNA sequences have been aligned based on a completely manually re-checked SEED alignment of 69,207 rRNA sequences for SSU and 2,868 rRNA sequences for LSU. The SSU alignment is based on the official ssu_jan04 release of the ARB Project. The SSU SEED alignment has been considerably improved for Archaea by manual addition of more than 1,000 sequences as well as Fungi (10,000 sequences). All SSU Eukaryotic sequences (18S) have been cross-checked by Wolfgang Ludwig before their addition to the SEED. Most of the bacterial sequences have also undergone a curation process carried out by the SILVA Team. We would rate our SSU SEED alignment for all Bacteria and Archaea as good and for Eukarya as reasonable.

The LSU alignment was provided by Wolfgang Ludwig and has not been released before SILVA. It was cross-checked by the SILVA Team before using it as the SEED for automatic alignment.  Bacteria and Archaea could be rated as good. The Eukaryotes need definitely further attention.

RNAmmer

RNAmmer is a computational predictor for the major rRNA species (SSU, LSU) from all three domains of life. The program uses hidden Markov models trained on data from the European ribosomal RNA database project. SILVA runs the profiles of RNAmmer on all sequence entries of the EMBL-EBI/ENA archive to complement the existing predictions. All predictions are marked with RNAmmer in the ann_src_field. More information about RNAmmer can be found in the paper.

Citations

Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO (2013) The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucl. Acids Res. 41 (D1): D590-D596.

Yilmaz P, Parfrey LW, Yarza P, Gerken J, Pruesse E, Quast C, Schweer T, Peplies J, Ludwig W, Glöckner FO (2014) The SILVA and "All-species Living Tree Project (LTP)" taxonomic frameworks. Nucl. Acids Res. 42:D643-D648

If you use SINA please cite:

Pruesse, E, Peplies, J and Glöckner, FO (2012) SINA: accurate high-throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics, 28, 1823-1829