Release information: SILVA 111

Release 111 of the SILVA SSU and LSU databases released on July 27, 2012

 

SSU 111

LSU 111

Parc

3,194,778

(+ 702,125)

288,717

(+ 19,219)

Ref

739,633

(+ 121,191)

29,306

(+ 8347)

Nearly full length sequences removed from SSURef:

HSM

MWM

GNHM

491,130

(+89,109)

Human skin microbiome

Mouse wound microbiota

Guerrero Negro hypersaline microbial mat

Information about former releases can be found here.

Sequence Retrieval and Processing

 

SSU 111

LSU 111

candidates (total)

4,556,364

485,703

RNAmmer

53,950

17,828

< 300 bases

1,041,574

152,870

> 2% ambiguities

17,873

5464

> 2% homopolymers

56,132

10,879

> 2% vector contamination

2423

496

low alignment identity

287,198

42,196

total rejected by QC

1,361,586

196,986

Sequences have been retrieved from EMBL Release 111 (March 12) using a complex keyword search procedure and sequence based search with RNAmmer profiles. Cross checks with RDP II indicated no loss of primary data. Most of the sequences rejected by a low identity value after alignment with SINA were classified as not ribosomal RNA sequences by manual inspection.

Basic statistics for the SILVA databases

 

SSUParc

SSU Ref

LSUParc

LSURef

Version

111

111

111

111

Total

3,194,778

739,633

288,717

29,306

Bacteria

2,651,771

629,125

25,961

19,580

Archaea

129,147

38,721

393

405

Eukaryota

246,322

71,787

259,851

932

Cultured #

36,305

28,632

17,918

5267

Typestrains #

20,103

17,795

8123

3146

# according to straininfo.net and the Living Tree Project

Growth of the ribosomal RNA databases since 1992

Length Distribution (SSU & LSU)

New in Release 111

  • Webpage
  • ARB files  
    • The 491,130 full length (Ref) sequences of the Human skin microbiome (HSM), Mouse wound microbiota (MWM) and Guerrero Negro hypersaline microbial mat (GNHM) projects have been separated from the SSURef dataset.
  • Pipeline
    • Improved importer and parser. The new importer is now significantly faster and much more specific.
    • Full integration of the new SINA aligner
    • Calculation of alignment identities (align_ident_slv) introduced to better control the specificity and quality of the alignment with respect to the Seed.
    • Threshold definition for Parc and Ref with respect to alignment identity refined.
  • Seed
    • The SSU Seed was extended with the latest LTP version integrated.
  • SINA (SILVA INcremental Aligner)  
  • Eukaryotic Taxonomy

Known Bugs

  • SSUParc: 15,008 sequences have no Pintail values

Small Subunit rRNA Database

SSU Parc (Web database & ARB file) contains all aligned sequences with an alignment identity value equal and above 50, an alignment quality value equal and above 40 as well as an basepair score or sequence quality equal and above 30. All sequences with a Pintail value < 50 or an alignment quality value < 75 have been assigned to color group 1 in ARB (red). All Living Tree Project or  StrainInfo typestrains have been assigned to color group 2 in ARB (light blue). No further sequence curation has been applied. 
 

To create SSU Ref (Web database & ARB file), all sequences below 1,200 bases for Bacteria and Eukarya and below 900 bases for Archaea or an alignment identity below 70 or an alignment quality value below 50 have been removed from SSU Parc. Aditionally, sequences of large scale submissions from a single project or habitat like the HSM, MWM and GNHM (see table obove) have been removed from SSU Ref, but remained as the basis to create SSURef NR. A guide tree was calculated by adding all sequences to the tree_1200 of SILVA release 108 which is based on tree_1000 from the ssujan04 release. For tree calculation, highly variable positions were removed for Bacteria, Archaea, and Eukarya with the respective position variability filters. Position variability filters for Bacteria, Archaea and Eukarya have been calculated and added to the dataset.  All sequences with a Pintail value < 50 or an alignment quality value < 75 have been assigned to color group 1 in ARB (red). All Living Tree Project or  StrainInfo typestrains have been assigned to color group 2 in ARB (light blue). Before using the alignment for extensive phylogenetic reconstructions all sequences should be checked carefully.

Large Subunit rRNA Databases

LSU Parc (Web database & ARB file) contains all aligned sequences with an alignment identity value equal and above 40 and an alignment quality value, a basepair score or a sequence quality equal and above 30. All sequences with an alignment quality value < 75 have been assigned to color group 1 in ARB (red). All Living Tree Project or StrainInfo typestrains have been assigned to color group 2 in ARB (light blue). No further curation has been applied. 

Additionally, for LSU Ref (Web database & ARB file) all sequences below 1,900 bases or an alignment identity below 60 have been removed, a guide tree was calculated based on the tree_1900 of SILVA release 108, and basic filters have been added.

Please take into account that the LSU SEED consists only of around 2,800 sequences and there is no guaranty that well aligned close relatives have always been available. We would recommend additional manual curation before using it for extensive phylogenetic reconstructions.

Taxonomy

With SILVA release 102 the default taxonomy shown on the webpage (browser/search) is the SILVA taxonomy. Briefly, the tree for Bacteria and Archaea has been organized based on the Bergey's taxonomic outline, LPSN and the literature. Starting with SILVA release 111 extensive care has been taken to also improve the eukaryotic taxonomy. The SILVA taxonomy is only available for the sequencs that are part of the Ref(erence) datasets (SSU Ref & LSU Ref). To show the classification of all sequences (Parc) in the SILVA databases you have to switch to EMBL taxonomy.  

Alternative Taxonomies

Besides the SILVA and EMBL taxonomy, alternative classifications taken from the greengenes, RDP II and LTP projects are also available in SILVA. On the webpage, the user can switch using the taxonomy menu. In ARB, the different taxonomies can be found in the fields: tax_slv, tax_embl, tax_gg, tax_rdp and tax_ltp for SILVA, EMBL, greengenes, RDP II and LTP, respectively. The corresponding *_name fields shows the respective sequence name for each entry.  Please take into account that greengenes, RDP II and LTP provide only a subset of the sequences hosted by SILVA. If no taxonomic mapping to greengenes, RDP II or LTP was available they are assigned as "unclassified" and the respective sequence name equals EMBL. For the LSU datasets only SILVA, LTP and EMBL taxonomies are available.

Altenative Names

All names of validly described species in the SSU and LSU databases have been checked for changes (basonyms, synonyms and orthographical corrections) against the DSMZ "Nomenclature up to date" catalogue (http://www.dsmz.de/download/bactnom/names.txt) released in April 2012.

Cultured and Type strains

The information if a sequence originates from a cultured or type strain has been added to the field strain and is indicated by [T] and [C]. Several sources have been used to compile the information: The StrainInfo.net bioportal, The Ribosomal Database Project II (10.28) and the Living Tree Project which provides manually curated information compliant with Euzebys "List of Prokaryotic names with Standing in Nomenclature".

Strain Identifiers

SourceInformationTagDatasets
EMBLTypestrains(t)SSU, LSU
EMBLGenomese[G]SSU, LSU
Straininfo.netCultureds[C]SSU, LSU
Straininfo.netTypestrainss[T]SSU, LSU
Living Tree Project Typestrains (curated) l[T]SSU
RDP IITypestrainsr[T]SSU


The identifiers can be used for data retrieval by searching in the strain field see FAQ.

Genomes

The information if a sequence originates from a genome project has been taken from EMBL and added to the field strain. It is indicated by e[G].

Detailed information about the corresponding identifiers and target databases can be found in the table to the right.

The identifiers can be used for data retrieval by searching in the strain field see FAQ.

Quality Values

The length and colours of the bars give a first indication on the sequence and alignment quality as well as the risk for sequence anomalies based on Pintail analysis. After downloading the sequences as an ARB file, sequences that need attention can be selected by searching for low quality (alignment, sequence) or Pintail values in the corresponding ARB database fields. A full description of the colour code and all database fields available in the ARB files can be found in the FAQ section. Taking into account the rich set of sequence associated information that comes along with every SILVA sequence, user designed sub-databases can be easily generated.

SEEDs

All rRNA sequences have been aligned based on a completely manually re-checked SEED alignment of 58,795 rRNA sequences for SSU and 2,868 rRNA sequences for LSU. The SSU alignment is based on the official ssu_jan04 release of the ARB Project. The SSU SEED alignment has been considerably improved for Archaea by manual addition of more than 1,000 sequences by Katrin Knittel. All SSU Eukaryotic sequences (18S) have been cross-checked by Wolfgang Ludwig before their addition to the SEED. Most of the bacterial sequences have also undergone a curation process carried out by the SILVA Team. We would rate our SSU SEED alignment for all Bacteria and Archaea as good and for Eukarya as reasonable.

The LSU alignment was provided by Wolfgang Ludwig and has not been released before SILVA. It was cross-checked by the SILVA Team before using it as the SEED for automatic alignment.  Bacteria and Archaea could be rated as good. The Eukaryotes need definitely further attention.

RNAmmer

RNAmmer is a computational predictor for the major rRNA species (SSU, LSU) from all three domains of life. The program uses hidden Markov models trained on data from the European ribosomal RNA database project. SILVA runs the profiles of RNAmmer on all sequence entries of the EMBL archive to complement the existing predictions. All predictions are marked with RNAmmer in the ann_src_field. More information about RNAmmer can be found in the paper.

Update Files

Update files are not longer provided. Because of the constant improvement we do on the SILVA pipeline we recommend to always take the latest version of SILVA and update it with your personal sequences. The difference between SILVA and your own database can be easily determined using the ARB Merge Tool.