New Solutions for Contextual Data Integration

Organization & Storage of Contextual Data

For maximum usability, we prepared an Excel-based solution to organize and store your contextual data. 

+++ Download version 1.03 of the rRNA Contextual Data Spreadsheet +++

A detailed documentation is included and also an example file (pre-filled) is available.

The file is under constant development and new fields will be added in the future. Feedback is welcome! Please send an  email to contact (at) arb-silva.de.

 

Integration of Contextual Data

Primary sequence information and corresponding contextual data are independently recorded and should be merged as early as possible within the process of sequence preparation and analysis. Then, both kinds of data are available as a single file for the whole workflow (click thumbnail on the right for an overview).

This requires extended FASTA files, called "Metadata-FASTA" files.

Full files for ARB/SILVA import contain all information from the rRNA Contextual Data Spreadsheet (example 1). INSDC compliant files only contain INSDC fields for direct submissions (example 2).

How to produce such a Metadata-FASTA file from the rRNA Contextual Data Spreadsheet and standard FASTA files?

  1. The first steps in rRNA sequence preparation/analysis (assembly, editing, quality checks) are typically done using commercial software tools. Therefore, we have convinced the provider of a highly professional, reasonably-priced sequence assembler/chromatogram editor to build in extensions for contextual data integration. The result is the RNA Baser, offered by HeracleSoftware. More information and download at www.rnabaser.com (fully operational local trials plus test files!). The RNA Baser exports both kinds of Metadata-FASTA files (full & INSDC compliant).
  2. A free web-based solution for users that do not want to use the RNA Baser is planned ...

 

ARB/SILVA import & export filters

New extended FASTA import and export filters for ARB have been set up.

+++ The respective filters (version 1.03) can be downloaded from the SILVA Archive +++

  • the "contextual_data_fasta_rnabaser.ift" is the import filter for Metadata-FASTA files created by e.g. the RNA Baser. Fully compatible to the Excel-based rRNA Contextual Data Spreadsheet (also additional non-INSDC compliant fields included - fully supported by the ARB/SILVA databases).
  • the "contextual_data_fasta_sequin.eft" is the export filter that writes an Metadata-FASTA with all INSDC compliant fields for easy submission using e.g. Sequin.

For installation in ARB - please have a short look at the README file.

Last update: November 3, 2008

Background

What are contextual data?

Contextual data (also called "metadata") are secondary data (information) attached to primary sequence data. Simply spoken, "data about data".

They describe aspects like:

  •  the origin of a sample and the corresponding environmental parameters
    • site characteristics (longitude, latitude, depth, altitude, temperature ...)
    • chemical parameters (pH, nitrate, phosphate ...)
    • ...
  •  the processing of a sample resulting in sequence information
    • PCR primers
    • cloning vectors
    • ...

 

Why are contextual data of outstanding importance?

Because only these additional data allow to turn primary sequence information into sound biological knowledge. An example:

A 16S rRNA sequence deposited in the public databases annotated as "uncultured bacterium" but without any additional information (contextual data) is of limited use only.

In contrast, if it was just supplemented with the sample location (lat, lon, time, depth) it can already be used to:

  • investigate the geography of the sequence
  • link the sequence with specific habitat conditions
  • integrate the diversity data with functional data from (meta)genomic surveys
  • supplement the sequence information with data from remote sensing surveys

Current status of contextual data and reasons for their limited availability

Already now, a number of specific fields to store contextual data are offered by the INSDC databases.

Examples are ...

  • "lat_lon" (= geographic position of sampling site)
  • "collection_date" (= date of sampling)
  • "isolation_source" (= physical geography of the sampling site)

More information on the INSDC fields currently available and the standards for completing them can be found in the INSDC Feature Table Document.

However, if you search in e.g. SILVA 96 SSURef (based on the EMBL release 96) for the contextual data available, you will find the following:

Only a very small portion of the total entries (324,342) contains this kind of information.

  • "lat_lon": 11,768 (3.6%)
  • "collection_date": 21,088 (6.5%)
  • "isolation_source": 229,180 (71%)

 

What is the reason for this unsatisfying situation?

The researchers are simply not submitting their contextual data information together with the primary sequence data to the INSDC databases.

This is mainly caused by missing software solutions for the integration of contextual data and primary sequence information.

To resolve this limitation, we have prepared solutions for

  • standardized contextual data storage and organization (Excel-based)
  • easy contextual data integration using standard tools
  • easy submission of sequence and basic contextual data using standard INSDC submission tools

MIENS - standards for the future

Besides the limitations due to missing solutions for contextual data integration and their local storage and organization, many contextual data fields can not yet be submitted to the INSDC databases.

In October 2008 the MIENS (Minimum Information about an ENvironmental Sequence) working group has been formed within the Genomic Standards Consortium to work on two main issues:

  1. Define which attributes for environmental sequences are most relevant for the community
  2. How to effectively handle data integration and sequence submission to the INSDC

 

Survey results, the current list of fields and additional information are available at the Wiki page of the Genomic Standards Consortium.

Your contribution is always welcome, just contact the SILVA team at contact (at) arb-silva.de.