MIMARKS - the Marker Gene Standard for the Future

In October 2008 the MIMARKS (Minimum Information about an MARKer gene Sequence) working group has been formed within the Genomic Standards Consortium to work on three main issues:

  1. Define which attributes for marker gene sequences that are most relevant for the community
  2. How to effectively handle data integration and sequence submission to the INSDC
  3. Define an overall concept for Minimum Information checklists MIxS (Minimum Information for any (x) Sequence)


Status of MIMARKS/MIxS:

In November 2010 the first version of the contextual (meta)data standard MIMARKS has been released, after two years of discussion with almost 100 experts.

In December 2010 the manuscript describing MIMARKS (formerly MIENS) has been made available for community voting at Nature Precedings.


In May 2011 the MIMARKS/MIxS paper has been published in Nature Biotechnology



Further information about MIMARKS is available at the Opens external link in new windowWiki page of the Genomic Standards Consortium.

MIMARKS is a living standard and changes can be requested on the MIxS Trac page

You can also subscribe to the  MIMARKS mailing list to actively participate in the discussions.

The software tool CDinFusion

CDinFusion (Contextual Data and FASTA infusion) is a submission-preparation-tool for the integration of contextual data (CD) with sequence data. The software enriches uploaded multi Fasta files with contextual data in compliance to the Genomic Standards Consortium (GSC) specifications MIGS/MIMS/MIMARKS (MIxS). The generated contextual data enriched files can be used for submission to the databases of the International Nucleotide Sequence Data Consortium (INSDC). The tool aims to offer scientists in all disciplines of life sciences a software to increase the quantity and quality of contextual data in the INSDC databases.

CDinFusion can be accessed at http://www.megx.net/cdinfusion. Have a look at the Video Tutorial.


What are contextual data?

Contextual data (also called "metadata") are secondary data (information) attached to primary sequence data. Simply spoken, "data about data".

They describe aspects like:

  •  the origin of a sample and the corresponding environmental parameters
    • site characteristics (longitude, latitude, depth, altitude, temperature ...)
    • chemical parameters (pH, nitrate, phosphate ...)
    • ...
  •  the processing of a sample resulting in sequence information
    • PCR primers
    • cloning vectors
    • ...

Why are contextual data of outstanding importance?

Because only these additional data allow to turn primary sequence information into sound biological knowledge. An example:

A 16S rRNA sequence deposited in the public databases annotated as "uncultured bacterium" but without any additional information (contextual data) is of limited use only. In contrast, if it was just supplemented with the sample location (lat, lon, time, depth) it can already be used to:

  • investigate the geography of the sequence
  • link the sequence with specific habitat conditions
  • integrate the diversity data with functional data from (meta)genomic surveys
  • supplement the sequence information with data from remote sensing surveys