Conceptual Description of the Level 1 C2M2¶
Authors: Rick Wagner
Maintainers: Rick Wagner
Version: 0.1
License: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
Objectives¶
This is a conceptual and narrative description of the Level 1 Crosscut Metadata Model (C2M2). It covers the things (proper nouns) in the Level 1 C2M2 and their relationships, and describes the tables used to represent them. The last section covers the internal controlled vocabularies used for a few attributes. These notes do not go heavily into things like the format (syntax) of the columns or the specific primary key and foreign key relationships.
Things (Proper Nouns) Described¶
The Level 1 C2M2 includes tables to describe the following things (entities), and the relationships among them.
- Namespaces
- Project
- File
- Subject
- Biosample
- Collection
This section has descriptions of each thing and a list of its attributes (fields).
Namespaces¶
A namespace is a logical groupings of things, used to avoid collisions among the names used by different Data Coordination Centers (DCCs). We assume that each DCC assigns a unique local name to each thing that it manages. (If this assumption is violated--if, for example, biosamples and files may be assigned the same local name--then additional local structure may be needed.) Then, anything from any DCC can be given a unique global name by concatenating the namespace id for the DCC from which the thing originates with the local name assigned to the thing by that DCC. Thus, for example, two things originating from DCC1 and DCC2, and each assigned a local name Sample1, will have distinct C2M2 names: DCC1:Sample1 and DCC2:Sample1.
Attributes¶
- namespace A globally unique ID representing this namespace
- abbreviation A short display label for this namespace
- name A short, human-readable, machine-read-friendly label for this namespace
- description A human-readable description of this namespace
Project¶
There can be a single project for each DCC, or things like studies can be represented as subprojects. The field persistent id could be a website for project, or a DOI for a paper. When we get to collections to describe datasets or cohorts, we'll show what project they were part of.
- abbreviation A short display label for this project
- name A short, human-readable, machine-read-friendly label for this project
- description A human-readable description of this project
- persistent id A persistent, resolvable (not nec. retrievable) URI generated by a DCC and attached to this project
File¶
- file id The unique name for this file, compromised of:
- namespace Namespace for the DCC or file creator
- id An ID representing this file, unique within this namespace
- project Which project or subproject created this file
- persistent id A persistent, resolvable (not nec. retrievable) URI generated by a DCC (using, e.g., our minid server) and attached to this file
- creation_time An ISO 8601 -- RFC 3339 (subset)-compliant timestamp documenting this file's creation time
Ex. YYYY-MM-DDTHH:MM:SS±NN:NN
- size_in_bytes The size of a file in bytes
- sha256 The output of the SHA-256 cryptographic hash function after being run on this file: one or both of sha256 and md5 is required; sha256 is preferred
- md5 The output of the MD5 message-digest algorithm after being run as a cryptographic hash function on this file: one or both of
- filename A filename with no prepended PATH information
- file_format An EDAM CV term ID identifying the digital format of this file
Ex. TSV or FASTQ
- data_type An EDAM CV term ID identifying the type of information stored in this file
Ex. RNA sequence reads
Subject¶
- subject id The unique name for this subject, compromised of:
- namespace Namespace for the DCC or subject provider
- id An ID representing this subject, unique within this namespace
- project Which project or subproject created this file
- persistent id A persistent, resolvable (not nec. retrievable) URI generated by a DCC and attached to this subject
- creation_time An ISO 8601 -- RFC 3339 (subset)-compliant timestamp documenting this subject record's creation time
Ex. YYYY-MM-DDTHH:MM:SS±NN:NN
- granularity A CFDE CV term categorizing this subject by multiplicity (see Subject Granularity under Controlled Vocabularies). One of:
- single organism
- symbiont system
- host-pathogen system
- microbiome
- cell line
- synthetic
Biosample¶
- biosample id The unique name for this biosample, compromised of:
- namespace Namespace for the DCC or biosample owner
- id An ID representing this biosample, unique within this namespace
- project Which project or subproject created this biosample
- persistent id A persistent, resolvable (not nec. retrievable) URI generated by a DCC and attached to this biosample
- creation_time An ISO 8601 -- RFC 3339 (subset)-compliant timestamp documenting this biosample's creation time
Ex. YYYY-MM-DDTHH:MM:SS±NN:NN
- assay_type An OBI CV term ID describing the type of material represented by this biosample
- anatomy An UBERON CV term ID used to locate the origin of this biosample within the physiology of its source or host organism
Collection¶
Like projects, collections can have subcollections. Collections can hold files, biosamples, or subjects, which is done using a relationship.
- collection id The unique name for this collection compromised of:
- namespace Namespace for the DCC or collection creator
- id An ID representing this collection, unique within this namespace
- persistent id A persistent, resolvable (not nec. retrievable) URI generated by a DCC and attached to this collection
- abbreviation A very short display label for this collection
- name A short, human-readable, machine-read-friendly label for this collection
- description A human-readable description of this collection
Relationships¶
There are several relationships between things that can be described, like which subject a biosample comes from. These a often mapping tables between the unique names (namespace, id) of different things.
Things in Collections¶
Collections can contain one or more files, biosamples, or subjects. A collection may contain a combination of different types. There are tables for each type that map the items into their collections. The item is identified by its namespace and id, so is the collection. Effectively, the tables look like the following:
Attributes¶
Files in collection * subject id The unique name (namespace, id) of the subject * collection id The unique name (namespace, id) of the collection
Biosamples in collection * biosample id The unique name (namespace, id) of the biosample * collection id The unique name (namespace, id) of the collection
Subjects in collection * subject id The unique name (namespace, id) of the subject * collection id The unique name (namespace, id) of the collection
Biosamples and Subjects¶
To allow for multiple subjects to be represented in a single biosample and vice versa, there is a mapping table between biosamples and subjects.
Attributes¶
- biosample id The unique name (namespace, id) of the biosample
- subject id The unique name (namespace, id) of the subject
Files Describing Subjects and Biosamples¶
To show a relationship between a file and a subject a or biosample, like a sequence file generated from a biosample, there are two more mapping tables.
Attributes¶
Files describing biosamples * file id The unique name (namespace, id) of the file * biosample id The unique name (namespace, id) of the biosample
Files describing subjects * file id The unique name (namespace, id) of the file * subject id The unique name (namespace, id) of the subject
Subject Role and Taxonomy¶
A table linking a subject, a subject_role (a named organism-level constituent component of a subject, like 'host', 'pathogen', 'endosymbiont', 'taxon detected inside a microbiome subject', etc.) and a taxonomic label (which is hereby assigned to this particular subject_role within this particular subject)".
Attributes¶
- subject
- namespace The namespaec of the subject
- id The ID of this subject
- role The role assigned to this organism-level constituent component of this subject (see Subject Role under Controlled Vocabularies). One of:
- single organism
- host
- symbiont
- pathogen
- microbiome taxon
- cell line ancestor
- synthetic
- taxonomy_id An NCBI Taxonomy Database ID identifying this taxon
CFDE Controlled Vocabularies¶
Subject Granularity¶
Term | Description |
---|---|
single organism | One organism |
symbiont system | A mixed system of consisting of two or more organisms (symbionts) in symbiosis (living colocated in time and space): one such symbiont may optionally be identified as a host |
host-pathogen system | A special case of a symbiont system consisting of one symbiont, designated as a host, plus one or more other symbionts acting to create or sustain disease within the host organism |
microbiome | A symbiont system consisting of a collection of (potentially unknown or partially characterized) taxa, where the environment in which the system resides is well-characterized, but the taxonomic composition of the system may be unknown; optionally contains one symbiont specially identified as a host |
cell line | A cell line derived from one or more species or strains |
synthetic | A synthetic biological entity |
Subject Role¶
Term | Description |
---|---|
single organism | The organism represented by a subject in the 'single organism' granularity category |
host | Any organism identified as a host for a subject assigned to the 'symbiont system', 'host-pathogen system', or 'microbiome' granularity categories |
microbiome taxon | A constituent taxon of either (a) a subject assigned to the 'environmental microbiome' granularity category or (b) the microbiome (non-host) portion of a subject assigned to the 'host-associated microbiome' granularity category [NB: This role is probably not appropriate for Level 1, because it necessitates the post-facto attachment of downstream analysis procedures (subject -> sample -> library prep -> sequencing -> bioinformatics -> taxonomic classification results) to a subject which was originally uncharacterized at this level] |
symbiont | An organism identified as a symbiont within a subject assigned to the 'symbiont system' granularity category |
pathogen | An organism identified as a pathogen symbiont in a subject assigned to the 'host-pathogen system' granularity category |
cell line ancestor | A taxon identified as a source organism for a subject assigned to the 'cell line' granularity category |
synthetic | A synthetic biological entity |
Conclusions¶
This section provides a concise overview of the key objects and concepts covered by the C2M2 model and should be viewed as an initial contact point for anyone interested in mapping data into the C2M2 model, thereby getting ready for a full ETL process.
What to read next?¶