Crosscut Metadata Model (C2M2) Documentation¶
Important note: in the case of any discrepancy between this document and the C2M2 technical wiki, DCC staff preparing current C2M2 submissions should rely on the latter for the most up-to-date C2M2 table definitions & usage guidance.
Quick Links¶
- For DCC contributors: C2M2 technical wiki
- Full C2M2 technical specification (skip introduction)
- Quick start guide
- C2M2 ER diagram
- Current release notes
- Upcoming C2M2 features (next release)
- Download the C2M2 JSON Schema
- Download a blank reference set of all C2M2 TSV files (column-header lines only)
The Common Fund Data Ecosystem's Crosscut Metadata Model (CFDE C2M2)¶
This document introduces the Crosscut Metadata Model (C2M2), a flexible metadata standard for describing experimental resources in biomedicine and related fields. The Common Fund Data Ecosystem group is creating a new federated search infrastructure, with C2M2 as its organizing principle, to offer the health research community access to an unprecedented array of intersectional data and tools. The C2M2 system will connect researchers with scale-powered statistical analysis methods; deep, seamless searching across experimental data generated by different projects and organizations; and new ways to aggregate and integrate experimental data from different sources to facilitate scientific replication and to drive new discoveries.
Using this new infrastructure, Common Fund Data Coordinating Centers (DCCs) will share structured, detailed information (metadata) about their experimental resources with the research community at large, widening and deepening access to usable observational data. One of C2M2's primary missions is to support search across cross-disciplinary datasets without moving or warehousing data. C2M2 is also designed to support large-scale analytic research, for example by substantially reducing the effort needed to perform meta-analysis of results produced by multiple independent teams studying similar health-related phenomena.
DCC metadata submissions¶
DCCs collect and provide metadata submissions (C2M2 instances) to CFDE describing experimental resources within their purview. Each submission is a set of tab-separated value files (TSVs); precise formatting requirements for these filesets are specified by JSON Schema documents, each of which is an instance of the Data Package meta-specification published by the Frictionless Data group. The Data Package meta-specification is a platform-agnostic toolkit for defining format and content requirements for files so that automatic validation can be performed on those files, just as a database management system stores definitions for database tables and automatically validates incoming data based on those definitions. Using this toolkit, the C2M2 JSON Schema specification defines foreign-key relationships between metadata fields (TSV columns), rules governing missing data, required content types and formats for particular fields, and other similar database management constraints. These architectural rules help to guarantee the internal structural integrity of each C2M2 submission, while also serving as a baseline standard to create compatibility across multiple submissions received from different DCCs. During the C2M2 submission process, the CFDE software infrastructure uses these schematic specifications to automatically validate format compliance and submission integrity prior to loading C2M2 metadata into its central database. Once loaded, C2M2 metadata will be used to fuel downstream services like web and API searching, customized statistical summaries, dynamic display graphics, asset browsing within experimental resource collections, and the automated forwarding of stable, accessible experimental data files (inventoried and annotated as part of a C2M2 metadata submission) to analytic workflow environments.
C2M2 overview¶
C2M2 offers DCCs a fairly sparse set of minimum structural benchmarks to meet when building a submission. The general idea is that DCC resource collections can initially be represented quickly (and thus begin driving downstream applications quickly) using metadata meeting minimal richness requirements -- enough to provide a basic level of harmonization with biomedical experimental metadata coming from other C2M2 sources (DCCs). Over time, DCC data managers can and should upgrade their C2M2 metadata submissions by adding more detailed descriptive information to their resource records; by elaborating on provenance, timing and other relationships between resources; and by working with the CFDE to expand C2M2 itself to better fit models and automation requirements already in production elsewhere.
A C2M2 submission (an "instance" or "datapackage") is a collection of data tables encoded as tab-separated value files (TSVs). Only three metadata records (three rows, across three C2M2 tables) are strictly required, so most of these tables can optionally be left empty in a minimally compliant submission. The three required records are
- a short contact sheet (name, email, etc.) referencing the DCC technical contact responsible for the submission,
- a single
project
record representing the submitting DCC itself (for resource attribution), and - at least one identifier namespace, registered in advance with the CFDE (to protect IDs used by the submitting DCC to represent files, samples, etc. from potential conflicts with identifiers generated independently by other DCCs -- see the section on IDs for a full discussion of identifiers and namespaces in C2M2).
A minimally compliant submission -- containing just the
three required records and no more -- would clearly not be
of much use. The simplest usable submission configuration
will also contain at least one nonempty data table (TSV)
representing a flat inventory of experimental resources (like
data files or biosamples). A more complex variant might similarly
inventory a few different resources like biosamples
, files
and
subjects
, and then also encode basic associative relationships
between those resources: for example, asserting which biosamples
were materially descended from which subjects
, or listing
which files
have been analytically derived from which biosamples
.
Beyond the single mandatory "this DCC owns this submission" record (2,
above), DCCs can also attach a hierarchy of project
records to
their experimental metadata, to group resources by research purview.
Future submission variants will allow submitters to model
events and timing (both for provenance and to describe
time-indexed data), among other anticipated extensions. Core
structures (that is, C2M2 core entities: fundamental types
of experimental resource) currently include files
,
biosamples
and subjects
; more (including gene
and chemical substance
) are scheduled to appear in the coming
months, based on direct collaboration with DCCs to model other
relevant, usable experimental metadata while keeping a continual
eye on maximizing harmonization and interoperability across the
whole C2M2 metadata space.
A foundational purpose of the C2M2 system is to facilitate metadata harmonization: finding ways wherever possible to represent comparable things in standard ways, without compromising meaning, context or accuracy (although precision may occasionally be weakened so as to preserve the robustness of the rest). In addition to building bridges and crosswalks between disparate but related resources, C2M2 is also meant to facilitate the graded introduction of metadata into the CFDE system, as discussed above. The paradigm of gradually increasing submission complexity is by design a (roughly) staged process: new layers of metadata can be added according to increasing complexity and harmonization difficulty, ranging from basic flat asset inventories to well-decorated networks of relationships between resources that are described in finer operational detail. In addition to flattening the learning curve for onboarding DCC data managers into the CFDE ecosystem, the ability to submit C2M2 metadata in managed stages of complexity lets DCC data managers test and see how downstream functionality is interacting with their C2M2 metadata -- and, critically, to provide feedback to CFDE to investigate and create any needed changes -- before investing more heavily in creating more complex C2M2 submissions.
The full C2M2 technical specification can be found below, which explains all available submission structures, constraints and requirements in detail.
DCC integration and the evolution of C2M2¶
Most DCCs already have some form of internal metadata model in use for their own curation operations. C2M2 representation of similar but distinct packages of important information, taken from multiple independently-developed custom DCC metadata systems (e.g. metadata describing people and organizations, data provenance, experimental protocols, or detailed event sequences), requires ongoing, iterative, case-based design and consensus-driven decision-making, coordinated across multiple research groups. Design and decision-making in such contexts requires long-term planning, testing and execution. New metadata difficult to integrate and harmonize will be handled by the creation of generalizable, well-defined extensions to C2M2 if possible, and by pruning (at least pro tem) if not. The core of the C2M2 data space is tasked first with harmonizing relatively universal and uncontroversial metadata concepts -- to be made stable and available according to FAIRness principles -- for streamlined submission construction and usable deployment of DCC metadata. C2M2's second (longer-term) priority takes a slower road to make robust decisions about integration of less immediately tractable information, in concert with the Common Fund community and an awareness of global standards.
With the flexible (but still well-defined) design of C2M2, we seek to split the difference between the ease of evolution inherent in a simple model and the operational power provided to downstream applications by more complicated and difficult-to-maintain extended frameworks.
This flexibility is also intended to simultaneously address the needs of DCCs at widely different scales of data complexity or funding depth, which will differ based on organization life-cycle phases, scope of research purview, etc. DCCs with advanced, operationalized metadata modeling systems of their own should not encounter arbitrary barriers to C2M2 support for more extensive relational modeling of their metadata if they want it; newer or smaller DCCs, by contrast, may not have enough readily-available information to feasibly describe their experimental resources beyond giving basic asset lists and project attributions. By committing both to developing modular C2M2 extensions for the most advanced DCC metadata and to offering simpler but well-structured model options (already harmonized across C2M2 metadata from other DCCs), we aim to minimize barriers to rapid entry into the C2M2 ecosystem and its downstream applications.
A C2M2 topic requiring special attention is the use of identifiers.
C2M2 identifiers¶
C2M2 is designed to be a framework for sharing information with the global research community about useful experimental resources. To be scientifically useful, this information (metadata) should be well-described and self-contained: enough, at least, to direct unambiguous future replication of the experiments involved. More to the point, C2M2 metadata should also be directly reusable in new experiments wherever possible.
C2M2 metadata is managed and curated by Common Fund DCCs to standardize and stabilize it for future research use. CFDE's explicit mission for C2M2 is to create an information archive that can usefully serve researchers who work without guaranteed access to follow-up information, including (among other scenarios) for future work done after the funding lifecycle of each managing DCC has ended.
C2M2 metadata is created at different times by different DCCs working independently of one another. The first requirement for any system trying to integrate such information is to provide a standard way to unambiguously attach identifiers (IDs: formal names or labels) to resources described by the C2M2 metadata submitted by each DCC. As a minimum promise of structural integrity, C2M2 requirements guarantee that C2M2 IDs used by each DCC will not clash with any others in the system (present or future).
Beyond basic structural integrity, C2M2 also offers support for optional citation-stable IDs which encode actionable information that users or automated software can follow to further interact with the resource named by the ID.
Resources represented as C2M2 entities (file
, biosample
,
project
, etc.; see the
C2M2 technical specification
for scope and detail) must be identified with a C2M2 ID,
and may also be identified with a persistent_id
.
C2M2 IDs ensure the basic structural integrity of the overall C2M2
system. Optional persistent_id
identifiers are meant to be
stable enough to be scientifically cited, and to provide for further
investigation by accessing related resolver services.
To be used as a C2M2 persistent_id
, an ID
-
will represent an explicit commitment by the managing DCC that the attachment of the ID to the resource it represents is permanent and final
-
must be a format-compliant URI or a compact identifier, where the protocol (the "scheme" or "prefix") specified in the ID is registered with at least one of the following (see the given lists for examples of URIs and compact identifiers)
-
the IANA (list of registered schemes)
- scheme used must be assigned either "Permanent" or "Provisional" status
-
Identifiers.org (list of registered prefixes)
-
N2T (Name-To-Thing) (list of registered prefixes)
-
protocols not appearing in the above registries but explicitly approved by the CFDE-CC. Currently, this list is limited to one protocol, namely
drs://
URIs identifying GA4GH Data Repository Service resources.
-
-
if representing a
file
, an ID used as apersistent_id
cannot be a direct-download URL for thatfile
: it must instead be an identifier permanently attached to thefile
and only indirectly resolvable (through the scheme or prefix specified within the ID) to thefile
itself
These requirements constitute a minimal set of rules to ensure
that C2M2 resources can be stably cited in scientific literature
and automatically reused in future research. Clearly, though,
the production and maintenance of persistent_ids
represents a substantial
investment of time, thought and effort, and we also emphasize
that not every C2M2 resource record that can receive a persistent_id
will necessarily ever need one. These IDs -- while representing a gold
standard for stability and long-term access -- are strictly optional.
DCCs should also note that without persistent_ids
, digital
file assets represented in C2M2 will serve only as inventory items and
annotated search results: permanent, indirected persistent_ids
are
required in order to enable any automated interoperability between actual
data files referenced by C2M2 records and external software systems (including
direct download access to files).
Since persistent_id
is always optional, C2M2 provides a separate
structure to provide for universal identification: the basic C2M2 ID is
a two-part label comprised of a prefix (id_namespace
)
and a suffix (local_id
) which, concatenated, make up the ID.
C2M2 IDs fall into categories described by three main cases:
[1] A persistent_id
already exists for the object being named.
-
if the
persistent_id
is a URI, then that URI should be split to form a C2M2 ID (see the URI reference for precise definitions of terms like "scheme" and "path" in this context):id_namespace
(prefix):scheme://authority/
local_id
(suffix):path
-
Example: an SRA accession URI
https://www.ncbi.nlm.nih.gov/sra/SRX000007
stored in C2M2 as apersistent_id
would be split, to form a corresponding C2M2 ID, into- an
id_namespace
prefix ofhttps://www.ncbi.nlm.nih.gov/sra/
- and a
local_id
suffix ofSRX000007
- an
-
if the existing
persistent_id
is not a URI but instead is a compact identifier, it should be split similarly, with the details determined according to the particular format specification for the prefix being used: a scheme label and a reference to the issuing or owning authority (plus a delimiter) should constitute theid_namespace
prefix, and the ID of the particular thing being referenced should be stored in thelocal_id
suffix.-
Example: the DOI compact identifier
doi:10.1006/jmbi.1998.2354
would be split into- an
id_namespace
prefix ofdoi:10.1006/
, specifying the identifier type (doi
) and the registered owner of the object (10.1006
) - and a
local_id
suffix ofjmbi.1998.2354
- an
-
[2] A DCC already uses URIs to identify things that correspond to C2M2 entities (files
,
biosamples
, etc.), but those URIs don't meet all the criteria to be C2M2
persistent_ids
(e.g. they're not guaranteed to be permanent). Such URIs
can still be split into an id_namespace
prefix (containing a reference to the controlling
authority, e.g. the DCC or one of its organizational data sources) and a
local_id
suffix (describing the object being identified) to form a C2M2 ID.
(For records with IDs built like this, persistent_id
would be left blank.)
[3] A DCC only has local identifiers for such entities. In this case, each local
identifier will be the corresponding C2M2 local_id
suffix (sanitized as necessary
for URI safety), and the id_namespace
prefix can be constructed according to the
'tag' URI proposal.
-
Example: The tag-URI-based
id_namespace
/local_id
C2M2 ID for a C2M2biosample
record representing Sample A-867-5309 at the Flerbiger's Disease Project (FDP) -- a non-permanent, strictly local sample ID assigned by the FDP for their C2M2 submission built at the end of the first quarter of 2021 -- might be (an email address would also work in place of 'flerbiger.org' below)id_namespace
:tag:flerbiger.org,2021-03-31:
local_id
:A-867-5309
C2M2 technical specification¶
The CFDE Crosscut Metadata Model (C2M2) is a relational database model designed to encode scientifically usable metadata that describes collections of accessible, standardized, and ideally replicable resources and results relevant to research in biomedicine.
Things are represented in C2M2 as data tables: specifically, rectangular data matrices, each row of which is a data record comprised of a small collection of named bits of data (fields). Each field has an agreed-upon meaning that helps to describe whatever "thing" the table represents; "things" (and the tables describing them) can refer to physical objects, like numbered biosamples; virtual objects, like digital files; or abstract concepts, like a project or a standard name for "salmon louse."
Relationships between things are represented as associations, whereby data records describing things are linked together in predefined ways so as to indicate that the linked things are meaningfully connected in some way. One might for example express the fact that
beard biopsy number "BB44" was collected from a billy goat named Abner
by
- creating a record (representing the biopsy material
BB44) in a
biosample
table (representing all biosamples being described), which might look like
biosample
id | name | ... |
---|---|---|
BB44 | goat beard biopsy number BB44 | ... |
- creating a record (representing Abner) in a
subject
table (representing all creatures from which biosamples have been taken), for example
subject
id | species | age | ... |
---|---|---|---|
Abner | goat | 8 | ... |
- using an association table called "
biosample_from_subject
" to link the two records together: "the given biopsybiosample
(BB44) was sampled from the givensubject
(Abner)", or
biosample_from_subject
biosample | subject |
---|---|
BB44 | Abner |
Our hypothetical goat biosample annotation describes an association linking
records in two different data tables representing different types of thing
(biosample
and subject
) in a well-defined way. Associations can also
connect data records of the same type of thing (i.e. within the same data
table). As an example,
project RNA_17.2 is a subproject of project RNA_17
could be expressed by
- creating a record in a
project
table to representRNA_17
- creating another record in the same
project
table to representRNA_17.2
- using an association like
project_in_project
to link the two records together:project
recordRNA_17.2
is a subproject (project_in_project
) ofproject
recordRNA_17
.
(The example sketches above are intended only to illustrate the use of association tables and do not precisely represent any particular C2M2 tables or fields.)
Following the literature we will be calling "things" entities. The next graphic is an entity-relationship (ER) diagram describing C2M2. Entities (things) are drawn as full tables: boxes with descriptions of named fields. Associations (relationships between entities) are named inside small boxes: arrows are drawn connecting each association with the entities that participate in the relationship that the association represents.
C2M2 model diagram |
---|
Color key:
- Black: Core entities (basic experimental resources):
file
,biosample
andsubject
- Dark red: Association relationships between entities
- Blue: Container entities (
project
andcollection
) and their containment relationships - Green: Term entities recording all standardized controlled-vocabulary terms submitted as C2M2 annotation metadata, plus extra descriptive information to facilitate user searching and web displays (see below for details)
- Gold: Administrative entities giving basic contact information for DCC creators of C2M2 submissions and describing CFDE-registered, DCC-controlled identifier namespaces
- Yellow:
subject_role_taxonomy
: a special association relationship optionally linking eachsubject
record with- (possibly multiple) NCBI Taxonomy IDs
- a user-supplied group of sub-entities of a
subject
-- like "host," "pathogen," or "microbiome constituent" -- identified according to roles describing components of commonly observed biosystem types. ("Single organism (with no further subdivisions)" is the default role).
Each valid C2M2 submission will contain 26 tab-separated value (TSV) files: one for each rectangle (entity or association) in the ER diagram above. Formats for all 26 files and their constituent fields are given in the C2M2 JSON Schema, and blank example files are provided for reference here. This schema document is an instance of the Data Package meta-specification published by the Frictionless Data group. It is a precise, machine-readable and (patient) human-readable JSON document that explicitly describes and explains all the structural components of C2M2 as drawn in the diagram above.
Each TSV file in a C2M2 submission will be a plain-text file representing a tabular data matrix, with rows delimited by newlines and fields (columns) delimited by tab characters. Field values in TSV files must conform to all formatting constraints specified in the C2M2 schema document; other common relational database constraints (unique columns, non-nullable fields, foreign key relationships) are also given in that document, which defines the required relational data structure of a valid C2M2 submission.
A minimal set of additional content requirements -- not expressible as relational database constraints, but still required to support downstream C2M2-driven automation -- complete the definition of a fully valid C2M2 submission; these are given in this document, alongside prose descriptions of the terse technical expressions in the C2M2 JSON Schema.
Only three metadata records (three rows, across three C2M2 tables) are strictly required for a valid submission, so most TSV tables can optionally be left empty in a minimally compliant submission. The three required records are
- a short contact sheet (name, email, etc.) referencing the primary DCC technical contact responsible for the submission,
- a single
project
row representing the submitting DCC itself (for unambiguous resource attribution), and - at least one identifier namespace row in the
id_namespace
TSV file, registered in advance with the CFDE (to protect IDs used by the submitting DCC to represent files, samples, etc. from potential conflicts with identifiers generated independently by other DCCs -- see the section on IDs for a full discussion of identifiers and namespaces in C2M2).
More information than this will clearly be required for a C2M2 metadata submission to be of any use, but regardless of individual metadata configuration, one TSV file must be created to represent each table, whether or not the table has any row data in it. Any blank table will be represented by a TSV file containing just one tab-separated header line which lists the (empty) table's field names. Instead of just omitting files for tables with no data, this requirement helps us differentiate "by design, no data is being submitted" from "this table was left out by mistake."
Common entity fields¶
The following fields all have the same meaning and serve the same function across
the various entity tables that include them (file
, biosample
, project
, etc.).
The phrase "this entity," in field descriptions below, refers to the
particular entity (file
, biosample
, etc.) record stored in the
row containing the field in question. For example: when we describe
the persistent_id
field as a "URI permanently attached to this
entity," we mean "for any particular row R in any C2M2 entity table,
if R contains a persistent_id
value, then that persistent_id
value
will be a URI uniquely and permanently associated with the particular
thing described by R." So if R is a row (describing one
particular file, say F) in a C2M2 file
table, then any
value present in R's persistent_id
field must be an
identifier that's permanently attached to the file F (and to
nothing else).
In this document the terms "record" and "row" are generally synonymous:
one row in a C2M2 entity table represents a single metadata record -- an ordered
group of values organized according to named fields -- describing exactly one
entity of the type represented by that table. Example: one row in the file
entity table is a record describing a single file.
field(s) | required? | description |
---|---|---|
id_namespace |
required: primary key | URI-prefix identifier devised by the DCC managing this entity and pre-registered with CFDE-CC. The value of this field will be used together with local_id as a composite key structure formally identifying C2M2 entities within the total C2M2 data space. The concatenation of id_namespace + local_id must form a valid URI. (See C2M2 identifiers for discussion, examples and content restrictions.) |
local_id |
required: primary key | URI-suffix identifier identifying this entity: a string that uniquely identifies each entity within the scope defined by the accompanying id_namespace value. The value of this field will be used together with id_namespace as a composite key structure formally identifying C2M2 entities within the total C2M2 data space. The concatenation of id_namespace + local_id must form a valid URI. (See C2M2 identifiers for discussion, examples and content restrictions.) |
persistent_id |
optional | An optional, resolvable URI permanently attached to this entity: a permanent address which must resolve (via some service like identifiers.org) to some network-retrievable object describing the entity, like a landing page with basic descriptive information, or a direct-download URL. Actual network locations (e.g. bare download URLs) must not be embedded directly within this identifier: one level of indirection (the resolver service) is required in order to protect persistent_id values from changes in network location over time as data is moved around. (See C2M2 identifiers for discussion, examples and content restrictions.) |
creation_time |
optional | An ISO 8601 / RFC 3339 (subset)-compliant timestamp documenting this entity's creation time (or, in the case of a subject entity, the time at which the subject was first documented by the primary project under which the subject was first observed): YYYY-MM-DDTHH:MM:SS±NN:NN , where
Apart from the time zone segment of creation_time (±NN:NN , just described) and the year (YYYY ) segment, all other constituent segments of creation_time named here may be rendered as 00 to indicate a lack of available data at the corresponding precision. |
abbreviation , name and description |
optional*† | Text describing this entity, to be used in C2M2 user interface displays showing row-level data. Final length limits on these fields have not yet been established, but will be soon, so content in these fields should be kept as terse as possible. Expect a rough maximum of 10 characters for abbreviations, 25 chars for names and the length of a typical paper abstract for descriptions.
|
project_id_namespace , project_local_id |
required: project foreign key |
This pair of fields stores a required foreign key into this submission's project table. The row in the project table identified by this key represents the primary project under which this entity was first created, observed, documented or otherwise encountered. (See the section on the project table for more on the meaning of project and usage details, including options for constructing simplified default values for these required fields.) |
*primary_dcc_contact.dcc_abbreviation
and primary_dcc_contact.dcc_name
are required fields, as is the value of
project.abbreviation
for one special project
record representing the
submitting DCC: see the primary_dcc_contact
table
and project
table sections, respectively, for details.
project.name
is also universally required (and must be unique to each project).
†the name
field is required for all term tables (the green tables in the diagram above); note
that this information is automatically generated from existing ontology references
(see below for details on how these tables are built)
Core C2M2 entities¶
The file
entity: a stable digital asset¶
field(s) | required? | description |
---|---|---|
id_namespace , local_id , project_id_namespace , project_local_id , persistent_id , creation_time |
(see above) | (See Common entity fields section) |
size_in_bytes |
optional | The size of this file in bytes. (integer) |
uncompressed_size_in_bytes |
optional | The total decompressed size in bytes of the contents of this file. (integer) |
sha256 |
required if md5 is null |
CFDE-preferred file checksum string: the output of the SHA-256 cryptographic hash function after being run on this file. One or both of sha256 and md5 is required. |
md5 |
required if sha256 is null |
Permitted file checksum string: the output of the MD5 message-digest algorithm after being run as a cryptographic hash function on this file. One or both of sha256 and md5 is required. |
filename |
required | A filename with no prepended PATH information. (e.g. example.txt and not /usr/foo/example.txt ) |
file_format |
optional | An EDAM CV term ID identifying the digital format of this file (e.g. format:3475 for "TSV", or format:1930 for "FASTQ"). |
compression_format |
optional | An EDAM CV term ID identifying the compression format of this file (e.g. gzip or bzip2): null if this file is not compressed |
data_type |
optional | An EDAM CV term ID identifying the type of information stored in this file (e.g. data:3495 for "RNA sequence reads"). |
assay_type |
optional | An OBI CV term ID describing the type of experiment that generated the results summarized by this file. |
analysis_type |
optional | An OBI CV term ID describing the type of analytic operation that generated this file |
mime_type |
optional | A MIME type (or "IANA media type") describing this file, e.g. "text/plain" or "application/octet-stream". See this page for a tutorial introduction and this list for a complete reference. |
bundle_collection_id_namespace |
optional | If this file is a bundle encoding more than one sub-file, this field gives the id_namespace of a collection listing the bundle's sub-file contents; null otherwise |
bundle_collection_local_id |
optional | If this file is a bundle encoding more than one sub-file, this field gives the local_id of a collection listing the bundle's sub-file contents; null otherwise |
dbgap_study_id |
optional | The name of a dbGaP study ID governing access control for this file, compatible for comparison to RAS user-level access control metadata |
The biosample
entity: a tissue sample or other physical specimen¶
field(s) | required? | description |
---|---|---|
id_namespace , local_id , project_id_namespace , project_local_id , persistent_id , creation_time |
(see above) | (See Common entity fields section) |
sample_prep_method |
optional | An OBI CV term ID (from the "planned process" branch of the vocabulary, excluding the "assay" subtree) describing the preparation method that produced this biosample |
anatomy |
optional | An UBERON CV term ID used to locate the origin of this biosample within the physiology of a source organism. |
The subject
entity: a biological entity from which a C2M2 biosample can be generated¶
field(s) | required? | description |
---|---|---|
id_namespace , local_id , project_id_namespace , project_local_id , persistent_id , creation_time |
(see above) | (See Common entity fields section) |
granularity |
required | A CFDE-controlled vocabulary categorizing broad classes of possible biosample sources. |
sex |
optional | A CFDE CV category characterizing the physiological sex of this subject |
ethnicity |
optional | A CFDE CV category characterizing the self-reported ethnicity of this subject |
age_at_enrollment |
optional | The age in years (with a fixed precision of two digits past the decimal point) of this subject when they were first enrolled in the primary project within which they were studied |
The C2M2 subject
entity is a generic data type meant to represent any biological
entity from which a biosample
can be generated. (The notion of a biosample
being derived from another biosample
will be modeled explicitly in future C2M2
versions; please see the biosample
section
above for more.)
In addition to the common entity fields, C2M2 metadata includes
two details specific to subject
entities: a structural configuration
called granularity
, and taxonomic labels.
A required granularity
field is included in each subject
row and
contains one of a fixed list of categorical value codes. These codes
characterize each C2M2 subject
record in the broadest possible terms:
subject.granularity field value |
name | description |
---|---|---|
cfde_subject_granularity:0 |
single organism | One organism. |
cfde_subject_granularity:1 |
symbiont system | A mixed system of consisting of two or more organisms (symbionts) in symbiosis (living colocated in time and space): one such symbiont may optionally be identified as a host. |
cfde_subject_granularity:2 |
host-pathogen system | A special case of a symbiont system consisting of one symbiont, designated as a host, plus one or more other symbionts acting to create or sustain disease within the host organism. |
cfde_subject_granularity:3 |
microbiome | A symbiont system consisting of a collection of (potentially unknown or partially characterized) taxa, where the environment in which the system resides is well-characterized, but the taxonomic composition of the system may be unknown; optionally contains one symbiont specially identified as a host. |
cfde_subject_granularity:4 |
cell line | A cell line derived from one or more species or strains. |
cfde_subject_granularity:5 |
synthetic | A synthetic biological entity. |
This is a draft list for granularity
: we do not imagine it to be in its final form.
A reference table describing current granularity
values and descriptions can be found
here.
Taxonomic labels can also be attached to subject
records. For the most basic "single
organism" granularity
, this will be a one-to-one relationship: "this subject
was a
member of Speciesella exemplarensis
." On the other hand, since a C2M2 subject
entity can
represent more complex biosample
sources, like an entire microbiome or a
multi-organism symbiont system, more complex granularity
types may require that
multiple parallel taxonomic labels be attached to a single subject
record.
The C2M2 subject_role_taxonomy
table provides a way to do this by linking
taxonomic labels to subject
records while also specifying, for each such
label, which sort of subject
subcomponent (categorized as a "role"; see
below)
the label should be attached to. For example, the information "host: Homo sapiens;
pathogen: Francisella tularensis" could be attached to a single subject
record
representing a biopsy from a (human-based) "host-pathogen-system". Please see the section below on
subject_role_taxonomy
for all the details on how to use this table.
Association tables: inter-entity linkages¶
C2M2 association tables codify relationships between specific entities of different types: in database terms, they spell out relationships between particular rows across different tables.
file_describes_subject
Each row in the C2M2 file_describes_subject
association table consists of
two identifiers, used as foreign keys: one for a file
record (a row in the
file
table describing one particular file), and one for a subject
record
(a row in the subject
table describing one particular subject or source organism).
Since C2M2 identifiers each have two parts -- id_namespace
and local_id
--
this gives a total of four fields for this table:
|file_id_namespace
| file_local_id
| subject_id_namespace
| subject_local_id
|
Each row of file_describes_subject
declares that some particular file F contains
data describing a particular subject S. The file F is identified by C2M2 ID
(file_id_namespace
+ file_local_id
) as a specific row in the C2M2 file
table;
the subject S is similarly referenced (subject_id_namespace
+ subject_local_id
) as
a particular row in the subject
table.
The following tables work in the same way:
file_describes_biosample
biosample_from_subject
collection_defined_by_project
file_describes_collection
(for any file describing an entire C2M2collection
)
These association tables are optional: valid C2M2 submissions do not need to express all (or indeed any) of these relationships. If included, association table information about relationships between entities can be used to power smarter downstream discovery than would be possible if limited only to manifests of isolated, unlinked resources.
Each association table's name defines the relationship it represents, and these
are generally nonspecific by design, to facilitate harmonization along basic
conceptual lines across the federated C2M2 metadata space. collection_defined_by_project
optionally attaches a primary generating project
to a C2M2 collection
:
this relationship is more specific than the others given here, and is meant
to express the same relationship between a project
and a collection
as
is (for example) expressed by the project
foreign key in the file
table:
"collection
C was defined under the auspices of project
P," just as "file
F
was created by project
P," or "biosample
B was obtained and catalogued
during project P." Not every collection
will have a well-defined DCC-modeled
project
under which it was created, so the collection_defined_by_project
association is optional.
Please see the relevant sections of the C2M2 JSON Schema to find all field names and foreign-key constraints for these tables.
Container entities¶
C2M2 offers two ways -- project
and collection
-- to define groups of
related entities (file
, subject
, biosample
, etc.). All valid
C2M2 submissions must provide at least minimal information describing a project
hierarchy, with each metadata record attached to a well-defined project
space.
Operations essential to data discovery (sorting, searching and binning) depend
on this information, so that as data is discovered by users, it can be more easily
associated with its proper research context. The C2M2 collection
container,
a generalization of "dataset", is optional, with its scope and complexity of usage
generally left to the submitting DCC.
project
¶
The C2M2 project
entity models an unambiguous, unique, named, most-proximate
research/administrative sphere of operations that first generates or observes
experimental resources represented by core C2M2 entities (file
, subject
, etc.).
The concept of project
is loosely rooted in -- but not necessarily mapped
one-to-one from -- some corresponding hierarchy of grants, contracts or other
important administrative subdivisions of primary research purview. Specifically
what that means -- exactly what these administrative subdivisions are, and how
metadata is to be allocated to them -- is left to the submitting DCC, subject to
the structural constraints we impose on project
in anticipation of using it
consistently across metadata spaces from different DCCs.
The project
hierarchy described by any valid C2M2 submission must represent
a directed, rooted, acyclic graph (a directed tree). Nodes (vertices) on this
tree are represented as rows in the C2M2 project
table. Edges between pairs
of nodes, representing parent/child relationships between projects (or more
awkwardly "containing project"/"subproject" relationships), are expressed
as rows in the project_in_project
association table (see
below). Each row in project_in_project
lists one parent project and one child project; taken together, all of
the rows in project_in_project
represent the entire project
tree hierarchy
within the submission.
Regardless of whether a DCC has a natural "top-level project" under which
to nest all other project
records, C2M2 requires by convention that one
artificial project
row be created and identified as the root node of the
project
hierarchy: this node will represent the DCC itself. This row is
referenced via foreign key by the contact entry in the primary_dcc_contact
table
(see below), and serves as an anchor point for creating roll-up summaries
or other aggregations of C2M2 metadata arranged according to submitting DCC.
A unique project
attribution is required for each row of all core entity
types: foreign keys (discussed above) are provided
for this purpose in the file
, biosample
and subject
tables. In a truly
minimal case, a DCC's project
hierarchy can just consist of the artificial
root node representing the DCC itself, and all resources can be attributed to
that one node. This will disable any downstream advantages of more fine-grained
accounting, but will enable a valid submission.
collection
¶
The C2M2 collection
entity is a generalization of "dataset." Elements of
a collection
can be data resources like the C2M2 file
entity, but a
collection
can also contain non-data entities (e.g. biosample
or subject
).
It is meant to serve as a generic container for grouping related core C2M2
entities; no semantic context, superstructure, or usage assumptions are built
in. The collection
entity should be used to represent all relevant, extant
groupings of experimental resources represented as C2M2 core entities, especially
those with already-created permanent and citation-ready identifiers (e.g. datasets
for which DOIs have been registered). Eventually we expect to offer
researchers using C2M2 metadata the opportunity to define and cite collection
entities federating existing C2M2 resources across multiple source DCCs: providing
the ability to stably cite groups of C2M2 resources is in fact the central
purpose of collection
. As a structural support for FAIRness principles
within CFDE, the C2M2 collection
entity is designed to facilitate reliable
reuse and reanalysis of Common Fund data and metadata.
In terms of minimal valid C2M2 submissions, collection
is entirely optional:
DCC metadata need not necessarily include any collection
records or attributions.
Membership of core C2M2 entities in a collection
is expressed with the
relevant ("X_in_collection
") association tables; nested collection
entities are listed in the collection_in_collection
association
table. (See below
for complete usage details.)
We emphasize that no relationship is assumed between project
and collection
.
A collection
may optionally be attributed to a primary (defining or
generating) C2M2 project
-- via the collection_defined_by_project
association table (see above) --
but collection
-project
associations will not even always be well-defined,
and are not at all required. We expect eventually to extend the ability to
define new collection
entities, on an ongoing basis, to interested community
members (beyond DCC data managers) whose work may not be related to any C2M2
project
records and whose new collection
entities won't be attributable
in this way.
Association tables: expressing containment relationships¶
project_in_project
collection_in_collection
file_in_collection
subject_in_collection
biosample_in_collection
These tables are used to express basic containment relationships like "this file
is in
this collection
" or "this project
is a sub-project of this other
project
." Rows in these tables consist of four fields:
- two (an
id_namespace
and alocal_id
) comprising a foreign key representing the containingproject
orcollection
, and - two (a second {
id_namespace
,local_id
} pair) acting as a foreign key referencing the contained resource (file
,biosample
, etc.) or group (e.g., a "childproject
" or subproject).
Example set of fields (from project_in_project
):
|parent_project_id_namespace
| parent_project_local_id
| child_project_id_namespace
| child_project_local_id
|
Another example (from file_in_collection
):
|file_id_namespace
| file_local_id
| collection_id_namespace
| collection_local_id
|
Please see the relevant sections of the C2M2 JSON Schema to find all field names and foreign-key constraints for each of these association tables.
Taxonomy and the subject
entity: the subject_role_taxonomy
association table¶
In the subject
section
above, we introduced the idea of flexibly attaching (possibly multiple) taxonomic labels
to subject
records. For the most basic "single organism" granularity
, such
a labeling will be a straightforward one-to-one map: "subject S-24601
was a
member of Speciesella exemplarensis
." On the other hand, since a C2M2 subject
entity can represent more complex types of biosample
sources -- like a multi-organism
symbiont system -- more complex granularity
types may require that several
taxonomic labels be attached to a single subject
record in parallel, to directly
describe the various constituent components of the subject
. The subject_role_taxonomy
is a ternary association table; each row contains three identifiers:
- the C2M2 ID (
id_namespace
+local_id
) of asubject
record - a
role_id
drawn from a preset list ofsubject
sub-component types:
subject_role_taxonomy.role_id field value |
name | description |
---|---|---|
cfde_subject_role:0 |
single organism | The organism represented by a subject in the 'single organism' granularity category |
cfde_subject_role:1 |
host | Any organism identified as a host for a subject assigned to the 'symbiont system', 'host-pathogen system', or 'microbiome' granularity categories |
cfde_subject_role:2 |
symbiont | An organism identified as a symbiont within a subject assigned to the 'symbiont system' granularity category |
cfde_subject_role:3 |
pathogen | An organism identified as a pathogen symbiont in a subject assigned to the 'host-pathogen system' granularity category |
cfde_subject_role:4 |
microbiome taxon | A constituent taxon of a subject assigned to the 'microbiome' granularity category |
cfde_subject_role:5 |
cell line ancestor | A taxon identified as a source organism for a subject assigned to the 'cell line' granularity category |
cfde_subject_role:6 |
synthetic | A synthetic biological entity |
(This is a draft list for role_id
: we do not imagine it to be in its final form.)
A reference table describing current subject_role
values and descriptions can be found
here.
- a taxonomic label (specifically, an identifier of the form
NCBI:txid######
, where######
is the numeric ID of the desired label in the NCBI Taxonomy database)
Each row in subject_role_taxonomy
thus attaches one taxonomic label to one subject
via one specified role_id
subcomponent type.
For example, the information "host: Homo sapiens; pathogen: Francisella tularensis"
could be attached to a single subject
record -- representing a biopsy from a
(human-based) "host-pathogen-system" -- by doing the following three things (||
symbols represent
tab characters separating TSV fields, in the example rows below):
- setting the
granularity
of thesubject
record (in thesubject
table) tocfde_subject_granularity:2
("host-pathogen system": see here for the complete list)
id_namespace: DCC_X_namespace || local_id: SUBJ_24601 || [...] || granularity: cfde_subject_granularity:2
- adding one row to
subject_role_taxonomy
specifying that the host of thesubject
system is Homo sapiens
subject_id_namespace: DCC_X_namespace || subject_local_id: SUBJ_24601 || role_id: cfde_subject_role:1 || taxonomy_id: NCBI:txid9606
- adding one more row to
subject_role_taxonomy
specifying F. tularensis as a pathogen in thesubject
system
subject_id_namespace: DCC_X_namespace || subject_local_id: SUBJ_24601 || role_id: cfde_subject_role:3 || taxonomy_id: NCBI:txid263
Controlled vocabularies and term entity tables¶
Support for the detailed description of C2M2 metadata with terms from standard scientific ontologies is a key component of cross-collection metadata harmonization within the CFDE. C2M2 currently provides a small number of (relatively uncontroversial) fields through which controlled (standardized, curated) scientific vocabulary terms can be attached to core C2M2 entities.
At present all C2M2 controlled vocabulary annotations are optional. Curated ontologies currently supplying supported term sets are
- the Disease Ontology (DO)
- the Ontology for Biomedical Investigations (OBI) (a good browser is available at EMBL-EBI)
- the Uber-Anatomy Ontology (UBERON)
- the NCBI Taxonomy
- EDAM, an ontology for bioinformatics concepts including data types and formats (browser)
The following table lists all current C2M2 controlled vocabulary fields: each CV field
is listed as C2M2_entity_table.field_name
. We list the source ontology for each
field along with a general description of that field's intended annotation context
within C2M2.
CV field | ontology | description |
---|---|---|
file.assay_type |
OBI | the type of experiment that produced a file |
file.file_format |
EDAM | the digital format or encoding of a file (e.g. "FASTQ") |
file.compression_format |
EDAM | the compression format of a file (e.g. gzip or bzip2), if it is compressed |
file.data_type |
EDAM | the type of information contained in a file (e.g. "sequence data") |
file.analysis_type |
OBI | the type of analytic operation that generated a file |
biosample.sample_prep_method |
OBI | the preparation method used to produce a biosample |
biosample.anatomy |
UBERON | the physiological source location in or on the subject from which a biosample was derived |
ncbi_taxonomy.id |
NCBI Taxonomy | a taxonomic name associated with a subject record (usage details discussed above) |
subject.granularity |
CFDE CV | the multiplicity of a subject (see details above) |
subject.sex |
CFDE CV | the physiological sex of a subject |
subject.ethnicity |
CFDE CV | the self-reported ethnicity of a subject |
In addition to these fields, (possibly multiple) diseases can be optionally
associated with each biosample
or subject
record via the biosample_disease
and
subject_disease
tables, which connect biosamples
or subjects
(via their C2M2 IDs)
to Disease Ontology terms.
Submitters should use bare CV terms in the relevant fields (e.g.
file.file_format
might be populated with format:1930
to express that the
containing record represents a FASTQ-formatted file). In the case of NCBI Taxon
IDs, which are integers in their "barest" form, values should be encoded as
NCBI:txid####
, where the ####
suffix represents the integer ID of the taxon in
question: Homo sapiens would thus be represented as NCBI:txid9606
.
Note again that all CV fields are optional: if any of these annotations is
unavailable or inappropriate for particular C2M2 records, or if the supported
ontologies prevent proper encoding of the relevant information, then the associated CV
fields (or records in CV association tables like biosample_disease
) can and
should be left blank.
If sufficiently specific terms cannot be found in the supported ontologies, we encourage DCC data managers to provisionally include more general ancestor terms (as available), and simultaneously to contact the CFDE Ontology Working Group with descriptions of any needed additions to the supported controlled vocabularies. CFDE has established direct update channels with the curation authorities for each supported ontology, and the Ontology WG explicitly aims to expedite the addition of any missing terms on behalf of Common Fund DCCs, rather than forcing bad choices.
For each controlled vocabulary supported by C2M2, a term table must be included as part of any valid submission. (These are the green tables in the ER diagram above.) Each such table will contain one row for each (unique) CV term used anywhere in the containing C2M2 submission, along with basic descriptive information for each term (to empower both downstream user searches and automated display interfaces). All term metadata is to be automatically loaded directly from the ontology reference data: green term-tracker tables should not be built by hand by DCC staff preparing submissions. Instead, the other C2M2 entity and association tables should be prepared first, with CV terms included in the appropriate fields. Once these are built, a CFDE-provided script must be used to automatically scan the prepared tables. The script will find all CV terms used throughout the submission, validate them against externally-provided ontology reference files, and then combine this information with descriptive data drawn directly from the reference files, to automatically build all necessary (green) term tables. These automatically-generated term tables (TSVs) are then to be bundled along with the rest of the C2M2 submission (as prepared directly by the DCC data management team).
The primary_dcc_contact
table¶
field(s) | required? | description |
---|---|---|
contact_email |
required: primary key | Email address of the primary DCC contact for this C2M2 submission. (Format: valid email address) |
contact_name |
required | Name of this DCC contact. |
project_id_namespace , project_local_id |
required: project foreign key |
The id_namespace and local_id fields of the project row representing this contact's DCC. |
dcc_abbreviation |
required | A short label for this contact's DCC. (Pattern: [a-zA-Z0-9_]+ ) |
dcc_name |
required | A short, human-readable, machine-read-friendly label for this contact's DCC. |
dcc_description |
optional | A paragraph-length description of this contact's DCC. |
dcc_url |
required | URL of the front page of the website for this contact's DCC. |
C2M2 release details¶
MAY 2023 RELEASE
- Make
file.filename
a required field. - Phenotypes can now use Mammalian Phenotype Ontology terms if no appropriate HPO term exists.
DECEMBER 2022 RELEASE
- Remove
biosample.assay_type
(though thefile
table retains itsassay_type
field.) - Add
biosample.sample_prep_method
, which references a newsample_prep_method
controlled vocabulary table. - Add several Interlex terms to the
data_type
andfile_format
controlled vocabulary tables. - Remove two unused terms from the controlled vocabulary for the
subject.sex
field. - Make the following changes to the controlled vocabulary for the
subject.race
field:- The existing term “Black” is renamed to “Black or African American”
- The existing term "American Indian or Alaskan Native" is renamed to "American Indian or Alaska Native"
- The existing term "Asian or Pacific Islander” will be deprecated and likely removed in a future release. It is superseded by the two new terms mentioned below.
- The new term “Asian” is added.
- The new term “Native Hawaiian or Other Pacific Islander” is added.
APRIL 2022 RELEASE
collection.name
will be required and must be unique (within each Program's submission)- Boolean
collection.has_time_series_data
has been added to offer basic annotation to reach users looking for time series datasets- possible values are
true
(collection contains time-series data),false
(collection doesn't contain time-series data) andnull
(no information provided)
- possible values are
- New field
file.dbgap_study_id
to support initial stand-up integrating file access control metadata with RAS user authentication metadata within the CFDE UI - Add
collection_gene
,collection_compound
,collection_substance
,collection_taxonomy
,collection_anatomy
andcollection_protein
to provide associations between C2M2 collections and concepts in relevant controlled vocabularies - Extend usage of C2M2
compound
table to include partial-knowledge glycans not tracked by PubChem - Add
subject_role
category for expression system (e.g. E. coli modified to express nonnative gene products) - Add a
protein
CV (UniProtKB IDs & descriptions)- add
protein_gene
- auto-populated from existing reference metadata, as with e.g.
phenotype_disease
- auto-populated from existing reference metadata, as with e.g.
- add
FEBRUARY 2022 RELEASE
- add
file.analysis_type
(OBI) - clarifications for
subject.sex
minimal enum values - add negative assertion option to
disease
associationssubject_disease
can now document "disease ruled out" in addition to "disease detected"biosample_disease
similarly augmented
- add
phenotype
observation data (new CV: Human Phenotype Ontology)subject_phenotype
(positive & negative assertions permitted)- reference tables
phenotype_disease
andphenotype_gene
(imported directly from HPO)
- add
collection_disease
andcollection_phenotype
to provide associations between C2M2 collections anddisease
andphenotype
terms - updated term builder script to handle new components
- HPO integration is currently being tested for final deployment: please get in touch with us if you wish to use the phenotype ontology before this process is complete
NOVEMBER 2021 RELEASE
- new entities:
compound
,substance
- simplified mirror of same-named PubChem structures
- meant to model "small molecules" of various types (e.g. drugs)
- not proteins or genes or other abstract structures with more baggage than "unambiguous chemical compound" -- such will be modeled separately later on
- new associations:
biosample_substance
,subject_substance
- new entity:
gene
- prototype: list of Ensembl IDs as a CV with metadata describing names, synonyms and source organisms
biosample_gene
association added as a limited prototype for knockout data
- new fields on
subject
for (public) clinical metadata:sex
,race
(multi-select-capable),ethnicity
,age_at_enrollment
,age_at_sampling
(asbiosample_from_subject.age_at_sampling
)
primary_dcc_contact
renamed todcc
;id
field added; mirrors portal registry data- new
bundle_collection
foreign key fromfile
intocollection
- allows enumeration of contents of archive files containing multiple subfiles (e.g. TAR archive files)
- new
file.compression_format
CV (EDAM) for clearer expression of file compression configurations - updated term builder script to handle new components
SEPTEMBER 2021 RELEASE
- removed regex pattern constraints on included CV terms
- OBI, for example, imports terms from other ontologies: being pattern-proscriptive about this proved impossible. We now just directly check each detected term against the ontology's reference store.
- added
disease
(support for Disease Ontology terms) plusbiosample_disease
andsubject_disease
associations - added (auto-built)
synonyms
fields (JSON arrays) to CV term tables - added
file_describes_collection
- added
biosample.assay_type
- restored
id_namespace
foreign key assertions tofile
,biosample
,subject
,project
,collection
- tightened constraints on most
name
fields- required, unique: see above for full usage details
- updated term builder script
- now handles
disease
- now handles
ncbi_taxonomy
- other stability improvements
- construction of all (green) CV-usage tables now fully automated as of this release
- now handles
Upcoming C2M2 features¶
No C2M2 updates are planned at this time.