Skip to content

The Common Fund Data Ecosystem's Crosscut Metadata Model (CFDE C2M2)

This document introduces the Crosscut Metadata Model (C2M2), a flexible metadata standard for describing experimental resources in biomedicine and related fields. The Common Fund Data Ecosystem group is creating a new computing infrastructure, with C2M2 as its organizing principle, to offer the health research community an unprecedented array of intersectional data tools. The C2M2 system will connect researchers with scale-powered statistical analysis methods; deep, seamless searching across experimental data generated by different projects and organizations; and new ways to aggregate and integrate experimental data from different sources to facilitate scientific replication and to drive new discoveries.

Using this new infrastructure, Common Fund data coordinating centers (DCCs) will share structured, detailed information (metadata) about their experimental resources with the research community at large, widening and deepening access to usable observational data. One immediate consequence will be a drastic simplification in the effort required to perform meta-analysis of results from multiple independent teams studying similar health-related phenomena.

DCC Metadata Submissions

DCCs collect and provide metadata submissions (C2M2 instances) to CFDE describing experimental resources within their purview. Each submission is a set of tab-separated value files (TSVs); precise formatting requirements for these filesets are specified by JSON Schema documents, each of which is an instance of the Data Package meta-specification published by the Frictionless Data group. The Data Package meta-specification is a toolkit for defining format and content requirements for files so that automatic validation can be performed on those files, just as a database management system stores definitions for database tables and automatically validates incoming data based on those definitions. Using this toolkit, the C2M2 JSON Schema specifications lay out foreign-key relationships between metadata fields (TSV columns), rules governing missing data, required content types for particular fields, and other similar database management constraints to define basic structural integrity for C2M2 metadata submissions. During the C2M2 ingestion process, the C2M2 software infrastructure uses these specifications to automatically validate format compliance and submission integrity, prior to loading metadata into its central database. Once loaded, metadata are used to fuel downstream services like search results, customized statistical summaries, dynamic display graphics, and asset browsing within experimental resource collections.

C2M2 Levels

CFDE offers DCCs three alternative metadata submission formats (C2M2 Levels 0, 1 and 2), each of which is automatically interoperable with the entire C2M2 software ecosystem. Levels are tiered primarily according to increasing complexity. The general idea is that DCC resource collections can be represented quickly (and thus begin driving downstream applications quickly) using metadata encoded at lower (simpler) C2M2 levels: over time, and as feasible, DCC data managers can upgrade their C2M2 metadata submissions by expanding into higher levels.

C2M2 Levels 0, 1 and 2 are increasingly large and complex variants of the same metadata model (C2M2): each level is a defined collection of data tables (encoded as TSVs: see above on submissions). Level 0 defines just one short file table; Level 1 provides a larger file table than Level 0 (more fields), adds more tables describing other basic biomedical resource concepts including biosample, subject, and project, and introduces ways to express simple relationships between records in different tables; Level 2 is an ever-growing mature metadata interchange standard, customized to advanced DCC metadata needs: it contains all Level 1 information, plus a library of more detailed metadata objects and extensions to existing objects. Lower levels are strict subsets of higher ones: Level 1's file table contains all Level 0 file fields and more; Level 2's biosample is an expanded superset of the Level 1 biosample, etc. Upgrades from one level to the next are therefore limited by design to be done only by expansion: moving up a level will not require DCCs to make changes to any metadata already provided.

A foundational purpose of the C2M2 system is to facilitate metadata harmonization: finding ways wherever possible to represent comparable things in standard ways, without compromising meaning, context or accuracy. In addition to complexity management, C2M2 levels are also intended to roughly encapsulate groups of concepts according to increasing harmonization difficulty.

Some examples, sorted by increasingly heavy challenges to harmonization:

  • All DCCs have file resources, describable (at a very high level) in a standard, noncontroversial way (size + filename: Level 0).
  • In addition to data files, virtually all DCCs deal in some way with biosamples and/or subjects: metadata describing basic aspects of these common concepts can be fairly (if still quite broadly) expressed by a shared model (Level 1), thereby enabling the beginnings of cross-dataset search and analysis.
  • Many DCCs have at least some specialized data, unique to their own spheres of operation, which (at least for some time) will not be meaningful candidates for system-level harmonization (Level 2 includes specialized extension modeling).

Most DCCs already have some form of internal metadata model in use for their own curation operations. C2M2 integration of similar but distinct packages of important information, taken from multiple independently-developed custom DCC metadata systems (including e.g. metadata describing people and organizations, data provenance relationships, experimental protocols, protected data, or detailed event sequences), will require ongoing, iterative, case-based design and consensus-driven decision-making, often coordinated across multiple independent research groups. Design and decision-making in such contexts will require long-term planning, testing and execution. Metadata difficult (or even impossible) to integrate and harmonize is thus handled as part of the ongoing evolution and expansion of Level 2, leaving Level 1 tasked with supporting relatively universal, simple and uncontroversial metadata concepts to maintain streamlined development and deployment of important core metadata packages without unnecessarily blocking feasible tasks to wait on more expensive custom integration.

With the design of C2M2, we are splitting the difference between the ease of evolution inherent in a simple model and the operational power provided to downstream applications by more complicated and difficult-to-maintain extended frameworks.

Modeling and data wrangling are always difficult, even for experts. Part of the goal of the level system is to compartmentalize the C2M2 model so as to maintain flexibility -- especially during developmental phases -- in order to best accommodate mutual learning between DCCs and CFDE as the construction of this federated metadata system progresses. It is generally far more expensive and error-prone to repeatedly change a complex, over-built, inseparable, monolithic model than it is to build one gradually from a simpler core of agreement which can be relatively quickly stabilized while more specialized branches are built in parallel and transitioned into more general use.

At any given moment, participating Common Fund DCCs will span a broad range of experience and available funding, based on mission details and lifecycle phases. DCCs with advanced, operationalized metadata modeling systems of their own should not encounter arbitrary barriers to C2M2 support for more extensive relational modeling of their metadata if they want it; CFDE will maintain such support by iteratively refining Level 2 according to needs identified while working with DCCs already wielding complex metadata models. Newer or smaller DCCs, by contrast, may not have enough readily-available information to feasibly describe their experimental resources using Level 2 structures (either existing or proposed): C2M2 Level 1 thus also aims to actively support such cases by offering simpler but still well-structured metadata models where metadata has already harmonized across other DCCs, lowering barriers to rapid entry into the data ecosystem and meaningful participation in downstream services.

A C2M2 topic requiring special attention is the use of identifiers.

C2M2 identifiers

  • Two complementary identifier slots for DCC-issued records
  • persistent_id: persistent and resolvable ID (ideal, but optional)
  • id_namespace, local_id: 2-part key is as least-common denominator
  • The optional persistent identifier is a DOI, ARK, MINID, etc.
  • The 2-element composite identifier conveys fragments of a record URI
  • local_id bears a name from some namespace, e.g. an accession ID
  • id_namespace specifies which namespace, i.e. left-hand side of a URI
  • concatenation of namespace + local yields the full record URI
  • URIs are cheap and easy: low barrier to entry, no hosting requirements
  • To steer users directly to data, we require persistent_id!
  • Otherwise, we can steer users towards the DCC contact.
  • For consistency, we repeat this scheme on several entity tables.

Not all DCCs arrive having adopted robust, persistent and resolvable identifiers, or we could just mandate their use to identify all records. We need to support different DCC identifier maturity levels, and so we need to keep persistent_id optional.

We add a 2-part composite identifier as something which can lower the barrier to entry: this is a URI that has been split into two parts. The web rules for forming URIs are very flexible: it takes essentially no cost and no real infrastructure to generate URIs, which are just rules for namespace hygiene laying out how a DCC can claim a namespace out of the ether, ideally rooted in some other basic name they own such as a DNS name (website domain) or even just an email address.

(For those who don’t know, the difference between a URI and a URL is that a URI is just an identifier. It doesn’t have to address any actual web server which responds to messages, so you can produce URIs all day long and never worry about hosting costs, availability, etc.)

The core idea is that most DCCs will have some accession ID or other locally-unique key for their assets already. They can copy that value into the local_id part or define some other mapping of their own choosing, there. Then, they can either define a new id_namespace or pass through an applicable, existing URI prefix as the id_namespace component. As long as each DCC follows the web rules to use an id_namespace they “own”, collisions are automatically avoided (where two DCCs might try to use the same composite identifier).

We also allow for the use of persistent, resolvable identifiers for things which are not really data. We suspect that many of the same practical benefits of these identifiers for data might apply to other entity types, even if resolution might lead to landing pages and contact info which at most guides users to documentation or other possible actions they might take in the real world or in the parallel bureaucratic world. (Attempting to directly download subjects or biosamples might trigger some unexpected and unpleasant -- although possibly comical -- downstream side effects.)

  • For a DCC already issuing persistent, resolvable IDs
  • persistent_id: fill with canonical ID
  • id_namespace, local_id: fill with (split) copy of canonical ID
  • For a DCC already issuing relatively stable URIs
  • persistent_id: leave blank until ready
  • id_namespace, local_id: fill with (split) copy of URI
  • For a DCC already issuing local accession IDs
  • persistent_id: leave blank until ready
  • local_id: fill with local accession ID
  • id_namespace: choose an appropriate namespace URI-prefix
  • For a DCC without local ID stability
  • Need to invent something approximating accession ID and proceed as above

If a DCC already uses persistent identifiers such as DOIs, ARKs, or other short identifiers resolvable by some name-to-thing service then they can just put that value into all the identifier fields:

  • verbatim in the persistent_id field so that consumers know a persistent_id is available for this record
  • split and replicated into the id_namespace and local_id fields so that the core C2M2 requirement for a record key is met
  • there is no need for the DCC to issue and juggle two separate identifier formats, but the compromise C2M2 format requires the fields to be populated

If a DCC already uses URIs or URLs for entities having a one-to-one correspondence to C2M2 record concepts, they can split that URI or URL into a namespace prefix part and a final local part and use those values to fill the 2-part composite record identifier. They would probably leave persistent_id blank.

If a DCC only has local identifiers for such entities, they can put that in the local_id part and then fabricate a new id_namespace representing their DCC or the sub-organizational scope where these local identifiers are indeed unique. The same _ id_namespace _should be reused for all peer records with local parts coming from the same DCC naming system.

Having stable identifier management is, unavoidably, basic “table stakes” for a DCC to produce reusable data. Archiving and indexing options like those provided by C2M2 are extremely limited in the absence of some method of persistent bookkeeping to track inventory: persistent identifiers are a core requirement precisely because without them, federated views of resource metadata from multiple DCCs cannot be maintained.

Level 0

C2M2 Level 0 defines a minimal valid C2M2 instance. Data submissions at this level of metadata richness will be the easiest to produce, and will support the simplest available functionality implemented by downstream applications.

Level 0 submission process: overview

Metadata submissions at Level 0 will consist of a single TSV file describing a collection of digital files owned or managed by a DCC. The properties listed for the Level 0 file entity (see below for diagram and definitions) will serve as the TSV's column headers; each TSV row will represent a single file. The Level 0 TSV itself thus represents a manifest or inventory of digital files that a DCC wants to introduce into the C2M2 metadata ecosystem.

This level encodes the most basic file metadata: its use by downstream applications will be limited to informing the least specific level of data accounting, querying and reporting.

Level 0 model diagram
Level 0 model diagram

Level 0 technical specification: properties of the file entity

Required: id_namespace local_id sha256|md5

property description
id_namespace String identifier devised by the DCC managing this file (cleared by CFDE-CC to avoid clashes with any preexisting id_namespace values). The value of this property will be used together with local_id as a composite key structure formally identifying Level 0 file entities within the total C2M2 data space. (See C2M2 identifiers for discussion and examples.)
local_id Unrestricted-format string identifying this file: can be any string as long as it uniquely identifies each file within the scope defined by the accompanying id_namespace value. (See C2M2 identifiers for discussion and examples.)
persistent_id A permanent, resolvable URI permanently attached to this file, meant to serve as a permanent address to which landing pages (which summarize metadata associated with this file) and other relevant annotations and functions can optionally be attached, including information enabling resolution to a network location from which the file can be downloaded. Actual network locations must not be embedded directly within this identifier: one level of indirection is required in order to protect persistent_id values from changes in network location over time as files are moved around. (See C2M2 identifiers for discussion and examples.)
size_in_bytes The size of this file in bytes. This varies (even for "copies" of the same file) across differences in storage hardware and operating system. CFDE does not require any particular method of byte computation: precise, reproducible file size integrity metadata will be provided in the form of checksum data in the sha256 and/or md5 properties. size_in_bytes will instead underpin automatic reporting of approximate storage statistics across different C2M2 collections of DCC metadata.
sha256 CFDE-preferred file checksum string: the output of the SHA-256 cryptographic hash function after being run on this file. One or both of sha256 and md5 is required.
md5 Permitted file checksum string: the output of the MD5 message-digest algorithm after being run as a cryptographic hash function on this file. One or both of sha256 and md5 is required. (CFDE recommends SHA-256 if feasible, but we recognize the nontrivial overhead involved in recomputing these hash values for large collections of files, so if MD5 values have already been generated, CFDE will accept them.)
filename A filename with no prepended PATH information.

Level 0 technical specification: JSON Schema and example TSVs

A JSON Schema document (implementing Frictionless Data's "Data Package" container meta-specification) defining the Level 0 TSV can be found here; an example Level-0-compliant TSV submission collection can be found here (just the file.tsv portion) and here (as a packaged BDBag archive).

Level 1

C2M2 Level 1 models basic experimental resources and associations between them. This level of metadata richness is more difficult to produce than Level 0's flat inventory of digital file assets. As a result, Level 1 metadata offers users more powerful downstream tools than are available for Level 0, including

  • faceted searches on a (small) set of biologically relevant features (like anatomy and taxonomy) of experimental resources like biosample and subject
  • organization of summary displays using subdivisions of experimental metadata collections by project (grant or contract) and collection (any scientifically relevant grouping of resources)
  • basic reporting on changes in metadata over time, tracking (for example) creation times for file and biosample

C2M2 Level 1 is designed to offer an intermediate tier of difficulty, in terms of preparing metadata submissions, between Level 0's basic digital inventory and the full intricacy of Level 2. Accordingly, we have reserved several modeling concepts -- requiring the most effort to produce and maintain -- for Level 2. The following are not modeled at Level 1:

  • any and all protected data
  • documentation of experimental protocols
  • event-based resource generation/provenance networks
  • detailed information on organizations and people governing the research being documented
  • a comprehensive suite of options to model scientific attributes of experimental resources
    • full collection of features like anatomy, taxonomy, and assay type, plus formal vocabularies to describe them
    • prerequisite to offering research users deep and detailed search possibilities

Level 1 submission process: overview

Build the core C2M2 entity tables (black) and the C2M2 container tables (blue) shown in the diagram below, and fill out the DCC contact sheet (grey). Once you've built the core entity tables, the green tables can be built automatically using our term-scanner script, which will collect all relevant CV terms used throughout your core entity tables and will create the corresponding green term-index tables, using data loaded from versioned, whole-CV reference documents (like OBO files).

In the case of any unpopulated tables (no collection records, for example, are required for model compliance), please create the relevant TSV files anyway, with just one tab-separated header line containing the empty table's column names. (In contrast to simply omitting the blank table file, the recommended practice instead explicitly distinguishes the case in which no data is being submitted for a given table from the case in which a table has been omitted by mistake.)

Color key:

  • #000000 Black: C2M2 Level 1 core entities: files, biosamples and subjects
  • #a52a2a Dark red: Associative relationships between Level 1 core entities
  • #0000ff Blue: Level 1 container entities (projects and collections) and their containment relationships
  • #1e7a1e Green: Tables recording all third-party ontology or controlled-vocabulary terms used within a Level 1 submission, including extra information about UI display labels
  • #8b6914 Gold: Single-record table listing basic contact information for DCC staff managing a Level 1 submission
  • #ffa500 Yellow: Association table optionally annotating each Level 1 subject record with
    • (possibly multiple) NCBI Taxonomy ID attributions
    • specification (and individual annotation) of subject sub-entities based on generic roles in observational ecosystems, like "host," "pathogen," "site-specific microbiome," "basic single organism" (default), etc.
Level 1 model diagram
Level 1 model diagram

Level 1 technical specification

Core entities
  • file revisited (superset additions: cf. below, §"Common entity fields" and also §"Controlled vocabularies and term tables")
  • biosample introduced (also cf. below, §"Common entity fields" and §"Controlled vocabularies and term tables")
    • Level 1 models biosamples as abstract materials that are directly consumed by one or more analytic processes. Simple provenance relationships -- between each such biosample and the subject from which it was originally derived, as well as between each biosample and any files analytically derived from it -- are represented using association tables, with one such table dedicated to each relationship type (cf. below, §"Association tables: inter-entity linkages"). Actual DCC-managed provenance metadata will sometimes (maybe always) represent more complex and detailed provenance networks: in such situations, chains of "this produced that" relationships too complex to model at Level 1 will need to be transitively collapsed. As an example: let's say a research team collects a cheek-swab sample from a hospital patient; subjects that swab sample to several successive preparatory treatments like centrifugation, chemical ribosomal-RNA depletion and targeted amplification; then runs the final fully-processed remnant material through a sequencing machine, generating a FASTQ sequence file as the output of the sequencing process. In physical terms our team will have created a series of distinct material samples, connected one to another by (directed) "X derived_from Y" relationships, represented as a (possibly branching) graph path (in fully general terms, a directed acyclic graph) running from a starting node set (here, our original cheek-swab sample) through intermediate nodes (one for each coherent material product of each individual preparatory process) to some terminal node set (in our case, the final-stage, immediately-pre-sequencer library preparation material). C2M2 Level 2 will offer metadata structures to model this entire process in full detail, including representational support for all intermediate biosamples, and for the various preparatory processes involved. For the purposes envisioned to be served by Level 1 C2M2 metadata, on the other hand, only subject <-> some_monolothic_stuff <-> (FASTQ) file can and should be explicitly represented.
      • The simplifications here are partially necessitated by the fact that event modeling has been deliberately deferred to C2M2 Level 2: as a result, the notion of a well-defined "chain of provenance" is not modeled at this C2M2 Level. (More concretely: Level 1 does not represent inter-biosample relationships.)
      • The modeling of details describing experimental processes has also been assigned to Level 2.
      • With both of these (more complex) aspects of experimental metadata masked at C2M2 Level 1, the most appropriate granularity at which a Level 1 biosample entity should be modeled is as an abstract "material phase" (possibly masking what is in reality a chain of multiple distinct materials) that enables an analytic (or observational or other scientific) process (which originates at a subject ) to move forward and ultimately produce one or more files.
    • In practice, a Level 1 C2M2 instance builder facing such a situation might reasonably create one record for the originating subject ; create one biosample entity record; create a file record for the FASTQ file produced by the sequencing process; and hook up subject <-> biosample and biosample <-> file relationships via the corresponding association tables (cf. below, §"Association tables: inter-entity linkages").
      • In terms of deciding (in a well-defined way) specifically which native DCC metadata should be attached to this Level 1 biosample record, one might for example choose to import metadata (IDs, etc.) describing the final pre-sequencer material. The creation of specific rules governing maps from native DCC data to (simplified, abstracted) Level 1 entity records is of necessity left up to the best judgment of the serialization staff creating each DCC's Level 1 C2M2 ETL instance; we recommend consistency, but beyond that, custom solutions will have to be developed to handle different data sources. Real-life examples of solution configurations will be published (as they are collected) to help inform decisionmaking, and CFDE staff will be available as needed to help create mappings between the native details of DCC sample metadata and the approximation that is the C2M2 Level 1 biosample entity.
      • Note in particular that this example doesn't preclude attaching multiple biosamples to a single originating subject; nor does it preclude modeling a single biosample that produces multiple files.
      • Note also that the actual end-stage material prior to the production of a file might not always prove to be the most appropriate metadata source from which to populate a corresponding biosample entity. Let's say a pre-sequencing library prepration material M is divided in two to produce derivative materials M1 and M2 , with M1 and M2 then amplified separately and sequenced under separate conditions producing files M1.fastq and M2.fastq . In such a case -- depending on experimental context -- the final separation and amplification processes producing M1 and M2 might reasonably be ignored for the purposes of Level 1 modeling, instead attaching a single (slightly upstream) biosample entity -- based on metadata describing M -- to both M1.fastq and M2.fastq. As above, final decisions regarding detailed rules mapping native DCC data to Level 1 entities are necessarily left to DCC-associated investigators and serialization engineers; CFDE staff will be available as needed to offer feedback and guidance when navigating mapping issues.
  • subject introduced (also cf. below, §"Common entity fields" and §"Taxonomy and the subject entity")
    • The Level 1 subject entity is a generic container meant to represent any biological entity from which a Level 1 biosample can be generated (the notion of biosamples being generated by other biosamples is more appropriately modeled at C2M2 Level 2: cf. §"biosample introduced", immediately above)
    • Alongside shared metadata fields (cf. below, §"Common entity fields") and inter-entity associations (cf. below, §"Association tables: inter-entity linkages"), C2M2 Level 1 models two additional details specific to subject entities:
      • internal structural configuration (referred to as subject_granularity and included in each subject record as one of an enumerated list of categorical value codes (for concepts like, e.g., "single organism," "microbiome," "cell line") -- reference list of granularity terms (with descriptions) is given here
      • taxonomic assignments attached to subcomponents ("roles," another ontological enumeration listed here for reference) of subject entities, e.g. "cell line ancestor -> NCBI:txid9606" or "host (of host-pathogen symbiont system) -> NCBI:txid10090": this is accomplished via the subject_role_taxonomy categorical association table (cf. below, §"Association table: taxonomy and the subject entity: the subject_role_taxonomy table")
    • all other subject-specific metadata -- including any protected data -- is deferred by design to Level 2
Common entity fields

The following properties all have the same meaning and function across the various entities they describe (file, biosample, project, etc.).

property description
id_namespace String identifier devised by the DCC managing this entity (cleared by CFDE-CC to avoid clashes with any preexisting id_namespace values). The value of this property will be used together with local_id as a composite key structure formally identifying Level 1 C2M2 entities within the total C2M2 data space. (See C2M2 identifiers for discussion and examples.)
local_id Unrestricted-format string identifying this entity: can be any string as long as it uniquely identifies each entity within the scope defined by the accompanying id_namespace value. (See C2M2 identifiers for discussion and examples.)
persistent_id A permanent, resolvable URI permanently attached to this entity, meant to serve as a permanent address to which landing pages (which summarize metadata associated with this entity) and other relevant annotations and functions can optionally be attached, including information enabling resolution to a network location from which the entity can be viewed, downloaded, or otherwise directly investigated. Actual network locations must not be embedded directly within this identifier: one level of indirection is required in order to protect persistent_id values from changes in network location over time as entity data is moved around. (See C2M2 identifiers for discussion and examples.)
creation_time An ISO 8601 / RFC 3339 (subset)-compliant timestamp documenting this entity's creation time (or, in the case of a subject entity, the time at which the subject was first documented by the project under which the subject was first observed): YYYY-MM-DDTHH:MM:SS±NN:NN, where
  • YYYY is a four-digit Gregorian year
  • MM is a zero-padded, one-based, two-digit month between 01 and 12, inclusive
  • DD is a zero-padded, one-based, two-digit day of the month between 01 and 31, inclusive
  • HH is a zero-padded, zero-based, two-digit hour label between 00 and 23, inclusive (12-hour time encoding is specifically prohibited)
  • MM and SS represent zero-padded, zero-based integers between 00 and 59, inclusive, denoting Babylonian-sexagesimal minutes and seconds, respectively
  • ± denotes exactly one of + or -, indicating the direction of the offset from GMT (Zulu) to the local time zone (or - in the special case encoded as -00:00, in which the local time zone is unknown or not asserted)
  • NN:NN represents the hours:minutes differential between GMT/Zulu and the local time zone context of this creation_time (qualified by the preceding + or - to indicate offset direction), with -00:00 encoding the special case in which time zone is unknown or not asserted (+00:00, by contrast, denotes the GMT/UTC/Zulu time zone itself)

Apart from the time zone segment of creation_time (±NN:NN, just described) and the year (YYYY) segment, all other constituent segments of creation_time named here may be rendered as 00 to indicate a lack of available data at the corresponding precision.
  • We are aware (and unconcerned) that this technically renders one particular HH:MM:SS value -- "00:00:00" -- ambiguous. Forestalling this ambiguity (by allowing select omissions of constituent sub-segments of creation_time string values as an alternative mechanism to denote missing data, or by introducing nonstandard and increasingly artificial special-case encodings like "99:99:99") was determined to be of less immediate concern than maintaining the technical advantages conferred by the (stronger) constraint of requiring a fixed-length creation_time string that remains fully conformant with (a constrained subset of) the RFC 3339 standard. The canonical C2M2 interpretation of "00:00:00" is thus explicitly defined to be "HH:MM:SS information unknown" and not "exactly midnight."
abbreviation, name and description Values which will be used, unmodified, for contextual display throughout portal and dashboard user interfaces: severely restricted, whitespace-free abbreviation (must match /[a-zA-Z0-9_]*/); terse but flexible name ; abstract-length description

C2M2 Level 1 offers two ways -- project and collection -- to denote groups of related metadata entity records representing core (file/subject/biosample) experimental resources.

  • project
    • unambiguous, unique, named, most-proximate research/administrative sphere of operations that first generates each experimental resource (file/subject/biosample) record
    • conceptually rooted in -- but not necessarily mapped one-to-one from -- a corresponding hierarchy of grants, contracts or other important administrative subdivisions of primary research funding
    • project attribution is required for core resource entity types: use FK specified in file/biosample/subject entity records to encode these attributions
    • projects can be nested (via the project_in_project association table: cf. below, §"Association tables: expressing containment relationships") into a hierarchical (directed, acyclic) network, but one and only one project node in one and only one project hierarchy can be attached to each core entity record.
    • by convention, for C2M2 Level 1, one artificial project node must be created and identified as the root (topmost ancestor) node of each DCC's project hierarchy: this node will represent the DCC itself: it is referenced directly (via foreign key) by the primary_dcc_contact table, and serves as an anchor point for creating roll-up summaries or other aggregations of C2M2 metadata arranged according to managing DCC.
  • collection
    • contextually unconstrained: a generalization of the "dataset" concept which additionally and explicitly supports the inclusion of elements (C2M2 metadata entities) representing subjects and biosamples
    • wholly optional: Level 1 C2M2 serialization of DCC metadata need not necessarily include any collection records or attributions
    • membership of C2M2 entities in collections is encoded using the relevant association tables (cf. below, §"Association tables: expressing containment relationships")
    • used to describe the federation of any set of core resource entities (and, recursively, other collections) across inter-project boundaries (or across inter-DCC boundaries, or across any other structural boundaries used to delimit or partition areas of primary purview or provenance, or crossing no such boundaries at all)
    • unconstrained with respect to "defining entity"
      • may optionally be attributed to a (defining/generating) C2M2 project record
        • this attribution is optional and (when null) will not always even be well-defined: the power to define new collections, on an ongoing basis, will be offered to all (approved? registered?) members of the interested research community at large, without being specifically restricted to researchers or groups already operating under the auspices of a well-defined project entity in the C2M2 system.
      • this configuration is meant to facilitate data/metadata reuse and reanalysis, as well as to provide a specific and consistent anchoring structure through which authors anywhere can create (and study and cite) newly-defined groupings of C2M2 resources, independently of their original provenance associations. (FAIRness is generally increased by provisioning for consistent reference frameworks.)
Association tables: expressing containment relationships
  • project_in_project
  • collection_in_collection
  • file_in_collection
  • subject_in_collection
  • biosample_in_collection

These tables are used to express basic containment relationships like "this file is in this collection" or "this project is a sub-project of this other project." The record format for all of these tables specifies four fields:

  • two (an id_namespace and a local_id) encoding a foreign key representing the containing project or collection, and
  • two (another {id_namespace, local_id} pair) acting as a foreign key referencing the table describing the contained resource (or subcollection).

Please see the relevant sections of the Level 1 JSON Schema to find all table-specific field names and foreign-key constraints.

Association tables: inter-entity linkages
  • file_describes_subject
  • file_describes_biosample
  • biosample_from_subject
  • collection_defined_by_project

As with the containment association tables, records in these tables will contain four fields, encoding two foreign keys: one (composite id_namespace+local_id) key per entity involved in the particular relationship being asserted by each record.

Table names define relationship types, and are (with the exception of collection_defined_by_project) somewhat nonspecific by design. Note in particular that relationships between core entities represented here may mask transitively-collapsed versions of more complex relationship networks in the native DCC metadataset. The specification of precise rules governing native-to-C2M2 metadata mappings (or approximations) are left to DCC serialization staff and relevant investigators; CFDE staff will be available as needed to offer feedback and guidance when navigating these issues.

Please see the relevant sections of the Level 1 JSON Schema to find all table-specific field names and foreign-key constraints.

Association table: taxonomy and the subject entity: the subject_role_taxonomy table

The subject_role_taxonomy _ "categorical association" table enables the attachment of taxonomic labels (NCBI Taxonomy Database identifiers, of the form_ /^NCBI:txid[0-9]+$/ and stored for reference locally in the C2M2 ncbi_taxonomy table) to C2M2 subject entities in a variety of ways, depending on subject_granularity, using subject_role values to specify the qualifying semantic or ontological context that should be applied to each taxonomic label.

  • subject_granularity: subject multiplicity specifier:
    • for each subject record, pick one of these values and include its id in the granularity field in that record.
  • subject_role: constituent relationship to intra-subject system:
    • each subject_granularity corresponds to a subset of these values, each of which can be labeled independently with NCBI Taxonomy Database IDs via subject_role_taxonomy.
  • subject_role_taxonomy: Putting it all together: this association table stores three items per record, connecting components of subject entities (subject_roles) to taxonomic assignments:
    • A (binary: { subject.id_namespace, subject.local_id }) key identifying a C2M2 subject entity record
    • An enumerated category code (the id field in this table) denoting a subject_role contextual qualifier
    • A (unitary: { }) ID denoting an NCBI Taxonomy Database entry classifying the given subject by way of the given subject_role

Please refer to the definition of subject_role_taxonomy in the Level 1 JSON Schema to find all technical details (field names and foreign-key constraints).

Controlled vocabularies and term tables
  • CVs in use at Level 1:
    • assay_type (OBI): used to describe types of experiment that can be produce 1 files
    • anatomy (Uberon): used to specify the physiological source location in or on the subject from which a biosample was derived
    • data_type (EDAM): used to categorize the abstract information content of a file (e.g. "this is sequence data")
    • file_format (EDAM): used to denote the digital format or encoding of a file (e.g. "this is a FASTQ file")
    • ncbi_taxonomy (...NCBI Taxonomy :): used to link subject entity records to taxonomic labels (cf above, §"Association table: taxonomy and the subject entity: the subject_role_taxonomy table")
  • general guidance on usage:
    • store bare CV terms (conforming to the pattern constraints specified for each CV's term set in the ER diagram, above) in the relevant entity-table fields (represented in the diagram as the sources of dotted green arrows)
    • for the moment -- with respect to deciding how to select terms -- just do the best you can by picking through the term sets provided by the given CVs
    • feel free to use more general ancestor terms if sufficiently-specific terms aren't available in a particular ontological CV
    • aggressively leave blank CV-field values for any records that wind up causing you the slightest bit of trouble.
    • #c2m2-internal-note: See the wish list (two bullets below) for initial notes on improving the engineering solutions for this topic after the demo. Too many important issues remain to be studied, argued, decided, implemented and tested for us to make CV management more than a quick and intellectually unsatisfying kludge until the next round of active model development. (Note that now that we've actually built a few C2M2 metadata instance collections, we can begin systematically comparing notes to help us all get a much better collective grip on some of the problems that will need resolution in this area, and we're going to want to hear from DCCs which haven't yet submitted metadata as well, so they can help us anticipate difficulties.)
  • CV term scanner script: auto-builds (green) CV term tables
    • executed during BDBag-preparation stage, after core TSVs have been built
    • inflates (bare) CV terms cited in core-entity table fields into corresponding CV term-usage tables
    • auto-loads and populates display-layer term-decorator data (name, description) from relevant (versioned) CV reference files
    • usage:
      • change "USER-DEFINED PARAMETERS" section to match your local directory configuration
      • make sure the prerequisite files are in the right directories
      • then just run the script without arguments
  • non-optional wish list: Everything listed in this segment must be carefully addressed and drafted in order to produce a mature policy on controlled vocabulary usage.
    • explicit version control policy for reference CVs
    • detailed plan for handling app-layer aggregations of CV-term query results to best serve users' search requests:
      • LCA computation and implicit matching of terms via shared ontological lineage
      • keyword-set association/tagging/decoration
      • synonym handling
      • etc.
    • policy specifying (or standardizing or prohibiting or ...?) a term-addition request process between CFDE and CV owners (active and ongoing between HMP and OBI, e.g.: terms are being added on request; CV managers are responsive), driven by usage needs identified by DCC clients
    • are URIs better than bare CV terms in terms of C2M2 field values?
      • what sort of URI support do CVs already provide?
      • how deeply can we leverage their own preexisting constructs without having to handle maintenance, synchrony, version, etc., issues ourselves?
      • can we establish a uniform URI policy to cover all C2M2-referenced CVs, or will we need to establish multiple policies for different CVs?
    • establish and execute some sort of survey process to create consensus on which particular CVs look like the best final selections to serve as sanctioned C2M2 reference sets (e.g. OBI vs. BAO); criteria:
      • how comprehensive is a CV's coverage of the relevant ontological space?
      • how responsive are the CV owners to change requests?
    • detailed ETL-construction usage plan: should we pre-select sub-vocabularies of sanctioned CVs to distribute to ETL generators, updating these CFDE-blessed CV subsets on an ongoing basis (as new term requirements roll in from client metadata sources (DCCs) as they try to model their respective datasets)?

Level 1 metadata submission examples: Data Package JSON Schema and example TSVs

A JSON Schema document -- implementing Frictionless Data's "Data Package" container meta-specification -- defining the Level 1 TSV collection is here; an example Level-1-compliant TSV submission collection can be found here for inspection in two alternative forms: (1) a bare collection of TSV files, and (2) a single packaged BDBag archive file containing those TSVs along with some packaging/manifest metadata. (DCCs will package each C2M2 submission as one of these BDBags: we provide a valid one here for reference.)

Level 2

C2M2 Level 2 is currently being drafted: publication of a complete specification is expected by the end of 2020.

  1. New modeling concept checklist:
    • clinical visit data
    • modular experimental flow (protocol)
    • resource (entity) provenance ([data|material]_event network)
    • structured addressbook for documenting and linking organizations (common_fund_program), roles/personae and actual people to C2M2 metadata
    • protected data
    • full elaboration of scientific attributes of C2M2 entities using controlled-vocabulary metadata decorations
    • i.e., substrate data for facet-search targets
    • e.g., Level 1's anatomy, assay_type, ncbi_taxonomy, etc.)
    • [enumerate requirements and scope for more complex modeling of scientific metadata_]