The Common Fund Data Ecosystem's Crosscut Metadata Model (CFDE C2M2)¶
This document introduces the Crosscut Metadata Model (C2M2), a flexible metadata standard for describing experimental resources in biomedicine and related fields. The Common Fund Data Ecosystem group is creating a new computing infrastructure, with C2M2 as its organizing principle, to offer the health research community an unprecedented array of intersectional data tools. The C2M2 system will connect researchers with scale-powered statistical analysis methods; deep, seamless searching across experimental data generated by different projects and organizations; and new ways to aggregate and integrate experimental data from different sources to facilitate scientific replication and to drive new discoveries.
Using this new infrastructure, Common Fund data coordinating centers (DCCs) will share structured, detailed information (metadata) about their experimental resources with the research community at large, widening and deepening access to usable observational data. One immediate consequence will be a drastic simplification in the effort required to perform meta-analysis of results from multiple independent teams studying similar health-related phenomena.
DCC Metadata Submissions¶
DCCs collect and provide metadata submissions (C2M2 instances) to CFDE describing experimental resources within their purview. Each submission is a set of tab-separated value files (TSVs); precise formatting requirements for these filesets are specified by JSON Schema documents, each of which is an instance of the Data Package meta-specification published by the Frictionless Data group. The Data Package meta-specification is a toolkit for defining format and content requirements for files so that automatic validation can be performed on those files, just as a database management system stores definitions for database tables and automatically validates incoming data based on those definitions. Using this toolkit, the C2M2 JSON Schema specifications lay out foreign-key relationships between metadata fields (TSV columns), rules governing missing data, required content types for particular fields, and other similar database management constraints to define basic structural integrity for C2M2 metadata submissions. During the C2M2 ingestion process, the C2M2 software infrastructure uses these specifications to automatically validate format compliance and submission integrity, prior to loading metadata into its central database. Once loaded, metadata are used to fuel downstream services like search results, customized statistical summaries, dynamic display graphics, and asset browsing within experimental resource collections.
C2M2 Levels¶
CFDE offers DCCs three alternative metadata submission formats (C2M2 Levels 0, 1 and 2), each of which is automatically interoperable with the entire C2M2 software ecosystem. Levels are tiered primarily according to increasing complexity. The general idea is that DCC resource collections can be represented quickly (and thus begin driving downstream applications quickly) using metadata encoded at lower (simpler) C2M2 levels: over time, and as feasible, DCC data managers can upgrade their C2M2 metadata submissions by expanding into higher levels.
C2M2 Levels 0, 1 and 2 are increasingly large and complex variants of the
same metadata model (C2M2): each level is a defined collection of
data tables (encoded as TSVs: see above on
submissions). Level 0 defines just
one short file
table; Level 1 provides a larger file
table than
Level 0 (more fields), adds more tables describing other
basic biomedical resource concepts including biosample
,
subject
, and project
, and introduces ways to express simple
relationships between records in different tables; Level 2 is
an ever-growing mature metadata interchange standard, customized
to advanced DCC metadata needs: it contains all Level 1 information,
plus a library of more detailed metadata objects and extensions to
existing objects. Lower levels are strict subsets of higher ones:
Level 1's file
table contains all Level 0 file
fields and
more; Level 2's biosample
is an expanded superset of the Level
1 biosample
, etc. Upgrades from one level to the next are
therefore limited by design to be done only by expansion: moving
up a level will not require DCCs to make changes to any metadata
already provided.
A foundational purpose of the C2M2 system is to facilitate metadata harmonization: finding ways wherever possible to represent comparable things in standard ways, without compromising meaning, context or accuracy. In addition to complexity management, C2M2 levels are also intended to roughly encapsulate groups of concepts according to increasing harmonization difficulty.
Some examples, sorted by increasingly heavy challenges to harmonization:
- All DCCs have file resources, describable (at a very high level) in a standard, noncontroversial way (size + filename: Level 0).
- In addition to data files, virtually all DCCs deal in some way with biosamples and/or subjects: metadata describing basic aspects of these common concepts can be fairly (if still quite broadly) expressed by a shared model (Level 1), thereby enabling the beginnings of cross-dataset search and analysis.
- Many DCCs have at least some specialized data, unique to their own spheres of operation, which (at least for some time) will not be meaningful candidates for system-level harmonization (Level 2 includes specialized extension modeling).
Most DCCs already have some form of internal metadata model in use for their own curation operations. C2M2 integration of similar but distinct packages of important information, taken from multiple independently-developed custom DCC metadata systems (including e.g. metadata describing people and organizations, data provenance relationships, experimental protocols, protected data, or detailed event sequences), will require ongoing, iterative, case-based design and consensus-driven decision-making, often coordinated across multiple independent research groups. Design and decision-making in such contexts will require long-term planning, testing and execution. Metadata difficult (or even impossible) to integrate and harmonize is thus handled as part of the ongoing evolution and expansion of Level 2, leaving Level 1 tasked with supporting relatively universal, simple and uncontroversial metadata concepts to maintain streamlined development and deployment of important core metadata packages without unnecessarily blocking feasible tasks to wait on more expensive custom integration.
With the design of C2M2, we are splitting the difference between the ease of evolution inherent in a simple model and the operational power provided to downstream applications by more complicated and difficult-to-maintain extended frameworks.
Modeling and data wrangling are always difficult, even for experts. Part of the goal of the level system is to compartmentalize the C2M2 model so as to maintain flexibility -- especially during developmental phases -- in order to best accommodate mutual learning between DCCs and CFDE as the construction of this federated metadata system progresses. It is generally far more expensive and error-prone to repeatedly change a complex, over-built, inseparable, monolithic model than it is to build one gradually from a simpler core of agreement which can be relatively quickly stabilized while more specialized branches are built in parallel and transitioned into more general use.
At any given moment, participating Common Fund DCCs will span a broad range of experience and available funding, based on mission details and lifecycle phases. DCCs with advanced, operationalized metadata modeling systems of their own should not encounter arbitrary barriers to C2M2 support for more extensive relational modeling of their metadata if they want it; CFDE will maintain such support by iteratively refining Level 2 according to needs identified while working with DCCs already wielding complex metadata models. Newer or smaller DCCs, by contrast, may not have enough readily-available information to feasibly describe their experimental resources using Level 2 structures (either existing or proposed): C2M2 Level 1 thus also aims to actively support such cases by offering simpler but still well-structured metadata models where metadata has already harmonized across other DCCs, lowering barriers to rapid entry into the data ecosystem and meaningful participation in downstream services.
A C2M2 topic requiring special attention is the use of identifiers.
C2M2 identifiers¶
- Two complementary identifier slots for DCC-issued records
persistent_id
: persistent and resolvable ID (ideal, but optional)id_namespace
,local_id
: 2-part key is as least-common denominator- The optional persistent identifier is a DOI, ARK, MINID, etc.
- The 2-element composite identifier conveys fragments of a record URI
local_id
bears a name from some namespace, e.g. an accession IDid_namespace
specifies which namespace, i.e. left-hand side of a URI- concatenation of namespace + local yields the full record URI
- URIs are cheap and easy: low barrier to entry, no hosting requirements
- To steer users directly to data, we require
persistent_id
! - Otherwise, we can steer users towards the DCC contact.
- For consistency, we repeat this scheme on several entity tables.
Not all DCCs arrive having adopted robust, persistent and resolvable identifiers,
or we could just mandate their use to identify all records. We need to
support different DCC identifier maturity levels, and so we need to keep
persistent_id
optional.
We add a 2-part composite identifier as something which can lower the barrier to entry: this is a URI that has been split into two parts. The web rules for forming URIs are very flexible: it takes essentially no cost and no real infrastructure to generate URIs, which are just rules for namespace hygiene laying out how a DCC can claim a namespace out of the ether, ideally rooted in some other basic name they own such as a DNS name (website domain) or even just an email address.
(For those who don’t know, the difference between a URI and a URL is that a URI is just an identifier. It doesn’t have to address any actual web server which responds to messages, so you can produce URIs all day long and never worry about hosting costs, availability, etc.)
The core idea is that most DCCs will have some accession ID or other
locally-unique key for their assets already. They can copy that value
into the local_id
part or define some other mapping of their own
choosing, there. Then, they can either define a new id_namespace
or pass
through an applicable, existing URI prefix as the id_namespace
component.
As long as each DCC follows the web rules to use an id_namespace
they “own”,
collisions are automatically avoided (where two DCCs might try to use the
same composite identifier).
We also allow for the use of persistent, resolvable identifiers for things which are not really data. We suspect that many of the same practical benefits of these identifiers for data might apply to other entity types, even if resolution might lead to landing pages and contact info which at most guides users to documentation or other possible actions they might take in the real world or in the parallel bureaucratic world. (Attempting to directly download subjects or biosamples might trigger some unexpected and unpleasant -- although possibly comical -- downstream side effects.)
- For a DCC already issuing persistent, resolvable IDs
persistent_id
: fill with canonical IDid_namespace
,local_id
: fill with (split) copy of canonical ID- For a DCC already issuing relatively stable URIs
persistent_id
: leave blank until readyid_namespace
,local_id
: fill with (split) copy of URI- For a DCC already issuing local accession IDs
persistent_id
: leave blank until readylocal_id
: fill with local accession IDid_namespace
: choose an appropriate namespace URI-prefix- For a DCC without local ID stability
- Need to invent something approximating accession ID and proceed as above
If a DCC already uses persistent identifiers such as DOIs, ARKs, or other short identifiers resolvable by some name-to-thing service then they can just put that value into all the identifier fields:
- verbatim in the
persistent_id
field so that consumers know apersistent_id
is available for this record - split and replicated into the
id_namespace
andlocal_id
fields so that the core C2M2 requirement for a record key is met - there is no need for the DCC to issue and juggle two separate identifier formats, but the compromise C2M2 format requires the fields to be populated
If a DCC already uses URIs or URLs for entities having a one-to-one
correspondence to C2M2 record concepts, they can split that URI or
URL into a namespace prefix part and a final local part and use
those values to fill the 2-part composite record identifier. They would
probably leave persistent_id
blank.
If a DCC only has local identifiers for such entities, they can put
that in the local_id
part and then fabricate a new id_namespace
representing their DCC or the sub-organizational scope where these
local identifiers are indeed unique. The same _ id_namespace
_should
be reused for all peer records with local parts coming from the
same DCC naming system.
Having stable identifier management is, unavoidably, basic “table stakes” for a DCC to produce reusable data. Archiving and indexing options like those provided by C2M2 are extremely limited in the absence of some method of persistent bookkeeping to track inventory: persistent identifiers are a core requirement precisely because without them, federated views of resource metadata from multiple DCCs cannot be maintained.
Level 0¶
C2M2 Level 0 defines a minimal valid C2M2 instance. Data submissions at this level of metadata richness will be the easiest to produce, and will support the simplest available functionality implemented by downstream applications.
Level 0 submission process: overview¶
Metadata submissions at Level 0 will consist of a single
TSV file
describing a collection of digital
files
owned or managed by a
DCC.
The properties listed for the Level 0 file
entity (see below for
diagram and definitions) will serve as the TSV's column headers; each TSV
row will represent a single file. The Level 0 TSV itself thus represents a
manifest or inventory of digital files that a DCC wants to introduce
into the C2M2 metadata ecosystem.
This level encodes the most basic file metadata: its use by downstream applications will be limited to informing the least specific level of data accounting, querying and reporting.
Level 0 model diagram |
---|
![]() |
Level 0 technical specification: properties of the file
entity¶
Required: id_namespace
local_id
sha256|md5
property | description |
---|---|
id_namespace |
String identifier devised by the DCC managing this file (cleared by CFDE-CC to avoid clashes with any preexisting id_namespace values). The value of this property will be used together with local_id as a composite key structure formally identifying Level 0 file entities within the total C2M2 data space. (See C2M2 identifiers for discussion and examples.) |
local_id |
Unrestricted-format string identifying this file : can be any string as long as it uniquely identifies each file within the scope defined by the accompanying id_namespace value. (See C2M2 identifiers for discussion and examples.) |
persistent_id |
A permanent, resolvable URI permanently attached to this file , meant to serve as a permanent address to which landing pages (which summarize metadata associated with this file ) and other relevant annotations and functions can optionally be attached, including information enabling resolution to a network location from which the file can be downloaded. Actual network locations must not be embedded directly within this identifier: one level of indirection is required in order to protect persistent_id values from changes in network location over time as files are moved around. (See C2M2 identifiers for discussion and examples.) |
size_in_bytes |
The size of this file in bytes. This varies (even for "copies" of the same file ) across differences in storage hardware and operating system. CFDE does not require any particular method of byte computation: precise, reproducible file size integrity metadata will be provided in the form of checksum data in the sha256 and/or md5 properties. size_in_bytes will instead underpin automatic reporting of approximate storage statistics across different C2M2 collections of DCC metadata. |
sha256 |
CFDE-preferred file checksum string: the output of the SHA-256 cryptographic hash function after being run on this file . One or both of sha256 and md5 is required. |
md5 |
Permitted file checksum string: the output of the MD5 message-digest algorithm after being run as a cryptographic hash function on this file . One or both of sha256 and md5 is required. (CFDE recommends SHA-256 if feasible, but we recognize the nontrivial overhead involved in recomputing these hash values for large collections of files, so if MD5 values have already been generated, CFDE will accept them.) |
filename |
A filename with no prepended PATH information. |
Level 0 technical specification: JSON Schema and example TSVs¶
A JSON Schema document (implementing
Frictionless Data's
"Data Package"
container meta-specification) defining the Level 0 TSV can be found
here;
an example Level-0-compliant TSV submission collection can be found
here
(just the file.tsv
portion) and
here
(as a packaged BDBag archive).
Level 1¶
C2M2 Level 1 models basic experimental resources and associations between them. This level of metadata richness is more difficult to produce than Level 0's flat inventory of digital file assets. As a result, Level 1 metadata offers users more powerful downstream tools than are available for Level 0, including
- faceted searches on a (small) set of biologically relevant features (like anatomy
and taxonomy) of experimental resources like
biosample
andsubject
- organization of summary displays using subdivisions of experimental metadata
collections by
project
(grant or contract) andcollection
(any scientifically relevant grouping of resources) - basic reporting on changes in metadata over time, tracking (for example)
creation times for
file
andbiosample
C2M2 Level 1 is designed to offer an intermediate tier of difficulty, in terms of preparing metadata submissions, between Level 0's basic digital inventory and the full intricacy of Level 2. Accordingly, we have reserved several modeling concepts -- requiring the most effort to produce and maintain -- for Level 2. The following are not modeled at Level 1:
- any and all protected data
- documentation of experimental protocols
- event-based resource generation/provenance networks
- detailed information on organizations and people governing the research being documented
- a comprehensive suite of options to model scientific attributes of
experimental resources
- full collection of features like anatomy, taxonomy, and assay type, plus formal vocabularies to describe them
- prerequisite to offering research users deep and detailed search possibilities
Level 1 submission process: overview¶
Build the core C2M2 entity tables (black) and the C2M2 container tables (blue) shown in the diagram below, and fill out the DCC contact sheet (grey). Once you've built the core entity tables, the green tables can be built automatically using our term-scanner script, which will collect all relevant CV terms used throughout your core entity tables and will create the corresponding green term-index tables, using data loaded from versioned, whole-CV reference documents (like OBO files).
In the case of any unpopulated tables (no collection
records, for example, are
required for model compliance), please create the relevant TSV files anyway,
with just one tab-separated header line containing the empty table's column
names. (In contrast to simply omitting the blank table file, the recommended practice
instead explicitly distinguishes the case in which no data is being submitted
for a given table from the case in which a table has been omitted by mistake.)
Color key:
Black: C2M2 Level 1 core entities: files, biosamples and subjects
Dark red: Associative relationships between Level 1 core entities
Blue: Level 1 container entities (projects and collections) and their containment relationships
Green: Tables recording all third-party ontology or controlled-vocabulary terms used within a Level 1 submission, including extra information about UI display labels
Gold: Single-record table listing basic contact information for DCC staff managing a Level 1 submission
Yellow: Association table optionally annotating each Level 1 subject record with
- (possibly multiple) NCBI Taxonomy ID attributions
- specification (and individual annotation) of subject sub-entities based on generic roles in observational ecosystems, like "host," "pathogen," "site-specific microbiome," "basic single organism" (default), etc.
Level 1 model diagram |
---|
![]() |
Level 1 technical specification¶
Core entities¶
file
revisited (superset additions: cf. below, §"Common entity fields" and also §"Controlled vocabularies and term tables")biosample
introduced (also cf. below, §"Common entity fields" and §"Controlled vocabularies and term tables")- Level 1 models
biosample
s as abstract materials that are directly consumed by one or more analytic processes. Simple provenance relationships -- between each suchbiosample
and thesubject
from which it was originally derived, as well as between eachbiosample
and anyfile
s analytically derived from it -- are represented using association tables, with one such table dedicated to each relationship type (cf. below, §"Association tables: inter-entity linkages"). Actual DCC-managed provenance metadata will sometimes (maybe always) represent more complex and detailed provenance networks: in such situations, chains of "this
producedthat
" relationships too complex to model at Level 1 will need to be transitively collapsed. As an example: let's say a research team collects a cheek-swab sample from a hospital patient; subjects that swab sample to several successive preparatory treatments like centrifugation, chemical ribosomal-RNA depletion and targeted amplification; then runs the final fully-processed remnant material through a sequencing machine, generating a FASTQ sequence file as the output of the sequencing process. In physical terms our team will have created a series of distinct material samples, connected one to another by (directed) "X
derived_from
Y
" relationships, represented as a (possibly branching) graph path (in fully general terms, a directed acyclic graph) running from a starting node set (here, our original cheek-swab sample) through intermediate nodes (one for each coherent material product of each individual preparatory process) to some terminal node set (in our case, the final-stage, immediately-pre-sequencer library preparation material). C2M2 Level 2 will offer metadata structures to model this entire process in full detail, including representational support for all intermediatebiosample
s, and for the various preparatory processes involved. For the purposes envisioned to be served by Level 1 C2M2 metadata, on the other hand, onlysubject
<->some_monolothic_stuff
<->(FASTQ) file
can and should be explicitly represented.- The simplifications here are partially necessitated by the fact that
event modeling has been deliberately deferred to C2M2 Level 2: as a result,
the notion of a well-defined "chain of provenance" is not modeled at
this C2M2 Level. (More concretely: Level 1 does not represent
inter-
biosample
relationships.) - The modeling of details describing experimental processes has also been assigned to Level 2.
- With both of these (more complex) aspects of experimental metadata
masked at C2M2 Level 1, the most appropriate granularity at which a Level 1
biosample
entity should be modeled is as an abstract "material phase" (possibly masking what is in reality a chain of multiple distinct materials) that enables an analytic (or observational or other scientific) process (which originates at asubject
) to move forward and ultimately produce one or morefile
s.
- The simplifications here are partially necessitated by the fact that
event modeling has been deliberately deferred to C2M2 Level 2: as a result,
the notion of a well-defined "chain of provenance" is not modeled at
this C2M2 Level. (More concretely: Level 1 does not represent
inter-
- In practice, a Level 1 C2M2 instance builder facing such a situation
might reasonably create one record for the originating
subject
; create onebiosample
entity record; create afile
record for the FASTQ file produced by the sequencing process; and hook upsubject
<->biosample
andbiosample
<->file
relationships via the corresponding association tables (cf. below, §"Association tables: inter-entity linkages").- In terms of deciding (in a well-defined way) specifically which native DCC
metadata should be attached to this Level 1
biosample
record, one might for example choose to import metadata (IDs, etc.) describing the final pre-sequencer material. The creation of specific rules governing maps from native DCC data to (simplified, abstracted) Level 1 entity records is of necessity left up to the best judgment of the serialization staff creating each DCC's Level 1 C2M2 ETL instance; we recommend consistency, but beyond that, custom solutions will have to be developed to handle different data sources. Real-life examples of solution configurations will be published (as they are collected) to help inform decisionmaking, and CFDE staff will be available as needed to help create mappings between the native details of DCC sample metadata and the approximation that is the C2M2 Level 1biosample
entity. - Note in particular that this example doesn't preclude attaching multiple
biosample
s to a single originatingsubject
; nor does it preclude modeling a singlebiosample
that produces multiplefile
s. - Note also that the actual end-stage material prior to the production of a
file
might not always prove to be the most appropriate metadata source from which to populate a correspondingbiosample
entity. Let's say a pre-sequencing library prepration materialM
is divided in two to produce derivative materialsM1
andM2
, withM1
andM2
then amplified separately and sequenced under separate conditions producingfile
sM1.fastq
andM2.fastq
. In such a case -- depending on experimental context -- the final separation and amplification processes producingM1
andM2
might reasonably be ignored for the purposes of Level 1 modeling, instead attaching a single (slightly upstream)biosample
entity -- based on metadata describingM
-- to bothM1.fastq
andM2.fastq
. As above, final decisions regarding detailed rules mapping native DCC data to Level 1 entities are necessarily left to DCC-associated investigators and serialization engineers; CFDE staff will be available as needed to offer feedback and guidance when navigating mapping issues.
- In terms of deciding (in a well-defined way) specifically which native DCC
metadata should be attached to this Level 1
- Level 1 models
subject
introduced (also cf. below, §"Common entity fields" and §"Taxonomy and thesubject
entity")- The Level 1
subject
entity is a generic container meant to represent any biological entity from which a Level 1biosample
can be generated (the notion ofbiosample
s being generated by otherbiosample
s is more appropriately modeled at C2M2 Level 2: cf. §"biosample
introduced", immediately above) - Alongside shared metadata fields (cf. below, §"Common entity fields") and inter-entity
associations (cf. below, §"Association tables: inter-entity linkages"), C2M2
Level 1 models two additional details specific to
subject
entities:- internal structural configuration (referred to as
subject_granularity
and included in eachsubject
record as one of an enumerated list of categorical value codes (for concepts like, e.g., "single organism," "microbiome," "cell line") -- reference list of granularity terms (with descriptions) is given here - taxonomic assignments attached to subcomponents ("roles," another ontological enumeration
listed here for reference) of
subject
entities, e.g. "cell line ancestor -> NCBI:txid9606" or "host (of host-pathogen symbiont system) -> NCBI:txid10090": this is accomplished via thesubject_role_taxonomy
categorical association table (cf. below, §"Association table: taxonomy and thesubject
entity: thesubject_role_taxonomy
table")
- internal structural configuration (referred to as
- all other
subject
-specific metadata -- including any protected data -- is deferred by design to Level 2
- The Level 1
Common entity fields¶
The following properties all have the same meaning and function across
the various entities they describe (file
, biosample
, project
, etc.).
property | description |
---|---|
id_namespace |
String identifier devised by the DCC managing this entity (cleared by CFDE-CC to avoid clashes with any preexisting id_namespace values). The value of this property will be used together with local_id as a composite key structure formally identifying Level 1 C2M2 entities within the total C2M2 data space. (See C2M2 identifiers for discussion and examples.) |
local_id |
Unrestricted-format string identifying this entity: can be any string as long as it uniquely identifies each entity within the scope defined by the accompanying id_namespace value. (See C2M2 identifiers for discussion and examples.) |
persistent_id |
A permanent, resolvable URI permanently attached to this entity, meant to serve as a permanent address to which landing pages (which summarize metadata associated with this entity) and other relevant annotations and functions can optionally be attached, including information enabling resolution to a network location from which the entity can be viewed, downloaded, or otherwise directly investigated. Actual network locations must not be embedded directly within this identifier: one level of indirection is required in order to protect persistent_id values from changes in network location over time as entity data is moved around. (See C2M2 identifiers for discussion and examples.) |
creation_time |
An ISO 8601 / RFC 3339 (subset)-compliant timestamp documenting this entity's creation time (or, in the case of a subject entity, the time at which the subject was first documented by the project under which the subject was first observed): YYYY-MM-DDTHH:MM:SS±NN:NN , where
Apart from the time zone segment of creation_time (±NN:NN , just described) and the year (YYYY ) segment, all other constituent segments of creation_time named here may be rendered as 00 to indicate a lack of available data at the corresponding precision.
|
abbreviation , name and description |
Values which will be used, unmodified, for contextual display throughout portal and dashboard user interfaces: severely restricted, whitespace-free abbreviation (must match /[a-zA-Z0-9_]*/ ); terse but flexible name ; abstract-length description |
Containers¶
C2M2 Level 1 offers two ways -- project
and collection
-- to denote groups of
related metadata entity records representing core (file
/subject
/biosample
)
experimental resources.
project
- unambiguous, unique, named, most-proximate research/administrative
sphere of operations that first generates each experimental resource
(
file
/subject
/biosample
) record - conceptually rooted in -- but not necessarily mapped one-to-one from -- a corresponding hierarchy of grants, contracts or other important administrative subdivisions of primary research funding
project
attribution is required for core resource entity types: use FK specified infile
/biosample
/subject
entity records to encode these attributionsproject
s can be nested (via theproject_in_project
association table: cf. below, §"Association tables: expressing containment relationships") into a hierarchical (directed, acyclic) network, but one and only oneproject
node in one and only oneproject
hierarchy can be attached to each core entity record.- by convention, for C2M2 Level 1, one artificial
project
node must be created and identified as the root (topmost ancestor) node of each DCC'sproject
hierarchy: this node will represent the DCC itself: it is referenced directly (via foreign key) by theprimary_dcc_contact
table, and serves as an anchor point for creating roll-up summaries or other aggregations of C2M2 metadata arranged according to managing DCC.
- unambiguous, unique, named, most-proximate research/administrative
sphere of operations that first generates each experimental resource
(
collection
- contextually unconstrained: a generalization of the "dataset" concept
which additionally and explicitly supports the inclusion of elements
(C2M2 metadata entities) representing
subject
s andbiosample
s - wholly optional: Level 1 C2M2 serialization of DCC metadata need not
necessarily include any
collection
records or attributions - membership of C2M2 entities in
collection
s is encoded using the relevant association tables (cf. below, §"Association tables: expressing containment relationships") - used to describe the federation of any set of core resource entities (and,
recursively, other
collection
s) across inter-project
boundaries (or across inter-DCC boundaries, or across any other structural boundaries used to delimit or partition areas of primary purview or provenance, or crossing no such boundaries at all) - unconstrained with respect to "defining entity"
- may optionally be attributed to a (defining/generating) C2M2
project
record- this attribution is optional and (when null) will not always even
be well-defined: the power to define new
collection
s, on an ongoing basis, will be offered to all (approved? registered?) members of the interested research community at large, without being specifically restricted to researchers or groups already operating under the auspices of a well-definedproject
entity in the C2M2 system.
- this attribution is optional and (when null) will not always even
be well-defined: the power to define new
- this configuration is meant to facilitate data/metadata reuse and reanalysis, as well as to provide a specific and consistent anchoring structure through which authors anywhere can create (and study and cite) newly-defined groupings of C2M2 resources, independently of their original provenance associations. (FAIRness is generally increased by provisioning for consistent reference frameworks.)
- may optionally be attributed to a (defining/generating) C2M2
- contextually unconstrained: a generalization of the "dataset" concept
which additionally and explicitly supports the inclusion of elements
(C2M2 metadata entities) representing
Association tables: expressing containment relationships¶
project_in_project
collection_in_collection
file_in_collection
subject_in_collection
biosample_in_collection
These tables are used to express basic containment relationships like "this file
is in
this collection
" or "this project
is a sub-project of this other
project
." The record format for all of these tables specifies four fields:
- two (an
id_namespace
and alocal_id
) encoding a foreign key representing the containingproject
orcollection
, and - two (another {
id_namespace
,local_id
} pair) acting as a foreign key referencing the table describing the contained resource (or subcollection).
Please see the relevant sections of the Level 1 JSON Schema to find all table-specific field names and foreign-key constraints.
Association tables: inter-entity linkages¶
file_describes_subject
file_describes_biosample
biosample_from_subject
collection_defined_by_project
As with the containment association tables, records in these tables
will contain four fields, encoding two foreign keys: one (composite
id_namespace+local_id
) key per entity involved in the particular
relationship being asserted by each record.
Table names define relationship types, and are (with the exception
of collection_defined_by_project
) somewhat nonspecific by design.
Note in particular that relationships between core entities represented
here may mask transitively-collapsed versions of more complex
relationship networks in the native DCC metadataset. The specification
of precise rules governing native-to-C2M2 metadata mappings (or
approximations) are left to DCC serialization staff and relevant
investigators; CFDE staff will be available as needed to offer
feedback and guidance when navigating these issues.
Please see the relevant sections of the Level 1 JSON Schema to find all table-specific field names and foreign-key constraints.
Association table: taxonomy and the subject
entity: the subject_role_taxonomy
table¶
The subject_role_taxonomy
_ "categorical association" table enables
the attachment of taxonomic labels (NCBI Taxonomy Database identifiers, of the form_
/^NCBI:txid[0-9]+$/
and stored for reference locally in the C2M2
ncbi_taxonomy
table) to C2M2 subject
entities in a variety of
ways, depending on subject_granularity
, using subject_role
values to specify
the qualifying semantic or ontological context that should be applied to
each taxonomic label.
subject_granularity
:subject
multiplicity specifier:- for each
subject
record, pick one of these values and include itsid
in thegranularity
field in that record.
- for each
subject_role
: constituent relationship to intra-subject
system:- each
subject_granularity
corresponds to a subset of these values, each of which can be labeled independently with NCBI Taxonomy Database IDs viasubject_role_taxonomy
.
- each
subject_role_taxonomy
: Putting it all together: this association table stores three items per record, connecting components ofsubject
entities (subject_role
s) to taxonomic assignments:- A (binary:
{ subject.id_namespace, subject.local_id }
) key identifying a C2M2subject
entity record - An enumerated category code (the
id
field in this table) denoting asubject_role
contextual qualifier - A (unitary:
{ ncbi_taxonomy.id }
) ID denoting an NCBI Taxonomy Database entry classifying the givensubject
by way of the givensubject_role
- A (binary:
Please refer to the definition of subject_role_taxonomy
in the
Level 1 JSON Schema
to find all technical details (field names and foreign-key constraints).
Controlled vocabularies and term tables¶
- CVs in use at Level 1:
- assay_type (OBI): used to describe types of experiment that can be produce
1
file
s - anatomy (Uberon): used to specify the physiological source location
in or on the
subject
from which abiosample
was derived - data_type (EDAM): used to categorize the abstract information content of
a
file
(e.g. "this is sequence data") - file_format (EDAM): used to denote the digital format or encoding of
a
file
(e.g. "this is a FASTQ file") - ncbi_taxonomy (...NCBI Taxonomy :): used to link
subject
entity records to taxonomic labels (cf above, §"Association table: taxonomy and thesubject
entity: thesubject_role_taxonomy
table")
- assay_type (OBI): used to describe types of experiment that can be produce
1
- general guidance on usage:
- store bare CV terms (conforming to the pattern constraints specified for each CV's term set in the ER diagram, above) in the relevant entity-table fields (represented in the diagram as the sources of dotted green arrows)
- for the moment -- with respect to deciding how to select terms -- just do the best you can by picking through the term sets provided by the given CVs
- feel free to use more general ancestor terms if sufficiently-specific terms aren't available in a particular ontological CV
- aggressively leave blank CV-field values for any records that wind up causing you the slightest bit of trouble.
#c2m2-internal-note: See the wish list (two bullets below) for initial notes on improving the engineering solutions for this topic after the demo. Too many important issues remain to be studied, argued, decided, implemented and tested for us to make CV management more than a quick and intellectually unsatisfying kludge until the next round of active model development. (Note that now that we've actually built a few C2M2 metadata instance collections, we can begin systematically comparing notes to help us all get a much better collective grip on some of the problems that will need resolution in this area, and we're going to want to hear from DCCs which haven't yet submitted metadata as well, so they can help us anticipate difficulties.)
- CV term scanner script:
auto-builds (green) CV term tables
- executed during BDBag-preparation stage, after core TSVs have been built
- inflates (bare) CV terms cited in core-entity table fields into corresponding CV term-usage tables
- auto-loads and populates display-layer term-decorator data (name, description) from relevant (versioned) CV reference files
- usage:
- change "
USER-DEFINED PARAMETERS
" section to match your local directory configuration - make sure the prerequisite files are in the right directories
- then just run the script without arguments
- change "
- non-optional wish list: Everything listed in this segment must be
carefully addressed and drafted in order to produce a mature policy on controlled vocabulary usage.
- explicit version control policy for reference CVs
- detailed plan for handling app-layer aggregations of CV-term query
results to best serve users' search requests:
- LCA computation and implicit matching of terms via shared ontological lineage
- keyword-set association/tagging/decoration
- synonym handling
- etc.
- policy specifying (or standardizing or prohibiting or ...?) a term-addition request process between CFDE and CV owners (active and ongoing between HMP and OBI, e.g.: terms are being added on request; CV managers are responsive), driven by usage needs identified by DCC clients
- are URIs better than bare CV terms in terms of C2M2 field values?
- what sort of URI support do CVs already provide?
- how deeply can we leverage their own preexisting constructs without having to handle maintenance, synchrony, version, etc., issues ourselves?
- can we establish a uniform URI policy to cover all C2M2-referenced CVs, or will we need to establish multiple policies for different CVs?
- establish and execute some sort of survey process to create
consensus on which particular CVs look like the best final selections
to serve as sanctioned C2M2 reference sets (e.g. OBI vs. BAO); criteria:
- how comprehensive is a CV's coverage of the relevant ontological space?
- how responsive are the CV owners to change requests?
- detailed ETL-construction usage plan: should we pre-select sub-vocabularies of sanctioned CVs to distribute to ETL generators, updating these CFDE-blessed CV subsets on an ongoing basis (as new term requirements roll in from client metadata sources (DCCs) as they try to model their respective datasets)?
Level 1 metadata submission examples: Data Package JSON Schema and example TSVs¶
A JSON Schema document -- implementing Frictionless Data's "Data Package" container meta-specification -- defining the Level 1 TSV collection is here; an example Level-1-compliant TSV submission collection can be found here for inspection in two alternative forms: (1) a bare collection of TSV files, and (2) a single packaged BDBag archive file containing those TSVs along with some packaging/manifest metadata. (DCCs will package each C2M2 submission as one of these BDBags: we provide a valid one here for reference.)
Level 2¶
C2M2 Level 2 is currently being drafted: publication of a complete specification is expected by the end of 2020.
- New modeling concept checklist:
- clinical visit data
- modular experimental flow (
protocol
) - resource (entity) provenance (
[data|material]_event
network) - structured addressbook for documenting and linking organizations
(
common_fund_program
), roles/personae and actual people to C2M2 metadata - protected data
- full elaboration of scientific attributes of C2M2 entities using controlled-vocabulary metadata decorations
- i.e., substrate data for facet-search targets
- e.g., Level 1's
anatomy
,assay_type
,ncbi_taxonomy
, etc.) - [enumerate requirements and scope for more complex modeling of scientific metadata_]