Acute to Chronic Pain Signatures program. The most comprehensive study to date to investigate the connections of peripheral biology, brain, psychological, and bio-behavioral risk factors. Composed of a consortium of organizations and scientists throughout the U.S. and Canada, A2CPS is part of the multi-pronged NIH HEAL (Helping to End Addiction Long-Term) initiative, an aggressive effort to speed scientific solutions to stem the national opioid public health crisis.
Application Programmer Interface. Allows developers to manipulate (query, update) remote data sources through specific protocols or specific standards for communication (e.g., REST,SOAP). An important element of the ecosystem will be the standardization and publishing of an API that can be used by data consumers to retrieve the inventories, the data asset specification, and additional metadata associated with the assets. This will allow for consumers of these inventories to programmatically interrogate the federated system for information that may be relevant to a consuming service.
American Society of Human Genetics. They hold an annual conference that is the largest human genetics and genomics meeting and exposition in the world.
sample or a file, which are also known as
material asset or
digital asset respectively.
an asset inventory is a collection of digital assets distributed by the DCCs through a portal.
A two-dimensional matrix encoding a specified associative relationship within groups of two or more entities. The columns of an association table encode two or more entity identifiers, which reference entity tables as foreign keys. Each row lists the IDs of two or more entities, asserting that those entities are related to one another in the context of the associative relationship attached to the table.
A list of groups of two or more entities, where all the entities in each listed group are understood to be mutually related to one another according to some specified relationship.
Example: 20 martial artists in team A are picked to pair off with 20 people in team B for one-on-one sparring. The associative relationship here might be called "is presently sparring with," and could be represented as a list like this: (team_A_person_12, team_B_person_4) (team_A_person_3, team_B_person_17)
No ordering or hierarchy among related entities is implied by default. In the example above, this means that it doesn't signify anything if I write the person from team A first or the one from team B first in each pair: an associative relationship among any group of entities doesn't in general depend on the order in which they're listed, so "A is related to B" always means "B is also related to A."
A semantic hierarchy, direction, or other structured ordering may be overlaid on entities related by an associative relationship by design convention.
For example: the "file X describes biosample Y" associative relationship in C2M2 Level 1 (encoded by the `file_describes_biosample` association table) clearly implies a more structured relationship: it's not meaningful to say that "'file X describes biosample Y' implies that 'biosample Y describes file X'".
Regardless of whether a particular associative relationship comes with an implied "subject/predicate" relationship or (by contrast) simply represents "A and B are in the same bucket", it's important to note that in any associative relationship, "A is related to B" always implies that "B is related to A" -- translation to English is generally the confusing factor, here, e.g. "file X describes biosample Y" comes guaranteed with the converse "biosample Y is described by file X".
Multiplicity is not constrained: extending our C2M2 example above, one biosample entity
can be "described by" multiple file entities (expressed by writing one record
for each associated biosample/file pair in the
repeating the biosample entity identifier as needed), and similarly one
file entity can "describe" multiple biosample entities (similarly encoded by creating
multiple records in the
file_describes_biosample association table, one for each
of the biosamples described by the one file).
Bioinformatics Center. The BIC is responsible for the task of harmonizing the large quantity and diversity of data and metadata being generated by the consortium and performing meaningful integrative analyses across these omics data types.
a physical object composed of biological material
material asset, collected from an organism, a cell culture, or a material containing organisms, such as an environmental material. syn: sample
C2M2 Creation Flow¶
the process which generated the C2M2 model.
C2M2 Extractor Flow¶
C2M2 Ingestion Process¶
the process through which metadata is brought in, matched, and merged into the CFDE.
a collection of metadata structurally conformant with one of the C2M2 level specifications (0, 1, or 2).
C2M2 Metadata Specification¶
Concentric, canonical subsets of C2M2 that are benchmarked at increasing levels of model complexity and detail, wherein each successive modeling level is a value-added superset of all of the metadata encompassed by the previous (less complex) level
C2M2 Richness Levels¶
Common Fund Data Ecosystem. A data ecosystem is a collection of data silos or commons joined together by a set of standards and services that facilitate findability, accessibility, reuse, and interoperability of datasets between silos/commons. A data ecosystem is focused on enabling multi-way connectivity between datasets, in a horizontal fashion, rather than deeper vertical analysis within each dataset. The goal of an ecosystem is to enable use cases between data silos, not within.
CFDE Asset Manifest¶
a collection of assets described by the CFDE Asset Specification. The ecosystem will support the concept of a manifest that describes a collection of files. The manifests enable bundling lists of CFDE data assets into a machine-readable file using a common format. Manifests will also be used to publish the complete inventories of data from each DCC, and will enable uniform collection of asset metadata to support indexing of the assets in the CFDE portal.
a type of Metadata Manifest describing an asset inventory
CFDE Asset Specification¶
defines the set of attributes used to charaterize an Asset. The specification simplifies the discovery of assets hosted at the DCCs with a minimal set of descriptors for each of these files. The types of files that are referenced (e.g., genomic sequence, metagenomic, RNA-Seq, physiological and metabolic data) are flexible and contain a small number of essential elements such as a GUID, originating institution (e.g., Broad Institute), assay type (e.g., whole genome/exome, transcriptome, epigenome), file type (e.g., fastq, alignment, vcf, counts), and tissue source and species name for the sample.
specifies a minimal set of attributes (metadata) related to an asset.
CFDE Core ER Diagram¶
CFDE Core Entity Relationship diagram corresponds to a
Level 0 representation richness and is available from the following github repository
CFDE Core Table Schema¶
CFDE Data Dashboard¶
CFDE Data Dashboardis
an interface that monitors DCC data upload to the cloud and usage statistics to support cross-DCC search and ecosystem integration.
CFDE Entity-Relation Diagram¶
CFDE Core Entity Relationship diagram corresponds to a
Level 2 representation richness and is available from the following github repository
CFDE Query Portal¶
a portal enabling users and administrators to search all the federated data assets at each Common Fund Program. The CFDE portal increases a user's ability to find these important resources, as well as mix and match sets of data from each site to use in subsequent analysis. Administrators and Program Officers at Common Fund can use a single website to view the growth of data from their program over time, review objective FAIR metrics for these assets, understand download statistics and geographic distribution, and view the degree of harmonization of these data in comparison to other sites.
Common Fund Programs¶
the intent of the CF programs is to provide a strategic and nimble approach to address key roadblocks in biomedical research that impede basic scientific discovery and its translation into improved human health. In addition, these programs capitalize on emerging opportunities to catalyze the rate of progress across multiple biomedical fields. The CF programs include:
4D Nucleome program's goal is to study the three-dimensional organization of the nucleus in space and time (the 4th dimension).
Genotype-Tissue Expressionproject is an ongoing effort to build a comprehensive public resource to study tissue-specific gene expression and regulation.
Human Microbiome Projectwhose mission is to generate resources to facilitate characterization of the human microbiota to further our understanding of how the microbiome impacts human health and disease.
Kids First Pediatric Research ProgramData Resource Center enables researchers, clinicians, and patients to work together to accelerate research and promote new discoveries for children affected with cancer and structural birth defects.
Library of Integrated Network-based Cellular Signatures projectis based on the premise that disrupting any one of the many steps of a given biological process will cause related changes in the molecular and cellular characteristics, behavior, and/or function of the cell – the observable composite of which is known as the cellular phenotype. Observing how and when a cell’s phenotype is altered by specific stressors can provide clues about the underlying mechanisms involved in perturbation and, ultimately, disease.
Metabolomics programwas developed with the goal of increasing national capacity in metabolomics by supporting the development of next generation technologies, providing training and mentoring opportunities, increasing the inventory and availability of high quality reference standards, and promoting data sharing and collaboration.
Stimulating Peripheral Activity to Relieve Conditions programseeks to accelerate development of therapeutic devices that modulate electrical activity in nerves to improve organ function.
a combination of two or more columns in a table that can be used to uniquely identify each row in the table. Uniqueness is only guaranteed when the columns are combined; when taken individually the columns do not guarantee uniqueness.
an organized arrangement of words and phrases used to index content and/or to retrieve content through browsing or searching. It typically includes preferred and variant terms and has a defined scope or describes a specific domain. For example, the DCIC curates an internal 4DN controlled vocabulary to provide definitions for emerging technologies and techniques, metadata terms, and captures important data features not defined by previous ontologies.
Data Citation Rubric¶
a process, such as a data acquisition or a data transformation, resulting in the creation of a file (digital asset).
a region of storage that contains a value or group of values. Each value can be accessed using its identifier or a more complex expression that refers to the object. In addition, each object has a unique data type.
collection of data, published or curated by a single agent, and available for access or download in one or more formats.
Data Article Tag Suite is a data model for representing key information about datasets with an emphasis on data discovery and data findability, which has inspired the creation of the NIH-C2M2 model. The DATS model is expressed as a JSON schema. Associated JSON-LD context files support search engine optimization because they map into schema.org and DCAT. Mappings into biological entities are also available via OBO Foundry resources.
Database of Genotypes and Phenotypes. dbGAP was developed to archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype in humans.
a Common Fund
Data Coordinating/Resource Center.
DCC Data Ingestion Process¶
Data Coordination and Integration Center. Data generated by 4DN partner institutions are integrated, curated, analyzed, and disseminated here.
Discovery Environment for Relational Information and Versioned Assets is a suite of tools and services that are designed to significantly reduce the overhead and complexity of creating and managing complex, big datasets. DERIVA provides a digital asset management system for scientific data to streamline the acquisition, modeling, management, and sharing of complex, big data, and provides interfaces so that these data can be delivered to diverse external tools for big-data analysis and analytic tools.
a group of assets that may be digital objects (i.e., files) or references to physical objects. DERIVA uses an entity-relationship data model to catalog and organize these assets.
Domain Name System is the internet database that connects URLs to their IP addresses
Data and Tools Cores. The Metabolomics Program consortium consists of six RCMRCs and seven DTCs that are overseen by the Metabolomics Consortium Coordinating Center at the University of Florida.
a component of the entity relationship model
Entity Relationship Model¶
Entity Relationship Model (or ER model) describes interrelated things of interest in a specific domain of knowledge. A basic ER model is composed of entity types (which classify the things of interest) and specifies relationships that can exist between entities (instances of those entity types). In software engineering, an ER model is commonly formed to represent things a business needs to remember in order to perform business processes. Consequently, the ER model becomes an abstract data model, that defines a data or information structure which can be implemented in a database, typically a relational database. source: Wikipedia).
a database table representing an entity
electronic Research Administration is an online interface where signing officials, principal investigators, trainees, and post-docs at institutions/organizations can access and share administrative information relating to research grants. It is the
designated ID provider for the whitelist of DCCs.
Specific instances of data gathering for a specific patient, as in a specific surgery or appointment
any object modeled as a C2M2 entity: a file, biosample, subject, collection, or project
Extract Transform Load Process (ETL)¶
Findable, Accessible, Interoperable, and Reusable. Making data FAIR is the main goal of the DCCs.
a grid of colored squares developed to visually communicate FAIRness level. The FAIR insignia identifies areas of strength and weakness in the FAIRness level of digital objects, guiding digital object producers on how to improve the FAIRness of their products. FAIR insignia is an output of the FAIRshake
a tool produced for carrying out FAIR Assessment. Under the CFDE, each data center’s inventory will be evaluated consistently based on FAIRshake, and the Coordinating Center will work with the individual Common Fund programs to adjust FAIR measures to meet the needs of the Common Fund. FAIRshake includes evaluating the FAIRness of digital objects including datasets, tools, and repositories.
an online open access repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos.
digital asset, that is a type of digital object that each of the DCCs hosts, such as genomic, metagenomic, RNA-Seq sequence data, physiological, and metabolic data or generic metadata electronic document.
A field in a database table that references a particular target record in some other ("foreign") database table by storing a copy of the identifier ("key") field from the foreign table assigned to the target record. A foreign key can also comprise multiple fields, if it encodes a multi-part identifier from the foreign table: cf. e.g. paired key.
a format specification produced by the Open Knowledge Foundation and supported by the Frictionlessdata.io organization. It aims to shorten the path from data to insight with a collection of specifications and software for the publication, transport, and consumption of data. This kills the cycle of find/improve/share that makes for a dynamic and productive data ecosystem.
an online hub for storing and sharing computer programs and other plain text files. The CFDE team uses it for storage, hosting websites, communications, and project management.
a distributed research automation platform that addresses the problem of securely and reliably automating, for many thousands of scientists, sequences of data management tasks that may span locations, storage systems, administrative domains, and timescales, and integrate both mechanical and human inputs.
Globus Automate Flow¶
a process which relies on Globus Automate, a software service created and based at the University of Chicago, to help scientists simplify their workflow by automating data transfer and synchronization tasks. Users can create automated sequences initiated by events, known as
Globus Automate Flows.
Globally Unique IDentifier. a chain of characters, usually 128-bit long used to uniquely identify an entity. Modern hashing functions used to generate GUIDs make
identifier collisions (the event of the function producing the same sequence) highly unlikely (but not impossible), hence those can only be
nearly guaranteed to be unique.
Human Microbiome Project.
see C2M2 Instance
JSON for Linking Data is based on the already successful JSON format and provides a way to help JSON data interoperate at Web-scale. JSON-LD is JSON based syntax for expressing a generalized RDF dataset.
a web-based interactive environment for organizing data, performing computation, and visualizing output.
a webpage containing information identifying and describing a particular resource
Lightweight Directory Access Protocol
a string assigned to a C2M2 resource by its managing DCC, identifying that resource uniquely within the context of a (mandatory) accompanying namespace stored in fields named local_id listed in the C2M2 specification.
Metabolomics Consortium Coordinating Center at the University of Florida, which handles overall coordination for the Metabolomics program.
Metabolomics Workbench Metabolite Database¶
a PostgreSQL database containing over 65,000 structures and annotations of biologically relevant metabolites collected from public repositories (e.g., LIPID MAPS, ChEBI, HMDB, BMRB, PubChem, KEGG). Users can search for metabolites in the database by substructure, text, or mass (m/z ratio). Each entry contains key information about the metabolite, including structure, molecular weight, common and systematic names, PubChem compound ID, and classification. Entries also contain cross references to external databases and repositories (e.g., HMDB, ChEBI, LIPID MAPS, METLIN, ChemSpider, KEGG, etc.) as well as links to the MoNA MS spectra and human metabolic pathways containing the metabolite. Additionally, the open-source chemistry cartridge enables substructure searching, generation of chemistry-centric attributes (formula, exact mass), and interconversion of molecular formats.
a type of information entity usually defined as
data about the data, understood as
descriptors to understand the context of a dataset. For example, metadata about an FASTQ file may be
file size or
file creator. Metadata is often classified into
administrative metadata, and
provenance metadata, all of which provide ++
context++ to the actual data/dataset.
a process which assigns identifiers to the objects, then extracts or creates metadata for these objects and persists them.
a file that includes metadata for a group of accompanying files that are part of a coherent unit (manifest), such as name, version, background scripts, and browser actions.
Human Metabolome Gene/Protein Database. Developed by the Common Fund Metabolomics program and part of Metabolomics Workbench. A database of metabolome-related genes and proteins containing over 7,300 genes and over 15,500 proteins. Users can search by gene (name, symbol, entrez ID, etc.), HMDB Pathway, or Reactome Pathway. MGP displays genes/proteins and metabolites associated with a pathway of interest. Searching by gene displays information about the gene’s associated proteins and pathways, including a summary of the function and metabolites involved in the pathway.
Molecular Transducers of Physical Activity Consortium. MoTrPAC is a national research consortium designed to discover and perform preliminary characterization of the range of molecular transducers (the "molecular map") that underlie the effects of physical activity in humans. The program's goal is to study the molecular changes that occur during and after exercise and ultimately to advance the understanding of how physical activity improves and preserves health. The six-year program is the largest targeted NIH investment of funds into the mechanisms of how physical activity improves health and prevents disease.
Metabolomics Workbench. The MW is an online interface to the NMDR developed at UCSD, by the Common Fund Metabolomics program, . It allows users to manage and upload studies as well as browse and search available studies. Using the MW interface, submitters upload data and results, including metadata, targeted data measurements, protocols/methods files, untargeted data measurements, and raw data (MS/NMR files, etc.). Other researchers can then use the MW website to browse, search, analyze, and download data as well as view summary figures of key study search parameters (e.g., bubble chart showing studies by sample source). For example, studies can be filtered by study metadata (disease, sample source, species, instrumentation) or metabolite information (metabolite classification, biochemical pathways, retention time, etc.) to identify data relevant to the user’s needs. Additionally, it provides analysis tools and access to metabolite standards, protocols, tutorials, training, and other resources to support metabolomic researchers (such as RefMet, Metabolomics Workbench Metabolite Database).
National Heart, Lung, and Blood Institute
National Institutes of Health
National Metabolomics Data Repository. Responsible for collating, analyzing, and distributing the data gathered by the RCMRCs and hosting the tools and methods created by the DTCs.
Open Biomedical Ontology
an entity comprising of multiple people, such as an institution or an association, that has a particular purpose.
URI or CURIE attached to a C2M2 resource; cannot be changed after attachment to the resource so identified; must resolve, either via some IANA scheme (e.g. http[s], i.e. the URI is a URL) or a resolver like identifiers.org, to a landing page describing the attached resource.
Proof of Concept (POC)¶
a process or realization of a certain method or idea in order to demonstrate its feasibility.
a entity to describe administrative/funding/contract/etc. hierarchy governing ownership/management/purview/responsibility of/for subcollections of experimental resources and metadata
Researcher Authorization Service. A service under development by the NIH's Center for Information Technology that will facilitate access to controlled data assets and repositories.
Regional Comprehensive Metabolomics Resource Cores. The Metabolomics Program consortium consists of six RCMRCs, also called the Compound Identification Cores (CIDCs), and seven Data and Tools Cores (DTCs) that are overseen by the Metabolomics Consortium Coordinating Center at the University of Florida.
Resource Description Framework. A standard model for semantic data interchange on the Web. RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all the data consumers to be changed.
Reference list of Metabolite names. Developed by the Common Fund Metabolomics program. Effectively, a large spreadsheet that provides a standard nomenclature for over 95,500 chemical species. From the Metabolomics Workbench website, it can be browsed and searched directly or a user can input a list of metabolite names and have them automatically converted to RefMet nomenclature. A user can also directly download the data, either in whole or after filtering as one would with a simple Excel sheet. Or the entire dataset can be downloaded as part of a Shiny R app and queried locally.
a resolvable address, attached to a given resource, is any string which can be used (as input to some intermediary service or scheme, e.g. http or identifiers.org) to obtain information about that resource
a retrievable C2M2 resource can be directly obtained by an interested/authorized party, e.g. via download
a qualifier indicative of the depth and granularity of an object model or Entity-Relationship model. In the context of CFDE, the C2M2 model is described by the following increasing richness levels: [Level-0, Level-1, Level-2]
one or more fields (columns) in a database table row which, taken together, uniquely identify that row within its containing table.
a material collected from an organism, a cell culture, or a material containing organisms, such as an environmental material. syn: biospecimens
Any well-defined, replicable process that transforms a (not necessarily inherently sequential) collection of data into an ordered sequence of information in such a way that one can then reliably reconstruct the original data from the encoded sequence.
("Inherently sequential" data is naturally ordered, first-bit-to-last-bit. Regarding data that isn't inherently sequential, one example is a dataset describing a social network that connects multiple people to one another via inter-person links. There's no obvious "first" person, among other things, so there's no obvious way to unambiguously describe that dataset as an ordered sequence of things.)
The basic purpose of serialization is to allow people to create, save, and share files (which are necessarily sequential: first byte, second byte, etc., through to the end of the file) that encode non-sequential data in a reproducible way according to a shared method, so that three different people don't wind up encoding the same data in three different ways. This helps ensure reliability when automatically processing complex datasets, especially in an environment where data is being shared and processed across multiple independent teams using different information systems.
In the context of CFDE, serialization refers specifically to the process of transforming a collection of biomedical metadata managed by a DCC from its native format(s) into a group of files conforming to one of the C2M2 metadata specifications, so that unstructured data from multiple independent sources can be reliably ingested into core CFDE database systems via a single standard automated process.
Single Sign-On: An authentication scheme that allows a user to log in with a single ID and password to any of several related, yet independent, software systems. It is often accomplished by using the Lightweight Directory Access Protocol (LDAP) and stored LDAP databases on (directory) servers. A simple version of single sign-on can be achieved over IP networks using cookies but only if the sites share a common DNS parent domain.
Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability initiative by the NIH. Common Fund leadership has partnered with the STRIDES initiative, which provides lower-cost cloud services to NIH projects.
a source from which some biological material (a biosample) was obtained.
a short list of general concepts -- defined and maintained as part of the C2M2 metadata specification space -- to partition and categorize subject entities into broad groups or types (generally corresponding to organism multiplicity).
C2M2 subjects are modeled quite generally by default, defined only as sources from which some biological material (a biosample) was obtained. Subject granularity allows further refinement of this definition by categorizing subjects into concept groups or types. These include the most basic "single organism" (e.g., one human subject); the more complex "symbiont system" (in which multiple co-located organisms are modeled as comprising an indivisible biological environment which can, at the time of sampling, only be characterized as a mixture of organisms, because physical separation, DNA-based differentiation, etc., are all pending downstream); and "cell line" (not properly an "organism" as such, but a common source of biological material, further classified within its own ontological system).
Specification of subject granularity also let the application layer handle subjects with different granularities in different ways, according to ontological context. For more detail you can see the C2M2 specification here.
a set of study subjects sharing some characteristics or undergoing the same type of study intervention.
A short list of general concepts -- defined and maintained as part of the C2M2 metadata specification space -- to support the systematic subdivision of subject entities into multiple constituent organisms phylogenetic clades, or other reasonable subdivisions, depending on subject granularity.
For any subject entity categorized by a subject granularity which represents a collection of multiple co-occurring organisms, subject roles allow the attachment of descriptive metadata to different constituent subcomponents of the biological system represented by the overall subject record.
Such attachments are (for example) represented by records in the
subject_role_taxonomy association table. Records in this table
link some subject role (e.g. "host", "symbiont", "pathogen", "cell line
ancestor") to (on the one hand) a particular subject entity record of which that role represents a part and also (on the other hand) to a taxonomic label classifying that particular role in that particular subject. Multiple roles (constituent organisms or taxa) within the same subject can thus be independently classified.
Table Schema to DERIVA Translation¶
a process of metadata ingest which uses tabular formatted data such as Frictionless data package to persist information in the DERIVA system. The code for such a process is available from Github from this repository
Training Coordination Center. This center is staffed by experts in bioinformatics curriculum development, teaching, and community building. It provides support and resources for the development of DCC-specific training programs as well as end-user training on CFDE products and general topics of interest to the Common Fund research community. The TCC can help with logistical support for hosting workshops, as well as providing guidance on how to grow and build a sustainable training program. The TCC provides instructor training for the DCCs and assists with creating useful qualitative and quantitative feedback and assessment tools. In addition to site-specific training, the TCC offers training on CFDE products as they become available, and pilots a general bioinformatics workshop curriculum on topics of broad interest within the Common Fund.
Trans-Omics for Precision Medicine. The Trans-Omics for Precision Medicine (TOPMed) program, sponsored by the National Institutes of Health (NIH) National Heart, Lung and Blood Institute (NHLBI), is part of a broader Precision Medicine Initiative, which aims to provide disease treatments tailored to an individual’s unique genes and environment. TOPMed contributes to this Initiative through the integration of whole-genome sequencing (WGS) and other omics (e.g., metabolic profiles, epigenomics, protein and RNA expression patterns) data with molecular, behavioral, imaging, environmental, and clinical data.
Tab-Separated Value File
Whole Genome Sequencing