|
|
Human MSigDB Collections: Details and Acknowledgments
General notes
Beginning in MSigDB 7.0,
we are now using Ensembl as the platform annotation authority. Identifiers for genes are mapped to their HGNC approved Gene Symbol and
NCBI Gene ID through annotations extracted from Ensembl's BioMart data service, and will be updated at each MSigDB release with the
latest available version of Ensembl.
H collection: hallmark gene sets
We envision this collection as the starting point for your exploration of the MSigDB resource and GSEA. Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.
We refer to the original overlapping gene sets, from which a hallmark is derived, as its 'founder' sets. Hallmark gene set pages provide links to the corresponding founder sets for more in-depth exploration. In addition, hallmark gene set pages include links to microarray data that served for refining and validation of the hallmark signatures.
To cite your use of the collection, and for further information, please refer to
Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures
Database (MSigDB) hallmark gene set collection. Cell Syst. 2015 Dec 23;1(6):417-425.
C1 collection: positional gene sets
Gene annotations for this collection are derived from the Chromosome and Karyotype band tracks from Ensembl
BioMart (see MSigDB release notes for the current version) and reflect the gene architecture as represented
on the primary assembly. Decimals in cytogenetic bands were ignored. For example, 5q31.1 was considered 5q31.
Therefore, genes annotated as 5q31.2 and those annotated as 5q31.3 were both placed in the same set, 5q31. These
gene sets can be helpful in identifying effects related to chromosomal deletions or amplifications, dosage
compensation, epigenetic silencing, and other regional effects.
C2 collection: curated gene sets
Gene sets in this collection are curated from various sources, including online pathway databases and the
biomedical literature. Many sets are also contributed by individual domain experts. The gene set page for
each gene set lists its source. The C2 collection is divided into the following two subcollections:
Chemical and genetic perturbations (CGP) and Canonical pathways (CP).
> C2 subcollection CGP: Chemical and genetic perturbations
Gene sets that represent expression signatures of genetic and chemical perturbations.
The majority of the CGP subcollection represents data curated from biomedical literature. Microarray, and
sequencing studies have identified many signatures of many important biological and clinical states (e.g.
cancer metastasis, stem cell characteristics, drug resistance).
Rather than, for example, a pathway database that is designed to represent a generic accounting of cellular
processes, CGP aims to provide specific targeted signatures largely from perturbation experiments.
A number of these gene sets come in pairs: xxx_UP (and xxx_DN) gene sets representing genes induced
(and repressed) by the perturbation.
The majority of CGP sets were curated from publications and include links to the
PubMed citation, the exact source of the set
(e.g., Table 1), and links to any corresponding raw data in GEO
or ArrayExpress repositories. When the gene
set involves a genetic perturbation, the set's brief description includes a link to the gene's entry in the
NCBI (Entrez) Gene database. When the gene set
involves a chemical perturbation, the set's brief description includes a link to the chemical's entry in the NCBI
PubChem Compound database.
Other CGP gene sets include:
- MatrisomeDB:
http://matrisomeproject.mit.edu, Naba A,
Clauser KR, Hoersch S, Liu H, Carr SA, Hynes RO. The matrisome: in silico definition and in vivo
characterization by proteomics of normal and tumor extracellular matrices. Mol Cell Proteomics. 2012
Apr;11(4):M111.014647, and the 2023 release of https://matrisomedb.org/
Shao X, Gomez CD, Kapoor N, Considine JM, Gao Y, Naba A. MatrisomeDB: 2023
updates of the ECM protein knowledge database. Nucleic Acids Research, 2022, gkac1009. doi.org/10.1093/nar/gkac1009.
- Gene sets contributed by the L2L database of published microarray gene expression data at
University of Washington. See Newman JC, Weiner AM. L2L: a simple tool for discovering the hidden
significance in microarray expression data. Genome Biol. 2005;6(9):R81. See also
http://depts.washington.edu/l2l.
- Gene sets curated by Dr. Chi Dang from the
MYC Target Gene Database
at Johns Hopkins University School of Medicine. See Zeller KI, Jegga AG, Aronow BJ, O'Donnell KA, Dang CV.
An integrated database of genes responsive to the Myc oncogenic transcription factor: identification of
direct genomic targets. Genome Biol. 2003;4(10):R69.
- A number of individuals have contributed gene sets to this collection. The gene set annotation
includes a "contributor" field that acknowledges the contributor by name/affiliation.
Note that sets deposited in the C2:CGP subcollection refelect a diverse array of species origins (largely human
and mouse, but also rat, macaque, etc), all non-human derived sets are mapped to human orthologs. Mouse derived
sets that were desposited in MSigDB utilizing mouse gene identifiers (i.e. not pre-converted to orthologs prior
to deposition) are also provided in the mouse namespace in the M2:CGP subcollection of the Mouse MSigDB collections.
> C2 subcollection CP: Canonical pathways
The pathway gene sets are curated from the following online databases:
C3 collection: regulatory target gene sets
Gene sets representing potential targets of regulation by transcription factors or microRNAs. The sets consist
of genes grouped by their shared regulatory element. The motifs represent known or likely cis-regulatory
elements in promoters and 3'-UTRs. These gene sets make it possible to link changes in an expression profiling
experiment to a putative cis-regulatory element. The C3 collection is divided into two
subcollections: microRNA targets (MIR) and transcription factor targets (TFT).
> C3 subcollection MIR: microRNA targets
> C3 subcollection TFT: Transcription factor targets
- GTRD: Sets of human genes predicted to contain transcription factor binding sites in their
promoter regions (-1000,+100 bp around the transcription start site) for the indicated transcription factor. Gene sets are derived
from the Gene Transcription Regulation Database (GTRD, gtrd.biouml.org)
uniform processing pipeline and represent a candidate list of potential regulatory targets for each transcription factor (see
MSigDB release notes for the current included GTRD version).
GTRD: an integrated view of transcription regulation. Kolmykov S, Yevshin I, Kulyashov M,
Sharipov R, Kondrakhin Y, Makeev VJ, Kulakovskiy IV, Kel A, Kolpakov F.
Nucleic Acids Res. 2021 Jan 8;49(D1):D104-D111. doi: 10.1093/nar/gkaa1057.
- TFT_LEGACY: (These are older gene sets that formerly represented the C3:TFT
subcollection prior to MSigDB v7.1). Gene sets that share upstream cis-regulatory motifs which can function
as potential transcription factor binding sites. We used two approaches to generate these motif gene sets.
- Gene sets of "conserved instances" consist of the inferred target genes for each motif m of 174
motifs highly conserved in promoters of four mammalian species (human, mouse, rat and dog). The motifs represent
potential transcription factor binding sites and are catalogued in Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M. Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature. 2005 Mar 17;434(7031):338-45.. Each gene set consists of all human genes whose promoters contained at least one conserved
instance of motif m , where a promoter is defined as the non-coding sequence contained
within a 4-kilobase window centered at the transcription start site (TSS).
- Mammalian transcriptional regulatory motifs were extracted from v7.4 TRANSFAC database (see
supplementary data of Xie et al). Each gene set consists of all human genes whose promoters contains
at least one conserved instance of the TRANSFAC motif, where a promoter is defined as the non-coding
sequence contained within a 4-kilobase window centered at the transcription start site (TSS).
C4 collection: computational gene sets
Computational gene sets defined by mining large collections of cancer-oriented expression data. This
collection is divided into three subcollections: Curated Cancer Cell Atlas (3CA), Cancer gene neighborhoods (CGN), and Cancer modules (CM).
> C4 subcollection 3CA: Curated Cancer Cell Atlas
Gene sets defined by Gavish et al. Hallmarks of Transcriptional Intratumour Heterogeneity across a Thousand Tumours. Nature 1-9 (2023).
doi:10.1038/s41586-023-06130-4. Briefly, the authors curated, annotated and integrated the data from 77 different studies to reveal
the patterns of transcriptional inter-tumoral heterogenity across 1,163 tumour samples covering 24 tumor types. They identifed 41 consensus
meta-programs among the malignant cells (gene sets named with prefix GAVISH_3CA_MALIGNANT_METAPROGRAM_), each consisting of genes which were
coordinately upregulated in subpopulations of cells within many tumors. The meta-programs cover diverse cellular processes including both
generic (for example, cell cycle and stress) and lineage-specific patterns. The authors further extended the meta-program analysis to six
common non-malignant cell types (gene sets named with prefix GAVISH_3CA_METAPROGRAM_). See also
https://www.weizmann.ac.il/sites/3CA/.
> C4 subcollection CGN: Cancer gene neighborhoods
In our GSEA paper, Subramanian,
Tamayo et al. 2005, PNAS 102, 15545-15550, we mined 4 expression
compendia datasets for correlated gene sets, starting with a list of 380 cancer-associated genes curated
from internal resources and
Brentani, Caballero et al. Human Cancer Genome Project/Cancer Genome
Anatomy Project Annotation Consortium.; Human Cancer Genome Project Sequencing Consortium. The
generation and utilization of a cancer-oriented representation of the human transcriptome by using
expressed sequence tags. Proc Natl Acad Sci U S A. 2003 Nov 11;100(23):13418-23. Using the profile of a
given gene as a template, we ordered every other gene in the data set by its Pearson correlation coefficient.
We applied a cutoff of R ≥ 0.85 to extract correlated genes. The calculation of neighborhoods is done
independently in each compendium. In this way, a given oncogene may have up to four "types" of
neighborhoods according to the correlation present in each compendium. Neighborhoods with <25 genes at
this threshold were omitted yielding the final 427 sets.
- GNF2: Human tissue compendium (Novartis).
Gene expression profiles from the Novartis normal tissue compendium, as published in Su AI, Wiltshire T, Batalov S,
Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB. A gene atlas of the
mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 2004 Apr 20;101(16):6062-7.
- CAR: Novartis carcinoma compendium (Novartis).
Gene expression profiles from the Novartis carcinoma tissue compendium, as published in Su AI, Welsh JB, Sapinoso LM,
Kern SG, Dimitrov P, Lapp H, Schultz PG, Powell SM, Moskaluk CA, Frierson HF Jr, Hampton GM. Molecular classification of human
carcinomas by use of gene expression signatures. Cancer Res. 2001 Oct 15;61(20):7388-93.
- GCM: Global Cancer Map (Broad Institute).
Gene expression profiles from the global cancer map, as published in Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH,
Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR. Multiclass cancer diagnosis
using tumor gene expression signatures. Proc Natl Acad Sci U S A. 2001 Dec 18;98(26):15149-54.
- MORF: An unpublished compendium of gene expression data sets,
including many of Broad Institute's Cancer Program in-house Affymetrix HG-U95 cancer samples (1,693 in all) from a variety of cancer
projects representing many different tissue types, mainly primary tumors, such as prostate, breast,
lung, lymphoma, leukemia, etc.
> C4 subcollection CM: Cancer modules
Gene sets defined by Segal E, Friedman N, Koller D, Regev A. A module map showing conditional
activity of expression modules in cancer. Nat Genet. 2004 Oct;36(10):1090-8. Briefly, the authors
compiled gene sets ('modules') from a variety of resources such as KEGG, GO, and others. By mining a
large compendium of cancer-related microarray data, they identified 456 such modules as significantly changed in
a variety of cancer conditions. See also
http://robotics.stanford.edu/~erans/cancer.
C5 collection: ontology gene sets
Gene sets in this collection are derived from ontology resources. divided into four sub collections derived from ontology annotations. Ontology annotations were curated from databases maintained by their respective authorities.
Ontology terms for very broad categories that would produce extremely large gene sets (greater than 2000 members) and ontology terms that produced gene sets with fewer than 5 members have been omitted. Additionally, each subcollection goes through a redundancy filtering procedure to ensure there are no identical or highly similar sets. (See
MSigDB release notes for the current versions, and more information on specific procedures.)
Note to GSEA users: Gene set enrichment analysis identifies gene sets consisting of co-regulated genes;
GO gene sets are based on ontologies and do not necessarily comprise co-regulated genes.
> C5 subcollection GO: Gene Ontology
The C5:GO subcollection is divided into three compoents (BP, CC, and MF) derived from
Gene Ontology (GO),
and represent GO terms belonging to one of the three root GO ontologies: biological process (BP), cellular component (CC), or
molecular function (MF) respectively.
GO is a collaborative effort to develop and use ontologies to support biologically meaningful annotation of genes and
their products. A GO annotation consists of a GO term associated with a specific reference that describes the work or
analysis upon which the association between a specific GO term and gene product is based. Each annotation also includes
an evidence code to indicate how the annotation to a particular term is supported
(http://geneontology.org/page/guide-go-evidence-codes).
Gene sets in this subcollection are prefixed with "GOBP" (Biological Process), "GOMF" (Molecular Function), or "GOCC"
(Cellular Component) to indicate their source ontology.
> C5 subcollection HPO: Human Phenotype Ontology
The Human Phenotype Ontology (HPO) provides a standardized vocabulary of phenotypic abnormalities encountered in human
disease (https://hpo.jax.org/). An HPO annotation consists of these
phenotypic abnormalities and the association between the abnormality and a set of genes known to be involved in
development of said abnormality, developed using the medical literature, Orphanet, DECIPHER, and OMIM.
Gene sets in this sub-colelction are prefixed with "HP" to indicate their source ontology.
Köhler S, Gargano M, Matentzoglu N, Carmody LC, Lewis-Smith D, Vasilevsky NA, Danis D, Balagura G, Baynam G, Brower AM, Callahan TJ, Chute CG, Est JL, Galer PD, Ganesan S, Griese M, Haimel M, Pazmandi J, Hanauer M, Harris NL, Hartnett MJ, Hastreiter M, Hauck F, He Y, Jeske T, Kearney H, Kindle G, Klein C, Knoflach K, Krause R, Lagorce D, McMurry JA, Miller JA, Munoz-Torres MC, Peters RL, Rapp CK, Rath AM, Rind SA, Rosenberg AZ, Segal MM, Seidel MG, Smedley D, Talmy T, Thomas Y, Wiafe SA, Xian J, Yüksel Z, Helbig I, Mungall CJ, Haendel MA, Robinson PN. The Human Phenotype Ontology in 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D1207-D1217.
C6 collection: oncogenic signature gene sets
Gene sets represent signatures of cellular pathways which are often dis-regulated in cancer. The majority of
signatures were generated directly from microarray data from NCBI GEO or from internal unpublished
profiling experiments which involved perturbation of known cancer genes. In addition, a small number of
oncogenic signatures were curated from scientific publications.
C7 collection: immunologic signature gene sets
Gene sets in this collection represent cell states and perturbations within the immune system. It consists of two subcollections:
- ImmuneSigDB, which was previously the complete C7 and represents a broad curation effort of signatures of immune perturbations and states
- VAX, a targeted subcollection that focuses specifically on curation of published studies of human responses to various vaccines.
> C7 subcollection ImmuneSigDB
ImmuneSigDB is composed of gene sets that represent a broad curation effort of cell
types, states, and perturbations within the immune system. The signatures were generated by manual
curation of published studies in human and mouse immunology.
We first captured relevant microarray datasets published in the immunology literature that have raw data
deposited to Gene Expression Omnibus (GEO).
For each published study, the relevant comparisons were
identified (e.g. WT vs. KO; pre- vs. post-treatment etc.) and brief, biologically meaningful descriptions were
created. All data was processed and normalized the same way to identify the gene sets, which correspond to
the top or bottom genes (FDR < 0.02 or maximum of 200 genes) ranked by mutual information for each
assigned comparison.
The immunologic signatures collection was generated as part of our collaboration with the
Haining Lab at Dana-Farber Cancer Institute and the
Human Immunology Project Consortium (HIPC).
To cite your use of the collection, and for further information, please refer to
Godec J, Tan Y, Liberzon A, Tamayo P, Bhattacharya S, Butte A, Mesirov JP, Haining WN,
Compendium of Immune Signatures Identifies Conserved and Species-Specific Biology in Response
to Inflammation, 2016, Immunity 44(1), 194-206.
> C7 subcollection VAX: vaccine response gene sets
The immune response signatures in this collection result from the curation of gene expression results from 62
publications covering on the order of 50 vaccines by the Human Immunology Project Consortium (HIPC). The initial
list of publications to curate were selected from a PubMed search of papers matching the terms "Vaccine [AND]
Signatures" or "Vaccine [AND] Gene expression". For inclusion, each gene list was required to show statistically
significant differential gene expression. Detailed metadata was collected including vaccine, cohort, comparison,
age, and type of expression change, e.g. up, down, positively correlated etc. The signatures were subject to
extensive quality control and proofreading. Some manual gene symbol updating and screening occurred following
curation. Response signatures that were identical were merged into a single gene set before performing MSigDB's
universal symbol remapping procedure.
C8 collection: cell type signature gene sets
Gene sets that contain cluster marker genes for cell types identified in single-cell sequencing studies of human
tissue. These gene sets have been curated from the literature and represent signature genes and cell type
identifications as represented in their respective originating publications. The gene sets present in this
collection cover a number of cell types from Heart, GI Tract, Pancreas, Kidney, Liver, the Immune system, Retina,
Olfactory tissue, and the Brain. These gene sets are intended to facilitate the assignment of cell types in datasets
such as those from experiments developing organoid models.
Funding for development of these gene sets was provided in part by the
Collaborative Computational Tools for the Human Cell Atlas program
sponsored by the Chan Zuckerberg Initiative.
|