Human MSigDB Collections: Details and
Acknowledgments

General notes

Beginning in MSigDB 7.0, we are now using Ensembl as the platform annotation authority. Identifiers for genes are mapped to their HGNC approved Gene Symbol and NCBI Gene ID through annotations extracted from Ensembl's BioMart data service, and will be updated at each MSigDB release with the latest available version of Ensembl.

H collection: hallmark gene sets

We envision this collection as the starting point for your exploration of the MSigDB resource and GSEA. Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.

We refer to the original overlapping gene sets, from which a hallmark is derived, as its 'founder' sets. Hallmark gene set pages provide links to the corresponding founder sets for more in-depth exploration. In addition, hallmark gene set pages include links to microarray data that served for refining and validation of the hallmark signatures.

To cite your use of the collection, and for further information, please refer to Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 2015 Dec 23;1(6):417-425.

C1 collection: positional gene sets

Gene annotations for this collection are derived from the Chromosome and Karyotype band tracks from Ensembl BioMart (see MSigDB release notes for the current version) and reflect the gene architecture as represented on the primary assembly. Decimals in cytogenetic bands were ignored. For example, 5q31.1 was considered 5q31. Therefore, genes annotated as 5q31.2 and those annotated as 5q31.3 were both placed in the same set, 5q31. These gene sets can be helpful in identifying effects related to chromosomal deletions or amplifications, dosage compensation, epigenetic silencing, and other regional effects.

C2 collection: curated gene sets

Gene sets in this collection are curated from various sources, including online pathway databases and the biomedical literature. Many sets are also contributed by individual domain experts. The gene set page for each gene set lists its source. The C2 collection is divided into the following two subcollections: Chemical and genetic perturbations (CGP) and Canonical pathways (CP).

> C2 subcollection CGP: Chemical and genetic perturbations

Gene sets that represent expression signatures of genetic and chemical perturbations.

The majority of the CGP subcollection represents data curated from biomedical literature. Microarray, and sequencing studies have identified many signatures of many important biological and clinical states (e.g. cancer metastasis, stem cell characteristics, drug resistance). Rather than, for example, a pathway database that is designed to represent a generic accounting of cellular processes, CGP aims to provide specific targeted signatures largely from perturbation experiments. A number of these gene sets come in pairs: xxx_UP (and xxx_DN) gene sets representing genes induced (and repressed) by the perturbation. The majority of CGP sets were curated from publications and include links to the PubMed citation, the exact source of the set (e.g., Table 1), and links to any corresponding raw data in GEO or ArrayExpress repositories. When the gene set involves a genetic perturbation, the set's brief description includes a link to the gene's entry in the NCBI (Entrez) Gene database. When the gene set involves a chemical perturbation, the set's brief description includes a link to the chemical's entry in the NCBI PubChem Compound database.

Other CGP gene sets include:
  • MatrisomeDB: http://matrisomeproject.mit.edu, Naba A, Clauser KR, Hoersch S, Liu H, Carr SA, Hynes RO. The matrisome: in silico definition and in vivo characterization by proteomics of normal and tumor extracellular matrices. Mol Cell Proteomics. 2012 Apr;11(4):M111.014647, and the 2023 release of https://matrisomedb.org/ Shao X, Gomez CD, Kapoor N, Considine JM, Gao Y, Naba A. MatrisomeDB: 2023 updates of the ECM protein knowledge database. Nucleic Acids Research, 2022, gkac1009. doi.org/10.1093/nar/gkac1009.
  • Gene sets contributed by the L2L database of published microarray gene expression data at University of Washington. See Newman JC, Weiner AM. L2L: a simple tool for discovering the hidden significance in microarray expression data. Genome Biol. 2005;6(9):R81. See also http://depts.washington.edu/l2l.
  • Gene sets curated by Dr. Chi Dang from the MYC Target Gene Database at Johns Hopkins University School of Medicine. See Zeller KI, Jegga AG, Aronow BJ, O'Donnell KA, Dang CV. An integrated database of genes responsive to the Myc oncogenic transcription factor: identification of direct genomic targets. Genome Biol. 2003;4(10):R69.
  • A number of individuals have contributed gene sets to this collection. The gene set annotation includes a "contributor" field that acknowledges the contributor by name/affiliation.


Note that sets deposited in the C2:CGP subcollection refelect a diverse array of species origins (largely human and mouse, but also rat, macaque, etc), all non-human derived sets are mapped to human orthologs. Mouse derived sets that were desposited in MSigDB utilizing mouse gene identifiers (i.e. not pre-converted to orthologs prior to deposition) are also provided in the mouse namespace in the M2:CGP subcollection of the Mouse MSigDB collections.

> C2 subcollection CP: Canonical pathways

The pathway gene sets are curated from the following online databases:

C3 collection: regulatory target gene sets

Gene sets representing potential targets of regulation by transcription factors or microRNAs. The sets consist of genes grouped by their shared regulatory element. The motifs represent known or likely cis-regulatory elements in promoters and 3'-UTRs. These gene sets make it possible to link changes in an expression profiling experiment to a putative cis-regulatory element. The C3 collection is divided into two subcollections: microRNA targets (MIR) and transcription factor targets (TFT).

> C3 subcollection MIR: microRNA targets

> C3 subcollection TFT: Transcription factor targets

C4 collection: computational gene sets

Computational gene sets defined by mining large collections of cancer-oriented expression data. This collection is divided into three subcollections: Curated Cancer Cell Atlas (3CA), Cancer gene neighborhoods (CGN), and Cancer modules (CM).

> C4 subcollection 3CA: Curated Cancer Cell Atlas

Gene sets defined by Gavish et al. Hallmarks of Transcriptional Intratumour Heterogeneity across a Thousand Tumours. Nature 1-9 (2023). doi:10.1038/s41586-023-06130-4. Briefly, the authors curated, annotated and integrated the data from 77 different studies to reveal the patterns of transcriptional inter-tumoral heterogenity across 1,163 tumour samples covering 24 tumor types. They identifed 41 consensus meta-programs among the malignant cells (gene sets named with prefix GAVISH_3CA_MALIGNANT_METAPROGRAM_), each consisting of genes which were coordinately upregulated in subpopulations of cells within many tumors. The meta-programs cover diverse cellular processes including both generic (for example, cell cycle and stress) and lineage-specific patterns. The authors further extended the meta-program analysis to six common non-malignant cell types (gene sets named with prefix GAVISH_3CA_METAPROGRAM_). See also https://www.weizmann.ac.il/sites/3CA/.

> C4 subcollection CGN: Cancer gene neighborhoods

In our GSEA paper, Subramanian, Tamayo et al. 2005, PNAS 102, 15545-15550, we mined 4 expression compendia datasets for correlated gene sets, starting with a list of 380 cancer-associated genes curated from internal resources and Brentani, Caballero et al. Human Cancer Genome Project/Cancer Genome Anatomy Project Annotation Consortium.; Human Cancer Genome Project Sequencing Consortium. The generation and utilization of a cancer-oriented representation of the human transcriptome by using expressed sequence tags. Proc Natl Acad Sci U S A. 2003 Nov 11;100(23):13418-23. Using the profile of a given gene as a template, we ordered every other gene in the data set by its Pearson correlation coefficient. We applied a cutoff of R ≥ 0.85 to extract correlated genes. The calculation of neighborhoods is done independently in each compendium. In this way, a given oncogene may have up to four "types" of neighborhoods according to the correlation present in each compendium. Neighborhoods with <25 genes at this threshold were omitted yielding the final 427 sets.
  • GNF2: Human tissue compendium (Novartis). Gene expression profiles from the Novartis normal tissue compendium, as published in Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 2004 Apr 20;101(16):6062-7.
  • CAR: Novartis carcinoma compendium (Novartis). Gene expression profiles from the Novartis carcinoma tissue compendium, as published in Su AI, Welsh JB, Sapinoso LM, Kern SG, Dimitrov P, Lapp H, Schultz PG, Powell SM, Moskaluk CA, Frierson HF Jr, Hampton GM. Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res. 2001 Oct 15;61(20):7388-93.
  • GCM: Global Cancer Map (Broad Institute). Gene expression profiles from the global cancer map, as published in Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR. Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci U S A. 2001 Dec 18;98(26):15149-54.
  • MORF: An unpublished compendium of gene expression data sets, including many of Broad Institute's Cancer Program in-house Affymetrix HG-U95 cancer samples (1,693 in all) from a variety of cancer projects representing many different tissue types, mainly primary tumors, such as prostate, breast, lung, lymphoma, leukemia, etc.

> C4 subcollection CM: Cancer modules

Gene sets defined by Segal E, Friedman N, Koller D, Regev A. A module map showing conditional activity of expression modules in cancer. Nat Genet. 2004 Oct;36(10):1090-8. Briefly, the authors compiled gene sets ('modules') from a variety of resources such as KEGG, GO, and others. By mining a large compendium of cancer-related microarray data, they identified 456 such modules as significantly changed in a variety of cancer conditions. See also http://robotics.stanford.edu/~erans/cancer.

C5 collection: ontology gene sets

Gene sets in this collection are derived from ontology resources. divided into four sub collections derived from ontology annotations. Ontology annotations were curated from databases maintained by their respective authorities.

Ontology terms for very broad categories that would produce extremely large gene sets (greater than 2000 members) and ontology terms that produced gene sets with fewer than 5 members have been omitted. Additionally, each subcollection goes through a redundancy filtering procedure to ensure there are no identical or highly similar sets. (See MSigDB release notes for the current versions, and more information on specific procedures.)

Note to GSEA users: Gene set enrichment analysis identifies gene sets consisting of co-regulated genes; GO gene sets are based on ontologies and do not necessarily comprise co-regulated genes.

> C5 subcollection GO: Gene Ontology

The C5:GO subcollection is divided into three compoents (BP, CC, and MF) derived from Gene Ontology (GO), and represent GO terms belonging to one of the three root GO ontologies: biological process (BP), cellular component (CC), or molecular function (MF) respectively.

GO is a collaborative effort to develop and use ontologies to support biologically meaningful annotation of genes and their products. A GO annotation consists of a GO term associated with a specific reference that describes the work or analysis upon which the association between a specific GO term and gene product is based. Each annotation also includes an evidence code to indicate how the annotation to a particular term is supported (http://geneontology.org/page/guide-go-evidence-codes). Gene sets in this subcollection are prefixed with "GOBP" (Biological Process), "GOMF" (Molecular Function), or "GOCC" (Cellular Component) to indicate their source ontology.

> C5 subcollection HPO: Human Phenotype Ontology

The Human Phenotype Ontology (HPO) provides a standardized vocabulary of phenotypic abnormalities encountered in human disease (https://hpo.jax.org/). An HPO annotation consists of these phenotypic abnormalities and the association between the abnormality and a set of genes known to be involved in development of said abnormality, developed using the medical literature, Orphanet, DECIPHER, and OMIM. Gene sets in this sub-colelction are prefixed with "HP" to indicate their source ontology.

Köhler S, Gargano M, Matentzoglu N, Carmody LC, Lewis-Smith D, Vasilevsky NA, Danis D, Balagura G, Baynam G, Brower AM, Callahan TJ, Chute CG, Est JL, Galer PD, Ganesan S, Griese M, Haimel M, Pazmandi J, Hanauer M, Harris NL, Hartnett MJ, Hastreiter M, Hauck F, He Y, Jeske T, Kearney H, Kindle G, Klein C, Knoflach K, Krause R, Lagorce D, McMurry JA, Miller JA, Munoz-Torres MC, Peters RL, Rapp CK, Rath AM, Rind SA, Rosenberg AZ, Segal MM, Seidel MG, Smedley D, Talmy T, Thomas Y, Wiafe SA, Xian J, Yüksel Z, Helbig I, Mungall CJ, Haendel MA, Robinson PN. The Human Phenotype Ontology in 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D1207-D1217.

C6 collection: oncogenic signature gene sets

Gene sets represent signatures of cellular pathways which are often dis-regulated in cancer. The majority of signatures were generated directly from microarray data from NCBI GEO or from internal unpublished profiling experiments which involved perturbation of known cancer genes. In addition, a small number of oncogenic signatures were curated from scientific publications.

C7 collection: immunologic signature gene sets

Gene sets in this collection represent cell states and perturbations within the immune system. It consists of two subcollections:
  • ImmuneSigDB, which was previously the complete C7 and represents a broad curation effort of signatures of immune perturbations and states
  • VAX, a targeted subcollection that focuses specifically on curation of published studies of human responses to various vaccines.

> C7 subcollection ImmuneSigDB

ImmuneSigDB is composed of gene sets that represent a broad curation effort of cell types, states, and perturbations within the immune system. The signatures were generated by manual curation of published studies in human and mouse immunology.

We first captured relevant microarray datasets published in the immunology literature that have raw data deposited to Gene Expression Omnibus (GEO). For each published study, the relevant comparisons were identified (e.g. WT vs. KO; pre- vs. post-treatment etc.) and brief, biologically meaningful descriptions were created. All data was processed and normalized the same way to identify the gene sets, which correspond to the top or bottom genes (FDR < 0.02 or maximum of 200 genes) ranked by mutual information for each assigned comparison.

The immunologic signatures collection was generated as part of our collaboration with the Haining Lab at Dana-Farber Cancer Institute and the Human Immunology Project Consortium (HIPC). To cite your use of the collection, and for further information, please refer to Godec J, Tan Y, Liberzon A, Tamayo P, Bhattacharya S, Butte A, Mesirov JP, Haining WN, Compendium of Immune Signatures Identifies Conserved and Species-Specific Biology in Response to Inflammation, 2016, Immunity 44(1), 194-206.

> C7 subcollection VAX: vaccine response gene sets

The immune response signatures in this collection result from the curation of gene expression results from 62 publications covering on the order of 50 vaccines by the Human Immunology Project Consortium (HIPC). The initial list of publications to curate were selected from a PubMed search of papers matching the terms "Vaccine [AND] Signatures" or "Vaccine [AND] Gene expression". For inclusion, each gene list was required to show statistically significant differential gene expression. Detailed metadata was collected including vaccine, cohort, comparison, age, and type of expression change, e.g. up, down, positively correlated etc. The signatures were subject to extensive quality control and proofreading. Some manual gene symbol updating and screening occurred following curation. Response signatures that were identical were merged into a single gene set before performing MSigDB's universal symbol remapping procedure.

C8 collection: cell type signature gene sets

Gene sets that contain cluster marker genes for cell types identified in single-cell sequencing studies of human tissue. These gene sets have been curated from the literature and represent signature genes and cell type identifications as represented in their respective originating publications. The gene sets present in this collection cover a number of cell types from Heart, GI Tract, Pancreas, Kidney, Liver, the Immune system, Retina, Olfactory tissue, and the Brain. These gene sets are intended to facilitate the assignment of cell types in datasets such as those from experiments developing organoid models. Funding for development of these gene sets was provided in part by the Collaborative Computational Tools for the Human Cell Atlas program sponsored by the Chan Zuckerberg Initiative.