Characterizing Systematic Anomalies in eXpression Data (CSAX)


Abstract

Methods for translating gene expression signatures into clinically relevant information have so far relied upon having many samples from patients with similar molecular phenotypes. Here, we address the question of what can be done when it is relatively easy to obtain normal (i.e., healthy) patient samples, but when abnormalities corresponding to disease states may be rare and one-of-a-kind. The associated computational challenge, anomaly detection, is a well-studied machine learning problem. However, due to the dimensionality and variability of expression data, existing methods based on feature space analysis or individual anomalously-expressed genes are insufficient. We present a novel approach, CSAX, that identifies pathways in which the normal expression relationships are disrupted in an individual sample. To evaluate our approach, we have compiled and released a compendium of microarray data sets, reformulated to create a testbed for anomaly detection. We demonstrate the accuracy of CSAX on the data sets in our compendium, compare it to other leading anomaly-detection methods, and show that CSAX aids both in identifying anomalies and in explaining their underlying biology. We note the potential for the use of such methods in identifying subclasses of disease. We also describe an approach to characterizing the difficulty of specific expression anomaly detection tasks and discuss how one can estimate the feasibility of a specific task. Our approach provides an important step towards identification of individual disease patterns in the era of personalized medicine.

Overview of CSAX

CSAX uses GSEA[GSEA1,2] to measure the extent to which gene sets have anomalous expression as measured by FRaC[FRaC]

FRaC stands for feature regression and classification and works by learning mathmatical relationships among normal gene expression and identifying possible anomalies when those relationships fail to hold. CSAX uses FRaC to compute an anomaly score for each gene, and uses GSEA to see which gene sets are associated with genes whose expression is particularly surprising. GSEA stands for gene set enrichment analysis and is used to measure consistent up- or down-regulation among related genes, but we use it to measure consistent unexpected expression among related genes.

This approach has the important advantage of identifying the gene sets that may best explain an anomaly, which is one of the primary goals of our research. However, applying this method to test set microarrays that come from the normal class will also identify gene sets that that are statistically enriched, even though these sets are effectively random, and depend on how accurately the training set represents the true distribution of the normal class. We use bagging to address this effect: over multiple iterations, we take a random subset of the training set and run FRaC and GSEA on it.

For an excellent detailed description of the algorithms, see Appendix A in our manuscript.

Compendium of Microarray Anomaly Detection Data Sets

Supplementary Material

Two Additional Data Sets from our JCB paper[JCB] results on DFLAT gene sets[DFLAT].

Source Code

References

[CSAX]
Keith Noto, Carla Brodley, Saeed Majidi, Diana W. Bianchi, and Donna K. Slonim.
CSAX: Characterizing Systematic Anomalies in eXpression Data.
RECOMB, 2014.
[JCB]
Keith Noto, Saeed Majidi, Andrea G. Edlow, Heather C. Wick, Diana W. Bianchi, and Donna K. Slonim.
CSAX: Characterizing Systematic Anomalies in eXpression Data.
To appear, Journal of Computational Biology, 2014.
[DFLAT]
Wick, H. C., Drabkin, H., Ngu, H., Sackman, M., Fournier, C., Haggett, J., Blake, J. A., Bianchi, D. W., and Slonim, D. K.
DFLAT: functional annotation for human development.
BMC Bioinformatics 15, 45 (2014).
[FRaC]
K. Noto, C. E. Brodley, and D. Slonim.
FRaC: A Feature-Modeling Appraoch for Semi-Supervised and Unsupervised Anomaly Detection.
Data Mining and Knowledge Discovery, 25(1), pp.109—133, 2011.
[GSEA1]
Aravind Subramanian, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, Scott L. Pomeroy, Todd R. Golub, Eric S. Lander, and Jill P. Mesirov.
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles.
PNAS, vol. 102 no. 43, 2005.
[GSEA2]
Vamsi K Mootha, Cecilia M Lindgren, Karl-Fredrik Eriksson, Aravind Subramanian, Smita Sihag, Joseph Lehar, Pere Puigserver, Emma Carlsson, Martin Ridderstråle, Esa Laurila, Nicholas Houstis, Mark J Daly, Nick Patterson, Jill P Mesirov, Todd R Golub, Pablo Tamayo, Bruce Spiegelman, Eric S Lander, Joel N Hirschhorn, David Altshuler, and Leif C Groop.
PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes.
Nat. Gen. (34) 267-273. 2003.
[LOF]
M.M. Breunig, H.P. Kriegel, R.T. Ng and J. Sander.
LOF: identifying density-based local outliers.
ACM SIGMOD Record 29(2). 2000.
[LIBSVM]
C-C Chang and C-J Lin.
LIBSVM: A library for support vector machines. 2001.