Characterizing Systematic Anomalies in eXpression Data (CSAX)
Abstract
Methods for translating gene expression signatures into clinically relevant
information have so far relied upon having many samples from patients with
similar molecular phenotypes. Here, we address the question of what can be
done when it is relatively easy to obtain normal (i.e., healthy) patient
samples, but when abnormalities corresponding to disease states may be rare and
one-of-a-kind. The associated computational challenge, anomaly detection, is a
well-studied machine learning problem. However, due to the dimensionality and
variability of expression data, existing methods based on feature space
analysis or individual anomalously-expressed genes are insufficient.
We present a novel approach, CSAX, that identifies pathways in which the normal
expression relationships are disrupted in an individual sample. To evaluate
our approach, we have compiled and released a compendium of microarray data
sets, reformulated to create a testbed for anomaly detection. We demonstrate
the accuracy of CSAX on the data sets in our compendium, compare it to other
leading anomaly-detection methods, and show that CSAX aids both in identifying
anomalies and in explaining their underlying biology. We note the potential
for the use of such methods in identifying subclasses of disease. We also
describe an approach to characterizing the difficulty of specific expression
anomaly detection tasks and discuss how one can estimate the feasibility of a
specific task. Our approach provides an important step towards identification
of individual disease patterns in the era of personalized medicine.
Overview of CSAX
CSAX uses
GSEA[GSEA1,2]
to measure the extent to which gene sets have anomalous expression as measured by
FRaC[FRaC]
FRaC stands for feature regression and classification and works
by learning mathmatical relationships among normal gene expression and identifying possible anomalies
when those relationships fail to hold.
CSAX uses FRaC
to compute an anomaly score for each gene,
and uses
GSEA to see which gene sets are associated with genes whose expression is particularly surprising.
GSEA stands for gene set enrichment analysis
and is used to measure consistent up- or down-regulation
among related genes, but we use it to measure consistent
unexpected expression among related genes.
This approach has the important advantage of identifying the gene
sets that may best explain an anomaly, which is one of the primary goals of our research.
However, applying this method to
test set microarrays that come from the normal class
will also identify gene sets that that are statistically enriched,
even though these sets are effectively random, and depend on how accurately the
training set represents the true distribution of the normal class.
We use bagging to address this effect:
over multiple iterations, we
take a random subset of the training set
and run FRaC and GSEA on it.
For an excellent detailed description of the algorithms,
see Appendix A in our manuscript.
Compendium of Microarray Anomaly Detection Data Sets
- Download the compendium (.tar.gz).
- Each data set in the compendium consists of:
- An expression matrix called matrix (click for example)
- A labels file identifying which samples are anomalous called metadata (click for example)
- A file with references and notes called README (click for example)
- Click here for a details about each data set.
Supplementary Material
- Supplementary file #1: Top gene sets identified for each compendium test set (Excel .xls)
- Supplementary file #2: RAAD scatter plots for each data set in the compendium (PDF)
- Supplementary file #3: Average AUC for all anomaly detection experiments on our compendium (HTML, tab-delimited)
- Supplementary file #4: CSAX AUC scores for various values of γ (tab-delimited, Excel)
Two Additional Data Sets from our JCB paper[JCB] results on DFLAT gene sets[DFLAT].
- preterm: preterm birth complications (bronchopulmonary dysplasia, retinopathy of prematurity, periventricular leuko- malacia)
- af2014: second trimester amniotic fluid supernatant samples
Source Code
References
[CSAX] |
Keith Noto, Carla Brodley, Saeed Majidi, Diana W. Bianchi, and Donna K. Slonim.
CSAX: Characterizing Systematic Anomalies in eXpression Data.
RECOMB, 2014.
|
[JCB] |
Keith Noto, Saeed Majidi, Andrea G. Edlow, Heather C. Wick, Diana W. Bianchi, and Donna K. Slonim.
CSAX: Characterizing Systematic Anomalies in eXpression Data.
To appear, Journal of Computational Biology, 2014.
|
[DFLAT] |
Wick, H. C., Drabkin, H., Ngu, H., Sackman, M., Fournier, C., Haggett, J., Blake, J. A., Bianchi, D. W., and Slonim, D. K.
DFLAT: functional annotation for human development.
BMC Bioinformatics 15, 45 (2014).
|
[FRaC] |
K. Noto, C. E. Brodley, and D. Slonim.
FRaC: A Feature-Modeling Appraoch for Semi-Supervised and Unsupervised Anomaly Detection.
Data Mining and Knowledge Discovery, 25(1), pp.109—133, 2011.
|
[GSEA1] |
Aravind Subramanian, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, Scott L. Pomeroy, Todd R. Golub, Eric S. Lander, and Jill P. Mesirov.
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles.
PNAS, vol. 102 no. 43, 2005.
|
[GSEA2] |
Vamsi K Mootha, Cecilia M Lindgren, Karl-Fredrik Eriksson, Aravind Subramanian, Smita Sihag, Joseph Lehar, Pere Puigserver, Emma Carlsson, Martin Ridderstråle, Esa Laurila, Nicholas Houstis, Mark J Daly, Nick Patterson, Jill P Mesirov, Todd R Golub, Pablo Tamayo, Bruce Spiegelman, Eric S Lander, Joel N Hirschhorn, David Altshuler, and Leif C Groop.
PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes.
Nat. Gen. (34) 267-273. 2003.
|
[LOF] |
M.M. Breunig, H.P. Kriegel, R.T. Ng and J. Sander.
LOF: identifying density-based local outliers.
ACM SIGMOD Record 29(2). 2000.
|
[LIBSVM] |
C-C Chang and C-J Lin.
LIBSVM: A library for support vector machines.
2001.
|