DOWNLOAD R CODE FILE FOR THIS TUTORIAL
First DOWNLOAD SIMEM R CODE zip bundle containing the R code files loaded below. Also note that, while the code below uses relative paths, you should edit these to point to where you’ve stored the code, annotation or data files.
### To install required packages, uncomment and run this
# source("http://www.bioconductor.org/biocLite.R")
# biocLite(c("Biobase", "preprocessCore", "genefilter"))
# install.packages(c("blme", "doParallel", "ggplot2", "locfit", "MASS", "plyr", "reshape"))
suppressPackageStartupMessages(library("Biobase"))
suppressPackageStartupMessages(library("blme"))
suppressPackageStartupMessages(library("doParallel"))
suppressPackageStartupMessages(library("genefilter"))
suppressPackageStartupMessages(library("ggplot2"))
suppressPackageStartupMessages(library("locfit"))
suppressPackageStartupMessages(library("MASS"))
suppressPackageStartupMessages(library("plyr"))
suppressPackageStartupMessages(library("preprocessCore"))
suppressPackageStartupMessages(library("reshape"))
source("../../R/data_format_lib.R")
source("../../R/model_lib.R")
source("../../R/simem_lib.R")
The Cowley et al. 2014 Project Achilles screen data, including over 200 screens from multiple cancer types formatted as an ExpressionSet
, and including measurement weights, can be downloaded HERE.
The Cowley 2014 Achilles study contains all the screens published in the Cheung 2011 Achilles study, but is measured using sequencing rather than microarrays as was the case for Cheung 2011.
load("../../data/shrna/achilles_screens_expanded_with_weights.eset")
fdat = fData(achilles_screens_expanded)
pheno = pData(achilles_screens_expanded)
achilles_screens_expanded
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 50986 features, 799 samples
## element names: exprs, groupMeans, groupVars, varWeights
## protocolData: none
## phenoData
## sampleNames: 22rv1_a 22rv1_b ... zr7530_d (799 total)
## varLabels: cell_line sample_id ... replicate_group (22 total)
## varMetadata: labelDescription
## featureData
## featureNames: AAAAATGGCATCAACCACCAT AAAAGGATAACCCAGGTGTTT ...
## TTTGTGACTACGGGAAGGCTC (50986 total)
## fvarLabels: reagent symbol gene_id
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
Project Achilles screens were performed using a pool of 50000+ hairpins/reagents mapping to ~12000 genes. While for the Cheung 2011 study, we specifically excluded reagents with low abundance in the universal sample and included this information in a reagent information table, we have not performed this filtering yet for the Cowley 2014 data.
Furthermore, the shRNA reagents in the Cowley 2014 Achilles screens are identified directly by their unique nucleotide sequence, which is specified in the reagent
column. This differs from the Cheung 2011 study, which identified reagents using a unique identifier specified in the probeset_id
column. Furthermore, while the sequences for shRNAs included in the Cheung 2011 and Cowley 2014 studies largely overlap, there are still thousands of reagents with sequences specific to each study. We make no attempt here to reconcile these reagent differences.
As noted in other Achilles examples, by default the simem()
function assumes shRNA/CRISPR reagents identified by a trcn_id
column. We’ll specify below how to alter the simem()
function to recognize the reagent
as the column specifying the unique reagent identifier (the nucleotide sequence) for the Cowley 2014 data.
We’ve also included the extensive cell line/screen annotations (including cancer type, subtype, QC, etc…) available for the Cowley 2014 data in the achilles_screens_expanded
ExpressionSet
loaded above. We’ll extract the cancer type information to be used in the analysis.
cancerTypes = unique(pheno[,c("cell_line", "type")])
cancerTypes$tissue = ifelse(cancerTypes$type == "ovarian", "ovarian", "other")
# Remove the types column, since it also exists in the phenoData annotations of the Achilles annotations
cancerTypes = cancerTypes[,-2]
Analogously with the Marcotte 2016 HER2+ analysis example, we specify the covariate="tissue"
and annotationsPerCellLine=cancerTypes
input parameters for the simem()
function.
For the purposes of this example, to reduce computation time, we’ll specify a few known oncogenes whose essentiality increases with expression.
testIds = c(7849, #PAX8
6656 #SOX1
)
By default, the simem()
function assumes that columns identifying gene ids, symbols, reagent ids, cell lines, replicates and time-points are specified in the screen data ExpressionSet
. The default column names assumed by the simem()
function are obtained using getDefaultVariableMap()
:
t(getDefaultVariableMap())
## [,1]
## geneId "gene_id"
## symbol "symbol"
## reagentId "trcn_id"
## cellLineId "cell_line"
## sampleId "sample_id"
## replicateId "replicate"
## timeGroup "timepointStd"
## timeNum "timepointIndex"
## condition "condition"
## screenId "screen_id"
## replicateGroupId "replicate_group"
As noted, the Cowley 2014 Achilles reagents are identified using the fdat$reagent
column. We’ll specify this in the simem()
function by adding a variableMap=vars
input parameter, as follows:
vars = getDefaultVariableMap()
vars$reagentId = "reagent"
# These values currently need to be specified, but are not used in the single time-point Achilles analysis
# The need for these will hopefully be removed in the case of single time-point analyses in a future update
vars$timeNum = "cell_doublings"
vars$timeGroup = "doubling_time_hrs"
If you want to perform a genome-wide analysis, simply omit the geneIds = testIds
parameter. This analysis typically takes several hours but can be greatly reduced using the parallelNodes
parameter on multi-core systems using functionality available through the doParallel
R package.
Since the Cowley 2014 Achilles data is an end-point assay, we must specify the endPoint=TRUE
parameter for the simem()
function. Furthermore, signal-noise measurement weights do not apply to this analysis case. To specify an analysis using precision weights (inverseVarianceWeights = TRUE
), we must ensure that we’ve added these weights to the ExpressionSet
beforehand (DETAILED HERE). These weights are already present in the achilles_screens_expanded
ExpressionSet
linked to/loaded at the beginning of this tutorial.
We’re now ready to predict genes whose essentiality is significantly different in Ovarian screens.
results = simem(achilles_screens_expanded,
geneIds=testIds,
covariate="tissue",
annotationsPerCellLine=cancerTypes,
analyzeReagents=TRUE,
inverseVarianceWeights=TRUE,
endPoint=TRUE,
parallelNodes=1,
variableMap=vars)
For illustration purposes, we’ll rerun the same code using multiple (2) parallel nodes (yes, one gene per node).
results = simem(achilles_screens_expanded,
geneIds=testIds,
covariate="tissue",
annotationsPerCellLine=cancerTypes,
analyzeReagents=TRUE,
inverseVarianceWeights=TRUE,
endPoint=TRUE,
parallelNodes=2,
variableMap=vars)
## automatically exporting the following variables from the local environment:
## analysisType, analyzeReagents, annotationsPerCellLine, annotationsPerId, calculateRelativeDropout, cellLineWeights, covariate, covariateFactorOrder, endPoint, formulas, inverseVarianceWeights, modelEngine, pvalueType, reagentWeights, screens, signalProbWeights, variableMap
## explicitly exporting variables(s): addPrecisionWeights, addSignalProbWeights, applyFDR, assignWeightsToObservations, combineResults, extractData, fitAnova, fitIsogenicCovariateGene, fitIsogenicCovariateRg, fitModel, fitModelAndFormatOutput, fitPanelCovariateGene, fitPanelCovariateRg, fitSimplerRandomEffects, formatOutput, getDefaultVariableMap, getMeanSd, getMeanVar, getMinimalSummary, getModelFormulas, getModelPValues, getRelativeDropout, getRelativeDropoutRates, getReplicateMeanVar, getSignalProb, getVarianceWeights, loadMeanVar, pvaluesNorm, pvaluesT, resetFactorLevels, simem, simemAnalysis, simemChunk, simemIsogenic, standardizeColumnNames, summaryAnova, summaryFitEffectCoefs, summaryFitLogLik, tryCatchWarningsErrors, withinBetweenDF
## numValues: 2, numResults: 0, stopped: TRUE
## got results for task 1
## numValues: 2, numResults: 1, stopped: TRUE
## returning status FALSE
## got results for task 2
## numValues: 2, numResults: 2, stopped: TRUE
## first call to combine function
## evaluating call object to combine results:
## fun(result.1, result.2)
## returning status TRUE
Here are the gene-level summaries (more detailed gene-level results can be obtained using results$gene_detailed
). The difference
parameter can be interpreted as the log2 measurement difference in Ovarian (compared to Other Cancers). A negative value indicates more dropout/essentiality in Ovarian, while a positive value indicates less essentiality in Ovarian.
options(width=120)
results$gene
## gene_id symbol difference_ovarian pvalue_ovarian fdr_ovarian warnings errors
## 1 7849 PAX8 -0.5302633 1.883183e-08 3.766366e-08
## 2 6656 SOX1 -0.5658423 8.210383e-06 8.210383e-06
Since analyzeReagents = TRUE
parameter was specified, per-reagent context-specific essentiality predictions are also available (more detailed gene-level results can be obtained using results$reagent_detailed
)
options(width=120)
reagent = results$reagent
reagent[order(reagent$symbol, reagent$reagent), ]
## reagent gene_id symbol difference_ovarian pvalue_ovarian fdr_ovarian warnings errors
## 1 CCCAGTGTCAGCTCCATTAAT 7849 PAX8 -0.4660553 3.171645e-02 6.343289e-02
## 2 CCGACTAAGCATTGACTCACA 7849 PAX8 -1.2582402 4.502977e-06 2.444116e-05
## 3 CCTTCGCCATAAAGCAGGAAA 7849 PAX8 -0.3110906 8.120610e-02 1.082748e-01
## 4 CTCTTTATCTAGCTCCGCCTT 7849 PAX8 -0.2248674 1.454872e-01 1.662711e-01
## 5 GCAACCATTCAACCTCCCTAT 7849 PAX8 -0.3777883 5.808373e-02 9.293396e-02
## 6 CATCTCCAACTCGCAGGGCTA 6656 SOX1 -0.4767102 1.049589e-02 2.798905e-02
## 7 GCCAACCAGGACCGGGTCAAA 6656 SOX1 -0.1205118 5.870765e-01 5.870765e-01
## 8 GCGGAGTTATATTCTGGGTTT 6656 SOX1 -1.0898745 6.110289e-06 2.444116e-05