Essentiality associated with continuous expression: Project Achilles (Cheung et al. 2011) per-gene continuous expression vs. essentiality

 

DOWNLOAD R CODE FILE FOR THIS TUTORIAL

 

First DOWNLOAD SIMEM R CODE zip bundle containing the R code files loaded below.

  Also note that, while the code below uses relative paths, you should edit these to point to where you’ve stored the code, annotation or data files.

### To install required packages, uncomment and run this
# source("http://www.bioconductor.org/biocLite.R")
# biocLite(c("Biobase", "preprocessCore", "genefilter"))
# install.packages(c("blme", "doParallel", "ggplot2", "locfit", "MASS", "plyr", "reshape"))

suppressPackageStartupMessages(library("Biobase"))
suppressPackageStartupMessages(library("blme"))
suppressPackageStartupMessages(library("doParallel"))
suppressPackageStartupMessages(library("genefilter"))
suppressPackageStartupMessages(library("ggplot2"))
suppressPackageStartupMessages(library("locfit"))
suppressPackageStartupMessages(library("MASS"))
suppressPackageStartupMessages(library("plyr"))
suppressPackageStartupMessages(library("preprocessCore"))
suppressPackageStartupMessages(library("reshape"))


source("../../R/data_format_lib.R")
source("../../R/model_lib.R")
source("../../R/simem_lib.R")

 

The Cheung et al. 2011 Project Achilles screen data formatted as an ExpressionSet, including measurement weights, can be downloaded HERE.

load("../../data/shrna/achilles_screens_with_weights.eset")
fdat = fData(achilles_screens)
pheno = pData(achilles_screens)

achilles_screens
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 54020 features, 389 samples 
##   element names: exprs, groupMeans, varWeights 
## protocolData: none
## phenoData
##   sampleNames: 786o_a 786o_b ... u251mg_d (389 total)
##   varLabels: sample_id cell_line ... replicate_group (7 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: TRCN0000072178cA_st TRCN0000072178cB_st ...
##     TRCN0000116697m_st (54020 total)
##   fvarLabels: probeset_id symbol gene_id description
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

 

Project Achilles screens were performed using a pool of ~54000 hairpins/reagents mapping to ~12000 genes. The reagent annotation table we’ll load, including the weight column which will be used to exclude some reagents, can be downloaded HERE. In this case, any reagents with a mean intensity measurement below 5 (on the log2 scale) in the universal T0 sample are excluded in the analysis. Furthermore, the shRNA reagents in the Achilles screens are identified using the probeset_id column. We’ll specify later how to alter the simem() function to recognize this fact, since by default simem() assumes reagents identified by a trcn_id column.

hp = read.delim("../../data/annotations/achilles_hairpin_annotations.txt", header=T, as.is=T, check.names=F)
hpWeights = hp[,c("probeset_id", "gene_id", "weight")]

 

The formatted CCLE microarray-based expression data for these genes can be downloaded HERE.

expr = read.delim("../../data/expression/ccle_achilles_expression.txt", header=T, as.is=T, check.names=F)

# Perform the analysis for ids contained in both the expression matrix and the shRNA data.
commonIds = intersect(expr$gene_id, fdat$gene_id)

# Subset the shRNA data to only include the Achilles screens profiled by CCLE expressions
samplesToInclude = which(pheno$cell_line %in% colnames(expr))
achilles = achilles_screens[,samplesToInclude]

 

When we want to specify a different set of covariate values (such as expression) for each gene, we do not specify the covariate and annotationsPerCellLine input parameters for the simem() function. These two parameters are used only when the same covariate values (such as cancer subtype) apply to all genes. Instead, we’ll specify the annotationsPerId = expr input parameter. In order for the expr data frame to be properly formatted, the first column of expr needs to be gene_id, containing the integer Entrez gene ids. The second and subsequent columns need to be screen/cell line names.

For the purposes of this example, to reduce computation time, we’ll specify a few known oncogenes whose essentiality increases with expression.

testIds = c(598, #BCL2L1
            3845, #KRAS
            7849, #PAX8
            6663 #SOX10
)

 

By default, the simem() function assumes that columns identifying gene ids, symbols, reagent ids, cell lines, replicates and time-points are specified in the screen data ExpressionSet. The default column names assumed by the simem() function are obtained using getDefaultVariableMap():

t(getDefaultVariableMap())
##                  [,1]             
## geneId           "gene_id"        
## symbol           "symbol"         
## reagentId        "trcn_id"        
## cellLineId       "cell_line"      
## sampleId         "sample_id"      
## replicateId      "replicate"      
## timeGroup        "timepointStd"   
## timeNum          "timepointIndex" 
## condition        "condition"      
## screenId         "screen_id"      
## replicateGroupId "replicate_group"

 

As noted, the Achilles reagents are identified using the fdat$probeset_id column. We’ll specify this in the simem() function by adding a variableMap=vars input parameter, as follows:

vars = getDefaultVariableMap()
vars$reagentId = "probeset_id"
vars$timeNum = "timeNum"
vars$timeGroup = "timepointIndex"

 

If you want to perform a genome-wide analysis, simply omit the geneIds = commonIds parameter. This analysis typically takes several hours but can be greatly reduced using the parallelNodes parameter on multi-core Linux or OS X systems.

 

Since the Achilles data is an end-point assay, we must specify the endPoint=TRUE parameter for the simem() function. Furthermore, signal-noise measurement weights do not apply to this analysis case. To specify an analysis using precision weights (inverseVarianceWeights = TRUE), we must ensure that we’ve added these weights to the ExpressionSet beforehand (DETAILED HERE).

 

We’re now ready to predict genes whose essentiality is significantly associated with expression.

results = simem(achilles,
               geneIds=testIds,
               reagentWeights=hpWeights,
               annotationsPerId=expr, 
               analyzeReagents=TRUE,
               inverseVarianceWeights=TRUE,
               endPoint=TRUE,
               parallelNodes=1,
               variableMap=vars)

 

For illustration purposes, we’ll rerun the same code using multiple (2) parallel nodes.

results = simem(achilles,
               geneIds=testIds,
               reagentWeights=hpWeights,
               annotationsPerId=expr, 
               analyzeReagents=TRUE,
               inverseVarianceWeights=TRUE,
               endPoint=TRUE,
               parallelNodes=2,
               variableMap=vars)
## automatically exporting the following variables from the local environment:
##   analysisType, analyzeReagents, annotationsPerCellLine, annotationsPerId, calculateRelativeDropout, cellLineWeights, covariate, covariateFactorOrder, endPoint, formulas, inverseVarianceWeights, modelEngine, pvalueType, reagentWeights, screens, signalProbWeights, variableMap 
## explicitly exporting variables(s): addPrecisionWeights, addSignalProbWeights, applyFDR, assignWeightsToObservations, combineResults, extractData, fitAnova, fitIsogenicCovariateGene, fitIsogenicCovariateRg, fitModel, fitModelAndFormatOutput, fitPanelCovariateGene, fitPanelCovariateRg, fitSimplerRandomEffects, formatOutput, getDefaultVariableMap, getMeanSd, getMeanVar, getMinimalSummary, getModelFormulas, getModelPValues, getRelativeDropout, getRelativeDropoutRates, getReplicateMeanVar, getSignalProb, getVarianceWeights, loadMeanVar, pvaluesNorm, pvaluesT, resetFactorLevels, simem, simemAnalysis, simemChunk, simemIsogenic, standardizeColumnNames, summaryAnova, summaryFitEffectCoefs, summaryFitLogLik, tryCatchWarningsErrors, withinBetweenDF
## numValues: 2, numResults: 0, stopped: TRUE
## got results for task 1
## numValues: 2, numResults: 1, stopped: TRUE
## returning status FALSE
## got results for task 2
## numValues: 2, numResults: 2, stopped: TRUE
## first call to combine function
## evaluating call object to combine results:
##   fun(result.1, result.2)
## returning status TRUE

 

Here are the gene-level summaries (more detailed gene-level results can be obtained using results$gene_detailed). The difference parameter can be interpreted as the magnitude of the average change in dropout slope associated with each unit increase in the gene’s log2 microarray intensity value. In other words, on average the dropout is faster (more negative), indicating increased essentiality, in cell lines with higher expression.

options(width=100)

results$gene
##   gene_id symbol  difference       pvalue          fdr warnings errors
## 1     598 BCL2L1 -0.45103977 1.140391e-10 2.280781e-10                
## 2    3845   KRAS -0.22064620 6.397026e-08 8.529368e-08                
## 3    6663  SOX10 -0.46751461 1.098756e-11 4.395024e-11                
## 4    7849   PAX8 -0.09270574 2.590732e-06 2.590732e-06

 

Since analyzeReagents = TRUE parameter was specified, per-reagent context-specific essentiality predictions are also available (more detailed gene-level results can be obtained using results$reagent_detailed)

options(width=100)
reagent = results$reagent
reagent[order(reagent$symbol, reagent$probeset_id), ]
##           probeset_id gene_id symbol   difference       pvalue          fdr warnings errors
## 7  TRCN0000033499m_st     598 BCL2L1 -1.084601359 1.328498e-07 7.970986e-07                
## 8  TRCN0000033500m_st     598 BCL2L1 -0.570923603 2.085288e-05 1.000938e-04                
## 9  TRCN0000033501m_st     598 BCL2L1 -0.225331047 8.695563e-02 1.739113e-01                
## 10 TRCN0000033502m_st     598 BCL2L1 -0.253058519 1.634092e-01 2.385993e-01                
## 11 TRCN0000033503m_st     598 BCL2L1 -0.127568563 1.417070e-01 2.267312e-01                
## 1  TRCN0000010369m_st    3845   KRAS -0.182540262 5.705246e-02 1.244781e-01                
## 2  TRCN0000018337m_st    3845   KRAS -0.101807686 1.234060e-01 2.115532e-01                
## 3  TRCN0000033259m_st    3845   KRAS -0.704177541 9.220785e-09 7.376628e-08                
## 4  TRCN0000033260m_st    3845   KRAS -0.389424485 3.798133e-03 1.139440e-02                
## 5  TRCN0000033261m_st    3845   KRAS -0.069402072 4.609164e-01 5.530996e-01                
## 6  TRCN0000033263m_st    3845   KRAS  0.006208942 9.613397e-01 9.613397e-01                
## 12 TRCN0000040150m_st    3845   KRAS -0.129222825 1.690078e-01 2.385993e-01                
## 13 TRCN0000040151m_st    3845   KRAS  0.048893350 7.293661e-01 7.610777e-01                
## 14 TRCN0000040152m_st    3845   KRAS -0.449105840 1.390005e-02 3.706681e-02                
## 20 TRCN0000021274m_st    7849   PAX8 -0.089940851 4.842104e-02 1.162105e-01                
## 21 TRCN0000021275m_st    7849   PAX8  0.040427702 2.591113e-01 3.454817e-01                
## 22 TRCN0000021276m_st    7849   PAX8 -0.020082596 5.432624e-01 6.159332e-01                
## 23 TRCN0000021277m_st    7849   PAX8 -0.058332023 1.200286e-01 2.115532e-01                
## 24 TRCN0000021278m_st    7849   PAX8 -0.335958596 2.230537e-10 2.676644e-09                
## 15 TRCN0000018984m_st    6663  SOX10 -1.126964461 6.338370e-14 1.521209e-12                
## 16 TRCN0000018985m_st    6663  SOX10 -0.435469574 2.512846e-04 9.216024e-04                
## 17 TRCN0000018986m_st    6663  SOX10 -0.078974328 5.646055e-01 6.159332e-01                
## 18 TRCN0000018987m_st    6663  SOX10 -0.120424704 4.226760e-01 5.339065e-01                
## 19 TRCN0000018988m_st    6663  SOX10 -0.693519719 2.688007e-04 9.216024e-04