DOWNLOAD R CODE FILE FOR THIS TUTORIAL
We previously worked through a basic 2 class differential essentiality analysis using cell line HER2+ status as an example. Here we see how to predict genes whose essentiality is associated (positively or negatively) with a continuous variable, such as RNA-Seq expression of a gene. We’ll use the expression of well-known Breast oncogene CCND1 as a case study.
First DOWNLOAD SIMEM R CODE zip bundle containing the R code files loaded below.
Also note that, while the code below uses relative paths, you should edit these to point to where you’ve stored the code, annotation or data files.
### To install required packages, uncomment and run this
# source("http://www.bioconductor.org/biocLite.R")
# biocLite(c("Biobase", "preprocessCore", "genefilter"))
# install.packages(c("blme", "doParallel", "ggplot2", "locfit", "MASS", "plyr", "reshape"))
suppressPackageStartupMessages(library("Biobase"))
suppressPackageStartupMessages(library("blme"))
suppressPackageStartupMessages(library("doParallel"))
suppressPackageStartupMessages(library("genefilter"))
suppressPackageStartupMessages(library("ggplot2"))
suppressPackageStartupMessages(library("locfit"))
suppressPackageStartupMessages(library("MASS"))
suppressPackageStartupMessages(library("plyr"))
suppressPackageStartupMessages(library("preprocessCore"))
suppressPackageStartupMessages(library("reshape"))
source("../../R/data_format_lib.R")
source("../../R/model_lib.R")
source("../../R/simem_lib.R")
The screen data formatted as an ExpressionSet
, including measurement weights, can be downloaded HERE.
load("../../data/shrna/breast_screens_with_weights.eset")
#breast_screens
The reagent annotation table we’re about to load, including the weight
column which will be used to exclude some reagents, can be downloaded HERE.
hp = read.delim("../../data/annotations/hairpin_annotations.txt", header=T, as.is=T, check.names=F)
hpWeights = hp[,c("trcn_id", "gene_id", "weight")]
The breast RNA-Seq expression file, from which we’ll extract CCND1 log2-FPKM expression values, can be downloaded HERE.
options(width=100)
rnaseq = read.delim("../../data/rnaseq/breast_rnaseq_qn.txt", header=T, as.is=T, check.names=F)
# Get the row index for CCND1 expression
ccnd1Index = which(rnaseq$symbol == "CCND1")
# Remove all identifier columns, and drop the columns dimension, resulting in a cell-line-named vector of log2-FPKM values
ccnd1Expr = unlist(rnaseq[ccnd1Index,-grep("gene_id|symbol|ensembl_id", colnames(rnaseq))])
ccnd1Expr
## 184a1 184b5 600mpe au565 bt20 bt474 bt483 bt549
## 4.845 5.481 7.756 6.255 4.346 5.855 6.853 2.300
## cal120 cal148 cal51 cal851 cama1 du4475 efm19 efm192a
## 6.601 -1.627 6.540 5.338 7.981 2.870 7.333 7.039
## evsat hbl100 hcc1008 hcc1143 hcc1187 hcc1395 hcc1419 hcc1428
## 3.654 1.670 6.476 9.054 5.051 5.409 7.099 7.352
## hcc1500 hcc1569 hcc1599 hcc1806 hcc1937 hcc1954 hcc202 hcc2185
## 9.495 -1.764 7.520 6.337 6.712 8.416 4.122 6.823
## hcc2218 hcc2688 hcc3153 hcc38 hcc70 hcc712 hdqp1 hs578t
## 6.965 4.411 3.626 5.846 5.333 8.130 7.402 6.433
## jimt1 kpl1 ly2 macls2 mb157 mcf10a mcf12a mcf7
## 6.435 7.021 7.290 6.288 7.873 4.964 4.317 8.044
## mdamb134vi mdamb157 mdamb175vii mdamb231 mdamb330 mdamb361 mdamb415 mdamb436
## 10.783 4.955 8.378 6.927 10.540 6.868 8.606 5.258
## mdamb453 mdamb468 mfm223 mx1 ocubm skbr3 skbr5 skbr7
## 9.097 5.966 7.336 4.108 2.271 5.702 6.838 4.382
## sum102 sum1315 sum149 sum159 sum185 sum190 sum225 sum229
## 5.812 4.665 4.266 6.623 5.341 7.542 6.844 3.509
## sum44 sum52 sw527 t47d uacc3199 uacc812 uacc893 zr751
## 6.577 5.705 7.586 6.412 5.646 5.615 7.171 7.696
## zr7530 zr75b
## 7.897 7.703
Once we have the CCND1 log2-FPKM expression values, we want to construct a data frame with 2 columns: cell_line
as the 1st column, and a second column to contain the expression values, which we’ll creatively name ccnd1
. We will then specify the covariate = "ccnd1"
and annotationsPerCellLine = ccnd1Values
parameters in the simem()
function.
ccnd1Values = cbind.data.frame(cell_line=names(ccnd1Expr),
ccnd1=ccnd1Expr,
stringsAsFactors=FALSE)
ccnd1Values[1:5,]
## cell_line ccnd1
## 184a1 184a1 4.845
## 184b5 184b5 5.481
## 600mpe 600mpe 7.756
## au565 au565 6.255
## bt20 bt20 4.346
For the purposes of this example, to reduce computation time, we’ll perform the analysis for CCND1, which is expected, based on the breast cancer literature, to be more essential in cell lines with higher CCND1 expression. We’ll also examine CDK4 and CDK6 essentiality, since these are known to directly form protein-protein complexes with CCND1 and are known to be required for CCND1’s cell cycle functions in different contexts.
genesOfInterest = c(595, #CCND1
1019, #CDK4
1021) #CDK6
If you want to perform a genome-wide analysis, simply omit the geneIds = genesOfInterest
parameter. As noted, this analysis typically takes ~18 hours but can be greatly reduced using the parallelNodes
parameter on systems with multiple processor cores.
If we specify an analysis using both precision and signal-noise measurement weights (inverseVarianceWeights = TRUE
and signalProbWeights = TRUE
parameters, respectively), we must ensure that we’ve added these weights to the ExpressionSet
beforehand (DETAILED HERE).
We’re now ready to predict genes whose essentiality is significantly associated with CCND1 log2-FPKM values.
results = simem(screens = breast_screens,
geneIds = genesOfInterest,
covariate = "ccnd1",
reagentWeights = hpWeights,
annotationsPerCellLine = ccnd1Values,
inverseVarianceWeights = TRUE,
signalProbWeights = TRUE,
analyzeReagents = TRUE,
parallelNodes = 1
)
Here are the gene-level summaries (more detailed gene-level results can be obtained using results$gene_detailed
). As expected, CCND1 and CDK4 show strongly significant increases in essentiality in cell lines with higher CCND1 expression. This is consistent with the knockdown of either member of the CCND1-CDK4 complex disrupting the function of the complex. CDK6 is not found to be significantly associated with CCND1 expression, suggesting that in Breast Cancer contexts, CDK4 rather than CDK6 is the primary complex partner for CCND1.
The difference
parameter can be interpreted as the magnitude of the average change in dropout slope associated with each unit increase in log2-FPKM CCND1 expression. In other words, on average the dropout is faster (more negative), indicating increased essentiality, in cell lines with higher CCND1 expression.
options(width=100)
results$gene
## gene_id symbol difference pvalue fdr warnings errors
## 1 595 CCND1 -0.069142031 2.108795e-13 3.163193e-13
## 2 1019 CDK4 -0.098144306 9.259274e-24 2.777782e-23
## 3 1021 CDK6 -0.006342742 1.509899e-01 1.509899e-01
Since analyzeReagents = TRUE
parameter was specified, per-reagent context-specific essentiality predictions are also available (more detailed gene-level results can be obtained using results$reagent_detailed
)
options(width=100)
reagent = results$reagent
reagent[order(reagent$symbol, reagent$trcn_id), ]
## trcn_id gene_id symbol difference pvalue fdr warnings errors
## 13 TRCN0000010315 595 CCND1 0.0043898353 7.554301e-01 8.183826e-01
## 14 TRCN0000010317 595 CCND1 -0.1245646522 5.426349e-06 2.351418e-05
## 21 TRCN0000040038 595 CCND1 -0.1298592714 8.398346e-07 1.068391e-05
## 22 TRCN0000040039 595 CCND1 -0.1264545296 1.232759e-06 1.068391e-05
## 23 TRCN0000040040 595 CCND1 0.0006743605 9.617795e-01 9.617795e-01
## 24 TRCN0000040041 595 CCND1 0.0372417472 4.209963e-01 6.755768e-01
## 25 TRCN0000040042 595 CCND1 -0.0803004670 2.374300e-05 8.818829e-05
## 1 TRCN0000000362 1019 CDK4 -0.1486592957 1.036791e-03 2.695657e-03
## 2 TRCN0000000363 1019 CDK4 -0.1151053469 4.804589e-06 2.351418e-05
## 3 TRCN0000000364 1019 CDK4 -0.0517594494 1.520418e-04 4.392318e-04
## 4 TRCN0000000365 1019 CDK4 -0.0509033521 1.002917e-02 2.370530e-02
## 9 TRCN0000009876 1019 CDK4 0.0090458961 4.825386e-01 6.970002e-01
## 15 TRCN0000010472 1019 CDK4 -0.1686939532 3.232102e-06 2.100866e-05
## 17 TRCN0000010520 1019 CDK4 -0.1148474472 7.583172e-05 2.464531e-04
## 18 TRCN0000018364 1019 CDK4 -0.2472135162 3.969374e-09 1.032037e-07
## 5 TRCN0000000485 1021 CDK6 0.0040365171 4.384718e-01 6.755768e-01
## 6 TRCN0000000486 1021 CDK6 -0.0071061656 4.417233e-01 6.755768e-01
## 7 TRCN0000000487 1021 CDK6 0.0033456048 5.704377e-01 7.105954e-01
## 8 TRCN0000000488 1021 CDK6 -0.0059697800 6.012731e-01 7.105954e-01
## 10 TRCN0000009877 1021 CDK6 -0.0302379973 4.811596e-02 8.935821e-02
## 11 TRCN0000010074 1021 CDK6 0.0059725757 5.945774e-01 7.105954e-01
## 12 TRCN0000010082 1021 CDK6 -0.0243369962 3.146235e-02 6.816842e-02
## 16 TRCN0000010473 1021 CDK6 0.0145013231 5.720802e-01 7.105954e-01
## 19 TRCN0000039745 1021 CDK6 -0.0181547237 4.497263e-02 8.935821e-02
## 20 TRCN0000039747 1021 CDK6 0.0153839690 6.909228e-01 7.810431e-01
## 26 TRCN0000055435 1021 CDK6 -0.0037305824 8.303432e-01 8.635569e-01
The above analysis tests whether the essentiality of each specified gene is associated with CCND1 expression. In the next section we will see how to test whether the essentiality of a gene is significantly associated with that gene’s own mRNA expression level.