Title: | Replicability Analysis for Multiple Studies of High Dimension |
---|---|
Description: | Estimation of Bayes and local Bayes false discovery rates for replicability analysis (Heller & Yekutieli, 2014 <doi:10.1214/13-AOAS697> ; Heller at al., 2015 <doi: 10.1093/bioinformatics/btu434>). |
Authors: | Ruth Heller [cre, aut], Shachar Kaufman [aut], Shay Yaacoby [aut], David Israeli [aut], Barak Brill [aut], Daniel Yekutieli [aut] |
Maintainer: | Ruth Heller <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.2.3 |
Built: | 2025-03-07 05:34:50 UTC |
Source: | https://github.com/barakbri/repfdr |
This data was created from the zmat
matrix (see SNPlocations
) using ztobins
function. It contain two objects to be input to the main function repfdr
.
The file includes two objects - a matrix and 3d array:
bz
is a matrix of binned 249024 z-scores (in rows) in each of the 3 studies (columns).
pbz
is a 3-dimensional array which contains for each study (first dimension), the probabilities of a z-score to fall in the bin (second dimension), under each hypothesis status (third dimension).
## Not run: download.file('http://www.math.tau.ac.il/~ruheller/repfdr_RData/binned_zmat.RData', destfile = "binned_zmat.RData") load(file = "binned_zmat.RData") bz[1:5,] pbz[,1:5,] ## End(Not run)
## Not run: download.file('http://www.math.tau.ac.il/~ruheller/repfdr_RData/binned_zmat.RData', destfile = "binned_zmat.RData") load(file = "binned_zmat.RData") bz[1:5,] pbz[,1:5,] ## End(Not run)
This data was created from the zmat_sim
matrix using ztobins
function. It contain two objects to be input to the main function repfdr
.
data(binned_zmat_sim)
data(binned_zmat_sim)
The file includes two objects - a matrix and 3d array:
bz_sim
is a matrix of binned 10000 z-scores (in rows) in each of the 3 studies (columns).
pbz_sim
is a 3-dimensional array which contains for each study (first dimension), the probabilities of a z-score to fall in the bin (second dimension), under each hypothesis status (third dimension).
data(binned_zmat_sim) bz_sim[1:5,] pbz_sim[,1:5,]
data(binned_zmat_sim) bz_sim[1:5,] pbz_sim[,1:5,]
Input parameters for the EM algorithm.
em.control(pi.initial = NULL, max.iter = 10000, tol = 1e-12, nr.threads = 0, verbose = TRUE)
em.control(pi.initial = NULL, max.iter = 10000, tol = 1e-12, nr.threads = 0, verbose = TRUE)
pi.initial |
Initial guess for the probabilities of the vectors of associations status. If |
max.iter |
Maximum number of EM iterations. |
tol |
Tolerance (in maximum absolute difference between two EM iterations in estimated probabilities) before declaring convergence and stopping. |
nr.threads |
Number of processing threads to use. If zero (the default), will automatically detect the number of compute cores available and spawn one thread per core. |
verbose |
An indicator of whether to report progress (running iteration number) during computation. |
The function is used inside the control
argument in repfdr
and piem
.
A list with the input values.
## Not run: download.file('http://www.math.tau.ac.il/~ruheller/repfdr_RData/binned_zmat.RData', destfile = "binned_zmat.RData") load(file = "binned_zmat.RData") out <- repfdr(pbz,bz,"replication", control = em.control(pi.initial = c(0.48,rep(0.02,26)), verbose = TRUE, nr.threads = 1)) # iterations are printed; run bit slower (1 thread) ## End(Not run)
## Not run: download.file('http://www.math.tau.ac.il/~ruheller/repfdr_RData/binned_zmat.RData', destfile = "binned_zmat.RData") load(file = "binned_zmat.RData") out <- repfdr(pbz,bz,"replication", control = em.control(pi.initial = c(0.48,rep(0.02,26)), verbose = TRUE, nr.threads = 1)) # iterations are printed; run bit slower (1 thread) ## End(Not run)
The function generates a matrix with all possible vectors of association status (in rows), given the number of studies and number of possible association status states in each study (2 or 3).
hconfigs(n.studies, n.association.status = 3, studies.names = NULL)
hconfigs(n.studies, n.association.status = 3, studies.names = NULL)
n.studies |
Number of studies in the analysis. |
n.association.status |
either 2 for no-association\association or 3 for no-association\negative-association\positive-association. |
studies.names |
Optional study names to display. |
This matrix should be used when selecting the rows indices for the association status vectors that are in the non-null set, specified by the used in non.null.rows
in the function repfdr
.
Matrix with rows indicating all the possible vectors of association status.
(H <- hconfigs(n.studies = 3)) # in replication analysis the non-null vectors are: H[apply(H,1,function(y){ sum(y==1)>1 | sum(y==-1)>1 }),] # in meta-analysis there is only one null vector (c(0,0,0)): H[rowSums(abs(H))!=0,] hconfigs(n.studies = 3, n.association.status= 2)
(H <- hconfigs(n.studies = 3)) # in replication analysis the non-null vectors are: H[apply(H,1,function(y){ sum(y==1)>1 | sum(y==-1)>1 }),] # in meta-analysis there is only one null vector (c(0,0,0)): H[rowSums(abs(H))!=0,] hconfigs(n.studies = 3, n.association.status= 2)
A matrix of size 10000x3 of indicators of whether each z-score from zmat_sim
belongs to a non-null hypothesis for the feature in the study (1) or to a null hypothesis for the feature in the study (0).
data(hmat_sim)
data(hmat_sim)
hmat_sim
is a matrix of 10000 rows, each row a vector of the true association status from which the z-scores in the same row in zmat_sim
was generated. Specifically, for a zero entry in hmat_sim the corresponding z-score in zmat_sim
was generated from the standard normal distribution, and for a unit entry in hmat_sim the corresponding z-score in zmat_sim
was generated from the normal distribution with mean 3 and variance one.
#### use hmat_sim to generate the simulated z-scores: data(hmat_sim) m <- nrow(hmat_sim) set.seed(12) zmat_sim1 <- matrix(rnorm(n=3*m,mean=hmat_sim*3),nrow=m,ncol=3) rm(m,H) data(zmat_sim) stopifnot(all.equal(zmat_sim1,zmat_sim)) #### hmat_sim was generated by the following code: H <- hconfigs(n.studies= 3, n.association.status=2) f <- c(0.895,0.005,0.005,0.02,0.005,0.02,0.02,0.03) # frequencies for the association status vectors m = 10000 # number of tests in each study hmat_sim1 <- matrix(rep(x = H, times = m*cbind(f,f,f)),ncol=3) data(hmat_sim) stopifnot(all.equal(hmat_sim1,hmat_sim)) # the simulation design cbind(H,f) sum(f) # all sum to 1?
#### use hmat_sim to generate the simulated z-scores: data(hmat_sim) m <- nrow(hmat_sim) set.seed(12) zmat_sim1 <- matrix(rnorm(n=3*m,mean=hmat_sim*3),nrow=m,ncol=3) rm(m,H) data(zmat_sim) stopifnot(all.equal(zmat_sim1,zmat_sim)) #### hmat_sim was generated by the following code: H <- hconfigs(n.studies= 3, n.association.status=2) f <- c(0.895,0.005,0.005,0.02,0.005,0.02,0.02,0.03) # frequencies for the association status vectors m = 10000 # number of tests in each study hmat_sim1 <- matrix(rep(x = H, times = m*cbind(f,f,f)),ncol=3) data(hmat_sim) stopifnot(all.equal(hmat_sim1,hmat_sim)) # the simulation design cbind(H,f) sum(f) # all sum to 1?
The function finds the posterior probabilities ofeach vector of association status for each feature, given the feature's vector of binned z-scores.
ldr(pdf.binned.z, binned.z.mat, Pi, h.vecs = NULL)
ldr(pdf.binned.z, binned.z.mat, Pi, h.vecs = NULL)
pdf.binned.z |
Same input as in |
binned.z.mat |
Same input as in |
Pi |
The estimated prior probabilities for each association status vector. Can be extracted from the output of |
h.vecs |
The row indices in |
A subset of features (e.g most significant) can be specified as the rows in binned.z.mat
, so the posterior probabilities of the vectors of association status are computed for this subset of features. See Example section.
Matrix with rows that contain for each of the vectors of association status the posterior probabilities. The columns are the different feature.
## Not run: download.file('http://www.math.tau.ac.il/~ruheller/repfdr_RData/binned_zmat.RData', destfile = "binned_zmat.RData") load(file = "binned_zmat.RData") data(Pi) # Fdr calculation: output3 <- repfdr(pbz, bz, "replication",Pi.previous.result = Pi) BayesFdr <- output3$mat[,"Fdr"] sum(BayesFdr <= 0.05) # The posterior probabilities for the the first five features with Bayes FDR at most 0.05: post <- ldr(pbz,bz[which(BayesFdr <= 0.05)[1:5],],Pi) round(post,4) # posteriors for a subset of the association status vectors can also be reported, # here the subset is the four first association status vectors: post <- ldr(pbz,bz[which(BayesFdr <= 0.05)[1:5],],Pi,h.vecs= 1:4) round(post,4) ## End(Not run)
## Not run: download.file('http://www.math.tau.ac.il/~ruheller/repfdr_RData/binned_zmat.RData', destfile = "binned_zmat.RData") load(file = "binned_zmat.RData") data(Pi) # Fdr calculation: output3 <- repfdr(pbz, bz, "replication",Pi.previous.result = Pi) BayesFdr <- output3$mat[,"Fdr"] sum(BayesFdr <= 0.05) # The posterior probabilities for the the first five features with Bayes FDR at most 0.05: post <- ldr(pbz,bz[which(BayesFdr <= 0.05)[1:5],],Pi) round(post,4) # posteriors for a subset of the association status vectors can also be reported, # here the subset is the four first association status vectors: post <- ldr(pbz,bz[which(BayesFdr <= 0.05)[1:5],],Pi,h.vecs= 1:4) round(post,4) ## End(Not run)
The function calls an expectation-maximization (EM) algorithm to estimate the prior probabilities of each association status vector. It is also used internally in repfdr
.
piem(pdf.binned.z, binned.z.mat, control = em.control())
piem(pdf.binned.z, binned.z.mat, control = em.control())
pdf.binned.z |
Same input as in |
binned.z.mat |
Same input as in |
control |
List of control parameters to pass to the EM algorithm. See |
The implementation of the EM algorithm is in C, and allows paralel processing. By default, the software automatically detects the number of available processing threads. See em.control
for the option of providing the number of threads to use, as well as for the additional control parameters.
all.iterations |
Matrix with number of columns equal to the number of EM iterations, and each column is the estimated probability distribution of the vector of association status. |
last.iteration |
Matrix of the vectors of association status along with the column vector of the last EM iteration, which contains the estimated probabilities of the vectors of association status. |
C
implementation by Shachar Kaufman.
Heller, Ruth, and Daniel Yekutieli. "Replicability analysis for Genome-wide Association studies." arXiv preprint arXiv:1209.2829 (2012).
## Not run: download.file('http://www.math.tau.ac.il/~ruheller/repfdr_RData/binned_zmat.RData', destfile = "binned_zmat.RData") load(file = "binned_zmat.RData") #binned_zmat can also be generated via output_piem <- piem(pbz, bz) # extract the last iteration to use it in repfdr (see help(repfdr)): Pi1 <- output_piem$last.iteration data(Pi) stopifnot(all.equal(Pi,Pi1)) # simulation data: data(binned_zmat_sim) output_piem_sim <- piem(pbz_sim, bz_sim) Pi_sim <- output_piem_sim$last.iteration # following are the true proportions in the data: (see help(hmat_sim) for data generation details.) f <- c(0.895,0.005,0.005,0.02,0.005,0.02,0.02,0.03) # the estimation vs the true proportions: cbind(round(Pi_sim,6),f) ## End(Not run)
## Not run: download.file('http://www.math.tau.ac.il/~ruheller/repfdr_RData/binned_zmat.RData', destfile = "binned_zmat.RData") load(file = "binned_zmat.RData") #binned_zmat can also be generated via output_piem <- piem(pbz, bz) # extract the last iteration to use it in repfdr (see help(repfdr)): Pi1 <- output_piem$last.iteration data(Pi) stopifnot(all.equal(Pi,Pi1)) # simulation data: data(binned_zmat_sim) output_piem_sim <- piem(pbz_sim, bz_sim) Pi_sim <- output_piem_sim$last.iteration # following are the true proportions in the data: (see help(hmat_sim) for data generation details.) f <- c(0.895,0.005,0.005,0.02,0.005,0.02,0.02,0.03) # the estimation vs the true proportions: cbind(round(Pi_sim,6),f) ## End(Not run)
Estimate Bayes and local Bayes false discovery rates (FDRs) from multiple studies, for replicability analysis and for meta-analysis, as presented in Heller and Yekutieli (see reference below).
repfdr(pdf.binned.z, binned.z.mat, non.null = c("replication", "meta-analysis", "user.defined"), non.null.rows = NULL,Pi.previous.result = NULL, control = em.control(), clusters = NULL, clusters.ldr.report=NULL, clusters.verbose=T)
repfdr(pdf.binned.z, binned.z.mat, non.null = c("replication", "meta-analysis", "user.defined"), non.null.rows = NULL,Pi.previous.result = NULL, control = em.control(), clusters = NULL, clusters.ldr.report=NULL, clusters.verbose=T)
pdf.binned.z |
A 3-dimensional array which contains for each study (first dimension), the probabilities of a z-score to fall in the bin (second dimension), under each hypothesis status (third dimension). The third dimension can be of size 2 or 3, depending on the number of association states: if the association can be either null or non-null (e.g. only in one direction), the dimension is 2; if the association can be either null, or positive, or negative, the dimension is 3.
Element |
binned.z.mat |
A matrix of the bin numbers for each of the z-scores (rows) in each study (columns).
Element |
non.null |
Indicates the desired analysis: |
non.null.rows |
Vector of row indices in |
Pi.previous.result |
An optional Vector of probabilities for each association status. If |
control |
List of control parameters to pass to the EM algorithm. See |
clusters |
Used for performing analysis in each cluster, and than aggregating results together.Default value is |
clusters.ldr.report |
Sets whether local fdr values (available through the function |
clusters.verbose |
if set to |
For N
studies, each examining the same M
features, the binned z-scores and the (estimated) probabilities under the null and non-null states in each study are given as input.
These inputs can be produced from the z-scores using the function ztobins
.
The function calls piem
for the computation of the probabilities for each vector of association status. The number of probabilies estimated is x^N
, where x=2,3
is the number of possible association states in each study.
The function calls ldr
for the computation of the conditional probability of each of the vectors of association status in the null set given the binned z-scores. The null set contains the rows in hconfigs(N,x)
that: are excluded from non.null.rows
if non.null
is user.defined
; that are non-zero if non.null
is meta-analysis
; that contain at most one 1 if non.null
is replication
and x=2
; that contain at most one 1 or one -1 if non.null
is replication
and x=3
.
The local Bayes FDR is estimated to be the sum of conditional probabilities in the null set for each feature. The empirical Bayes FDR is the average of all local Bayes FDRs that are at most the value of the local Bayes FDR for each feature. The list of discoveries at level q are all features with empirical Bayes FDR at most q.
If many studies are available, one may not be able to compute RepFDR directly at the original data. If however, different groups of studies are known to be independent (e.g., if a SNP is non null for studies 1,2 is independent of the SNP being non null in studies 3,4) one may Run RepFDR in each cluster seperatly and then aggregate the results. This is done by providing a vector for the clusters
argument, with an integer value stating the cluster membership for each study.
See the values section below for the results returned from this function, when partitioning the data to clusters.
See vignette('RepFDR') for a complete example, on how to run RepFDR in clusters.
mat |
An |
Pi |
Vector of the estimated probabilities for each of the |
repfdr.mat.percluster |
Returned if |
repfdr.Pi.percluster |
Returned if |
ldr |
Returned if |
Ruth Heller, Shachar Kaufman, Shay Yaacoby, Barak Brill, Daniel Yekutieli.
Heller, R., & Yekutieli, D. (2014). Replicability analysis for genome-wide association studies. The Annals of Applied Statistics, 8(1), 481-498.
Heller, R., Yaacoby, S., & Yekutieli, D. (2014). repfdr: a tool for replicability analysis for genome-wide association studies. Bioinformatics, btu434.
#### Example 1: a simulation; each feature in each study has two association states, #### null and positive, prior is known #This example generates the Z scores for two studies, with 0.05 probability to have # non - null signal in each study. # The prior matrix is being pregenerated to show the optimal values. # if this matrix was not supplied, the repfdr method would estimate it # using an EM algorithm. See the next examples for estimating the prior as well using repfdr. set.seed(1) n = 2 #two studies m=10000 # ten thounsand, SNPs H_Study_1 = rbinom(m,1,prob = 0.05) #signal of 1, for SNPS with association in the first study H_Study_2 = rbinom(m,1,prob = 0.05) #signal of 1, for SNPS with association in the second study Zmat = matrix(rnorm(n*m),nrow = m) #generate matrix #insert signal (mean shift of 3) for the first study Zmat[which(H_Study_1==1),1] = Zmat[which(H_Study_1==1),1] + 4 #insert signal to the second study Zmat[which(H_Study_2==1),2] = Zmat[which(H_Study_2==1),2] + 4 #estimate densities via ztobins: ztobins_res = ztobins(Zmat,n.association.status = 2,plot.diagnostics = FALSE,n.bin= 100) #writing out the prior explicitly. If this was not supplied, #the repfdr would try to estimate this prior from the data. Precomputed_Pi = matrix(NA,ncol = 3,nrow = 4) Precomputed_Pi[,1] = c(0,1,0,1) Precomputed_Pi[,2] = c(0,0,1,1) Precomputed_Pi[,3] = c(0.95^2,0.95*0.05,0.95*0.05,0.05^2) colnames(Precomputed_Pi) = c('Study 1','Study 2','Pi') #run repfdr repfdr_res = repfdr(ztobins_res$pdf.binned.z, ztobins_res$binned.z.mat, non.null = 'replication', Pi.previous.result = Precomputed_Pi) #The precomputed prior matrix. if this would not repfdr_res$Pi #local fdr0 and Fdr for each SNP head(repfdr_res$mat) Non_Null = which(H_Study_1 ==1 & H_Study_2 == 1) Reported = which(repfdr_res$mat[,2] <= 0.05) TP = length(intersect(Reported, Non_Null)) TP FP = length(Reported) - TP FP FN = length(Non_Null - TP) FN #### Example 2: a simulation; each feature in each study has two association states, #### null and positive, prior is estimated ## Not run: # a) Replicablity analysis: data(binned_zmat_sim) # this loads the binned z-scores as well as the (estimated) probabilities # in each bin for each state output.rep <- repfdr(pbz_sim, bz_sim, "replication") BayesFdr.rep <- output.rep$mat[,"Fdr"] Rej <- (BayesFdr.rep <= 0.05) sum(Rej) # which of the tests are true replicability findings? (we know this since the data was simulated) data(hmat_sim) true.rep <- apply(hmat_sim,1,function(y){ sum(y==1)>1 }) # Compute the false discovery proportion (FDP) for replicability: sum(Rej * !true.rep) / sum(true.rep) # we can use the previously calculated Pi for further computations (e.g meta-analysis): Pi_sim <- output.rep$Pi # b) meta-analysis: output.meta <- repfdr(pbz_sim, bz_sim, "meta-analysis", Pi.previous.result = Pi_sim) BayesFdr.meta <- output.meta$mat[,"Fdr"] Rej <- (BayesFdr.meta <= 0.05) sum(Rej) # which of the tests are true association findings? (we know this since the data was simulated) true.assoc <- rowSums(hmat_sim) >= 1 # Compute the false discovery proportion (FDP) for association: sum(Rej * !true.assoc) / sum(true.assoc) ## End(Not run) ## Not run: #### Example 3: SNPs data; each SNP in each study has three association states, #### negative, null, or positive: # load the bins of the z-scores and their probabilities. download.file('http://www.math.tau.ac.il/~ruheller/repfdr_RData/binned_zmat.RData', destfile = "binned_zmat.RData") load(file = "binned_zmat.RData") # can also be generated from SNPlocation - see ztobins documentation. # load the prior probabilities for each association status vector. data(Pi) Pi # the proportions vector was computed using piem() # with the following command: Pi <- piem(pbz, bz)$last.iteration # a) replicablity analysis: output.rep <- repfdr(pbz, bz, "replication",Pi.previous.result=Pi) BayesFdr.rep <- output.rep$mat[,"Fdr"] Rej <- sum(BayesFdr.rep <= 0.05) sum(Rej) # The posterior probabilities for the first five features with Bayes FDR at most 0.05: post <- ldr(pbz,bz[order(BayesFdr.rep)[1:5],],Pi) round(post,4) # posteriors for a subset of the association status vectors can also be reported: H <- hconfigs( dim(bz)[2], 3) h.replicability = apply(H, 1, function(y) {sum(y == 1)> 1 | sum(y == -1) >1}) post <- ldr(pbz,bz[order(BayesFdr.rep)[1:5],],Pi,h.vecs= which(h.replicability==1)) round(post,4) # b) meta-analysis: output.meta <- repfdr(pbz, bz, "meta-analysis", Pi.previous.result = Pi) BayesFdr.meta <- output.meta$mat[,"Fdr"] Rej <- sum(BayesFdr.meta <= 0.05) sum(Rej) ## End(Not run) ## manhattan plot (ploting can take a while): # code for manhattan plot by Stephen Turner (see copyrights at the source code manhattan.r) ## Not run: data(SNPlocations) par(mfrow=c(2,1)) # Replication manhattan(dataframe=cbind(SNPlocations,P=BayesFdr.rep),ymax=10.5,pch=20, limitchromosomes=1:4,suggestiveline=-log(0.05,10),genomewideline=F,cex=0.25, annotate=SNPlocations$SNP[BayesFdr.rep<=0.05],main="Replication") # Association manhattan(dataframe=cbind(SNPlocations,P=BayesFdr.meta),ymax=10.5,cex=0.25, limitchromosomes=1:4,suggestiveline=-log(0.05,10),genomewideline=F,pch=20, annotate=SNPlocations$SNP[BayesFdr.rep<=0.05],main="Meta-analysis") par(mfrow=c(1,1)) ## End(Not run)
#### Example 1: a simulation; each feature in each study has two association states, #### null and positive, prior is known #This example generates the Z scores for two studies, with 0.05 probability to have # non - null signal in each study. # The prior matrix is being pregenerated to show the optimal values. # if this matrix was not supplied, the repfdr method would estimate it # using an EM algorithm. See the next examples for estimating the prior as well using repfdr. set.seed(1) n = 2 #two studies m=10000 # ten thounsand, SNPs H_Study_1 = rbinom(m,1,prob = 0.05) #signal of 1, for SNPS with association in the first study H_Study_2 = rbinom(m,1,prob = 0.05) #signal of 1, for SNPS with association in the second study Zmat = matrix(rnorm(n*m),nrow = m) #generate matrix #insert signal (mean shift of 3) for the first study Zmat[which(H_Study_1==1),1] = Zmat[which(H_Study_1==1),1] + 4 #insert signal to the second study Zmat[which(H_Study_2==1),2] = Zmat[which(H_Study_2==1),2] + 4 #estimate densities via ztobins: ztobins_res = ztobins(Zmat,n.association.status = 2,plot.diagnostics = FALSE,n.bin= 100) #writing out the prior explicitly. If this was not supplied, #the repfdr would try to estimate this prior from the data. Precomputed_Pi = matrix(NA,ncol = 3,nrow = 4) Precomputed_Pi[,1] = c(0,1,0,1) Precomputed_Pi[,2] = c(0,0,1,1) Precomputed_Pi[,3] = c(0.95^2,0.95*0.05,0.95*0.05,0.05^2) colnames(Precomputed_Pi) = c('Study 1','Study 2','Pi') #run repfdr repfdr_res = repfdr(ztobins_res$pdf.binned.z, ztobins_res$binned.z.mat, non.null = 'replication', Pi.previous.result = Precomputed_Pi) #The precomputed prior matrix. if this would not repfdr_res$Pi #local fdr0 and Fdr for each SNP head(repfdr_res$mat) Non_Null = which(H_Study_1 ==1 & H_Study_2 == 1) Reported = which(repfdr_res$mat[,2] <= 0.05) TP = length(intersect(Reported, Non_Null)) TP FP = length(Reported) - TP FP FN = length(Non_Null - TP) FN #### Example 2: a simulation; each feature in each study has two association states, #### null and positive, prior is estimated ## Not run: # a) Replicablity analysis: data(binned_zmat_sim) # this loads the binned z-scores as well as the (estimated) probabilities # in each bin for each state output.rep <- repfdr(pbz_sim, bz_sim, "replication") BayesFdr.rep <- output.rep$mat[,"Fdr"] Rej <- (BayesFdr.rep <= 0.05) sum(Rej) # which of the tests are true replicability findings? (we know this since the data was simulated) data(hmat_sim) true.rep <- apply(hmat_sim,1,function(y){ sum(y==1)>1 }) # Compute the false discovery proportion (FDP) for replicability: sum(Rej * !true.rep) / sum(true.rep) # we can use the previously calculated Pi for further computations (e.g meta-analysis): Pi_sim <- output.rep$Pi # b) meta-analysis: output.meta <- repfdr(pbz_sim, bz_sim, "meta-analysis", Pi.previous.result = Pi_sim) BayesFdr.meta <- output.meta$mat[,"Fdr"] Rej <- (BayesFdr.meta <= 0.05) sum(Rej) # which of the tests are true association findings? (we know this since the data was simulated) true.assoc <- rowSums(hmat_sim) >= 1 # Compute the false discovery proportion (FDP) for association: sum(Rej * !true.assoc) / sum(true.assoc) ## End(Not run) ## Not run: #### Example 3: SNPs data; each SNP in each study has three association states, #### negative, null, or positive: # load the bins of the z-scores and their probabilities. download.file('http://www.math.tau.ac.il/~ruheller/repfdr_RData/binned_zmat.RData', destfile = "binned_zmat.RData") load(file = "binned_zmat.RData") # can also be generated from SNPlocation - see ztobins documentation. # load the prior probabilities for each association status vector. data(Pi) Pi # the proportions vector was computed using piem() # with the following command: Pi <- piem(pbz, bz)$last.iteration # a) replicablity analysis: output.rep <- repfdr(pbz, bz, "replication",Pi.previous.result=Pi) BayesFdr.rep <- output.rep$mat[,"Fdr"] Rej <- sum(BayesFdr.rep <= 0.05) sum(Rej) # The posterior probabilities for the first five features with Bayes FDR at most 0.05: post <- ldr(pbz,bz[order(BayesFdr.rep)[1:5],],Pi) round(post,4) # posteriors for a subset of the association status vectors can also be reported: H <- hconfigs( dim(bz)[2], 3) h.replicability = apply(H, 1, function(y) {sum(y == 1)> 1 | sum(y == -1) >1}) post <- ldr(pbz,bz[order(BayesFdr.rep)[1:5],],Pi,h.vecs= which(h.replicability==1)) round(post,4) # b) meta-analysis: output.meta <- repfdr(pbz, bz, "meta-analysis", Pi.previous.result = Pi) BayesFdr.meta <- output.meta$mat[,"Fdr"] Rej <- sum(BayesFdr.meta <= 0.05) sum(Rej) ## End(Not run) ## manhattan plot (ploting can take a while): # code for manhattan plot by Stephen Turner (see copyrights at the source code manhattan.r) ## Not run: data(SNPlocations) par(mfrow=c(2,1)) # Replication manhattan(dataframe=cbind(SNPlocations,P=BayesFdr.rep),ymax=10.5,pch=20, limitchromosomes=1:4,suggestiveline=-log(0.05,10),genomewideline=F,cex=0.25, annotate=SNPlocations$SNP[BayesFdr.rep<=0.05],main="Replication") # Association manhattan(dataframe=cbind(SNPlocations,P=BayesFdr.meta),ymax=10.5,cex=0.25, limitchromosomes=1:4,suggestiveline=-log(0.05,10),genomewideline=F,pch=20, annotate=SNPlocations$SNP[BayesFdr.rep<=0.05],main="Meta-analysis") par(mfrow=c(1,1)) ## End(Not run)
SNPlocations
includes the locations of SNPs in chromosomes 1 to 4.
Data was simulated to the SNPs with HAPGEN2 for three studies and a sample of it was taken (Chromosomes 1 to 4) for the examples. The data is summarized as z-scores(transformed p-values, with inverse standard normal cumulative distribution).
The z-scores matrix can be download from the web (see example).
data(SNPlocations)
data(SNPlocations)
SNPlocations
data.frame of 249024 SNPs' names, chromosome number and location on the chromosomes.
zmat
Matrix of 249024 SNPs' z-scores (in rows) in each of the 3 studies (columns).
See: Su, Zhan, Jonathan Marchini, and Peter Donnelly. "HAPGEN2: simulation of multiple disease SNPs." Bioinformatics 27.16 (2011): 2304-2305.
data(SNPlocations) head(SNPlocations) ## Not run: download.file('http://www.math.tau.ac.il/~ruheller/repfdr_RData/zmat.RData',destfile = "zmat.RData") load(file = "zmat.RData") input.to.repfdr <- ztobins(zmat, 3, df= 15) pbz <- input.to.repfdr$pdf.binned.z bz <- input.to.repfdr$binned.z.mat ## End(Not run)
data(SNPlocations) head(SNPlocations) ## Not run: download.file('http://www.math.tau.ac.il/~ruheller/repfdr_RData/zmat.RData',destfile = "zmat.RData") load(file = "zmat.RData") input.to.repfdr <- ztobins(zmat, 3, df= 15) pbz <- input.to.repfdr$pdf.binned.z bz <- input.to.repfdr$binned.z.mat ## End(Not run)
For each study, the function discretizes two sided P-values into bins and estimates the probabilities in each bin for the null and non-null states.
The function can plot diagnostic plots (disabled by default) for model fit. These should be monitored for misfit of model to data, before using function output in repfdr
. See description of diagnostic plots below.
twosided.PValues.tobins(pval.mat, n.bins = 120, type = 0, df = 7, central.prop = 0.5, pi0=NULL,plot.diagnostics = FALSE, trim.z=FALSE,trim.z.upper = 8, trim.z.lower = -8, force.bin.number = FALSE, pi.plugin.lambda = 0.05)
twosided.PValues.tobins(pval.mat, n.bins = 120, type = 0, df = 7, central.prop = 0.5, pi0=NULL,plot.diagnostics = FALSE, trim.z=FALSE,trim.z.upper = 8, trim.z.lower = -8, force.bin.number = FALSE, pi.plugin.lambda = 0.05)
pval.mat |
Matrix of two sided P-Values of the features (in rows) in each study (columns). |
n.bins |
Number of bins in the discretization of the z-score axis (the number of bins is |
type |
Type of fitting used for f; 0 is a natural spline, 1 is a polynomial, in either case with degrees of freedom |
df |
Degrees of freedom for fitting the estimated density f(z). |
central.prop |
Central proportion of the z-scores used like the area of zero-assumption to estimate pi0. |
pi0 |
Sets argument for estimation of proportion of null hypotheses. Default value is NULL (automatic estimation of pi0) for every study. Second option is to supply vector of values between 0 and 1 (with length of the number of studies/ columns of |
plot.diagnostics |
If set to A second plot is the Normal Q-Q plot of Zscores, converted using Misfit in these two plots should be investigated by the user, before using output in Default value is |
trim.z |
If set to |
trim.z.upper |
Upper bound for trimming Z scores. Default value is 8 |
trim.z.lower |
Lower bound for trimming Z scores. Default value is -8 |
force.bin.number |
Set to |
pi.plugin.lambda |
The function makes use of the plugin estimator for the estimation of the proportion of null hypotheses. The plugin estimator is |
This utility function outputs the first two arguments to be input in the main function repfdr
.
A list with:
pdf.binned.z |
A 3-dimensional array which contains for each study (first dimension), the probabilities of a z-score to fall in the bin (second dimension), under each hypothesis status (third dimension). The third dimension can be of size 2 or 3, depending on the number of association states: if the association can be either null or only in one direction, the dimension is 2; if the association can be either null, or positive, or negative, the dimension is 3. |
binned.z.mat |
A matrix of the bin numbers for each the z-scores (rows) in each study (columns). |
breaks.matrix |
A matrix with |
df |
Number of degrees of freedom, used for spline fitting of density. |
proportions |
Matrix with |
PlotWarnings |
Vector of size |
# we generate a dataset with p=10000 pvalues for two studies, # p1=300 of which are non null: set.seed(1) p = 10000 p1 = 300 z1 = (rnorm(p)) z2 = (rnorm(p)) temp = rnorm(p1, 3.5,0.5) z1[1:p1] = temp + rnorm(p1,0,0.2) z2[1:p1] = temp + rnorm(p1,0,0.2) zmat.example = cbind(z1,z2) pmat.example = 1-(pnorm(abs(zmat.example)) - pnorm(-1*abs(zmat.example))) twosided.pval.res = twosided.PValues.tobins(pmat.example, plot.diagnostics = TRUE) twosided.pval.res$proportions
# we generate a dataset with p=10000 pvalues for two studies, # p1=300 of which are non null: set.seed(1) p = 10000 p1 = 300 z1 = (rnorm(p)) z2 = (rnorm(p)) temp = rnorm(p1, 3.5,0.5) z1[1:p1] = temp + rnorm(p1,0,0.2) z2[1:p1] = temp + rnorm(p1,0,0.2) zmat.example = cbind(z1,z2) pmat.example = 1-(pnorm(abs(zmat.example)) - pnorm(-1*abs(zmat.example))) twosided.pval.res = twosided.PValues.tobins(pmat.example, plot.diagnostics = TRUE) twosided.pval.res$proportions
A simulated data set from three studies, with 10000 "features" in each study, each of which yielded a z-score. The data comprises 10000x3 z-scores.
See hmat_sim
for the indicators of association status matrix.
data(zmat_sim)
data(zmat_sim)
zmat_sim
is a matrix of 10000 z-scores (in rows) in each of the 3 studies (columns).
data(zmat_sim) head(zmat_sim) ## Not run: input.to.repfdr <- ztobins(zmat_sim, 2 ) pbz_sim1 <- input.to.repfdr$pdf.binned.z bz_sim1 <- input.to.repfdr$binned.z.mat data(binned_zmat_sim) stopifnot(all.equal(pbz_sim1,pbz_sim)) stopifnot(all.equal(bz_sim1,bz_sim)) ## End(Not run) #### zmat_sim was generated by the following code: data(hmat_sim) set.seed(12) m <- nrow(hmat_sim) zmat_sim1 <- matrix(rnorm(n=3*m,mean=hmat_sim*3),nrow=m,ncol=3) data(zmat_sim) stopifnot(all.equal(zmat_sim1,zmat_sim))
data(zmat_sim) head(zmat_sim) ## Not run: input.to.repfdr <- ztobins(zmat_sim, 2 ) pbz_sim1 <- input.to.repfdr$pdf.binned.z bz_sim1 <- input.to.repfdr$binned.z.mat data(binned_zmat_sim) stopifnot(all.equal(pbz_sim1,pbz_sim)) stopifnot(all.equal(bz_sim1,bz_sim)) ## End(Not run) #### zmat_sim was generated by the following code: data(hmat_sim) set.seed(12) m <- nrow(hmat_sim) zmat_sim1 <- matrix(rnorm(n=3*m,mean=hmat_sim*3),nrow=m,ncol=3) data(zmat_sim) stopifnot(all.equal(zmat_sim1,zmat_sim))
For each study, the function discretizes the z-scores into bins and estimates the probabilities in each bin for the null and non-null states.
The function can plot diagnostic plots (disabled by default) for model fit. These should be monitored for misfit of model to data, before using function output in repfdr
. See description of diagnostic plots below.
ztobins(zmat, n.association.status = 3, n.bins = 120, type = 0, df = 7, central.prop = 0.5, pi0=NULL,plot.diagnostics = FALSE, trim.z=FALSE,trim.z.upper = 8,trim.z.lower = -8, force.bin.number = FALSE, pi.using.plugin = FALSE, pi.plugin.lambda = 0.05)
ztobins(zmat, n.association.status = 3, n.bins = 120, type = 0, df = 7, central.prop = 0.5, pi0=NULL,plot.diagnostics = FALSE, trim.z=FALSE,trim.z.upper = 8,trim.z.lower = -8, force.bin.number = FALSE, pi.using.plugin = FALSE, pi.plugin.lambda = 0.05)
zmat |
Matrix of z-scores of the features (in rows) in each study (columns). |
n.association.status |
either 2 for no-association\association or 3 for no-associtation\negative-association\positive-association. |
n.bins |
Number of bins in the discretization of the z-score axis (the number of bins is |
type |
Type of fitting used for f; 0 is a natural spline, 1 is a polynomial, in either case with degrees of freedom |
df |
Degrees of freedom for fitting the estimated density f(z). |
central.prop |
Central proportion of the z-scores used like the area of zero-assumption to estimate pi0. |
pi0 |
Sets argument for estimation of proportion of null hypotheses. Default value is NULL (automatic estimation of pi0) for every study. Second option is to supply vector of values between 0 and 1 (with length of the number of studies/ columns of |
plot.diagnostics |
If set to A second plot is the Normal Q-Q plot of Zscores, converted using Misfit in these two plots should be investigated by the user, before using output in Default value is |
trim.z |
If set to |
trim.z.upper |
Upper bound for trimming Z scores. Default value is 8 |
trim.z.lower |
Lower bound for trimming Z scores. Default value is -8 |
force.bin.number |
Set to |
pi.using.plugin |
Logical flag indicating whether estimation of the number of null hypotheses should be done using the plugin estimator.(Default is |
pi.plugin.lambda |
Parameter used for estimation of proportion of null hypotheses, for one sided tests. Default value is 0.05. This should be set to the type 1 error used for hypothesis testing. |
This utility function outputs the first two arguments to be input in the main function repfdr
.
A list with:
pdf.binned.z |
A 3-dimensional array which contains for each study (first dimension), the probabilities of a z-score to fall in the bin (second dimension), under each hypothesis status (third dimension). The third dimension can be of size 2 or 3, depending on the number of association states: if the association can be either null or only in one direction, the dimension is 2; if the association can be either null, or positive, or negative, the dimension is 3. |
binned.z.mat |
A matrix of the bin numbers for each the z-scores (rows) in each study (columns). |
breaks.matrix |
A matrix with |
df |
Number of degrees of freedom, used for spline fitting of density. |
proportions |
Matrix with |
PlotWarnings |
Vector of size |
# Simulated example using both the central proportion estimator # and the plug in estimator for the proportion of null hypotheses: set.seed(1) p = 10000 p1 = 300 z1 = (rnorm(p)) z2 = (rnorm(p)) temp = rnorm(p1, 3.5,0.5) z1[1:p1] = temp + rnorm(p1,0,0.2) z2[1:p1] = temp + rnorm(p1,0,0.2) z1.abs = abs(z1) z2.abs = abs(z2) plot(z1,z2) hist(z1) hist(z2) zmat.example = cbind(z1,z2) ztobins.res = ztobins(zmat.example, plot.diagnostics = TRUE) ztobins.res$proportions ztobins.res.plugin.estimator = ztobins(zmat.example, pi.using.plugin = TRUE, plot.diagnostics = TRUE) ztobins.res.plugin.estimator$proportions ## Not run: # three association states case (H in {-1,0,1}): download.file('http://www.math.tau.ac.il/~ruheller/repfdr_RData/zmat.RData',destfile = "zmat.RData") load(file = "zmat.RData") input.to.repfdr3 <- ztobins(zmat, 3, df = 15) pbz <- input.to.repfdr3$pdf.binned.z bz <- input.to.repfdr3$binned.z.mat # two association states case (H in {0,1}): data(zmat_sim) input.to.repfdr <- ztobins(zmat_sim, 2, n.bins = 100 ,plot.diagnostics = T) pbz_sim <- input.to.repfdr$pdf.binned.z bz_sim <- input.to.repfdr$binned.z.mat ## End(Not run)
# Simulated example using both the central proportion estimator # and the plug in estimator for the proportion of null hypotheses: set.seed(1) p = 10000 p1 = 300 z1 = (rnorm(p)) z2 = (rnorm(p)) temp = rnorm(p1, 3.5,0.5) z1[1:p1] = temp + rnorm(p1,0,0.2) z2[1:p1] = temp + rnorm(p1,0,0.2) z1.abs = abs(z1) z2.abs = abs(z2) plot(z1,z2) hist(z1) hist(z2) zmat.example = cbind(z1,z2) ztobins.res = ztobins(zmat.example, plot.diagnostics = TRUE) ztobins.res$proportions ztobins.res.plugin.estimator = ztobins(zmat.example, pi.using.plugin = TRUE, plot.diagnostics = TRUE) ztobins.res.plugin.estimator$proportions ## Not run: # three association states case (H in {-1,0,1}): download.file('http://www.math.tau.ac.il/~ruheller/repfdr_RData/zmat.RData',destfile = "zmat.RData") load(file = "zmat.RData") input.to.repfdr3 <- ztobins(zmat, 3, df = 15) pbz <- input.to.repfdr3$pdf.binned.z bz <- input.to.repfdr3$binned.z.mat # two association states case (H in {0,1}): data(zmat_sim) input.to.repfdr <- ztobins(zmat_sim, 2, n.bins = 100 ,plot.diagnostics = T) pbz_sim <- input.to.repfdr$pdf.binned.z bz_sim <- input.to.repfdr$binned.z.mat ## End(Not run)