This vignette shows some examples of using GxEScanR to perform genome-wide association study (GWAS) and genome-wide by environment interaction study (GWEIS) scans using all the options available to the user.
With growing number of SNPs that can be imputed it is necessary to have software that can efficiently perform GWAS and GWEIS scans. GxEScanR can do this using files that were saved in the BinaryDosage format. The BinaryDosage package can convert VCF and GEN files into the BinaryDosage format. The BinaryDosage format was designed to keep the file with the genetic data small with fast read times. GxEScanR uses this and efficient large scale regression routines to perform GWAS and GWEIS scans quickly.
The examples below use three sample files. The first contains a data frame that has subject data. The second file is a genetic data file in the BinaryDosage format. The last file contains the information returned by the BinaryDosage::getbdinfo routine that returns information about the binary dosage file that makes reading it fast.
covdatafile <- system.file("extdata", "covdata.rds", package = "GxEScanR")
covdata <- readRDS(covdatafile)| sid | y | e | 
|---|---|---|
| I1 | 0 | 0 | 
| I2 | 0 | 0 | 
| I3 | 0 | 0 | 
| I4 | 0 | 0 | 
| I5 | 0 | 0 | 
To load the binary dosage information file, it is necessary to update the file name since the file has been moved from its original location during the installation process. The following loads the binary dosage information and updates the file name.
bdinfofile <- system.file("extdata", "pdata_4_1.bdinfo", package = "GxEScanR")
bdinfo <- readRDS(bdinfofile)
bdinfo$filename <- system.file("extdata", "pdata_4_1.bdose", package = "GxEScanR")| model | outcome | predictors | subjects | 
|---|---|---|---|
| D|E,G | Phenotype | All covariates, gene | All | 
| D|E,G,GxE | Phenotype | All covariates, gene, gene x last covariate | All | 
| E|G | Last covariate | All other covariates, gene | All | 
| E|G,D=1, case only | Last covariate | All other covariates, gene | Cases | 
| E|G,D=0 control only | Last covariate | All other covariates, gene | Controls | 
The simplest scan to do is a linear regression GWAS. The following model is first when doing a linear regression GWAS.
| model | outcome | predictors | subjects | 
|---|---|---|---|
| D|E,G | Phenotype | All covariates, gene | All | 
In the example data set, the phenotype, y, is coded 0,1. When GxEScanR sees the phenotype codes this way it assumes the outcome is binary and uses logistic regression. To perform a linear regression GWAS the binary option needs to be set to FALSE. The following shows how to do a linear GWAS along with the results.
The routine outputs the number of subjects used in the analysis and returns a data frame with the results.
lingwas1 <- gwas(data = covdata,
                 bdinfo = bdinfo,
                 binary = FALSE)
#> [1] "200 subjects have complete data"| snp | betag | lrtg | 
|---|---|---|
| 1:10001 | 0.1605708 | 9.0532054 | 
| 1:10002 | -0.0220380 | 0.1219035 | 
| 1:10003 | -0.0732687 | 1.3812032 | 
| 1:10004 | 0.0326160 | 0.2355224 | 
| 1:10005 | 0.0627953 | 1.2072107 | 
The output can be redirected to output file that produces a plain test version of the results in a tab delimited file that can be read into R using the read.table routine. In this case, the gwas routine returns a value of 0.
outfile <- tempfile()
lingwas2 <- gwas(data = covdata,
                 bdinfo = bdinfo,
                 outfile = outfile,
                 binary = FALSE)
#> [1] "200 subjects have complete data"
lingwas2
#> [1] 0
lingwas2 <- read.table(outfile, header = TRUE, sep ='\t')| SNPID | betag | lrtg | 
|---|---|---|
| 1:10001 | 0.1605708 | 9.0532054 | 
| 1:10002 | -0.0220380 | 0.1219035 | 
| 1:10003 | -0.0732687 | 1.3812032 | 
| 1:10004 | 0.0326160 | 0.2355224 | 
| 1:10005 | 0.0627953 | 1.2072107 | 
The gweis routine takes the same parameters as the gwas function but performs additional tests. The models fit for a linear regression GWAS are.
| model | outcome | predictors | subjects | 
|---|---|---|---|
| D|E,G | Phenotype | All covariates, gene | All | 
| D|E,G,GxE | Phenotype | All covariates, gene, gene x last covariate | All | 
Note: When doing a GWEIS the interaction covariate is in the last column of the subject data frame.
In this test the minmaf option was used. When minmaf is specified the minor allele for a SNP must exceed minmaf to be test. Notice that only 5 SNPs are in the output data frame. This is because one of the SNPs has a minor allele frequency below 0.2.
lingweis1 <- gweis(data = covdata,
                   bdinfo = bdinfo,
                   minmaf = 0.2,
                   binary = FALSE)
#> [1] "200 subjects have complete data"| snp | betadg | lrtdg | betagxe | lrtgxe | lrt2df | 
|---|---|---|---|---|---|
| 1:10001 | 0.1605708 | 9.0532054 | 0.0777308 | 0.5402899 | 9.593495 | 
| 1:10002 | -0.0220380 | 0.1219035 | 0.1484303 | 1.3479068 | 1.469810 | 
| 1:10004 | 0.0326160 | 0.2355224 | 0.1968781 | 2.0769149 | 2.312437 | 
| 1:10005 | 0.0627953 | 1.2072107 | -0.0442079 | 0.1401332 | 1.347344 | 
If the user is interested in see what happened to SNPs that weren’t included in the data frame, the skipfile option can be used. The skipfile value is the name of a file to write the skipped SNPs to. The skipfile option can be used along with the outfile option. The skip file is in the same format as the output file. Below is an example using the skipfile option.
skipfile = tempfile()
lingweis2 <- gweis(data = covdata,
                   bdinfo = bdinfo,
                   skipfile = skipfile,
                   minmaf = 0.2,
                   binary = FALSE)
#> [1] "200 subjects have complete data"| snp | betadg | lrtdg | betagxe | lrtgxe | lrt2df | 
|---|---|---|---|---|---|
| 1:10001 | 0.1605708 | 9.0532054 | 0.0777308 | 0.5402899 | 9.593495 | 
| 1:10002 | -0.0220380 | 0.1219035 | 0.1484303 | 1.3479068 | 1.469810 | 
| 1:10004 | 0.0326160 | 0.2355224 | 0.1968781 | 2.0769149 | 2.312437 | 
| 1:10005 | 0.0627953 | 1.2072107 | -0.0442079 | 0.1401332 | 1.347344 | 
| SNPID | reasondg | reasongxe | 
|---|---|---|
| 1:10003 | 2 | 2 | 
The following table lists the reasons SNPs were skipped given the skipped value.
| code | reason | 
|---|---|
| 1 | Excluded by user | 
| 2 | Minor allele frequency below threshold | 
| 3 | X matrix singular | 
| 4 | Singular matrix updating beta top | 
| 5 | Singular matrix updating beta bottom | 
| 6 | Maximum iterations exceeded | 
In this example, the phenotype is coded (0,1). The gwas and gweis routines check for this an will run logistic regressions if the outcome is coded (0,1) unless binary is set to FALSE. If the use wants to make sure the outcome is coded (0,1), the user may set binary to TRUE. In this case, if the outcome is not coded (0,1) an error is produced.
The following model is fit when doing a logistic regression GWAS.
| model | outcome | predictors | subjects | 
|---|---|---|---|
| D|E,G | Phenotype | All covariates, gene | All | 
loggwas1 <- gwas(data = covdata,
                 bdinfo = bdinfo,
                 blksize = 2,
                 binary = TRUE)
#> [1] "200 subjects have complete data"| snp | betag | lrtg | 
|---|---|---|
| 1:10001 | 0.7129 | 9.0230 | 
| 1:10002 | -0.0943 | 0.1219 | 
| 1:10003 | -0.3144 | 1.3777 | 
| 1:10004 | 0.1397 | 0.2356 | 
| 1:10005 | 0.2680 | 1.2003 | 
In this example, the option blksize is used. When an analysis is run several SNPs are read in at one time. This saves disk time. The following are the default values for given the number of subjects. These values were chosen to keep the program running using less than 4GB of RAM. The user is allowed to specify a value up to twice the default value. Little performance gain is seen going with larger values. If the user enters 0 for blksize, the default value is used.
| subjects | blksize | 
|---|---|
| less than 10,000 | 5000 | 
| 10,000 to 24,999 | 2000 | 
| 25,000 to 49,999 | 1000 | 
| 50,000 to 99,999 | 500 | 
| 100,000 to 249,999 | 200 | 
| 250,000 to 499,999 | 100 | 
| 500,000 or greater | 50 | 
A logistic regression GWEIS fits an additional 4 models that produce 7 more tests. 3 of these models use the the interaction covariate as the outcome. The following show all the models fit in a logistic regression GWEIS.
The following model is fit when doing a logistic regression GWAS.
| model | outcome | predictors | subjects | 
|---|---|---|---|
| D|E,G | Phenotype | All covariates, gene | All | 
| D|E,G,GxE | Phenotype | All covariates, gene, gene x last covariate | All | 
| E|G | Last covariate | All other covariates, gene | All | 
| E|G,D=1, case only | Last covariate | All other covariates, gene | Cases | 
| E|G,D=0 control only | Last covariate | All other covariates, gene | Controls | 
Note: When doing a GWEIS the interaction covariate is in the last column of the data frame.
In the example subject data, the covariate is coded (0,1). In this case, the gweis routine will use logistic regression to fit the last 3 models.
loggweis1 <- gweis(data = covdata,
                   bdinfo = bdinfo,
                   snps = 1:2,
                   binary = TRUE)
#> [1] "200 subjects have complete data"| snp | betadg | lrtdg | betagxe | lrtgxe | lrt2df | betaeg | lrteg | lrt3df | betacase | lrtcase | betactrl | lrtctrl | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1:10001 | 0.7129 | 9.0230 | 0.4931 | 0.9520 | 9.9750 | 0.4115 | 3.3606 | 13.3356 | 0.4290 | 2.0147 | -0.0477 | 0.0147 | 
| 1:10002 | -0.0943 | 0.1219 | 0.6508 | 1.3569 | 1.4789 | 0.0798 | 0.0910 | 1.5699 | 0.3882 | 1.1018 | -0.2562 | 0.3735 | 
In this example the snps options was used. The snps option can either be a vector of indices indicating what SNPs to include or a list of SNPs by SNP ID. A vector of indices was used in this example.
In the example subject data, the covariate is coded (0,1) which the gweis routine sees a binary covariate to make the routine do a linear regression 1 can be added to the interaction covariate. This will change the coding to (1,2) which the routine sees as a continuous covariate.
covdata2 <- covdata
covdata2$e <- covdata2$e + 1
loggweis2 <- gweis(data = covdata2,
                   bdinfo = bdinfo,
                   snps = c("1:10001", "1:10002"))
#> [1] "200 subjects have complete data"| snp | betadg | lrtdg | betagxe | lrtgxe | lrt2df | betaeg | lrteg | lrt3df | betacase | lrtcase | betactrl | lrtctrl | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1:10001 | 0.7129 | 9.0230 | 0.4931 | 0.9520 | 9.9750 | 0.1001 | 3.4031 | 13.3781 | 0.1044 | 2.0128 | -0.0098 | 0.0147 | 
| 1:10002 | -0.0943 | 0.1219 | 0.6508 | 1.3569 | 1.4789 | 0.0194 | 0.0912 | 1.5700 | 0.0947 | 1.0967 | -0.0516 | 0.3688 | 
In this example the snps options was used with the SNP IDs.