Simulating data
Simulation.Rmd
Simulation without family history
genstats offers different methods of simulating data. First, we are
going to show how to simply simulate the genotypes of N subjects and the
genetic and full liabilities. We are using a small scale test, but if
the data are going to be used for actual statistical analysis, we
recommend that at least \(10^4\times10^4\) data points are simulated.
Data is simulated by the chunk below, further information about the
parameters can be found here ?gen_sim
.
gen_sim('simple_sim_example', N=1000, M=1000, K = 0.05, h2 = 0.5, C=50, block_size = 100, fam=FALSE)
Now there exists an rds file in the working directory with the data, which we will load.
simple_data = snp_attach('simple_sim_example.rds')
class(simple_data)
#> [1] "list"
Looking at the data the file contains a list. Here genotypes (G), information for each SNP (map) and liabilities (fam) can be found.
d1 = simple_data$genotypes[1:10,1:5]
d2 = simple_data$map[1:10,]
d3 = simple_data$fam[1:10,]
knitr::kables(
list(knitr::kable(d1, col.names = c(1,2,3,4,5), caption = 'Genotypes'),
knitr::kable(d2, align = "c", digits = 3, caption = 'Map'),
knitr::kable(d3, align = "c", digits = 2, caption = 'Fam')))
|
|
|
Each SNP is sampled from the distribution \(binom(2, MAF_j)\), the effect of a each causal SNP is sampled from the distribution \(\mathcal{N}\left( 0, \frac{h^2}{c}\right)\) and each minor allele frequency (MAF) is sampled from the distribution \(unif(0.01, 0.49)\). The liabilities is computed by \(l_g =X \beta\) and \(l_f = l_g + l_e\) with the following distributions. \(l_g \sim \mathcal{N}(0, h^2),\ l_e \sim \mathcal{N}(0, 1-h^2)\ \text{and}\ l_f \sim \mathcal{N}(0,1)\). To clarify \(h^2\) is defined as the variation in genetic liability. Using the function dist_check() we can see whether the distribution of the full liabilities is as expected. We are using a small data set and we cannot expect the the mean and variance to be close to 0 and 1.
dist_check(simple_data)
#> mean_l0 variance_l0
#> -0.552989 1.634673
We can see that the mean and variance differ a lot from 0 and 1, as
stated this a a very small data set but with a larger simulation the
distribution will become a lot closer to the expected values. The qqplot
however resembles a normal distribution. It is now possible to perform
GWAS for association analysis vignette("GWAS")
and/or use
the genotypes for prediction vignette("Prediction")
.
Simulation with family history
Simulation with family history is very similar to simulating without.
gen_sim('genetic_data', N=10000, M=10000, C=100, block_size = 1000, n_sib=2)
family_data = snp_attach('genetic_data.rds')
knitr::kable(family_data$fam[1:10,2:16], align = "c", digits = 2)
l_f_0 | l_g_0 | pheno_0 | l_f_p1 | l_g_p1 | pheno_p1 | l_f_p2 | l_g_p2 | pheno_p2 | l_f_s3 | l_g_s3 | pheno_s3 | l_f_s4 | l_g_s4 | pheno_s4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
-0.34 | -0.71 | 0 | 0.87 | -0.49 | 0 | 0.10 | 0.28 | 0 | 0.51 | 0.21 | 0 | -1.50 | -0.11 | 0 |
1.43 | 0.49 | 0 | 0.78 | 1.10 | 0 | 0.61 | 0.46 | 0 | 2.06 | 1.31 | 1 | 1.36 | 1.00 | 0 |
-1.08 | -0.34 | 0 | -0.09 | 1.03 | 0 | -0.23 | -0.76 | 0 | -0.11 | -0.16 | 0 | -0.73 | -0.79 | 0 |
1.27 | 0.58 | 0 | -1.24 | -0.34 | 0 | 0.86 | 1.21 | 0 | 1.51 | 0.46 | 0 | 0.46 | 0.08 | 0 |
0.31 | 1.01 | 0 | 1.75 | 1.38 | 1 | 0.88 | 0.33 | 0 | 0.44 | 0.94 | 0 | 0.70 | 1.14 | 0 |
-0.51 | 0.13 | 0 | -0.42 | -0.15 | 0 | -0.03 | -0.60 | 0 | -0.41 | -0.58 | 0 | 0.10 | 0.91 | 0 |
0.32 | 0.10 | 0 | 0.21 | -0.17 | 0 | 0.64 | -0.18 | 0 | -1.17 | -0.39 | 0 | -0.13 | 0.27 | 0 |
0.58 | 0.77 | 0 | 2.17 | 2.78 | 1 | 0.27 | 0.83 | 0 | 1.76 | 2.33 | 1 | 2.31 | 1.23 | 1 |
0.23 | -0.26 | 0 | 0.52 | 0.66 | 0 | -1.55 | -1.73 | 0 | -1.04 | -0.34 | 0 | -1.29 | -0.96 | 0 |
-1.47 | -0.98 | 0 | 0.11 | -0.06 | 0 | -1.33 | -0.44 | 0 | -0.53 | -0.54 | 0 | 0.65 | 0.18 | 0 |
Now there is a bit more family history, but the same values for each family member as we had for the subject. Again we can check the distribution of each full liability using the same function as before.
dist_check(family_data)
#> mean_l0 variance_l0
#> 0.007328894 0.944272064
#> mean_l1 variance_l1
#> 0.005552209 1.046033033
#> mean_l2 variance_l2
#> 0.01255945 1.01357212
#> mean_l3 variance_l3
#> -0.02114777 0.96889338
#> mean_l4 variance_l4
#> -0.0031872 0.9506856
Now that a larger data set have been simulated the mean and variance
is a lot closer to the expected values of the liabilities and the
qqplots for each liability clearly resembles a normal distribution. This
data is therefore ready to be analysed using LTFH
vignette("LT-FH")
and/or used in predictive analysis
vignette("Prediction")
.
Note on parallelisation, block size and memory use.
The functions for the simulation have been optimized with parallelisation and do this with all available cores as a standard. Depending on the computer running the simulation, the size of the simulation, the block size and amount of RAM this will take up a lot of memory of the computer. If a user specified amount of cores is preferred, this can be specified by running options(mc.cores = core_amount) where core_amount is the amount of cores R will have available. Furthermore depending on the specified block size, R will need to allocate quite a bit of memory, this can create an error on some computers depending on the RAM. It could be relevant to check https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/memory.size. If this doesn’t fix the error, decreasing the block size and/or amount of cores is recommended.