Simulating data • genstats

library(genstats)
library(bigsnpr)

Simulation without family history

genstats offers different methods of simulating data. First, we are going to show how to simply simulate the genotypes of N subjects and the genetic and full liabilities. We are using a small scale test, but if the data are going to be used for actual statistical analysis, we recommend that at least \(10^4\times10^4\) data points are simulated. Data is simulated by the chunk below, further information about the parameters can be found here ?gen_sim.

gen_sim('simple_sim_example', N=1000, M=1000, K = 0.05, h2 = 0.5, C=50, block_size = 100, fam=FALSE)

Now there exists an rds file in the working directory with the data, which we will load.

simple_data = snp_attach('simple_sim_example.rds')
class(simple_data)
#> [1] "list"

Looking at the data the file contains a list. Here genotypes (G), information for each SNP (map) and liabilities (fam) can be found.

d1 = simple_data$genotypes[1:10,1:5]
d2 = simple_data$map[1:10,]
d3 = simple_data$fam[1:10,]

knitr::kables(
  list(knitr::kable(d1, col.names = c(1,2,3,4,5), caption = 'Genotypes'), 
       knitr::kable(d2, align = "c", digits = 3, caption = 'Map'), 
       knitr::kable(d3, align = "c", digits = 2, caption = 'Fam')))

Genotypes
1	2	3	4	5
1	1	2	0	0
1	0	0	0	0
0	0	0	0	1
1	0	0	0	0
0	1	0	0	0
1	1	1	1	0
1	0	0	0	0
1	1	1	1	2
0	2	0	0	1
1	0	0	0	0

Map
snp	beta	MAF
1	0	0.241
2	0	0.074
3	0	0.138
4	0	0.110
5	0	0.083
6	0	0.219
7	0	0.217
8	0	0.318
9	0	0.387
10	0	0.113

Fam
FID	l_g_0	l_f_0
1	-0.36	-0.07
2	1.22	0.83
3	-0.01	-0.37
4	0.31	1.38
5	0.09	-1.28
6	-0.07	0.26
7	-1.61	-2.57
8	-1.32	-1.35
9	-1.23	-1.08
10	-2.00	-2.61

Each SNP is sampled from the distribution \(binom(2, MAF_j)\), the effect of a each causal SNP is sampled from the distribution \(\mathcal{N}\left( 0, \frac{h^2}{c}\right)\) and each minor allele frequency (MAF) is sampled from the distribution \(unif(0.01, 0.49)\). The liabilities is computed by \(l_g =X \beta\) and \(l_f = l_g + l_e\) with the following distributions. \(l_g \sim \mathcal{N}(0, h^2),\ l_e \sim \mathcal{N}(0, 1-h^2)\ \text{and}\ l_f \sim \mathcal{N}(0,1)\). To clarify \(h^2\) is defined as the variation in genetic liability. Using the function dist_check() we can see whether the distribution of the full liabilities is as expected. We are using a small data set and we cannot expect the the mean and variance to be close to 0 and 1.

dist_check(simple_data)

#>     mean_l0 variance_l0 
#>   -0.552989    1.634673

We can see that the mean and variance differ a lot from 0 and 1, as stated this a a very small data set but with a larger simulation the distribution will become a lot closer to the expected values. The qqplot however resembles a normal distribution. It is now possible to perform GWAS for association analysis vignette("GWAS") and/or use the genotypes for prediction vignette("Prediction").

Simulation with family history

Simulation with family history is very similar to simulating without.

gen_sim('genetic_data', N=10000, M=10000, C=100, block_size = 1000, n_sib=2)

family_data = snp_attach('genetic_data.rds')
knitr::kable(family_data$fam[1:10,2:16], align = "c", digits = 2)

l_f_0	l_g_0	l_f_p1	l_g_p1	pheno_p1	l_f_p2	l_g_p2	l_f_s3	l_g_s3	pheno_s3	l_f_s4	l_g_s4	pheno_s4
-0.34	-0.71	0.87	-0.49	0	0.10	0.28	0.51	0.21	0	-1.50	-0.11	0
1.43	0.49	0.78	1.10	0	0.61	0.46	2.06	1.31	1	1.36	1.00	0
-1.08	-0.34	-0.09	1.03	0	-0.23	-0.76	-0.11	-0.16	0	-0.73	-0.79	0
1.27	0.58	-1.24	-0.34	0	0.86	1.21	1.51	0.46	0	0.46	0.08	0
0.31	1.01	1.75	1.38	1	0.88	0.33	0.44	0.94	0	0.70	1.14	0
-0.51	0.13	-0.42	-0.15	0	-0.03	-0.60	-0.41	-0.58	0	0.10	0.91	0
0.32	0.10	0.21	-0.17	0	0.64	-0.18	-1.17	-0.39	0	-0.13	0.27	0
0.58	0.77	2.17	2.78	1	0.27	0.83	1.76	2.33	1	2.31	1.23	1
0.23	-0.26	0.52	0.66	0	-1.55	-1.73	-1.04	-0.34	0	-1.29	-0.96	0
-1.47	-0.98	0.11	-0.06	0	-1.33	-0.44	-0.53	-0.54	0	0.65	0.18	0

Now there is a bit more family history, but the same values for each family member as we had for the subject. Again we can check the distribution of each full liability using the same function as before.

dist_check(family_data)

#>     mean_l0 variance_l0 
#> 0.007328894 0.944272064 
#>     mean_l1 variance_l1 
#> 0.005552209 1.046033033 
#>     mean_l2 variance_l2 
#>  0.01255945  1.01357212 
#>     mean_l3 variance_l3 
#> -0.02114777  0.96889338 
#>     mean_l4 variance_l4 
#>  -0.0031872   0.9506856

Now that a larger data set have been simulated the mean and variance is a lot closer to the expected values of the liabilities and the qqplots for each liability clearly resembles a normal distribution. This data is therefore ready to be analysed using LTFH vignette("LT-FH") and/or used in predictive analysis vignette("Prediction").

Note on parallelisation, block size and memory use.

The functions for the simulation have been optimized with parallelisation and do this with all available cores as a standard. Depending on the computer running the simulation, the size of the simulation, the block size and amount of RAM this will take up a lot of memory of the computer. If a user specified amount of cores is preferred, this can be specified by running options(mc.cores = core_amount) where core_amount is the amount of cores R will have available. Furthermore depending on the specified block size, R will need to allocate quite a bit of memory, this can create an error on some computers depending on the RAM. It could be relevant to check https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/memory.size. If this doesn’t fix the error, decreasing the block size and/or amount of cores is recommended.