R包-GCA

GCA 包可以计算群体间遗传关联。

安装包

1
2
remotes::install_github("QGresources/GCA")
library(GCA)

加载网页需要的包

1
2
3
4
5
library(BGLR)
library(corrplot)
library(GCA)
library(ggplot2)
library(gridExtra)

准备数据

模拟数据 GCcattle包含两个对象 cattle.pheno and cattle.W,分别为表型和基因组信息,具体信息可见

这是模拟数据,6个世代,基因型应该没有缺失。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
?GCcattle

# Load cattle dataset
data(GCcattle)

# Check phenotype, fixed covariates and pedigree information
str(cattle.pheno)
'data.frame': 2500 obs. of 6 variables:
$ Progeny : int 1 2 3 4 5 6 7 8 9 10 ...
$ Sire : int 0 0 0 0 0 0 0 0 0 0 ...
$ Dam : int 0 0 0 0 0 0 0 0 0 0 ...
$ Sex : chr "M" "M" "M" "M" ...
$ Unit : Factor w/ 8 levels "U1","U2","U3",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Phenotype: num -0.614 0.803 1.79 0.117 0.399 ...

# Check marker information
str(cattle.W)

1
2
> dim(cattle.W) # marker matrix
[1] 2500 10000

遗传力设为 0.6

1
2
sigma2a <- 0.6 # additive genetic variance
sigma2e <- 0.4 # residual variance

我们接下来构建固定效应的系数矩阵和基因组关系矩阵

1
2
X2 <- model.matrix(~ -1 + factor(cattle.pheno$Unit) + factor(cattle.pheno$Sex)) # incidence matrix of unit effect and sex
G <- computeG(cattle.W, maf = 0.05) # genomic relationship matrix; markers with minor allele frequency (maf) less than 0.05 are removed

查看一下 computeG 这个函数,基因型缺失为 NA ,缺失值默认用随机抽样得到。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Compute genomic relationship matrix

Description:

Use single nucleotide polymorphisms markers to derive an additive
genomic relationship matrix. Missing markers are allowed, but
should be coded as NA.

Usage:

computeG(snpmatrix, maf = 0.05, impute = "rbinom", method = "G1")

Arguments:

snpmatrix: A marker matrix with the dimension of n by m, where the
elements are coded as 0, 1, 2, or NA, where n and m are the
total number of individuals and markers, respectively.

maf: A minor allele frequency cutoff for quality control. The
default minor allele frequency is 0.05.

impute: Perform genotype imputation for missing markers if
applicable. Two methods of 'mean' and 'rbinom' are available,
where the 'mean' imputes missing markers using mean and
'rbinom' imputes the missing markers by random sampling from
a binomial distribution. The default method is 'rbinom'. This
argument will be ignored if the ‘snpmatrix’ does not include
any missing markers.

method: A type of genomic relationship matrix including 'G1' and 'G2'
(VanRaden 2008). The default method is 'G1'.

计算群体关联的例子

选项说明

1
2
gca(Kmatrix, Xmatrix, sigma2a, sigma2e, MUScenario, statistic, NumofMU, 
Uidx = NULL, scale = TRUE, diag = TRUE)
  • Kmatrix: Genetic relationship matrix constructed from either pedigree or genomics.
  • Xmatrix: Fixed effects incidence matrix excluding intercept. The first column of the Xmatrix should start with unit effects followed by other fixed effects if applicable.
  • sigma2a and sigma2e: Estimates of additive genetic and residual variances, respectively.
  • MUScenario: A vector of fixed factor units.
  • statistic: Choice of connectedness statistic. Available options include
    1. PEV-derived functions: PEVD_IdAve, PEVD_GrpAve, PEVD_contrast, CD_IdAve, CD_GrpAve, CD_contrast, r_IdAve, r_GrpAve, and r_contrast.
    2. VE-derived functions: VED0, VED1, VED2, CDVED0, CDVED1, CDVED2, CR0, CR1, and CR2.
  • NumofMU: Return either pairwise unit connectedness (Pairwise) or overall connectedness across all units (Overall).
  • Uidx: An integer indicating the last column number of units in the Xmatrix. This Uidx is required for VED2, CDVED2, and CR2 statistics. The default is NULL.
  • scale (logical): Should sigma2a be used to scale statistic (i.e., PEVD_IdAve, PEVD_GrpAve, PEVD_contrast, VED0, VED1, and VED2) so that connectedness is independent of measurement unit? The default is TRUE.
  • diag (logical): Should the diagonal elements of the PEV matrix (i.e., PEVD_GrpAve, CD_GrpAve, and r_GrpAve) or the K matrix (CDVED0, CDVED1, and CDVED2) be included? The default is TRUE.

PEVD_GrpAve

下面我们以 PEVD_GrpAve 这个指标作为例子,其范围为 0 到 1

1
2
3
4
PEVD_GrpAve <- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit), statistic = 'PEVD_GrpAve',
NumofMU = 'Pairwise')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
> PEVD_GrpAve
U1 U2 U3 U4 U5 U6 U7
U1 NA 0.01668301 0.02314068 0.02845543 0.02468162 0.02908540 0.01920005
U2 0.01668301 NA 0.03679673 0.04326783 0.03956293 0.04428925 0.03464098
U3 0.02314068 0.03679673 NA 0.04837241 0.04439133 0.04987349 0.03808480
U4 0.02845543 0.04326783 0.04837241 NA 0.05041112 0.05712596 0.04589734
U5 0.02468162 0.03956293 0.04439133 0.05041112 NA 0.05062642 0.04002046
U6 0.02908540 0.04428925 0.04987349 0.05712596 0.05062642 NA 0.04647375
U7 0.01920005 0.03464098 0.03808480 0.04589734 0.04002046 0.04647375 NA
U8 0.01559363 0.03069310 0.03658017 0.04175780 0.03786799 0.04314658 0.03401520
U8
U1 0.01559363
U2 0.03069310
U3 0.03658017
U4 0.04175780
U5 0.03786799
U6 0.04314658
U7 0.03401520
U8 NA


移除对角线元素,并绘图

1
2
3
4
5
6
diag(PEVD_GrpAve) <- 0 
corrplot(PEVD_GrpAve, is.corr = FALSE, method ="circle", type = "upper",
diag = F, number.cex = 7 / ncol(PEVD_GrpAve), col = cm.colors(10), cl.lim = c(0, 0.08),
number.digits = 4, mar = c(0,1,1,1), addCoef.col = "black",
tl.col = "black", tl.srt = 90, tl.cex = 1.2)

如果设置 NumofMU = 'Overall' ,那么我们会得到所有配对的群体间PEVD 的均值

1
2
3
4
5
6
7
> PEVD_GrpAve_Overall <- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a,
sigma2e = sigma2e, MUScenario = as.factor(cattle.pheno$Unit),
statistic = 'PEVD_GrpAve', NumofMU = 'Overall')
PEVD_GrpAve_Overall
[1] 0.03752627


当然,这里的统计量也可以换为 'PEVD_IdAve''PEVD_contrast',

PEVD_IdAve

1
2
3
4
5
6
7
8
9
10
11
PEVD_IdAve <- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit), statistic = 'PEVD_IdAve',
NumofMU = 'Pairwise')
diag(PEVD_IdAve) <- 0
pdf("test2.pdf")
corrplot(PEVD_IdAve, is.corr = FALSE, method ="circle", type = "upper",
diag = F, number.cex = 7 / ncol(PEVD_GrpAve), col = cm.colors(10), cl.lim = c(0, 0.08),
number.digits = 4, mar = c(0,1,1,1), addCoef.col = "black",
tl.col = "black", tl.srt = 90, tl.cex = 1.2)
dev.off()

下图为两种方法结果的比对,左边是 PEVD_GrpAve ,右边是 PEVD_IdAve ,趋势是一致的。

v7aeDU.png

CD_GrpAve

CD 越大越好,下面的统计量同样可以改为 'CD_IdAve''CD_contrast'

1
2
3
4
5
6
7
8
9
10
11
12
CD_GrpAve <- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CD_GrpAve',
NumofMU = 'Pairwise')
# replace NAs in diagnol to make plot
diag(CD_GrpAve) <- min(CD_GrpAve, na.rm = T)
pdf("CD_GrpAve.pdf")
corrplot(CD_GrpAve, is.corr = FALSE, method ="circle", type = "upper",
diag = F, number.cex = 7 / ncol(CD_GrpAve), col = cm.colors(10),
number.digits = 4, mar = c(0,1,1,1), addCoef.col = "black",
tl.col = "black", tl.srt = 90, tl.cex = 1.2, cl.lim = c(0.5, 0.8))
dev.off()

v7wpfs.png

结果和 PEVD 不一致。

r_GrpAve

1
2
3
4
5
6
7
8
9
10
11
r_GrpAve <- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit),
statistic = 'r_GrpAve', NumofMU = 'Pairwise')
diag(r_GrpAve) <- 0
png("r_GrpAve.png")
corrplot(r_GrpAve, is.corr = FALSE, method ="circle", type = "upper",
diag = F, number.cex = 7 / ncol(PEVD_GrpAve), col = cm.colors(10), cl.lim = c(0, 0.08),
number.digits = 4, mar = c(0,1,1,1), addCoef.col = "black",
tl.col = "black", tl.srt = 90, tl.cex = 1.2)
dev.off()

全是负数

v7yP1J.png

VED

VED 越小越好

1
2
3
4
5
6
7
8
9
10
11
12
13
VED0  <- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit), statistic = 'VED0',
NumofMU = 'Pairwise')
VED1 <- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit), statistic = 'VED1',
NumofMU = 'Pairwise')
VED2 <- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit), statistic = 'VED2',
NumofMU = 'Pairwise', Uidx = 8)
# 剔除对角线的缺失
diag(VED0) <- diag(VED1) <- diag(VED2) <- floor(min(VED0, na.rm = T))


分别绘图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
png("VED0.png")
corrplot(VED0, is.corr =FALSE, method="circle", type = "upper",
diag = F, number.cex = 7 / ncol(VED0), col = cm.colors(10),
number.digits = 4, addCoef.col = "black", tl.col = "black",
tl.srt = 90, tl.cex = 1.2, cl.lim = c(0, 0.08))
dev.off()

png("VED1.png")
corrplot(VED1, is.corr =FALSE, method="circle", type = "upper",
diag = F, number.cex = 7 / ncol(VED0), col = cm.colors(10),
number.digits = 4, addCoef.col = "black", tl.col = "black",
tl.srt = 90, tl.cex = 1.2, cl.lim = c(0, 0.08))
dev.off()

png("VED2.png")
corrplot(VED2, is.corr =FALSE, method="circle", type = "upper",
diag = F, number.cex = 7 / ncol(VED0), col = cm.colors(10),
number.digits = 4, addCoef.col = "black", tl.col = "black",
tl.srt = 90, tl.cex = 1.2, cl.lim = c(0, 0.08))
dev.off()

三个趋势一致,并且和 PEVD 趋势一致。

只是 VED1 和 VED2 结果完全一致。

CR

CR 越大越好

1
2
3
4
5
6
7
8
9
10
11
12
13
CR0  <- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CR0',
NumofMU = 'Pairwise')
CR1 <- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CR1',
NumofMU = 'Pairwise')
CR2 <- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CR2',
NumofMU = 'Pairwise', Uidx = 8)
# 剔除对角线的缺失
diag(CR0) <- diag(CR1) <- diag(CR2) <- floor(min(CR0, na.rm = T))


分别绘图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
png("CR0.png")
corrplot(CR0, is.corr =FALSE, method="circle", type = "upper",
diag = F, number.cex = 7 / ncol(CR0), col = cm.colors(10),
number.digits = 4, addCoef.col = "black", tl.col = "black",
tl.srt = 90, tl.cex = 1.2, cl.lim = c(0, 0.08))
dev.off()

png("CR1.png")
corrplot(CR1, is.corr =FALSE, method="circle", type = "upper",
diag = F, number.cex = 7 / ncol(CR0), col = cm.colors(10),
number.digits = 4, addCoef.col = "black", tl.col = "black",
tl.srt = 90, tl.cex = 1.2, cl.lim = c(0, 0.08))
dev.off()

png("CR2.png")
corrplot(CR2, is.corr =FALSE, method="circle", type = "upper",
diag = F, number.cex = 7 / ncol(CR0), col = cm.colors(10),
number.digits = 4, addCoef.col = "black", tl.col = "black",
tl.srt = 90, tl.cex = 1.2, cl.lim = c(0, 0.08))
dev.off()

趋势和 PEVD 完全相反。

下图为 CR0

v7yAn1.png

不同统计量的关系

PEVD_GrPAve 和 VED0, VED1, and VED2 强相关,基本为 1

CD_GrpAve 和 CDVED0, CDVED1, and CDVED2 强相关,基本为1

r_GrPAve 和 CR0, CR1, and CR2 相关很强,但是只有0.94-0.99

关联率和GS准确性

在全基因组评估中,参考群和验证群的关系会影响预测准确性。

在模拟数据中,模拟了5种情况,通过交换 0, 140, 210, 280 和 350 个个体来逐步增加参考群和验证群的群体关联, 计算 PEVD_GrpAve 和 CD_GrpAve

从下面可以看到,从 MUSC1 到 MUSC5 ,PEVD_GrpAve 逐级递减,说明群体关联越来越高,而 CD_GrpAve 先高后低,这是由于 CD penalizes the connectedness measure when the genetic variability among population is reduced

1
2
3
4
5
6
7
8
9
10
11
12
13
# PEVD_GrpAve
PEVD_GrpAve <- apply(CattleSIM[, 7: 11], 2, function(x) GC(x,
Kmatrix = G, sigma2a, sigma2e, stat = 'PEVD_GrpAve'))
print(PEVD_GrpAve)
## MUSC1 MUSC2 MUSC3 MUSC4 MUSC5
## 0.0039430647 0.0018066760 0.0013236613 0.0010374891 0.0009192205
# CD_GrpAve
CD_GrpAve <- apply(CattleSIM[, 7: 11], 2, function(x) GC(x,
Kmatrix = G, sigma2a, sigma2e, stat = 'CD_GrpAve'))
print(CD_GrpAve)
## MUSC1 MUSC2 MUSC3 MUSC4 MUSC5
## 0.6679398 0.7586556 0.7626448 0.7503289 0.7140141

通过交叉验证计算模拟数据的准确性(cor(gebv, tbv))

群体关联和GS准确性的关系见下图。

v76ern.png

参考文献

  1. https://qgresources.github.io/GCA_Vignette/GCA.html
  • 版权声明: 本博客所有文章除特别声明外,著作权归作者所有。转载请注明出处!
  • Copyrights © 2019-2024 Vincere Zhou
  • 访问人数: | 浏览次数:

请我喝杯茶吧~

支付宝
微信