软件学习-HIBLUP

2025-12-29

数据分析

字数统计: 8.2k | 阅读时长≈ 36 分钟

HIBLUP 是一个育种分析软件。

输入数据

HIBLUP 基本上所有输入文件都要求标题，下面就不再赘述这一点。

常规输入文件

表型数据（--pheno）：有标题，第一列必须是个体号 （这样就不用特殊指定个体号是加性效应或显性效应了），缺失用NA
系谱数据（--pedigree）：有标题，缺失用NA
基因型数据（--bfile）：二进制格式，样本名称需要提前更改为个体号。

非常规输入文件

SNP权重文件：有标题，第一列是SNP名称，其它列是SNP权重（加性，显性，必须是正数，必须是加性在前显性在后，标题不重要）

通过 --snp-weight 选项来使用这个文件，示例如下
1
2
3
4
SNPid weight
SNP1 42
SNP2 0.3
SNP3 100
群体分类文件：有标题，2列，第一列是样本名称，第二列是所属群体

通过 --pop-class 选项来使用这个文件，示例如下
1
2
3
4
5
6
ID Breed
Ind1 DD
Ind2 YY
Ind3 LL
Ind4 YY
Ind5 LL
亲缘关系矩阵文件：二进制文件，举例为 demo.GA.id 和 demo.GA.bin ，其中 demo.GA.id是一列个体号，而 demo.GA.bin 是按照密集矩阵的方式存储的矩阵下三角元素的内容（一个元素接着一个元素，每个元素占4个字节，也就是单精度），而 demo.GA.sparse.bin 是按照稀疏矩阵方式存储的下三角矩阵的元素（行，列，值）。

这个文件通过 --make-xrm 输出得到，通过 --xrm [prefix] 来使用，举例如下
1
./hiblup ... --xrm demo.GA ...
如果是多个亲缘关系矩阵，则用逗号隔开，如下
1
./hiblup ... --xrm demo.GA,demo.GD ...
估计育种值结果文件：第一列是个体号，其他列是加性或显性育种值（如果同时使用加性和显性效应，则必须满足加性在前，显性在后的顺序。HIBLUP 应该就是通过列号来判断是哪个效应，标题不重要）。

通过 --gebv 来使用这个文件，举例如下
1
2
3
4
id add
Ind1 -1.1345
Ind2 0.3245
Ind3 0.0234
SNP效应文件：SNP效应值用于预测个体的育种值，或者选配。至少需要5列。

通过 --score 来使用这个文件，举例如下（同样是加性在前显性在后）
1
2
3
4
SNPid a1 a2 freq_a1 add dom
SNP1 A C 0.1243 0.30134 0.00000
SNP2 G T 0.0345 -0.06324 0.00124
SNP3 A G 0.3635 0.15425 -0.00913

GWAS结果文件（Summary data file）：GWAS的结果文件，格式见 GCTA 软件的 COJO 格式

通过 --sumstat 来使用这个文件，举例如下（其它不解释，这里NMISS应该是非缺失的样本数。这里A1必须是用于计数的碱基，即计算标记效应的碱基；同样“FREQ”必须是A1碱基的基因频率）。

SNP A1 A2 FREQ BETA SE P NMISS
M1 G T 0.5181 -1.565 1.155 0.1762 500
M2 A G 0.145 -1.77 1.519 0.2445 500
M3 G A 0.3206 1.498 1.583 0.3445 500
M4 C G 0.5356 0.3366 1.003 0.7374 500
M5 C G 0.0975 1.27 1.755 0.4695 500

基因组窗口文件（Genome windows file）：通过这个文件可以用于LD分析等。

通过 --window-file 来使用这个文件，举例如下（注意窗口之间不能重叠）

chr    start    stop
1    10583    1577084
1    1577084    2364990
1    2364990    3150345
1    3150345    4284187
1    4284187    4854314
2    10133    341834
2    341834    1161563
2    1161563    1688845
2    1688845    2829810
2    2829810    3389305

样本和位点过滤文件：一列样本名称或位点名称，无标题（注意这里没有标题）

样本过滤通过 --keep 和 --remove 来保留和剔除样本。

位点过滤通过 --extract 和 --exclude 来保留和剔除位点。

其实这里就和 PLINK 一样。

基本选项

除了上面的输入文件之外，如 --pheno ，--pedigree ，--bfile 之外，还有一些描述因子和模型的基本选项如下。

--pheno-pos n1 n2 ：指定性状所在列，如果是多列用空格隔开
--qcovar n1,n2: 指定协变量所在列（quantitative covariates），多列用逗号隔开
--dcovar n1,n2: 指定固定因子所在列 (discrete covariates)，多列用逗号隔开
-rand n1,n2: 指定环境随机因子所在列，多列用逗号隔开

上面这种用逗号分隔开的写法是用于单性状模型的。对于多性状模型，此时要对所有性状的因子进行设定，使用逗号作为性状内因子的分隔符号，不同因子之间用空格隔开，0表示对于这个性状没有这类因子。

举个例子，下面是一个3性状模型，这里第1个性状有2个固定因子（2,3），第2个性状没有固定因子（0），第3个性状有3个固定因子（2,3,6）

1	./hiblup ... --dcovar 2,3 0 2,3,6 ...

其它选项

--threads 32 ：设置多线程数目，如果不设置则会从 standard OpenMP environment variable 使用最大的线程数，并且使用所有可用资源来提高计算效率（和 beagle 一样）。

分析

构建亲缘关系矩阵

输出的亲缘关系矩阵有2种格式，密集矩阵形式和稀疏矩阵形式（具体见上面的”非常规输入文件“）。

A阵

构建 A 阵和A逆矩阵命令如下（--make-xrm --pedigree file --add --add-inv），会生成 demo.PA.bin, demo.PA.id, demo.PAinv.sparse.bin, demo.PAinv.sparse.id 。（对于HIBLUP而言，其分析只需要亲缘关系矩阵，逆矩阵是不需要的，所以可以不加 --add-inv ）

# construct A and its inverse simultaneously
./hiblup --make-xrm 
         --pedigree demo.ped
         --add 
         --add-inv   #wether to construct its inverse matrix
         --thread 32 
         --out demo

因为 *.bin 是二进制文件，正常不能打开，如果想要阅读内容的话，可以添加一个参数 --write-txt ，就会输出文本文件。

有些时候可能会遇到矩阵不可逆的报错，如下，原因是矩阵非正定，此时可以通过对 --ridge-value 指定一个很小的数通过岭回归的方式来解决（就是对对角线添加一个很小的数）。

1	Error: matrix is not invertible,…

Ga 和 Gd矩阵

构建G阵的算法

HIBLUP 计算G阵的公式如下，其中是矩阵的迹。

矩阵每一行是一个样本，每一列是一个位点的基因型编码。在 HIBLUP 中，我们有4种方式来构建加性效应和显性效应的矩阵，如下表

其中等是基因频率，而等是基因型频率。

使用 --code-method n 来选择使用那种基因型编码方式，4中基因型编码方式的参考文献见下

–code-method 1, (default) Su, Guosheng, et al. “Estimating additive and non-additive genetic variances and predicting genetic merits using genome-wide dense single nucleotide polymorphism markers.” Plos one (2012): e45293.
–code-method 2, Zeng, Zhao-Bang, Tao Wang, and Wei Zou. “Modeling quantitative trait loci and interpretation of models.” Genetics 169.3 (2005): 1711-1725.
–code-method 3, Yang, Jian, et al. “Common SNPs explain a large proportion of the heritability for human height.” Nature genetics 42.7 (2010): 565-569.
–code-method 4, Vitezica, Zulma G., et al. “Orthogonal estimates of variances for additive, dominance, and epistatic effects in populations.” Genetics 206.3 (2017): 1297-1307.

构建G阵

构建 G 阵的命令如下（答疑说--step是用来控制内存使用的，值越小内存使用越小，但是还是没有说清楚），这里就是使用基因型数据替换了系谱数据

# construct A and D simultaneously
./hiblup --make-xrm --code-method 2
         --bfile demo 
         --add --dom 
         --step 10000 
         --thread 32 
         --out demo
         #--write-txt  #to output a readable ".txt" file

这个命令会生成 demo.GA.bin, demo.GA.id, demo.GD.bin, demo.GD.id 。

如果需要生成逆矩阵，可以添加选项 --add-inv 和 --dom-inv 。

多个群体的G阵

对于多个群体（如多个品种）的基因型数据，HIBLUP 可以构建一个混合的G阵，公式如下

其中和是每个群体用它自身的等位基因频率计算得到的矩阵（因为每个群体的基因频率不同），和是每个群体的样本数目。

运行命令如下，单纯就是增加了一个 --pop-class 选项，即群体分类文件。

# construct mixed GRM
./hiblup --make-xrm --code-method 1
         --bfile demo 
         --pop-class demo.popclass.txt
         --step 10000 
         --thread 32 
         --out demo

加权G阵

命令如下，增加一个SNP权重文件

./hiblup --make-xrm --code-method 1
         --bfile demo 
         --add --dom 
         --snp-weight snp.wt.txt 
         --thread 32 
         --out demo

H 阵

H 矩阵构建公式如下，这个公式就是经典公式。

这里我们对按照非基因型个体在前, 有基因型在后的顺序排列,并且划分为 4 个子矩阵。其中和是 A 矩阵中对于非基因型个体和基因型个体的子矩阵，而和是另外2个分块子矩阵。

这里有一点和 blupf90 不一样，blupf90 是先 blending (A 阵和原始G 阵的权重比例) ，再 tuning （使得 A 阵和 G阵的取值范围相同），这里相反。

首先这里是对原始 G 阵调整元素取值范围，使得其对角线元素均值和非对角线均值与 A阵相同，得到矩阵（tuning）。然后设置 G 阵和 A 阵的比例，公式如下，得到，最后就是计算 H 阵。

命令如下，这里就是同时使用系谱和基因型数据即可。得到文件 demo.HA.bin, demo.HA.id, demo.HD.bin, demo.HD.id 。

./hiblup --make-xrm 
         --pedigree demo.ped 
         --bfile demo 
         --alpha 0.05 
         --add --dom 
         --out demo

同样这里使用 --snp-weight 可以构建加权 G 矩阵，或者使用 --pop-class 可以构建混合G矩阵，然后再用于构建 H 矩阵。

但是这里你可以提前构建好的 G 阵来构建 H 阵，这里 G 阵文件必须是二进制格式，命令如下

./hiblup --make-xrm 
         --pedigree demo.ped 
         --xrm yourGRM     # yourGRM.bin, yourGRM.id
         --alpha 0.05 
         --add             # or '--dom' is supported
         --out demo

同样，你可以用 --add-inv 和 --dom-inv 计算H矩阵的逆矩阵，得到 demo.HAinv.sparse.bin, demo.HAinv.sparse.id, demo.HDinv.sparse.bin, demo.HD.sparse.id 。

环境随机因子的亲缘关系矩阵

此时需要提供表型文件，命令如下（正常环境随机因子的亲缘关系矩阵不是单位矩阵嘛）

./hiblup --make-xrm
         --pheno demo.phe
         --rand 6,7
         --out demo

输出 demo.loc.sparse.bin, demo.lco.sparse.id, demo.dam.sparse.bin, demo.dam.sparse.id 。

备注

HIBLUP 生成的密集矩阵格式的亲缘关系矩阵文件，均可以处理后直接被 GCTA, LDAK 使用，处理过程如下

awk '{print $1,$1}' demo.GA.id > gcta.grm.id
cp demo.GA.bin gcta.grm.bin
# then use it by GCTA
./gcta64 ... --grm gcta ...

但是 GCTA, LDAK 生成的亲缘关系矩阵文件可以直接被 HIBLUP 使用，而不用转换格式。

文件格式转换

亲缘关系矩阵，从二进制文件格式转为文本文件格式，命令如下

./hiblup --trans-xrm 
         --xrm demo.GA 
         --out demo
#        --triangle

输出 demo.txt 和 demo.id.txt ，这里 demo.txt 就是矩阵文本格式，以3个个体为例，其内容如下

1
2
3

1.55755    1.04847    0.64622
1.04847    1.58144    0.66971
0.64622    0.66971    1.56794

如果添加选项 --triangle ，那么会输出下三角矩阵的的三元组格式（稀疏矩阵格式），此时 demo.txt 内容如下

1    1    1.55755
2    1    1.04847
2    2    1.58144
3    1    0.646224
3    2    0.669715
3    3    1.56794

相反，将文本格式转为二进制格式命令如下

./hiblup --trans-xrm 
         --xrm-txt demo.txt
         --xrm-id demo.id.txt
         --out demo

这里的 demo.txt 是矩阵格式，如果是三元组格式则需要加上选项 --triangle 。

注意：HIBLUP 可以直接将 GCTA, LDAK 生成的二进制格式转为文本格式。

近交系数和亲缘系数

通过 --ibc 获得近交系数，命令如下

# derived from pedigree
./hiblup --ibc --pedigree demo.ped --thread 32 --out demo
# or derived from prior calculated PA matrix
./hiblup --ibc --xrm demo.PA --out demo

# derived from genotype
./hiblup --ibc --bfile demo --thread 32 --out demo
# or derived from prior calculated GA matrix
./hiblup --ibc --xrm demo.GA --out demo

输出近交系数文件 demo.ibc

id        ibc
IND0702  0.00290048
IND0703  0.00262272
IND0704  0.00262272

通过 --rc 命令获得亲缘系数（从结果来看，亲缘系数就是对加性遗传相关做了处理后的），命令如下

这里基因型数据可以通过 --code-method 设置不同的构建G阵的方法。

# derived from pedigree
./hiblup --rc --pedigree demo.ped --thread 32 --out demo
# or derived from prior calculated PA matrix
./hiblup --rc --xrm demo.PA --out demo

# derived from genotype
./hiblup --rc --bfile demo --thread 32 --out demo
# or derived from prior calculated GA matrix
./hiblup --rc --xrm demo.GA --out demo

输出亲缘系数文件 demo.rc

id1     id2     rc
IND0701 IND0701 1
IND0701 IND0702 0.472652
IND0701 IND0703 0.0721178
IND0701 IND0704 0.0721178

PCA

通过基因型数据进行PCA分析如下，可以使用 --npc n来控制输出主成分的数目，而不是所有主成分

1	./hiblup --pca --bfile demo --out demo

从G阵进行PCA分析

1	./hiblup --pca --xrm demo.GA --npc 5 --out demo

输出 demo.pc 文件如下，就是PCA主成分结果

id       PC1     PC2     PC3     PC4     PC5
IND0701  0.0413326       -0.00664808     -0.00795809     0.0218009       -0.0156735
IND0702  0.00856316      0.0378125       0.000503819     0.0205049       0.00972465
IND0703  -0.00793242     0.0458023       -0.00438572     -0.0173733      -0.0254701
IND0704  -0.00793248     0.0458023       -0.00438575     -0.0173733      -0.0254701

输出 demo.pcp 如下，包含了解释方差的比例和累积比例

Components      PC1     PC2     PC3     PC4     PC5
Standard deviation      4.23976    4.10211    3.76024    3.66821    3.60725
Proportion of Variance  0.0224695  0.0210342  0.0176743  0.0168197  0.0162653
Cumulative Proportion   0.0224695  0.0435037  0.0611779  0.0779976  0.094263

估计方差组分-单性状模型

命令

命令如下，这里都有相应注释，不再解释。

这里默认就是会估计方差组分的。

./hiblup --single-trait
         --pheno demo.phe
         --pheno-pos 8
         --dcovar 2,3       #fixed effect
         --qcovar 4,5       #covariates
         --rand 6,7         #non-genetic (environmental) random effect
         --pedigree demo.ped    #genetic random effect
         --bfile demo           #genetic random effect
         --add --dom 
         --threads 32 
         --out demo

这里有 5 种可选的估计方差组分的算法，如下，你可以通过 --vc-method 来选择其中一种方法，并且通过 --ai-maxit 和 --em-maxit 来设置 AIREML 和 EMREML 算法的最大迭代次数。

对于 HE 算法，如果模型中存在协变量，HIBLUP 会先使用表型对这些协变量做回归，再用得到的残差执行 HE 算法。

AI
EM
HE : He regression
EMAI : 这里没有解释这种方法
HI : 即 HE + AI ，将 HE 的估计值作为 AIREML 的初始值

但是对于多个性状而言，每次都重复构建 H 阵或 G阵可能比较耗时，此时我们可以只构建一次 G阵，然后使用 --xrm 命令来使用G阵文件（替换 --bfile ），得到相同的结果。

举例如下

# Step 1: construct XRMs
./hiblup --make-xrm
         --pedigree demo.ped
         --bfile demo
         --add --dom
         --out demo

# Step 2: link XRMs to fit model
./hiblup --single-trait 
         --pheno demo.phe 
         --pheno-pos 8 
         --dcovar 2,3 
         --qcovar 4,5 
         --rand 6,7 
         --xrm demo.HA,demo.HD 
         --add --dom 
         --threads 32 
         --out demo

我们也可以通过 --vc-priors 来提供方差组分初始值来加快收敛，具体使用方式见下面”遗传评估-已知方差组分“部分。

结果文件

demo.vars ：方差和每个方差组分的”遗传力“

    Var    Var_SE    h2    h2_SE
loc    12.2127    8.9109    0.1035    0.0682
dam    10.6268    4.8605    0.0901    0.041
HA    59.2007    11.7641    0.5019    0.0807
HD    28.9946    8.9294    0.2458    0.083
e    6.9232    1.6957    0.0587    0.0161

demo.beta ：固定因子和协变量的估计值

Levels    Estimation    SE
mu    32.5285    3.4758
sex_F    6.2933    1.7593
sex_M    26.2353    1.7634
season_Autumn    18.0905    0.9753
season_Spring    -1.9168    0.9904
season_Summer    7.8471    1.0066
season_Winter    8.5078    0.9944
day    0.1547    0.0574
bornweight    1.5703    0.4614

demo.anova ：所有固定因子和协变量的方差分析的表格

Factors    Df    SumSq    MeanSq    F    Pr(>F)
sex    1    49710.3845    49710.3845    7180.311    <2e-16
season    3    24515.0941    8171.698    1180.344    <2e-16
day    1    925.3352    925.3352    133.658    <2e-16
bornweight    1    437.3432    437.3432    63.171    1.30E-14
e    493    3413.1141    6.9232

demo.rand ：随机因子的预测值（育种值会命名为 “PA”, “GA”, or “HA” 等）

单形状重复力模型

命令如下，没什么区别，就是增加一个 --rand 1 ，即将第一列作为环境随机因子而已，即永久环境效应。

./hiblup --single-trait 
         --pheno demo.repeat.phe
         --pheno-pos 2
         --rand 1
         --bfile demo
         --threads 32
         --out demo

注意：HIBLUP 软件估计方差组分时会使用表型数据的方差协方差矩阵，因此如果数据重复次数低（如1000个个体具有2000行数据，平均重复2次），此时 HIBLUP 会比 MME-based 软件更快。但是如果数据重复次数高（如200个个体具有2000行数据，平均重复10次），此时 MME-based 软件会更快。

估计方差组分-多性状模型

命令

这里 HIBLUP还支持2中情况：同一个群体的多个性状，或多个群体的同一个性状。而且支持不同性状具有完全不同的固定因子和随机因子。

以 GBLUP 为例，命令如下（这里性状内的因子用逗号分隔，性状间用空格分隔）

./hiblup --multi-trait
         --pheno demo.phe
         --pheno-pos 8 9 10
         --qcovar 4,5 5 4
         --dcovar 2,3 0 2
         --rand 6,7 7 0
         --xrm demo.GA,demo.GD  # same with --bfile demo --add --dom
         --vc-method AI
         --ai-maxit 30
         --threads 32
         --out demo

默认情况下 HIBLUP 认为性状间的残差协方差存在，如果添加选项 --ignore-cove 可以将性状间的残差协方差视为不存在（看来这里没有更灵活的设置方式，只能都存在或都不存在）。

输出文件

demo.vars ：每个性状的方差

Item    Var    Var_SE    h2    h2_SE
tr1_loc    13.73854    9.9309    0.11739    0.07538
tr1_dam    2.29288    3.18932    0.01959    0.02731
tr1_demo.GA    61.35173    9.57171    0.52423    0.06964
tr1_demo.GD    29.93077    6.27996    0.25575    0.06022
tr1_e    9.71917    1.67088    0.08305    0.01696
tr2_dam    3.42172    3.80424    0.02971    0.03308
tr2_demo.GA    63.59    10.52058    0.55219    0.06017
tr2_demo.GD    38.14332    7.27491    0.33122    0.06696
tr2_e    10.00395    1.63578    0.08687    0.01578
tr3_demo.GA    21.03906    6.94184    0.2203    0.0668
tr3_demo.GD    15.25593    10.56861    0.15974    0.10919
tr3_e    59.20689    9.04084    0.61996    0.09801

demo.covars ：每个性状的遗传效应和残差的协方差，相关系数和SE

Item    COVar    COVar_SE    r    r_SE
tr1:tr2_demo.GA    46.6899    8.56852    0.747524    0.171364
tr1:tr3_demo.GA    16.3376    5.96333    0.454749    0.185599
tr2:tr3_demo.GA    12.3489    6.21401    0.337612    0.157671
tr1:tr2_demo.GD    26.9273    4.65452    0.796922    0
tr1:tr3_demo.GD    -2.25039    4.95673    -0.105301    0.23413
tr2:tr3_demo.GD    8.23124    5.21494    0.341194    0.259133
tr1:tr2_e    0.24039    1.17428    0.024379    0.119333
tr1:tr3_e    -1.10713    2.87487    -0.0461537    0.11985
tr2:tr3_e    -4.80127    2.85341    -0.197284    0.128375

demo.*.anova ：同上

demo.*.beta: 同上

demo.*.rand: 同上

注意事项

用户可以通过--vc-priors 和 --covc-priors 来设置方差和协方差的初始值，具体使用方式见下面”遗传评估-已知方差组分“部分。
HIBLUP 不会估计环境因子的协方差组分（这是个缺陷啊，不清楚原因，我感觉HIBLUP也就是跑单性状合适）。如果一定要估计环境因子的协方差组分，可以通过 --make-xrm 和 --xrm 的方式通过创建亲缘关系矩阵的方式来计算（还是不太懂）。

遗传评估-已知方差组分

命令

这里 HIBLUP 是--pheno-pos 中参数的数目自动判断性状数目的，因此不需要像上面那样特殊指定是单性状模型还是多性状模型（其实我感觉上面其实也不需要特殊指定啊）。

命令举例如下

# (1) additive effect for single trait model
./hiblup --mme
         --pheno demo.phe
         --pheno-pos 8
         --rand 6
         --xrm demo.GA   # can be replaced by --bfile demo --add
         --vc-priors v1,v2,v3
         --pcg
         --threads 32
         --out demo

# (2) additive and dominant effect for multiple traits model
./hiblup --mme
         --pheno demo.phe
         --pheno-pos 8 9 10
         --rand 6,7 7 0
         --bfile demo   # can be replaced by --xrm demo.GA,demo.GD
         --add --dom
         --vc-priors t1_v1,t1_v2,t1_v3,t1_v4,t1_v5 t2_v1,
                     t2_v2,t2_v3,t2_v4 t3_v1,t3_v2,t3_v3
         --covc-priors cov1,cov2,cov3 cov4,cov5,cov6 cov7,cov8,cov9
         --pcg
         --threads 32
         --out demo

默认情况下，HIBLUP 会直接计算矩阵来求解，用户也可以通过添加 --pcg 改为使用 PCG 算法来求解。

方差

这里 --vc-priors 是方差部分，包括环境随机因子，加性方差，残差（必须按照这个顺序）。

单性状模型内容解释如下，性状内用逗号分隔

v1: the variance of the environmental random effect located
    at the 6th column in phenotype file.
v2: the additive genetic variance.
v3: the residual variance.

多性状模型，方差组分是一个性状一个性状排列，性状内用逗号分隔，性状间用空格分隔，解释如下

t1_v1: the variance of the environmental random effect located 
       at the 6th column in phenotype file for the first trait.
t1_v2: the variance of the environmental random effect located 
       at the 7th column in phenotype file for the first trait.
t1_v3: the variance of additive genetic effect for the first trait.
t1_v4: the variance of dominant genetic effect for the first trait.
t1_v5: the variance of residuals for the first trait.
t2_v1: the variance of the environmental random effect located 
       at the 7th column in phenotype file for the second trait.
t2_v2: the variance of additive genetic effect for the second trait.
t2_v3: the variance of dominant genetic effect for the second trait.
t2_v4: the variance of residuals for the second trait.
t3_v1: the variance of additive genetic effect for the third trait.
t3_v2: the variance of dominant genetic effect for the third trait.
t3_v3: the variance of residuals for the third trait.

协方差

协方差组分包括遗传效应和残差，这里是按照性状的顺序进行排列，单个因子内使用逗号分隔，不同因子之间使用空格分隔，顺序如下图

因子间的互作

E by E

2个环境因子之间的互作效应使用 ”:“ 号表示，可以作为固定因子，也可以作为随机因子

作为固定因子示例如下，表示固定因子为 2 和 2:3:6 （3个因子之间的互作，这里互作效应应该就是将多列合并为一列）

1	./hiblup ... --dcovar 2,2:3:6

同理，作为随机因子示例如下

1	./hiblup ... --rand 6,6:7,7

多性状模型同理如下

1	./hiblup ... --dcovar 2:3 0 2,3,2:3 --rand 6 7,6:7 0

G by G

HIBLUP 可以拟合遗传效应之间的互作，如加性和显性，显性和显性等。用于互作的因子数目没有限制，个遗传因子的互作效应可以通过下面的式子表示：

这里是样本数，是 Hadamard 积（对应元素乘积，两个矩阵的对应位置元素直接相乘），是迹函数。

加入 G by G 效应的唯一做法是通过 --xrm 来使用不同的 G 阵，用冒号分隔，举例如下

# fit A by D only
./hiblup ... --xrm demo.GA:demo.GD
#fit A, D, and A by D
./hiblup ... --xrm demo.GA,demo.GD,demo.GA:demo.GD
#fit A by A, A by D and D by D simultaneously
./hiblup ... --xrm demo.GA:demo.GA,demo.GA:demo.GD,demo.GD:demo.GD

G by E

G by E 效应会作为随机因子加入到模型中，这里需要使用随机因子在表型中的列号，以及G阵前缀，并使用 --rand-gxe 选项。

对于单性状模型，在互作效应内部使用冒号，示例如下

./hiblup --single-trait ... --rand-gxe 6:demo.GA,6:demo.GD
#interaction among multiple effects
./hiblup --single-trait ... --rand-gxe 6:demo.GA:demo.GD
./hiblup --single-trait ... --rand-gxe 6:7:demo.GA

对于多性状模型，以3性状模型示例如下（0表示这个性状没有G by E效应，不同性状之间用空格分隔）

1	./hiblup --multi-traits ... --rand-gxe 6:demo.GA 6:demo.GA,7:demo.GA 0

可靠性和PEV

你可以通过添加 --r2 来计算可靠性，以单性状模型举例如下

./hiblup --single-trait
         --pheno demo.phe
         --pheno-pos 8
         --dcovar 2,3       #fixed effect
         --qcovar 4,5       #covariates
         --rand 7           #non-genetic (environmental) random effect
         --bfile demo       #genetic random effect
         --r2               #to calculate the reliability
         --threads 32
         --out demo

此时随机因子结果中会包含一些附加列，包括标准误，PEV ，R2(可靠性)，举例如下

ID      dam       dam_SE   dam_PEV  dam_R2    GA        GA_SE    GA_PEV   GA_R2     residuals  residuals_SE residuals_PEV residuals_R2
IND1001 0.496645  3.75992  14.137   0.255863  6.66127   4.27597  18.2839  0.774767  -1.45194   3.18313      10.1323       0.55248
IND1002 0.496645  3.75992  14.137   0.255863  6.66127   4.27597  18.2839  0.774767  2.04383    3.17906      10.1064       0.553624
IND1003 0.674833  3.53678  12.5088  0.341566  2.08201   4.62091  21.3528  0.736963  -0.104053  4.14567      17.1866       0.24091
IND1004 0.674833  3.53678  12.5088  0.341566  -0.682836 4.70199  22.1087  0.727651  0.9083     4.17824      17.4577       0.228936

计算SNP效应

通过 --snp-effect 选项来计算SNP效应，正常做法举例如下

# compute additive SNP effect by SSGBLUP of single trait model
./hiblup --single-trait
         --pheno demo.phe
         --pheno-pos 8
         --bfile demo
         --pedigree demo.ped
         --add
         --snp-effect
         --thread 32
         --out demo

第二种方式，也可以从之前计算好的GEBV结果中来推导SNP效应值，举例如下

# compute dominant SNP effect
./hiblup --snp-effect
         --gebv demo.gebv.d.txt
         --bfile demo
         --dom
         --thread 32
         --out demo

此时这里通过 --bfile 来构建G阵和计算SNP效应，因此为了节约时间使用 --xrm 选项提供 G 阵，示例如下

# compute additive and dominant SNP effect
./hiblup --snp-effect
         --gebv demo.gebv.ad.txt
         --bfile demo
         --xrm demo.GA,demo.GD
         --add --dom
         --thread 32
         --out demo

输出 demo.snpeff ，即 SNP 效应值。

注意：如果 GEBV 值是通过加权G阵计算得到的，此时必须添加SNP权重文件 --snp-weight 。

预测GEBV/PRS

HIBLUP 可以用之前计算好的SNP效应值和基因型数据来预测GEBV/PRS ，这和 PLINK 软件的 --score 一样，但是速度更快。

命令如下，通过 --score 命令输入SNP效应值文件

./hiblup --pred
         --bfile demo        #the binary genotype data
         --score demo.snpeff #the pre-computed SNP effects
         --threads 10
         --out demo

默认情况下，HIBLUP 会按照加性效应的编码方式来编码基因型（如 0 1 2 for AA Aa aa；这里有一个疑问，到底要不要中心化，这里为什么没有中心化？）。用户可以通过添加 --dom 来按照显性效应来编码基因型数据（如 0 1 0 for AA Aa aa）。如果同时有加性和显性效应，则同时使用 --add 和 --dom ，此时 SNP 效应文件中需要同时具有加性和显性效应。

输出 demo.bv 文件，内容如下，第一列是个体号，其它列是育种值或者说是PRS

id    trait1    trait2
Ind2    -0.305403    2.6644
Ind5    0.00897198    -1.36166
Ind11    0.392148    -0.653216
Ind17    0.00232218    -0.213599
Ind22    -0.359507    -2.12692
Ind45    -0.232806    0.269005

选配

这里其实使用计算一对候选亲本对其后代的预测基因组育种值。这里需要事先计算得到所有SNP的标记效应值，示例命令如下，就是增加一个选项 --mating

./hiblup --mating
         --bfile demo
         --score demo.snpeff
         --thread 32
         --out demo

输出文件 demo.mating ，内容如下，就是每一对候选亲本对的后代预测育种值。

sir      dam      g
Ind704      Ind710      -0.0675817
Ind704      Ind711      1.06056
Ind704      Ind712      2.00997
Ind704      Ind713      -0.0879987
Ind704      Ind714     -0.492576

注意事项：

基因型文件和SNP效应值文件必须具有相同的 A1 和 A2 碱基（那就要求生成二进制数据格式的时候要按照 demo.snpeff 文件生成）。
demo.fam 中需要有清晰的性别信息（第5列，1=male; 2=female; other=unknown），如果个体没有性别信息会从这次分析中剔除。

LD

HIBLUP 支持基于碱基计数数据对一对SNP数据计算LD相关系数 ® ，命令如下

./hiblup --ld
         --bfile demo 
         --window-bp 1e6
         --threads 16
         --out demo_ldm

会生成2个输出文件，demo_ldm.info, demo_ldm.bin ，这里由于二进制文件 *.bin 无法直接打开，因此可以通过添加参数 --write-txt 来输出 txt 文件。

可参数说明如下，感觉没啥用

--window-bp ：指定非重叠的窗口大小，默认1Mb，单位bp。
--window-num : 指定一个窗口内的SNP数目为一个固定值，此时窗口大小就不固定了（和--window-bp应该是二选一的关系）。
--window-geno : 将整个基因组视为一个窗口，此时内存和时间可能较长
--window-file : 指定一个基因组窗口文件，见”非常规输入文件“章节

LD scores

LD scores 定义为某个SNP与某个区域所有SNP的LD () 之和，这反应了这个SNP与其它SNP的连锁程度，这个值越高说明这个位点与其它位点的连锁程度越高。LD scores 通常用在 LD scores regression ，目的是估计性状的遗传力或性状间的遗传相关。

输入文件为基因型数据，命令如下

./hiblup --ldscore
         --bfile demo   
         --window-bp 1000000
         --threads 10
         --out test

设置窗口的命令同上，输出文件 test.ldsc ，示例如下，分别为SNP名称，maf ，LD scores 。

id    maf    ldscore
M1    0.481875    1
M2    0.145    1
M3    0.320625    1
.    .    .
.    .    .
.    .    .
M991    0.089375    1.044
M992    0.11375    1.33231
M993    0.31875    1.2885
M994    0.103125    1
M995    0.115625    1

LD scores regression

估计遗传力

LD scores regression 用于估计性状遗传力 (Bulik-Sullivan, Po-Ru Loh, et al. 2015) 和估计遗传相关 (Bulik-Sullivan, Finucane, et al. 2015) 。

这里输入不需要基因型数据，只需要 GWAS 结果文件，通过参考面板计算得到的 LD scores 文件。

估计遗传力命令如下

./hiblup --ldreg
         --sumstat demo.ma   #the summary data
         --lds demo.ldsc     #the pre-computed LD scores
         --out demo

这里GWAS 结果文件和 LD scores 文件中的SNP数目必须保持一致。

输出文件 demo.ldsr.h2，内容如下

1 2	Item Intercept Intercept_SE h2 h2_SE h2_Pval demo 1.08285 0.011433 0.122826 0.00393122 2.71554e-214

这里 Intercept 与群体结构有关，越接近1说明群体越不分层。h2 是估计遗传力，h2_Pval 是遗传力卡方分布统计检验的p值。

估计遗传相关

估计遗传相关类似于估计遗传力，如果提供多个GWAS结果，HIBLUP 就会同时估计遗传力和遗传相关。

./hiblup --ldreg
         --sumstat demo1.ma demo2.ma demo3.ma   #the summary data of multiple traits, use space as separator
         --lds demo.ldsc     #the pre-computed LD scores
         --out demo

此时输出2个文件，test.ldsr.h2 是遗传力结果，同上；test.ldsr.rg 是遗传相关结果，举例如下

Item    CovG    CovG_SE    Intercept    Intercept_SE    rG    rG_SE    rG_Pval
demo1:demo2    0.0252414    0.00716841    0.0166108    0.00819384    0.141956    0.0403148    0.000429607
demo1:demo3    0.0744506    0.00892225    0.102821    0.00671784    0.296832    0.0355727    7.15917e-17
demo2:demo3    0.262124    0.0340515    0.315166    0.0112682    0.608769    0.0790827    1.38348e-14

这里 CovG 是遗传协方差，rG 是遗传相关，rG_Pval 是遗传相关卡方检验的P值。

其它参数

其它可用参数如下

--M : 指定使用的SNP数目，默认情况下会按照 Bulik-Sullivan 建议的 MAF 大于等于5% 的SNP
--chisq-max : to specify the maximum threshold of X2 for the first step estimator of intercept, the default is 30.
--intercept-h2 : to constrain the intercept with a constant rather than estimating it from data for heritability estimation.
--intercept-gencov: to constrain the intercept with a constant rather than estimating it from data for genetic correlation estimation.

Summary levle BLUP (SBLUP)

SBLUP 模型是通过GWAS结果文件和 LD 矩阵文件来估计SNP标记效应。

使用基因型数据

第一种方式是直接使用基因型数据（而不是LD矩阵文件），命令如下

./hiblup --sblup
         --sumstat demo.ma   #the summary data
         --bfile demo 
         --window-bp 1e6
         --h2 0.3234
         --threads 16
         --out demo

这里 --h2 是性状遗传力，可以使用 REML 方法估计，也可以使用 LD scores regression 估计。

如果单个窗口的SNP数目很多，如超过10k，那么建议加上 --pcg选项来加速计算 SNP效应。

使用LD相关矩阵

使用LD相关矩阵更加直接，注意这里假设所有SNP服从哈温平衡 ，不然估计的SNP效应会有偏差。

命令如下

./hiblup --sblup
         --sumstat demo.ma   #the summary data
         --ldm demo_ldm      #the pre-computed LD correlation matrix
         --h2 0.3234
         --threads 10
         --out demo

这里 --ldm 是 LD 相关矩阵文件（二进制文件），见 “LD” 章节。

输出文件

输出文件 demo.snpeff ，举例如下，最后一列就是估计的标记效应。

id a1 a2 freq_a1 demo
M1 A G 0.1285 -0.000963937
M2 T C 0.1285 -0.00108931
M3 A G 0.1062 0.00588629
M4 G A 0.1285 -0.00164344
M5 A C 0.2459 -0.00100206

MT-SBLUP

多性状 SBLUP 比单性状的预测准确性更高 (Robert, Zhihong, et al. 2018) 。

和上面一样也是有2种方式，但是这里我们只显示使用LD相关矩阵的做法，如下

./hiblup --sblup
         --sumstat demo1.ma demo2.ma demo3.ma demo4.ma   #the summary data, use space as separator
         --ldm demo_ldm      #the pre-computed LD correlation matrix
         --h2 0.3234 0.1256 0.6345 0.3536
         --rg 0.1336 0.5567 0.2345 0.8454 0.3446 0.4633
       # --pcg               #use PCG for fast computing
         --threads 10
         --out demo

这里遗传相关的输入顺序是按照下三角矩阵的顺序（如 1-2 1-3 1-4 2-3 2-4 3-4；这个顺序不对啊）

输出 demo.snpeff ，内容如下

id a1 a2 freq_a1 demo1 demo2 demo3 demo4
M1 A G 0.1285 -0.000963937 -0.000577569 -0.000792698 0.000175215
M2 T C 0.1285 -0.00108931 -0.000597102 -0.000825137 0.000177501
M3 A G 0.1062 0.00588629 0.00155157 0.00270818 0.000154987
M4 G A 0.1285 -0.00164344 -0.000557257 -0.000874613 0.000155528
M5 A C 0.2459 -0.00100206 -0.000456737 -0.000855748 -0.000422206

计算显著位点解释表型方差的比例（PVE）

在答疑部分，计算 the proportion of phenotypic variance explained (PVE) for significant SNPs ，这里 HIBLUP 建议是用显著位点和其它位点构建2个G阵，然后计算方差组分，代码如下

#step1: construct GRM1 using the significant SNPs
./hiblup --make-xrm 
         --bfile demo 
         --extract snp.txt 
         --out grm1 
         --threads 32    # grm1.GA.id, grm1.GA.bin will be generated
#step2: construct GRM2 using the SNPs excluding the significant SNPs
./hiblup --make-xrm 
         --bfile demo 
         --exclude snp.txt 
         --out grm2 
         --threads 32    # grm2.GA.id, grm2.GA.bin will be generated
#step3: compute the variance components for GRM1 and GRM2 by single trait model
./hiblup --single-trait 
         --pheno demo.phe 
         --pheno-pos X 
         --xrm grm1.GA,grm2.GA 
         --out vc 
         --threads 32    
#step4: calculate the PVE for significant SNPs by following equation:
         pve = V(grm1) / ( V(grm1)+V(grm2)+V(e) )

参考文献

https://www.hiblup.com/tutorials

版权声明： 本博客所有文章除特别声明外，著作权归作者所有。转载请注明出处！