使用shapeit和impute软件定相填充

2023-09-22

字数统计: 3.9k | 阅读时长≈ 16 分钟

使用 shapeit 和 impute 软件进行填充。

shapeit5

用芯片数据实际使用报错如下。AC(Allele Count) 表示该Allele的数目，这个应该只有测序的 VCF 才有，芯片只有 GT 字段，因此芯片数据用不了 shapeit5，只有 shapeit3 及更低版本才能使用 bed/bim/fam 作为输入文件。

1	ERROR: AC field is needed in file

下载安装

官网见shapeit5

首先尝试使用 conda 安装，不然自己安装貌似很麻烦。

$ conda search shapeit5
Loading channels: done
# Name                       Version           Build  Channel
shapeit5                       1.0.0      h0c8ee15_0  anaconda/cloud/bioconda
shapeit5                       5.1.1      hb60d31d_0  anaconda/cloud/bioconda

$ conda install shapeit5=5.1.1

安装完之后查看一下

1 2	$ conda list \| grep "shapeit" shapeit5 5.1.1

在 conda 的 bin 目录下出现了以下文件 ( -iname 不区分大小写)

$ find /mnt/data/zhouziwen/lib/miniconda3/bin -iname "*shapeit*"
/mnt/data/zhouziwen/lib/miniconda3/bin/SHAPEIT5_ligate
/mnt/data/zhouziwen/lib/miniconda3/bin/SHAPEIT5_phase_common
/mnt/data/zhouziwen/lib/miniconda3/bin/SHAPEIT5_phase_rare
/mnt/data/zhouziwen/lib/miniconda3/bin/SHAPEIT5_switch

使用

参考官网文档

定相常见变异

phase_common 这个工具对常见变异（common variant sites）进行定相，例如 SNP 数据。这个工具是对大型测序数据定相的第一步，可以视为 SHAPEIT4 的升级版本（芯片定相只需要使用这一个工具）。

第一种使用方式：对无关个体定相，其中：

--input ：验证群基因型文件，vcf/bcf 格式
--thread 8 : 使用 8 个线程
--region 1 : 使用 1 号染色体的数据
--map ：遗传图片文件，可选
--output target.phased.bcf : 输出文件

1	phase_common --input array/target.unrelated.bcf --region 1 --map info/chr1.gmap.gz --output target.phased.bcf --thread 8

第二种使用方式：对有亲缘关系的个体定相，这里 --pedigree 就是系谱文件（缺失记为 NA）。

这里的原理是用亲本数据对子代定相（当不存在孟德尔错误，并且至少有一个亲本是纯合子）时，此时后代的第一个单倍型是父本的单倍型，第二个单倍型是母本的单倍型。

1	phase_common --input array/target.family.bcf --pedigree info/target.family.fam --region 1 --map info/chr1.gmap.gz --output target.phased.bcf --thread 8

第三种使用方式：对 X 染色体定相，其中 --haploids info/target.haploid.txt 中的文件指定单倍体的样本（一行一个样本号，例如雄性）。

这里的原理其实是见单倍体在输入或输出文件中当作完全纯合的二倍体来编码。对于单倍体个体，所有杂合子首先会改为缺失，然后再填充；对于双倍体个体，按照正常流程填充。

1	phase_common --input array/target.haploid.bcf --haploids info/target.haploid.txt --region 1 --map info/chr1.gmap.gz --output target.phased.bcf --thread 8

第四种使用方式：使用参考面板填充，--reference 后面跟着 vcf/bcf 文件。

1	phase_common --input array/target.unrelated.bcf --reference array/reference.bcf --region 1 --map info/chr1.gmap.gz --output target.phased.bcf --thread 8

第五种使用方式：使用 scaffold 填充，--scaffold array/target.scaffold.bcf 后面跟着的 vcf/bcf 文件就是输入文件部分定相的结果（部分样本部分位点，不能有缺失并且全部定相）。

1	phase_common --input array/target.unrelated.bcf --scaffold array/target.scaffold.bcf --region 1 --map info/chr1.gmap.gz --output target.phased.bcf --thread 8

ligate

如果你将一条染色体分为了多个片段进行定相，那么可以使用 ligate 来将合并成一整条染色体（无用）。

首先，我们对同一条染色体两个重叠的片段进行定相。

1
2

phase_common --input array/target.unrelated.bcf --region 1:1-5500000 --map info/chr1.gmap.gz --output target.chunk1.bcf --thread 8
phase_common --input array/target.unrelated.bcf --region 1:4500000-10000000 --map info/chr1.gmap.gz --output target.chunk2.bcf --thread 8

然后我们将这两个 bcf 文件名称写入一个文件（需要排序好）

1 2	echo target.chunk1.bcf > chunks.txt echo target.chunk2.bcf >> chunks.txt

然后，我们执行下面的命令

1	ligate --input chunks.txt --output target.phased.bcf --thread 2

在整合的过程中，会输出两个统计量（前提是多个 bcf 文件之间存在重叠的区域）

Switch rate: the percentage of samples for which haplotypes have been switched,
Avg phaseQ: the agreement of the phasing in the overlapping region between two successive chunks of data. We usually expect values above 80 or 90.

注意，如果定相过程中使用了系谱，那么这里也需要使用 --pedigree 选项。

定相稀有变异

这里建议只对于样本数目超过2000的数据使用 phase_rare ，不然就可以正常使用 phase_common 。

第一种使用方式：定相无关个体。

这里我们分多步来定相，首先我们对常见变异（MAF>0.1%）进行定相

1	phase_common --input wgs/target.unrelated.bcf --filter-maf 0.001 --region 1 --map info/chr1.gmap.gz --output tmp/target.scaffold.bcf --thread 8

然后，我们使用常见变异定相结果作为 scaffold 来定相稀有变异。

没太看懂，暂时不管它。

while read LINE; do
	CHK=$(echo $LINE | awk '{ print $1; }')
	SRG=$(echo $LINE | awk '{ print $3; }')
	IRG=$(echo $LINE | awk '{ print $4; }')
	phase_rare --input wgs/target.unrelated.bcf --scaffold tmp/target.scaffold.bcf --map info/chr1.gmap.gz --input-region $IRG --scaffold-region $SRG --output tmp/target.phased.chunk$CHK\.bcf  --thread 8
done < info/chunks.coordinates.txt

shapeit4

下载安装

官网见shapeit4

首先尝试使用 conda 安装，发现安装了 shapeit5 以后就没法安装 shapeit4 了（因为附属软件的版本冲突），但是在一个虚拟环境中安装成功了。

手动编译安装教程参考shapeit4 安装

1 2	$ conda search shapeit4 $ conda install shapeit4=4.2.2 # 安装失败

安装完之后查看一下

1 2	$ conda list \| grep "shapeit" shapeit4 4.2.2

安装后的 shapeit4 在虚拟环境下面的目录

1 2	$ which shapeit4 /mnt/data/zhouziwen/lib/miniconda3/envs/R4.1.0/bin/shapeit4

我尝试在 base 环境下回退版本失败，懒得卸载重装 conda 了，于是给上面的 shapeit4 创建一个软链接

1	ln -s /mnt/data/zhouziwen/lib/miniconda3/envs/R4.1.0/bin/shapeit4 /mnt/data/zhouziwen/lib/bin

使用

参考shapeit4

类似 impute5 ，首先也需要对输入的基因型数据进行 index

1 2	bgzip chr_1.vcf bcftools index chr_1.vcf.gz

基本使用方式类似，如下

1	shapeit4 --reference reference.vcf.gz --input unphased.vcf.gz --map chr20.b37.gmap.gz --region 20 --output phased.vcf.gz

--reference ：参考面板（index），vcf.gz 或 bcf 格式，经过测试可以使用全部染色体的基因型文件。注意参考面板和验证群的chrom, position, ref, alt 必须一致，shapeit4 只会使用参考面板和验证群的共同位点。
--input ：验证群基因型文件（index），vcf.gz 或 bcf 格式，经过测试可以使用全部染色体的基因型文件

shapeit4 默认使用芯片数据，如果使用测序数据，增加选项 –sequencing 。

shapeit2

官网网址

定相命令

从shapeit2下载二进制文件，解压即可

无参定相如下 (首先需要按照染色体拆分开，一次定相一条染色体)，其中

--input-bed ：输入的二进制文件前缀。输入文件也可以使用 vcf （--input-vcf file）
--output-max：输出文件，这里也可以只使用前缀（--output-max gwas.phased）
--thread 8 : 多进程

shapeit2 --input-bed gwas \
        --output-max gwas.phased.haps gwas.phased.sample \
        --thread 8 \
        --force

shapeit2 运行过程中会对数据进行检查，如果存在缺失率超过 5% 的样本或位点，或者单态的位点（无变异）都会提出 warnings 。如果存在个体的缺失率为 100% ，则会报错。

实际运行过程中发现以下报错，个体缺失率超过 10% 就会报错，结果文件为空文件。加入 --force 选项后就可以正常运行。

1	ERROR: 3 individuals with high rates of missing data (>10%). These individuals should be removed. You can disable this error with --force (at your own risk).

有参考群填充命令如下，加入选项 --input-ref ，这里使用的是 HAP/LEGEND/SAMPLE 格式，因此需要转格式。

shapeit2 --input-bed gwas \
        --output-max gwas.phased \
        --input-ref reference.haplotypes reference.legend reference.sample \
        --thread 8 \
        --force

注意，验证群的位点必须均在参考群中，位点的染色体，物理位置， ref/alt 必须保持一致（因此输入文件只能使用二进制格式或VCF，不能使用ped/map），不然会报错。

输出文件如下

shapeit_[date]_[time]_UUID.log ：日志文件，可以通过 --output-log gwas.phased.log指定日志文件名称
haps 文件
sample 文件

可以将定相结果文件从 HAPS/SAMPLE 转化为两种不同的格式，第一种是转为 HAP/LEGEND/SAMPLE 格式（这是 impute2 的参考面板的输入格式，注意 --output-ref 不能只提供前缀）

1
2
3

shapeit2 -convert \
        --input-haps gwas.phased \
        --output-ref gwas.phased.hap gwas.phased.leg gwas.phased.sam

第二种格式是转化为 VCF 格式

1
2
3

shapeit2 -convert \
        --input-haps gwas.phased \
        --output-vcf gwas.phased.vcf

下面具体介绍一下不同的输出文件格式

HAPS/SAMPLE 格式

假设我们有 4 个样本和 3 个 SNP ，SAMPLE 文件用于描述样本情况，举例如下，前两行是标题行，之后每一行表示一个个体，前3列分别为样本1 - 样本2 - 缺失率。

ID_1 ID_2 missing
0    0    0
UNR1 UNR1 0
UNR2 UNR2 0
UNR3 UNR3 0
UNR4 UNR4 0

HAPS 用于描述 SNP 情况和单倍型，如下所示，每一行代表一个 SNP ，前 5 列分别为染色体 - SNP - 物理位置 - 第一个碱基 - 第二个碱基；之后的列每两列组成一对，这两列表示一个个体的单倍型。

举个例子，第 6 列和第 7 列内容分别是 000 和 011 ，这里 0 代表第一个碱基，1代表第二个碱基，因此也就是说第一个个体的单倍型为 ATA 和 ACT 。

1
2
3

7 SNP1 123 A G 0 0 1 0 0 0 1 1
7 SNP2 456 T C 0 1 1 0 0 1 0 1
7 SNP3 789 A T 0 1 1 0 1 1 1 1

HAP/LEGEND/SAMPLE

SAMPLE 文件如下所示，前 4 类分别为个体号 - Group1 ID - Group2 ID - Sex 。其中个体号必须只能包含字母和数字。

sample population group sex
CEU1 CEU EUR 1
CEU2 CEU EUR 2
GBR1 GBR EUR 2
YRI1 YRI AFR 2

LEGEND 文件描述样本情况，每一行代表一个 SNP ，前 4 列分别为 SNP - 物理位置 - 第一个碱基 - 第二个碱基；

id position a0 a1
SNP1 123 A G
SNP2 456 T C
SNP3 789 A T

HAP 文件包含单倍型结果，每两列表示一个个体的单倍型。

1
2
3

0 0 1 0 0 0 1 1
0 1 1 0 0 1 0 1
0 1 1 0 1 1 1 1

使用 xargs 多进程

这里我们可以使用 xargs 实现多进程，首先我们生成所有的命令行

1
2
3

for i in $(seq 1 22); do
  echo "-B gwas.chr$i -M gmap.chr$i\.txt -O gwas.chr$i\.phased -T T" >> myCommands.txt
done

第二步，定相如下，这里 J 就是并行数目，2 是线程数目。

1	cat myCommands.txt \| xargs -PJ -n2 shapeit2 &

加入系谱数据

如果存在系谱，那么直接在 plink 或二进制文件中的父本列和母本列写入即可。

impute2

下载及使用说明见impute2官网

impute2 也可以定相，也可以进行自填充。

填充命令

shapeit 和 impute2 的搭配如下：

参考群构建（ wang 2022 文献中使用 shapeit 定相， minmac4 填充；验证群也是 shapeit + minmac4 , 填充后提出 R^2 小于 0.4 （应该是 minmac4 的统计指标）的位点）
使用 shapeit 对验证群进行无参定相（不过有参定相更准）。
使用 impute2 对 shapeit 结果文件进行有参填充。

参考面板构建仅仅是第二步改为无参填充即可。

填充命令如下（参考 https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#required_args ）

-use_prephased_g ：说明验证群已经定相了。
-known_haps_g : 验证群定相后的基因型文件 (必选项，或者提供 -g参数) 。
-int 9.1e6 9.6e6 : 填充的区域（必选项，默认长度不超过 7MB，除非添加 -allow_large_regions 参数允许更大的长度）
-m ：遗传图谱，必选项（三列，第一列是物理位置，第二列是当前位置与map文件下一个位置的重组率(in cM/Mb)，第三列是遗传图谱位置（in cM），列标题必须为 “position COMBINED_rate(cM/Mb) Genetic_Map(cM)” ）（由于遗传图谱是必选项，因此很难创建自己的参考面板）

impute2 -use_prephased_g \
        -known_haps_g gwas.phased.haps \
        -h reference.haplotypes.gz \
        -l reference.legend.gz \
        -m genetic_map.txt \
        -int 9.1e6 9.6e6 \
        -Ne 20000 \
        -o gwas.from9.1e6_to9.6e6.imputed

如何填充整条染色体

参考 https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#whole_chroms

原则上 impute 可以一次填充一整条染色体，但是我们更加推荐将整条染色体切分为更小的区块来分析，第一是填充更准确，第二是可以使用多个CPU同时填充，节约计算时间和内存。

举个例子，我们可以将整条染色体按照 5MB 来拆分开，使用 “-int 1 5000000” 用于第一次分析, “-int 5000001 10000000” 用于第二次分析，以此类推。impute2 分析时会在区域两端使用一个 250kb 的 buffer 区域来避免边缘效应（就是说为了避免区域两端的位点结果不准，它会在两边延长 250kb 的长度，但是这两个 buffer 区域的位点不会出现在最终结果中，这和 beagle 的窗口的 overlap 原理一样），因此你在提供拆分片段时不用重叠（因为 impute2 已经考虑重叠了，overlap 长度可以通过 -buffer 选项设置）。

拆分后并填充完毕后，你可以直接将所有合并结果合并

1	cat chr16_chunk1.impute2 chr16_chunk2.impute2 chr16_chunk3.impute2 > chr16_chunkAll.impute2

小结

impute2 只是适用于重测序数据，而且参考面板需要遗传图谱，分析时需要将整条染色体按照 5MB 来拆分开，比较麻烦。

impute5

下载安装

从impute-5下载得到二进制文件和说明文档，将二进制文件放在环境变量的文件夹中即可，添加执行权限。

1	impute5_1.1.5_static --help

使用前需要对参考面板和验证群创建 index ，使用 bcftools 可以完成这一点（需要实现使用 bgzip 压缩为vcf.gz 格式）。

bgzip -c reference.vcf > reference.vcf.gz
bcftools index reference.vcf.gz

bgzip target.vcf # 不保留未压缩文件
bcftools index target.vcf.gz

填充命令如下，其中

--h : 参考面板（indexed, phased, non-missing），必须无缺失且定相（经过测试，可以使用全部染色体的基因型文件）
--g: 需要填充的基因型（indexed, phased, non-missing），和参考面板的 chr,pos,ref,alt 必须保持一致
--r: 指定染色体
--o: 输出文件
--out-gp-field: 输出 genotype probabilities
--threads 2: 多线程（原理是将验证群逐个样本去填充）

1 2	impute5_1.1.5_static --h reference.vcf --g target.vcf --r 20 --o imputed.vcf.gz --out-gp-field --threads 2

填充结果文件中 INFO 列的 INFO 字段为该位点的 IMPUTE INFO score ，用来评估填充位点的填充质量，越大说明填充准确性越好。从Genotype_Imputation.html 中找到解释如下

Imputation quality score (INFO)

INFO scores measure the imputation accuracy of each variant. Variants with low INFO scores, typically less than 0.3, are often removed.

We can use genotype imputation software tools, such as (IMPUTE2, MaCH, Beagle, SHAPEIT), to compute the imputation quality score (INFO score) to indicate the imputation accuracy of each variant. The INFO score ranges from 0 to 1. A score of 1 indicates perfect imputation accuracy, while 0 indicates the variant was not imputed.

The INFO score is calculated using a selected imputation algorithm and the reference panel used for imputation. The algorithm calculates the probability of the imputed genotype based on the available genotype data and the reference panel. The INFO score is calculated as the ratio of the difference between the probability of the imputed genotype and the probability of the most likely observed genotype to the maximum possible difference between the two probabilities. The INFO score represents the amount of information gained by imputing the missing genotype and indicates the accuracy of the imputed data.

Researchers typically set a threshold for the INFO score, below which the imputed variant is considered unreliable and removed from the dataset. The threshold may vary depending on the research question, the studied population, and the imputation software used.

版权声明： 本博客所有文章除特别声明外，著作权归作者所有。转载请注明出处！