两个Plink文件同一SNP基因型比对代码实现

2021-08-16

字数统计: 2.4k | 阅读时长≈ 11 分钟

本软件是两个Plink文件样本间基因型比对代码实现的另一种比对角度，前面的思路是在比较两个基因型文件中，同一个样本之间有多少位点不一致，这里是比较同一个SNP有多少样本的分型不一致。在比对同一个群体的两个plink文件时，如果存在不一致，这两种比对角度都是必需的。

举个例子，比对某个群体基因型文件填充前后的基因型变化，查看样本间不一致率，发现所有样本都有几十个位点不一致，前三行如下：

202884940104_R14C04	202884940104_R14C04	15	34653	0.0004
202884940104_R11C01	202884940104_R11C01	17	34665	0.0005
202888800049_R23C04	202888800049_R23C04	17	34658	0.0005

而通过相同的 SNP 间的不一致率，发现大部分 SNP 在两个基因型文件中完全一致，只有一小部分SNP存在不一致，前三行如下：

8_118011832	3508	3529	0.994
5_39432167	3459	3524	0.9816
14_133523132	3428	3518	0.9744

前提条件

两个 plink 文件中的 map 文件的染色体、SNP名称及物理位置必须保持一致，样本名称保持一致（两个plink 文件中同一样本的芯片号相同，可能需要自己手动转plink文件的样本名称）。

使用软件

python 3.0以上版本，事先安装 numpy 模块。
plink 1.9 , 并将 plink 添加到环境变量中。

输入文件格式

plink 文件：前缀相同的一对文件，后缀分别为.ped ，.map 。

输出文件说明

cmp_common_samples.txt : 第一列为SNP名称，第二列为两个plink文件中共同样本中不一致的样本数目，第三列为共同样本总数，第四列为前两列相除得到的不一致率。

示例如下：

8_118011832	3508	3529	0.994
5_39432167	3459	3524	0.9816
14_133523132	3428	3518	0.9744

运行代码

代码文件见：compair_snps.py

将输入文件和本程序放在同一文件夹下，运行命令示范为

1	python compair_snps.py --plink1 plink1 --plink2 plink2

可用参数说明：

--plink1: 第一个plink文件前缀，必选。

--plink2: 第二个plink文件前缀，可选。

--exclude-miss : 基因型比对是否先剔除缺失值，可选，只能为yes和no，默认为yes。如果该选项设置为no，比对时不会剔除缺失样本对，如果存在同一个样本，在一个基因型文件中有基因型，另一个基因型文件中则为缺失，该样本同样会被视为不一致位点；如果该选项设置为 yes ，比对时会剔除至少一个样本为缺失的位点，只比对两个样本均不为缺失的位点。

--save-temp : 是否保留中间文件，可选，只能为yes和no，默认为 no。

--out : 结果文件名称，可选，默认为 cmp_common_samples.txt

代码说明

检验参数

检验输入的各个参数是否合规

#检验参数
legal_argument_set = set(["yes", "no"])

if not os.path.exists(f"{args.plink1}.ped"):
    print(f"Error: not exist {args.plink1}.ped\n\n")
    sys.exit(1)

if not os.path.exists(f"{args.plink1}.map"):
    print(f"Error: not exist {args.plink1}.map\n\n")
    sys.exit(1)

if not os.path.exists(f"{args.plink2}.ped"):
    print(f"Error: not exist {args.plink2}.ped\n\n")
    sys.exit(1)

if not os.path.exists(f"{args.plink2}.map"):
    print(f"Error: not exist {args.plink2}.map\n\n")
    sys.exit(1)

if args.exclude_miss not in legal_argument_set:
    print("Error: illegal input, --exclude-miss must be yes or no\n\n")
    sys.exit(1)
    
if args.save_temp not in legal_argument_set:
    print("Error: illegal input, --save-temp must be yes or no\n\n")
    sys.exit(1)

复制基因型文件

这里根据原始plink文件重新生成两套新的plink文件，其中芯片号使用数字代替。

这里会提取共同位点，共同样本。

plink_prefix1 = args.plink1
plink_prefix2 = args.plink2

# 复制 map
os.system(f"cp {plink_prefix1}.map {plink_prefix1}_temp.map")
os.system(f"cp {plink_prefix2}.map {plink_prefix2}_temp.map")

# 重新生成plink 文件
ped1 = open(f"{plink_prefix1}.ped",'r')
map1 = open(f"{plink_prefix1}.map",'r')

ped2 = open(f"{plink_prefix2}.ped",'r')
map2 = open(f"{plink_prefix2}.map",'r')

re_ped1 = open(f"{plink_prefix1}_temp.ped",'w')
re_ped2 = open(f"{plink_prefix2}_temp.ped",'w')
common_snp_file = open(f'{plink_prefix1}_{plink_prefix2}_common_snp_temp.txt','w')

error_status = False 

#提取共同位点
j = 0
temp_id_set = set()
temp_line_set = set()
map1_dick = {}  # SNP名称：染色体_物理位置
for i in map1:
    j += 1
    f = i.split()
    if (len(f) == 4):
        if '\t'.join(f) not in temp_line_set:
            temp_line_set.add('\t'.join(f))
            if f[1] not in temp_id_set:
                temp_id_set.add(f[1])
                map1_dick[f[1]] = f[0]+"_"+f[3] # SNP名称：“染色体_物理位置”
            else:
                print(f"Error: duplicated id {f[1]} in different rows in {plink_prefix1}.map\n") 
                error_status = True
        else:
            print(f"Error: duplicted row {j} in {plink_prefix1}.map\n")
            error_status = True

    else:
        print(f"Error: {j} row in {plink_prefix1}.map has less or more than 4 coloums\n")
        error_status = True

j = 0
temp_id_set = set()
temp_line_set = set()
common_snp_num = 0
for i in map2:
    j += 1
    f = i.split()
    if (len(f) == 4):
        if '\t'.join(f) not in temp_line_set:
            temp_line_set.add('\t'.join(f))
            if f[1] not in temp_id_set:
                temp_id_set.add(f[1])
                if f[1] in map1_dick:
                    common_snp_num += 1
                    common_snp_file.write(f[1]+'\n')
            else:
                print(f"Error: duplicated id {f[1]} in different rows in {plink_prefix2}.map\n") 
                error_status = True
        else:
            print(f"Error: duplicted row {j} in {plink_prefix2}.map\n")
            error_status = True

    else:
        print(f"Error: {j} row in {plink_prefix2}.map has less or more than 4 coloums\n")
        error_status = True

print("\nNumber of common SNPs in two plink files: {}\n".format(common_snp_num))
   
#重命名
# newid_dick = {} #chipid/sampleid:newid
# sampleid_dick = {} #newid:chipid/sampleid


#ped重编码为数字(1-indexed)
# 第一次结果不对
# 我懂了，plink 的 merge 命令会重新排列样本顺序，所以不是按照ped1,ped2的顺序

# 修改程序，严格按照想要的总文件的顺序，重编码芯片号。然后合并时先合并ped2,再合并ped1
# 重编码 ped 文件，第一列家系号替换为0，个体号重编码为数字。因为 merge 命令会按照 家系号 + 个体号进行排序。

ped1_dick = {}
for i in ped1:
    f = i.split()
    if f[1] not in ped1_dick:
        ped1_dick[f[1]] = '\t'.join(f[2:])+'\n'
    else:
        error_status = True
        print(f"Error: duplicated chip id {f[1]} in {plink_prefix1}.ped\n")

        
sample1_num = len(ped1_dick)

print("Number of records present in plink1 file: {}\n".format(sample1_num))

common_sample_list = []
common_sample_num = 0

j=0
temp_id_set = set()
for i in ped2:
    j += 1
    f = i.split()
    if f[1] not in temp_id_set:
        temp_id_set.add(f[1])
        if f[1] in ped1_dick:
            common_sample_num += 1
            common_sample_list.append(f[1]) #两个样本重复样的芯片号
            re_ped2.write('0'+'\t'+str(j)+'\t'+'\t'.join(f[2:])+'\n')
    else:
        error_status = True
        print(f"Error: duplicated chip id {f[1]} in {plink_prefix2}.ped\n")

sample2_num = j
print("Number of records present in plink2 file: {}\n".format(sample2_num))
print("Number of common samples in two plink file: {}\n".format(common_sample_num))

## re_ped1 重排个体顺序，使得与 re_ped2 一致
## 这里的 j 沿着上面递增，没有初始化 j = 0

for i in common_sample_list:
    j += 1
    re_ped1.write('0'+'\t'+str(j)+'\t'+ped1_dick[i])

ped1.close()
map1.close()
ped2.close()
map2.close()
re_ped1.close()
re_ped2.close()
common_snp_file.close()

合并基因型

# 如果没有共同个体，程序中止
if common_sample_num == 0: #如果没有共同个体，报错
    print("Error: two plink file has no same samples, check your data again!")
    error_status = True

# 如果读取文件过程中出现错误退出
if error_status:
    print("Please check your file again!")
    sys.exit(1)


# extract merge recodeA
# 合并文件是第二个plink文件的共同样本在上方，第一个plink的共同样本在下方，二者样本的顺序一致。

merge_status = os.system(f"plink --allow-extra-chr --chr-set 95 --file {plink_prefix2}_temp --merge {plink_prefix1}_temp --extract {plink_prefix1}_{plink_prefix2}_common_snp_temp.txt --recodeA --out {plink_prefix1}_{plink_prefix2}_merge_temp > /dev/null")

if merge_status != 0:
    print("Error in merge two plink data\n")
    print("Maybe some common SNP from Complementary strands, check your data again")
    sys.exit(1)

#raw文件处理
os.system(f"sed 's/NA/3/g' {plink_prefix1}_{plink_prefix2}_merge_temp.raw > {plink_prefix1}_{plink_prefix2}_merge_temp_sed.raw")
raw_file = open(f"{plink_prefix1}_{plink_prefix2}_merge_temp_sed.raw",'r')
out_file = open(args.out,'w')

# 处理标题，得到 SNP 名称
snp_list = []
title = raw_file.readline().split() #剔除标题
for i in title[6:]:
    snp_name = i[:-2]
    snp_list.append(snp_name) # SNP名称的列表

raw_list=[]
for i in raw_file:
    f = i.split()
    raw_list.append(f[6:])

array1 = np.array(raw_list, dtype=np.int8)

total_sample_num = array1.shape[0]
snp_num = array1.shape[1]

print("Number of records present in merged ped file: {}\n".format(total_sample_num))
print("Number of SNPs in merged plink files: {}\n".format(snp_num))

if (total_sample_num != (2*common_sample_num)):
    print("Error: total samples number of merged plink file is not 2 fold of common samples in two plink files\n")
    sys.exit(1)

比对同一个SNP的基因型

#计算不一致率
out_list = []
  
if args.exclude_miss == 'yes':
    for i in range(snp_num):
        first_index = range(0,common_sample_num)
        second_index = range(common_sample_num,total_sample_num)
        v1 = array1[first_index,i]
        v2 = array1[second_index,i]
        non_miss_index = np.where((v1!=3) & (v2!=3))
        total_num = non_miss_index[0].shape[0]
        v1_new = v1[non_miss_index]
        v2_new = v2[non_miss_index]
        no_match_num = np.count_nonzero(v1_new!=v2_new)
        if total_num > 0:
            no_match_rate = no_match_num/total_num
        else:
            no_match_rate = 0
        out_list.append([snp_list[i],f"{no_match_num}",f"{total_num}",f"{no_match_rate:.4f}"])
elif args.exclude_miss == 'no':
    for i in range(snp_num):
        first_index = range(0,common_sample_num)
        second_index = range(common_sample_num,total_sample_num)
        v1 = array1[first_index,i]
        v2 = array1[second_index,i]
        total_num = common_sample_num
        no_match_num = np.count_nonzero(v1!=v2)
        no_match_rate = no_match_num/total_num
        out_list.append([snp_list[i],f"{no_match_num}",f"{total_num}",f"{no_match_rate:.4f}"])



sorted_out_list = sorted(out_list, key = lambda s:float(s[3]), reverse = True) #不一致率从大到小
for i in sorted_out_list:
    out_file.write('\t'.join(i)+'\n')

raw_file.close()
out_file.close()

删除中间文件

#删除中间文件
if args.save_temp == "yes":
    pass
elif args.save_temp == "no":
    os.system(f"rm *temp*")

版权声明： 本博客所有文章除特别声明外，著作权归作者所有。转载请注明出处！