调整plink文件SNP顺序代码实现

2021-09-12

数据分析 / python

字数统计: 1.4k | 阅读时长≈ 6 分钟

一般来说 Plink 文件中的SNP顺序并不影响任何操作，如果有必要的话，可以通过下面这个代码以指定SNP顺序调整plink文件。

前言

经过 plink 软件处理的基因型文件，位点会按照物理位置的顺序进行排序，而且性染色体会重命名为常染色体+1 或常染色体+2。这里我们同样可以利用 plink 软件的这个特性来实现以我们指定的顺序排序SNP的目的。

使用软件

python 3.8 及以上版本
plink 软件 v1.90 版本

输入输出文件格式

plink 文件

格式说明见根据芯片号拆分plink文件的代码实现

运行代码

代码文件见：snp_order.py

将所有输入文件和本程序放在同一文件夹下，运行命令示范如下：

1	python snp_order.py cn1.map plink_order cn1_2

参数说明：

cn1.map : 参考map文件，即为预期调整好SNP顺序之后的map 文件

plink_order : 输入plink文件前缀

cn1_2 : 输出plink文件前缀

注意，参考map文件与输入plink文件的map文件的内容一致，仅仅是SNP顺序变动了。最终会将输入的plink文件的SNP顺序调整为参考map文件的SNP顺序。

代码说明

输入参数

import os,sys

ref_map_name = sys.argv[1]
in_plink_prefix = sys.argv[2]
out_plink_prefix = sys.argv[3]

判断参数和文件内容

这里我们先用一个变量 error_status 作为总的判断异常的逻辑值，默认是 False ，如果运行过程中出现问题，则修改为 True

1	error_status = False

写了一个函数，判断文件名是否存在且非空

# 判断输入文件是否存在且非空
def Judge_file(file_path):
    import os
    if os.path.exists(file_path):
        if os.path.getsize(file_path):
            return True
    return False

之后就是对输入输出文件进行判断，判断三个输入文件是否存在；判断输入plink前缀是否等于输出plink前缀。如果有问题，程序中止。

if not Judge_file(ref_map_name):
    print(f"Error: {ref_map_name} not exist or is empty file\n")
    error_status = True
    
if not Judge_file(f"{in_plink_prefix}.ped"):
    print(f"Error: {in_plink_prefix}.ped not exist or is empty file\n")
    error_status = True
elif not Judge_file(f"{in_plink_prefix}.map"):
    print(f"Error: {in_plink_prefix}.map not exist or is empty file\n")
    error_status = True

if in_plink_prefix == out_plink_prefix:
    print("Error: second argument is same with third argument\n")
    error_status = True

if error_status:
    sys.exit(1)

之后判断参考map文件和输入plink文件的map文件内容是否有问题，两个map文件的SNP名称是否完全一致。如果有问题，程序中止。

# 判断两个map的SNP名称是否完全一致
ref_map_file = open(ref_map_name,'r')
in_map_file = open(f"{in_plink_prefix}.map",'r')

ref_snp_list = []
ref_snp_dick = {}
j = 0
temp_id_set = set()
for i in ref_map_file:
    j += 1
    f = i.split()
    if len(f) == 4:
        if f[1] not in temp_id_set:
            temp_id_set.add(f[1])
            ref_snp_list.append(f[1])
            ref_snp_dick[f[1]] = str(j) # snp: 序号
        else:
            print(f"Error: duplicated id {f[1]} in {ref_map_name}\n")
            error_status = True
    else:
        print(f"Error: {j} row in {ref_map_name} has less or more than 4 columns\n")
        error_status = True

in_snp_list = []
not_in_snp_list = [] # 不在参考map中的snp
j = 0
temp_id_set = set()
for i in in_map_file:
    j += 1
    f = i.split()
    if len(f) == 4:
        if f[1] not in temp_id_set:
            temp_id_set.add(f[1])
            in_snp_list.append(f[1])
            if f[1] not in ref_snp_dick:
                not_in_snp_list.append(f[1])
                error_status = True
        else:
            print(f"Error: duplicated id {f[1]} in {in_plink_prefix}.map\n")
            error_status = True
    else:
        print(f"Error: {j} row in {in_plink_prefix}.map has less or more than 4 columns\n")
        error_status = True


ref_map_file.close()
in_map_file.close()

if not_in_snp_list:
    temp_str = ", ".join(not_in_snp_list)
    print(f"Error: snp in {in_plink_prefix}.map but not in {ref_map_name}: {temp_str}\n")

if (not error_status) and (len(ref_snp_list) != len(in_snp_list)):
    print(f"Error: snp numbers of {in_plink_prefix}.map and {ref_map_name} are not same\n")
    error_status = True

if error_status:
    sys.exit(1)

SNP顺序重排

下面这一步是重点，我生成一个临时map文件，其中第二列与输入文件的map文件一致，但是物理位置却是参考map文件的SNP序号（染色体都设为1）。

# 生成临时map文件，染色体均设为1，物理位置设置为参考map的序号
temp_map_file = open(f"{in_plink_prefix}_temp.map",'w')
for i in in_snp_list:
    temp_map_file.write('1'+'\t'+i+'\t'+'0'+'\t'+ref_snp_dick[i]+'\n')

temp_map_file.close()

然后再用这个临时map文件去过一遍plink软件，plink 软件就会傻憨憨地按照第四列的”物理位置“进行SNP重新排序，实际就是按照参考map文件的SNP序号排序。说白了，这里就是我骗了一下plink软件，用plink软件自动按物理位置排序的性质，实现了以任何一种给定SNP顺序排序的目的。

1
2
3

# 重排
os.system(f"plink --allow-extra-chr --chr-set 95 --ped {in_plink_prefix}.ped --map {in_plink_prefix}_temp.map --recode --out {out_plink_prefix} > /dev/null")

然后我们使用参考map文件替换结果map文件

1 2	# 使用参考map替换结果map os.system(f"cp {ref_map_name} {out_plink_prefix}.map")

最后删除中间文件，这里就是那个临时map文件，过河拆桥了属于是。

1 2	# 删除中间文件 os.system("rm temp")

闲话

我写这个脚本时的需求，就是为了将plink软件处理后的基因型数据调整成plink软件处理前的顺序。也就是，就是用了 plink 软件导致SNP顺序改变了，plink 软件是这个问题的源头。然后，我现在又通过给plink 软件提供一个虚假的map文件，用 plink 软件又把SNP顺序调整回来了，也就是现在 plink 软件又成了这个问题的解药。感觉有点哲学的意思，就像金庸先生说的 “毒物旁边往往生长着解药” 。

版权声明： 本博客所有文章除特别声明外，著作权归作者所有。转载请注明出处！