软件学习-GCTB

GCTB 是用于分析众多贝叶斯方法的软件。

基本参数

输入输出

–bfile test

Input PLINK binary PED files, e.g. test.fam, test.bim and test.bed (see PLINK user manual for details).

–pheno test.phen

Input phenotype data from a plain text file, e.g. test.phen. (前2列是家系号和个体号,第3列是表型列)

–out test

Specify output root filename.

数据管理

–keep test.indi.list

Specify a list of individuals to be included in the analysis.

–chr 1

Include SNPs on a specific chromosome in the analysis, e.g. chromosome 1.

–extract test.snplist

Specify a list of SNPs to be included in the analysis.

–exclude test.snplist

Specify a list of SNPs to be excluded from the analysis.

–mpheno 2

If the phenotype file contains more than one trait, by default, GCTB takes the first trait for analysis (the third column of the file) unless this option is specified. For example, –mpheno 2 tells GCTB to take the second trait for analysis (the fourth column of the file).

–covar test.qcovar

Input quantitative covariates from a plain text file, e.g. test.qcovar. Each quantitative covariate is recognized as a continuous variable.

–random-covar test.randcovar

Input quantitative covariates from a plain text file which will be fitted as random effects in the model. For a categorical variable with k levels, create a matrix of 0/1 with k columns to indicate the presence of each level in a column.(没有特别理解这个选项)

MCMC选项

–seed 123

Specify the seed for random number generation, e.g. 123. Note that giving the same seed value would result in exactly the same results between two runs.

–chain-length 21000

Specify the total number of iterations in MCMC, e.g. 21000 (default). (总迭代次数)

–burn-in 1000

Specify the number of iterations to be discarded, e.g. 1000 (default).

–out-freq 100

Display the intermediate results for every 100 iterations (default).

–thin 10

Output the sampled values for SNP effects and genetic architecture parameters for every 10 iterations (default). Only non-zero sampled values of SNP effects are written into a binary file.

–no-mcmc-bin

Suppress the output of MCMC samples of SNP effects. (没有太看懂,貌似是说不输出SNP效应的抽样结果,那就是只输出方差组分的抽样结果)

贝叶斯方法选项

选项

–bayes S

Specify the Bayesian alphabet for the analysis, e.g. SS. Different alphabet launch different models, which differ in the prior specification for the SNP effects. The available alphabet include

  • B: Each SNP effect is assumed to have an i.i.d. mixture prior of a t-distribution t(0,,ν) with a probability π and a point mass at zero with a probability 1−π. ( 注意:π 是有效应的SNP比例,下同)
  • C: Each SNP effect is assumed to have an i.i.d. mixture prior of a normal distribution N(0,)with a probability π and a point mass at zero with a probability 1−π.
  • S: Similar to C but the variance of SNP effects is related to minor allele frequency () through a parameter , i.e. .
  • N: nested BayesC. SNPs within a 0.2 Mb non-overlapping genomic region are collectively considered as a window (specify the distance by –wind 0.2). This nested approach speeds up the analysis by skipping over windows with zero effect.
  • NS: nested BayesS.
  • R: BayesR. Each SNP effect is assumed to have an i.i.d. mixture prior of multiple normal distributions with a probability and a point mass at zero with a probability , where is a given constant.

–fix-pi

An option to fix to a constant (the value is specified by the option --pi below). The default setting is to treat π as random and estimate it from the data. (将 设定为一个固定值;不然 --pi 只是一个初始值)

–pi 0.05

A starting value for the sampling of π when it is estimated from the data, or a given value for when it is fixed. The default value is 0.05. When BayesR is used, it is a string seperated by comma where the number of values defines the number of mixture components and each value defines the starting value for each component (the first value is reserved for the zero component); the default values are 0.95,0.03,0.01,0.01.

–gamma 0,0.01,0.1,1

When BayesR is used, this speficies the gamma values seperated by comma, each representing the scaling factor for the variance of a mixture component. Note that the number of values should match that in --pi.

–hsq 0.5

A starting value for the sampling of SNP-based heritability, which may improve the mixing of MCMC algorithm if it starts with a good estimate. The default value is 0.5. (遗传力的初始值)

–S 0

A starting value for the sampling of the parameter S (relationship between MAF and variance of SNP effects) in BayesS, which may improve the mixing of MCMC algorithm if it starts with a good estimate. The default value is 0.

–wind 0.2

Specify the window width in Mb for the non-overlapping windows in the nested models, e.g. 0.2 Mb. The default value is 1 Mb.

–var-random 0.05

Specify the prior knowledge on the proportion of variance explained by the non-SNP random effects. This value will be used to specify the prior distribution for the non-SNP random effect variance variable.

举例

首先是 BayesA 和 BayesB (先验分布为 Scaled-t density), BayesA 可以理解为 固定为1 情况下的 BayesB

1
2
3
4
5
6
7
8
9
# BayesB
~/software/gctb --bfile DHA_${j}_SNP --pheno yhat_EPA.txt --keep ./CV/ID${i}.txt --bayes B \
--pi 0.05 --fix-pi --hsq 0.164 \
--chain-length 50000 --burn-in 20000 --out ./CV/${j}/BayesB/DHA_B${i}

# BayesA
~/software/gctb --bfile DHA_${j}_SNP --pheno yhat_EPA.txt --keep ./CV/ID${i}.txt --bayes B \
--pi 1.0 --fix-pi --hsq 0.164 \
--chain-length 50000 --burn-in 20000 --out ./CV/${j}/BayesA/DHA_A${i}

然后是 BayesC 和 BayesRR (先验分布为正态分布), BayesRR 可以理解为 固定为1 情况下的 BayesC

1
2
3
4
5
6
7
8
9
# BayesC
~/software/gctb --bfile DHA_${j}_SNP --pheno yhat_EPA.txt --keep ./CV/ID${i}.txt --bayes C \
--pi 0.05 --hsq 0.164 \
--chain-length 50000 --burn-in 20000 --out ./CV/${j}/BayesC/DHA_C${i}

# BayesRR
gctb --bfile ./G2/DHA_${j}_SNP --pheno yhat_DHA.txt --keep ./CV/ID${i}.txt --bayes C \
--pi 1.0 --fix-pi --hsq 0.752 \
--chain-length 50000 --burn-in 20000 --out ./CV/${j}/BayesRR/DHA_RR${i} >/dev/null

然后是 BayesS

1
2
3
4
# BayesS
~/software/gctb --bfile DHA_${j}_SNP --pheno yhat_EPA.txt --keep ./CV/ID${i}.txt --bayes S \
--pi 0.05 --hsq 0.164 \
--chain-length 50000 --burn-in 20000 --out ./CV/${j}/BayesS/DHA_S${i}\

然后是 BayesR (没有看懂,大概是说把SNP分为四类,零效应、小、中、大方差)

1
2
3
4
# BayesR
~/software/gctb --bfile DHA_${j}_SNP --pheno yhat_EPA.txt --keep ./CV/ID${i}.txt --bayes R \
--pi 0.95,0.03,0.01,0.01 --gamma 0,0.01,0.1,1 --hsq 0.164 \
--chain-length 50000 --burn-in 20000 --out ./CV/${j}/BayesR/DHA_R${i}

输出文件

我这只有下面前4个输出文件

test.log: a text file of running status, intermediate output and final results;

test.snpRes: a text file of posterior statistics of SNP effects;

test.covRes: a text file of posterior statistics of covariates;

test.parRes: a text file of posterior statistics of key model parameters;

test.mcmcsamples.CovEffects: a text file of MCMC samples for the covariates fitted in the model;

test.mcmcsamples.SnpEffects: a binary file of MCMC samples for the SNP effects;

test.mcmcsamples.Par: a text file of MCMC samples for the key model parameters;

这里重点结果文件就是 test.snpRes ,重点在于第2列(Name, 即SNP名称),第5列(A1),第8列(A1Effect, SNP的效应大小)。

然后用碱基计数(0,1,2;必须是A1碱基的计数) 乘以SNP效应值,然后累积,得到个体的GEBV。

贝叶斯方法参考文献

GCTB software and BayesS method
Zeng et al. (2018) Signatures of negative selection in the genetic architecture of human complex traits. Nature Genetics, doi: 10.1038/s41588-018-0101-4.

BayesR
Moser et al. (2015) Simultaneous discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model. PLoS Genetics, 11: e1004969.

BayesN
Zeng et al. (2018) A nested mixture model for genomic prediction using whole-genome SNP genotypes. PLoS One, 13: e0194683.

BayesCπ
Habier et al. (2011) Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics, 12: 186.

BayesB
Meuwissen et al. (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157: 1819-1829.

参考文献

  1. https://cnsgenomics.com/software/gctb/#Overview
  • 版权声明: 本博客所有文章除特别声明外,著作权归作者所有。转载请注明出处!
  • Copyrights © 2019-2025 Vincere Zhou
  • 访问人数: | 浏览次数:

请我喝杯茶吧~

支付宝
微信