专栏名称: 生信媛

生信媛，从1人分享，到8人同行。坚持分享生信入门方法与课程，持续记录生信相关的分析pipeline, python和R在生物信息学中的利用。内容涵盖服务器使用、基因组转录组分析以及群体遗传。

Lecture 21 - 使用samtools进行变异的calling

生信媛 · 公众号 · 生物 · 2017-08-24 23:42

正文

翻译：生物女博士

注：本系列课程翻译已获得原作者授权。

# 为作者的注释

#* 为译者的注释

Lecture 21 - 使用samtools进行变异的calling

# 建一个转换和排序的BAM文件的“快捷方式”。

alias tobam='samtools view -b - | samtools sort -o - tmp '

# 安装bcftools.

cd ~/src

curl -OL http://sourceforge.net/projects/samtools/files/samtools/1.1/bcftools-1.1.tar.bz2

tar jxvf bcftools-1.1.tar.bz2

cd bcftools-1.1

make

ln -s ~/src/bcftools-1.1/bcftools ~/bin

# 使用Lecture18的比对脚本

bash compare.sh

# 有参考基因组地Pileup

samtools mpileup -f ~/refs/852/NC.fa -Q 0 bwa.bam | more

# 生成基因型，然后把变异call出来。

# Samtools是为二倍体的人类基因组设计的，可能会有与之对应的隐含假设包含其中。

samtools mpileup -Q 0 -ugf ~/refs/852/NC.fa bwa.bam | bcftools call -vc > samtools.vcf

# 查看call出来的snp。

# snp calling是如何工作的？

# 以我写的一个DIY的call snp程序作为简单的演示（见网站）

#* 代码在本章节结尾

samtools mpileup -f ~/refs/852/NC.fa -Q 0 bwa.bam > bwa.pileup

cat bwa.pileup | python snpcaller.py > diy.vcf

# 我简单版的程序跟samtools用默认设置时得到一样的结果——感觉如何？

# 欢迎来到你的下一个文件格式。

# 按列分割文件。

cat samtools.vcf| grep -v "##" | cut -f 1,2,3,4,5 | head

# 安装Freebayes

# Mac上我们需要另一个程序来编译freebayes。

brew install cmake

# 现在来安装FreeBayes

#* 使用git clone需要先安装git。

#*也可以直接到 https:// github.com/ekg/freebayes.git 上下载到本地，解压，然后将文件夹上传到服务器的~/src目录下进行安装。

cd ~/src

git clone --recursive git://github.com/ekg/freebayes.git

cd freebayes

# 在Mac上，下边的命令会报错。

make

# 但第二次运行就正常了，暂不清楚什么原因。程序似乎也是可以用的——但是可能有些功能是缺失的。

make

ln -sf ~/src/freebayes/bin/freebayes ~/bin

# 现在回到主目录，用FreeBayes来call SNP。

cd ~/edu/lec21/

# 程序该怎么用？

freebayes

# 用它来call snps。

freebayes -f ~/refs/852/NC.fa bwa.bam > freebayes.vcf

# 可视化结果

#*可以导入IGV查看。

DIY snp caller

#*这是一个python编写的程序，python用缩进来标明成块的代码，如果缩进有问题，程序也会出错。而微信公众号排版有时候会有一些问题，故建议查看原英文网站，确认自己的这个python脚本没问题。

# DIY snip caller
import sys
from collections import defaultdict
# This code was written in about 20 minutes to demonstrate
# a really simple snp caller. As it is it calls only
# homozygous mutation defined as over 50% of calls.

print "##fileformat=VCFv4.2"
print "\t".join("#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT bwa.bam".split())

for line in sys.stdin:
    # Strip the newline then split by tabs.
    column = line.strip().split("\t")
    
    # The fifth base contains the basecalls# (zero based counting).
    bases = column[4]
    
    # The quals column will tell us how many bases are called.
    quals = column[5]
    
    # Upper case everything.
    # We do lose strand information with that.
    bases = bases.upper()
    count = defaultdict(int)  
    for c in bases:
        count[c] += 1

    # How many bases cover the site
    depth = float(len(quals))
    
    for m in "ATGC":
        if count[m]/depth > 0.50:
        # We got a homozygous mutations
        # produce the CVF records
        # Collect output into an array.

        info = "DP=%s" % (int(depth))
        format = "GT:PL"
        values = "1/1:30"    
        out = [column[0], column[1], ".", column[2], m, 30, ".", info, format, values]
        
        # Make all elements of the array a string.
        out = map(str, out)
        
        # Join the fields of the output array by tabs
        # Print the joined fields.
        print "\t".join(out)

Lecture 22 - 预测变异的影响

# 安装、创建以及链接bioawk

cd ~/src

#* 使用git clone需要先安装git。

git clone https://github.com/lh3/bioawk.git

#* 也可以直接到 https://github.com/lh3/bioawk.git 上下载到本地，解压，然后将文件夹上传到服务器的~/src目录下进行安装。

cd bioawk

#* 译者下载后的名字为bioawk-master，在此步骤将命令对应更改为cd bioawk-master/ 即可，其他一样。

make

ln -sf ~/src/bioawk/bioawk ~/bin

# bioawk有自带的一个手册。

man ~/src/bioawk/awk.1

# Bioawk知道怎样处理生物信息类型数据

# 它不仅能按列分割数据，还可以识别列名

# 通过运行compare.sh脚本生成模拟reads并比对后（详情参见之前的课程），现在用bioawk来处理它们。

bioawk -c help

cat r1.fq | bioawk -c fastx ' { print $seq } ' | head -1

cat mutations.gff | bioawk -c gff ' { print $seqname, $start } ' | head -1

# 跟下边写法比起来，bioawk使得代码变得更有可读性：

cat mutations.gff | grep -v "#" | awk -F '\t' -v OFS='\t' ' { print $1, $5-$4+1 } ' | head -1

Lecture 21 - 使用samtools进行变异的calling

正文

请到「今天看啥」查看全文