专栏名称: 生信媛

生信媛，从1人分享，到8人同行。坚持分享生信入门方法与课程，持续记录生信相关的分析pipeline, python和R在生物信息学中的利用。内容涵盖服务器使用、基因组转录组分析以及群体遗传。

linux的日常·调整GFF文件中的坐标位置

生信媛 · 公众号 · 生物 · 2020-04-11 07:00

正文

请到「今天看啥」查看全文

周二洲更在编辑群@我，好玩的题目又来了~

数据详见：生信练习题：调整GFF文件中的坐标位置——by徐洲更

一直拖到今天周五，正好娱乐一下。

先大概看一下数据吧。数据大概分为2个部分：注释部分和数据主体部分（也就是我们需要处理的部分。）

因此：

如果是#开头的注释部分，不处理
如果非#开头的部分，则需要将chr8后边的起始位置数值提取出来，实现将该值与后边两个位置列进行相加-1。具体对于第一行而言，就是使得第4列的30280+25234310-1，第5列的30951+25234310-1。并且在数据处理之后，去掉chr8后边这一串 :25234310-25266151 .

对于需要处理的数据，思路也比较简单，先按照冒号(:)、短横杠(-)和tab分隔符(\t)将数据分列。分列之后，我们需要加上的数值在第2列，需要被加的值则在第6列和第7列。此外，需要排除第2列和第3列进行输出。

我们一步步来看。

首先，我们先将注释文件挑出来不处理，直接打印：

awk '{if(/^#/){print $0}}' target.gff|head

$ awk '{if(/^#/){print $0}}' target.gff|more# This output was generated with AUGUSTUS (version 3.2.3).# AUGUSTUS is a gene prediction tool written by M. Stanke ([email protected]),# O. Keller, S. König, L. Gerischer and L. Romoth.# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),# Using native and syntenically mapped cDNA alignments to improve de novo gene finding# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013# No extrinsic information on sequences given.# arabidopsis version. Using default transition matrix.# We have hints for 0 sequences and for 0 of the sequences in the input set.## ----- prediction on sequence number 1 (length = 31842, name = chr8:25234310-25266151) -----## Constraints/Hints:# (none)# Predicted genes for sequence number 1 on both strands# start gene g1# protein sequence = [MERRKVEIKRIEKKSIRQVTFSKRRNGLMEKARQLSILCESSIAVLVVSDSGKLYNSTSGDKAFCSLQISCDLLLVFS# MKRPDACLEEAKSDNVSIDFLKSLEEQLKTALSITRDKKTELMMEFVKTLQEKVSVLVFIYWPFQAMLTLESSKTPSLEITIPS]# end gene g1###

简单解释一下：

//：表示匹配
^#：表示开头为#的
因此/^#/表示匹配开头为#号
$0表示整行

if(/^#/){print $0} 串起来就是: 如果是以#开头，那么打印整行。

好，我们继续。

对于不是#开头的行，我们需要对其进行分列，并对6、7列与第2列相加后减1。

awk 'BEGIN {FS="[-:\t]";OFS="\t"}{if(/^#/){print $0}else{$6=$6+$2-1;$7=$7+$2-1;print $0}}' target.gff

可以看到，分列成功并进行了相关的数值计算。

# This output was generated with AUGUSTUS (version 3.2.3).# AUGUSTUS is a gene prediction tool written by M. Stanke ([email protected]),# O. Keller, S. König, L. Gerischer and L. Romoth.# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),# Using native and syntenically mapped cDNA alignments to improve de novo gene finding# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013# No extrinsic information on sequences given.# arabidopsis version. Using default transition matrix.




    
# We have hints for 0 sequences and for 0 of the sequences in the input set.## ----- prediction on sequence number 1 (length = 31842, name = chr8:25234310-25266151) -----## Constraints/Hints:# (none)# Predicted genes for sequence number 1 on both strands# start gene g1chr8	25234310	25266151	AUGUSTUS	gene	25237205	25239574	0.03	+	.	g1chr8	25234310	25266151	AUGUSTUS	transcript	25237205	25239574	0.03	+	.	g1.t1chr8	25234310	25266151	AUGUSTUS	tss	25237205	25237205	.	+	.	transcript_id "g1.t1"; gene_id "g1";chr8	25234310	25266151	AUGUSTUS	exon	25237205	25237684	.	+	.	transcript_id "g1.t1"; gene_id "g1";chr8	25234310	25266151	AUGUSTUS	start_codon	25237500	25237502	.	+	0	transcript_id "g1.t1"; gene_id "g1";chr8	25234310	25266151	AUGUSTUS	intron	25237685	25237767	0.62	+	.	transcript_id "g1.t1"; gene_id "g1";chr8	25234310	25266151	AUGUSTUS	intron	25237838	25239047	0.47	+	.	transcript_id "g1.t1"; gene_id "g1";chr8	25234310	25266151	AUGUSTUS	intron	25239144	25239268	0.27	+	.	transcript_id "g1.t1"; gene_id "g1";chr8	25234310	25266151	AUGUSTUS	CDS	25237500	25237684	0.69	+	0	transcript_id "g1.t1"; gene_id "g1";chr8	25234310	25266151	AUGUSTUS	CDS	25237768	25237837	0.57	+	1	transcript_id "g1.t1"; gene_id "g1";chr8	25234310	25266151	AUGUSTUS	exon	25237768	25237837	.	+	.	transcript_id "g1.t1"; gene_id "g1";chr8	25234310	25266151	AUGUSTUS	CDS	25239048	25239143	0.5	+	0	transcript_id "g1.t1"; gene_id "g1";chr8	25234310	25266151	AUGUSTUS	exon	25239048	25239143	.	+	.	transcript_id "g1.t1"; gene_id "g1";chr8	25234310	25266151	AUGUSTUS	CDS	25239269	25239406	0.21	+	0	transcript_id "g1.t1"; gene_id "g1";chr8	25234310	25266151	AUGUSTUS	exon	25239269	25239574	.	+	.	transcript_id "g1.t1"; gene_id "g1";chr8	25234310	25266151	AUGUSTUS	stop_codon	25239404	25239406	.	+	0	transcript_id "g1.t1"; gene_id "g1";chr8	25234310	25266151	AUGUSTUS	tts	25239574	25239574	.	+	.	transcript_id "g1.t1"; gene_id "g1";

FS是输入字段分隔符，中括号里边的分隔符的关系是“或”。即遇到 - 或者 : 或者 \t ，都作为分隔符分开。
OFS是输出字段分隔符。这里指定的是tab分隔。
BEGIN是为awk提供启动的操作。具体在本案例而言，就是告诉awk，我后边的FS是 -:\t ，OFS是"\t"。
else后边一串就是将第6列重新赋值为$ $2-1，第7列重新赋值为$ $2-1。然后输出整行。

下一个任务就是将第2、3列去掉。

awk 'BEGIN {FS="[-:\t]";OFS="\t"}{if(/^#/){print $0}else{$6=$6+$2-1;$7=$7+$2-1;$2=$3="";print $0}}' target.gff|more

# This output was generated with AUGUSTUS (version 3.2.3).# AUGUSTUS is a gene prediction tool written by M. Stanke ([email protected]),# O. Keller, S. König, L. Gerischer and L. Romoth.# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),# Using native and syntenically mapped cDNA alignments to improve de novo gene finding# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013# No extrinsic information on sequences given.# arabidopsis version. Using default transition matrix.




    
# We have hints for 0 sequences and for 0 of the sequences in the input set.## ----- prediction on sequence number 1 (length = 31842, name = chr8:25234310-25266151) -----## Constraints/Hints:# (none)# Predicted genes for sequence number 1 on both strands# start gene g1chr8			AUGUSTUS	gene	25237205	25239574	0.03	+	.	g1chr8			AUGUSTUS	transcript	25237205	25239574	0.03	+	.	g1.t1chr8			AUGUSTUS	tss	25237205	25237205	.	+	.	transcript_id "g1.t1"; gene_id "g1";chr8			AUGUSTUS	exon	25237205	25237684	.	+	.	transcript_id "g1.t1"; gene_id "g1";chr8			AUGUSTUS	start_codon	25237500	25237502	.	+	0	transcript_id "g1.t1"; gene_id "g1";chr8			AUGUSTUS	intron	25237685	25237767	0.62	+	.	transcript_id "g1.t1"; gene_id "g1";chr8			AUGUSTUS	intron	25237838	25239047	0.47	+	.	transcript_id "g1.t1"; gene_id "g1";chr8			AUGUSTUS	intron	25239144	25239268	0.27	+	.	transcript_id "g1.t1"; gene_id "g1";chr8			AUGUSTUS	CDS	25237500	25237684	0.69	+	0	transcript_id "g1.t1"; gene_id "g1";chr8			AUGUSTUS	CDS	25237768	25237837	0.57	+	1	transcript_id "g1.t1"; gene_id "g1";chr8			AUGUSTUS	exon	25237768	25237837	.	+	.	transcript_id "g1.t1"; gene_id "g1";chr8			AUGUSTUS	CDS	25239048	25239143	0.5	+	0	transcript_id "g1.t1"; gene_id "g1";chr8			AUGUSTUS	exon	25239048	25239143	.	+	.	transcript_id "g1.t1"; gene_id "g1";chr8			AUGUSTUS	CDS	25239269	25239406	0.21	+	0	transcript_id "g1.t1"; gene_id "g1";chr8			AUGUSTUS	exon	25239269	25239574	.	+	.	transcript_id "g1.t1"; gene_id "g1";chr8			AUGUSTUS	stop_codon	25239404	25239406	.	+	0	transcript_id "g1.t1"; gene_id "g1";chr8			AUGUSTUS	tts	25239574	25239574	.	+	.	transcript_id "g1.t1"; gene_id "g1";

看起来似乎第2、3列似乎是没了。但是其实它们只是为空，我们看不到而已。

我们来仔细看看：

$ awk 'BEGIN {FS="[-:\t]";OFS="\t"}{if(/^#/){print $0}else{$6=$6+$2-1;$7=$7+$2-1;$2=$3="";print $0}}' target.gff|awk '!/^#/{print $0}'|head|cat -Achr8^I^I^IAUGUSTUS^Igene^I25237205^I25239574^I0.03^I+^I.^Ig1$chr8^I^I^IAUGUSTUS^Itranscript^I25237205^I25239574^I0.03^I+^I.^Ig1.t1$chr8^I^I^IAUGUSTUS^Itss^I25237205^I25237205^I.^I+^I.^Itranscript_id "g1.t1"; gene_id "g1";$chr8^I^I^IAUGUSTUS^Iexon^I25237205^I25237684^I.^I+^I.^Itranscript_id "g1.t1"; gene_id "g1";$chr8^I^I^IAUGUSTUS^Istart_codon^I25237500^I25237502^I.^I+^I0^Itranscript_id "g1.t1"; gene_id "g1";$chr8^I^I^IAUGUSTUS^Iintron^I25237685^I25237767^I0.62^I+^I.^Itranscript_id "g1.t1"; gene_id "g1";$chr8^I^I^IAUGUSTUS^Iintron^I25237838^I25239047^I0.47^I+^I.^Itranscript_id "g1.t1"; gene_id "g1";$chr8^I^I^IAUGUSTUS^Iintron^I25239144^I25239268^I0.27^I+^I.^Itranscript_id "g1.t1"; gene_id "g1";$chr8^I^I^IAUGUSTUS^ICDS^I25237500^I25237684^I0.69^I+^I0^Itranscript_id "g1.t1"; gene_id "g1";$chr8^I^I^IAUGUSTUS^ICDS^I25237768^I25237837^I0.57^I+^I1^Itranscript_id "g1.t1"; gene_id "g1";$

为了避免注释行对我们的干扰。我这里输出的时候用 awk '!/^#/{print $0}' 只输出非注释行。

可以看到，chr8后边有3个 ^I ，这个其实就是tab分隔符的真身。

为了结果更为令人愉悦，我还是决定把这个小刺拔掉。

$ awk 'BEGIN {FS="[-:\t]";OFS="\t"}{if(/^#/){print $0}else{$6=$6+$2-1;$7=$7+$2-1;$2=$3="";sub(/\t+/,"\t");print}}' target.gff|awk '!/^#/{print $0}'|head|cat -A

chr8^IAUGUSTUS^Igene^I25237205^I25239574^I0.03^I+^I.^Ig1$chr8^IAUGUSTUS^Itranscript^I25237205^I25239574^I0.03^I+^I.^Ig1.t1$chr8^IAUGUSTUS^Itss^I25237205^I25237205^I.^I+^I.^Itranscript_id "g1.t1"; gene_id "g1";$chr8^IAUGUSTUS^Iexon^I25237205^I25237684^I.^I+^I.^Itranscript_id "g1.t1"; gene_id "g1";$chr8^IAUGUSTUS^Istart_codon^I25237500^I25237502^I.^I+^I0^Itranscript_id "g1.t1"; gene_id "g1";$chr8^IAUGUSTUS^Iintron^I25237685^I25237767^I0.62^I+^I.^Itranscript_id "g1.t1"; gene_id "g1";$chr8^IAUGUSTUS^Iintron^I25237838^I25239047^I0.47^I+^I.^Itranscript_id "g1.t1"; gene_id "g1";$chr8^IAUGUSTUS^Iintron^I25239144^I25239268^I0.27^I+^I.^Itranscript_id "g1.t1"; gene_id "g1";$chr8^IAUGUSTUS^ICDS^I25237500^I25237684^I0.69^I+^I0^Itranscript_id "g1.t1"; gene_id "g1";$chr8^IAUGUSTUS^ICDS^I25237768^I25237837^I0.57^I+^I1^Itranscript_id "g1.t1"; gene_id "g1";$

sub(/\t+/,"\t") 就是使用sub进行字符串替换。将多个tab替换成单个tab。

最后方案

$ awk 'BEGIN {FS="[-:\t]";OFS="\t"}{if(/^#/){print $0}else{$6=$6+$2-1;$7=$7+$2-1;$2=$3="";sub(/\t+/,"\t");print}}' target.gff  > output.gff