专栏名称: 生信媛
生信媛,从1人分享,到8人同行。坚持分享生信入门方法与课程,持续记录生信相关的分析pipeline, python和R在生物信息学中的利用。内容涵盖服务器使用、基因组转录组分析以及群体遗传。
目录
相关文章推荐
51好读  ›  专栏  ›  生信媛

Linux的日常·单行数据转fasta

生信媛  · 公众号  · 生物  · 2020-03-18 07:00

正文

请到「今天看啥」查看全文


原谅我真的不知道怎么取这次的题目……

一个朋友跑来问我, 怎么使用linux将一个名称和序列单行存放的数据转成fasta格式。

还是直接上图比较直观:

原始的文件是这样的:

名称1 tab分隔 名称2 tab分隔 名称3 tab分隔 序列 tab分隔 null

转成:

>名称1|名称2|名称3
序列

好,那第一步先把tab分隔变成分行:

$ cat temp |sed 's/\t/\n/g'|head -8
MetaCYC
G-12237
Q7X281
MKVACIGAGPGGLFFATLLKRSRPDAEVVVFERNRPDDTFGFGVVFSDATLDAIDAADPVLSEALEKHGRHWDDIEVRVHGERERVGGMGMAAVVRKTLLSLLQERARAEGVQMRFQDEVRDPAELDDFDLVVVCDGANSRFRTLFADDFGPTAEVASAKFIWFGTTYMFDGLTFVHQDGPHGVFAAHAYPISDSLSTFIVETDADSWARAGLDAFDPATPLGMSDEKTKSYLEDLFRAQIDGHPLVGNNSRWANFATRRARSWRSGKWVLLGDAAHTAHFSVGSGTKMAMEDAVALAETLGEASRSVPEALDLYEERRRPKVERIQNSARPSLSWWEHFGRYVRSFDAPTQFAFHFLTRSIPRGKLAVRDAAYVDRVDGWWLRHHEAGPLKTPFRVGPYRLPTRRVTVGDDLLTGTDGTGIPMVPFSGQPFGAGVWIDAPDTEEGLPLALDQVRETAEAGVLLVGVRGGTALTRVLVAEEARLAHSLPAAIVGAYDDDTATTLVLSGRADLVGGTK
null
MetaCYC
EG10460
P05719

稍微解释一下:

  • s :是替换字符串的命令
  • \t :tab分隔符
  • \n :换行符
  • g :全局替换。不加的话只替换第一个,加上就是替换所有的。

接下来用 grep -v 把null这行去掉:

$ cat temp |sed 's/\t/\n/g'|grep -v "null"|head -8
MetaCYC
G-12237
Q7X281
MKVACIGAGPGGLFFATLLKRSRPDAEVVVFERNRPDDTFGFGVVFSDATLDAIDAADPVLSEALEKHGRHWDDIEVRVHGERERVGGMGMAAVVRKTLLSLLQERARAEGVQMRFQDEVRDPAELDDFDLVVVCDGANSRFRTLFADDFGPTAEVASAKFIWFGTTYMFDGLTFVHQDGPHGVFAAHAYPISDSLSTFIVETDADSWARAGLDAFDPATPLGMSDEKTKSYLEDLFRAQIDGHPLVGNNSRWANFATRRARSWRSGKWVLLGDAAHTAHFSVGSGTKMAMEDAVALAETLGEASRSVPEALDLYEERRRPKVERIQNSARPSLSWWEHFGRYVRSFDAPTQFAFHFLTRSIPRGKLAVRDAAYVDRVDGWWLRHHEAGPLKTPFRVGPYRLPTRRVTVGDDLLTGTDGTGIPMVPFSGQPFGAGVWIDAPDTEEGLPLALDQVRETAEAGVLLVGVRGGTALTRVLVAEEARLAHSLPAAIVGAYDDDTATTLVLSGRADLVGGTK
MetaCYC
EG10460
P05719
MSAGKLPEGWVIAPVSTVTTLIRGVTYKKEQAINYLKDDYLPLIRANNIQNGKFDTTDLVFVPKNLVKESQKISPEDIVIAMSSGSKSVVGKSAHQHLPFECSFGAFCGVLRPEKLIFSGFIAHFTKSSLYRNKISSLSAGANINNIKPASFDLINIPIPPLAEQKIIAEKLDTLLAQVDSTKARFEQIPQILKRFRQAVLGGAVNGKLTEKWRNFEPQHSVFKKLNFESILTELRNGLSSKPNESGVGHPILRISSVRAGHVDQNDIRFLECSESELNRHKLQDGDLLFTRYNGSLEFVGVCGLLKKLQHQNLLYPDKLIRARLTKDALPEYIEIFFSSPSARNAMMNCVKTTSGQKGISGKDIKSQVVLLPPVKEQAEIVRRVEQLFAYADTIEKQVNNALARVNNLTQSILAKAFRGELTAQWRAENPDLISGENSAAALLEKIKAERAASGGKKASRKKS

grep -v "null" 是去除含有"null"字符串的行。也可以写的更严谨一些: grep -v "^null$"

  • ^ 表示行首
  • $ 表示行尾

接下来,就是把前面3行变成一行,并且 | 分隔。

$ cat temp |sed 's/\t/\n/g'|grep -v "null"|awk '{if(NR%4!=0){printf "|%s",$0}else{printf "\n" $0 "\n"}}'|head -4
|MetaCYC|G-12237|Q7X281
MKVACIGAGPGGLFFATLLKRSRPDAEVVVFERNRPDDTFGFGVVFSDATLDAIDAADPVLSEALEKHGRHWDDIEVRVHGERERVGGMGMAAVVRKTLLSLLQERARAEGVQMRFQDEVRDPAELDDFDLVVVCDGANSRFRTLFADDFGPTAEVASAKFIWFGTTYMFDGLTFVHQDGPHGVFAAHAYPISDSLSTFIVETDADSWARAGLDAFDPATPLGMSDEKTKSYLEDLFRAQIDGHPLVGNNSRWANFATRRARSWRSGKWVLLGDAAHTAHFSVGSGTKMAMEDAVALAETLGEASRSVPEALDLYEERRRPKVERIQNSARPSLSWWEHFGRYVRSFDAPTQFAFHFLTRSIPRGKLAVRDAAYVDRVDGWWLRHHEAGPLKTPFRVGPYRLPTRRVTVGDDLLTGTDGTGIPMVPFSGQPFGAGVWIDAPDTEEGLPLALDQVRETAEAGVLLVGVRGGTALTRVLVAEEARLAHSLPAAIVGAYDDDTATTLVLSGRADLVGGTK
|MetaCYC|EG10460|P05719
MSAGKLPEGWVIAPVSTVTTLIRGVTYKKEQAINYLKDDYLPLIRANNIQNGKFDTTDLVFVPKNLVKESQKISPEDIVIAMSSGSKSVVGKSAHQHLPFECSFGAFCGVLRPEKLIFSGFIAHFTKSSLYRNKISSLSAGANINNIKPASFDLINIPIPPLAEQKIIAEKLDTLLAQVDSTKARFEQIPQILKRFRQAVLGGAVNGKLTEKWRNFEPQHSVFKKLNFESILTELRNGLSSKPNESGVGHPILRISSVRAGHVDQNDIRFLECSESELNRHKLQDGDLLFTRYNGSLEFVGVCGLLKKLQHQNLLYPDKLIRARLTKDALPEYIEIFFSSPSARNAMMNCVKTTSGQKGISGKDIKSQVVLLPPVKEQAEIVRRVEQLFAYADTIEKQVNNALARVNNLTQSILAKAFRGELTAQWRAENPDLISGENSAAALLEKIKAERAASGGKKASRKKS

awk '{if(NR%4!=0){printf "|%s",$0}else{printf "\n" $0 "\n"}}' 的含义是,如果行号(NR)取余不等于0(即非4倍数的行),那么打印 | 和整行内容($0),否则就打印换行符+整行+换行符。

最后就比较简单了。把 > 加上:

$ cat temp |sed 's/\t/\n/g'|grep -v "null"|awk '{if(NR%4!=0){printf "|%s",$0}else{printf "\n" $0 "\n"}}'|sed 's/|MetaCYC/>MetaCYC/'|head -4
>MetaCYC|G-12237|Q7X281
MKVACIGAGPGGLFFATLLKRSRPDAEVVVFERNRPDDTFGFGVVFSDATLDAIDAADPVLSEALEKHGRHWDDIEVRVHGERERVGGMGMAAVVRKTLLSLLQERARAEGVQMRFQDEVRDPAELDDFDLVVVCDGANSRFRTLFADDFGPTAEVASAKFIWFGTTYMFDGLTFVHQDGPHGVFAAHAYPISDSLSTFIVETDADSWARAGLDAFDPATPLGMSDEKTKSYLEDLFRAQIDGHPLVGNNSRWANFATRRARSWRSGKWVLLGDAAHTAHFSVGSGTKMAMEDAVALAETLGEASRSVPEALDLYEERRRPKVERIQNSARPSLSWWEHFGRYVRSFDAPTQFAFHFLTRSIPRGKLAVRDAAYVDRVDGWWLRHHEAGPLKTPFRVGPYRLPTRRVTVGDDLLTGTDGTGIPMVPFSGQPFGAGVWIDAPDTEEGLPLALDQVRETAEAGVLLVGVRGGTALTRVLVAEEARLAHSLPAAIVGAYDDDTATTLVLSGRADLVGGTK
>MetaCYC|EG10460|P05719
MSAGKLPEGWVIAPVSTVTTLIRGVTYKKEQAINYLKDDYLPLIRANNIQNGKFDTTDLVFVPKNLVKESQKISPEDIVIAMSSGSKSVVGKSAHQHLPFECSFGAFCGVLRPEKLIFSGFIAHFTKSSLYRNKISSLSAGANINNIKPASFDLINIPIPPLAEQKIIAEKLDTLLAQVDSTKARFEQIPQILKRFRQAVLGGAVNGKLTEKWRNFEPQHSVFKKLNFESILTELRNGLSSKPNESGVGHPILRISSVRAGHVDQNDIRFLECSESELNRHKLQDGDLLFTRYNGSLEFVGVCGLLKKLQHQNLLYPDKLIRARLTKDALPEYIEIFFSSPSARNAMMNCVKTTSGQKGISGKDIKSQVVLLPPVKEQAEIVRRVEQLFAYADTIEKQVNNALARVNNLTQSILAKAFRGELTAQWRAENPDLISGENSAAALLEKIKAERAASGGKKASRKKS

到这里其实已经结束了。但是我还是感觉写的太罗嗦了。于是换了一种解决方法:

$ cat temp |sed 's/\t/\n/g'|grep -v "null"|sed 'N;N;s/\n/|/g;s/^/>/;N'|head -4
>MetaCYC|G-12237|Q7X281
MKVACIGAGPGGLFFATLLKRSRPDAEVVVFERNRPDDTFGFGVVFSDATLDAIDAADPVLSEALEKHGRHWDDIEVRVHGERERVGGMGMAAVVRKTLLSLLQERARAEGVQMRFQDEVRDPAELDDFDLVVVCDGANSRFRTLFADDFGPTAEVASAKFIWFGTTYMFDGLTFVHQDGPHGVFAAHAYPISDSLSTFIVETDADSWARAGLDAFDPATPLGMSDEKTKSYLEDLFRAQIDGHPLVGNNSRWANFATRRARSWRSGKWVLLGDAAHTAHFSVGSGTKMAMEDAVALAETLGEASRSVPEALDLYEERRRPKVERIQNSARPSLSWWEHFGRYVRSFDAPTQFAFHFLTRSIPRGKLAVRDAAYVDRVDGWWLRHHEAGPLKTPFRVGPYRLPTRRVTVGDDLLTGTDGTGIPMVPFSGQPFGAGVWIDAPDTEEGLPLALDQVRETAEAGVLLVGVRGGTALTRVLVAEEARLAHSLPAAIVGAYDDDTATTLVLSGRADLVGGTK
>MetaCYC|EG10460|P05719
MSAGKLPEGWVIAPVSTVTTLIRGVTYKKEQAINYLKDDYLPLIRANNIQNGKFDTTDLVFVPKNLVKESQKISPEDIVIAMSSGSKSVVGKSAHQHLPFECSFGAFCGVLRPEKLIFSGFIAHFTKSSLYRNKISSLSAGANINNIKPASFDLINIPIPPLAEQKIIAEKLDTLLAQVDSTKARFEQIPQILKRFRQAVLGGAVNGKLTEKWRNFEPQHSVFKKLNFESILTELRNGLSSKPNESGVGHPILRISSVRAGHVDQNDIRFLECSESELNRHKLQDGDLLFTRYNGSLEFVGVCGLLKKLQHQNLLYPDKLIRARLTKDALPEYIEIFFSSPSARNAMMNCVKTTSGQKGISGKDIKSQVVLLPPVKEQAEIVRRVEQLFAYADTIEKQVNNALARVNNLTQSILAKAFRGELTAQWRAENPDLISGENSAAALLEKIKAERAASGGKKASRKKS

看上去没多少人话了。。。。显得很专业的样子。

sed 'N;N;s/\n/|/g;s/^/>/;N' 是什么意思呢?

一个N相当于把当前行(MetaCYC)和下一行(G-12237)读取到模式空间中(就是某个地方存起来),这时候模式空间就有2行内容(MetaCYC换行符G-12237),且用换行符隔开。再加一个N,就是再下一行(第3行)也加进去。里面就存了3行内容了(MetaCYC换行符G-12237换行符Q7X281)。后边 s/\n/|/g 就是将换行符替换成 | (MetaCYC|G-12237|Q7X281),而 s/^/>/ 就是在前面处理的内容前面加 > (>MetaCYC|G-12237|Q7X281)。最后一个N就是把下一行放到这个处理好的行的后边。

好了~~今天到这。

放上本文的测试数据:

链接:https://pan.baidu.com/s/1w31KXlvyGOJqeS0dOfYsjA
提取码:rpag


如果觉得不错,点个“在看”哦~~







请到「今天看啥」查看全文