专栏名称: 生信媛
生信媛,从1人分享,到8人同行。坚持分享生信入门方法与课程,持续记录生信相关的分析pipeline, python和R在生物信息学中的利用。内容涵盖服务器使用、基因组转录组分析以及群体遗传。
目录
相关文章推荐
51好读  ›  专栏  ›  生信媛

不用命令行,用R也能提取基因的启动子序列

生信媛  · 公众号  · 生物  · 2019-12-28 18:39

正文

请到「今天看啥」查看全文


对于提取序列来说,我们一般会选择用samtools、bedtools等,因为他们有非常完善的操作 序列 的功能,比如取两个bed文件的交集,提取某段序列等等。但实际上,他们对于 区间 的操作是不如R方便的,比如取基因body,取第一个外显子,取启动子区域等等。所以,我们有时候就会想能否在R里面,花式找了些区间之后,提取他们的序列,实际上,这也是可以的,需要分成几步。

首先我们需要先加载一些包

  1. # BSgenome.Athaliana.TAIR.TAIR9是拟南芥的基因组序列包

  2. # TAIR9和TAIR10是一样的,无非是TAIR10自身的基因注释丰富了些

  3. library(BSgenome.Athaliana.TAIR.TAIR9)


  4. # TxDb.Athaliana.BioMart.plantsmart28是拟南芥的Txdb包

  5. # 就是我们ChIPSeeker做peak注释的那个Txdb

  6. library(TxDb.Athaliana.BioMart.plantsmart28)

  7. Txdb_package TxDb.Athaliana.BioMart.plantsmart28


  8. # 也可以用自己的gtf做

  9. # 所以这些操作也可以针对非模式物种

  10. Txdb_gtf makeTxDbFromGFF("~/reference/annoation/Athaliana/Araport11/Araport11_GFF3_genes_transposons.201606.gtf")

然后是提取基因的区间

  1. > gene_package genes(Txdb_package)

  2. > gene_package

  3. GRanges object with 33602 ranges and 1 metadata column:

  4. seqnames ranges strand | gene_id

  5. <Rle> <IRanges> <Rle> |

  6. AT1G01010 1 3631-5899 + | AT1G01010

  7. AT1G01020 1 5928-8737 - | AT1G01020

  8. AT1G01030 1 11649-13714 - | AT1G01030

  9. AT1G01040 1 23146-31227 + | AT1G01040

  10. AT1G01046 1 28500-28706 + | AT1G01046

  11. ... ... ... ... . ...

  12. ATMG01370 Mt 360717-361052 - | ATMG01370

  13. ATMG01380 Mt 361062-361179 - | ATMG01380

  14. ATMG01390 Mt 361350-363284 - | ATMG01390

  15. ATMG01400 Mt 363725-364042 + | ATMG01400

  16. ATMG01410 Mt 366086-366700 - | ATMG01410

  17. -------

  18. seqinfo: 7 sequences (1 circular) from an unspecified genome


  19. # 注意这里有个no seqlengths

  20. > gene_gtf genes(Txdb_gtf )

  21. > gene_gtf

  22. GRanges object with 37330 ranges and 1 metadata column:

  23. seqnames ranges strand | gene_id

  24. <Rle> <IRanges> <Rle> |

  25. AT1G01010 Chr1 3631-5899 + | AT1G01010

  26. AT1G01020 Chr1 6788-9130 - | AT1G01020

  27. AT1G01030 Chr1 11649-13714 - | AT1G01030

  28. AT1G01040 Chr1 23121-31227 + | AT1G01040

  29. AT1G01050 Chr1 31170- 33171 - | AT1G01050

  30. ... ... ... ... . ...

  31. ATMG09730 ChrM 181268-181347 + | ATMG09730

  32. ATMG09740 ChrM 191954-192025 - | ATMG09740

  33. ATMG09950 ChrM 333651-333725 - | ATMG09950

  34. ATMG09960 ChrM 347266-347348 + | ATMG09960

  35. ATMG09980 ChrM 359666-359739 - | ATMG09980

  36. -------

  37. seqinfo: 7 sequences (2 circular) from an unspecified genome; no seqlengths

然后根据基因区域来提取启动子序列

  1. > promoter_package promoters(gene_package,upstream = 4000,downstream = 500)

  2. Warning message:

  3. In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :

  4. GRanges object contains 11 out-of-bound ranges located on sequences 1, 2, 3, 4, and Pt. Note that

  5. ranges located on a sequence whose length is unknown (NA) or on a circular sequence are not

  6. considered out-of-bound (use seqlengths() and isCircular() to get the lengths and circularity flags

  7. of the underlying sequences). You can use trim() to trim these ranges. See

  8. ?`trim,GenomicRanges-method` for more information.


  9. > promoter_gtf promoters(gene_gtf,upstream = 4000,downstream = 500)

这里就出现了一个坑了,就是你在用Txdb包的时候,出现了warning,但用自建gtf构建的Txdb没有出现了warning。实际上,这种warning是因为你的区间取过界了。

  1. > promoter_package

  2. GRanges object with 33602 ranges and 1 metadata column:

  3. seqnames ranges strand | gene_id

  4. <Rle> <IRanges> <Rle> |

  5. AT1G01010 1 -369-4130 + | AT1G01010

  6. AT1G01020 1 8238-12737 - | AT1G01020

  7. AT1G01030 1 13215-17714 - | AT1G01030

  8. AT1G01040 1 19146-23645 + | AT1G01040

  9. AT1G01046 1 24500-28999 + | AT1G01046

  10. ... ... ... ... . ...

  11. ATMG01370 Mt 360553-365052 - | ATMG01370

  12. ATMG01380 Mt 360680-365179 - | ATMG01380

  13. ATMG01390 Mt 362785-367284 - | ATMG01390

  14. ATMG01400 Mt 359725-364224 + | ATMG01400

  15. ATMG01410 Mt 366201-370700 - | ATMG01410

  16. -------

  17. seqinfo: 7 sequences (1 circular) from an unspecified genome

而promoter_gtf没有报错是因为他不知道 你的染色体的长度范围在哪里

所以下面我的操作都是基于promoter_gtf,这样还可以应用在 非模式物种 里面。首先你在构建Txdb的时候就需要给他传进去染色体长度了。但你怎么知道每条染色体是多长呢,这就需要你的fai文件了。fai文件是给基因组fa做索引的时候出现的,没有的话你可以直接用samtools faidx做下。

  1. $ samtools faidx Athaliana.fa


  2. # 这里的第2列是染色体长度

  3. Chr1 30427671 6 79 80

  4. Chr2 19698289 30812844 79 80

  5. Chr3 23459830 50760485 79 80

  6. Chr4 18585056 74517281 79 80

  7. Chr5 26975502 93337597 79 80

  8. ChrM 366924 120654568 79 80

  9. ChrC 154478 121026144 79 80

然后我们就可以在构建的时候手动输进对应的长度了,但这只是拟南芥这个组装完整的基因组,如果是一些非模式物种,scaffold级别的话,就会有数百条染色体长度了,根本不可能手动输。所以我们还需要读进R里面进行一些操作。

  1. # 注意不要变成因子

  2. > seq_length read.table("~/reference/genome/TAIR10/Athaliana.fa.fai",

  3. + stringsAsFactors = F)[,1:2]

  4. > seq_length$V1

  5. [1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC"

  6. > seq_length$V2

  7. [1] 30427671 19698289 23459830 18585056 26975502 366924 154478

  8. > Txdb_gtf makeTxDbFromGFF("~/reference/annoation/Athaliana/Araport11/Araport11_GFF3_genes_transposons.201606.gtf",

  9. + chrominfo = Seqinfo(seqnames = seq_length$V1,

  10. + seqlengths = seq_length$V2))

这样我们提取启动子序列的时候就会报错了

  1. > gene_gtf genes(Txdb_gtf)


  2. # 此时no seqlengths消失了

  3. > gene_gtf

  4. GRanges object with 37330 ranges and 1 metadata column:

  5. seqnames ranges strand | gene_id

  6. <Rle> <IRanges> <Rle> |

  7. AT1G01010 Chr1 3631-5899 + | AT1G01010

  8. AT1G01020 Chr1 6788-9130 - | AT1G01020

  9. AT1G01030 Chr1 11649-13714 - | AT1G01030

  10. AT1G01040 Chr1 23121-31227 + | AT1G01040

  11. AT1G01050 Chr1 31170-33171 - | AT1G01050

  12. ... ... ... ... . ...

  13. ATMG09730 ChrM 181268-181347 + | ATMG09730

  14. ATMG09740 ChrM 191954-192025 - | ATMG09740

  15. ATMG09950 ChrM 333651-333725 - | ATMG09950

  16. ATMG09960 ChrM 347266-347348 + | ATMG09960

  17. ATMG09980 ChrM 359666-359739 - | ATMG09980

  18. -------

  19. seqinfo: 7 sequences from an unspecified genome


  20. # 出现了报错

  21. > promoter_gtf promoters(gene_gtf,upstream = 4000,downstream = 500)

  22. Warning message:

  23. In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :

  24. GRanges object contains 15 out-of-bound ranges located on sequences Chr1, Chr2, Chr3, Chr4, Chr5, ChrC, and ChrM.

  25. Note that ranges located on a sequence whose length is unknown (NA) or on a circular sequence are not considered

  26. out-of-bound (use seqlengths() and isCircular() to get the lengths and circularity flags of the underlying

  27. sequences). You can use trim() to trim these ranges. See ?`trim,GenomicRanges-method` for more information.


  28. # 我们用trim函数去掉越过界的区间

  29. # 原本过界的就变成了1了

  30. > promoter_gtf trim(promoter_gtf)

  31. > promoter_gtf

  32. GRanges object with 37330 ranges and 1 metadata column:

  33. seqnames ranges strand | gene_id

  34. <Rle> <IRanges> <Rle> |

  35. AT1G01010 Chr1 1-4130 + | AT1G01010

  36. AT1G01020 Chr1 8631-13130 - | AT1G01020

  37. AT1G01030 Chr1 13215-17714 - | AT1G01030

  38. AT1G01040 Chr1 19121-23620 + | AT1G01040

  39. AT1G01050 Chr1 32672-37171 - | AT1G01050

  40. ... ... ... ... . ...

  41. ATMG09730 ChrM 177268-181767 + | ATMG09730

  42. ATMG09740 ChrM 191526-196025 - | ATMG09740

  43. ATMG09950 ChrM 333226-337725 - | ATMG09950

  44. ATMG09960 ChrM 343266-347765 + | ATMG09960

  45. ATMG09980 ChrM 359240-363739 - | ATMG09980

  46. -------

  47. seqinfo: 7 sequences from an unspecified genome

然后我们只取前2个基因出来看看效果

  1. # 这边的GRange是带名字的GRange,所以我们取区间的时候可以用基因名字

  2. > promoter_gtf_part promoter_gtf[c("AT1G01010","AT1G01020")]

  3. > promoter_gtf_part

  4. GRanges object with 2 ranges and 1 metadata column:

  5. seqnames ranges strand | gene_id

  6. <Rle> <IRanges> <Rle> |

  7. AT1G01010 Chr1 1-4130 + | AT1G01010

  8. AT1G01020 Chr1 8631-13130 - | AT1G01020

  9. -------

  10. seqinfo: 7 sequences from an unspecified genome


  11. # 用的是Biostrings包的getSeq

  12. # 注意你这里的Grange的seqnames要和genome的seqnames要一致

  13. # 有时候Grange是1,而Genome是Chr1

  14. > getSeq(BSgenome.Athaliana.TAIR.TAIR9,promoter_gtf_part)

  15. A DNAStringSet instance of length 2

  16. width seq names

  17. [1] 4130 CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCC...GGTTTCTGGTAAATGGAAGCTTACCGGAGAATCTGTTGAGGTCAAGG AT1G01010

  18. [2] 4500 ATCCGCAACAATTCACCAATTGAAGAACAAGAGAAAGGTTTAAACTT...GAGAGAGAGCAATGGCGGCGAGTGAACACAGATGCGTGGGATGTGGT AT1G01020


  19. promoter_gtf_part_seq getSeq(BSgenome.Athaliana.TAIR.TAIR9,promoter_gtf_part)


  20. # 然后用writeXStringSet写成fa

  21. writeXStringSet(promoter_gtf_part_seq,

  22. filepath = "test.fasta",

  23. format = "fasta")

  1. >AT1G01010

  2. CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCATG

  3. AATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATGATAATTTTAT

  4. CGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCTTGT

  5. GGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAAGCTTTGCTACGATCTACATT

  6. TGGGAATGTGAGTCTCTTATTGTAACCTTAGGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTATGTTT

  7. GGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAATTATTTAGTTG








请到「今天看啥」查看全文