前面已经给出了4个GEO芯片数据挖掘分析点,详见:
也是时候进阶到转录组测序了,它跟表达量芯片技术类似的,都是基于基因在不同样品的表达量矩阵做各式各样的分析,比如2022的文章:《Aberrant MYCN expression drives oncogenic hijacking of EZH2 as a transcriptional activator in peripheral T-cell lymphoma》,对应的数据集是:GSE198126
GSM5939531 m34-Thymus GSM5939532 m53-thymus GSM5939533 m56-thymus GSM5939534 m70-thymus GSM5939535 m72-thymus GSM5939536 m86-thymus GSM5939537 CD4-Tcells-1 GSM5939538 CD4-Tcells-2 GSM5939539 CD4-Tcells-3
研究者们发现了在 Peripheral T cell lymphoma (PTCL)疾病里面有 recurrent overexpression of MYCN的现象,所以在做了一个小鼠模型:Inducible expression of MYCN in lymphoid cells in a mouse model ,做一个简单的转录组看看差异,如下所示:
简单的转录组看看差异 (A) Heat map showing the top differentially expressed genes in MYCN-driven TCL compared with normal CD4 T cells and depicting the transcription factor binding motifs of the top transcription factors identified by i-CisTarget in the up- and downregulated genes. NES = normalized enrichment score. (B) NESs for different hallmark gene sets in MYCN-driven TCL compared with normal CD4 T cells. (C) GSEA showing enrichment of an embryonic stem cell–like signature as described by Wong et al
值得注意是,作者不仅仅是做了差异分析和生物学功能数据库注释(gsea方法,癌症的50个hallmark基因集 ),而且还搭配了转录因子调控分析。
常规的差异分析 只需要读取作者在geo界面给出来了的表达量矩阵文件即可:
data"GSE198124_all.counts.csv.gz", data.table = F )# data$V1=str_split(data$V1,'_',simplify = T)[,2] # head(data) data=data[!duplicated(data$V2),] mat3:ncol(data))] rownames(mat)=data[,2 ] mat[1 :4 ,1 :4 ] keep_feature 1 ) > 1 ;table(keep_feature) ensembl_matrix rownames(ensembl_matrix)=rownames(mat)[keep_feature] ensembl_matrix[1 :4 ,1 :4 ] symbol_matrix = ceiling(ensembl_matrix) symbol_matrix[1 :4 ,1 :4 ] colnames(symbol_matrix)library (AnnoProbe) head(rownames(symbol_matrix)) ids=annoGene(rownames(symbol_matrix),'SYMBOL' ,'mouse' ) head(ids) tail(sort(table(ids$biotypes))) ids=ids[ids$biotypes=='protein_coding' ,] symbol_matrix=symbol_matrix[rownames(symbol_matrix) %in % ids$SYMBOL,] colnames(symbol_matrix) group_list=ifelse(grepl('CD4' , colnames(symbol_matrix) ), 'control' ,'case' ) group_list group_list = factor(group_list,levels = c('control' ,'case' )) save(symbol_matrix,group_list,file = 'symbol_matrix.Rdata' )
然后使用转录组测序的金标准算法(DESeq2,edgeR,limma-voom)计算统计学显著的上下调基因即可,并且做如下所示的标准可视化:
统计学显著的上下调基因 精华当然是生物学功能数据库注释(gsea方法, 癌症的50个hallmark基因集 ),但是我一般来说会默认走一下kegg的gsea注释,代码如下所示:
rm(list = ls()) load( file = 'DEG_deseq2.Rdata' ) load( file = 'DEG_limma_voom.Rdata' ) load( file = 'DEG_edgeR.Rdata' ) colnames(DEG_deseq2) colnames(DEG_limma_voom)# FoxO signaling pathway nrDEG=DEG_deseq2[,c("log2FoldChange" , "padj" )] nrDEG=DEG_limma_voom[,c("logFC" , "adj.P.Val" )] head(nrDEG) colnames(nrDEG)=c('logFC' ,'P.Value' )library (org.Mm.eg.db)library (clusterProfiler) gene "SYMBOL", toType = "ENTREZID" , OrgDb = org.Mm.eg.db) head(gene) head(nrDEG) nrDEG = nrDEG[rownames(nrDEG) %in % gene$SYMBOL,] nrDEG$ENTREZID = gene$ENTREZID[match(rownames(nrDEG) , gene$SYMBOL)] # https://www.ncbi.nlm.nih.gov/gene/?term=SPARCL1 geneList=nrDEG$logFC names(geneList)=nrDEG$ENTREZID geneList=sort(geneList,decreasing = T ) head(geneList)library (clusterProfiler) str(geneList) kk_gse organism = 'mmu' ,#按需替换 nPerm = 1000 , minGSSize = 10 , pvalueCutoff = 0.99 , verbose = FALSE ) tmp=kk_gse@result kk=DOSE::setReadable(kk_gse, OrgDb='org.Mm.eg.db' , keyType='ENTREZID' )#按需替换 kk@result$Description=gsub(' - Mus musculus \\(house mouse\\)' , '' ,kk@result$Description ) head(kk@result$Description) tmp=kk@result pro='test' write.csv(kk@result,file = file.path(paste0(pro,'_kegg.gsea.csv' ))) save(kk,file = file.path( 'gsea_kk.Rdata' ))
可以看到,虽然说没有选择癌症的50个hallmark基因集,但是结果是差不多的,上调的仍然是增殖相关通路,以及一个代谢通路。
kegg_top_gsea 常规转录组数据分析流程里面是没有转录因子调控分析的 我们上面的转录组测序的金标准算法(DESeq2,edgeR,limma-voom)能拿到统计学显著的上下调基因,但是没有这些基因集对应的转录因子,有很多计算工具可以实现,比如BART、i-cisTarget和Enrichr和cistrome的Lisa,其中这个文章选择了i-cisTarget。
但是按照我一直以来的代码,我会使用RcisTarget转录因子分析。它可以从一组基因中识别潜在的转录因子(Transcription Factors, TFs)调控网络,并通过 motif 分析 推测转录因子的作用机制。该包主要被应用于转录调控研究,特别是基因共表达网络或差异表达基因集的上游调控因素预测。需要看一些基本的文档:
SCENIC: single-cell regulatory network inference and clustering. Nat Methods. 2017 Nov;14(11):1083-1086. bioconductor: https://bioconductor.org/packages/release/bioc/vignettes/RcisTarget/inst/doc/RcisTarget_MainTutorial.R motif2annotaion: https://rdrr.io/github/aertslab/RcisTarget/man/motifAnnotations.html github: https://github.com/aertslab/RcisTarget RcisTarget database: https://resources.aertslab.org/cistarget/ 也可以直接使用我们的代码,只需要一句话代码cisTarget完成富集分析 ,如下所示:
library (RcisTarget)# Load gene sets to analyze. e.g.: geneList1 'examples', package='RcisTarget' ), "hypoxiaGeneSet.txt" ), stringsAsFactors=FALSE )[,1 ] geneLists geneLists## 重点在于拿到了上下调基因各自的列表哦: length(gene_up);length(gene_down) head(gene_up);head(gene_down) geneLists = list( gene_up=gene_up, gene_down=gene_down )# Select motif database to use (i.e. organism and distance around TSS) data(motifAnnotations_hgnc)# 这个RcisTarget包内置的motifAnnotations_hgnc是16万行 # 可以看到每个转录因子基因有多个motif,但是不到2000个转录因子 # 取决于R包“RcisTarget”的版本,确定是否运行下面的一行代码: if (F ){ motifAnnotations_hgnc = motifAnnotations } dim(motifAnnotations_hgnc) motifAnnotations_hgnc[1 :4 ,1 :4 ] length(unique(motifAnnotations_hgnc$TF)) motifRankings "~/database/cisTarget_databases/hg19-tss-centered-10kb-7species.mc9nr.feather")#motifRankings dim(motifRankings@rankings) motifRankings@rankings[1 :4 ,1 :4 ] lapply(geneLists, length) geneLists = lapply(geneLists, function (x){ x[x %in % colnames(motifRankings@rankings) ] }) lapply(geneLists, length) # Motif enrichment analysis: # 输入数据可以是一个list,这里是geneLists,那就分别对list每一个元素,分别做TF的聚类 motifEnrichmentTable_wGenes motifAnnot=motifAnnotations_hgnc)# 这个 cisTarget 函数等价于 下面的3个步骤: # 1. Motif enrichment analysis (calcAUC) # 2. Motif-TF annotation (addMotifAnnotation) # 3. Selection of significant genes (addSignificantGenes) dim(motifEnrichmentTable_wGenes) motifEnrichmentTable_wGenes[1 :4 ,1 :4 ]# 设置阈值,相应的TF对应富集到8个基因以上才被留下 kp =motifEnrichmentTable_wGenes$nEnrGenes > 8 table( kp ) motifEnrichmentTable_wGenes=motifEnrichmentTable_wGenes[kp,] write.csv(motifEnrichmentTable_wGenes, file = paste0(d,'/' ,'motifEnrichmentTable_wGenes.csv' ))
上面的代码依赖于一个数据库文件,而且不同的物种不一样:
1.0G 11 19 2020 hg19-500bp-upstream-7species.mc9nr.feather 1.0G 11 19 2020 hg19-tss-centered-10kb-7species.mc9nr.feather 1.0G 11 19 2020 mm9-500bp-upstream-7species.mc9nr.feather 1.0G 11 19 2020 mm9-tss-centered-10kb-7species.mc9nr.feather