作为生物信息学教学队伍的财务一名,我旁观了大量代码实战技巧,也勉强是学会了一下R语言,恰好看到朋友圈单细胞比较火爆,而且群主的CNS图表复现超级容易理解,我也跟着学习了一下,目录如下:
所以我恳请群主也给了我一个作业题,就是数据集GSE129516,通常读取10x数据需要三个文件:barcodes.tsv, genes.tsv, matrix.mtx,但是这个研究者仅仅是上传了matrix文件。本来呢,我之前似乎看到过教程,将表达矩阵逆转为10x的标准输出三个文件,还是直接用readMM()读入稀疏矩阵然后转为普通矩阵然后构建seurat对象。但是感觉太麻烦了,很明显我的技术是hold不住的啊!
这个数据集GSE129516,就是拿到了如下所示的数据文件:
GEO下载的
我首先读取了一个文件,看了看,就是表达矩阵,所以直接CreateSeuratObject即可,都省去了3个文件的组合命令。
表达矩阵例子
首先批量读取每个10x样品的表达矩阵
保证当前工作目录下面有后缀是matrices.csv.gz的文件,就是前面下载的6个文件:
rm(list=ls()) options(stringsAsFactors = F )library (Seurat) fs=list.files(pattern = 'matrices.csv.gz' ) fs sceList <- lapply(fs, function (x){ a=read.csv( x ) a[1 :4 ,1 :4 ] raw.data=a[,-1 ] rownames(raw.data)=a[,1 ] library (stringr) p=str_split(x,'_' ,simplify = T )[,2 ] sce <- CreateSeuratObject(counts = raw.data,project = p ) })
每个matrices.csv.gz文件都读取后,提供CreateSeuratObject构建成为对象。如果是读取10x数据需要三个文件:barcodes.tsv, genes.tsv, matrix.mtx,那个更简单哦!
然后使用seurat最出名的多个10x合并算法
多个单细胞对象的整合,这里直接使用标准代码即可:
pro='integrated' for (i in 1 :length(sceList)) { sceList[[i]] <- NormalizeData(sceList[[i]], verbose = FALSE ) sceList[[i]] <- FindVariableFeatures(sceList[[i]], selection.method = "vst" , nfeatures = 2000 , verbose = FALSE ) } sceList sce.anchors <- FindIntegrationAnchors(object.list = sceList, dims = 1 :30 ) sce.integrated <- IntegrateData(anchorset = sce.anchors, dims = 1 :30 )
因为是6个10X样品,所以这个步骤会略微有点耗费时间哦!
接着走标准的降维聚类分群
因为是构建好的对象,所以后续分析都是常规代码:
library (ggplot2)library (cowplot)# switch to integrated assay. The variable features of this assay are automatically # set during IntegrateData DefaultAssay(sce.integrated) <- "integrated" # Run the standard workflow for visualization and clustering sce.integrated <- ScaleData(sce.integrated, verbose = FALSE ) sce.integrated <- RunPCA(sce.integrated, npcs = 30 , verbose = FALSE ) sce.integrated <- RunUMAP(sce.integrated, reduction = "pca" , dims = 1 :30 ) p1 <- DimPlot(sce.integrated, reduction = "umap" , group.by = "orig.ident" ) p2 <- DimPlot(sce.integrated, reduction = "umap" , group.by = "orig.ident" , label = TRUE , repel = TRUE ) + NoLegend() plot_grid(p1, p2) p2 ggsave(filename = paste0(pro,'_umap.pdf' ) ) sce=sce.integrated DimHeatmap(sce, dims = 1 :12 , cells = 100 , balanced = TRUE ) ElbowPlot(sce) sce <- FindNeighbors(sce, dims = 1 :15 ) sce <- FindClusters(sce, resolution = 0.2 ) table([email protected] $integrated_snn_res.0.2) sce <- FindClusters(sce, resolution = 0.8 ) table([email protected] $integrated_snn_res.0.8) DimPlot(sce, reduction = "umap" ) ggsave(filename = paste0(pro,'_umap_seurat_res.0.8.pdf' ) ) DimPlot(sce, reduction = "umap" ,split.by = 'orig.ident' ) ggsave(filename = paste0(pro,'_umap_seurat_res.0.8_split.pdf' ) ) save(sce,file = 'integrated_after_seurat.Rdata' )
最后对聚类的不同细胞亚群进行注释
前面呢是标准的聚类分群,每个细胞亚群仅仅是一个编号,实际上还需要给予它们一定的生物学意义,我们这里采用SingleR的标准代码:
rm(list=ls()) options(stringsAsFactors = F )library (Seurat) load(file = 'integrated_after_seurat.Rdata' ) DefaultAssay(sce) <- "RNA" # for B cells : cluster, 1,21 VlnPlot(object = sce, features ='Cd19' ,log =T ) VlnPlot(object = sce, features ='Cd79a' ,log =T ) library (SingleR) sce_for_SingleR <- GetAssayData(sce, slot="data" )[email protected] $seurat_clusters mouseImmu <- ImmGenData() pred.mouseImmu <- SingleR(test = sce_for_SingleR, ref = mouseImmu, labels = mouseImmu$label.main, method = "cluster" , clusters = clusters, assay.type.test = "logcounts" , assay.type.ref = "logcounts" ) mouseRNA <- MouseRNAseqData() pred.mouseRNA <- SingleR(test = sce_for_SingleR, ref = mouseRNA, labels = mouseRNA$label.fine , method = "cluster"