百万细胞舍我其谁（一晚上解决战斗）

生信菜鸟团 · 公众号 · 生物 · 2024-12-29 15:33

正文

前些日子，我们提到了：《扎克伯格背刺基于R语言的Seurat单细胞生态》，很多单细胞转录组数据集公开的时候仅仅是提供了Python编程语言的.h5ad 而不是r编程语言的 .rds 文件，但是也给出来了：《数据分析思维之分而治之》的解决方案，因为：《在R编程环节有所限制未必不是好事》，正是因为“缚手缚脚”所以我们才会技艺高超！

而且，在为了，百万级别的单细胞转录组数据集会越来越多，首先是现在的10x单细胞转录组一个样品普遍只需要不到1万的人民币，而且国产是半价，只需要区区100万人民币就可以拿到100万甚至更多的细胞数量。这个经费对于张泽民院士来说，那就是毛毛雨了，他老人家的绝大部分cns文章都是标配百万级别单细胞数量。而且现在越来越多的多个单细胞数据整合文章，反正是公共数据集，多整合一些就很容易达到百万的数据量！

随之而来的问题就是计算资源，和耗时的问题了。比如张泽民老师的2024的文章：《Spatiotemporal single-cell analysis decodes cellular dynamics underlying different responses to immunotherapy in colorectal cancer》，

对应的数据集是：GSE236581 ，在geo界面可以下载：

4.5M 12 25 22:40 barcodes.tsv.gz
244K 12 25 22:40 features.tsv.gz
3.9G 12 25 22:51 matrix.mtx.gz

我的Mac电脑配置是64g内存以及4t的样品，基本上60万左右的细胞数量的单细胞转录组项目的降维聚类分群可以hold住，就是会很慢很慢。超过60万，比如细胞数量到达100万，在我的mac里面的使用r来处理就会报错了。毫无疑问，这个时候就需要借助于服务器了，很容易达到256线程、2Tb运行内存的配置，硬盘更加夸张了，几百个t的硬盘都很容易。比如我们读取上面的文件：

Sys.time()
ct=Read10X('inputs/',gene.column = 1)
dim(ct)
Sys.time()
phe=data.table::fread('GSE236581_CRC-ICB_metadata.txt.gz',data.table = F) 
head(phe)
rownames(phe)=phe[,1]
phe=phe[,-1]
table(phe$Ident)
length(table(phe$Ident))
pdf('design.pdf')
gplots::balloonplot(
  table(phe$MajorCellType,phe$Tissue)
)
dev.off()

作者给出来了很清晰的第一层次降维聚类分群后的各个单细胞亚群的生物学名字，并且接下来细致的分群。总计数 975,275 high-quality single cells，第一层次是6个大的亚群：six major cell types: T cells, B cells, innate lymphoid cells (ILCs), myeloid cells, stromal cells, and epithelial，然后可以细分成为 91 fine-grained cell subsets 都是作者耗费了大量的时间整理和取舍的。

细胞数量到达100万

如果说大家拿到了自己的服务器，只需要简单的安装必备的R包，然后就可以一劳永逸的使用下面的代码对任意单细胞转录组数据集进行处理了。

rm(list=ls())
options(stringsAsFactors = F) 
source('scRNA_scripts/lib.R') 
ct=Read10X('inputs/',gene.column = 1) 
phe=data.table::fread('GSE236581_CRC-ICB_metadata.txt.gz',data.table = F)  
rownames(phe)=phe[,1] 
sce.all=CreateSeuratObject(counts = ct,meta.data = phe) 
sce.all$orig.ident=sce.all$Ident 
sp='human' 
dir.create("./1-QC")
setwd("./1-QC")
# 如果过滤的太狠，就需要去修改这个过滤代码
source('../scRNA_scripts/qc.R')
sce.all.filt = basic_qc(sce.all)
print(dim(sce.all))
print(dim(sce.all.filt))
setwd('../')
getwd() 

Sys.time()
###### step3: harmony整合多个单细胞样品 ######
 
if(T){
  dir.create("2-harmony")
  getwd()
  setwd("2-harmony")
  source('../scRNA_scripts/harmony.R')
  # 默认 ScaleData 没有添加"nCount_RNA", "nFeature_RNA"
  # 默认的
  sce.all.int = run_harmony(sce.all.filt)
  setwd('../')
  
}
   
Sys.time()
###### step4:  看标记基因库 ######
# 原则上分辨率是需要自己肉眼判断，取决于个人经验
# 为了省力，我们直接看  0.1和0.8即可

table(Idents(sce.all.int))
table(sce.all.int$seurat_clusters)
table(sce.all.int$RNA_snn_res.0.1) 
table(sce.all.int$RNA_snn_res.0.8) 

getwd()
dir.create('check-by-0.1')
setwd('check-by-0.1')
sel.clust = "RNA_snn_res.0.1"
sce.all.int table([email protected]) 

source('../scRNA_scripts/check-all-markers.R')
setwd('../') 
getwd()

dir.create('check-by-0.5')
setwd('check-by-0.5')
sel.clust = "RNA_snn_res.0.5"
sce.all.int table([email protected]) 
source('../scRNA_scripts/check-all-markers.R')
setwd('../') 
getwd()

dir.create('check-by-0.8')
setwd('check-by-0.8')
sel.clust = "RNA_snn_res.0.8"
sce.all.int table([email protected]) 
source('../scRNA_scripts/check-all-markers.R')
setwd('../') 
getwd()

last_markers_to_check


Sys.time()
###### step5: 确定单细胞亚群生物学名字 ######
# 一般来说，为了节省工作量，我们选择0.1的分辨率进行命名
# 因为命名这个步骤是纯人工 操作
# 除非0.1确实分群太粗狂了，我们就选择0.8 
  
 
sce.all.int$celltype=sce.all.int$MajorCellType
table(sce.all.int$celltype)  
Sys.time()
if("MajorCellType" %in% colnames([email protected] ) ){
  
  sel.clust = "MajorCellType"
  sce.all.int   table([email protected]) 
  
  dir.create('check-by-MajorCellType')
  setwd('check-by-MajorCellType')
  source('../scRNA_scripts/check-all-markers.R')
  setwd('../') 
  getwd()
  [email protected]
  save(phe,file = 'phe.Rdata')
  pdf('MajorCellType-vs-orig.ident.pdf',width = 10)
  gplots::balloonplot(table(sce.all.int$MajorCellType,sce.all.int$orig.ident))
  dev.off()
}
Sys.time() 
id = unique(sce.all.int$MajorCellType)
lapply(id, function(x){
  sce=sce.all.int[,sce.all.int$MajorCellType %in% x]
  sce.all=CreateSeuratObject(
    counts = sce@assays$RNA$counts,
    meta.data = [email protected]
  )
  save(sce.all,file = paste0(x,'.sce.all.Rdata'))
  Sys.time()
})

上面的r代码是需要自己保证准确无误的，然后写入到文件：step1-load-by-Seurat-v5.R，接下来就可以nohup的在后台运行它啦~

在我们的服务器的Linux的诊断运行：nohup Rscript step1-load-by-Seurat-v5.R &

[1] "2024-12-26 01:21:51 CST"
[1] "2024-12-26 01:35:48 CST"
[1] "2024-12-26 01:37:28 CST"
[1] "2024-12-26 01:41:08 CST" ## harmony以及降维聚类分群耗时
[1] "2024-12-26 06:11:03 CST" ## 多种分辨率看各个亚群的特异性基因耗时
[1] "2024-12-26 09:34:09 CST" ## 作者的大亚群看各个亚群的特异性基因耗时
[1] "2024-12-26 10:01:45 CST"

总体上来说，非常丝滑，代码提交到后台后等半天就ok了，全程都是代码自动化处理而已，不需要自己干啥子就可以拿到海量的结果！

一切的难点，就在于不同的单细胞公共数据集的读取，只需要成为了r编程语言里面的Seurat对象，就可以全流程自动化处理~

拿到了结果简单的肉眼看看，作者定义的b细胞里面很明显是有plasma的亚群，在第一层次可以分出来。然后作者定义的t细胞和ILC在我们的第一层次很难区分，这个也是合理的。作者的髓系免疫细胞里面很明显是有mast细胞的一小撮在外面，同理stromal细胞里面有成纤维有内皮细胞等等，可以继续细分，这些知识点可以看：

而且我们的第一层次降维聚类分群里面的7是一个混合体，就是我们通常说的处于细胞增殖状态的单细胞亚群，它可以是上皮细胞也可以是免疫细胞或者其它！

你距离cns文章就差一个服务器账号啦

还等什么呢，赶快扫码吧！