对单细胞表达矩阵做gsea分析

生信菜鸟团 · 公众号 · 生物 · 2021-01-03 21:38

正文

gsea分析这方面教程我在《生信技能树》公众号写了不少了，不管是芯片还是测序的表达矩阵，都是一样的，把基因排序即可。那在单细胞分析里面也是如此，首先对指定的单细胞亚群可以做差异分析，然后就有了基因排序，后面gsea分析全部的代码无需修改，我这里演示一个简单的例子给大家哈！

安装seurat-data包获取测试数据

代码其实一句话即可：

 devtools::install_github('satijalab/seurat-data')

但是因为是在GitHub，所以中国大陆地区的小伙伴有时候会遇到：

> devtools::install_github('satijalab/seurat-data')
Error: Failed to install 'SeuratData' from GitHub:
  Timeout was reached: [api.github.com] Operation timed out after 10002 milliseconds with 0 out of 0 bytes received

正常情况下应该是：

* installing *source* package ‘SeuratData’ ...
** using staged installation
** R
** exec
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (SeuratData)

查看以及认识测试数据

library(SeuratData)
AvailableData()
InstallData("pbmc3k") #  (89.4 MB) 
#  trying URL 'http://seurat.nygenome.org/src/contrib/pbmc3k.SeuratData_3.1.4.tar.gz'
# 上面的 InstallData 代码只需要运行一次即可哈
data("pbmc3k") 
exp  <- pbmc3k@assays$RNA@data
dim(exp)
exp[1:4,1:4]
table(is.na(pbmc3k$seurat_annotations ))

这个测试数据pbmc3k，是已经做好了降维聚类分群和注释啦，这个数据集超级出名的，大家可以自行去了解它的背景知识哈。不幸的是元旦节假期我在中国大陆，所以这个下载简直是可怕

> InstallData("pbmc3k") #  (89.4 MB) 
trying URL 'http://seurat.nygenome.org/src/contrib/pbmc3k.SeuratData_3.1.4.tar.gz'
Content type 'application/octet-stream' length 93780025 bytes (89.4 MB)

downloaded 55 KB

Error in download.file(url, destfile, method, mode = "wb", ...) : 
  download from 'http://seurat.nygenome.org/src/contrib/pbmc3k.SeuratData_3.1.4.tar.gz' failed
In addition: Warning messages:
1: In download.file(url, destfile, method, mode = "wb", ...) :
  downloaded length 56996 != reported length 93780025
2: In download.file(url, destfile, method, mode = "wb", ...) :
  URL 'https://seurat.nygenome.org/src/contrib/pbmc3k.SeuratData_3.1.4.tar.gz': status was 'Failure when receiving data from the peer'

虽然这个数据集失败了，我还有后手！

另外一个例子

可以看看我们前面的例子：人人都能学会的单细胞聚类分群注释，你必须要理解这个例子里面的降维聚类分群和注释。前面的seurat-data包的测试数据pbmc3k就是我们这个例子里面的 sce

rm(list=ls())
options(stringsAsFactors = F)
library(Seurat)
load(file = 'last_sce.Rdata')
sce

如下所示:

> exp  <- sce@assays$RNA@data
> dim(exp)
[1] 19349 33168
> exp[1:4,1:4]
4 x 4 sparse Matrix of class "dgCMatrix"
       AAACCTGAGAGTAATC_1 AAACCTGGTAAATGTG_1 AAACGGGAGAAACGAG_1 AAACGGGAGGAGTACC_1
Sox17            1.663735                  .                  .           1.916002
Mrpl15           .                         .                  .           .       
Lypla1           .                         .                  .           .       
Tcea1            .                         .                  .           .       
> table( sce$singleR  ) 

              B cells     Endothelial cells           Fibroblasts Fibroblasts activated 
                 4034                 10578                    94                   309 
          Hepatocytes           Macrophages             Monocytes              NK cells 
                 1563                  5105                  4428                  1180 
              T cells 
                 5877

可以看到这个数据集是小鼠的哈，后面我们为了简单一点，直接采用把小鼠基因名字直接大写的方式，来转化为人类基因名字哦。

我们后面的分析也针对这个例子：人人都能学会的单细胞聚类分群注释的数据吧。

针对指定的群做差异分析

我看了看，绝大部分都是免疫细胞，都被研究的太多了，比如我们看T细胞和B细胞的差异，应该是没有太大意思了。所以我就看看 Fibroblasts 和 Fibroblasts activated 这两个单细胞亚群吧，反正细胞数量也少，94个细胞对比309给，运算起来也很快哈。

指定亚群做单细胞分析，代码如下；

Idents(sce)='singleR'
deg=FindMarkers(object = sce, ident.1 = 'Fibroblasts',ident.2 = 'Fibroblasts activated',
                min.pct = 0.01,   logfc.threshold = 0.01,
                thresh.use = 0.99)
head(deg)

批量做gsea分析

在msigdb数据库网页可以下载全部的基因集，我这里方便起见，仅仅是下载 h.all.v7.2.symbols.gmt文件：

### 对 MsigDB中的全部基因集 做GSEA分析。