专栏名称: 生信菜鸟团
生信菜鸟团荣誉归来,让所有想分析生物信息学数据的小伙伴找到归属,你值得拥有!
目录
相关文章推荐
生信菜鸟团  ·  前瞻 | Nature | ... ·  3 天前  
生物学霸  ·  蒲慕明院士:物理学出身的神经科学家 ·  3 天前  
BioArt  ·  Science丨神经元- ... ·  3 天前  
51好读  ›  专栏  ›  生信菜鸟团

把tcga大计划的CNS级别文章标题画一个词云

生信菜鸟团  · 公众号  · 生物  · 2020-12-29 07:12

正文

TCGA计划官方文章在:https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/publications

全部的标题的英文很容易提取和整理,如下:

Comprehensive genomic characterization defines human glioblastoma genes and core pathwaysIntegrated genomic analyses of ovarian carcinomaComprehensive molecular characterization of human colon and rectal cancerComprehensive molecular portraits of human breast tumoursComprehensive genomic characterization of squamous cell lung cancersIntegrated genomic characterization of endometrial carcinomaGenomic and epigenomic landscapes of adult de novo acute myeloid leukemiaComprehensive molecular characterization of clear cell renal cell carcinomaThe Cancer Genome Atlas Pan-Cancer analysis projectThe somatic genomic landscape of glioblastomaComprehensive molecular characterization of urothelial bladder carcinomaComprehensive molecular profiling of lung adenocarcinomaMultiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of originThe Somatic Genomic Landscape of Chromophobe Renal Cell CarcinomaComprehensive molecular characterization of gastric adenocarcinomaIntegrated genomic characterization of papillary thyroid carcinomaComprehensive genomic characterization of head and neck squamous cell carcinomasGenomic Classification of Cutaneous MelanomaComprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade GliomasComprehensive Molecular Portraits of Invasive Lobular Breast CancerThe Molecular Taxonomy of Primary Prostate CancerComprehensive Molecular Characterization of Papillary Renal-Cell CarcinomaComprehensive Pan-Genomic Characterization of Adrenocortical CarcinomaDistinct patterns of somatic genome alterations in lung adenocarcinomas and squamous cell carcinomasIntegrated genomic characterization of oesophageal carcinomaComprehensive Molecular Characterization of Pheochromocytoma and ParagangliomaIntegrated Molecular Characterization of Uterine CarcinosarcomaIntegrative Genomic Analysis of Cholangiocarcinoma Identifies Distinct IDH-Mutant Molecular ProfilesIntegrated genomic and molecular characterization of cervical cancerComprehensive and Integrative Genomic Characterization of Hepatocellular CarcinomaIntegrative Analysis Identifies Four Molecular and Clinical Subsets in Uveal MelanomaIntegrated Genomic Characterization of Pancreatic Ductal AdenocarcinomaComprehensive Molecular Characterization of Muscle-Invasive Bladder CancerComprehensive and Integrated Genomic Characterization of Adult Soft Tissue SarcomasThe Integrated Genomic Landscape of Thymic Epithelial TumorsPan-cancer Alterations of the MYC Oncogene and Its Proximal Network across the Cancer Genome AtlasScalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic PipelinesMolecular Characterization and Clinical Relevance of Metabolic Expression Subtypes in Human CancersSystematic Analysis of Splice-Site-Creating Mutations in CancerSomatic Mutational Landscape of Splicing Factor Genes and Their Functional Consequences across 33 Cancer TypesThe Cancer Genome Atlas Comprehensive Molecular Characterization of Renal Cell CarcinomaPan-Cancer Analysis of lncRNA Regulation Supports Their Targeting of Cancer Genes in Each Tumor ContextSpatial Organization and Molecular Correlation of Tumor-Infiltrating Lymphocytes Using Deep Learning on Pathology ImagesMachine Learning Detects Pan-cancer Ras Pathway Activation in The Cancer Genome AtlasGenomic and Molecular Landscape of DNA Damage Repair Deficiency across The Cancer Genome AtlasDriver Fusions and Their Implications in the Development and Treatment of Human CancersGenomic, Pathway Network, and Immunologic Features Distinguishing Squamous CarcinomasIntegrated Genomic Analysis of the Ubiquitin Pathway across Cancer TypesSnapShot: TCGA-Analyzed TumorsThe Cancer Genome Atlas: Creating Lasting Value beyond Its DataMachine Learning Identifies Stemness Features Associated with Oncogenic DedifferentiationOncogenic Signaling Pathways in The Cancer Genome AtlasPerspective on Oncogenic Processes at the End of the Beginning of Cancer GenomicsComprehensive Characterization of Cancer Driver Genes and Mutations




    
An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome AnalyticsPathogenic Germline Variants in 10,389 Adult CancersA Pan-Cancer Analysis of Enhancer Expression in Nearly 9000 Patient SamplesGenomic and Functional Approaches to Understanding Cancer AneuploidyA Comprehensive Pan-Cancer Molecular Study of Gynecologic and Breast CancersComparative Molecular Analysis of Gastrointestinal AdenocarcinomaslncRNA Epigenetic Landscape Analysis Identifies EPIC1 as an Oncogenic lncRNA that Interacts with MYC and Promotes Cell-Cycle Progression in CancerThe Immune Landscape of CancerIntegrated Molecular Characterization of Testicular Germ Cell TumorsComprehensive Analysis of Alternative Splicing Across Tumors from 8,705 PatientsA Pan-Cancer Analysis Reveals High-Frequency Genetic Alterations in Mediators of Signaling by the TGF-β SuperfamilyIntegrative Molecular Characterization of Malignant Pleural MesotheliomaThe chromatin accessibility landscape of primary human cancersComprehensive Molecular Characterization of the Hippo Signaling Pathway in CancerBefore and After: Comparison of Legacy and Harmonized TCGA Genomic Data Commons’ DataComprehensive Analysis of Genetic Ancestry and Its Molecular Correlates in Cancer

简单的使用bing搜索一下关键词: word clound in r ,就可以找到解决方案,第一个链接就是:http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know,代码分成5个步骤。

  • Step 1: Create a text file
  • Step 2 : Install and load the required packages
  • Step 3 : Text mining
  • Step 4 : Build a term-document matrix
  • Step 5 : Generate the Word cloud

一般来说,会R基础的朋友们很容易看懂,如果你还不会R语言,建议看:

把R的 知识点路线图 搞定,如下:

  • 了解常量和变量概念
  • 加减乘除等运算(计算器)
  • 多种数据类型(数值,字符,逻辑,因子)
  • 多种数据结构(向量,矩阵,数组,数据框,列表)
  • 文件读取和写出
  • 简单统计可视化
  • 无限量函数学习

核心代码就是wordcloud函数 ,但是这个wordcloud函数要求的输入数据就需要认真做出来。

# 安装R包相信无需再强调了library("tm")library("SnowballC")library("wordcloud")library("RColorBrewer")# 这里我们直接读取自己电脑剪切的数据即可# 运行下面这句代码的同时,需要保证你已经复制了前面我们整理好的文章标题哦!text=readLines(pipe("pbpaste"))# 好像这里Mac系统跟Windows系统稍微不一样,大家需要自行把握# Load the data as a corpusdocs <- Corpus(VectorSource(text))toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))docs <- tm_map(docs, toSpace, "/")docs <- tm_map(docs, toSpace, "@")docs <- tm_map(docs, toSpace, "\\|")# Convert the text to lower casedocs <- tm_map(docs, content_transformer(tolower))# Remove numbersdocs <- tm_map(docs, removeNumbers)# Remove english common stopwordsdocs <- tm_map(docs, removeWords, stopwords("english"))# Remove your own stop word# specify your stopwords as a character vectordocs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) # Remove punctuationsdocs <- tm_map(docs, removePunctuation)# Eliminate extra white spacesdocs <- tm_map(docs, stripWhitespace)# Text stemming# docs <- tm_map(docs, stemDocument)
dtm <- TermDocumentMatrix(docs)m <- as.matrix(dtm)v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v)head(d, 10)set.seed(1234)wordcloud(words = dfreq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

词云绘图结果每次布局都不一样哦,如下所示:

image-20200819181252785

其实就是把词频给可视化了一下:

> head(d, 10)               word freq1  characterization   252         molecular   253           genomic   244            cancer   235     comprehensive   226          analysis   137        integrated   128         carcinoma   119              cell    810           genome    8

出现次数很多的单词,在词云就显示大一点,仅此而已。

学徒作业

学习上面的代码,对TCGA的2018和2020两拨CNS级别数据挖掘文章标题进行同样的可视化词云。

第一个是: 2018的TCGA的泛癌项目论文全部是发表在 Cell及其子刊上

https://www.cell.com/pb-assets/consortium/pancanceratlas/pancani3/index.html

第二个是:https://www.nature.com/collections/afdejfafdb/

2020的Nature及其子刊的22篇全基因组的泛癌分析(Pan-Cancer Analysis of Whole Genomes)

在三年前我就整理并且制作了TCGA肿瘤数据库知识图谱视频教程,一年半前免费公布在生信技能树的B站,现在勉勉强强也快有两万的观看量。

  • 视频地址:https://www.bilibili.com/video/av49363776

  • 代码地址:https://github.com/jmzeng1314/tcga_example

阅读量如下:

视频目录是:

  • P1-TCGA-101-课程介绍-需要哪些背景知识

  • P2-TCGA-102-课程导读-如何使用我的github代码

  • P3-TCGA-103--TCGA数据库大有作用-不仅仅是灌水

  • P4-TCGA-201-背景介绍及网页工具大全

  • P5-TCGA-202-其它数据库介绍

  • P6-TCGA-203-使用Xena网页工具

  • P7-TCGA-204-使用firehose网页工具

  • P8-TCGA-205-文章规律讲解

  • P9-TCGA-301-数据下载方式导言

  • P10-TCGA-302-GDC下载数据实战

  • P11-TCGA-303-GDC数据整理

  • P12-TCGA-304-GDC下载数据续集

  • P13-TCGA-305-R-TCGA包下载数据及数据提取

  • P14-TCGA-306-使用GDC和firehose下载-TCGA的胃癌的甲基化信息数据

  • P15-TCGA-307-使用GDC和Xena下载RNA-Seq的表达矩阵并且比较

我们生信技能树团队优秀R语言讲师《小洁》也学完了我的全套视频,在她自己的理解的基础上面,也给大家奉献了一套笔记:

小洁的笔记

细数下来,写了17篇TCGA相关的笔记,现对其进行完整梳理,一篇年度精品推文横空出世。再次重申:本系列是我的 TCGA 学习记录,跟着 生信技能树B站课程 学的,已获得授权。课程链接: https://www.bilibili.com/video/av49363776

一、数据下载







请到「今天看啥」查看全文