▲ 作者:Justin Jee, Christopher Fong et al.
▲ 链接:
https://www.nature.com/articles/s41586-024-08167-5
▲ 摘要:
在此,我们将自然语言处理注释与结构化药物、患者报告的人口统计数据、肿瘤患病登记数据、以及来自纪念斯隆—凯特琳癌症中心的24950名患者的肿瘤基因组数据结合起来,生成临床基因组上的协调肿瘤学真实数据集(MSK-CHORD)。
MSK-CHORD中的数据包括非小细胞肺癌(n=7809)、乳腺癌(n=5368)、结直肠癌(n=5543)、前列腺癌(n=3211)和胰腺癌(n=3109)的数据,并能够发现在较小数据集中不明显的临床基因组学关系。通过利用MSK-CHORD训练机器学习模型来预测总体生存率,我们发现,通过交叉验证和外部多机构数据集测试,包括自然语言处理(如疾病部位)衍生特征的模型优于基于基因组数据或单独阶段的模型。
通过注释705241份放射学报告,MSK-CHORD还发现了特定器官部位转移的预测因子,包括在独立数据集中证实的经免疫治疗的肺腺癌中SETD2突变与较低转移潜力之间的关系。
▲ Abstract:
Here we combine natural language processing annotations with structured medication, patient-reported demographic, tumour registry and tumour genomic data from 24,950 patients at Memorial Sloan Kettering Cancer Center to generate a clinicogenomic, harmonized oncologic real-world dataset (MSK-CHORD). MSK-CHORD includes data for non-small-cell lung (n = 7,809), breast (n = 5,368), colorectal (n = 5,543), prostate (n = 3,211) and pancreatic (n = 3,109) cancers and enables discovery of clinicogenomic relationships not apparent in smaller datasets. Leveraging MSK-CHORD to train machine learning models to predict overall survival, we find that models including features derived from natural language processing, such as sites of disease, outperform those based on genomic data or stage alone as tested by cross-validation and an external, multi-institution dataset. By annotating 705,241 radiology reports, MSK-CHORD also uncovers predictors of metastasis to specific organ sites, including a relationship between SETD2 mutation and lower metastatic potential in immunotherapy-treated lung adenocarcinoma corroborated in independent datasets.