第17届中国R会议 & 2024 X 智能大会 & 2024 数据科学国际论坛联合会议将于2024年7月20-22日在中国人民大学召开,本次会议由
中国人民大学应用统计科学研究中心、
中国人民大学统计学院、统计之都和中国商业统计学会人工智能分会主办,由Journal of Data Science 编辑部和中国人民大学数据科学与大数据统计系承办,得到宽德投资、明汯投资、和鲸科技、子博设计赞助支持。
本次会议时间、地点及形式如下:
2024.7.20-21 (9:00-17:30) :
线下:中国人民大学逸夫楼、立德楼
线上:学说直播平台
2024.7.22(19:00-21:00):
线上:学说直播平台
线下
报名
已经截止,请已报名人员携带有效证件入校现场参会。
线上
参会请扫描下方二维码并在相应时间加入会议室:
下面为您奉上本次中国R会议 & 2024 X 智能大会 & 2024 数据科学国际论坛联合会议7月21日的
医药行业的软件开发专场、 Advancements in Statistical Testing, Estimation, and Design of Experiments专场介绍!
主席:
李扬(中国人民大学)&魏志军(诺华)
时间:
2024年7月22日 晚上 19:00-21:00
平台:
学说直播平台
会场内容介绍
What happens when your validated ecosystem is a Graph?
段晓丽
Xiaoli Duan has been a Data Scientist in Roche PD Data Sciences since she received her Ph.D. degree in Industrial Engineering in 2022, with a research focus on statistical machine learning in healthcare. She is an R developer of the NEST project (chevron family) and a Python developer of automatic tumor segmentation algorithms. She is a product owner of the R interface to Roche’s distributed ecosystem across multiple semantic platforms.
报告摘要:
A validated environment to use R to develop clinical trials reporting tools and deliver reproducible data analytic results (i.e. table, listing, and figure outputs) for regulatory submission is a must. The Comprehensive R Archive Network (CRAN) which sets up the highest standard of validating a new/upgraded package assesses the cohort of package reverse dependencies upon submission and evaluates if the package continues to serve as expected as a dependency in the current validated ecosystem. Indeed, the evaluation of the heaviness of package dependencies and the risk of inter-dependency impacting reproducibility is a complex process, given that active package up-versioning and data standards publications make our Auto-validation R Submission Portal a dynamic system/network on a daily basis.
Our goal is to effortlessly touch the comprehensive review of a validated ecosystem’s all available package dependencies via a directed Graph - a non-linear data structure in graph theory - and simplify the validation task workflow in terms of computational complexity. We will
(1) linearly traverse/search and visualize package dependencies within a user-defined scope,
(2) linearly order/schedule pending packages to be validated in the queue and automatically trigger which is the next to be performed in the validation pipelines to minimize any newly broken package behaviors due to package upgrades, and
(3) automatically make package owners/maintainers notified if their package dependencies get upgraded up to certain versions by any other package requests, which will make the package re-submitted for validation again (but thinking about this is a heads-up of potential test failures due to package up-versioning).
Our demos will cover three CRAN-released clinical trial analysis tools: tern (Roche), tidytlg (J&J), and forestly (Merck). Note that our proposed framework can be generalized for any complex dataflow system, regardless of performing tasks, programming languages, package managers, etc. The dynamic QC (for results) process (and data dependencies) can also be supported if we provide an end-to-end R solution to clinical reporting in a centralized platform
Integrating LLM Coding Capabilities in End-to-End Data Science: Challenges and Reflections
程鼎
时间:
7.22 online
个人简介:
Ding Cheng is currently working at AbbVie - Allergan Aesthetics, where he is responsible for commercial and business-related data analysis and modeling. With extensive experience in clinical research development, IT and innovation, and business intelligence, Ding is passionate about integrating advanced digital technologies with medical practices to drive improvements in the healthcare industry.
报告摘要:
This presentation will explore the integration of large language model (LLM) coding capabilities within the end-to-end data science workflow. Using a case study of constructing a Chat Dashboard, we will delve into the challenges, insights, and reflections encountered throughout the process. The focus will be on development within the R programming environment, highlighting the application of statistical models to enhance data analysis and decision-making. The presentation will cover technical implementation details and share experiences in project management and interdisciplinary collaboration, providing practical guidance for professionals looking to leverage LLM advantages in the data science field.
Patient Narrative Generation in R
曹心怡
时间:
7.22 online
个人简介:
统计程序员,就职于先声再明医药有限公司。
毕业于英属哥伦比亚大学,主修统计学与经济学
报告摘要:
The Patient Narrative, or Adverse Event narrative, is critical in clinical trials for providing detailed safety data. Its distinctive features include patient-generated content and presentation in chronological order. However its creation involves tedious tasks like data retrieval and event timeline linking. The use of R for the automated generation of patient narrative reports significantly saves the resources in data collection and repetitive writing tasks, offering a notable improvement in accuracy compared to manual methods. This presentation will primarily focus on how to generate Narratives using R, along with the usage of the current popular R packages. Moreove# it will explore the potential for further automating Narrative generation in
R.
双剑合璧: R和Python协同构建数据应用
王杰,刘晓畅
时间:
7.22 online
个人简介:
王杰,是强生创新制药中国研发临床统计编程部门技术解决方案的数据工程师。他是一位技术娴熟的统计程序员,专注于发现机会,推动优化和创新,应用传统和前沿的方法进行临床相关的数据分析。他拥有12年生物制药数据分析经验, 在加入强生之前, 曾在辉瑞研发工作过5年多从事临床数据分析相关工作。刘晓畅目前任职于Johnson & Johnson的临床与统计编程部门,担任的是Data Engineer的工作。他擅长使用R,Python以及其他编程语言和工具,在处理大规模临床数据、数据挖掘、机器学习,数据可视化和生成式人工智能应用方面具有丰富的经验和技能。他的工作目标是利用数据驱动的解决方案来支持临床研究和决策。他于2018年获得山东大学药学学士学位,2019年获得英国爱丁堡大学药物发现与转化生物学硕士学位。
报告摘要:
R与Python是构建数据科学应用的过程中必不可少的重要工具,诚然,它们有着各独特的优势:R 以其强大的统计分析能力和数据可视化功能闻名,而Python则以其易读性和广泛的库的支持在数据处理和机器学习领域中占据一席之地。对于一个完整的数据科学项目,R与Python并非互斥的关系。我们可以通过结合它们各自的优势,在开发过程中实现协同效应。从需求出发,灵活选择工具,这样我们将极大地提高开发数据应用的速度,以及赋予应用一定程度的鲁棒性。我们将详细介绍这种协同工作的实践过程,以及如何最大限度地利用R和Python的优势,为数据科学家提供一个全新的构建数据应用的视角。
对于大部分的数据清洗,可视化以及前端交互的需求,我们选择使用R作为我们的工具。其中,我们选择了R Shiny作为前端交互的框架并使用了R golem作为开发R Shiny应用的框架。对于一些特定的功能,我们利用Python的优势,通过FastAPI搭建接口为前端应用提供功能的实现。此外,我们也借助了微软提供的Graph API,以此来丰富应用的功能。
在具体的实践过程中,需要根据项目的需求和团队的技术能力来选择合适的工具和框架。通过合理地利用R和Python的协同构建,可以开发出高效、灵活和功能强大的数据应用,为数据科学工作提供更多可能性和创新空间。
Advancements in Statistical Testing, Estimation, and Design of Experiments专场
主席
:
王春燕(中国人民大学)
时间:
2024年7月22日 晚上 19:00-21:00
平台:
学说直播平台
会场内容介绍
Simultaneous jump detection for multiple sequences via screening and multiple testing
张春明
时间:
7月22日 19:00-21:00
个人简介:
Chunming Zhang is a Professor in the Department of Statistics at the University of Wisconsin-Madison. She earned her Ph.D. in Statistics from the University of North Carolina at Chapel Hill under the guidance of Jianqing Fan. She completed her B.S. in mathematical statistics at Nankai University, Tianjin, China, and an M.S. in Computational Mathematics from Academia Sinica, Beijing, China. Her research interests range from statistical learning and data mining, statistical methods with applications to imaging data, neuroinformatics, and bioinformatics, multiple testing, large-scale simultaneous inference and applications, statistical methods in financial econometrics, non- and semi-parametric estimation and inference, to functional and longitudinal data analysis. Her current research topics include new developments in the area of large-scale structure learning tasks and statistical inference procedures, with applications in neuroscience, biology, machine learning, and causal inference. She is an elected Fellow (2016) of the American Statistical Association (ASA) and an elected Fellow (2011) of the Institute of Mathematical Statistics (IMS) and is honored by a Medallion Award and Lecturer (2024) of the IMS.
报告摘要:
The estimation of nonparametric discontinuous regression function is fundamental in many applied fields, but challenges arise when the number of jumps (or discontinuities) is large and unknown. We propose a new jump detection method, via the consecutive screening and multiple testing (SaMT) algorithm for estimating the unknown jump points in the flexible non-parametric regression model, guaranteeing the desired accuracy. The initial jump candidates are obtained in the consecutive screening procedure combined with locally-linear smoothing method. To further assess the significance of an individual jump candidate, we develop a novel test based on the profile likelihood inference. The ultimate selection of relevant jump points is conducted in multiple testing procedure, which rules out irrelevant jump points with large variations, due to heteroscedastic errors, from jump candidates. Moreove# we generalize the proposed SaMT algorithm to detect the common jump points shared across multiple aligned sequences. The proposed method is easy to implement, enjoys flexibility in choices of bandwidth parameter and threshold quantity in screening, and is illustrated through simulations and real data examples, as compared with existing methods.
Common Odds Ratio Test and Interval Estimation for Stratified Bilateral and Unilateral Data
马长兴
时间:
7月22日
19:00-21:00
个人简介:
Changxing Ma, PhD is Associate Professor, Co-Director for Master of Public Health (MPH) Biostatistics in the Department of Biostatistics at the University at Buffalo. He graduated from Nankai University in 1997. Before joining in Biostatics University at Buffalo, he worked at Nankai Dept of Statistics from 1992 to 2002, worked with longitudinal and birth cohort’s data in University of Florida for 5 years from 2000 to 2005. He published more than 130 peer-reviewed publications in a wide range of statistical and biomedical journals. His Google scholar h-index is 46, i10-index 95.
报告摘要:
In clinical research, data are commonly collected bilaterally from paired organs or bodily parts within individual subjects. However, unilateral data arise when constraints or limiting factors impede the collection of complete bilateral data. In this paper, we propose three large-sample tests and five confidence interval methods for making inferences on the common treatment effect, measured by the odds ratio, in a stratified design under integrated bilateral and unilateral data. Our simulation results show that the likelihood ratio-based and score-based tests, along with their associated confidence interval methods, demonstrate robust control of type I error and close-to-nominal coverage probabilities. We apply the proposed methods to real-world datasets of acute otitis media and myopic eyes to showcase their validity and applicability in clinical practice.
Assessing heterogeneous causal effects across clusters in partially nested designs
刘笑
时间:
7月22日 19:00-21:00
个人简介:
Xiao Liu is an assistant professor in the quantitative methods program of the Department of Educational Psychology at UT Austin. She is interested in causal inference methods, quasi-experimental methods (e.g., propensity score), causal mediation analysis, and longitudinal data analysis.
报告摘要:
Partially nested designs are common in studies of psychological or behavioral interventions. In this type of design, after participants are assigned to study arms, participants in a treatment arm are subsequently assigned to clusters (e.g., teachers, therapy groups) to receive treatment, whereas participants in a control arm are unclustered (e.g., a wait-list control). As participants in the treatment arm receive treatment in clusters, it is often of interest to examine heterogeneity of treatment effects across the clusters; but this is challenging in the partially nested design. Particularly, in defining a causal effect of treatment for a specific cluster (e.g., a specific therapist), it is unclear how the treatment and control outcomes should be compared, as the control arm has no clustering (e.g., no therapists). It may be tempting to compare outcomes of a specific cluster to outcomes of the entire control arm—howeve# this comparison may not represent a causal effect even when the treatment assignment is randomized, because the cluster assignment in the treatment arm may be nonrandomized (elaborated in this talk). In this talk, I will describe our study that extends the principal stratification framework and the principal score approach to assessing heterogeneous cluster-specific treatment effects in the partially nested design. Besides the effect definition and identification, our study obtains various estimators for the cluster-specific treatment effects, including a multiply-robust estimator that can provide more robustness to parametric model misspecification. In addition to simulation results, I will present an empirical example applying our methods to estimating the heterogeneous treatment effects across clusters in a partially nested design. I will end this talk with a discussion of the implications of our study and potential future directions.