第17届中国R会议 & 2024 X 智能大会 & 2024 数据科学国际论坛联合会议将于2024年7月20-22日在中国人民大学召开,本次会议由
中国人民大学应用统计科学研究中心、
中国人民大学统计学院、统计之都和中国商业统计学会人工智能分会主办,由Journal of Data Science 编辑部和中国人民大学数据科学与大数据统计系承办,得到宽德投资、明汯投资、和鲸科技、子博设计赞助支持。
本次会议时间、地点及形式如下:
2024.7.20-21 (9:00-17:30) :
线下:中国人民大学逸夫楼、立德楼
线上:学说直播平台
2024.7.22(19:00-21:00):
线上:学说直播平台
线下
参会请扫描下方二维码报名:
截止日期
:7月16日
注意:1、自主投稿已截止,不再接收自由投稿
2、已报名者无需再次报名
线上
参会请扫描下方二维码并在相应时间加入会议室:
下面为您奉上本次中国R会议 & 2024 X 智能大会 & 2024 数据科学国际论坛联合会议7月21日的
New Statistical Methods I专场、 New Statistical Methods Ⅱ专场、生物统计前沿方法专场介绍!
New Statistics Methods
I
专场
主席:
成慧敏(统计之都)
时间:
2024年7月21日下午 13:00-15:00
会议地点:
·
中国人民大学立德楼812
会场内容介绍
Network Tight Community Detection
成慧敏
时间:
7月21日 13:00-13:30
个人简介:
I am an Assistant Professor in the Department of Biostatistics at Boston University. I am affiliated with the Rafik B. Hariri Institute for Computing and Computational Science Engineering and Nanotechnology Innovation Center at Boston University. I received my Ph.D. in statistics from the University of Georgia in 2023.
报告摘要:
Conventional community detection methods often categorize all nodes into clusters. Howeve# the presumed community structure of interest may only be valid for a subset of nodes (named as “tight nodes”), while the rest of the network may consist of noninformative “scattered nodes”. For example, a protein-protein network often contains proteins that do not belong to specific biological functional modules but are involved in more general processes, or act as bridges between different functional modules. Forcing each of these proteins into a single cluster introduces unwanted biases and obscures the underlying biological implication. To address this issue, we propose a tight community detection (TCD) method to identify tight communities excluding scattered nodes. The algorithm enjoys a strong theoretical guarantee of tight node identification accuracy and is scalable for large networks. The superiority of the proposed method is demonstrated by various synthetic and real experiments.
Two variable screening procedures with restrictions on the positive or negative effects
赵博娟
时间:
7月21日 13:30-14:00
个人简介:
参会申请人毕业于南开大学数学系(数理统计专业, 博士),曾在美国美国南卫理公会大学(Southern Methodist University)和美国哈佛大学 (Harvard School of Public Health)做过博士后研究, 在美国,Meharry Medical College工作,现在天津财经大学工作(教授、博导)。
报告摘要:
In this pape# two variable screening procedures, the local significant forward and backward procedure with restrictions on the positive or negative effects (FBRPN) and the backward procedure with restrictions on the positive or negative effects (BRPN), are proposed to obtain meaningful protective and risk factors in fast and sequential ways in models with a linear component such as the Generalized Linear Models to avoid multicollinearity. The two fitted models from the two procedures are compared to obtain the most efficient model and the representative variables of the original predictive variables. The new procedures are not prediction-driven, and are compared with traditional prediction-driven procedures including the forward, backward, stepwise and best subsets regression in three illustration examples. Simulation studies are carried out to show the effectiveness of the new procedures. Finally, practical issues are discussed, and applications of the new procedures in big data analysis are envisioned.
时间:
7月21日 13:50-14:15
个人简介:
我目前在上海财经大学统计与管理学院攻读统计学博士学位。我的导师是王绍立副教授。我对统计学和机器学习理论充满热情,特别分布式计算领域。此前,我曾在南昌大学获得学士学位。
报告摘要:
本报告深入探讨了高维线性分位数回归问题中的分布式估计和支持恢复技术。分位数回归作为一种对异常值和数据异质性具有较强鲁棒性的最小二乘回归替代方法,已获得广泛应用。然而,其检查损失函数的非平滑特性,在分布式计算和理论分析中带来了重大挑战。为了克服这些难题,我们提出了一种创新的转换策略,将分位数回归问题转化为最小二乘优化问题。本报告中,我们采用了双平滑技术,对先前牛顿型分布式方法进行了扩展,消除了对误差项与协变量之间独立性的严格假设。
我们开发了一种高效的算法,该算法在计算和通信效率方面表现出色。从理论上讲,我们提出的分布式估计器在经过一定数量的迭代后,能够达到接近最优的收敛速度,并实现高准确度的支持恢复。
此外,本报告还通过在合成数据和真实数据集上的广泛实验,进一步验证了所提出方法的有效性。实验结果表明,我们的方法在处理高维数据时,不仅能够提供准确的估计,还能有效地恢复数据中的关键支持结构。
总体而言,本报告为高维分位数回归的分布式估计和支持恢复提供了一种新的视角和解决方案,具有重要的理论和实际应用价值。
New Statistical Methods Ⅱ专场
主席
:
李雪瞳(北京大学光华管理学院)
时间:
2024年7月21日下午 15:30-17:30
会议地点:
·
中国人民大学立德楼812
会场内容介绍
Mixture Conditional Regression with Ultrahigh Dimensional Text Data for Estimating Extralegal Factor Effects
师佳鑫
时间:
7月21日 15:30-16:00
个人简介:
师佳鑫,北京大学光华管理学院商务统计与经济计量系在读博士生。主要研究方向为高维数据中的潜在结构分析,因子模型,计算法学,复杂网络数据分析等。研究论文被Annals of Applied Statistics期刊接收。
报告摘要:
Testing judicial impartiality is a problem of fundamental importance in empirical legal studies, for which standard regression methods have been popularly used to estimate the extralegal factor effects. Howeve# those methods cannot handle control variables with ultrahigh dimensionality, such as those found in judgment documents recorded in text format. To solve this problem, we develop a novel mixture conditional regression (MCR) approach, assuming that the whole sample can be classified into a number of latent classes. Within each latent class, a standard linear regression model can be used to model the relationship between the response and a key feature vecto# which is assumed to be of a fixed dimension. Meanwhile, ultrahigh dimensional control variables are then used to determine the latent class membership, where a na\"ive Bayes type model is used to describe the relationship. Hence, the dimension of control variables is allowed to be arbitrarily high. A novel expectation-maximization algorithm is developed for model estimation. Therefore, we are able to estimate the key parameters of interest as efficiently as if the true class membership were known in advance. Simulation studies are presented to demonstrate the proposed MCR method. A real dataset of Chinese burglary offenses is analyzed for illustration purposes.
A Gaussian Mixture Model for Multiple Instance Learning with Partially Subsampled Instances
余柏辰
时间:
7月21日 16:00-16:30
个人简介:
余柏辰,北京大学光华管理学院商务统计与经济计量系在读博士生,师从王汉生教授。本科毕业于华东师范大学统计学院。主要研究方向为图像数据分析、高维数据分析等。
报告摘要:
Multiple instance learning is a powerful machine learning technique, which is found useful when numerous instances can be naturally grouped into different bags. Accordingly, a bag-level label can be created for each bag according to whether the instances contained in the bag are all negative or not. Thereafte# how to train a statistical model with bag-level labels with/without partially labeled instances becomes the problem of great interest. To this end, we develop a Gaussian mixture model (GMM) framework to describe the stochastic behavior of the instance-level feature vectors. Both the instance-based maximum likelihood estimator (IMLE) and the bag-based maximum likelihood estimator (BMLE) are theoretically investigated. We found that the statistical efficiency of the IMLE could be much better than that of the BMLE, if the instance-level labels are relatively hard to be predicted. To fix the problem, we develop here a subsampling-based maximum likelihood estimation (SMLE) approach, where the instance-level labels are partially provided through carefully subsampling. This leads to a significantly reduced labeling cost with little sacrifice in terms of statistical efficiency. To demonstrate the finite sample performance, extensive simulation studies are presented. A real data example using whole-slide images (WSIs) to diagnose metastatic breast cancer is illustrated.
Gaussian Mixture Model with Rare Event
李雪曈
时间:
7月21日 16:30-17:00
个人简介:
李雪曈,北京大学光华管理学院商务统计与经济计量系在读博士生,师从王汉生教授。主要研究方向包括非均衡数据分析,网络结构数据分析,分布式计算等。研究论文发表在Statistica Sinica,Electronic Journal of Statistics
报告摘要:
We study here a Gaussian Mixture Model (GMM) with rare events data. In this case, the commonly used Expectation-Maximization (EM) algorithm exhibits extremely slow nu_x0002_merical convergence rate. To theoretically understand this phenomenon, we formulate the numerical convergence problem of the EM algorithm with rare events data as a problem about a contraction operator. Theoretical analysis reveals that the spectral radius of the contraction operator in this case could be arbitrarily close to 1 asymptotically. This theo_x0002_retical finding explains the empirical slow numerical convergence of the EM algorithm with rare events data. To overcome this challenge, a Mixed EM (MEM) algorithm is developed, which utilizes the information provided by partially labeled data. As compared with the standard EM algorithm, the key feature of the MEM algorithm is that it requires addi_x0002_tionally labeled data. We find that MEM algorithm significantly improves the numerical convergence rate as compared with the standard EM algorithm. The finite sample perfor_x0002_mance of the proposed method is illustrated by both simulation studies and a real-world dataset of Swedish traffic signs.
主席:
周静(中国人民大学)
时间:
2024年7月21日下午 13:00-15:00
会议地点:
·
中国人民大学立德楼807
会场内容介绍
Functional Adaptive Double-Sparsity Estimator for High-Dimensional Sensor Data Analysis
李忻月
时间:
7月21日 13:00-13:30
个人简介:
Prof. Li received her PhD in Biostatistics from Yale University. Prior to Yale University, she spent one year at Peking University and three years at the University of Chicago, receiving her B.A. and M.S. in Statistics from the University of Chicago. Prof. Li’s research focuses on statistical methods for wearable device data, medical imaging data, large population studies, and precision medicine. Her research papers were published in high-impact journals, such as The Lancet, JAMA Network Open, Advanced Science, IEEE Internet of Things Journal, NPJ Digital Medicine, and Statistica Sinica. Prof. Li has established collaboration with China, Europe and US to join international efforts in developing statistical methods for analyzing wearable sensor data in large population health studies.
报告摘要:
Wearable sensors have been increasingly used in health monitoring and early anomaly detection. Wearable device can collect objective and continuous information on physical activity and vital signs and have great potentials in studying the association with health outcomes. Howeve# how to effectively analyze high-frequency multi-dimensional sensor data is challenging. In this talk, we propose a new Functional Adaptive Double-Sparsity Estimator (FadDoS) based on functional regularization of sparse group lasso with multiple functional predictors, which can achieve global sparsity via functional variable selection and local sparsity via zero-subinterval identification within coefficient functions. We prove that the FadDoS estimator converges at a bounded rate and satisfies the oracle property under mild conditions. Extensive simulation studies confirm the theoretical properties and exhibit excellent performances compared to existing approaches. We applied FadDoS to a Kinect sensor study that utilized an advanced motion sensing device tracking human multiple joint movements and conducted among community-dwelling elderly, and we demonstrated how FadDoS can effectively characterize the detailed association between joint movements and physical health assessments. The proposed method is not only effective in Kinect sensor analysis but also applicable to broader fields where multi-dimensional sensor signals are collected simultaneously. The R code for FadDoS is available at https://github.com/Cheng-0621/FadDoS.
Bayesian Integrative Region Segmentation in Spatially Resolved Transcriptomic Studies
罗翔宇
时间:
7月21日 13:30-14:00
个人简介:
罗翔宇2018年9月起任职于中国人民大学统计与大数据研究院,现为准聘副教授。他2018年博士毕业于香港中文大学统计系。罗翔宇的研究兴趣包括贝叶斯统计、非参数贝叶斯、生物信息学、统计计算等。他热衷于开发新的统计模型来解决实际中的生物问题。其具体研究方向包括利用统计图模型构建基因调控或共表达网络、纠正高通量数据中的批次效应、对于批量层次的基因表达或DNA甲基化数据进行去卷积化、发现单细胞分辨率上的个体异质性、空间转录组及多组学数据融合分析等。
报告摘要:
The spatially resolved transcriptomic study is a recently developed biological experiment that can measure gene expressions and retain spatial information simultaneously, opening a new avenue to characterize fine-grained tissue structures. In this article, we propose a nonparametric Bayesian method named BINRES to carry out the region segmentation for a tissue section by integrating all the three types of data generated during the study—gene expressions, spatial coordinates, and the histology image. BINRES is able to capture more subtle regions than existing statistical partitioning models that only partially make use of the three data modes and is more interpretable than neural-network-based region segmentation approaches. Specifically, due to a nonparametric spatial prio# BINRES does not require a prespecified region number and can learn it automatically. BINRES also combines the image and the gene expressions in the Bayesian consensus clustering framework and thus flexibly adjusts their label alignment contribution weights in a data-adaptive manner. A computationally scalable extension is developed for large-scale studies. Both simulation studies and the real application to three mouse spatial transcriptomic datasets demonstrate that BINRES outperforms the competing methods and easily achieves the uncertainty quantification of the integrative partition.