现如今国外经济学的一些 TOP 期刊(如 AER, QJE, JPE, AEJ 系列等)基本上都会要求作者提供论文的原始数据和代码,并且还会将作者上传的数据和代码也会公开出来,通过这样的方式不仅能约束学术不端行为,也能保护作者的知识产权。固然说,这种将数据代码公开给学者使用,可以帮助学术圈的进步,但是这也给投稿的作者们带来了难题,尤其是很多时候他们使用的数据是保密的或者签订了协议并不能公开此数据。
对此,我们就需要采取一些措施来处理我们的原始数据,如构造一个合成数据集,让这个合成数据集满足所有的隐私保护约束,同时还能保留原始数据的一些重要的结构,让广大学者可以通过使用这个合成数据集能够大致复现论文的主要结论。基于这个思考,我们可以利用多重填充(Multiple Imputation)的方法,以下的步骤参考于 How to come public, with private data.
2. Stata 范例
为了更好地描述该方法是如何进行的,我们将使用一个现成的在线数据集。该数据摘自《1998 瑞士劳动力市场调查》,在 stata 命令
oaxaca
(by Jann, 2008)中作为示例数据提供。
在这里我们假设你已经签署了保密协议来处理 Swiss Survey 的数据,并准备提交论文,但是所投稿的期刊需要你提供论文的数据和代码。但是由于你已经签署了保密协议不能公开此数据集,因此在此文的建议是提供 5 个人为合成的数据集,基于此合成数据集,其他人就可以使用你提供的代码去复现论文中的实证结果。具体的 stata 操作如下:
. expand 1648 in 1, gen(tag) (1,647 observations created)
. local vlist "lnwage educ exper tenure isco female lfp age single married divorced kids6 kids714 wt" . foreach i of varlist `vlist' { replace `i'=. if tag==1 }
. replace seed = runiform(0,100) if tag==1 (1,647 real changes made)
. replace lfp = runiform()<.87 if tag="=1 (1,647 real changes made)
下一步是利用多元填充(Multiple Imputation)的方法生成合成数据集。在这里我们需要使用
mi impute chain
命令,我们认为最好的方法是使用
pmm
,即预测均值匹配(predictive mean matching)的方法。即:
. mi set wide
. mi register impute lnwage educ exper tenure /// isco female age single married /// kids6 kids714 wt
. mi impute chain /// (pmm, knn(100)) educ female age single married kids6 kids714 wt (pmm if lfp==1, knn(100)) /// lnwage exper tenure isco = seed lfp, add(5) note: missing-value pattern is monotone; no iteration performed
Conditional models (monotone): educ: pmm educ seed lfp , knn(100) female: pmm female educ seed lfp , knn(100) age: pmm age female educ seed lfp , knn(100) single: pmm single age female educ seed lfp , knn(100) married: pmm married single age female educ seed lfp , knn(100) kids6: pmm kids6 married single age female educ seed lfp , knn(100) kids714: pmm kids714 kids6 married single age female educ seed lfp , knn(100) wt: pmm wt kids714 kids6 married single age female educ seed lfp , knn(100) lnwage: pmm lnwage wt kids714 kids6 married single age female educ seed lfp if lfp==1, knn(100) exper: pmm exper lnwage wt kids714 kids6 married single age female educ seed lfp if lfp==1, knn(100) tenure: pmm tenure exper lnwage wt kids714 kids6 married single age female educ seed lfp if lfp==1, knn(100) isco: pmm isco tenure exper lnwage wt kids714 kids6 married single age female educ seed lfp if lfp==1, knn(100)
educ: predictive mean matching female: predictive mean matching age: predictive mean matching single: predictive mean matching married: predictive mean matching kids6: predictive mean matching kids714: predictive mean matching wt: predictive mean matching lnwage: predictive mean matching exper: predictive mean matching tenure: predictive mean matching isco: predictive mean matching
---------------------------------------------------------- | Observations per m |---------------------------------------------- Variable | Complete Incomplete Imputed | Total -----------+-----------------------------------+---------- educ | 1647 1647 1647 | 3294 female | 1647 1647 1647 | 3294 age | 1647 1647 1647 | 3294 single | 1647 1647 1647 | 3294 married | 1647 1647 1647 | 3294 kids6 | 1647 1647 1647 | 3294 kids714 | 1647 1647 1647 | 3294 wt | 1647 1647 1647 | 3294 lnwage | 1434 1458 1458 | 2892 exper | 1434 1458 1458 | 2892 tenure | 1434 1458 1458 | 2892 isco | 1434 1458 1458 | 2892 ---------------------------------------------------------- (complete + incomplete = total; imputed is the minimum across m of the number of filled-in observations.)
. forvalues i = 1/5 { preserve keep if tag==1 keep _`i'_* lfp ren _`i'_* * save fake_oaxaca_`i', replace restore }
现在通过估计一个简单的 Linear Regression、Quantile Regression 和 Heckman 两步法模型来检验合成数据集的可行性。即:
frame create test
frame test: { use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clear qui:reg lnwage educ exper tenure female est sto m1 qui:qreg lnwage educ exper tenure female, q(10) est sto m2 qui:heckman lnwage educ exper tenure female age, selec(lfp =educ female age single married kids6 kids714) two est sto m3 }
forvalues i = 1/5 { frame test: { use fake_oaxaca_`i', clear
qui:reg lnwage educ exper tenure female est sto m1`i' qui: qreg lnwage educ exper tenure female, q(10) est sto m2`i'
qui: heckman lnwage educ exper tenure female age, /// selec(lfp =educ female age single married kids6 kids714) two est sto m3`i' } }
. frame test: { . use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clear (Excerpt from the Swiss Labor Market Survey 1998) . mean lnwage exper tenure educ female age single married kids6 kids714
. forvalues i = 1/2 { frame test: { use fake_oaxaca_`i', clear mean lnwage exper tenure educ female age single married kids6 kids714 corr lnwage exper tenure educ female age single married kids6 kids714 , cov } }