Privacy Risks,Regulatory Dilemmas and Improvement Approaches of Synthetic Data
Wu Zongze1,2, Ren Baiyu3
1. School of Public Policy and Management, Tsinghua University, Beijing 100084, China; 2. Institute of Artificial Intelligence International Governance, Tsinghua University, Beijing 100084, China; 3. School of Juridical Science, China University of Political Science and Law, Beijing 100088, China
Abstract:With the rapid development of data-driven technologies such as artificial intelligence,the scarcity of real-world data has become an increasingly severe problem.The continuously strengthened privacy regulations in various countries have further exacerbated the insufficient supply of real-world data.Against this background,synthetic data,which is virtual and can be fitted to real-world data,is widely regarded as a promising solution.However,synthetic data cannot completely eliminate privacy risks.This aspect has received insufficient attention in current privacy protection theories and practices.As a result,in terms of privacy regulation,synthetic data faces many challenges,such as unclear regulatory positioning and ambiguous re-identification responsibilities.In order to fully realize the practical utility of synthetic data,it is necessary to improve the privacy regulatory measures for synthetic data from three aspects:clarifying the privacy regulatory positioning,promoting the formulation of technical standards,and ensuring full-process supervision.While supporting the innovative application of synthetic data,it is also essential to effectively protect personal privacy rights and interests.Thus,it can help to better realize the value of data elements.
[1]VILLALOBOS P,HO A,SEVILLA J,et al.Will we run out of data? Limits of LLM scaling based on human-generated data[EB/OL]. (2024-06-06)[2025-07-02].https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data. [2]Financial Conduct Authority.Report:Using synthetic data in financial services[EB/OL]. (2024-03-07)[2025-01-03].https://www.fca.org.uk/publications/corporate-documents/report-using-synthetic-data-financial-services. [3]Utah State Legielature.Artificial Intelligence Amendments[EB/OL]. (2024-03-13)[2024-10-23].https://le.utah.gov/%7E2024/bills/static/SB0149.html. [4]EMAM K E,MOSQUERA L,HOPTROFF R.Practical Synthetic Data Generation[M].Sebastopol,CA,USA:O'Reilly Media,2020. [5]RUIZ N,MURALIDHAR K,DOMINGO-FERRER J.On the privacy guarantees of synthetic data:A reassessment from the maximum-knowledge attacker perspective[C]//DOMINGO-FERRER J,MONTES F.Privacy in statistical databases.Cham:Springer International Publishing,2018:59-74. [6]GAL M S,LYNSKEY O.Synthetic data:Legal implications of the data-generation revolution[J].IOWA Law Review,2023,109:1087-1156. [7]HAENDEL M A,CHUTE C G,BENNETT T D,et al.The national COVID cohort collaborative (N3C):Rationale,design,infrastructure,and deployment[J].Journal of the American Medical Informatics Association,2021,28 (3):427-443. [8]ROSE L T,FISCHER K W.Garbage in,garbage out:Having useful data is everything[J].Measurement:Interdisciplinary Research and Perspectives,2011,9 (4):222-226. [9]KILKENNY M F,ROBINSON K M.Data quality: “Garbage in-garbage out” [J].Health Information Management Journal,2018,47 (3):103-105. [10]TANAKA F,ARANHA C.Data augmentation using GANs[EB/OL]. (2019-04-19)[2024-11-16].http://arxiv.org/abs/1904.09135. [11]ROCHER L,HENDRICKX J M,DE MONTJOYE Y A.Estimating the success of re-identifications in incomplete datasets using generative models[J].Nature Communications,2019,10 (1):3069. [12]SHOKRI R,STRONATI M,SONG C,et al.Membership inference attacks against machine learning models[C]//2017 IEEE Symposium on Security and Privacy (SP).San Jose,CA,USA:IEEE,2017. [13]YEOM S,GIACOMELLI I,FREDRIKSON M,et al.Privacy risk in machine learning:Analyzing the connection to overfitting[C]//2018 IEEE 31st Computer Security Foundations Symposium (CSF).Oxford,United Kingdom:IEEE,2018. [14]FREDRIKSON M,LANTZ E,JHA S,et al.Privacy in pharmacogenetics:An end-to-end case study of personalized warfarin dosing[C]//Proceedings of the 23rd USENIX Security Symposium.San Diego,CA,USA:USENIX Association,2014. [15]COHEN A,NISSIM K.Towards formalizing the GDPR's notion of singling Out[J].National Academy of Sciences,2020,117 (15):8344-8352. [16]丁晓东.论个人信息概念的不确定性及其法律应对[J].比较法研究,2022 (5):46-60. [17]王利明,程啸.中国民法典释评·人格权编[M].北京:中国人民大学出版社,2020. [18]程啸.个人信息范围的界定与要件判断[J].武汉大学学报 (哲学社会科学版),2024,77 (4):128-140. [19]FREDRIKSON M,JHA S,RISTENPART T.Model inversion attacks that exploit confidence information and basic countermeasures[C]]//Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security.New York,USA:Association for Computing Machinery,2015. [20]STALLA-BOURDILLON S,KNIGHT A.Anonymous data v.personal data — A false debate:An EU perspective on anonymization,pseudonymization and personal data[J].Wisconsin International Law Journal,2017,34 (2):285-321. [21]齐爱民,张哲.识别与再识别:个人信息的概念界定与立法选择[J].重庆大学学报 (社会科学版),2018,24 (2):119-131. [22]许可.复活僵尸法条:个人信息匿名化制度的再造[J].财经法学,2024 (4):160-177. [23]MITTELSTADT B.From individual to group privacy in big data analytics[J].Philosophy & Technology,2017,30 (4):475-494. [24]OHM P.Broken promises of privacy:Responding to the surprising failure of anonymization[J].UCLA Law Review,2010,57:1701-1768. [25]胡凌.功能视角下个人信息的公共性及其实现[J].法制与社会发展,2021,27 (5):176-189. [26]吴剑锋,陶文强.消费者人脸识别支付技术使用意愿的影响因素分析[J].浙江学刊,2020 (6):59-67. [27]赵精武.个人信息匿名化的理论基础与制度建构[J].中外法学,2024,36 (2):326-345. [28]丁晓东.公开个人信息法律保护的中国方案[J].法学,2024 (3):3-16. [29]RICHARDS N,HARTZOG W.The pathologies of digital consent[J].Washington University Law Review,2019,96 (6):1461-1503.