Methodological Framework for the Architecture of Data Anonymization for Machine Learning Tasks
DOI:
https://doi.org/10.22213/2410-9304-2026-1-4-12Keywords:
anonymization, privacy, architecture, de-identification, security, k-anonymityAbstract
This paper presents a methodological framework for the architecture of a tabular-data anonymization system embedded into the lifecycle of corporate machine learning projects and data preparation workflows. We propose a process- and stage-based approach to designing an anonymization pipeline that establishes a unified terminology, requirements, and constraints, and formalizes rule profiles for pseudonymization, generalization, masking, and suppression across different attribute classes: direct identifiers, quasi-identifiers, and sensitive attributes. Building on the k-anonymity, l-diversity, and t-closeness models, we introduce "privacy checkpoints" at which attainment of target metric values, suppression rates, and the level of generalization are evaluated. At each checkpoint, a privacy report is generated containing the observed k, l, and t values, warnings, and explanatory notes, enabling an informed decision on whether a dataset can be admitted into the ML pipeline. The paper also shows how to pre-validate profiles and parameters on representative anonymized samples without accessing actual production datasets, thereby reducing disclosure risks at early approval stages. The framework further specifies roles and responsibility boundaries (data owner, data engineer, analyst/data scientist, ML engineer, information security specialist, and system administrator) and a three-tier system architecture with a web interface and an API suitable for integration with pipeline orchestrators. Treating rule profiles as versioned artifacts-alongside dataset versions, run parameters, metadata storage, operation logging, and periodic auditing-ensures reproducibility of training data preparation and end-to-end traceability of anonymization impacts on model quality. The framework can serve as a reference model for an initial pilot implementation and subsequent expansion to other data classes and privacy governance practices in ML projects .References
Slijepčević D., Henzl M., Klausner L.D., Dam T., Kieseberg P., Zeppelzauer M. k-Anonymity in Practice: How Generalisation and Suppression Affect Machine Learning Classifiers // Computers & Security. 2021. Vol. 111. Art. 102488. DOI: 10.1016/j.cose.2021.102488.
Ni C., Cang L.S., Gope P., Min G. Data Anonymization Evaluation for Big Data and IoT Environment // Information Sciences. 2022. Vol. 605. P. 381-392. DOI: 10.1016/j.ins.2022.05.040.
Sweeney L. k-Anonymity: A Model for Protecting Privacy // International Journal on Uncertainty, Fuzziness and Knowledge-Based Systems. 2002. Vol. 10, No. 5. P. 557-570. DOI: 10.1142/S0218488502001648.
Machanavajjhala A., Kifer D., Gehrke J., Venkitasubramaniam M. ℓ-Diversity: Privacy Beyond k-Anonymity // ACM Transactions on Knowledge Discovery from Data. 2007. Vol. 1, No. 3. Art. 3.DOI: 10.1145/1217299.1217302.
Li N., Li T., Venkatasubramanian S. t-Closeness: Privacy Beyond k-Anonymity and ℓ-Diversity // Proceedings of the 23rd IEEE International Conference on Data Engineering. 2007. P. 106-115.DOI: 10.1109/ICDE.2007.367856.
Majeed A., Lee S. Anonymization Techniques for Privacy Preserving Data Publishing: A Comprehensive Survey // IEEE Access. 2021. Vol. 9. P. 8512-8545. DOI: 10.1109/ACCESS. 2020.3045700.
El Mestari S.Z., Lenzini G., Demirci H. Preserving Data Privacy in Machine Learning Systems // Computers & Security. 2024. Vol. 137. Art. 103605.DOI: 10.1016/j.cose.2023.103605.
Domingo-Ferrer J., Mateo-Sanz J.M. Practical Data-Oriented Microaggregation for Statistical Disclosure Control // IEEE Transactions on Knowledge and Data Engineering. 2002. Vol. 14, No. 1. P. 189-201.DOI: 10.1109/69.979982.
Caruccio L., Desiato D., Polese G., Tortora G., Zannone N. A Decision-Support Framework for Data Anonymization with Application to Machine Learning Processes // Information Sciences. 2022. Vol. 613. P. 1-32. DOI: 10.1016/j.ins.2022.09.004.
Gadotti A., Rocher L., Houssiau F., Crețu A.-M., de Montjoye Y.-A. Anonymization: The Imperfect Science of Using Data While Preserving Privacy // Science Advances. 2024. Vol. 10, No. 29. Art. eadn7053.DOI: 10.1126/sciadv. adn7053.
Борисов Р. С., Ефименко А. А. Паспорт наборов данных и результатов исследований для публикации в открытых источниках // Правовая информатика. 2022. № 2. С. 66-79.DOI: 10.21681/1994-1404-2022-2-66-79.
Борисов Р. С., Ефименко А. А. Протокол анонимизации наборов данных для публикации в открытых источниках // Правовая информатика. 2023. № 2. С. 54-66.DOI: 10.21681/1994-1404-2023-2-54-66.
Борисов С. А., Босов А. А., Иванов Д. Е. Применение имитационного компьютерного моделирования к задаче обезличивания персональных данных. Оценка состояния и основные положения // Программирование. 2023. № 4. С. 58-74.DOI: 10.31857/S0132347423040040.
Борисов С. А., Босов А. А., Иванов Д. Е. Применение имитационного компьютерного моделирования к задаче обезличивания персональных данных. Модель и алгоритм обезличивания методом синтеза // Программирование. 2023. № 5. С. 19-34.DOI: 10.31857/ S0132347423050023.
Ловцов Д. А. Теория защищенности информации в эргасистемах: монография. М. : РГУП, 2021. 276 с.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Э М Дюкина, Ю В Силаев, О М Перминова

This work is licensed under a Creative Commons Attribution 4.0 International License.