Modern technology is evolving at an exponential pace, based on Artificial Intelligence (hereafter, “AI”) and data-driven solutions. AI occupies a significant space in technological advances, several applications for machine learning and computer vision algorithms that continue to grow. While recognising AI and machine learning applications are the way forward into the future, it is also well-known that they require access to a large amount of data. Against this technological development, apparently, we can encounter data protection and privacy-friendly regulations. Not only the General Data Protection Regulation (GDPR), but we can identify several other data protection and privacy laws around the world that require companies to adapt to strict obligations. This raises the question of whether AI is compatible with such laws, GDPR being the standard. There is a clear tension between traditional data protection principles (e.g., purpose limitation, data minimization, and the limitation of automated decisions). A more flexible interpretation of the GDPR has been adopted for the development of AI and big data applications. However, this approach may be short-sighted in the long term and may hinder the development of such technologies. How to justify feeding massive volumes of data to form AI systems (high data consumption) without violating data protection legislation (e.g. principles to the fullest extent). Synthetic data has emerged as a privacy-friendly solution. We intend to launch and contribute to the debate about the advantages of synthetic data in the training of AI models, with a special focus on the interpretation of the GDPR.

What is Artificial Intelligence?

There are differences within the community when it comes to the definition of artificial intelligence systems (hereafter “AI”). Oversimplifying the concept, we can state that AI entails the ability or capacity of a machine to act purposefully, think rationally and deal effectively with its environment.[1] More in depth, AI can be defined in different subsets, such as symbolic learning, Machine Learning (hereafter “ML”) and reinforcement learning.

Symbolic learning is connected to explicit knowledge structures like ontologies, rule learning and therefore reasoning. This subfield was popular before the “recent” dawn of neural networks.[2]

Machine learning is characterised by statistical learning, benefiting from large volumes of data to learn models and is the present paradigm for AI.[3].

Lastly, we can also have reinforcement learning as a third subcategory of ML. The algorithm is defined by an optimal set of actions (a control policy) to achieve a certain goal, thus it is not told how to do a certain action but only receives positive or negative rewards that, in an iterative cycle, will bring the decision closer or further away from the primary goal.[4]

Full AI is thus characterized by a subset of operations that are typical of human intelligence, such as learning, adaptation, interaction, reasoning, problem-solving, knowledge representation, predicting and planning, autonomy, perception, movement and manipulation.[5]

The characteristics of this technology optimize human capability (reducing human shortcomings), having the potential to reduce task execution time, increase the productivity and efficiency of general actions, or the ability to increase the reading and analysis of mass data volumes (identifying and correlating the pattern).[6] Such capabilities are explained because AI is developed to discover correlations between data and build models, linking inputs to presumably correct responses (“predictions”). However, these predictions are only possible after the system is trained on vast sets of examples. AI is defined by a system that becomes “data-hungry” over time, and this requirement encourages data collection in a self-reinforcing spiral.[7]

AI data-driven technology

We can state that AI is thus data-driven technology that is hugely dependent on data. They require huge amounts of structured, semi-structured or non-structured sets of data (oversimplifying this matter, let’s call the need for “big data”).[8] The emergence of new technologies amounts to this reality, where mobile networks, social media and the internet of things generate even more data.[9]

The potential benefits are diverse and important for society.[10] However, as stated, the development of these applications is typically done via data hungry models, sometimes even when data for a specific purpose is not available. Solving data availability and abundance can thus be a challenge for the deployment of these technologies. Amassing the necessary data is not only technically demanding but can also entail several risks for privacy.[11]

The implementation of big data technologies can be defined into advanced data-driven strategies, based on the analysis and interpretation of verifiable and reliable data. Although data-driven management is a big challenge to entities, it enables the study and interpretation to get specific answers and present efficient solutions. Vast data sets are although difficult to manage using standard techniques, because of their special characteristics and definition on the several V’s: huge Volume, high Velocity, great Variety, low Veracity, and high Value.[12]  Understanding how big data technologies can be thus balanced in practice is not a simple task[13], especially when considering the emergence of new technologies in the context of IoT, social platforms, driverless cars, or smart cities. There are inherent risks for individuals of indiscriminate collection of data[14], analytics and decision-making based on AI techniques.

We can thus foresee that those data-driven technologies might contradict the inherent intention of data protection and privacy laws. How can we ensure purpose limitation data is collected indiscriminately and do not ensure data minimization or accuracy? It should also be considered that although such data can be created by individuals, most often are automatically collected by the system from the physical world or from computer-mediated activities.[15]

Current challenges: what about the data?

Recognising that AI can significatively improve human analytics and have a high and transversal impact on several sectors, it is also well-known that they require access to a large amount of data. While revealing clear economic and societal benefits[16], they entail challenging risks to fundamental rights and interests of individuals. We are thus confronted with a two-sided coin situation: where there are improvements or technological advances to be made, there are also a number of inherent risks that must be taken into consideration.[17]

Overreaching ethical and legal challenges arise and questions associated with ethics, fairness, transparency, accountability and explicability.[18] Aiming to tackle these challenges, the European Commission (hereafter “EC”) and the European Parliament (hereafter “EP”) repeatedly expressed the need for legislative action to ensure a well-functioning internal market for AI where benefits and risks are adequately addressed. Such an approach fundamentally aimed to ensure the development of secure, trustworthy and ethical artificial intelligence[19] and the protection of ethical principles.[20]

Seven key requirements for a trusted ecosystem for regulating AI have been proposed, in which “privacy and data governance” is highlighted.[21] It is understandable that if AI systems are to be designed around human rights, privacy and data protection should be taken as a priority.[22]

A proposal for a regulation laying down harmonized rules on AI[23] was adopted in the EU, establishing a balanced and proportionate horizontal regulatory approach while limited to the minimum necessary requirements to address the inherent risks. The regulation contains specific rules on the protection of individuals with regard to the processing of personal data, notably sets out the legal requirements for high-risk AI systems in relation to data and data governance, documentation and record keeping, transparency and provision of information to users, human oversight, robustness, accuracy and security. However, the regulation should be seen as complementary to existing data protection and privacy laws.

It is well known that building, training and testing AI models requires access to large and diverse data. While the proposal does not provide any direct legal solution to this question, such reality may conflict with data protection and privacy laws, notably the GDPR. It becomes clear that there is a tension between traditional data protection principles, for example purpose limitation, data minimization, and the limitation of automated decisions.[24] It is relevant to note that at the time of the entry into force of the GDPR AI systems were not a reality as they are today, and this piece of regulation mainly considered internet development.

It should be considered that data subjects should always be guaranteed full control over their own data, and that they are not allowed to process such data freely and without a legitimate basis of lawfulness. Imagine autonomous driving, where the algorithm collects data from the driver but also from sensors scanning the environment around (e.g. pedestrians or other vehicles). Can we consider that the data subjects have control over their data in these cases, while they are not even aware that their data is being processed?

While it is not our intention to conduct an in-depth analysis of all the issues surrounding this topic, the main idea that should be retained is that the development of such technologies is destined to raise data protection and privacy issues within the applicable EU legal framework. The European Data Protection Board already had the opportunity to emphasize the very high risk related to excessive data collection and the storage of such data over a long period of time, considering the development of new functionalities and, more specifically, those based on AI algorithms.[25]

Protecting the rights and interests of data subjects is therefore an essential part of developing IA systems. It is also recognized that the promotion of AI-driven innovation is closely linked to other data policy strategies, which establish trusted mechanisms and services for re-use, sharing and polling of data for the development of high-quality models.[26] That is why some ethical challenges inherent to AI are very similar to the ones raised by other technologies relying on big data, for example, on the abuse of data.[27] One might not realize that the collection and analysis of large datasets, often centralised, may lead to the re-identification of individuals, by means of linking datasets or inferring new data from existing datasets, nor that such collection is being used in another context than the one in which data was originally collected.

Such systems should therefore be underpinned by a logic of data ethics, outlining some principles that converge with data protection and privacy laws, for example, the purpose of the data processing is specifically identified, data is processed with respect for integrity, data sets are understandable, transparent and their use accountable, data should be open and individuals should have control over their own personal data.[28]

Synthesis as a possible solution

Synthetic data (hereafter, “SD”) is an artificially generated set of data, conceptually generated from a sample of real data, preserving its statistical properties without leveraging specific data records (the process can be then called as “synthesis”).[29] Data is generated through computer programs rather than real-world events. However, SD is not generated by any run-of-the-mill computer program. Rather, from algorithms that model the original statistical distributions and structure of real datasets. Even though the output data might be considered “fake”, it still maintains utility for training models.[30]

Synthetic datasets can be used to fit specific needs, such as testing new tools or just to share data. Synthesis is thus an emerging privacy-enhancing technology that can enable inexpensive access to unlimited new samples of realistic data based on a model that is trained on real data.[31] More recently advances in AI have enabled the creation of even more realistic synthetic datasets, with deep learning models leading the innovation in this field. These models can be automatically trained on available data, and then be used to generate new, unlimited SD samples.

Synthetic generators, such as the ones based in the celebrated Generative Adversarial Network architectures,[32] are a flexible and easy to use AI based solution that enables organizations to generate highly realistic synthetic events. SD can thus play a key role in the adoption of new technologies, especially considering that privacy and data protection laws have made it even more difficult to access and use real data for training AI models.

Defining types of synthetic data

Conceptually SD is not real data, but data that is generated from real data and maintains the same statistical properties.[33] There are three types of SD.[34] Those that are generated from real/actual data sets, those that do not use real data, and hybrids of the first two types.

Data synthesis based on real data is the most commonly used since the applied models are often domain agnostic, its development is less domain knowledge intensive, they directly adapt to the context from where the data is collected, often regardless of complexity.

Best in class: synthesis as technical measure

In contrast to other privacy-enhancing technologies that, to some extent, render information to be non-personal (de-identified data)[35] SD is not real data related to real individuals and there is no link between records in a SD set and records in the original real data set.

Recital 26 of the GDPR states that the regulation does not apply to “information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable”.[36] Data synthesis would then benefit from the same reasoning as the case for not needing to comply with further obligations for anonymization under GDPR.[37] However, depending on the synthesis process applied, the need to ensure initial data creation and testing must comply with applicable laws. Notably, the process of anonymization is itself a data processing activity that needs to comply with data protection laws.[38]

Anonymization is an important measure to ensure that data subjects are not re-identified, limiting the risks over their rights and freedoms while enabling us to make the information available to the public. Although, successful anonymization may be practically impossible for complex dataset.[39]

We should also consider risk-based de-identification, as the process used to prevent personal identifiers from being connected with information.[40] In some cases, de-identification may be a solution, in particular for AI training models, notably because training data need not include directly identifying data.[41]

Considering the above, the identified advantage is that SD uses characteristics of the real data (or not) to generate new “fake” data. While entailing statistical models of distributions and structure of the real datasets, SD are similar to the original one in terms of granularity. Nevertheless, SD should not be held as personal data, since the information is not linked to actual data subjects.

Tackling current challenges

Among the many challenges that were identified, the dependency on data is one of the greatest.

Applying synthesis as a privacy measure can help tackle some issues when it comes to data and data protection laws, mainly the compliance with fundamental data protection principles.[42] This solution could avoid the need to promote a flexible interpretation of the GDPR so as not to hamper the use of personal data for AI purposes.

Repurposing might not be compatible with new purposes that are different from those for which the data were originally collected.[43] This constant analysis and balancing of rights and interests may be difficult or even impossible with the development of AI, even more so when we consider the learning capabilities of such systems.

There is also a tension between the idea of data minimisation, according to which personal data should be adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed.[44] It is relevant to consider that the principle of data minimization implies that entities should identify the minimum amount of personal data to fulfill their purposes, and no more.[45] There is thus a clear tension with the concept of big data and data analytics and this principle. Considerations of ways in which this tension can be reduced are bound to involve long-term practical issues.[46] Constant assessments of proportionality in data inclusion, limiting processing to statistics, and avoiding making decisions about individuals can become too much of a burden when ensuring the evolution of training models and the development of the system. In close connection with the data minimisation principle is the lawful ground for processing, notably AI prediction is provided as a service.[47] It should be considered that using personal data for performing or entering a particular contract does not cover the subsequent use of such data for purposes of business analytics, or possible inputs to a predictive-decisional model concerning the data subject.[48] Data minimisation should not, however, be confused with anonymization. Although there are conceptual and technical similarities between the concepts, privacy-preserving techniques normally means that certain data used in AI are rendered pseudonymous or anonymous.[49]

The risk-based approach of data protection laws entities to assess their risk appetite for AI development, for example, considering the implementation of technical measures in the context of particular circumstances.[50] This obligation can in some cases hinder the evolution of such systems, forcing governments to create special regimes like “sand-boxes” for high-risk technologies. The expansion of these regulatory frameworks emerges to tackle current challenge of this “new” technologies. Inserted in this environment, in a regulatory tension, companies may enjoy a wide interaction with regulators for the controlled development of services and products offered. This way, at a later and safer moment, regulatory agencies will be able to choose “if”, “how” and “when” to grant definitive authorizations for the entry and real performance of these companies in the consumer market.

The Portuguese Government established in a Council of Minister Resolution No. 29/2020, the bases for the establishment of the so-called “Technological Free Zones” (ZLT) “for the testing and experimentation in real environment in the country of any new technologies and solutions”. In generic terms, this space is intended to be a general and intersectoral structure for the experimentation of innovative technologies. In other terms, this would be a representation corresponding to the concept of the regulatory sandbox regime. The legal framework for the ZLTs, was later approved by DL 67/2021, of 30 July, in which the regime and governance model for promoting technology-based innovation through the creation of these zones was established.

Compliance with applicable laws should not be sacrificed to promote the creation of training sets and the construction of algorithmic models in which the resulting AI systems could pose a risk to the rights and interests of data subjects. Since privacy regulations may not apply to SD, violation of the legislation or even the threat of a data breach would not be an issue, at least for the associated use cases.[51]

SD may enable the use of datasets not restricted by legal and privacy concerns, allowing the development of new innovative solutions without risking data subjects re-identification. The opportunity entails that organizations can become compliant with data protection and privacy laws while being innovative.[52]

Leveraging SD can also bost data economy. Organizations are able to share their data without incurring the risk of exposing personal data.[53] The use of it accelerates and eases the data sharing process between entities, helping the creation of a trustful global data economy.[54] Sharing data may empower crowd intelligence, leading to new research and innovation.[55]

Developing and testing software solutions also benefit from this solution. SD eases the creation of high volumes of realistic data for other environments such as development or even testing environments.[56] SD generators are by design conceived to be privacy preserving, these allow sampling for rich datasets boost the development and accuracy of the software development, mitigating the risk of future bugs, improving the testing process and an integration flow less prone to errors.[57]

SD can also support AI research and new methods to predict a certain disease. SD generated from original and real patients’ data, can solve some privacy concerns while keeping the datasets characteristics and features that allow us to identify a certain disease or support a certain diagnosis.[58]

Conclusions

We can conclude that the use of SD may allow entities to occlude personal data from generated datasets with less impact on their utility and more effectiveness when compared other technical measures. Classical anonymization techniques are often applied to real data but are limited in blocking re-identification events, are laborious to implement, and also require some degree of expertise to identify. Differential privacy techniques may allow avoiding these pitfalls and can be directly integrated in synthesizer training pipelines leading to synthesizer models with a controlled privacy guarantee and the least impact in utility.

This unique feature can potentially lay grounds for compliance with applicable law in data protection and privacy, notably the GDPR. In some cases, it would not be necessary to promote a flexible interpretation of the law so as not to undermine the use of personal data for AI purposes.

Beside the positive outlook in what comes to privacy protection, SD can open up AI models to the possibility of approximating real distributions in a way that would not otherwise be possible, or economically feasible when relying solely on limited real world samples. Data generators can interpolate from real observations, in some cases even extrapolation capacity can be observed. The usual constraints of collection and labeling of real-world data are not hindrances since unlimited samples can be obtained from the modeled distributions without incurring additional costs. These tools can also be used to remove unbalance from available real-world samples, which can be used to remove bias, ensuring that production models respond correctly to rare catastrophic events and other important challenges behind data applications.

The ability to produce virtually infinite synthetic samples synergizes with privacy in a best of both worlds scenario – higher privacy to utility tradeoffs penalizes utility, however, with unlimited potential of creating new data points, this drawback can effectively be mitigated in AI application development.

In the context of this paper, there are several interesting doors that SD can open. However, entities should be aware that this technology is subject to possible limitations like any other emerging technology and should not privilege its exclusive use over others technological privacy-friendly measures.


[1] Rembrandt Devillé, Nico Sergeyssels and Catherine Middag, “Chapter 1, Basic Concepts of AI, for Legal Scholars”, Artificial Intelligence and the Law, Jan De Bruyne and Cedric Vanleenhove (eds.), Intersentia, p.2.

[2] Minsky, M. L.. “Logical versus analogical or symbolic versus connectionist or neat versus scruffy”, AI magazine, 1991, 12(2), p.34; Blazek, P., How Aristotle is Fixing Deep Learning’s Flaws, 2022; and Marcus, G., “The next decade in AI: four steps towards robust artificial intelligence”, 2020.

[3] Dinsmore, J., “The symbolic and connectionist paradigms: closing the gap. Psychology Press”, 2014.

[4] Rembrandt Devillé, Nico Sergeyssels and Catherine Middag, “Chapter 1, Basic Concepts of AI, for Legal Scholars”, Artificial Intelligence and the Law, Jan De Bruyne and Cedric Vanleenhove (eds.), Intersentia, p.6-7.

[5] Agência para a Modernização Administração (ama), “GuIA, Guia para uma inteligência artificial ética, transparente e responsável na administração pública”, p. 10.

[6] For some successful applications of ML see Wang, W., Ye, Z., Gao, H., & Ouyang, D., “Computational pharmaceutics-A new paradigm of drug delivery”, Journal of Controlled Release, 2021, 338, p.119-136

[7]  European Parliament Research Service (EPRS), “The impact of the General Data Protection Regulation (GDPR) on artificial intelligence”, Scientific Foresight Unit (STOA), PE 641.530 – June 2020, Section I.

[8] Sarker, I.H., “Machine Learning: Algorithms, Real-World Applications and Research Directions”, SN COMPUT. SCI. 2, 2021, p. 160.

[9] Bernard Marr, “How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read”, Forbes, 2018.

[10] Information Commissioner’s Office, “Big data, artificial intelligence, machine learning and data protection”, Data Protection Act and General Data Protection Regulation, 20170904, Version:2.2, pp. 15 to 18.

[11] Ericsson, “Privacy in mobile networks – How to embrace privacy by design”.

[12] European Parliament Research Service (EPRS) (…) p.4.

See also, Agência para a Modernização Administração (ama) (…) p. 26.

[13] Vestoso, Margherita. 2018. “The GDPR beyond Privacy: Data-Driven Challenges for Social Scientists, Legislators and Policy-Makers” Future Internet 10, no. 7: 62.

[14] European Parliament Research Service (EPRS) (…) p.18.

[15] Ibidem, p.4.

[16] European Commission, “White Paper: on artificial Intelligence – A European approach to excellence and trust”, COM(2020) 65 final, Brussels, 18.2.2020, p.2.

[17] Concerning this idea see European Commission, “White Paper On Artificial Intelligence -A European approach to excellence and trust”, Brussels, 19.02.2020, COM(2020) 65 final.

[18] Chapter 3 Setting the scene: on AI ethics and regulations, Artificial Intelligence and the Law, Jan De Bruyne and Cedric Vanleenhove (eds.), Intersentia, p.49.

[19] European Council, Special meeting of the European Council (1 and 2 October 2020) – Conclusions, EUCO 13/20, 2020, p. 6.

[20] As a general reference see the footnote references Concerning this idea see European Commission, “White Paper On Artificial Intelligence -A European approach to excellence and trust”, Brussels, 19.02.2020, COM(2020) 65 final.

[21] The High-level Expert Group on Artificial Intelligence defined seven key requirements for a trustworthy AI: (i) human agency and oversight; (ii) technical robustness and safety; (iii) privacy and data governance, (iv) transparency; (v) diversity, non-discrimination and fairness; (vi) societal and environmental wellbeing; and (vii) accountability (cf. High-Level Expert Group on Artificial Intelligence, set up by the European Commission, “Ethics Guidelines for Trustworthy AI, 201). See also European Parliament Research Service (EPRS) (…) p.31. And also the Communication from the Commission to the European Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions, “Building Trust in Human-Centric Artificial Intelligence”, Brussels, 8.4.2019, COM(2019) 168 final.

[22] Trustworthy AI systems should thus entail the following five dimensions: Accountability, ensuring that the systems are accountable, secure and auditable. Transparency, ensuring the visualization of their components and the procedures applied. Explicability, ensuring that the systems can be understood by the explanation provided. Fairness, ensuring that the systems are fair and non-discriminatory. Ethics, ensuring that the system offers mitigations to deal with ethical bias (Cf. Agência para a Modernização Administração (ama) (…) p. 33.)

[23] Proposal for a Regulation of the European Parliament and of the Council Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence ACT) and amending certain union legislative acts {SEC(2021) 167 final} – {SWD(2021) 84 final} – {SWD(2021) 85 final}, COM(2021)206 final, 2021/0106 (COD), p.3.

[24] Zarsky, Tal Z.. “Incompatible: The GDPR in the Age of Big Data”, Seton Hall Law Review, 2017. Vol. 47:995, pp. 1003 – 1004.

[25] European Data Protection Board (EDPB), “Guidelines 01/2020 on processing personal data in the context of connected vehicles and mobility related applications”, version 2.0, adopted on 9 March 2021.

[26] Proposal for a Regulation (…) p.5.

[27] For a deeper understanding of these ethical challenges, especially on data abuse, is recommended to consult Michiel Fierens, Stephanie Rossello and Ellen Wauters, “Chapter 3 Setting the scene: on AI ethics and regulations”, Artificial Intelligence and the Law, Jan De Bruyne and Cedric Vanleenhove (eds.), Intersentia, p.51.

[28] Agência para a Modernização Administração (ama) (…) pp.. 22-23.

[29] Khaled El Eman, Lucy Mosquera & Richard Hoptroff, “Practical Synthetic Data Generation, Balancing Privacy and the Broad Availability of Data” (2020), p.1.

[30]  Li, H., Xiong, L., & Jiang, X. (2014). Differentially Private Synthesization of Multi-Dimensional Data using Copula Functions. Advances in database technology : proceedings. International Conference on Extending Database Technology, 2014, 475–486.

[31] Although, the generation of synthetic data is something that is already known in the market for the last decades, mainly through techniques such as modeling a multivariate probability distribution for a given data set and then sampling new data.

[32] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta and A. A. Bharath, “Generative Adversarial Networks: An Overview,” in IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 53-65, Jan. 2018 (doi: 10.1109/MSP.2017.2765202).

[33] Khaled El Eman, Lucy Mosquera & Richard Hoptroff, “Practical Synthetic Data Generation, Balancing Privacy and the Broad Availability of Data”, 2020, p. 1.

[34]  Ibidem.

[35] An example is the removal of private health information from medical records (Uzuner, Ö., Luo, Y., & Szolovits, P., “Evaluating the state-of-the-art in automatic de-identification”. Journal of the American Medical Informatics Association, 14(5), 2007, pp. 550-563)

[36] Randy Koch, GDPR, CCPA and beyond: How synthetic data can reduce the scope of stringent regulations, 2020.

[37] El Emam, “Accelerating AI with synthetic data”.

[38] Barta, G. (2018). Challenges in the compliance with the General Data Protection Regulation: anonymization of personally identifiable information and related information security concerns. Knowledge–economy–society: business, finance and technology as protection and support for society, chapter 11.

[39] Rocher, L., Hendrickx, J. M., & De Montjoye, Y. A., “Estimating the success of re-identifications in incomplete datasets using generative models”, Nature communications, 10(1), 2019, p.1-9

[40] El Emam, K. (2010). Risk-based de-identification of health data. IEEE Security & Privacy, 8(3), p.64-67

[41] For a discussion on this matter, and mitigation strategies, see Google’s Considerations for Sensitive Data within Machine Learning Datasets.

[42]  Charline Daelman, “Chapter 6, AI through a Human Right Lens. The Role of Human Rights in Fulfilling AIs Potential”, Artificial Intelligence and the Law, Jan De Bruyne and Cedric Vanleenhove (eds.), Intersentia, p.123.

[43] European Parliament Research Service (EPRS) (…) pp.45 to 47. Also, see WP29, Opinion 03/2013 on purpose limitation, adopted on 2 April 20213, 00569/13/EN, WP 203.

[44] Ibidem, pp.47 to 48.

[45] Information Commissioner’s Office (ICO), Guide to the General Data Protection Regulation (GDPR), 2021, pp. 28-31.

[46] European Parliament Research Service (EPRS) (…) pp.47 to 48.

[47] Information Commissioner’s Office (ICO), Guidance on the AI auditing framework, Draft guidance for consultation, 2019, p. 23.

[48] Ibidem p. 50

[49] Ibidem p. 83.

[50] Ibidem p. 13.

[51] Randy Koch, GDPR, CCPA and beyond: How synthetic data can reduce the scope of stringent regulations, 2020.

[52] Anderson, J. W., Kennedy, K. E., Ngo, L. B., Luckow, A., & Apon, A. W. (2014, October). Synthetic data generation for the internet of things. In 2014 IEEE International Conference on Big Data (Big Data) (pp. 171-176). IEEE

[53] El Emam, K., & Hoptroff, R., “The synthetic data paradigm for using and sharing data”, Cutter Executive Update, 19(6), 2019.

[54] Steinhoff, J. Towards a Political Economy of Synthetic Data: The Possibility of a Data-intensive Capitalism That is Not a Surveillance Capitalism

[55] Piwowar, H. A., Becich, M. J., Bilofsky, H., Crowley, R. S., & caBIG, “Data Sharing and Intellectual Capital Workspace”, Towards a data sharing culture: recommendations for leadership from academic health centers. PLoS medicine, 5(9), 2008, p. 183.

[56] Kamaev, A. N., Smagin, S. I., Sukhenko, V. A., & Karmanov, D. A., “Synthetic data for AUV technical vision systems testing”, In CEUR Workshop Proceedings Vol. 1839, 2017, p. 126-140.

[57] Whiting, M. A., Haack, J., & Varley, C., “Creating realistic, scenario-based synthetic data for test and evaluation of information analytics software”, In Proceedings of the 2008 Workshop on beyond time and errors: novel evaluation methods for Information Visualization, 2008, April, p. 1-9.

[58] This dataset is composed of 3 different types of input features: (i) objective, based in factual information; (ii) examination, translating the results of medical examination; and subjective, that is the information given by the patient.