Pseudonymization, anonymization and data protection with AI 

Companies that want to develop and use artificial intelligence need suitable training data. However, these often have a personal reference and thus fall within the scope of the General Data Protection Regulation (GDPR). Nevertheless, with CIB PoP it is possible to use the valuable information for AI training – thanks to pseudonymization and anonymization.

Data protection and progress: a contradiction?

Who will be successful in the future global digital competition depends largely on who holds their own in the field of artificial intelligence. The leading players are currently the major U.S. corporations Alphabet (Google), Amazon, Meta (Facebook) and Microsoft. But China is also making great strides. In addition to military applications, the main investment here is in AI-supported surveillance of the population. A so-called social scoring system is currently under construction, in which every citizen receives points for positive behavior while points are deducted for negative behavior. The data comes from myriad sources – including the ubiquitous video cameras equipped with facial recognition software.

The advances in the U.S. and China are possible not only because of technical know-how, but also because of lax data protection laws. This is different in Europe, where personal data is very comprehensively protected by the General Data Protection Regulation (GDPR). The successes of AI have therefore so far been limited to areas without personal reference. This includes, for example, the use of machine-generated data from industry. According to a Bitkom survey, however, many companies would like to see more far-reaching projects: 66 percent of companies involved in AI stated that personal data must be used in order to obtain usable analysis results. müssen, um verwertbare Analyseergebnisse zu erhalten.

Pseudonymization and anonymization enable GDPR-compliant machine learning

Artificial intelligence has made its major breakthrough through the discipline of “Machine Learning” (ML). Here, it is not rules but data that determine the behavior of AI. For example, if an algorithm is to be developed that recognizes cats and dogs in pictures, it is not necessary to define the distinguishing features of the animals in the form of rules. Rather, the ML algorithm analyzes large sets of example images of the two animal species. Over time, this produces a generalized model that can be used to classify images that the AI has not yet seen.

At the heart of machine learning, then, is extensive training data. On the one hand, this data must be suitable for training an ML model from an economic and technical point of view. On the other hand, however, its use must not lead to a violation of the General Data Protection Regulation. If you still want to use personal data, you basically have two options:

  • Disguise personal reference (pseudonymization)
  • Remove the personal reference (anonymization)

Pseudonymization: The identification of the person is made more difficult

With pseudonymization, direct identifiers such as names are replaced by pseudonyms. For example, “Bernhard” becomes “Heinrich”. It is important that the assignment is unambiguous: if “Bernhard” occurs more than once in a data record, it must be replaced by “Heinrich” throughout. Some applications require that the pseudonymization is reversible. This is the case if it is possible to derive the original value from the pseudonym – even if a separate key is required for this.

Pseudonymization does not prevent the re-identification of individuals, but merely makes corresponding inferences more difficult. Therefore, pseudonymized data is subject to the GDPR. For example, they must be deleted – just like real data – when the retention obligation expires and there are no other retention reasons.

Anonymization: Re-identification is impossible

If you want to break free of the GDPR straitjacket, you have to resort to anonymization. This is because anonymized data does not allow any conclusions to be drawn about individuals from a technical perspective. To meet this requirement, all information that would allow re-identification must be deleted, redacted or replaced by digits, for example.

Procedure depends on the purpose of the processing

An important component of the General Data Protection Regulation is the principle of data minimization. It states, by analogy, that personal data may only be stored and processed to the extent required by the purpose. With regard to AI projects, this means: Companies must always check in advance whether the processing purpose can be achieved with anonymized data. If this is the case and the data is still not anonymized, there is a violation of the principle. Only if the purpose cannot be achieved by anonymization may a pseudonymization be performed.

CIB PoP recognizes and deletes personal data

Of course, it is not purposeful to manually remove or replace personal data in documents or data sets – especially not in AI projects where large amounts of data are processed. For this reason, CIB has launched the Fraunhofer IAIS project in collaboration with PoP (Protect our Privacy) Within this framework, an AI-based solution was developed that automatically detects personal data in documents and removes or pseudonymizes them.

CIB Pop

For image-based documents, CIB PoP first recognizes the text. Then, the text contents are passed to a language model (NLP model). This has been prepared with training data and is capable of independently identifying all GDPR-relevant content. In the next step, these can optionally be anonymized or blacked out. In the latter case, every trace is actually removed from the documents. The “Realistic removal” function is also worth highlighting. In this case, the AI reconstructs the background. For example, the scan of a completed form becomes a blank version again.

has been available since June 2022 as part of the CIB document viewer CIB doXiview and opens up numerous new possibilities for document-based processes. Among other things, the solution is suitable for using documents as training data for AI tasks such as document classification and text recognition. With CIB PoP, for example, the following AI scenarios can be implemented in a GDPR-compliant manner:

  • Identification of business processes based on document content
  • Extraction of process data from documents (e.g. invoices)
  • Completeness check of incoming forms, application documents and scans

By the way, the research project has shown that tasks of this kind can be handled very well with anonymization. There is usually no need for an absolutely realistic replacement of the texts. In some cases, it is even sufficient to remove the texts.

Conclusion: CIB PoP opens up new opportunities for medium-sized businesses

Especially for medium-sized companies, it has been difficult, if not impossible, to use document content for training AI applications due to the high GDPR hurdles. With CIB PoP this is now changing. Because now it is possible in a very simple way to remove the original personal reference from a document and use the remaining content for secure, GDPR-compliant AI development.

Want to learn more about the technology? Talk with us!

Florian Deuring

Specialist author for software and digitalization