Home signMedius sign
← Back to Stories

[Deep Dive] Anonymization and pseudonymization

The use of artificial intelligence, or more specifically, machine learning algorithms, is becoming quite common. The key to more accurate predictive modeling is reliable and well-defined data.

The extraction of key data and the ability to search for future predictions in real time allow better decision-making and process automation, not only in advanced high-tech companies, but anywhere where key data which allows us its modeling and classification is available.

In every company, whether it is a company with advanced relational databases or a company that operates in the form of letters and other unstructured documents, there are cases that can be automated or even predictively modeled to make employees' work easier and provide them with additional support when making important decisions.

The importance of data protection

Companies that are in the digitalization phase or in the transition to digitalization usually do not have in-house development teams that can perform the task of automating or implementing, learning and applying predictive models competently, so in this case they hire experienced external contractors.

However, complications tend to arise when the client realizes that they need to share their key data with the external contractor, because without key data about the business process the client wants to model, the external contractor cannot do their job.

There are several ways to resolve this, but first a non-disclosure agreement is a must. This obliges the contractor to carefully protect sensitive data which they are not allowed to disclose. However, some concern may remain that the data might end up with a competitor one way or another. What can we do in this situation?

Well-protected data

There are of course additional measures, but one of the most compelling is to anonymize or pseudonymize key modeling data. In this case, the modeler develops the model directly on data that does not contain, for example, the true value of the product purchase price, nor the actual prices of the product or service, nor the true customer information, but all this key data is encoded in such a way that it is very difficult to decode.

The essential difference between anonymization and pseudonymization is that the anonymized data cannot be decrypted into its original form, whereas with the pseudonymization process a key exists that allows both encryption and decryption. The anonymizer can advise you which data should be pseudonymized and which data should be anonymized. As a result, the contractor is aware of the type of data, but not of its true content.

Our solution

For structured (e.g. relational, tabular) data, the process of anonymization or pseudonymization is quite simple. It is just a small extra step when exporting the data using a dedicated programming interface. Some databases already offer the feature to anonymize the data before export.

The challenge is to anonymize or pseudonymize unstructured data, e.g. texts, contracts, and even administrative and judicial decisions or judgements, etc. In such cases, we want to hide in particular any personal data, company names, specific amounts, etc. For this reason we need to use more advanced anonymization or pseudonymization systems on the basis of natural language processing and machine learning.

Such systems are language-dependent and usually do not allow 100% anonymization, so it is always necessary to verify the files before transmitting such data; however, such a system still greatly facilitates the anonymization or pseudonymization process.

Specifically for the Slovenian language, Medius developed such a tool called D.A.T.E. We will present it in more detail in a future article.

We use third-party cookies to analyze web traffic. This allows us to deliver and improve our web content. Our website uses cookies for these purposes only.