Anonymize like a Rock Star! (or: What’s New on Dat...

former_member301504 · ‎04-15-2019

The SAP HANA Data Anonymization Features released for SPS03 in 2018 already made you look cool: Allowing you to build completely new applications by giving you the ability to use sensitive data without having the risk of exposing individual secrets. For instance, anonymizing sensitive travel and expense data and combining it with experience data in order to use it for an exciting new application for travel agents, as explained in this blog.

For this spring's innovations from SAP HANA, the development team extended the functionality that will finally make you the anonymization rock star. Firstly, SAP HANA 2.0 SPS 04 provides new ways to keep the utility of the anonymized data set. Secondly, configuring complex scenarios has become easier and more flexible. And thirdly, SAP HANA 2.0 SPS 04 enables you to measure KPIs of the anonymization process making the process transparent for Data Protection Officers and reflecting data quality.

Everything you need to deliver anonymized data sets faster and in better quality – just like a rock star! And don’t worry - you don’t have to grow your hair or learn to play the guitar – according to the famous urban dictionary a rock star is “A person who always delivers the goods. If they say they are going to do something they do.”

Case

[caption id="attachment_1014599" align="alignleft" width="400"]

Table 1: Original Data[/caption]

Imagine being in charge of creating a salary analytics application at ACME. The application should make the salary structure of ACME transparent to third parties, i.e., hiring managers. The goal is to reveal information like where ACME pays less than its competitors, leading to a hiring gap of new talents. For this task, we get access to the salary table like the one in Table 1. It’s unquestionable that salary data is sensitive and any individual salary must not be revealed. (Disclaimer: for the sake of this example we only use generated data.) Of course, for our analytical application we only need columns with attributes like start year, gender, zip and obviously the salary (see Table 2). However, even those could be unique and potentially identify someone. For instance, a colleague knowing that Gene started in 1996 in APJ, makes his record unique and reveals his salary to the colleague querying. To overcome this risk and make our salary analyzation possible, we decide to use k-Anonymity in the HANA Data Anonymization feature: This anonymization method ensures that at least ‘k’ persons are indistinguishable with respect to

[caption id="attachment_1014600" align="alignright" width="350"]

Table 2 Selected columns for the analytical application[/caption]

gender, region, zip, career level education and start year. Gene is hidden in a crowd of at least k colleagues and we cannot single him out anymore. If you are not familiar with concepts like re-identification and k-Anonymity , I recommend reading this article.

Rock stars preserve data quality (and privacy)

Applying anonymization methods in SAP HANA 2.0 SPS 04 has become easier: it is now possible to enhance SQL views with anonymization methods. For our ACME salary analytics application, we create an anonymized view ‘SALARY_ANON’ of the ‘SALARY’ source table in Figure 1. SAP HANA might generalize some values according to the defined hierarchies. This ensures that at least k=10 persons are indistinguishable following k-Anonymity guarantees. This is the create syntax of the anonymized view:

CREATE VIEW SALARYANON (ID, GENDER, REGION, ZIPCODE, CAREER_LEVEL, EDUCATION, START_YEAR, SALARY) AS

SELECT ID, GENDER, REGION, ZIPCODE, CAREER_LEVEL, EDUCATION, START_YEAR, SALARY FROM SALARY

WITH ANONYMIZATION (ALGORITHM 'K-ANONYMITY' PARAMETERS '{"k": 10}'

COLUMN ID PARAMETERS '{"is_sequence":true}'

COLUMN GENDER PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["f"],["m"]]}}'

COLUMN REGION PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["APJ"],["EMEA"]…]}}'

COLUMN CAREER_LEVEL PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["C1",…C5"]]}}'

COLUMN EDUCATION PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Assoc-acdm","Professional Ed",…],…]}}'

COLUMN ZIPCODE PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["5004","50xx"],…]}}'

COLUMN START_YEAR PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"embedded":[["1987","1986-1990","1986-1995"],…]}}');

However, the result looks like the example in Table 3: The start year was replaced by an asterisk ‘*’ rendering the data set unusable for the analytics application. Without the start year we will not be able to find out how well ACME pays young talents. One of the new features in SAP HANA 2.0 SPS 04 is the parameter ‘max’ for a hierarchy, defining the highest level of generalization that can happen without ruining the utility of the anonymized data set. As an Anonymization rock star, we set it to ‘1’ in this scenario limiting the generalization of start year to five-year buckets. The result is compelling (see Table 4).

COLUMN START_YEAR PARAMETERS '{"is_quasi_identifier":true, "max":1, "hierarchy":{"embedded":[["1987","1986-1990",…],…]}}');

The anonymization preserves five-year buckets of the start year but shows less digits of the ZIP code to remain k=10 anonymous.

With the help of the ‘max’ parameter, we keep the utility in a simple and easy way. At the same time the privacy guarantee of k-Anonymity stays untouched. In addition to the ‘max’ parameter, SAP HANA 2.0 SPS 04 also introduced ‘min’ which defines the minimal generalization level amongst others. See the documentation for full reference.

[caption id="attachment_1014601" align="alignnone" width="404"]

Table 3 Anonymization: 1st approach - Asterisk '*' in start year[/caption]

[caption id="attachment_1014602" align="alignnone" width="398"]

Table 4: Anonymization 2nd approach with 'max' parameter - Start Year as required[/caption]

Rock stars tackle hierarchy configurations easily

Creating a hierarchy configuration can be complicated. Think of the example code in Figure 1 The hierarchy definition of ZIPCODE explicitly shows the shortened version ‘62xx’ for each ZIP ‘6204’.This is difficult and error-prone. With SAP HANA 2.0 SPS 04 we can define SQL script functions that solve this rather mechanical tasks elegantly. First, we define a SQL script function like this:

create or replace function HierarchyFunctionZip(value varchar(255), level integer)

returns outValue varchar(255)

as

charsToShow integer default 0;

begin

charsToShow := length(value) - level;

outValue := rpad(substring(value, 1, charsToShow), charsToShow+level,'x');

end;

Instead of describing the hierarchy in detail, the definition of the zip code column now looks like this:

COLUMN ZIPCODE PARAMETERS '{"is_quasi_identifier":true,"hierarchy":

  {"schema":"SALARYANALYTICS", "function":"HierarchyFunctionZip", "levels":3}}'

This is way more classy than the lengthy description of every possible zip code. Additionally, it is possible to refer to SQL hierarchies defined on an SAP HANA system as well. See the documentation for examples.

Rock stars measure privacy (and much more)

So far, we have managed to fix the obvious data quality issue, define the hierarchy in an easy way and are ready to consume the salary data in the analytical application for ACME. But of course, we want to and must inform the Data Protection and Privacy Officer (DPPO) of ACME to be on the safe side in protecting employee data.

The DPPO can log into the HANA cockpit and see all the configuration parameters like the configured ‘k’. This is important but does not necessarily reflect the current condition of the data. SAP HANA 2.0 SPS 04 introduces additional parameters including ones that reflect on the real-time anonymized data, called KPIs. In this case, the DPPO is fine with a group size of ‘k=10’, but thanks to the new KPIs, the DPPO finds out that the effective ‘k’ is 31. Remember that the configured group size is the minimum number of indistinguishable persons. Of course, with a much higher effective ‘k’ resulting in a much lower effective risk than required, see Figure 3, the DPPO can surely sleep better. Besides privacy KPIs there are also KPIs reflecting utility and much more. For a full list, refer to the documentation.

Summary

Now our task is completed: ACME can launch the salary analytics application displayed in Figure 3 and hiring managers can find out that young talents receive a lower salary in APJ than in other regions. The possibilities of data anonymization are quite incredible. The application outlined provides real-time insights into highly sensitive salary data while simultaneously ensuring the privacy of the individuals. All the new features described enable a productivity boost: you can deliver anonymized data sets with increased data quality faster than ever before. Life as an anonymization rock star has never been easier.

Finally, there are even more new data anonymization features in SAP HANA 2.0 SPS 04. This includes l-diversity, an extension to k-Anonymity, and new ways of dealing with changes in the original data. You can find more information here.

[caption id="attachment_1014604" align="aligncenter" width="300"]