Blog

Masking Personally Identifiable Information with Salt using Cloud Data Fusion

By July 11, 2020 No Comments

Written by Bruno Araujo at Kasna

 

The first time I heard about Cloud Data Fusion (CDF) I thought of it as a good tool that would have helped me in past projects where it was required to create Spark jobs on Dataproc and orchestrate them. At the time, we used an external tool to create the cluster, run jobs and delete it after the jobs finished. But with Data Fusion, I could do everything using one single tool and create pipelines with an easy user interface.

More recently, after discovering CDF, I had to put my hands on it working on a client project. I’ll present my experience with it and how we used it to secure Personally Identifiable Information (PII).

Overview

Cloud Data Fusion is the fully-managed integration service provided by Google Cloud that helps build and manage ETL data pipelines in a code-free manner. It can be used in hybrid and multi-cloud environments and has both batch and streaming capabilities (Enterprise Edition).

Cask Data Application Platform (CDAP) is the open source project behind Data Fusion that allows data pipeline portability, that means, no vendor lock-in.


Studio – https://cloud.google.com/data-fusion/docs/tutorials/reusable-pipeline

As Data Fusion is relatively expensive to play (1.80 USD/hour + Dataproc prices), I used and recommend Qwiklabs to help practice pipeline creation and get a good overview of its interface. Qwiklabs will provision your own environment and you will not pay for it, only for the lab itself.

However, it isn’t just pushing data from one side to another. Data Engineers need to be careful with the type of data going through pipelines. Personally Identifiable Information (PII) requires additional treatment for data security and storage location. Always question who should have access to it, what type of access and how sensible the data is. Always follow the least privilege principle, allowing minimum access to it.

The Problem

It is common nowadays to ingest PII and use it in analysis to help decision making. But how can we secure the PII in our Data Warehouse and protect it against identification?

The Design

Hashing or masking data is the concept of one-way transforming the PII into something non-identifiable for the users while trying to maintain its value. Different algorithms and methods can be used for this purpose and the general idea is exemplified below:

Suppose we receive the following data through our pipelines:

id email credit_card
1 john@mydomain.com 326847623784
2 mary@mydomain.com 316287612852

After hashing the PII, we would transform it into something like:

id email_hashed credit_card_hashed
1 30bb0141a0fde54ff52fa777f1227b5d71253ee89a13be8e9f8944ca66b86dbb c5bc66470347b65fa7b12e9a970ff6dfa2756c282458539e0a90ebb291e12d4c
2 07a1beff41dcccd09339b3e97e4366c84dbfe135390b8a722c67a9a76630b990 1e8e3c4335216805b0ea66cbe9f61857b20bb60a01d9ae94afbc96ad9043abf3

 

And that’s it! Sensitive information is hidden while being ingested.

An additionally secure step would be to append a “salt” at the end of the data before applying the hash function. We can use a random generated string, for example, and secure it from people trying to break the hashed data.

PII = john@mydomain.au

Salt = e4s112agz4

Salted data (PII + Salt) = john@mydomain.aue4s112agz4

Hash value = 8edf5fbf54c4283639a40b6c7fa8582637dcd1c7a4ab99c83c6107f537c98ea1

 

In that way, we extend the length and complexity of the hashed data making it harder to crack.

So, our approach is to hash the data and store the real value separately and replace the original data to be used on the Data Warehouse with the hashed value. Confusing? It is simple like the diagram below:

If the PII is not used we shouldn’t keep it. However, in some cases we are required to store it. For example, we might need to send an email back to the user but only the application should have access to the user’s email and not the analysts themselves. That is why in this example we are storing the PII separately.

The Proposed Solution

In this solution we assume for simplicity that data lands on Google Cloud Storage (GCS) and we’ll use it as our initial data source. It could also have originated from many other sources supported by Data Fusion. Also, we will use BigQuery as our Data Warehouse for data analysis.

Cloud Data Fusion offers a Wrangler component that helps clean, filter and transform data using simple directives. Everything you do on its interface translates to a recipe. You could, for example, add/remove columns, parse data and sum columns. It is a good visualisation tool for Data Cleaning.

Wrangler Example – https://cloud.google.com/data-fusion/docs/tutorials/targeting-campaign-pipeline

In our solution we will use the SHA-256 function to hash the data. The generated recipe will be:

set-column email_hashed email                          # duplicates email column to email_hashed

hash email_hashed SHA-256 true

set-column credit_card_hashed credit_card

hash credit_card_hashed SHA-256 true

For additional security, we will append a salt stored in CDAP Secure Storage.

set-column email_hashed email + ‘${email_salt}’

hash email_hashed SHA-256 true

set-column credit_card_hashed credit_card + ‘${credit_card_salt}’

hash credit_card_hashed SHA-256 true

CDAP exposes APIs to diverse operations like managing and scheduling pipelines. One of them is Secure Storage. It is basically an internal storage that encrypts data upon submission and can be controlled via RESTful APIs. That is where we will keep our salts and store them by making PUT requests on the Create Secure endpoint for every PI type (email, credit card). Example:

PUT <cdap-endpoint>/api/v3/namespaces/<my-namespace>/securekeys/<my-new-salt>

With the body { “description”: “My new salt”, “data”: “my-new-salt-value” }

So, we will do it for both for email_salt and credit_card_salt:

PUT <cdap-endpoint>/api/v3/namespaces/<my-namespace>/securekeys/email_salt

With the body { “description”: “PII email salt“, “data”: “<my-email-salt-value>” }

 

PUT <cdap-endpoint>/api/v3/namespaces/<my-namespace>/securekeys/credit_card_salt

With the body { “description”: “PII credit card salt“, “data”: “<my-credit-card-salt-value>” }

And now we can use them on our Wrangler with the ${email_salt} or ${credit_card_salt} variables.

Obs.: This is a proposed solution. We could also use Google Data Loss Prevention (DLP) to automatically detect PII and mask the data for us but we opted to use Wrangler directives and internal Secure Storage to save costs.

The Drawbacks

When architecting and designing solutions you will find that it is impossible to design one perfect architecture without any drawbacks. There is always a tradeoff between using one component over another. The same applies to the one proposed in this article and it is important to exercise what can go wrong and what its limitations are. Don’t be too attached to what you design.

Some of the drawbacks with this design are:

  • PII still sits on GCS: Even if we hashed the PII, it is still stored in GCS. We should avoid storing personal information but if we do, ideally this should be locked down so only service accounts have access to it.
  • Less value for Analysts: If we need to count different email domains on our data that information wouldn’t be available anymore after hashing it. We need to understand our business needs to see how we can keep its value. In this case, we could hash just the username of the email. Ex.:
    • 8edf5fbf54c4283639a40b6c7fa8582637dcd1c7a4ab99c83c6107f537c98ea1@gmail.com
  • Cost: Data Fusion charges per hour for its instance that generally requires it to keep it running all the time. In our example, data is being ingested in batch and most of the time the Data Fusion instance will remain idle.
  • Salt manipulation: Using the CDAP internal Secure Storage we require additional management of the salt, controlling who has access and how to generate its values.

Conclusion

Cloud Data Fusion proved to be a good tool to democratize Data Engineering for non-technical people. They can do simple data engineering tasks requiring no deep knowledge on the field. CDF is recommended for users with lack of code experience and is ideal for creating and orchestrating simple pipelines.

Nothing is perfect. The easy user interface has some tradeoffs. Specific and customizable pipelines can be troublesome to create and Cloud Data Fusion’s price can surprise small companies trying to invest in the data engineering area. Additionally, CDF was recently released (GA in November 21, 2019), making it relatively new in the market but with promising potential and a great feature roadmap ahead.

While I venture myself in other cloud areas and products, if I someday come back to check how Cloud Data Fusion is going, I would like to see it provisioning other types of jobs as well, like DataFlow and BigQuery jobs.