Icon close

Masking Personally Identifiable Information with Salt using Cloud Data Fusion




The first time I heard about Cloud Data Fusion (CDF) I thought of it as a good tool that would have helped me in past projects where it was required to create Spark jobs on Dataproc and orchestrate them. At the time, we used an external tool to create the cluster, run jobs and delete it after the jobs finished. But with Data Fusion, I could do everything using one single tool and create pipelines with an easy user interface.

More recently, after discovering CDF, I had to put my hands on it working on a client project. I’ll present my experience with it and how we used it to secure Personally Identifiable Information (PII).

Cloud Data Fusion is the fully-managed integration service provided by Google Cloud that helps build and manage ETL data pipelines in a code-free manner. It can be used in hybrid and multi-cloud environments and has both batch and streaming capabilities (Enterprise Edition).

Cask Data Application Platform (CDAP) is the open source project behind Data Fusion that allows data pipeline portability, that means, no vendor lock-in.

Studio – https://cloud.google.com/data-fusion/docs/tutorials/reusable-pipeline

As Data Fusion is relatively expensive to play (1.80 USD/hour + Dataproc prices), I used and recommend Qwiklabs to help practice pipeline creation and get a good overview of its interface. Qwiklabs will provision your own environment and you will not pay for it, only for the lab itself.

However, it isn’t just pushing data from one side to another. Data Engineers need to be careful with the type of data going through pipelines. Personally Identifiable Information (PII) requires additional treatment for data security and storage location. Always question who should have access to it, what type of access and how sensible the data is. Always follow the least privilege principle, allowing minimum access to it.

The problem

It is common nowadays to ingest PII and use it in analysis to help decision making. But how can we secure the PII in our Data Warehouse and protect it against identification?

The design

Hashing or masking data is the concept of one-way transforming the PII into something non-identifiable for the users while trying to maintain its value. Different algorithms and methods can be used for this purpose and the general idea is exemplified below:

Suppose we receive the following data through our pipelines:

After hashing the PII, we would transform it into something like:

And that’s it! Sensitive information is hidden while being ingested.

An additionally secure step would be to append a “salt” at the end of the data before applying the hash function. We can use a random generated string, for example, and secure it from people trying to break the hashed data.

In that way, we extend the length and complexity of the hashed data making it harder to crack.

So, our approach is to hash the data and store the real value separately and replace the original data to be used on the Data Warehouse with the hashed value. Confusing? It is simple like the diagram below:

If the PII is not used we shouldn’t keep it. However, in some cases we are required to store it. For example, we might need to send an email back to the user but only the application should have access to the user’s email and not the analysts themselves. That is why in this example we are storing the PII separately.

The proposed solution

In this solution we assume for simplicity that data lands on Google Cloud Storage (GCS) and we’ll use it as our initial data source. It could also have originated from many other sources supported by Data Fusion. Also, we will use BigQuery as our Data Warehouse for data analysis.

Wrangler Example – https://cloud.google.com/data-fusion/docs/tutorials/targeting-campaign-pipeline

In our solution we will use the SHA-256 function to hash the data. The generated recipe will be:

For additional security, we will append a salt stored in CDAP Secure Storage.

CDAP exposes APIs to diverse operations like managing and scheduling pipelines. One of them is Secure Storage. It is basically an internal storage that encrypts data upon submission and can be controlled via RESTful APIs. That is where we will keep our salts and store them by making PUT requests on the Create Secure endpoint for every PI type (email, credit card). Example:

So, we will do it for both for email_salt and credit_card_salt:

And now we can use them on our Wrangler with the ${email_salt} or ${credit_card_salt} variables.

Obs.: This is a proposed solution. We could also use Google Data Loss Prevention (DLP) to automatically detect PII and mask the data for us but we opted to use Wrangler directives and internal Secure Storage to save costs.

The drawbacks

When architecting and designing solutions you will find that it is impossible to design one perfect architecture without any drawbacks. There is always a tradeoff between using one component over another. The same applies to the one proposed in this article and it is important to exercise what can go wrong and what its limitations are. Don’t be too attached to what you design.


Some of the drawbacks with this design are:


PII still sits on GCS

Even if we hashed the PII, it is still stored in GCS. We should avoid storing personal information but if we do, ideally this should be locked down so only service accounts have access to it.

Less value for Analysts

If we need to count different email domains on our data that information wouldn’t be available anymore after hashing it. We need to understand our business needs to see how we can keep its value. In this case, we could hash just the username of the email. Ex.: 8edf5fbf54c4283639a40b6c7fa8582637dcd1c7a4ab99c83c6107f537c98ea1@gmail.com


Data Fusion charges per hour for its instance that generally requires it to keep it running all the time. In our example, data is being ingested in batch and most of the time the Data Fusion instance will remain idle.

Salt manipulation

Using the CDAP internal Secure Storage we require additional management of the salt, controlling who has access and how to generate its values.

Stay up to date in the community!

We love talking with the community. Subscribe to our community emails to hear about the latest brown bag webinars, events we are hosting, guides and explainers.