The General Data Protection Regulation (GDPR) is a regulation intended to strengthen data protection for all individuals within the European Union. It becomes enforceable from May 25, 2018.

One of the key requirements of the GDPR is the right to erasure. The right to erasure provides individuals the right to request erasure of personal data related to them.

As a result, enterprises are confronted with a hard deadline of May 25, 2018 before which they must implement the right to erasure across all their enterprise data stores. However, many enterprises lack the technology to identify the data stores containing an individual’s personal data, making it impossible to erase an individual’s personal data when requested by the individual.

There are two technical problems that need to be solved to support the right to erasure:

  1. Identify data stores containing sensitive data. We will call them sensitive data locations.
  2. Once sensitive data locations are identified, build sensitive data erasure functionality. This is particularly challenging for Hadoop due to its largely immutable files. Therefore, we will provide a blueprint for supporting data erasure in Hadoop.

Let’s tackle the two problems one by one.

Identifying data stores containing sensitive data

A common misconception is that IT teams can manually compile a list of sensitive data locations. However, manual sensitive data discovery is impractical for the following reasons:

  • In traditional relational databases such as Oracle, sensitive data could be lurking in opaque varchar/clob/blob columns.
  • In NoSQL databases such as mongodb, the lack of predefined schema makes it challenging to manually assess what sensitive data might be stored in the database.
  • Application log files also sometimes contain sensitive data. This is likely a result of developer mistake, but, unfortunately, it does happen.
  • Traditional data warehouses, cloud data stores, and Hadoop often lead to the same (sensitive) data getting replicated to many data stores (Vertica, Amazon S3, Hadoop, etc.).
  • Sensitive data, like other data, is often not deleted, but archived. This creates one more location where sensitive data is stored.
  • Scanned documents may also contain sensitive data.

The sheer number of data stores potentially containing sensitive data makes manual discovery impractical. Clearly, automation is needed.

Combine the fact that in today’s dynamic IT environment, sensitive data locations keep changing with time, automatic sensitive data discovery becomes almost a must-have.

The Kogni Discovery Engine is the ideal solution to this problem. It scans an enterprise’s data stores and automatically builds a sensitive data catalog. The sensitive data catalog can be explored in Kogni’s intuitive, interactive dashboard, and it can also be accessed through an API to build higher-level functionality such as data-erasure logic.

Screen-Shot-2018-01-22-at-4.54.11-PMKogni’s Interactive Sensitive Data Dashboard

Highlights of the Kogni Discovery Engine:

  • Discovers sensitive data stored in text and images
  • Inspects Hadoop, S3, NoSQL, and RDBMS
  • Purpose-built classifiers for sensitive data like credit card numbers, SSNs, emails, phone numbers, and more
  • User-defined classifiers to identify sensitive data types unique to your enterprise
  • Dashboard view of sensitive data across enterprise data sources

Erasing sensitive data in Hadoop’s immutable files

Because of Hadoop’s largely immutable files, erasure requests are best supported using the Command Pattern. When individuals request erasure of their personal data, create and store command objects encapsulating all information needed to execute the commands later. Periodically, a batch job picks up all the pending commands and executes them by following three high level steps:

  1. Identify the files from which sensitive data needs to be erased. Note: these files are typically immutable.
  2. Generate new files by copying data from the original files, while omitting (erasing) the requested sensitive data.
  3. Replace old files with new files.

Of course, implementing these high level steps is highly dependent on enterprise-specific business logic. Kogni offers pre-built, flexible workflows that can be rapidly tailored to capture enterprise-specific data-erasure business logic.

Next Steps

Enterprises can accelerate their GDPR compliance journey by leveraging automated sensitive data discovery tools. Kogni, with its automated sensitive data discovery engine and customizable data-erasure workflows, is an invaluable aid in implementing the right to erasure.

Learn how Kogni can help your enterprise with GDPR compliance: Request Demo