January 5, 2019

Monitoring sensitive data using Anomaly Detection

Over the years, the size of the data being stored by the organizations has been growing exponentially. This has increased concerns over the data privacy. With the threat of data breaches, organizations face the challenge of protecting users sensitive information including Phone Numbers, Emails etc. Also, the organizations need to stay in compliance with International regulations. With the ever-increasing data regulations and their complexity, it would be a daunting task for the organizations.

Kogni is a unique tool which does not only redact sensitive information but also discovers all the places they are stored. It uses Machine Learning to leverage Data Security and ensures that the organizations stay in compliance with industry guidelines and international regulations like NIST, GDPR, PCI etc.

Due to sheer volumes of data being stored, it is very difficult for the organizations to monitor all of the data. Kogni discovers all the places where the sensitive information is stored. A very useful feature of it can be to monitor the places where the sensitive information is stored and thereby adding an extra layer of security to all the sensitive data.

In this blog, we present a method to monitor sensitive data. We compare different methods of achieving this and advantages of one over the other.

Approach: One plausible way of detecting the security breaches can be monitoring all the user actions which are accessing sensitive data. This can be done by analyzing various Audit logs. By detecting anomalies in these logs, we can detect security breaches on private data.

There are several audit logs that are generated by various processes within the Hadoop environment. Some processes generating audit logs:

  1. Cloudera Navigator
  2. HDFS Audit logs
  3. MapReduce Audit logs
  4. Hive Audit logs
  5. Impala Audit logs
  6. YARN Audit logs
  7. HBase Audit logs
  8. Sentry Audit logs

Cloudera Navigator Audit logs: Cloudera Navigator provides a unified auditing service for all the major services(see all the services supported by Cloudera Navigator Auditing service here). A big catch is that if we solely rely on it, it imposes an extra infrastructural requirement to run the Anomaly Detection. Therefore, we will not solely rely on it but we will use it to improve our pipeline.

Advantages:

  1. Easy to query using API interface.
  2. Easy to filter related logs for a query.
  3. Can query logs for a specific timeframe.
  4. Well structured.

Disadvantages:

  1. Does not support Spark SQL and Spark file access.
  2. When using Impala queries with Hue, Hue does not terminate its connection after the query is executed. As a result, the related audit logs are missing.
  3. HiveCLI is not supported.
  4. Imposes some infrastructural needs.

Sample Cloudera Navigator Audit log:

{
     "timestamp" : "2018-02-05T23:40:59.000Z",
     "service" : "CD-IMPALA-SpjIIzkY",
     "username" : "root",
     "ipAddress" : "10.0.0.58",
     "command" : "CREATE_TABLE",
     "resource" : "test:test3",
     "operationText" : "create table test3(id int, phone String)",
     "allowed" : true,
     "serviceValues" : {
       "operation_text" : "create table test3(id int, phone String)",
       "database_name" : "test",
       "query_id" : "204880ee6fccc9e8:9502f66e00000000",
       "object_type" : "TABLE",
       "session_id" : "374a675274aa2524:eba8aa57e61ca69a",
       "privilege" : "CREATE",
       "table_name" : "test3",
       "status" : ""
     }
}

HDFS Audit logs: HDFS audit logs records every file access on HDFS. Many of the relational databases are stored in files on HDFS. So, every related audit log related to the files will be logged.

Advantages:

  1. No infrastructural needs.
  2. You can have every log related to every file access.

Disadvantages:

  1. For services like Impala, if the data being accessed is already cached, the related HDFS Audit logs don’t show up.
  2. Difficult to access them.
  3. Need to handle log rollover.
  4. Difficult to filter.
  5. Too verbose that for even a small MapReduce job, a lot of related HDFS audit logs are generated.
  6. Difficult to find queries/processes/services that are resulting in those audit logs.

Sample HDFS Audit log:

{
    “allowed”: true,
    “serviceName”: “hdfs”,
    “username”: “centos”,
    “src”: “/user/centos”,
    “eventTime”: 1512538718000,
    “ipAddress”: “10.0.15.3”,
    “operation”: “listStatus”,
    “dest”: null,
    “permissions”: null,
    “impersonator”: null,
    “delegationTokenId”: null
}

Impala Audit logs:

Impala audit logs are generated on every node where the Impala daemon is running.

Advantages:

  1. Well structured.
  2. Gives the query being executed.
  3. No infrastructural needs.

Disadvantages:

  1. Need to monitor all the nodes where Impala daemon is running.
  2. Since this only records for Impala related Audit logs, we need to find other similar auditing services for other services as well.

Sample Impala Audit log:

{  
  "1518626522508":{  
     "query_id":"51488b4f90313d18:2860e81700000000",
     "session_id":"b4fd37e314dd6a6:c822b00988ff0086",
     "start_time":"2018-02-14 16:42:02.475475000",
     "authorization_failure":false,
     "status":"",
     "user":"root",
     "impersonator":null,
     "statement_type":"CREATE_TABLE",
     "network_address":"10.0.0.58:36646",
     "sql_statement":"create table test5 (name String)",
     "catalog_objects":[  
        {  
           "name":"test.test5",
           "object_type":"TABLE",
           "privilege":"CREATE"
        }
     ]
  }
}

Hive Audit logs:

Hive uses Hive metastore for service logging. Hive Audit logs are stored with the class name org.apache.hadoop.hive.metastore.HiveMetaStore.audit. These needs to be filtered out before processing further.

Advantages:

  1. All the Hive Audit logs are stored in the Master Node.
  2. No infrastructural needs.

Disadvantages:

  1. No information about Query being executed.
  2. Logs need to be parsed.

Sample Hive Audit log:

2017-12-05 05:10:56,239 INFO org.apache.hadoop.hive.metastore.HiveMetaStore.audit: [pool-5-thread-1]: ugi=hue/ip-10-0-15-3.us-west-2.compute.internal@CLAIRVOYANT ip=/10.0.15.3 cmd=get_functions: db=cloudera_manager_metastore_canary_test_db_hive_hivemetastore_6b039170b5491248c0b846d658d38e58 pat=*

Some factors affecting Audit logs: We considered some common factors that may effect the Audit logs.

Kerberos: Kerberos is a network authentication protocol which gives certain access to certain people based on Principal they belong to by generating tokens valid for a certain time.
Here we perform a detailed comparison of how the resulting Audit logs are affected when Kerberos is installed and when not installed.
HDFS Audit logs on Non-kerborized cluster:
{
“allowed”: true,
“serviceName”: “hdfs”,
“username”: “centos”,
“src”: “/user/centos”,
“eventTime”: 1512538718000,
“ipAddress”: “10.0.15.3”,
“operation”: “listStatus”,
“dest”: null,
“permissions”: null,
“impersonator”: null,
“delegationTokenId”: null
}
HDFS Audit logs on Kerborized Cluster:
{
“allowed”: true,
“serviceName”: “hdfs”,
“username”: “centos@CLAIRVOYANT”,
“src”: “/user/centos”,
“eventTime”: 1512429082151,
“ipAddress”: “10.0.15.3”,
“operation”: “listStatus”,
“dest”: null,
“permissions”: null,
“impersonator”: null,
“delegationTokenId”: null
}
Hive on Non-Kerborized Cluster:
2017–11–30 17:33:50,912 INFO org.apache.hadoop.hive.metastore.HiveMetaStore.audit: [pool-5-thread-21]: ugi=centos ip=10.0.15.3 cmd=source:10.0.15.3 get_table : db=test tbl=movies
Hive on Kerborized Cluster:
2017–12–05 04:52:32,209 INFO org.apache.hadoop.hive.metastore.HiveMetaStore.audit: [pool-5-thread-21]: ugi=centos@CLAIRVOYANT ip=10.0.15.3 cmd=source:10.0.15.3 get_table : db=test tbl=movies
Impala Audit logs on Non-kerborized Cluster:

Impala Audit log

Effect of Kerberos: The User field is being affected after installing Kerberos. On Kerberos cluster, the username field also contains the principal to which the user belongs.

Connection Types: Different types of connections for a service also affecting the Auditing logs. For example, Hive can be accessed through Hue, HiveCLI,  and Beeline. Similarly other services.

HDFS Audit log accessed from Hadoop FsShell:

{
  "allowed": true,
  "serviceName": "hdfs",
  "username": "centos",
  "src": "/user/centos",
  "eventTime": 1512538718000,
  "ipAddress": "10.0.15.3",
  "operation": "listStatus",
  "dest": null,
  "permissions": null,
  "impersonator": null,
  "delegationTokenId": null
}

HDFS Audit log accessed from Hue:

{
  "allowed": true,
  "serviceName": "hdfs",
  "username": "centos",
  "src": "/user/centos",
  "eventTime": 1512538602501,
  "ipAddress": "10.0.15.3",
  "operation": "listStatus",
  "dest": null,
  "permissions": null,
  "impersonator": "hue",
  "delegationTokenId": null
}

Effect of Connections: For many services, the impersonator field seems to be effected. Similarly, we have performed various experiments for JDBC connections etc.

Analyzing and Comparing various Audit logs:

As we discussed above, there are various processes within the Big Data environment generating Audit logs. Here we analyze and compare them for simple queries.

Audit logs for Impala:

Query: “show databases”

Query: “show databases”

There are no related HDFS Audit logs. This can be one of the scenarios where there is data access regardless of accessing files on HDFS.

Query: “select * from test2”. Reading the content from test2 table in test database.

Query: “select * from test2”

Here, we analyze more into the Audit logs and compare some of them.

Query: “create table test5 (name String)”

Query: “create table test5 (name String)”

After executing and analyzing simple Impala queries (typically involving one or no tables), it is found that Navigator and Impala Queries are similarly structured giving almost the same information.

Navigator has a “service” field which can be used to filter out the logs while Impala Audit logs don’t have to be filtered.

The Query that was being executed was accessed through “operationText” field in Navigator logs while using “sql_statement” in Impala Audit logs.

Let’s do the same comparison for a little more complex queries like joins, where multiple tables are involved.

Query: “select t1.id, t1.name, t1.email, t2.phone from test4 t1 join test3 t2 on t1.id = t2.id”

Query: Join statement on test4 and test3 tables

Though Impala Audit logs offer more information like port address, for our case currently we are not considering the port addresses in order to ease the development process.

Also, the multiple audit logs for each query is stored under one JSON object in Impala logs while Navigator produces multiple records. Similarly, multiple audit logs are produced for cases like invalidating metadata, other joins or cases where we are accessing multiple tables.

Upcoming:

In this blog, we discussed the need and advantages of adding an extra layer of security for the sensitive data. We also discussed various Audit logs and effect of Kerberos on them.

In my next blog, we will be discussing Hive Audit logs, Pipeline of the Anomaly Detection, using Kogni API to identify sensitive information and method to identify the Anomalies.