1 of 8

Datastore Protection

This section consists of various documents that assist you in scanning various popular data stores using Nightfall APIs.

Airtable DLP Tutorial
Amazon Kinesis DLP Tutorial
Amazon RDS DLP Tutorial - Full Scan
Amazon RDS DLP Tutorial
Amazon S3 DLP Tutorial
Elasticsearch DLP Tutorial
Snowflake DLP Tutorial

Airtable DLP Tutorial

How to scan for sensitive data in Airtable

Airtable is a popular cloud collaboration tool that lands somewhere between a spreadsheet and a database. As such, it can house all sorts of sensitive data that you may not want to surface in a shared environment.

By utilizing Airtable's API in conjunction with Nightfall AI’s scan API, you can discover, classify, and remediate sensitive data within your Airtable bases.

Prerequisites

You will need a few things to follow along with this tutorial:

An Airtable account and API key
A Nightfall API key
An existing Nightfall Detection Rule
A Python 3 environment (version 3.7 or later)
The most recent version of Python Nightfall SDK

Installation

Install the Nightfall SDK and the requests library using pip.

pip install nightfall=1.2.0
pip install requests

Creating the Example

To start, import all the libraries we will be using.

The JSON, OS, and CSV libraries are part of Python so we don't need to install them.

import requests
import json
import os
import csv
from nightfall import Nightfall

We've configured the Airtable and Nightfall API keys as environment variables so they are not written directly into the code.

nightfall_api_key = os.environ.get('NIGHTFALL_API_KEY')
airtable_api_key = os.environ.get('AIRTABLE_API_KEY')

Next, we define the Detection Rule with which we wish to scan our data.

The Detection Rule can be pre-made in the Nightfall web app and referenced by UUID.

Also, we abstract a nightfall class from the SDK, for our API key.

detectionRuleUUID = os.environ.get('DETECTION_RULE_UUID')

nightfall = Nightfall(nightfall_api_key)

The Airtable API doesn't list all bases in a workspace or all tables in a base; instead, you must specifically call each table to get its contents.

In this example, we have set up a config.json file to store that information for the Airtable My First Workspace bases. You may also wish to consider setting up a separate Base and Table that stores your schema and retrieves that information with a call to the Airtable API.

[
    {
        "base_id": "appp4vxoDwgURFwYp",
        "base_name": "Product Planning",
        "tables": [
            "Stories", 
            "Epics", 
            "Sprints", 
            "Release Milestones", 
            "Facets", 
            "App Sections"
        ]
    },
    {
        "base_id": "appwWnUfLVJhltYQv",
        "base_name": "Product Launch",
        "tables": [
            "Features",
            "Product Themes",
            "Monthly Newsletters"
        ]
    }
  ]

As an extension of this exercise, you could write Nightfall findings back to another table within that Base.

Now we set up the parameters we will need to call the Airtable API using the previously referenced API key and config file.

airtable_config = json.load(open('config.json', 'r'))
airtable_base_url = 'https://api.airtable.com/v0'
airtable_headers = {
  "Authorization": f"Bearer {airtable_api_key}"
}

We will now call the Airtable API to retrieve the contents of our Airtable workspace. The data hierarchy in Airtable goes Workspace > Base > Table. We will need to perform a GET request on each table in turn.

As we go along, we will convert each data field into its string enriched with identifying metadata so that we can locate and remediate the data later should sensitive findings occur.

🚧Warning
If you are sending more than 50,000 items or more than 500KB, consider using the file API. You can learn more about how to use the file API in the Using the File Scanning Endpoint with Airtable section below.

all_airtable = []

for base in airtable_config:
    base_id = base['base_id']
    req_tables = [i.replace(' ', '%20') for i in base['tables']]

    for table in req_tables:
        airtable_url = f"{airtable_base_url}/{base_id}/{table}"
        airtable_response = requests.get(airtable_url, headers=airtable_headers)
        airtable_content = json.loads(airtable_response.text)

        for i in airtable_content['records']:
            # We enrich each datum with metadata so it can be easily located later
            cur_str = f"BaseName: {base['base_name']} -|- BaseID: {base_id} -|- Table: {table} -|- Record: {i['id']} -|- Field: "

            for j in i['fields']:
                str_to_send = f"{cur_str}{j} -|- Content: {i['fields'][j]}"
                all_airtable.append(str_to_send)

Before moving on we will define a helper function to use later so that we can unpack the metadata from the strings we send to the Nightfall API.

def str_parser(sent_str):
    split_str = sent_str.split(' -|- ')
    split_dict = {i[:i.find(': ')]: i[i.find(': ')+2:] for i in split_str[:5]}
    findertext = f" -|- Field: {split_dict['Field']} -|- Content: "
    split_dict['Content'] = sent_str[sent_str.find(findertext)+len(findertext):]
    return split_dict

We will begin constructing an all_findings object to collect our results. The first row of our all_findings object will constitute our headers since we will dump this object to a CSV file later.

This example will include the full finding below. As the finding might be a piece of sensitive data, we recommend using the Redaction feature of the Nightfall API to mask your data.

all_findings = []
all_findings.append(
  [
    'base_name', 'base_id', 'table_name', 'record_id', 'field',
    'detector', 'confidence', 
    'finding_start', 'finding_end', 'finding'
  ]
)

Now we call the Nightfall API on content retrieved from Airtable. For every sensitive data finding we receive, we strip out the identifying metadata from the sent string and store it with the finding in all_findings so we can analyze it later.

findings, redactions = nightfall.scan_text(
    all_airtable,
    detection_rule_uuids=[detectionRuleUUID]
)

# This level of loop corresponds to each list item sent to the Nightfall API
for field_idx, field_findings in enumerate(findings):
    
    sent_str = all_airtable[field_idx]
    # We call the helper function we defined earlier to help us parse the string sent to the Nightfall API
    parsed_str = str_parser(sent_str)
    offset = len(sent_str) - len(parsed_str['Content'])

    # This loop corresponds to each finding within an item sent to the Nightfall API
    for finding in field_findings:

        # If a finding is returned within the metadata for the content, we discount it
        if finding.byte_range.start < offset:
            continue

        # Add finding data to all_findings
        all_findings.append([
            parsed_str['BaseName'],
            parsed_str['BaseID'],
            parsed_str['Table'],
            parsed_str['Record'],
            parsed_str['Field'],
            finding.detector_name,
            finding.confidence.value,
            finding.byte_range.start,
            finding.byte_range.end,
            finding.finding
        ])

Finally, we export our results to a CSV so they can be easily reviewed.

if len(all_findings) > 1:
    with open('output_file.csv', 'w') as output_file:
        csv_writer = csv.writer(output_file, delimiter = ',')
        csv_writer.writerows(all_findings)
else:
    print('No sensitive data detected. Hooray!')

That's it! You now have insight into all of the sensitive data stored within your Airtable workspace!

As a next step, you could write your findings to a separate 'Nightfall Findings' Airtable base for review, or you could update and redact confirmed findings in situ using the Airtable API.

Using the File Scanning Endpoint with Airtable

The example above is specific to the Nightfall Text Scanning API. To scan files, we can use a similar process as we did the text scanning endpoint. The process is broken down into the sections below, as the file scanning process is more intensive.

File Scan Prerequisites

To utilize the File Scanning API you need the following:

An active API Key authorized for file scanning passed via the header Authorization: Bearer — see Authentication and Security
A Nightfall Detection Policy associated with a webhook URL
A web server configured to listen for file scanning results (more information below)

File Scan Implementation

Retrieve Airtable Data

Similar to the process in the beginning of this tutorial for the text scanning endpoint, we will now initialize and retrieve the data we want to retrieve from Airtable.

airtable_config = json.load(open('config.json', 'r'))
airtable_base_url = 'https://api.airtable.com/v0'
airtable_headers = {
  "Authorization": f"Bearer {airtable_api_key}"
}

all_airtable = []
all_airtable.append(
  ['base_name', 'base_id', 'table_name', 'record_id', 'field', 'content']
)

Now we go through writing the data to a .csv file.

filename = "nf_airtable_input-" + str(int(time.time())) + ".csv"

for base in airtable_config:
    base_id = base['base_id']
    req_tables = [i.replace(' ', '%20') for i in base['tables']]

    for table in req_tables:
        airtable_url = f"{airtable_base_url}/{base_id}/{table}"
        airtable_response = requests.get(airtable_url, headers=airtable_headers)
        airtable_content = json.loads(airtable_response.text)

        for i in airtable_content['records']:
            for j in i['fields']:
                # We enrich each datum with metadata so it can be easily located later
                # BaseName, BaseID, Table, Record, Field, Content
                row = [base['base_name'], base_id, table, i['id'], j, i['fields'][j]]
                all_airtable.append(row)

with open(filename, 'w') as output_file:
    csv_writer = csv.writer(output_file, delimiter=',')
    csv_writer.writerows(all_airtable)

print("Airtable Data Written to: ", filename)

Upload to Scan API

Using the above .csv file, begin the Scan API file upload process.

scan_id, message = nightfall.scan_text(
    filename,
    webhook_url=WEBHOOK_URL,
    detection_rule_uuids=[detectionRuleUUID],
)

Using the Scan Endpoint

Once the files have been uploaded, use the scan endpoint.

A webhook server is required for the scan endpoint to submit its results. See our example webhook server.

The scanning endpoint will work asynchronously for the files uploaded, so you can monitor the webhook server to see the API responses and file scan findings as they come in.

Amazon Kinesis DLP Tutorial

Amazon Kinesis allows you to collect, process, and analyze real-time streaming data. In this tutorial, we will set up Nightfall DLP to scan Kinesis streams for sensitive data. An overview of what we are going to build is shown in the diagram below.

We will send data to Kinesis using a simple producer written in Python. Next, we will use an AWS Lambda function to send data from Kinesis to Nightfall. Nightfall will scan the data for sensitive information. If there are any findings returned by Nightfall, the Lambda function will write the findings to a DynamoDB table.

Prerequisites

To complete this tutorial you will need the following:

An AWS Account with access to Kinesis, Lambda, and DynamoDB
The AWS CLI installed and configured on your local machine.
A Nightfall API Key
An existing Nightfall Detection Rule which contains at least one detector for email addresses.
Local copy of the companion repository for this tutorial.

Before continuing, you should clone the companion repository locally.

git clone https://github.com/nightfallai/nightfall-kinesis-demo

Configuring AWS Services

First, we will configure all of our required Services on AWS.

Create Execution Role

Open the IAM roles page in the AWS console.
Choose Create role.
Create a role with the following properties:
1. Lambda as the trusted entity
2. Permissions
  - AWSLambdaKinesisExecutionRole
  - AmazonDynamoDBFullAccess
3. Role name: nightfall-kinesis-role

Create Kinesis Data Stream

Open the Kinesis page and select Create Data Stream
Enter nightfall-demo as the Data stream name
Enter 1 as the Number of open shards
Select Create data stream

Create Lambda Function

Open the Lambda page and select Create function
Choose Author from scratch and add the following Basic information:
1. nightfall-lambda as the Function name
2. Python 3.8 as the Runtime
3. Select Change default execution role, Use an existing role, and select the previously created nightfall-kinesis-role

Once the function has been created, in the Code tab of the Lambda function select Upload from and choose .zip file. Select the local nightfall-lambda-package.zip file that you cloned earlier from the companion repository and upload it to AWS Lambda.

You should now see the previous sample code replaced with our Nightfall-specific Lambda function.

Next, we need to configure environment variables for the Lambda function.

Within the same Lambda view, select the Configuration tab and then select Environment variables.

Add the following environment variables that will be used during the Lambda function invocation.

NIGHTFALL_API_KEY : your Nightfall API Key
DETECTION_RULE_UUID : your Nightfall Detection Rule UUID.

🚧Detection Rule Requirements
This tutorial uses a data set that contains a name, email, and random text. In order to see results, please make sure that the Nightfall Detection Rule you choose contains at least one detector for email addresses.

Lastly, we need to create a trigger that connects our Lambda function to our Kinesis stream.

In the function overview screen on the top of the page, select Add trigger.
Choose Kinesis as the trigger.
Select the previously created nightfall-demo Kinesis stream.
Select Add

Create DynamoDB Table

The last step in creating our demo environment is to create a DynamoDB table.

Open the DynamoDB page and select Create table
Enter nightfall-findings as the Table Name
Enter KinesisEventID as the Primary Key

Be sure to also run the following before the Lambda function is created:

This is to ensure that the required version of the Python SDK for Nightfall has been installed. We also need to install boto3.

pip install nightfall=1.2.0
pip install boto3

Lambda Function Overview

Before we start processing the Kinesis stream data with Nightfall, we will provide a brief overview of how the Lambda function code works. The entire function is shown below:

import os
import base64
import boto3
from nightfall import Nightfall


def lambda_handler(event, context):
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table('nightfall-findings')

    records = []
    for record in event['Records']:
        # Kinesis data is base64 encoded so decode here
        payload = base64.b64decode(record["kinesis"]["data"])
        records.append(payload.decode("utf-8"))

    nightfall = Nightfall(
        os.environ.get('NIGHTFALL_API_KEY')
    )

    findings, redactions = nightfall.scan_text(
        records,
        detection_rule_uuids=[os.environ.get('DETECTION_RULE_UUID')]
    )

    for record_i, record_findings in enumerate(findings):
        if record_findings:
            formatted_findings = []
            for finding in record_findings:
                formatted_findings.append({
                    'Finding': finding.finding,
                    'BeforeContext': finding.before_context,
                    'AfterContext': finding.after_context,
                    'DetectorName': finding.detector_name,
                    'DetectorUUID': finding.detector_uuid,
                    'ByteStart': finding.byte_range.start,
                    'ByteStop': finding.byte_range.stop,
                    'Confidence': finding.confidence.value,
                })

            table.put_item(
                Item={
                    'KinesisEventID': event['Records'][record_i]['eventID'],
                    'KinesisRecord': records[record_i],
                    'NightfallFindings': formatted_findings,
                }
            )

This is a relatively simple function that does four things.

Create a DynamoDB client using the boto3 library.

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('nightfall-findings')

Extract and decode data from the Kinesis stream and add it to a single list of strings.

records = []
for record in event['Records']:
    # Kinesis data is base64 encoded so decode here
    payload = base64.b64decode(record["kinesis"]["data"])
    records.append(payload.decode("utf-8"))

Create a Nightfall client using the nightfall library and scan the records that were extracted in the previous step.

nightfall = Nightfall(
    os.environ.get('NIGHTFALL_API_KEY')
)
    
findings, redactions = nightfall.scan_text(
    records,
    detection_rule_uuids=[os.environ.get('DETECTION_RULE_UUID')]
)

Iterate through the response from Nightfall, if there is are findings for a record we copy the record and findings metadata into a DynamoDB table. We need to process the list of Finding objects into a list of dicts before passing them to DynamoDB.

for record_i, record_findings in enumerate(findings):
    if record_findings:
        formatted_findings = []
        for finding in record_findings:
            formatted_findings.append({
                'Finding': finding.finding,
                'BeforeContext': finding.before_context,
                'AfterContext': finding.after_context,
                'DetectorName': finding.detector_name,
                'DetectorUUID': finding.detector_uuid,
                'ByteStart': finding.byte_range.start,
                'ByteStop': finding.byte_range.stop,
                'Confidence': finding.confidence.value,
            })

        table.put_item(
            Item={
                'KinesisEventID': event['Records'][record_i]['eventID'],
                'KinesisRecord': records[record_i],
                'NightfallFindings': formatted_findings,
           }
        )

Sending Data to Kinesis

Now that you've configured all of the required AWS services, and understand how the Lambda function works, you're ready to start sending data to Kinesis and scanning it with Nightfall.

We've included a sample script in the companion repository that allows you to send fake data to Kinesis. The data that we are going to be sending looks like this:

'id': fake.uuid4(),
'name': fake.name(),
'email': fake.email(),
'message': fake.paragraph()

The script will send one record with the data shown above every 10 seconds.

Sample Data Script Usage Instructions

Before running the script, make sure that you have the AWS CLI installed and configured locally. The user that you are logged in with should have the appropriate permissions to add records to the Kinesis stream. This script uses the Boto3 library which handles authentication based on the credentials file that is created with the AWS CLI.

You can start sending data with the following steps:

Open the companion repo that you cloned earlier in a terminal.
Create and Activate a new Python Virutalenv

python3 -m venv venv
source venv/bin/activate

Install Dependencies

pip install -r requirements.txt

Start sending data

If everything worked, you should see output similar to this in your terminal:

SENT TO KINESIS: {'id': '8a69f3f5-432e-4ec1-8295-e8b79236e36e', 'name': 'Jessica Henderson', 'email': '[email protected]', 'message': 'Eye evening ahead field. With energy all personal soon sense. Method decision TV that.'}
SENT TO KINESIS: {'id': 'd4a90b48-cbcd-45ca-a231-3edbbc0c4792', 'name': 'Thomas Cuevas', 'email': '[email protected]', 'message': 'People write from season. Upon drive before summer exactly tonight practice expert. Actually news reason particularly in should.'}
SENT TO KINESIS: {'id': '084083bc-114a-4cc5-8cd6-2e15fd26b6db', 'name': 'Nathan Ward', 'email': '[email protected]', 'message': 'Add school air visit physical range. Child that company late. Boy than remain. Early ability economy thought event option.'}

View Nightfall Findings in DynamoDB

As the data starts to get sent to Kinesis, the Lambda function that we created earlier will begin to process each record and check for sensitive data using the Nightfall Detection Rule that we specified in the configuration.

If Nightfall detects a record with sensitive data, the Lambda function will copy that record and additional metadata from Nightfall to the DynamoDB table that we created previously.

Conclusion

Clean Up

If you'd like to clean up the created resources in AWS after completing this tutorial you should remove the following resources:

nightfall-kinesis-role IAM Role
nightfall-demo Kinesis data stream
nightfall-lambda Lambda Function
nightfall-findings DynamoDB Table

Using Redaction to Mask Findings

With the Nightfall API, you are also able to redact and mask your Kinesis findings. You can add a Redaction Config, as part of your Detection Rule, as a section within the lambda function. For more information on how to use redaction with the Nightfall API, and its specific options, please refer to the guide here.

Amazon RDS DLP Tutorial

RDS is a service for managing relational databases and can contain databases from several different varieties. This tutorial demonstrates connectivity with a postgresSQL database but could be modified to support other database options.

This tutorial allows you to scan your RDS managed databases using the Nightfall API/SDK.

You will need a few things first to use this tutorial:

An AWS account with at least one RDS database (this example uses postgres but could be modified to support other varieties of SQL)
A Nightfall API key
An existing Nightfall Detection Rule
A Python 3 environment (version 3.6 or later)
Python Nightfall SDK

To accomplish this, we will install the version required of the Nightfall SDK:

pip install nightfall=0.6.0

We will be using Python and importing the following libraries:

import requests
import psycopg2
import os
import sys
import json
from nightfall import Nightfall

We will set the size and length limits for data allowed by the Nightfall API per request. Also, we extract our API Key, and abstract a nightfall class from the SDK, for it.

size_limit = 500000
length_limit = 50000

Next we extract our API Key, and abstract a nightfall class from the SDK, for it.

nightfall = Nightfall(os.environ['NIGHTFALL_API_KEY'])

Next we define the Detection Rule with which we wish to scan our data. The Detection Rule can be pre-made in the Nightfall web app and referenced by UUID.

detectionRuleUUID = os.environ.get('DETECTION_RULE_UUID')

First we will set up the connection with the Postgres table, in RDS, and get the data to be scanned from there.

Note, we are setting the RDS authentication information as the below environment variables, and referencing the values from there:

'RDS_ENDPOINT'
'RDS_USER'
'RDS_PASSWORD'
'RDS_DATABASE'
'RDS_TABLE'
'RDS_PRIMARYKEY'

connection = psycopg2.connect(
        host = os.environ.get('RDS_ENDPOINT'),
        port = 5432,
        user = os.environ.get('RDS_USER'),
        password = os.environ.get('RDS_PASSWORD'),
        database = os.environ.get('RDS_DATABASE')
    )
table_name = os.environ.get('RDS_TABLE')
primary_key = os.environ.get('RDS_PRIMARYKEY')
cursor = connection.cursor()

sql = f"""
SELECT *
FROM {table_name}
"""

cursor.execute(sql)
connection.commit()

cols = [i.name for i in cursor.description]
data = cursor.fetchall()

We can then check the data size, and as long as it is below the aforementioned limits, can be ran through the API.

If the data payloads are larger than the size or length limits of the API, extra code will be required to further chunk the data into smaller bits that are processable by the Nightfall scan API.

This can be seen in the second and third code panes below:

primary_key_col = []

if len(data) == 0:
  raise Exception('Table is empty! No data to scan.')

all_findings = []
for col_idx, col in enumerate(columns):
    payload = [str(i[col_idx]) for i in data]
    if col == primary_key:
      primary_key_col = payload
      col_size = sys.getsizeof(payload)

    if col_size < size_limit:
   	 resp = nightfall.scanText(
        [payload],
        detection_rule_uuids=[detectionRuleUUID])
    
     col_resp = json.loads(resp)
      
for item_idx, item in enumerate(col_resp):
  if item != None:
    for finding in item:
      finding['column'] = col
      try:
        finding['index'] = primary_key_col[item_idx]
      except:
          finding['index'] = item_idx
      all_findings.append(finding)

col_resp = []
chunks = []
chunk = []
running_size = 0
big_items = []

for item_idx, item in enumerate(payload):
  item_size = sys.getsizeof(item)
  if (running_size + item_size < size_limit) and (len(chunk) < length_limit):
    chunk.append(item)
    running_size += item_size
  elif item_size < size_limit:
    chunks.append(chunk)
    chunk = [item]
    running_size = item_size
  else:
    if len(chunk) < length_limit:
      chunk.append('')
    else:
      chunks.append(chunk)
      chunk = ['']
      big_items.append(item_idx)
      chunks.append(chunk)

chunk_cursor = 0

for chunk in chunks:
  resp = nightfall.scanText(
        [chunk],
        detection_rule_uuids=[conditionSetUUID])
  col_resp.extend(json.loads(resp.text))
  chunk_cursor += len(chunk)

for item_idx, item in enumerate(col_resp):
  if item != None:
    for finding in item:
      finding['column'] = col
      try:
        finding['index'] = primary_key_col[item_idx]
      except:
          finding['index'] = item_idx
      all_findings.append(finding)

for big in big_items:
  item_size = sys.getsizeof(big)
    chunks_req = (item_size // size_limit) + 1
    chunk_len = len(item) // chunks_req
    cursor = 0
    item_findings = []
    for _ in range(chunks_req):
        p = item[cursor : min(cursor + chunk_len, len(item))]
        resp = nightfall.scanText({
        [[p]],
        detection_rule_uuids=[conditionSetUUID])
        item_findings.extend(json.loads(resp.text))
        cursor += chunk_len
  
  if item_findings == []:
    raise Exception(f"Error while scanning large item at column {col}, Index {primary_key_col[big]}")
  for find_chunk in item_resp:
      if find_chunk != None:
        for finding in find_chunk:
          finding['column'] = col
          try:
            finding['index'] = primary_key_col[big]
          except:
            finding['index'] = big
          all_findings.append(finding)

To review the results, we will print the number of findings, and write the findings to an output file:

printf("{len(all_findings)} sensitive findings in {os.environ.get('RDS_TABLE')}")
with open('rds_findings.json', 'w') as output_file:
  json.dump(all_findings, output_file)

Please find the full script together below, broken into functions that can be ran in full:

import requests
import psycopg2
import os
import sys
import json
from nightfall.api import Nightfall

size_limit = 500000
length_limit = 50000

nightfall = Nightfall(os.environ['NIGHTFALL_API_KEY'])

detectionRuleUUID = os.environ.get('DETECTION_RULE_UUID')

def get_from_rds():
    '''
    Gets data to be scanned from postgres table in RDS.
    Inputs: 
        None (all required info should be stored as environment variables)
    Returns: 
        data, list[tuple()] - data from postgres table as a list of tuples
        cols, list[str] - list of column names
        primary_key, str - name of primary key column
    '''
    connection = psycopg2.connect(
        host = os.environ.get('RDS_ENDPOINT'),
        port = 5432,
        user = os.environ.get('RDS_USER'),
        password = os.environ.get('RDS_PASSWORD'),
        database = os.environ.get('RDS_DATABASE')
    )
    table_name = os.environ.get('RDS_TABLE')
    primary_key = os.environ.get('RDS_PRIMARYKEY')
    cursor = connection.cursor()

    sql = f"""
    SELECT *
    FROM {table_name}
    """

    cursor.execute(sql)
    connection.commit()

    cols = [i.name for i in cursor.description]
    data = cursor.fetchall()

    return data, cols, primary_key

def nightfall_scan(payload):
    '''
    Calls the Nightfall scan API on input text
    Inputs:
        payload, list[str] - list of strings to be scanned
    Returns:
        resp - http response from the Nightfall API containing scan results
    '''
    return nightfall.scanText(
        [payload],
        detection_rule_uuids=[detectionRuleUUID])

def craft_chunks(payload, size_limit, length_limit):
    '''
    Chunks a payload into smaller bits processable by the Nightfall scan API
    Inputs:
        payload, list[str] - list of strings to be scanned
        size_limit, int - maximum data size allowed by Nightfall API per request
        length_limit, int - maximum no. of items allowed by Nightfall API per request
    Returns:
        chunks, list[list[str]] - list of lists of strings to be scanned
        big_items, list[int] - list of indices of items that exceed the size_limit on their own
    '''
    chunks = []
    chunk = []
    running_size = 0
    big_items = []
    for item_idx, item in enumerate(payload):
        item_size = sys.getsizeof(item)
        if (running_size + item_size < size_limit) and (len(chunk) < length_limit):
            chunk.append(item)
            running_size += item_size
        elif item_size < size_limit:
            chunks.append(chunk)
            chunk = [item]
            running_size = item_size
        else:
            if len(chunk) < length_limit:
                chunk.append('')
            else:
                chunks.append(chunk)
                chunk = ['']
            big_items.append(item_idx)
    chunks.append(chunk)
    return chunks, big_items

def scan_big_item(item, size_limit):
    '''
    Chunks a single large block of text and sends it to the Nightfall API
    in processable bits
    Inputs:
        item, str - a single string to be scanned by the Nightfall API
        size_limit, int - maximum data size allowed by Nightfall API per request
    Returns:
        item_findings, list[list[dict]] - findings from the Nightfall API for the entire item
    '''
    item_size = sys.getsizeof(item)
    chunks_req = (item_size // size_limit) + 1
    chunk_len = len(item) // chunks_req
    cursor = 0
    item_findings = []
    for _ in range(chunks_req):
        p = item[cursor : min(cursor + chunk_len, len(item))]
        resp = nightfall.scanText(
          [[p]],
          detection_rule_uuids=[detectionRuleUUID])
        item_findings.extend(json.loads(resp.text))
        cursor += chunk_len
    return item_findings

if __name__ == '__main__':
    # This script will be for Postgres but other SQL varieties 
    # will work with modifications
    data, columns, primary_key = get_from_rds()
    primary_key_col = []
    if len(data) == 0:
        raise Exception('Table is empty! No data to scan.')

    all_findings = []
    for col_idx, col in enumerate(columns):
        payload = [str(i[col_idx]) for i in data]
        if col == primary_key:
            primary_key_col = payload
        col_size = sys.getsizeof(payload)

        if col_size < size_limit:
            resp = resp = nightfall.scanText({
              [[p]],
              detection_rule_uuids=[detectionRuleUUID])
            col_resp = json.loads(resp)
        else:
            col_resp = []
            chunks, big_items = craft_chunks(payload, size_limit, length_limit)
            chunk_cursor = 0
            for chunk in chunks:
                resp = nightfall.scanText(
                  [chunk],
                  detection_rule_uuids=[detectionRuleUUID])
                col_resp.extend(json.loads(resp))
                chunk_cursor += len(chunk)
            
            for big in big_items:
                item_resp = scan_big_item(payload[big], size_limit)
                if item_resp == None:
                    raise Exception(f"Error while scanning large item at column {col}, Index {primary_key_col[big]}")
                for find_chunk in item_resp:
                    if find_chunk != None:
                        for finding in find_chunk:
                            finding['column'] = col
                            try:
                                finding['index'] = primary_key_col[big]
                            except:
                                finding['index'] = big
                            all_findings.append(finding)

        # Add location within source Table
        for item_idx, item in enumerate(col_resp):
            if item != None:
                for finding in item:
                    finding['column'] = col
                    try:
                        finding['index'] = primary_key_col[item_idx]
                    except:
                        finding['index'] = item_idx
                    all_findings.append(finding)

    print(f"{len(all_findings)} sensitive findings in {os.environ.get('RDS_TABLE')}")
    with open('rds_findings.json', 'w') as output_file:
        json.dump(all_findings, output_file)

The following are potential ways to continue building upon this service:

Writing Nightfall results to a database and reading that into a visualization tool
Adding to this script to support other varieties of SQL
Redacting sensitive findings in place once they are detected, either automatically or as a follow-up script once findings have been reviewed

Using Redaction to Mask Findings

With the Nightfall API, you are also able to redact and mask your RDS findings. You can add a Redaction Config, as part of your Detection Rule. For more information on how to use redaction, and its specific options, please refer to the guide here.

Using the File Scanning Endpoint with RDS

The example above is specific for the Nightfall Text Scanning API. To scan files, we can use a similar process as we did the text scanning endpoint. The process is broken down in the sections below, as the file scanning process is more intensive.

Prerequisites

To utilize the File Scanning API you need the following:

An active API Key authorized for file scanning passed via the header Authorization: Bearer — see Authentication and Security
A Nightfall Detection Policy associated with a webhook URL
A web server configured to listen for file scanning results (more information below)

Steps to use the Endpoint

Retrieve data from RDS

Similar to the process in the beginning of this tutorial for the text scanning endpoint, we will now initialize our AWS RDS Connection. Once the session is established, we can query from RDS.

connection = psycopg2.connect(
        host = os.environ.get('RDS_ENDPOINT'),
        port = 5432,
        user = os.environ.get('RDS_USER'),
        password = os.environ.get('RDS_PASSWORD'),
        database = os.environ.get('RDS_DATABASE')
    )
table_name = os.environ.get('RDS_TABLE')
primary_key = os.environ.get('RDS_PRIMARYKEY')
cursor = connection.cursor()

sql = f"""
SELECT *
FROM {table_name}
"""

cursor.execute(sql)
connection.commit()

cols = [i.name for i in cursor.description]
data = cursor.fetchall()

Now we go through the data and write to a .csv file.

primary_key_col = []

if len(data) == 0:
  raise Exception('Table is empty! No data to scan.')

filename = "nf_rds_input-" + str(int(time.time())) + ".csv"  

for col_idx, col in enumerate(columns):
    payload = [str(i[col_idx]) for i in data]   
    with open(filename, 'w') as output_file:
      csv_writer = csv.writer(output_file, delimiter=',')
      csv_writer.writerows(payload)
     
print("RDS Data Written to: ", filename)

Begin the file upload process to the Scan API, with the above written .csv file, as shown here.
Once the files have been uploaded, begin using the scan endpoint mentioned here. Note: As can be seen in the documentation, a webhook server is required for the scan endpoint, to which it will send the scanning results. An example webhook server setup can be seen here.
The scanning endpoint will work asynchronously for the files uploaded, so you can monitor the webhook server to see the API responses and file scan findings as they come in.

Amazon RDS DLP Tutorial - Full Scan

How to run a full scan of an Amazon database

To scan an Amazon database instance (i.e. mySQL, Postgres) you must create a snapshot of that instance and export the snapshot to S3.

The export process runs in the background and doesn't affect the performance of your active DB instance. Exporting RDS snapshots can take a while depending on your database type and size

Once the snapshot has been exported you will be able to scan the resulting parquet files with Nightfall like another file. You can do this using our endpoints for uploading files or using our Amazon S3 Python integration.

Prerequisites

In addition to having created your RDS instance, you will need to define the following to export your snapshots so they can later be scanned by Nightfall:

Amazon S3 bucket

To perform this scan, you will need to configure an Amazon S3 bucket to which you will export a snapshot.

📘S3 Bucket Requirements
This bucket must have snapshot permissions and the bucket to export must be in the same AWS Region as the the snapshot being exported.

If you have not already created a designated S3 bucket, in the AWS console select Services > Storage > S3

Click the "Create bucket" button and give your bucket a unique name as per the instructions.

For more information please see Amazon's documentation on identifying an Amazon S3 bucket for export.

Identity and Access Management (IAM) Role

You need an Identity and Access Management (IAM) Role to perform the transfer for a snapshot to your S3 bucket.

This role may be defined at the time of backup and it will be given the proper specific permissions.

You may also create the role under Services > Security, Identity, & Compliance > IAM and select “Roles” from under the “Access Management” section of the left-hand navigation.

From there you can click the “Create role” button and create a role where “AWS Service” is the trusted entity type.

For more information see Identity and Access Management in Amazon RDS and Providing access to an Amazon S3 bucket using an IAM role

AWS KMS Key

You must create a symmetric encryption AWS Key using the Key Management Service (KMS).

From your AWS console, select the Services > Security, Identity, & Compliance > Key Management Service from the adjacent submenu.

From there you can click the “Create key” button and follow the instructions.

Walkthrough

To do this task manually, go to Amazon RDS Service (Services > Database > RDS) and select the database to export from your list of databases.

Select the “Maintenance & backups” tab. Go to the “Snapshots” section.

You can select an existing automated snapshot or manually create a new snapshot with the “Take snapshot” button

Once the snapshot is complete, click the snapshot’s name.

From the “Actions” menu in the upper right select “Export to Amazon S3"

Enter a unique export identifier
Choose whether you want to export all or part of your data (You will be exporting to Parquet)
Choose the S3 bucket
Choose or create your designated IAM role for backup
Choose your AWS KMS Key
Click the Export button

Once the Status column of export is "Complete", you can click the link to the export under the S3 bucket column.

Within the export in the S3 bucket, you will find a series of folders corresponding to the different database entities that were exported.

Exported data for specific tables is stored in the format base_prefix/files, where the base prefix is the following:

export_identifier/database_name/schema_name.table_name/

For example:

export-1234567890123-459/rdststdb/rdststdb.DataInsert_7ADB5D19965123A2/

The current convention for file naming is as follows:

partition_index/part-00000-random_uuid.format-based_extension

For example:

1/part-00000-c5a881bb-58ff-4ee6-1111-b41ecff340a3-c000.gz.parquet
2/part-00000-d7a881cc-88cc-5ab7-2222-c41ecab340a4-c000.gz.parquet
3/part-00000-f5a991ab-59aa-7fa6-3333-d41eccd340a7-c000.gz.parquet

You may download these parquet files and upload them to Nightfall to scan as you would any other parquet file.

📘Obtaining file size
You can obtain the value for fileSizeBytes you can run the command wc -c

#Start the upload
curl --location --request POST 'https://api.nightfall.ai/v3/upload' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer NF-<Your API Key>' \
--data-raw '{
    "fileSizeBytes": <Your File Size>,
    "mimeType" : "application/zip"
}'

#Resulting payload
{"id":"<Your File Upload ID>","fileSizeBytes":3693,"chunkSize":10485760,"mimeType":"application/zip"}

#Post the file using the 'id' from the returned payload in your path
curl --location --request PATCH 'https://api.nightfall.ai/v3/upload/<Your File Upload ID>' \
--header 'X-Upload-Offset: 0' \
--header 'Content-Type: application/octet-stream' \
--header 'Authorization: Bearer NF-<Your API Key>' \
--data-binary '@///Users/myuser/yourfilepath/userdata1.parquet'

#Finish the upload

curl --location --request POST 'https://api.nightfall.ai/v3/upload/<Your File Upload ID>/finish' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer  NF-<Your API Key>'

# Scan the file using an alert config

curl --request POST \
     --url https://api.nightfall.ai/v3/upload/<Your File Upload ID>/scan \
     --header 'Accept: application/json' \
     --header 'Authorization: Bearer NF-<Your API Key>' \
     --header 'Content-Type: application/json' \
     --data '
{
     "policy": {
          "detectionRuleUUIDs": [
               "<Your detection Rule UUID>"
          ],
          "alertConfig": {
               "email": {
                    "address": "<Your Email Address>"
               }
          }
     },
     "requestMetadata": "scan of parquet file"
}
'

In the above sequence of curl invocations, we upload the file and then initiate the file scan with a policy that uses pre-configured detection rule as well as an alertConfig that send the results to an email address.

Note that results you receive in this case will be an attachment with a JSON payload as follows:

{
   "errors":null,
   "findingsPresent":true,
   "findingsURL":"https://files.nightfall.ai/18b8b6b8-59c9-4891-9f92-9c357acc19cd.json?Expires=1655306887&Signature=ASDFcCfQpEush3QrWmdZX9A9RePQjNHZTfRkBlgPwdPf~RcNnPYgYzt3G4AAzkI8IDbUdc4CzBbAROTx0oYOtNODCTdoKHKB7Q0a7~hzRNx3BHYHH1msdhkS1qTl3z82RCh6DZi~nk~Oa~yt-XZvAf3ui4MyNU0wyfqjbKO9o79Ec9YWqMdUmTOP1Ss39YmA71e6ky0VOdjdN4baoQV5VElTQ1rHrkdgYHz-95Dnzd3YK3IxGQR92AU7KA3X-rrcmpIJwMUIJSsl8~or0WIg5ar4U9Ood1BFSE~GmlQsKclEo1LEaX2KclWaQtjmN9~3IQnxOmkhPeAhEt-5n~Hbug__&Key-Pair-Id=ASDFOPZ1EKX0YC",
   "requestMetadata":"scan of go sdk for URLs sdk",
   "uploadID":"18b8b6b8-59c9-4891-9f92-9c357acc19cd",
   "validUntil":"2022-06-15T15:28:07.86163221Z"
}

The findings themselves will be available at the URL specified in findingsURL until the date-time stamp contained in the validUntil property.

When parquet files are analyzed, as with other tabular data, not only will the the location of the finding be shown within a given byte range, but also column and row data as well.

Below is a SQL script small table of generated data containing example personal data, including phone numbers and email addresses.

DROP TABLE IF EXISTS `myTable`;

CREATE TABLE `myTable` (
  `id` mediumint(8) unsigned NOT NULL auto_increment,
  `name` varchar(255) default NULL,
  `phone` varchar(100) default NULL,
  `email` varchar(255) default NULL,
  `address` varchar(255) default NULL,
  `postalZip` varchar(10) default NULL,
  `region` varchar(50) default NULL,
  `country` varchar(100) default NULL,
  `alphanumeric` varchar(255),
  `text` TEXT default NULL,
  PRIMARY KEY (`id`)
) AUTO_INCREMENT=1;

INSERT INTO `myTable` (`name`,`phone`,`email`,`address`,`postalZip`,`region`,`country`,`alphanumeric`,`text`)
VALUES
  ("Malcolm Mcgee","1-831-777-4886","[email protected]","109-1617 Augue Av.","873766","Delta","Ukraine","HPJ88FSI6HJ","in faucibus orci luctus et ultrices posuere cubilia Curae Phasellus"),
  ("Harrison Dudley","(645) 987-7967","[email protected]","P.O. Box 311, 4823 Odio Street","81398-37524","Nordland","Germany","INL95CND6TF","egestas nunc sed libero. Proin sed turpis nec mauris blandit"),
  ("Driscoll Callahan","1-598-623-3631","[email protected]","689-226 Eu St.","4534 JV","California","Chile","QQI55BTP0CS","velit dui, semper et, lacinia vitae, sodales at, velit. Pellentesque"),
  ("Anne Rollins","(558) 943-1159","[email protected]","5361 Enim, Street","176814","Minas Gerais","Brazil","EVO25RST5RM","cursus vestibulum. Mauris magna. Duis dignissim tempor arcu. Vestibulum ut"),
  ("Noah Townsend","(514) 311-3416","[email protected]","797-7375 Consectetuer Ave","5177","Tây Ninh","Chile","CNG87EJF4EK","libero at auctor ullamcorper, nisl arcu iaculis enim, sit amet");

Below is an example finding when a scan is done of the resulting parquet exported to S3 where the Detection Rule use Nightfall's built in Detectors for matching phone numbers and emails. In this example shows a match in the 1st row and and 4th column. This is what we would expect based on our table structure.

{
   "findings":[
      {
         "detector":{
            "id":"89f810aa-64a5-4269-b0a0-110d250d55ee",
            "name":"email address"
         },
         "finding":"[email protected]",
         "confidence":"LIKELY",
         "location":{
            "byteRange":{
               "start":36,
               "end":51
            },
            "codepointRange":{
               "start":36,
               "end":51
            },
            "lineRange":{
               "start":1,
               "end":1
            },
            "rowRange":{
               "start":1,
               "end":1
            },
            "columnRange":{
               "start":4,
               "end":4
            },
            "commitHash":""
         },
         "matchedDetectionRuleUUIDs":[
            "950833c9-8608-4c66-8a3a-0734eac11157"
         ],
         "matchedDetectionRules":[
            
         ]
      },

similarly, it also finds phone numbers in the 3rd column.

{
         "detector":{
            "id":"d08edfc4-b5e2-420a-a5fe-3693fb6276c4",
            "name":"Phone number",
            "version":1
         },
         "finding":"514) 311-3416",
         "confidence":"POSSIBLE",
         "location":{
            "byteRange":{
               "start":814,
               "end":827
            },
            "codepointRange":{
               "start":814,
               "end":827
            },
            "lineRange":{
               "start":5,
               "end":5
            },
            "rowRange":{
               "start":5,
               "end":5
            },
            "columnRange":{
               "start":3,
               "end":3
            },
            "commitHash":""
         },
         "matchedDetectionRuleUUIDs":[
            "950833c9-8608-4c66-8a3a-0734eac11157"
         ],
         "matchedDetectionRules":[
            
         ]
      }

You may also use our tutorial for Integrating with Amazon S3 (Python) to scan through the S3 objects.

For more information please see the Amazon documentation Exporting DB snapshot data to Amazon S3

Amazon S3 DLP Tutorial

AWS S3 is a popular tool for storing your data in the cloud, however, it also has huge potential for unintentionally leaking sensitive data. By utilizing AWS SDKs in conjunction with Nightfall’s Scan API, you can discover, classify, and remediate sensitive data within your S3 buckets.

Prerequisites

You will need the following for this tutorial:

AWS credentials
A Nightfall API key
An existing Nightfall Detection Rule
A Python 3 environment
most recent version of the Nightfall Python SDK

We will use boto3 as our AWS client in this demo. If you are using another language, check this page for AWS's recommended SDKs.

Installation

To install boto3 and the Nightfall SDK, run the following command.

pip install boto3
pip install nightfall=1.2.0

Implementation

In addition to boto3, we will be utilizing the following Python libraries to interact with the Nightfall SDK and to process the data.

import boto3
import requests
import json
import csv
import os
from nightfall import Nightfall

We've configured our AWS credentials, as well as our Nightfall API key, as environment variables so they don't need to be committed directly into our code.

aws_session_token = os.environ.get('AWS_SESSION_TOKEN')
aws_access_key_id = os.environ.get('AWS_ACCESS_KEY_ID')
aws_secret_access_key = os.environ.get('AWS_SECRET_ACCESS_KEY')

nightfall_api_key = os.environ.get('NIGHTFALL_API_KEY')

Next we define the Detection Rule with which we wish to scan our data. The Detection Rule can be pre-made in the Nightfall web app and referenced by UUID. Also, we extract our API Key, and abstract a nightfall class from the SDK, for it.

detectionRuleUUID = os.environ.get('DETECTION_RULE_UUID')

nightfall = Nightfall(os.environ['NIGHTFALL_API_KEY'])

Now we create an iterable of scannable objects in our target S3 buckets, and specify a maximum file size to pass to the Nightfall API (500 KB). In practice, you could add additional code to chunk larger files across multiple API requests.

We will also create an all_findings object to store Nightfall Scan results. The first row of our all_findings object will constitute our headers, since we will dump this object to a CSV file later.

This example will include the full finding below. As the finding might be a piece of sensitive data, we recommend using the Redaction feature of the Nightfall API to mask your data.

objects_to_scan = []
size_limit = 475000

all_findings = []
all_findings.append(
  [
    'bucket', 'object', 'detector', 'confidence', 
    'finding_byte_start', 'finding_byte_end',
    'finding_codepoint_start', 'finding_codepoint_end', 'fragment'
  ]
)

We will now initialize our AWS S3 Session. Once the session is established, we get a handle for the S3 resource.

my_session = boto3.session.Session(
  aws_session_token = aws_session_token,
  aws_access_key_id = aws_access_key_id,
  aws_secret_access_key = aws_secret_access_key
)

s3 = my_session.resource('s3')

Now we go through each bucket and retrieve the scannable objects, adding their text contents to objects_to_scan as we go.

In this tutorial, we assume that all files are text-readable. In practice, you may wish to filter out un-scannable file types such as images with the object.get()['ContentType'] property.

for bucket in s3.buckets.all():
  for obj in bucket.objects.all():
    temp_object = obj.get()
    size = temp_object['ContentLength']

    if size < size_limit:
      objects_to_scan.append((obj, temp_object['Body'].read().decode()))

For each object content we find in our S3 buckets, we send it as a payload to the Nightfall Scan API with our previously configured detectors.

request-responseOn receiving the request-response, we break down each returned finding and assign it a new row in the CSV we are constructing.

In this tutorial, we scope each object to be scanned with its API request. At the cost of granularity, you may combine multiple smaller files into a single call to the Nightfall API.

for obj, data in objects_to_scan:
    findings, redactions = nightfall.scan_text(
        [data],
        detection_rule_uuids=[detectionRuleUUID]
    )

    for finding in findings[0]:
        row = [
            obj.bucket_name,
            obj.key,
            finding.detector_name,
            finding.confidence.value,
            finding.byte_range.start,
            finding.byte_range.end,
            finding.codepoint_range.start,
            finding.codepoint_range.end,
            finding.finding,
        ]
        all_findings.append(row)

Now that we have finished scanning our S3 buckets and collated the results, we are ready to export them to a CSV file for further review.

if len(all_findings) > 1:
    with open('output_file.csv', 'w') as output_file:
        csv_writer = csv.writer(output_file, delimiter = ',')
        csv_writer.writerows(all_findings)
else:
      print('No sensitive data detected. Hooray!')

That's it! You now have insight into all of the sensitive data inside your data stored inside your organization's AWS S3 buckets.

As a next step, you could attempt to delete or redact your files in which sensitive data has been found by further utilizing boto3.

Using the File Scanning Endpoint with S3

The example above is specific to the Nightfall Text Scanning API. To scan files, we can use a similar process as we did the text scanning endpoint. The process is broken down in the sections below, as the file scanning process is more intensive.

Prerequisites for File Scanning

To utilize the File Scanning API you need the following:

An active API Key authorized for file scanning passed via the header Authorization: Bearer — see Authentication and Security
A Nightfall Detection Policy associated with a webhook URL
A web server configured to listen for file scanning results (more information below)

File Scan Implementation

Retrieve a File List

The first step is to get a list of files in your S3 buckets/objects

Similar to the process at the beginning of this tutorial for the text scanning endpoint, we will now initialize our AWS S3 Session. Once the session is established, we get a handle for the S3 resource.

my_session = boto3.session.Session(
  aws_session_token = aws_session_token,
  aws_access_key_id = aws_access_key_id,
  aws_secret_access_key = aws_secret_access_key
)

s3 = my_session.resource('s3')

Now we go through each bucket and retrieve the scannable objects.

for bucket in s3.buckets.all():
  for obj in b.objects.all():
    # here we can call the file-scanning endpoints

For each object content we find in our S3 buckets, we send it as an argument to the Nightfall File Scan API with our previously configured detectors.

Iterate through a list of files and begin the file upload process.

Once the files have been uploaded, begin using the scan endpoint.

A webhook server is required for the scan endpoint to submit its results. See our example webhook server.

The scanning endpoint will work asynchronously for the files uploaded, so you can monitor the webhook server to see the API responses and file scan findings as they come in.

Elasticsearch DLP Tutorial

Elasticsearch is a popular tool for storing, searching, and analyzing all kinds of structured and unstructured data, especially as a part of the larger ELK stack. However, along with all data storage tools, there is huge potential for unintentionally leaking sensitive data. By utilizing Elastic's own REST APIs in conjunction with Nightfall AI’s Scan API, you can discover, classify, and remediate sensitive data within your Elastic stack.

You can follow along with your own instance or spin up a sample instance with the commands listed below. By default, you will be able to download and interact with sample datasets from the elk instance at localhost:5601. Your data can be queried from localhost:9200. The "Add sample data" function can be found underneath the Observability section on the Home page; in this tutorial we reference the "Sample Web Logs" dataset..

docker pull sebp/elk

docker run -p 5601:5601 -p 9200:9200 -p 5044:5044 -it --name elk sebp/elk

You will need a few things to follow along with this tutorial:

An Elasticsearch instance with data to query
A Nightfall API key
An existing Nightfall Detection Rule
A Python 3 environment (version 3.7 or later)
Python Nightfall SDK

We will need to install the nightfall sdk library and the requests library using pip.

pip install nightfall=1.2.0
pip install requests

We will be using Python and importing the following libraries:

import requests
import json
import os
import csv
from nightfall import Nightfall

We first configure the URLs to communicate with. If you are following along with the Sample Web Logs dataset alluded to at the beginning of this article, you can copy this Elasticsearch URL. If not, your URL will probably take the format http://<hostname>/<index_name>/_search.

elasticsearch_base_url = 'http://localhost:9200/kibana_sample_data_logs/_search'

Next we define the Detection Rule with which we wish to scan our data. The Detection Rule can be pre-made in the Nightfall web app and referenced by UUID.

Also, we abstract a nightfall class from the SDK, from our API key.

nightfall_api_key = os.environ.get('NIGHTFALL_API_KEY')

detectionRuleUUID = os.environ.get('DETECTION_RULE_UUID')

nightfall = Nightfall(nightfall_api_key)

We now construct the payload and headers for our call to Elasticsearch. The payload represents whichever subset of data you wish to query. In this example, we are querying all results from the previous hour.

We then make our call to the Elasticsearch data store and save the resulting response.

elasticsearch_payload = {"query": {"range": {"timestamp": {"gte" : "now-1h"}}}}

elasticsearch_headers = {
  'Content-Type': 'application/json'
}

elasticsearch_response = requests.get(
  url = elasticsearch_base_url,
  headers = elasticsearch_headers,
  params = elasticsearch_payload
)

logs = elasticsearch_response.json()['hits']['hits']

Now we send our Elasticsearch query results to the Nightfall SDK for scanning.

findings, redactions = nightfall.scan_text(
    [json.dumps(l) for l in logs],
    detection_rule_uuids=[detectionRuleUUID]
)

We will create an all_findings object to store Nightfall Scan results. The first row of our all_findings object will constitute our headers, since we will dump this object to a CSV file later.

This example will include the full finding below. As the finding might be a piece of sensitive data, we would recommend using the Redaction feature of the Nightfall API to mask your data. More information can be seen in the 'Using Redaction to Mask Findings' section below.

all_findings = []
all_findings.append(
    [
        'index_name', 'log_id', 'detector', 'confidence', 
        'finding_start', 'finding_end', 'finding'
    ]
)

Next we go through our findings from the Nightfall Scan API and match them to the identifying fields from the Elasticsearch index so we can find them and remediate them in situ.

Finding locations here represent the location within the log as a string. Finding locations can also be found in byteRange.

# Each top level item in findings corresponds to one log
for log_idx, log_findings in enumerate(findings):
    for finding_idx, finding in enumerate(log_findings):

        index_name = logs[log_idx]['_index']
        log_id = logs[log_idx]['_id']

        row = [
            index_name,
            log_id,
            finding.detector_name,
            finding.confidence.value,
            finding.byte_range.start,
            finding.byte_range.end,
            finding.finding,
        ]
        all_findings.append(row)

Finally, we export our results to a csv so they can be easily reviewed.

if len(all_findings) > 1:
    with open('output_file.csv', 'w') as output_file:
        csv_writer = csv.writer(output_file, delimiter = ',')
        csv_writer.writerows(all_findings)
else:
    print('No sensitive data detected. Hooray!')

That's it! You now have insight into all sensitive data shared inside your Elasticsearch instance within the past hour.

However, in use-cases such as this where the data is well-structured, it can be more informative to call out which fiels are found to contain sensitive data, as opposed to the location of the data. While the above script is easy to implement without modifying the queried data, it does not provide insight into these fields.

Using Redaction to Mask Findings

the Nightfall API, you are also able to redact and mask your Elasticsearch findings. You can add a Redaction Config, as part of your Detection Rule. For more information on how to use redaction, and its specific options, please refer to the guide here.

Using the File Scanning Endpoint with Elasticsearch

Prerequisites:

To utilize the File Scanning API you need the following:

An active API Key authorized for file scanning passed via the header Authorization: Bearer — see Authentication and Security
A Nightfall Detection Policy associated with a webhook URL
A web server configured to listen for file scanning results (more information below)

Steps to use the Endpoint

Retrieve data from Elasticsearch

Similar to the process in the beginning of this tutorial for the text scanning endpoint, we will now initialize our and retrieve the data we like, from Elasticsearch:

elasticsearch_base_url = 'http://localhost:9200/kibana_sample_data_logs/_search'

elasticsearch_payload = {"query": {"range": {"timestamp": {"gte" : "now-1h"}}}}

elasticsearch_headers = {
    'Content-Type': 'application/json'
}

elasticsearch_response = requests.get(
    url = elasticsearch_base_url,
    headers = elasticsearch_headers,
    params = elasticsearch_payload
)

logs = elasticsearch_response.json()['hits']['hits']

Now we go through write the logs to a .csv file.

filename = "nf_elasticsearch_input-" + str(int(time.time())) + ".csv"  

with open(filename, 'w') as output_file:
    csv_writer = csv.writer(output_file, delimiter=',')
    csv_writer.writerows([json.dumps(l) for l in logs])
     
print("Elasticsearch Data Written to: ", filename)

Begin the file upload process to the Scan API, with the above written .csv file, as shown here.
Once the files have been uploaded, begin using the scan endpoint mentioned here. Note: As can be seen in the documentation, a webhook server is required for the scan endpoint, to which it will send the scanning results. An example webhook server setup can be seen here.
The scanning endpoint will work asynchronously for the files uploaded, so you can monitor the webhook server to see the API responses and file scan findings as they come in.

Snowflake DLP Tutorial

Snowflake is a data warehouse built on top of the Amazon Web Services or Microsoft Azure cloud infrastructure. This tutorial demonstrates how to use the Nightfall API for scanning a Snowflake database.

This tutorial allows you to scan your Snowflake databases using the Nightfall API/SDK.

You will need a few things first to use this tutorial:

A Snowflake account with at least one database
A Nightfall API key
An existing Nightfall Detection Rule
Most recent version of Python Nightfall SDK

We will first install the required Snowflake Python connector modules and the Nightfall SDK that we need to work with:

pip install snowflake-connector-python
pip install nightfall=0.6.0

To accomplish this, we will be using Python and importing the following libraries:

import requests
import snowflake.connector
import os
import sys
import json
from nightfall import Nightfall

We will set the size and length limits for data allowed by the Nightfall API per request. Also, we extract our API Key, and abstract a nightfall class from the SDK, for it.

size_limit = 500000
length_limit = 50000

Next we extract our API Key, and abstract a nightfall class from the SDK, for it.

nightfall = Nightfall(os.environ['NIGHTFALL_API_KEY'])

Next we define the Detection Rule with which we wish to scan our data. The Detection Rule can be pre-made in the Nightfall web app and referenced by UUID.

detectionRuleUUID = os.environ.get('DETECTION_RULE_UUID')

First we will set up the connection with Snowflake, and get the data to be scanned from there.

Note, we are setting the Snowflake authentication information as the below environment variables, and referencing the values from there:

SNOWFLAKE_USER
SNOWFLAKE_PASSWORD
SNOWFLAKE_ACCOUNT
SNOWFLAKE_DATABASE
SNOWFLAKE_SCHEMA
SNOWFLAKE_TABLE
SNOWFLAKE_PRIMARY_KEY

connection = snowflake.connector.connect(
  user=os.environ.get('SNOWFLAKE_USER'),
  password=os.environ.get('SNOWFLAKE_PASSWORD'),
  account=os.environ.get('SNOWFLAKE_ACCOUNT'),
  schema=os.environ.get('SNOWFLAKE_SCHEMA'),
  database=os.environ.get('SNOWFLAKE_DATABASE')
)
table_name = os.environ.get('SNOWFLAKE_TABLE')
primary_key = os.environ.get('SNOWFLAKE_PRIMARY_KEY')

cursor = connection.cursor()

sql = f"""
        SELECT *
        FROM {table_name}
        LIMIT 1000;
        """

cursor.execute(sql)

cols = [i[0] for i in cursor.description]
data = cursor.fetchall()

We can then check the data size, and as long as it is below the aforementioned limits, can be ran through the API.

If the data payloads are larger than the size or length limits of the API, extra code will be required to further chunk the data into smaller bits that are processable by the Nightfall scan API.

This can be seen in the second and third code panes below:

primary_key_col = []

if len(data) == 0:
  raise Exception('Table is empty! No data to scan.')

all_findings = []
for col_idx, col in enumerate(columns):
    payload = [str(i[col_idx]) for i in data]
    if col == primary_key:
      primary_key_col = payload
      col_size = sys.getsizeof(payload)

    if col_size < size_limit:
   	 resp = nightfall.scanText(
        [payload],
        detection_rule_uuids=[detectionRuleUUID])
    
     col_resp = json.loads(resp)

for item_idx, item in enumerate(col_resp):
  if item != None:
    for finding in item:
      finding['column'] = col
      try:
        finding['index'] = primary_key_col[item_idx]
      except:
          finding['index'] = item_idx
      all_findings.append(finding)

col_resp = []
chunks = []
chunk = []
running_size = 0
big_items = []

for item_idx, item in enumerate(payload):
  item_size = sys.getsizeof(item)
  if (running_size + item_size < size_limit) and (len(chunk) < length_limit):
    chunk.append(item)
    running_size += item_size
  elif item_size < size_limit:
    chunks.append(chunk)
    chunk = [item]
    running_size = item_size
  else:
    if len(chunk) < length_limit:
      chunk.append('')
    else:
      chunks.append(chunk)
      chunk = ['']
      big_items.append(item_idx)
      chunks.append(chunk)

chunk_cursor = 0

for chunk in chunks:
  resp = nightfall.scanText({
        "text": [chunk],
        "detectionRuleUUIDs": [conditionSetUUID]})
  col_resp.extend(json.loads(resp.text))
  chunk_cursor += len(chunk)
  
for item_idx, item in enumerate(col_resp):
  if item != None:
    for finding in item:
      finding['column'] = col
      try:
        finding['index'] = primary_key_col[item_idx]
      except:
          finding['index'] = item_idx
      all_findings.append(finding)

for big in big_items:
  item_size = sys.getsizeof(big)
    chunks_req = (item_size // size_limit) + 1
    chunk_len = len(item) // chunks_req
    cursor = 0
    item_findings = []
    for _ in range(chunks_req):
        p = item[cursor : min(cursor + chunk_len, len(item))]
        resp = nightfall.scanText({
        "text": [[p]],
        "detectionRuleUUIDs": [conditionSetUUID]})
        item_findings.extend(json.loads(resp.text))
        cursor += chunk_len
  
  if item_findings == []:
    raise Exception(f"Error while scanning large item at column {col}, Index {primary_key_col[big]}")
  for find_chunk in item_resp:
      if find_chunk != None:
        for finding in find_chunk:
          finding['column'] = col
          try:
            finding['index'] = primary_key_col[big]
          except:
            finding['index'] = big
          all_findings.append(finding)

To review the results, we will print the number of findings, and write the findings to an output file:

print(f"{len(all_findings)} sensitive findings in {os.environ.get('SNOWFLAKE_TABLE')}")
with open('snowflake_findings.json', 'w') as output_file:
  json.dump(all_findings, output_file)

The following are potential ways to continue building upon this service:

Writing Nightfall results to a database and reading that into a visualization tool
Redacting sensitive findings in place once they are detected, either automatically or as a follow-up script once findings have been reviewed

Using Redaction to Mask Findings

With the Nightfall API, you are also able to redact and mask your Snowflake findings. You can add a Redaction Config, as part of your Detection Rule. For more information on how to use redaction, and its specific options, please refer to the guide here.

Using the File Scanning Endpoint with Snowflake

Prerequisites

To utilize the File Scanning API you need the following:

An active API Key authorized for file scanning passed via the header Authorization: Bearer — see Authentication and Security
A Nightfall Detection Policy associated with a webhook URL
A web server configured to listen for file scanning results (more information below)

Steps to use the Endpoint

Retrieve data from Snowflake

Similar to the process in the beginning of this tutorial for the text scanning endpoint, we will now initialize our Snowflake Connection. Once the session is established, we can query from Snowflake.

connection = snowflake.connector.connect(
  user=os.environ.get('SNOWFLAKE_USER'),
  password=os.environ.get('SNOWFLAKE_PASSWORD'),
  account=os.environ.get('SNOWFLAKE_ACCOUNT'),
  schema=os.environ.get('SNOWFLAKE_SCHEMA'),
  database=os.environ.get('SNOWFLAKE_DATABASE')
)
table_name = os.environ.get('SNOWFLAKE_TABLE')
primary_key = os.environ.get('SNOWFLAKE_PRIMARY_KEY')

cursor = connection.cursor()

sql = f"""
        SELECT *
        FROM {table_name}
        LIMIT 1000;
        """

cursor.execute(sql)

cols = [i[0] for i in cursor.description]
data = cursor.fetchall()

Now we go through the data and write to a .csv file.

primary_key_col = []

if len(data) == 0:
  raise Exception('Table is empty! No data to scan.')

filename = "nf_snowflake_input-" + str(int(time.time())) + ".csv"  

for col_idx, col in enumerate(columns):
    payload = [str(i[col_idx]) for i in data]   
    with open(filename, 'w') as output_file:
      csv_writer = csv.writer(output_file, delimiter=',')
      csv_writer.writerows(payload)
     
print("Snowflake Data Written to: ", filename)

Begin the file upload process to the Scan API, with the above written .csv file, as shown here.
Once the files have been uploaded, begin using the scan endpoint mentioned here. Note: As can be seen in the documentation, a webhook server is required for the scan endpoint, to which it will send the scanning results. An example webhook server setup can be seen here.
The scanning endpoint will work asynchronously for the files uploaded, so you can monitor the webhook server to see the API responses and file scan findings as they come in.