Amazon S3 DLP Tutorial

AWS S3 is a popular tool for storing your data in the cloud, however, it also has huge potential for unintentionally leaking sensitive data. By utilizing AWS SDKs in conjunction with Nightfall’s Scan API, you can discover, classify, and remediate sensitive data within your S3 buckets.

Prerequisites

You will need the following for this tutorial:

We will use boto3 as our AWS client in this demo. If you are using another language, check this page for AWS's recommended SDKs.

Installation

To install boto3 and the Nightfall SDK, run the following command.

pip install boto3
pip install nightfall=1.2.0

Implementation

In addition to boto3, we will be utilizing the following Python libraries to interact with the Nightfall SDK and to process the data.

import boto3
import requests
import json
import csv
import os
from nightfall import Nightfall

We've configured our AWS credentials, as well as our Nightfall API key, as environment variables so they don't need to be committed directly into our code.

aws_session_token = os.environ.get('AWS_SESSION_TOKEN')
aws_access_key_id = os.environ.get('AWS_ACCESS_KEY_ID')
aws_secret_access_key = os.environ.get('AWS_SECRET_ACCESS_KEY')

nightfall_api_key = os.environ.get('NIGHTFALL_API_KEY')

Next we define the Detection Rule with which we wish to scan our data. The Detection Rule can be pre-made in the Nightfall web app and referenced by UUID. Also, we extract our API Key, and abstract a nightfall class from the SDK, for it.

detectionRuleUUID = os.environ.get('DETECTION_RULE_UUID')

nightfall = Nightfall(os.environ['NIGHTFALL_API_KEY'])

Now we create an iterable of scannable objects in our target S3 buckets, and specify a maximum file size to pass to the Nightfall API (500 KB). In practice, you could add additional code to chunk larger files across multiple API requests.

We will also create an all_findings object to store Nightfall Scan results. The first row of our all_findings object will constitute our headers, since we will dump this object to a CSV file later.

This example will include the full finding below. As the finding might be a piece of sensitive data, we recommend using the Redaction feature of the Nightfall API to mask your data.

objects_to_scan = []
size_limit = 475000

all_findings = []
all_findings.append(
  [
    'bucket', 'object', 'detector', 'confidence', 
    'finding_byte_start', 'finding_byte_end',
    'finding_codepoint_start', 'finding_codepoint_end', 'fragment'
  ]
)

We will now initialize our AWS S3 Session. Once the session is established, we get a handle for the S3 resource.

my_session = boto3.session.Session(
  aws_session_token = aws_session_token,
  aws_access_key_id = aws_access_key_id,
  aws_secret_access_key = aws_secret_access_key
)

s3 = my_session.resource('s3')

Now we go through each bucket and retrieve the scannable objects, adding their text contents to objects_to_scan as we go.

In this tutorial, we assume that all files are text-readable. In practice, you may wish to filter out un-scannable file types such as images with the object.get()['ContentType'] property.

for bucket in s3.buckets.all():
  for obj in bucket.objects.all():
    temp_object = obj.get()
    size = temp_object['ContentLength']

    if size < size_limit:
      objects_to_scan.append((obj, temp_object['Body'].read().decode()))

For each object content we find in our S3 buckets, we send it as a payload to the Nightfall Scan API with our previously configured detectors.

request-responseOn receiving the request-response, we break down each returned finding and assign it a new row in the CSV we are constructing.

In this tutorial, we scope each object to be scanned with its API request. At the cost of granularity, you may combine multiple smaller files into a single call to the Nightfall API.

for obj, data in objects_to_scan:
    findings, redactions = nightfall.scan_text(
        [data],
        detection_rule_uuids=[detectionRuleUUID]
    )

    for finding in findings[0]:
        row = [
            obj.bucket_name,
            obj.key,
            finding.detector_name,
            finding.confidence.value,
            finding.byte_range.start,
            finding.byte_range.end,
            finding.codepoint_range.start,
            finding.codepoint_range.end,
            finding.finding,
        ]
        all_findings.append(row)

Now that we have finished scanning our S3 buckets and collated the results, we are ready to export them to a CSV file for further review.

if len(all_findings) > 1:
    with open('output_file.csv', 'w') as output_file:
        csv_writer = csv.writer(output_file, delimiter = ',')
        csv_writer.writerows(all_findings)
else:
      print('No sensitive data detected. Hooray!')

That's it! You now have insight into all of the sensitive data inside your data stored inside your organization's AWS S3 buckets.

As a next step, you could attempt to delete or redact your files in which sensitive data has been found by further utilizing boto3.

Using the File Scanning Endpoint with S3

The example above is specific to the Nightfall Text Scanning API. To scan files, we can use a similar process as we did the text scanning endpoint. The process is broken down in the sections below, as the file scanning process is more intensive.

Prerequisites for File Scanning

To utilize the File Scanning API you need the following:

  • An active API Key authorized for file scanning passed via the header Authorization: Bearer — see Authentication and Security

  • A Nightfall Detection Policy associated with a webhook URL

  • A web server configured to listen for file scanning results (more information below)

File Scan Implementation

Retrieve a File List

The first step is to get a list of files in your S3 buckets/objects

Similar to the process at the beginning of this tutorial for the text scanning endpoint, we will now initialize our AWS S3 Session. Once the session is established, we get a handle for the S3 resource.

my_session = boto3.session.Session(
  aws_session_token = aws_session_token,
  aws_access_key_id = aws_access_key_id,
  aws_secret_access_key = aws_secret_access_key
)

s3 = my_session.resource('s3')

Now we go through each bucket and retrieve the scannable objects.

for bucket in s3.buckets.all():
  for obj in b.objects.all():
    # here we can call the file-scanning endpoints

For each object content we find in our S3 buckets, we send it as an argument to the Nightfall File Scan API with our previously configured detectors.

Iterate through a list of files and begin the file upload process.

Once the files have been uploaded, begin using the scan endpoint.

A webhook server is required for the scan endpoint to submit its results. See our example webhook server.

The scanning endpoint will work asynchronously for the files uploaded, so you can monitor the webhook server to see the API responses and file scan findings as they come in.

Last updated