Nightfall Sample Data Sets
These data sets include multiple examples of various data types, the expected result a DLP solution should produce, along with a justification where appropriate. This data is de-identified, and can be used to test detection in any DLP platform.
Nightfall testing notes: If using the following test data in Nightfall, ensure your detection rules have the appropriate confidence setting to trigger the results where appropriate. Detection rules with confidence thresholds higher than the listed result will not trigger.
Detection engine efficacy can be broken down into two categories:
- Recall: Does the solution find the most sensitive data? When deployed in production, it is important that a detection engine is able to find the most relevant or sensitive data amidst vast amounts of structured and unstructured data. A strong detection engine will be able to extract text from many different file types and extract the content differently for each of those sources (OCR for images, row/column context for tabular data). Also, the context surrounding findings must be taken into consideration and should produce a confidence score or likelihood of occurrence so that one can tune the levels of recall vs. precision. The quality of recall can be determined by confirming True Positives.
- Precision: While ensuring a DLP service accurately recognizes sensitive data such as credit cards or Social Security numbers is important, it is equally as important for the solution to accurately weed out those findings that are not true positive results. If there is too much noise, it becomes impossible to action alerts and, if automated blocking is in place, it can lead to significant disruption of end users. The quality of recall can be determined by confirming True Negatives (i.e. a credit card that does not start with the correct digits should not trigger).
In addition to these two factors, it is important to evaluate whether or not a detection engine can be trained and tuned over time. An improving detection engine will become even stronger the more production data passes through its systems.
API keys represent unique vulnerabilities to organizations. The leak of a single API key can expose thousands of records or critical intellectual property. A strong detection engine will detect API keys in both text and files using machine learning so that context helps determine a confidence level of occurrence. It should also recognize the most common online services and check that these keys are active, as active keys require the fastest response to remediation.
Nightfall testing note: If a finding is marked ‘Unverified’, it means that Nightfall was unable to access the service. If Nightfall was able to access it, then the key would be marked as ‘Active’.
1) API Key Screenshot
U.S. Social Security Numbers have a fairly well-defined format, but specific ranges are excluded and contextual information can be important to reduce false positives at low confidence levels.
2) SSN (and CC) Screenshot
Credit cards can range from 14 to 19 digits with the most common being 16. The arrangement of the numbers can also vary in their format. Card numbers must pass the Luhn algorithm and begin with an appropriate Bank Identification Number, or BIN. Beyond that, contextual clues such as dashes or spaces and words such as ‘credit card’ should influence the confidence score of the results.
3) Credit Card File
4) Credit Card Screenshot