Links
Comment on page

Nightfall Sample Data Sets

These data sets include multiple examples of various data types, the expected result a DLP solution should produce, along with a justification where appropriate. This data is de-identified, and can be used to test detection in any DLP platform.
Nightfall testing notes: If using the following test data in Nightfall, ensure your detection rules have the appropriate confidence setting to trigger the results where appropriate. Detection rules with confidence thresholds higher than the listed result will not trigger.

Detection Success Criteria

Detection engine efficacy can be broken down into two categories:
  • Recall: Does the solution find the most sensitive data? When deployed in production, it is important that a detection engine is able to find the most relevant or sensitive data amidst vast amounts of structured and unstructured data. A strong detection engine will be able to extract text from many different file types and extract the content differently for each of those sources (OCR for images, row/column context for tabular data). Also, the context surrounding findings must be taken into consideration and should produce a confidence score or likelihood of occurrence so that one can tune the levels of recall vs. precision. The quality of recall can be determined by confirming True Positives.
  • Precision: While ensuring a DLP service accurately recognizes sensitive data such as credit cards or Social Security numbers is important, it is equally as important for the solution to accurately weed out those findings that are not true positive results. If there is too much noise, it becomes impossible to action alerts and, if automated blocking is in place, it can lead to significant disruption of end users. The quality of recall can be determined by confirming True Negatives (i.e. a credit card that does not start with the correct digits should not trigger).
In addition to these two factors, it is important to evaluate whether or not a detection engine can be trained and tuned over time. An improving detection engine will become even stronger the more production data passes through its systems.

API Keys

API keys represent unique vulnerabilities to organizations. The leak of a single API key can expose thousands of records or critical intellectual property. A strong detection engine will detect API keys in both text and files using machine learning so that context helps determine a confidence level of occurrence. It should also recognize the most common online services and check that these keys are active, as active keys require the fastest response to remediation.
Nightfall testing note: If a finding is marked ‘Unverified’, it means that Nightfall was unable to access the service. If Nightfall was able to access it, then the key would be marked as ‘Active’.
Input Data
Expected result
Comments
service.api_key = "AIzaSyBJm7CoylH_zhLfmnLNoWDljBtVesNmSrw”
Positive API key finding with high degree of confidence. Nightfall confidence: Very Likely (Google - Unverified)
Contextual clues including references to ‘API’ and ‘key’ should lead to increased confidence by the solution. Ideally, the solution recognizes the finding as a potential Google API key.
OAuth token= "gho_28axLS98lwZQr9wixIO3k7MstW2sD9Xp8saB1eO”
Positive API finding with high degree of confidence. Nightfall confidence: Likely (Github - Unverified)
Contextual clues including references to ‘OAuth’ and ‘token’ should lead to increased confidence by the solution. Ideally, the solution recognizes the finding as a potential GitHub API key.
import stripe stripe.api_key = "sk_live_4eC39HqLyjWDarjtT1zdp7dcTYooMQauvdEDq54NiTphI7jx"
Positive API finding with high degree of confidence. Nightfall confidence: Very Likely (Stripe - Unverified)
Contextual clues including references to ‘API’ and ‘key’ should lead to increased confidence by the solution. Ideally, the solution recognizes the finding as a potential Stripe API key.
A screen shot with an API key, available below (#1).
Positive API finding with high degree of confidence. Nightfall confidence: Very Likely (Stripe - Unverified)
Solution should successfully extract text from image file and evaluate in the same way as preceding examples.
{ "_id" : { "$oid" : "5328e6f265f3f4d17089a11d" }, "domain" : "jira.mongodb.org", "banned_by" : null, "media_embed" : {}, "subreddit" : "mongodb"}
No finding.
This JSON snippet contains a string that matches the pattern for a Jira API key and even has “Jira” in close proximity, but is not a valid Jira API key.
1) API Key Screenshot

U.S. Social Security Numbers

U.S. Social Security Numbers have a fairly well-defined format, but specific ranges are excluded and contextual information can be important to reduce false positives at low confidence levels.
Input Data
Expected result
Comments
Hi - I applied for an insurance policy online with my SSN 285-25-9002, however the app says an unexpected error occurred. Could you take a look on your end?
Positive SSN finding with high degree of confidence. Nightfall confidence: Very Likely
Format and content matches. Contextual clues provide additional confidence.
My social is 285-25-9002.
Positive SSN finding with lower degree of confidence. Nightfall confidence: Possible
Format and content matches, but fewer contextual clues reduce confidence.
285-25-9002
Positive SSN finding with low degree of confidence. Nightfall confidence: Possible
Format and content matches, but lack of contextual clues leads to minimal confidence.
Hi - I applied for an insurance policy online with my SSN 914-25-9002, however the app says an unexpected error occurred. Could you take a look on your end?
No finding.
Format matches and there are a number of contextual clues, but the content of the number is invalid as an SSN.
Hi - I applied for an insurance policy online with my SSN 285-00-9002, however the app says an unexpected error occurred. Could you take a look on your end?
No finding.
Format matches and there are a number of contextual clues, but the content of the number is invalid as an SSN.
A screenshot with a SSN, available below (#2).
Positive SSN finding with high degree of confidence. Nightfall confidence: Very Likely
Solution should successfully extract text from image file and evaluate in the same way as preceding examples.
2) SSN (and CC) Screenshot

Credit cards

Credit cards can range from 14 to 19 digits with the most common being 16. The arrangement of the numbers can also vary in their format. Card numbers must pass the Luhn algorithm and begin with an appropriate Bank Identification Number, or BIN. Beyond that, contextual clues such as dashes or spaces and words such as ‘credit card’ should influence the confidence score of the results.
Input Data
Expected result
Comments
I tried to pay with my credit card, the number is 6771-8979-6102-7961. Could you take a look on your end?
Positive credit card finding with high degree of confidence. Nightfall confidence: Very Likely
Finding passes appropriate checks for a credit card and the contextual data provides additional confidence.
6771-8979-6102-7961
Positive credit card finding with medium degree of confidence. Nightfall confidence: Likely
Finding passes appropriate checks for a credit card. Confidence should only be mid-level due to the minimal contextual clue of dashes or spaces.
6771897961027961
Positive credit card finding with low degree of confidence. Nightfall confidence: Possible
Finding passes appropriate checks for a credit card. Confidence should only be mid-level due to the lack of contextual clues.
I tried to pay with my credit card, the number is 7373-8979-6102-7961. Could you take a look on your end?
No finding.
Contextual clues are present, the number is the appropriate length and passes the Luhn algorithm, but the BIN is invalid.
I tried to pay with my credit card, the number is 6771-8979-6102-7931. Could you take a look on your end?
No finding.
Contextual clues are present, the number is the appropriate length and the BIN is valid, but does not pass the Luhn algorithm.
A file containing credit cards, available below (#3).
10 positive credit card findings with high degree of confidence. Nightfall confidence: 10 Very Likely
Solution should identify all credit cards. Ideally, solution will use the contextual clue of the column header for all entries in that column.
A screen shot of credit cards, available below (#4).
10 positive credit card findings with medium to high degree of confidence. Nightfall confidence, 1 Very Likely, 9 Likely
Solution should extract text from the image and successfully recognize valid card numbers.
3) Credit Card File
CC File.csv
361B
Text
4) Credit Card Screenshot