Nightfall Documentation
  • Data Detection and Response
  • Posture Management
  • Data Exfiltration Prevention
  • Data Encryption
  • Developer APIs
  • Data Classification and Discovery
  • Welcome to Developer APIs Documentation
  • Introduction to Developer APIs
    • Overview
    • Quickstart
    • Use Cases
    • Authentication and Security
  • Key Concepts
    • Entities and Terms to Know
    • Setting Up Nightfall
      • Creating API Key
      • Creating Detectors
      • Creating Detection Rules
      • Creating Policies
    • Alerting
    • Scanning Text
    • Scanning Files
      • Supported File Types
      • File Scanning and Webhooks
      • Uploading and Scanning API Calls
      • Special File Types
      • Specialized File Detectors
      • Webhooks and Asynchronous Notifications
        • Accessing Your Webhook Signing Key
        • Creating a Webhook Server
    • Scanning Features
      • Using Pre-Configured Detection Rules
        • Scanning Images for patterns using Custom Regex Detectors
      • Creating an Inline Detection Rule
      • Using Exclusion Rules
      • Using Context Rules
      • Using Redaction
      • Using Policies to Send Alerts
      • Detecting Secrets
      • PHI Detection Rules
    • Detector Glossary
    • Test Datasets
    • Errors
    • Nightfall Playground
  • Nightfall APIs
    • DLP APIs - Firewall for AI Platform
      • Rate Limits for Firewall APIs
    • DLP APIs - Native SaaS Apps
      • Policy User Scope Update API
      • Rate Limits for Native SaaS app APIs
  • Exfiltration Prevention APIs
    • Default
    • Models
  • Posture Management APIs
    • Default
    • Models
  • Nightfall Software Development Kit (SDK)
    • Overview
    • Java SDK
    • Python SDK
    • Go SDK
    • Node.JS SDK
  • Language Specific Guides
    • Overview
    • Python
    • Ruby
    • Java
  • Tutorials
    • GenAI Protection
      • OpenAI Prompt Sanitization Tutorial
      • Anthropic Prompt Sanitization Tutorial
      • LangChain Prompt Sanitization Tutorial
    • SaaS Protection
      • HubSpot DLP Tutorial
      • Zendesk DLP Tutorial
    • Observability Protection
      • Datadog DLP Tutorial
      • New Relic DLP Tutorial
    • Datastore Protection
      • Airtable DLP Tutorial
      • Amazon Kinesis DLP Tutorial
      • Amazon RDS DLP Tutorial
      • Amazon RDS DLP Tutorial - Full Scan
      • Amazon S3 DLP Tutorial
      • Elasticsearch DLP Tutorial
      • Snowflake DLP Tutorial
  • Nightfall Use Cases
    • Overview
    • GenAI Content Filtering-How to prevent exposure of sensitive data
    • Redacting Sensitive Data in 4 Lines of Code
    • Detecting Sensitive Data in SMS Automations
    • Building Endpoint DLP to Detect PII on Your Machine in Real-Time
    • Deploy a File Scanner for Sensitive Data in 40 Lines of Code
    • Using Scan API (with Python)
  • FAQs
    • What Can I do with the Firewall for AI
    • How quickly can I get started with Firewall for AI?
    • What types of data can I scan with API?
    • What types of detectors are supported out of the box?
    • Can I customize or bring my own detectors?
    • What is the pricing model?
    • How do I know my data is secure?
    • How do I get in touch with you?
    • Can I test out the detection and my own detection rules before writing any code?
    • How does Nightfall support custom data types?
    • How does Nightfall's Firewall for AI differs from other solutions?
  • Nightfall Playground
  • Login to Nightfall
  • Contact Us
Powered by GitBook
On this page
  • Accepted Text and Derivatives
  • Accepted Office Formats
  • Accepted Archive and Compressed File Types
  • Accepted Image File Types
  • Rejected MIME Types
  • Spreadsheets and Tabular Data
  • Redacting CSV Files
  • Git Repositories
  • File Scanning Limitations

Was this helpful?

Export as PDF
  1. Key Concepts
  2. Scanning Files

Supported File Types

PreviousScanning FilesNextFile Scanning and Webhooks

Last updated 5 months ago

Was this helpful?

The file scan API has first-class support for text extraction and scanning on all MIME types enumerated below.

Certain file types receive special handling, such as and , that results in more precise information about the location of findings within the source file.

Handling of MIME Types Not Listed

Files with a MIME type not listed below are processed using an unoptimized text extractor. As a result, the quality of the text extraction for unrecognized types may vary.

Accepted Text and Derivatives

  • application/json

  • application/x-ndjson

  • application/x-php

  • text/calendar

  • text/css

  • text/csv (treated as and may be )

  • text/html

  • text/javascript

  • text/plain

  • text/tab-separated-values (treated as )

  • text/tsv (treated as )

  • text/x-php

Accepted Office Formats

  • application/pdf

  • application/vnd.openxmlformats-officedocument.presentationml.presentation

  • application/vnd.openxmlformats-officedocument.wordprocessingml.document

Accepted Archive and Compressed File Types

  • application/bzip2

  • application/ear

  • application/gzip

  • application/jar

  • application/java-archive

  • application/tar+gzip

  • application/vnd.android.package-archive

  • application/war

  • application/x-bzip2

  • application/x-gzip

  • application/x-rar-compressed

  • application/x-tar

  • application/x-webarchive

  • application/x-zip-compressed

  • application/x-zip

  • application/zip

Accepted Image File Types

  • image/apng

  • image/avif

  • image/gif

  • image/jpeg

  • image/jpg

  • image/png

  • image/svg+xml

  • image/tiff

  • image/webp

Rejected MIME Types

The file scan API explicitly rejects requests with MIME types that are not conducive to extracting or scanning text. Sample rejected MIME types include:

  • application/photoshop

  • audio/midi

  • audio/wav

  • video/mp4

  • video/quicktime

Spreadsheets and Tabular Data

File scans of Microsoft Office, Apache parquet, csv, and tab separated files will provide additional properties to locate findings within the document beyond the standard byteRange, codepointRange, and lineRange properties.

Findings will contain a columnRange and a rowRange that will allow you to identify the specific row and column within the tabular data wherein the finding is present.

This functionality is applicable to the following mime types:

  • text/csv

  • text/tab-separated-values

  • text/tsv

  • application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

  • application/vnd.ms-excel

Below is a sample match of a spreadsheet containing dummy PII where a SSN was detected in the 2nd column and 55th row.

{
   "findings":[
      {
         "path":"Sheet1 (5)",
         "detector":{
            "id":"e30d9a87-f6c7-46b9-a8f4-16547901e069",
            "name":"US social security number (SSN)",
            "version":1
         },
         "finding":"624-84-9182",
         "confidence":"LIKELY",
         "location":{
            "byteRange":{
               "start":2505,
               "end":2516
            },
            "codepointRange":{
               "start":2452,
               "end":2463
            },
            "lineRange":{
               "start":55,
               "end":55
            },
            "rowRange":{
               "start":55,
               "end":55
            },
            "columnRange":{
               "start":2,
               "end":2
            },
            "commitHash":""
         },
         "matchedDetectionRuleUUIDs":[
            "950833c9-8608-4c66-8a3a-0734eac11157"
         ],
         "matchedDetectionRules":[
            
         ]
      },
...

Redacting CSV Files

Findings within csv files may be redacted.

To enable redaction in files, set the enableFileRedaction flag of your policy to "true"

The csv file will be redacted based on the configuration of the defaultRedactionConfig of the policy

curl --request POST \
     --url https://api.nightfall.ai/v3/upload/02a0c5e1-c950-4e28-a988-f6fffefc4205/scan \
     --header 'Accept: application/json' \
     --header 'Authorization: Bearer NF-<Your API Key>' \
     --header 'Content-Type: application/json' \
     --data '
{
     "policy": {
          "detectionRuleUUIDs": [
               "950833c9-8608-4c66-8a3a-0734eac11157"
          ],
          "alertConfig": {
               "email": {
                    "address": "<your email addres>"
               }
          },
          "defaultRedactionConfig": {
               "maskConfig": {
                    "charsToIgnore": [
                         "-",
                         "@"
                    ],
                    "maskingChar": "*"
               }
          },
          "enableFileRedaction": true

     },
     "requestMetadata": "csv redaction test"
}
'

When results are sent to the location specified in the alertConfig (in this case an email address) a redactedFile property will be set with a fileURL in addition the findingsURL

{
   "errors":null,
   "findingsPresent":true,
   "findingsURL":"https://files.nightfall.ai/asdfc5e1-c950-4e28-a988-f6fffefc4205.json?Expires=1655324479&Signature=zjo1nT-PECHC-fiTvAgdA8aDnceoY~6iGfzOBCcBjscKqOHnIar8hoH4gGufffiulBw5BpfJuvWwBW~lXO~ZNhN139LDwoTsfLJswJiQCB2Hj-Az0Em6go~1j8WBqCS8G0Gk17M-zcPedHGX3z~1pw8nm5sh6Pa-jJwfw9NIEiqmBb3Vdcj3J-~Wzag~ENV4499rnG299ee-ig5Ms1oVlzycb4YxzgTMrTL5Q07ozNenwFZcGDNQre1inLXmV-m8teLX-K3boklenp9KXiNDDV0wi74ADN-QfIR1q1oU7mEI1f3aVC3kju0QRErp2lsfs08EtZKLE3C4N17jDJdYcw__&Key-Pair-Id=K24YOPZ1EKX0YC",
   "redactedFile":{
      "fileURL":"https://files.nightfall.ai/asdfc5e1-c950-4e28-a988-f6fffefc4205-redacted.csv?Expires=1655324479&Signature=Hx8kRh88maLeStysy3fsLbFVG9VELEtfemtQe2lWUnFjAMd9HqlEksTmirqAWFWV4zPVUB73izlMj5cSer8v2N5ZCcnD3dz~nnwR4P5LewGJ2CQzGnDnXgh70HW5qp04gnUD-pYWp~bGPVspkJKCkl1zH-EoGonvcNVq3SNsVzOlsVIjep7Y7otQKEEyAZ7JmHiVfuBxrvn8pleuC5lEJ3f9miPyoRqH9DyPlNTJTIuijqe9q32Qcui2RsDR6IT-foFX52dy6rRa01ZV0gZMDWJokMlCr8Iu5An~qnhxC49bqTtI82oz9FcBaP-Yea8cq1TiAfGxX7CJ0~JeTLvr6g__&Key-Pair-Id=K24YOPZ1EKX0YC",
      "validUntil":"2022-06-15T20:21:19.750990823Z"
   },
   "requestMetadata":"csv redaction test",
   "uploadID":"02a0c5e1-c950-4e28-a988-f6fffefc4205",
   "validUntil":"2022-06-15T20:21:19.723045787Z"
}

This redacted file will be a modified version of the original csv file.

Below is an example of a redacted csv file.

name,email,phone,alphanumeric
Ulric Burton,*****@*************,*-***-***-****,TEL82EBM1GQ
Wade Jones,******************@***********,(********-****,VVF64PJV2EF
Molly Mccullough,*****************@**********,(********-****,OHO41SFZ2BR
Raja Riggs,************@**********,(********-****,UVD51JTE5NZ
Colin Carter,**********************@*********,(********-****,LNI34LLC5WV// Some code

Git Repositories

Nightfall provides special handling for archives of Git repositories.

Nightfall will scan the repository history to discover findings in particular checkin, returning the hash for the checkin.

In order to scan the repository, you will need to create a clone, i.e.

git clone https://github.com/nightfallai/nightfall-go-sdk.git

This creates a clone of the Nightfall go SDK.

You will then need to create an archive that can be uploaded using Nightfall's file scanning sequence.

zip -r directory.zip directory

Note that in order to work, the hidden directory .github must be included in the archive.

Using the Nightfall go SDK archive created above, a simple example would be to scan for URLs (i.e. strings starting with http:// or https://), which will send results such as the following:

{
   "findings":[
      {
         "path":"f607a067..53e59684/nightfall.go",
         "detector":{
            "id":"6123060e-2d9f-4f35-a7a1-743379ea5616",
            "name":"URL"
         },
         "finding":"https://api.nightfall.ai/\"",
         "confidence":"LIKELY",
         "location":{
            "byteRange":{
               "start":142,
               "end":168
            },
            "codepointRange":{
               "start":142,
               "end":168
            },
            "lineRange":{
               "start":16,
               "end":16
            },
            "rowRange":{
               "start":0,
               "end":0
            },
            "columnRange":{
               "start":0,
               "end":0
            },
            "commitHash":"53e59684d9778ceb0f0ed6a4b949c464c24d35ce"
         },
         "beforeContext":"tp\"\n\t\"os\"\n\t\"time\"\n)\n\nconst (\n\tAPIURL = \"",
         "afterContext":"\n\n\tDefaultFileUploadConcurrency = 1\n\tDef",
         "matchedDetectionRuleUUIDs":[
            "cda0367f-aa75-4d6a-904f-0311209b3383"
         ],
         "matchedDetectionRules":[
            
         ]
      },
 ...

Support for Large Repositories

Currently, processing is limited to repositories with a total number of commits lower than 5000.

Large repositories result in a large volume of data sent at once. We are working on changes to allow these and other large surges of data to be processed in a more controlled manner, and will increase the limit or remove it altogether once those changes are complete.

Sensitive Data in GitHub Repositories

If the finding in a GitHub repository is considered to be sensitive, it should be considered compromised and appropriate mitigation steps (i.e. secrets should be rotated).

To retrieve the specific checkout, you will need to clone the repository, i.e.

git clone https://github.com/nightfallai/nightfall-go-sdk.git

You can then checkout the specific commit using the commit hash returned by Nightfall.

cd nightfall-go-sdk
git checkout 53e59684d9778ceb0f0ed6a4b949c464c24d35ce

File Scanning Limitations

  • CSV Files: Only the first 250,000 rows will be scanned.

  • Spreadsheet Files: Up to 100,000 rows per sheet will be scanned, with a maximum of 1 million rows across all tabs in multi-sheet spreadsheets.

  • PDF Files: Scanning is limited to the first 100 pages, including a maximum of 50 images within those pages.

  • Images: Images smaller than 5KB or larger than 50MB will be excluded from scanning.

  • Archive Files: A maximum of 1,000 files will be extracted and scanned. Files larger than 100MB requiring extraction will not be scanned.

application/vnd.openxmlformats-officedocument.spreadsheetml.sheet (treated as )

application/vnd.ms-excel (treated as )

data files are also accepted.

Below is an example curl request for a csv file that has already been .

When you initiate the with this file, you will receive scan results that contain the commitHash property filled in.

Note that you are in a when workin with this sort of check out of a repository.

Apache parquet
uploaded
file upload sequence
'detached HEAD' state
tabular data
archives of Git repositories
tabular data
redacted
tabular data
tabular data
tabular data
tabular data