The file scan API has first-class support for text extraction and scanning on all MIME types enumerated below.
Certain file types receive special handling, such as tabular data and archives of Git repositories, that results in more precise information about the location of findings within the source file.
application/json
application/x-ndjson
application/x-php
text/calendar
text/css
text/csv (treated as tabular data and may be redacted )
text/html
text/javascript
text/plain
text/tab-separated-values (treated as tabular data)
text/tsv (treated as tabular data)
text/x-php
application/pdf
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet (treated as tabular data)
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.ms-excel (treated as tabular data)
application/bzip2
application/ear
application/gzip
application/jar
application/java-archive
application/tar+gzip
application/vnd.android.package-archive
application/war
application/x-bzip2
application/x-gzip
application/x-rar-compressed
application/x-tar
application/x-webarchive
application/x-zip-compressed
application/x-zip
application/zip
image/apng
image/avif
image/gif
image/jpeg
image/jpg
image/png
image/svg+xml
image/tiff
image/webp
The file scan API explicitly rejects requests with MIME types that are not conducive to extracting or scanning text. Sample rejected MIME types include:
application/photoshop
audio/midi
audio/wav
video/mp4
video/quicktime
File scans of Microsoft Office, Apache parquet, csv, and tab separated files will provide additional properties to locate findings within the document beyond the standard byteRange
, codepointRange
, and lineRange
properties.
Findings will contain a columnRange
and a rowRange
that will allow you to identify the specific row and column within the tabular data wherein the finding is present.
This functionality is applicable to the following mime types:
text/csv
text/tab-separated-values
text/tsv
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.ms-excel
Apache parquet data files are also accepted.
Below is a sample match of a spreadsheet containing dummy PII where a SSN was detected in the 2nd column and 55th row.
Findings within csv files may be redacted.
To enable redaction in files, set the enableFileRedaction
flag of your policy
to "true"
The csv file will be redacted based on the configuration of the defaultRedactionConfig
of the policy
Below is an example curl request for a csv file that has already been uploaded .
When results are sent to the location specified in the alertConfig
(in this case an email address) a redactedFile
property will be set with a fileURL
in addition the findingsURL
This redacted file will be a modified version of the original csv file.
Below is an example of a redacted csv file.
Nightfall provides special handling for archives of Git repositories.
Nightfall will scan the repository history to discover findings in particular checkin, returning the hash for the checkin.
In order to scan the repository, you will need to create a clone, i.e.
git clone https://github.com/nightfallai/nightfall-go-sdk.git
This creates a clone of the Nightfall go SDK.
You will then need to create an archive that can be uploaded using Nightfall's file scanning sequence.
zip -r directory.zip directory
Note that in order to work, the hidden directory .github
must be included in the archive.
When you initiate the file upload sequence with this file, you will receive scan results that contain the commitHash
property filled in.
Using the Nightfall go SDK archive created above, a simple example would be to scan for URLs (i.e. strings starting with http://
or https://
), which will send results such as the following:
Large repositories result in a large volume of data sent at once. We are working on changes to allow these and other large surges of data to be processed in a more controlled manner, and will increase the limit or remove it altogether once those changes are complete.
To retrieve the specific checkout, you will need to clone the repository, i.e.
git clone https://github.com/nightfallai/nightfall-go-sdk.git
You can then checkout the specific commit using the commit hash returned by Nightfall.
Note that you are in a 'detached HEAD' state when workin with this sort of check out of a repository.
CSV Files: Only the first 250,000 rows will be scanned.
Spreadsheet Files: Up to 100,000 rows per sheet will be scanned, with a maximum of 1 million rows across all tabs in multi-sheet spreadsheets.
PDF Files: Scanning is limited to the first 100 pages, including a maximum of 50 images within those pages.
Images: Images smaller than 5KB or larger than 50MB will be excluded from scanning.
Archive Files: A maximum of 1,000 files will be extracted and scanned. Files larger than 100MB requiring extraction will not be scanned.