Key Outcomes
- 15,000 Suspicious Activity Reports (SARs) processed annually
- 100% extraction accuracy
- Critical information that was previously trapped in complex PDF formats is now easily accessible
- Fully automated process from file upload to structured database
Client Overview
Our client is a leading Banking-as-a-Service provider in the U.S., serving fintech partners with deposit and credit products to millions of customers. With assets rapidly growing to $4.2 billion in recent years, the company has seen its business expand significantly. As their partner and vendor network grows, the traditional method of scaling teams alongside increasing manual work was no longer viable. To keep pace with their expansion, they turned to Cavallo Technologies to implement an advanced, centralized data and AI platform built on Databricks.
The Challenge
Suspicious Activity Reports (SARs) are vital documents that financial institutions use to report potentially suspicious activities, such as fraud or money laundering, to regulatory bodies like the Financial Crimes Enforcement Network (FinCEN). However, these reports are often stored in PDF format, making them challenging to extract and analyze due to the following reasons:
- Unstructured Data: SARs can contain repeating sections, making them hard to parse in a consistent manner.
- Dynamic Content: The length of certain sections varies across different reports.
- Checkboxes and Non-Text Elements: Extracting information from checkboxes or images requires specialized tools beyond basic text extraction.
With over 100,000 SAR documents in PDF format, our client faced significant difficulty analyzing and querying these reports. They needed an effective solution that would allow them to extract valuable data for compliance purposes.
The Solution
Cavallo Technologies developed a custom SAR parsing tool capable of handling the complex, unstructured nature of the documents. The tool uses a combination of techniques, including:
- Static and Dynamic Fields Extraction: We leveraged predefined coordinates for fixed-position data, while a keyword-based search identified dynamically changing sections.
- Extracting Static Information: We developed methods to extract values from checkboxes and other graphical elements.
- Fast Processing: Each SAR document is processed in under 1.5 seconds, enabling the client to scale their operations and ensure timely analysis of critical compliance data.

Development and Validation
The process began by identifying and mapping the various sections within each SAR document. Our approach included:
- Section Recognition: We identified the different sections of the report and mapped them to specific pages. For repeating sections, the system consolidated page numbers to maintain accuracy.
- Static Information Extraction: For fields that appear in fixed positions (like checkboxes), we defined coordinate points for efficient data extraction.
- Dynamic Fields Handling: Sections with varying content lengths were managed using keyword searches and offset coordinates, ensuring the system could extract relevant information regardless of its location.
- Narrative Section Extraction: The narrative section, which spans multiple pages, was handled by extracting and combining text in a seamless manner.
To ensure the highest level of accuracy, we ran multiple rounds of validation using both synthetic and real SAR data. We worked closely with both the technical team and business stakeholders to review and validate over 100 reports, achieving 100% extraction accuracy.
The outcome
Efficiency at Scale
- The solution enabled the client to process thousands of SAR documents programmatically, eliminating the need for manual intervention.
- Parsing logic is optimized for speed, handling each document in less than 1.5 seconds.
Accuracy in Data Extraction
- By combining text extraction with image processing for non-text elements like checkboxes, the tool ensures comprehensive data capture.
- We validated the results manually on 100+ documents and found the extraction to be 100% accurate.
Compliance and Security
- The solution adheres to strict security frameworks (e.g., Databricks AI Security Framework) and does not employ probabilistic models, ensuring no AI-related risks.
The Result
Our client now has a fully automated SAR parsing system that processes thousands of reports with unmatched accuracy and speed. This solution has unlocked critical compliance data, enabling the client to efficiently meet regulatory requirements while freeing up valuable resources that were previously tied up in manual processes.