Unlocking Unstructured Data with Text Analysis

September 12, 2016

The millions of documents FINRA receives have been an opportunity and a challenge for accessing information. On the one hand, unstructured data isn’t easy to digest or manipulate, especially when one document can be hundreds of pages long. On the other hand, it also represents a huge resource for analysts and the potential for more insights.

The Text Analysis team from Enterprise Solutions department wanted to help unlock the potential in these documents with text analysis. With AWS and Natural Language Processing (NLP) tools advancing, they believed they had the tools to help.

A time consuming process

They started their work with Private Placements Review (PPR), a unit within FINRA’s Corporate Financing Department. PPR provides regulatory oversight for private funding rounds. The PPR team is small, with only a dozen users. Still, this group deals with a huge range of documents from simple two-page term sheets to memoranda that are hundreds of pages long.

PPR’s Investigators review these documents for actual or potential violations of SEC and FINRA rules. One potential violation, for instance, is when an individual who has committed certain regulatory violations participates in a private placement offering. PPR’s Investigators go through these documents page by page to find the individuals. From there, they search other systems to see if these individuals committed any violations or had connections to other private placements. This is a time intensive manual process.

The Text Analysis team began to looking at ways to use text analysis tools to automate this process and make it easier for users. They created a POC with multiple functionalities including: extracting names of individuals and organizations, matching them to Central Registration Depository (CRD) records, highlighting the occurrences of names in the text, and even provide a summary of the most important from the documents. All of this proved to PPR that this project had potential benefits.

Figuring out the system

Creating this system wasn’t straightforward. Some requests were clear such as the need to find names and other entities in the documents. Others weren’t as clear. They came into focus by working closely with PPR. For instance, the team saw users working with documents hundreds of pages long docs. They saw a potential use of text analysis to summarize the document. They looked into text analysis to give users a 2-5 page summary that gets the essence of the data.

Another idea that they added to the system was showing relationships between different filings that had the same people. “Many Investigators had been doing this by hand, which could be time consuming and limited because of the number of possible permutations,” Gerald Portante, Product Manager said. Now with AWS, we have the technology to do tribe analysis in a systemic manner. By, “leveraging AWS,” he explained, “we can say with more certainty that these people have a relationship.”

Finding the right solution

With a variety of requirements, the team realized there was no single prescription. “Every problem needed research,” Dmytro Dolgopolov, Senior Director of Enterprise Solutions, explained. They searched for the best fit for each requirement. Sometimes this meant using multiple tools together to create the best fit.

As they focused on entity extraction, for instance, they tried various products. To begin, they started out with Stanford Core NLP open source library. While this gives a rough picks of names found in a document, the team still needed to string the information together. From there, they had to build their own logic to resolve those names and entities against CRD records.

For document summarization, they also used multiple tools. The solution ended up being a mix of open source libraries and proprietary logic specific for the business case.

Benefitting business users

Though it’s a small department, text analysis is already having an impact on the users work. The program makes individuals in private placement offerings immediately visible. Connected to CRD, users also get an automatic note if someone is on a statutory list. What once took multiple reads is now easy to see. In addition, PPR is now using the text analysis to put eyes on every single new offering filed with FINRA. Before, PPR could only do this with about 2 out of every 5 filings. That ease ensures oversight doesn’t happen.

For new examiners, this system also helps them work more quickly. “I used to be an examiner for years. PPM takes me 15 minutes, because I know where to go. A new person could take days.” Portante said.

Moving forward

As PPR is starting to see the enhancements text analysis is bringing to their work, the Text Analysis team isn’t stopping. They’re hoping to go from a plug-in model to full integration with the current systems in use. In addition, they’re working on expanding functionality for PPR, including identification on scale to help users prioritize documents.

One functionality they’re working on is document classification. Currently, they’re trying multiple approaches to find the best solution. One approach they’re using is the rule based approach, sitting down with subject matter experts to document every rule and then create the program with as many rules as possible. This process can be thorough but is also time consuming.

The other approach is based on machine learning, training the system through multiple document samples. With each iteration the system learns how to classify documents. To figure out which model is better, the team has worked on both solutions in parallel. This solution is faster in the beginning but might not pick up all the rules necessary for PPR’s needs. Other potential solutions are commercial and open source solutions. Eventually, they will run in parallel to see which works better. They’re trying to decide not only quality but also time invested to figure out the best solution.

The Text Analysis team is also looking beyond the PPR team to help other groups working with unstructured data. Not only would this enhance current processes but also improve analysis throughout the organization. As unstructured data becomes structured, information is not only more accessible but also easier to process for more analysts, strengthening analysis with more cohesive data.