How to extract data from documents based on keywords?

Most of the time the data we need to extract has their own keywords/labels that identify the data. For example, Account NumberInvoice #IDStatement DatePurchase Order IDVendorPage, etc. all are keywords for the data we might need to extract from our documents. Therefore, AlgoDocs has flexible and easy to use approaches for extracting data using keyword-based search.

There are two keyword-based data extraction approaches in AlgoDocs.

  • Text match [before|after|including] filters
  • Advanced keyword-based search

Keyword-based data extraction using Text match filters

Text match filters approach is often used when keywords are positioned inline to the left of the data we want to extract. Consider our sample invoice below and assume we would like to extract Phone and INVOICE # fields.

For extracting a Phone field we click on ‘Extract‘ button for getting all data extracted from the document. Once the text is extracted from the document we will have default filters already added ‘Specify Start Position‘ and ‘Specify End Position‘.

For ‘Specify Start Position‘ we use ‘Text match after‘ filter option, which allows us to enter 1 or more keywords for the data we want to extract. So, let’s enter ‘phone:’ keyword, which will be enough in our case. 

For ‘Specify End Position‘ we use ‘Text match before‘ filter option, which allows us to enter 1 or more keywords as a stopping criteria, i.e. whenever one of the defined keywords are found AlgoDocs will stop reading the data at that point. So, let’s enter ‘invoice‘ keyword, which will meet our requirements as a stopping criteria, since there is ‘INVOICE #’ located on the same line.

As a final step we might need to remove all blank spaces around the our extracted phone number. For this, click on ‘Add Filter‘  ‘Format Text‘  ‘Remove blank spaces‘ and select ‘Trailing Blank Spaces‘ option.

Similar steps for creating an extracting rule of an INVOICE # field are shown below.

Keyword-based data extraction using Advanced Keyword-Based Search

AlgoDocs has an advanced keyword-based search that allows you to define a set of keywords or phrases for AlgoDocs to search for. This approach differs from ‘Text match‘ filters by offering more flexible and detailed search and is not limited to inline keywords only, as it is the case with ‘Text match‘ filters. For example, there are many cases when keywords are positioned on the top of the data we want to extract in which case ‘Text match‘ filters might not perform as good as we would need. Moreover, some documents have complex layouts in which case only advanced keyword-based search works best. Consider an example portion of such document below, which has Account NumberMeter NumberTotal Amount Due or Bill Date fields that we would like to extract. As you can see, keywords are located on the top of their corresponding values.

In order to apply advanced keyword-based search we select ‘Advanced Keyword-Based Search‘ as the data field type in the beginning  when creating an extracting rule. Another way of applying advanced keyword-based search is to click ‘Add Filter‘  ‘Advanced Keyword-Based Search‘. 

Next, we need to click on ‘Edit Keywords‘ button for defining the keywords for the data we need to extract. The power of AlgoDocs’ advanced keyword-based search is that it searches for the keywords in the order we specify and after it finds the keyword it searches for its relevant value around it, i.e. it searches to the right area of the keyword and to the bottom area of the keyword based on our settings. In order to illustrate keyword-based search we will enter all possible keywords for Account Number. Imagine that we have various documents that might have different keywords for account number as shown below.

The output will look as follows.

Data extraction for datesamounts and other fields are implemented in a similar manner as we did for account number. Please feel free to contact our support team in case you have questions related to advanced keyword-based search.