Rule Based Data Extraction: Multipage Invoice

Master multi-page invoice data extraction with AlgoDocs! This comprehensive tutorial dives deep into rule-based extraction, guiding you through the process of setting up an extractor to capture specific fields like invoice number, date, purchase order, bill-to details, and itemized tables.

Learn how to:

1) Create a new extractor for multi-page documents

2) Utilize advanced keyword search filters for precise data capture

3) Apply various filters, like “trailing blank spaces” and “remove line breaks” for clean data

4) Extract table data using column separators and refine captured information

5) Customize table layout and column headers for organized presentation

6) Download extracted data in Excel format with hidden headers (optional)

7) Process multiple documents with a single extractor

8) Effortlessly streamline your invoice processing workflow and eliminate manual data entry with AlgoDocs!

Start using the free plan of AlgoDocs today (50 pages per month)!



Hi! Welcome to another screencast video of AlgoDocs. Today, I will walk you through the process of creating extractors for extracting data from a multi-page invoice document using a rule-based extraction method.

Here is my sample document and I’ll be extracting six fields.

Which are invoice dates, invoice number, purchase order number, bill to field, item table, and invoice total.

So let’s jump right into the extractors page to create a new extractor.

I input a name and select a sample document.

This is the document I am using as a sample, and I’ll be extracting six fields.

which are invoice dates, invoice number, purchase order number, Bill to field , items table, and invoice total.

I head over to extractor editor to begin adding fields.

So I start with the invoice number field, and Remember I am using rule-based data extraction method.

I do not need these default filters so i take them out.

I will be making use of the advanced keyword based search filter, so I copy this key word.

I need to select the data type of my search phrase which is invoice number.

There are different data types to choose from. for value position I select below the phrase.

with my data captured I can remove spaces. I select trailing blank spaces option, Then I save this field.

I can duplicate invoice number field to create invoice date.

I need to make some adjustment to the filters for invoice date.

So I change the data type to date.

And I can adjust the output format. I want it to display as day, month, and year.

You can choose any output format you want.

I don’t need the trailing blank spaces filter.

I don’t need the trailing blank spaces filter, so I save this field.

I will duplicate invoice number field for purchase order number.

I just need to change the search phrase, because they are of same data type

Now for the Bill-to field, I will use a different method to capture the data.

under the rule based data extraction method, I select the field to text option.

I right click and drag my mouse to select the area I want to capture.

then I use the crop text filter to specify start position to capture the text after Bill-To.

Then specify end position to text match before remit.

with my data captured I can remove line breaks.

then I use the remove blanks spaces filter to remove multiple blank spaces.

I can save this field.

to capture the value of Total I use a different approach. for page selection I need to select apply to all remaining pages , because I might have total in any page other than page two if the document has more pages due to the increase in table items.

I will be using the specify start position filter to capture every raw data after subtotal, because it always appears before the value of Total.

Then I will also apply the specify start position filter to capture the value after total.

with end of line filter i can keep just the value after total.

I can use the find numbers filter to format numbers then I save.

Now I have to create my items field.

under the rule based data extraction method I’ll be using the table extraction option.

I add column seperators to define the table columns, to add more lines I use the plus icon.

with the columns seperated I change the page selection to, “apply to all remaining pages”, because I might receive another document in the same format which have table items extended up to page 50, 100, or 1000

So with that done I can start refining my captured data

I have data captured which are not part of the table.

so to keep the data which are part of the table I’ll start by ensuring that data in column two are kept in the same row because if we take a look at the original file ,

each description is kept in one row but here we have them in different rows.

to do that I’m going to use the “Alter rows” filter

I select the merge rows filter and I specify where column one has a value.

I select the merge rows option and I specify where column one has a value.

Now you see the data in column two is properly captured.

to take out the other data which are not part of the table.

I’m going to use the keep rows option using column 4 as a criteria. where column four has a value which is a float number.

my table is properly extracted, I have 11 rows in my table and in my document I also have 11 rows.

I can further customize the look of my table.

I want invoice number, invoice date, purchase order number, bill to and total to be columns and their data as rows in this table.

so to do that I’ll use the “alter columns” filter and the “add column” option and it has added a new column before column 1 which is actually where I wanted it to start, you can specify whatever position you want to add a new column.

and with the “fill cells of column” option, I can fill the value of invoice number into the new column.

So I am going to be repeating the same process for invoice date, purchase order number, Bill-to field, and total.

I have added the columns and populated the rows of “invoice date ” and “Purchase order number” but for bill-to and total fields I will be doing something different I want them to be added after column 14.

after adding Bill-to and total columns, I can now set column headers.

with the “alter columns” option I can change the column headers.

my column headers have been renamed. I can save and exit the extractor editor.

I head over to the extracted data page,

here I can download the excel file of my extracted data.

In my excel sheet you can see the “Invoice number”, “Invoice date”, “Purchase order number”, “Bill to”, and “Total” fields as headers, but I can hide them.

I return to my extracted data page and I select the fields/table option to uncheck the header fields, I want to hide.

I return to my Algodocs account and in Extractors page, I select the fields & table icon of my extractor, then I uncheck the fields I want to hide.

in my extracted data page only items table can be seen.

If I open the excel file you can see that those headers are hidden but remain visible in my items table and we can always go back to our extractor and check those fields, to be displayed. Now I can upload numerous document to my extractor for data extraction.

This brings us to the end of this video.

I hope you enjoyed this tutorial and you continue to use the numerous features that Algodocs offers. remember you can always send your questions or support request to, and we will give you a quick response.


AlgoDocs, Invoice Processing, Data Extraction, Rule-Based Extraction, Multi-Page Documents, Invoice Data, Keyword Filters, Table Extraction, Excel Download, FreePlan, and Tutorial