Rule-Based Extractor for Single-Page Documents

In this demo video, we create an extractor to extract fields and line items from single-page invoices. Additionally, we demonstrate how you can add header fields of an invoice to the line items as new columns.

See the following sample document and corresponding Excel output, which are produced by the created extractor.

The sample document used in this video is: Invoice.pdf

The output generated by the created extractor is: AlgoDocs-20231107032119.xlsx

============================================================================================

Feel free to start a free subscription right now and parse your PDF documents. You can use AlgoDocs for free forever, with 50 pages per month. If you need to process a higher number of pages, then please see our affordable pricing plans.

If you have specific requirements and need a custom solution, please contact us.

============================================================================================

In this video, we will demonstrate rule-based data extraction from a single-page invoice document. Particularly, we will extract the invoice number, invoice date, bill to, account number, customer purchase order number, and all the item lines. So, let’s begin.

As the first step, we create the extractor. Let’s name it “1 Page Invoice” and choose a sample document. We start by adding a field for the invoice number. Since Invoice Number is a field, we select the “Field or Text to Table” option, which is a rule-based method for extracting data. In the extractor editor, we have “Specify Start” and “End Position” filters added by default, which we remove because we will use another filter that is called “Advanced Keyword-Based Search.” This filter allows us to search for the value using phrases and indicate the position of the value. We add the last filter to remove trailing blank spaces. Finally, we name the field and save it. We continue with the invoice date field, for which we can duplicate the invoice number field. Now, we click on the “Invoice Date” field to adjust it to capture the invoice date. We need to apply some changes in the “Advanced Keyword-Based Search” filter. Since the data we need to capture is the date, we change the data type to date and change the search phrase from invoice number to invoice date. By default, the format of the captured data is year-month-day. We can change the date format to some other format we need, for example, day, month, or year. Here, we don’t need the “Remove blank spaces” filter, so we remove it. Next, we continue with the Account Number field, for which we again duplicate the Invoice Number field.

In the “Advanced Keyword-Based Search” filter, we change only the search phrase. For the customer purchase order number, we can duplicate either the invoice or the account number. Just like in the case of the account number, it is enough to change only the search phrase. Now, we continue with the “Bill To” field, for which we add a new field. Note that we don’t duplicate any of the existing fields because the “Bill To” field is going to have a bit different approach and will use different filters. To capture the “Bill To” part, we may select an approximate region for it. Therefore, when moving to the filters section, we can see that the “Raw Data” section contains data from the selected region only. In order to get the “Bill To” field properly, we need to add some filters. We begin by adding the “Specify Start Position” filter to get everything after the “Bill To” phrase. Then, we add the “Specify End Position” filter to get everything before the “Account” keyword. Now, we can remove line breaks and then remove multiple blank spaces. Next, we will add a table for extracting line items. This time, we select the “Table” option, which lets us place column separators on our table.

We remove the “Keep Rows” filter, which is added by default. As we can see, the “Raw Data” section contains tabular data divided into columns according to the column separators we placed in the previous step. Now, we add the “Keep Section” filter by choosing the “With Condition” option, which allows us to define the beginning and the end of our table rows. Our table begins with the keyword “product” in the first column of the table, and the table ends with the phrase “thank you” in the first column of the table. We exclude conditional rows since we don’t need them. We add the “Keep Rows” filter to get rid of empty rows.

Additionally, we can add another filter to remove trailing blank spaces. Now, assume we wish to include all these fields as new columns for every line item. Therefore, first, we add a new column to the table, and then we add another filter to fill the cells of the newly added column with the invoice number field. And we repeat this for the remaining fields. When the Auto-Fetch option is checked, filters are updated on every adjustment; otherwise, filters are updated when a new filter is added. We have added the Invoice Number, Invoice Date, Account Number, and Customer Purchase Order as new columns at the beginning of all items. We can add the “Bill To” field to the end of the table. Next, we can format the number columns in the table. When this filter is added, the cells of exported Excel or JSON fields will be formatted as numbers. Additionally, we can change the number of decimal places. Finally, we can set the column headers.

Let’s go to the “Extracted Data” section and see what we have as an output for the created extractor. As we can see, all created fields are captured successfully, along with the Items table. We can export the extracted data to Excel. The header fields at the top are unnecessary since we added them to the items table. Therefore, let’s hide them. This can be done in the settings of the extractor. When we return to the “Extracted Data” section, those fields will not be shown, but they remain in the items table. And this time, our exported Excel contains only the items table. Since our extractor is ready, we can upload hundreds and thousands of files in File Manager or use integrations for automatically importing documents to AlgoDocs.