A guide on extracting tables from low quality scanned documents

Many companies deal with thousands of documents every month. Document workflow automation becomes vital for such companies as the number of documents increases. One of the most frequent and at the same time tedious operations when processing documents is reading data from tables, especially when documents are scanned pdfs or images.

Automating table extraction from scanned documents and exporting them into Excel or JSON within seconds is a dream for every company dealing with manual data entry. Automating table data extraction from scanned documents and images reduces operational costs and saves a lot of time.

In this article, we will talk about table extraction from scanned documents or images with a low quality.

You, most probably, came across some online tools that can extract tabular data from documents. However, there are a few that really work with low quality scanned documents or images taken by a mobile device.

Optical Character Recognition (OCR) is the technology used for converting scanned images into text. However, standard OCR tools require you to apply certain image processing operations on the images before you can apply OCR on them. Without manual pre-processing OCR will fail in most cases and accuracy will be low. Unfortunately, even with pre-processing operations free OCR tools produce poor performance.

How to extract tables from scanned PDFs and images with low quality?

AlgoDocs has an advanced AI-powered OCR engine that automatically handles any type of scanned PDF or image with a low quality. AlgoDocs accepts either colourful scanned images, black and white or at any other settings and extracts data with high accuracy. AlgoDocs can process scanned images with as low a dpi as 75.

If you have scanned PDFs or images with low quality, then AlgoDocs is the right solution for you. You may start a free subscription right now and test your own scanned documents since we offer a free subscription (forever) with 50 pages per month. If you need to process a higher number of pages, then please see our affordable pricing plans.

Please, read our article on basic steps for table extraction from documents here:

Extract tables from PDF and scanned documents

AlgoDocs: the best software tool to extract tables from scanned PDFs and images

Consider the portions of the scanned documents below and the tables that AlgoDocs extracted from them.

Example #1

Sample scanned image with low-quality (black&white)
Extracted table by AlgoDocs.

Example #2

Sample scanned image with low-quality.

Extracted table by AlgoDocs

As you can see, the accuracy of AlgoDocs is perfect even with low-quality scans. However, there are cases when scanned images may cause AlgoDocs to make mistakes concerning small characters such as punctuations or other symbols (points, commas, date separators, etc.).

Let’s have a look at the example below with a scanned image and see what AlgoDocs could extract from it.

The extracted table from the above scanned image is shown below. As you can see, there are numbers that are extracted with wrong decimal separators (indicated in red circles), i.e. a decimal point is mistakenly recognized as a comma. This is due to the dark background that some rows have on the image.

With the help of flexible extracting rules of AlgoDocs, the workaround is quick and simple. Whenever you have low-quality scanned PDFs of images, we always advise you to follow the steps explained below.

Step1. Remove all points and commas from the numbers

We apply the ‘Search & Replace’ filter in AlgoDocs by using regular expressions as the search type. We apply this rule to all the columns in the example below, but you can restrict this rule to a specific column when needed. In order to find all dots or commas we use \.|, as the search term and we leave empty the second field (replace by this), since we simply want to remove them.

Step 2. Convert all numbers to their previous format

Since we removed all points and commas from numbers, they actually increased, i.e. multiplied by 100 we can say (2,378.63 became 237863). Therefore, since we know that our numbers had 2 decimal places, we can divide all numbers by 100 to get the original numbers. The ‘Arithmetic Operation’ filter helps us implement exactly this.

We divide numbers by 100 in the last column as shown in the example below. You may apply this filter to other columns too.

That’s it. We got numbers in their original form with 100% accuracy!

The same approach can be applied to other symbols when you have documents with a low quality.

Please, contact us if you need any assistance.

Comments are closed.