A Guide on Extracting Tables From Low-Quality Scanned Documents

Many companies deal with thousands of documents every month. Document workflow automation becomes vital for such companies as the number of documents increases. One of the most frequent and at the same time tedious operations when processing documents is reading data from tables, especially when documents are scanned PDFs or images.

Automating table extraction from scanned documents and exporting them into Excel or JSON within seconds is a dream for every company dealing with manual data entry. Automating table data extraction from scanned documents and images reduces operational costs and saves a lot of time.

In this article, we will talk about table extraction from scanned documents or images with low quality.

You, most probably, came across some online tools that can extract tabular data from documents. However, there are a few that really work with low-quality scanned documents or images taken by a mobile device.

Optical Character Recognition (OCR) is the technology used for converting scanned images into text. However, standard OCR tools require you to apply certain image processing operations on the images before you can apply OCR on them. Without manual pre-processing, OCR will fail in most cases, and accuracy will be low. Unfortunately, even with pre-processing operations free OCR tools produce poor performance.

How to extract tables from scanned PDFs and images with low quality?

Algodocs has an advanced AI-powered OCR engine that automatically handles any type of scanned PDF or image with a low quality. Algodocs accepts either colorful scanned images, black and white, or any other settings and extracts data with high accuracy. Algodocs can process scanned images with as low a dpi as 75.

If you have scanned PDFs or images with low quality, then Algodocs is the right solution for you. You may start a free subscription right now and test your own scanned documents since we offer a free subscription (forever) with 50 pages per month. If you need to process a higher number of pages, then please see our affordable pricing plans.

Please read our article on the basic steps for table extraction from documents here:

Extract tables from PDF and scanned documents

Algodocs: the best software tool to extract tables from scanned PDFs and images

Consider the portions of the scanned documents below and the tables that Algodocs extracted from them.

Example #1

Sample scanned image with low-quality (black and white)

Extracted table by Algodocs.

Example #2

Extracted table by Algodocs

As you can see, the accuracy of Algodocs is perfect even with low-quality scans. However, there are cases when scanned images may cause Algodocs to make mistakes concerning small characters such as punctuation or other symbols (points, commas, date separators, etc.).

Let’s have a look at the example below with a scanned image and see what Algodocs could extract from it.

The extracted table from the above-scanned image is shown below. As you can see, there are numbers that are extracted with wrong decimal separators (indicated in red circles), i.e. a decimal point is mistakenly recognized as a comma. This is due to the dark background that some rows have on the image.

With the help of flexible extracting rules of Algodocs, the workaround is quick and simple. Whenever you have low-quality scanned PDFs of images, we always advise you to follow the steps explained below.

Step1. Remove all points and commas from the numbers

We apply the ‘Search & Replace’ filter in Algodocs by using regular expressions as the search type. We apply this rule to all the columns in the example below, but you can restrict this rule to a specific column when needed. In order to find all dots or commas we use \.|, as the search term and we leave empty the second field (replace by this), since we simply want to remove them.

Step 2. Convert all numbers to their previous format

Since we removed all points and commas from numbers, they actually increased, i.e. multiplied by 100 we can say (2,378.63 became 237863). Therefore, since we know that our numbers had 2 decimal places, we can divide all numbers by 100 to get the original numbers. The ‘Arithmetic Operation’ filter helps us implement exactly this.

We divide numbers by 100 in the last column as shown in the example below. You may apply this filter to other columns too.

That’s it. We got numbers in their original form with 100% accuracy!

The same approach can be applied to other symbols when you have documents with a low quality.

Please, contact us if you need any assistance.

Document Classification

Document Data Extraction

Document Review

Pre-trained Models

Custom Models

Key-Value Pairs Extraction

Data Extraction with Prompts

Smart Table Extraction

Workflows

Invoices

Passports

Purchase Orders

ID Cards

Bills of Lading

Customs Declarations

Explore Our Solutions

Receipts

Bank Statements

Utility Bills

Company

Contact Us

Privacy Policy

Terms of Service

Security & Compliance

Resources

Blog

White Label

API Documentation

Help