What do PDF Parser & OCR options in extracting rules mean?

AlgoDocs allows you to select the type of an engine in every extractor to apply when extracting data from a document. There are three options available:

  • Both (when PDF Parser fails, then apply OCR)
  • PDF Parser only
  • OCR only

The first option is the default one, which means that AlgoDocs will try to apply PDF Parser and if it fails extracting data from a document, then it will automatically apply OCR (Optical Character Recognition). Therefore, there are only two types of engines that you actually apply for extracting data from documents.

What each of these engines mean and how to know which one to use?

PDF Parser engine is used for generated PDF files only, such that PDF files that contain text and not scanned files. So, PDF Parser works only for non-scanned PDF files with text only. However, OCR engine is applicable for both text PDF files and scanned PDF files along with images. In other words, while PDF Parser works for text PDF files only, OCR can be applied for all types of documents, text & scanned PDF files and images.

Then, we ask another question: why do I need PDF Parser engine at all, if OCR handles all types of documents?

The reason behind AlgoDocs still having PDF Parser engine is because of its speed performance. PDF Parser engine is much faster than OCR engine. PDF Parser can extract data from a single page in 1-2 seconds, whereas OCR will extract data from a single page in 10-15 seconds. Therefore, we advise you to prefer PDF Parser in cases when you have PDF files with text only.

Please, note that when you have a mixture of text and image in PDF files, then when applying PDF Parser AlgoDocs will extract only the text sections of your document by ignoring the scanned section of the document. Therefore, to make an extraction from scanned section too, you should select ‘OCR only’ option.

If you have troubles extracting data from your documents, please contact our support team.