OCR and Data Capture

A scanned document is an assembly of digital "photos" of all the pages. Humans can easily read and understand the text just by looking at it, but a computer can't make much out of it apart from just displaying it on the screen.

To make use of the actual text, the software must first run the document through a process called OCR: Optical Character Recognition. This technology enables computers to analyze and interpret scanned images, and convert them to real electronic text.

OCR can increase the value of your scanned documents by making content searchable and reusable.

Searchable PDFs

OCR is important when scanning documents to PDF, because it will make your PDFs searchable. This will allow your Document Management System to index the documents, so that you can quickly search for and retrieve them from the database later on.

PixEdit® software stores both the electronic text and the scanned image in the PDF. We call that "hidden" text. This means that documents will be fully searchable and the text reusable, while still preserving the visual appearance of the original.

Reusing text

OCR is also useful for other purposes:

  • Quickly copy text from a scanned document to another application, for example Word, Excel, PowerPoint, Outlook, etc.
  • Export to a text file and import it in other applications
  • Help you add PDF Bookmarks more quickly
  • Forms Processing and extracting metadata


