The Optimal Optical Character Recognition: OCR and Digitization

Category: Digitization

The majority of Digital Asset Managers (DAMs) are able to support searching within the text of a document if the uploaded metadata includes Optical Character Recognition (OCR). For those who are new to digitization, OCR is the conversion of text in images to something machine-readable. This machine-readable data is then accessible to users, adding an extra layer of discovery to your virtual collections. If you’ve ever been able to copy and paste from a PDF, you’ve benefited from OCR.

Deriving the Metadata

There are several tools available for converting text in an image to a text file. These examine the pixels in the image to identify characters and create a map of letters, spaces, and special characters on the page. At Backstage, we do this at the PDF-generation stage with the assistance of a product called ABBYY Fine-reader. This combination OCR engine and PDF creator is widely regarded as the most accurate OCR software platform available. Either as a standalone application or integrated into third party programs, ABBYY FineReader has OCR support for over 200 languages and a high recognition rate.

Reliability Given Content Variables

Typically, we leave OCR uncorrected and rely on the robustness of ABBYY FineReader, which boasts an up-to 99% average accuracy rate. However, there are factors which can affect the quality of the conversion.

Because OCR is derived directly from the image, the following is important to consider:

What’s the contrast between the textual content and the rest of the image? High contrast will have a higher rate of recognition than low contrast materials.
Is the text fading? Similar to the issue of contrast, an OCR engine will have difficulty picking out pixels when the image has faded elements.
Are there stains on the page? Hard lines that intersect with text or smudges can trick OCR engines into thinking text appears differently than it does.
Does the image include decorative fonts? The more that these deviate from the traditional shape of a given character, the more likely that the reader may make mistakes in its conversion.
Handwriting rarely converts correctly unless it is extremely uniform and precise. However, there are solutions for transcription which we’ll discuss later.

Let’s look at some examples.

568 characters with spaces and all of them converted correctly! This example of a clear digitized image scores 100% in character recognition.

Background graphics with low contrast have made it more complicated for the reader to identify certain characters. Even still, this image scores about 95% in character recognition.

The unique font and smudged highlighting have resulted in some errors in the OCR conversion. The handwriting, too, is very irregular. With these added characters in mind, this image has around an 86% character recognition rate.

An unfortunate coffee stain and subsequent smudge have obscured some of the text. While the mug marks affected the conversion very little, the smudge resulted in this image having about a 98% conversion rate – not bad, condition considered!

Despite how faded this object was prior to digitization (even to the extent that it is difficult to read without adjusting the contrast on your monitor), the conversion rate is an impressive 100%. While some fading can affect the results of character recognition, these results are a testament to the quality of this technology.

Added Touches to OCR Results

Some libraries want fully correct OCR based on the needs of the collection or its patrons. When it comes to reviewing 100% of the conversion output, there are a few companies that assist with this work. One such partner is Apex CoVantage which provides data conversion and content enhancement among other services. They are also able to break images up into something called METS/ALTO, which provides article level segmentation for additional clarity and searching.

When we start discussing OCR for handwritten materials, libraries have a few options available. As we’ve seen, handwriting is almost always far too unique and irregular to be converted accurately. Recruiting community volunteers to perform manual transcription has become a very popular method. That said, the technology is always improving; Doxie.AI, for example, offers a proprietary, secure machine learning solution to better interpret some types of handwriting and convert it into structured metadata. While this is a developing technology, the conversion rates for some forms of handwriting have been extremely promising.

Questions?

Learn more about OCR and our partners by calling us at 1.800.288.1265, visiting us online at www.bslw.com, or feel welcome send an email anytime to info@bslw.com with questions or project ideas.