The majority of Digital Asset Managers (DAMs) are able to support searching within the text of a document if the uploaded metadata includes Optical Character Recognition (OCR). For those who are new to digitization, OCR is the conversion of text in images to something machine-readable. This machine-readable data is then accessible to users, adding an extra layer of discovery to your virtual collections. If you’ve ever been able to copy and paste from a PDF, you’ve benefited from OCR.
Deriving the Metadata
There are several tools available for converting text in an image to a text file. These examine the pixels in the image to identify characters and create a map of letters, spaces, and special characters on the page. At Backstage, we do this at the PDF-generation stage with the assistance of a product called ABBYY Fine-reader. This combination OCR engine and PDF creator is widely regarded as the most accurate OCR software platform available. Either as a standalone application or integrated into third party programs, ABBYY FineReader has OCR support for over 200 languages and a high recognition rate.
Reliability Given Content Variables
Typically, we leave OCR uncorrected and rely on the robustness of ABBYY FineReader, which boasts an up-to 99% average accuracy rate. However, there are factors which can affect the quality of the conversion.
Because OCR is derived directly from the image, the following is important to consider:
- What’s the contrast between the textual content and the rest of the image? High contrast will have a higher rate of recognition than low contrast materials.
- Is the text fading? Similar to the issue of contrast, an OCR engine will have difficulty picking out pixels when the image has faded elements.
- Are there stains on the page? Hard lines that intersect with text or smudges can trick OCR engines into thinking text appears differently than it does.
- Does the image include decorative fonts? The more that these deviate from the traditional shape of a given character, the more likely that the reader may make mistakes in its conversion.
- Handwriting rarely converts correctly unless it is extremely uniform and precise. However, there are solutions for transcription which we’ll discuss later.
Let’s look at some examples.
Added Touches to OCR Results
Some libraries want fully correct OCR based on the needs of the collection or its patrons. When it comes to reviewing 100% of the conversion output, there are a few companies that assist with this work. One such partner is Apex CoVantage which provides data conversion and content enhancement among other services. They are also able to break images up into something called METS/ALTO, which provides article level segmentation for additional clarity and searching.
When we start discussing OCR for handwritten materials, libraries have a few options available. As we’ve seen, handwriting is almost always far too unique and irregular to be converted accurately. Recruiting community volunteers to perform manual transcription has become a very popular method. That said, the technology is always improving; Doxie.AI, for example, offers a proprietary, secure machine learning solution to better interpret some types of handwriting and convert it into structured metadata. While this is a developing technology, the conversion rates for some forms of handwriting have been extremely promising.
Questions?
Learn more about OCR and our partners by calling us at 1.800.288.1265, visiting us online at www.bslw.com, or feel welcome send an email anytime to info@bslw.com with questions or project ideas.