Scanning Text

There is software for your computer scanner that will convert text on a printed page to a computer file. Optical character recognition (OCR) software is useful because it saves time as apposed to re-typing the text. OCR software will save the results as a plain text file or other formats such as a Windows Doc File, HTML page, a Rich Text File and sometimes even as a PDF file. You can buy a separate OCR program but often you will get a OCR program with a new scanner.

I bought my first OCR program called TextBridge about 1998. I had lots of pages of cemetery records that I had typed in the 1980s and didn’t want to type again. Since then, every time I bought a new scanner, I made sure that OCR software was included. That turned out to be much cheaper. I have use 2 versions of a software called PaperPort and my newest scanner came with I.R.I.S. OCR.

Through the years all OCR software has gotten more accurate. There are claims that some OCR software are 98% accurate. That makes me laugh as none of the current OCR software is that accurate. OCR software works best when you are scanning a clean, flat sheet of paper. No old printed page is that good. There are always stray marks and letters that weren’t printed that well. All OCR software have problems differentiating the number ’1″ and the small letter ‘l’. Plus there is the problem of the number ’0′ and capital letter ‘O’. Depending on the font used in the original document, they can very similar. Scanning newsprint has the problem that there is almost always bleed through from the opposite side of the page. Scanning books have the problem that when putting the book on the scanner the lines of text come out curved and that throws off the OCR software.

There are many websites that have newspapers, old books, etc. that have had OCR software run on them so you can search the text. On one website I was looking for my mother’s maiden name, ‘Wilklow’ when I looked at the “hits” about 75% of those where to the word ‘window’.  The Fulton History website is one of those that when you do a search, you can see part of the OCR’d text in the results box (on the left of page). If you take a good look at those results you will see that some results are proper words and sometimes there is gibberish. It all depends on how good the original was. With as many millions of pages that Fulton History has, it would take hundreds of years to correct all the OCR’d text.

In more expensive OCR programs you can pick a language for the output. That will help if you are scanning text in a language that has accents, umlauts, etc.. You can train some of the more expensive OCR programs to recognize unusual printed letters. One book that I was scanning had an odd looking letter ‘w’ that always came out as ‘xv’. I am currently scanning some cemetery burial records that were printed about 1993 on a dot-matrix printer. If  you magnify the text on those pages you can see the dots that make up the letters. The OCR software had trouble with the capital letter ‘M’. It ended up deciding that it is ‘l’l’ (el apostrophe el). I think overall that OCR software is 75% accurate. Editing can take lots of time but in most cases it is still easier than re-typing documents.