OCR (Optical Character Recognition) Programs.

Perpetual Man

Tim James
Supporter
Joined
Jun 13, 2006
Messages
6,381
Recently I've been going through all the old drawers as it were and sorting everything out.

Hidden amongst it all is a lot of old material that I typed up in the dim and dusty past, and part of my current re-organisation/writing phase is to try and get some of this old stuff brought up to date and on the computer.

Obviously I will want to edit, so the best way of dealing with it, as opposed to re-typing an awful lot of stuff was to use an OCR program.

I did use one in the olden days, and was less than enthused with the programs ability to translate typeface into documents.

But time has passed.

I managed to download a free OCR program called FREEOCR (Says it all really) and for a free bit of software I have been quite impressed with it's accuracy, what it can do and features I have not used but look like fun!

There are certain things that seem to slow things down a bit, but as a whole it does the job.

I think there is about 1 in 8 pages where there is a problem with it copying, but I'm not to sure that is not down to fainter type on a given page.

So, anyone else had experience with these programs and just how accurate are the better ones? (and how expensive can they get?)

(At some point I realise a lot of my old material is handwritten and getting that across is going to be a lot more fun....)
 
OCR accuracy:

1. Resolution—Most OCR apps recommend hardcopy be scanned at 300 dpi or better. Using a lower resolution makes the characters less distinct.

2. Level—Character recognition works better when the text is perfectly level. Some software includes an auto-leveling feature. Don't assume that text is perfectly level just because the edge of the page is square to the scan bed. I've run into many professionally printed books where the print is skewed relative to the page.

3. Contrast—Most OCR software converts RGB or grayscale images to a pure black-and-white bitmap. This is done with an automatically applied contrast function, but the automatic might be off. If the hardcopy has wrinkles, dirt or other schmutz on the paper, you may run into a lot of errors. Manually setting the contrast for such pages might help. Sometimes "dirt," in the form of pulp and other flecks in the paper itself, may be unavoidable.

4. Typeface—A nice, clean typeface is best, but you may not have any choice in the matter. A typewriter font with lots of filled in spaces from dirty strikers or an old ribbon may make OCR nearly impossible. Multi-generation xerographic copies, or carbon copies—such as official documents—may also be very problematic.

Most OCR software has some "signature" error that crops up over and over again. This may have to do with the source material, or something in the OCR software itself. Such "regular" errors are actually helpful as they are easy to fix with a find-and-replace or GREP function.

Ultimately, nothing beats a manual read-through, which seems to be your intention since you want to rework or edit old documents. I've seen many ebooks with OCR-style errors. That's not a problem with the technology, but the methodology of the people who did the hardcopy-to-text conversion. After all, old print books got a manual read-through, so why not do the same for ebooks? Caveat: spell checkers and grammar checkers may not spot mistakes if the error is a real word.

You obviously have a piece of free software that is highly accurate. Good for you. However, if you are ever in need of scan software with OCR features built-in, I highly recommend Hamrick's VueScan. All scanners come with OEM (original equipment manufacturer) software, but such software is frequently abandoned and not updated to work with newer operating systems. Perhaps this is a kind of planned obsolescence so that the manufacturer can force you to buy a new piece of hardware. In my experience, VueScan will run everything—no drivers needed. (For all of you who switched up to Windows 7 64-bit and found your old scanners bricked for lack of new drivers, I'm looking at you.)

EDIT: One last note—Sometimes scans of pictures from magazines will suffer from "print through" of whatever is on the other pages of a magazine. While this is typically not a problem with text, due to the auto-contrast applied for OCR, it brings up an important tip.

The print through from magazines, or other light-weight paper is not actually caused by the next page of the magazine, but the backside of the page you wish to scan. If the page is still in the magazine, slip a piece of black construction paper in there. This will prevent the backside of the page from "transmitting" through. This effect is similar to store front display windows—when it is dark outside and the display is lit up, you can see through the glass easily. But during bright daylight, the glass throws a lot of reflections back at you.

Most flatbed scanners have a white lid, which may be used for calibration. (Most scanners today have a calibration strip hidden from view.) If you are scanning many magazine-like pages, you might wish to tape a piece of black paper to the lid of your scanner.
 
Last edited:
Thanks Metryq.

The free OCR program has been quite good, but it seems that I have reached a bit from the past where I must have changed typewriters and the quality has dropped horrendously (some of it is my typing too...).

So that is not scanning so well.

I've downloaded ViewScan and am running it on, I presume, Trial and have been quite impressed. Having an entire scanning suite in one place is more than a little handy and the OCR seems to be doing a better job that FreeOCR.

Thanks again.
 
Just stumbled across something that might be a substitute for OCR.

Technically it is not OCR but it is close to it.

The program in question is called Nitro PDF and it is designed to give you more fundamental control over PDF files. You can use it to edit standard PDF, make your own and quite a few other features besides.

One of which is to scan things directly to the program and create PDF's from whatever document you are using, in this instant some scanned text.

Once done you have a PDF image of scanned text, but Nitro allows you to convert a PDF file into another format, say a Word file, and suddenly you have a fully transferred and accessible document.
 
I might take back slightly what I said earlier about the ocr of old newsprint as apparently the fonts used were very embellished compared to those used today - epecially f and s - and this causes the problems.
 

Back
Top