“The one ‘hole’ in my workflow has been OCR. For years, people have been able to scan a document and have it converted into real text. One of my old printers even came with OCR software included – for Windows of course. But when I’ve really needed OCR, I’ve just assumed that there were no high quality packages available for Linux. Recently I decided to find out for myself (a complete OCR virgin) what is available, how to use it, and what the results are like. I installed every free OCR package I could find, and systematically tested them. They all work very differently, so I tried to design a simple test for my specific needs.”
Excellent review, very thorough and clear.
99% accuracy is still pretty bad, imagine 10 errors when you scan a 1000 word document. And 99% accuracy was only obtained by programs that are highly experimental and unavailable to the average GNU user.
Too bad, the state of open source OCR is really immature compared to what is available for Windows. I wish HP would supply good OCR software for Linux with their scanners. I used to use OCR on Windows about 10 years ago, and it was awful. Then I recently tried OCR again with the Windows software (forgot the name) that came with my HP PSC 1300 printer/scanner combo, and it was eerily accurate- truly impressive. However, it required installing a disgustingly slow, bloated, poorly designed package of unrelated software, typical HP junk. And it’s Windows only.
Edited 2007-05-24 00:27
omnipage don’t have a 100% ccuracy so i think 99% is very good
99% is pretty good if you’re the type of person who can’t write one sentence without a typo.
Actually that’d be 10 errors for a 1000-character document, which is quite a bit worse.
</pedant mode>
That while we’re smashing through processor speed barriers, and now you can get a fairly powerful machine for under 500 bucks, and we have refrigerators that will order groceries for us when we’re running low on staples the 2 “holy grails” of mainstream computing haven’t made nearly the same amount of progress: Speech recognition and OCR have seen very small gains in progress over the past decade compared to the rest of the computing realm.
It’s a pretty good indication of just how complex these types of technologies really are…having computers understand human types of communication is extremely difficult. This isn’t a Linux vs Windows vs whatever deal, it’s a man vs machine plight.
I look forward to the day where I can talk to my machine, and it’ll produce compilable code. Ok maybe not really, but it’s a nice thought nonetheless ;-).
A fridge that automatically orders items combines two known working products together; a computer and a fridge. Speech recognition is a much harder thing to combine because Dear Aunt, let’s set so double the killer delete select all.
I have worked in *nix-based image processing for almost 4 years, and have repeatedly beat my head against the OCR barrier. In the end, we had to go with OCRXTR, by Vividata.
I have tried GOCR, ClaraOCR, OCRad, and Tesseract. All my results were just not even close to realistic for production work.
For some reason, although I could compile Tesseract on FreeBSD just fine after a little coaxing, the output was absolute garbage; mostly it looked like line noise, or Perl poetry. Not anywhere close to accurate. Maybe there are some Linux-specific assumptions in Tesseract?
GOCR produces half-decent results if all you are looking for is English ASCII output from fairly simple documents. But, even then, it’s performance is really bad in comparison to OCRXTR. On the order of 1/10 the speed at best.
ClaraOCR is interesting, and we even considered contracting the author to extend it to allow for server-side work, but it was just not a reasonable time-investment.
Consider, Vividata OCRXTR can process 300-dpi black&white pages on at least the order of two pages per second, and has a fairly linear performance curve, even with large multipage Tiff files. Also, it works for color images, automatically handles inverted pages, preserves images in pages, and produces nicely-formatted PDF output. All in all what I would consider minimum requirements for production. There just is no comparison at this point. Unfortunately, since it is Linux only, I have to run Linux compatibility mode in FreeBSD in order to use it.
I had the same experience with Tesseract. We purchased Ocrxtr. If anyone is interested, the engine is made by Scansoft so it is very accurate. I’ve compared it to scans by document vendors and it is right up there with them.
But like someone said earlier, it’s more of a Man Vs Machine fight than Winvs*Nix fight.
I’ve been looking for proper software for converting images to text (bulk) and there simply isn’t anything free which is relevant.
Now here how this should be done for those interested.
First of all, Neural networks and similar techniques can “teach” your box to handle handwritten stuff etc. So anyone building an OCR software has to be really good in this expertise. Secondly, a “training” part is obviously a need for these kind of softwares in order to get better.
There used to be a proprietary software called Eyes on Hands which has some really good features (Cost: 10 000$+).
For instance, when scanning plenty of docs with numbers, you set up scan fields and say “here we’ll have a number as input”. Then it matches numbers or characters against what it thinks it is and lists all of them in a logic order in columns. Saying something like “We believe these are 9’s in descending order based on likelyhood”. Then you just look at it’s interpretation and can easily correct what it has done wrong. By using AI/NEural NEts in the software, it can actually then get better at interpreting.
HOwever, as with any software.. I seem to have the experience that any “Niche software” which isn’t of great value to sysadmins have very few OSS counterparts I’m afraid. Which is simply just sad =(.
People here or elsewhere that have a problem with FOSS OCR apps actually think FOSS is free labor, free slaves working for them. That’s not the case at all.
If people really want the field to improve, they can make FOSS programs so that everyone can improve them. The fact that few people improved gocr or ocrad shows that it’s of no great interest.
As for admins and regular users, gocr and ocrad are more than sufficient. gocr + ocrad are sufficient, with FuzzyOCR, to help my SpamAssassin recognize image spams, and it’s sufficient for the very rare use of OCR at home.
Besides, IIRC the kde app knows about ocrad too.
So I think they’re good enough for regular users, and admins.
For more specific work, of course, as always with FOSS, these apps won’t improve by magic alone.
People need to stop accusing others of wanting free labor in such discussions. Giving a factual account of the pros and cons of a piece of F/OSS software is not in itself deserving of such an insult. Now, when users of such software verbally assault the author, or demand support, then of course that crosses the line. You might notice in my occasion that our company was fully willing to pay serious money for one that *would* do the job. If we had found one F/OSS tool that was close to doing the job, we would have gladly payed for extra work. My main desire was to have something that I could compile and run natively in FreeBSD.
Yes, GOCR and friends will do some decent work for casual use or certain server-side tasks, but when I talk about production, I’m talking about systems that can process 10,000 scanned pages a day without choking, and produce formatted document output, preferably PDF.
To any here who are working on F/OSS OCR: there is an absolute *gold mine* waiting for you if you succeed in taking these systems to the next level. The FOSS world has the best webservers, application frameworks, operating systems and programming languages, but is lagging behind in many more specific areas like OCR.
Although, I suspect that anyone who produces such a beast will be greeted by a barrage of patent-violation cases. Not that I like the idea, but I think that’s most likely the outcome.
You might notice in my occasion that our company was fully willing to pay serious money for one that *would* do the job.
So far so good.
If we had found one F/OSS tool that was close to doing the job, we would have gladly payed for extra work.
But none were close to doing the job, because no-one payed for the work to get it to do the specialist job. Your company, like many others, just opted to sink money into a proprietary package and leave the freedom alternative languishing.
If a few companies, with interests in professional OCR, would form a consortium to bring a F/OSS alternative up to snuff, it would be problem solved. For the long run too.
Nope. Not going to happen. Invest in finished products. If the product isn’t finished, leave it by the wayside. In such a climate, F/OSS can’t reach maturity.
If we had found one F/OSS tool that was close to doing the job, we would have gladly payed for extra work.
What do you think about this?
http://www.abbyy.com/sdk/?param=59956
ABBYY are the makers of FineReader, so this is serious stuff. Yes, this is only a backend (SDK), but writing a GUI frontend for a full-featured backend is easier than recreating essential backend features requiring tremendous expertise.
I don’t see anything on that page to suggest that the FineReader Engine is available under a F/OSS license. I think if you develop anything that uses it, you’re gonna be paying them for the privilege. In the OP’s case, that means paying a licensing fee *and* hiring a coder to put a nice GUI on it.
Yeah, it’s not FOSS, but you have the API, so you don’t need the source. And it may turn out cheaper in the end. But if a FOSS license, not *nix compatibility, were an absolute requirement (why?), then of course FineReader Engine would not be suitable.
To be correct OCR is and was developed using a few special OCR type faces. The engine used a bitmap overlay on each character since spacing was not kerned it was almost easy but very time intensive.
Today billions of packages and addresses are read by delivery and postal machines.
They basically find the address region. Try to break it up to lines of print. Then try to split the lines to single characters. They then use a confidence level from 3 to 6 character engine. Each engine uses it’s own algorithm. One is a software copy of an original single purpose hardwired computer designed almost 30 years ago. Oddly the modern engines have not been able to fully beat the old system. Most of the newer designs use either a mathematical based on pixel or shapes/curves to determine the character.
99% accuracy isn’t all that bad for a typical document. Is it good enough for the U.S. Postal Service? Probably not. But it would do the trick for the vast majority of applications.
99% accuracy would be fine for the postal service. Postal OCR systems have an advantage in that there is a lot that can be done with contextual analysis (addresses are generally in a known format and there are databases which contain all of the addresses). Postal systems are always trying to find a balance between speed and accuracy in order to correctly process the most mail per unit of time.
I can’t help but wonder if some of the DPI performance is based on some size constraints in the code. From the review, it doesn’t appear that few (if any) of the OCR engines know what the DPI is, so a higher DPI would make the characters seem larger. Incidently, many of the cameras used in postal applications spit out images at 212 DPI.
Edited 2007-05-25 15:45