Linked by Thom Holwerda on Wed 23rd May 2007 23:45 UTC, submitted by Austin
Linux "The one 'hole' in my workflow has been OCR. For years, people have been able to scan a document and have it converted into real text. One of my old printers even came with OCR software included - for Windows of course. But when I've really needed OCR, I've just assumed that there were no high quality packages available for Linux. Recently I decided to find out for myself (a complete OCR virgin) what is available, how to use it, and what the results are like. I installed every free OCR package I could find, and systematically tested them. They all work very differently, so I tried to design a simple test for my specific needs."
Order by: Score:
Nice review
by sb56637 on Thu 24th May 2007 00:23 UTC
sb56637
Member since:
2006-05-11

Excellent review, very thorough and clear.

99% accuracy is still pretty bad, imagine 10 errors when you scan a 1000 word document. And 99% accuracy was only obtained by programs that are highly experimental and unavailable to the average GNU user.

Too bad, the state of open source OCR is really immature compared to what is available for Windows. I wish HP would supply good OCR software for Linux with their scanners. I used to use OCR on Windows about 10 years ago, and it was awful. Then I recently tried OCR again with the Windows software (forgot the name) that came with my HP PSC 1300 printer/scanner combo, and it was eerily accurate- truly impressive. However, it required installing a disgustingly slow, bloated, poorly designed package of unrelated software, typical HP junk. And it's Windows only.

Edited 2007-05-24 00:27

Reply Score: 4

RE: Nice review
by collinm on Thu 24th May 2007 01:21 UTC in reply to "Nice review"
collinm Member since:
2005-07-15

omnipage don't have a 100% ccuracy so i think 99% is very good

Reply Score: 1

RE[2]: Nice review
by setuid_w00t on Thu 24th May 2007 03:03 UTC in reply to "RE: Nice review"
setuid_w00t Member since:
2005-10-22

99% is pretty good if you're the type of person who can't write one sentence without a typo.

Reply Score: 2

RE: Nice review
by Havin_it on Fri 25th May 2007 11:18 UTC in reply to "Nice review"
Havin_it Member since:
2006-03-10

99% accuracy is still pretty bad, imagine 10 errors when you scan a 1000 word document.


Actually that'd be 10 errors for a 1000-character document, which is quite a bit worse.

</pedant mode>

Reply Score: 3

It's interesting...
by jayson.knight on Thu 24th May 2007 03:21 UTC
jayson.knight
Member since:
2005-07-06

That while we're smashing through processor speed barriers, and now you can get a fairly powerful machine for under 500 bucks, and we have refrigerators that will order groceries for us when we're running low on staples the 2 "holy grails" of mainstream computing haven't made nearly the same amount of progress: Speech recognition and OCR have seen very small gains in progress over the past decade compared to the rest of the computing realm.

It's a pretty good indication of just how complex these types of technologies really are...having computers understand human types of communication is extremely difficult. This isn't a Linux vs Windows vs whatever deal, it's a man vs machine plight.

I look forward to the day where I can talk to my machine, and it'll produce compilable code. Ok maybe not really, but it's a nice thought nonetheless ;-).

Reply Score: 5

RE: It's interesting...
by dreamlax on Thu 24th May 2007 04:50 UTC in reply to "It's interesting..."
dreamlax Member since:
2007-01-04

A fridge that automatically orders items combines two known working products together; a computer and a fridge. Speech recognition is a much harder thing to combine because Dear Aunt, let's set so double the killer delete select all.

Reply Score: 4

My experience:
by rycamor on Thu 24th May 2007 06:04 UTC
rycamor
Member since:
2005-07-18

I have worked in *nix-based image processing for almost 4 years, and have repeatedly beat my head against the OCR barrier. In the end, we had to go with OCRXTR, by Vividata.

I have tried GOCR, ClaraOCR, OCRad, and Tesseract. All my results were just not even close to realistic for production work.

For some reason, although I could compile Tesseract on FreeBSD just fine after a little coaxing, the output was absolute garbage; mostly it looked like line noise, or Perl poetry. Not anywhere close to accurate. Maybe there are some Linux-specific assumptions in Tesseract?

GOCR produces half-decent results if all you are looking for is English ASCII output from fairly simple documents. But, even then, it's performance is really bad in comparison to OCRXTR. On the order of 1/10 the speed at best.

ClaraOCR is interesting, and we even considered contracting the author to extend it to allow for server-side work, but it was just not a reasonable time-investment.

Consider, Vividata OCRXTR can process 300-dpi black&white pages on at least the order of two pages per second, and has a fairly linear performance curve, even with large multipage Tiff files. Also, it works for color images, automatically handles inverted pages, preserves images in pages, and produces nicely-formatted PDF output. All in all what I would consider minimum requirements for production. There just is no comparison at this point. Unfortunately, since it is Linux only, I have to run Linux compatibility mode in FreeBSD in order to use it.

Reply Score: 2

RE: My experience:
by jolly_rancher on Thu 24th May 2007 17:49 UTC in reply to "My experience:"
jolly_rancher Member since:
2006-10-04

I had the same experience with Tesseract. We purchased Ocrxtr. If anyone is interested, the engine is made by Scansoft so it is very accurate. I've compared it to scans by document vendors and it is right up there with them.

Reply Score: 1

OCR is indeed a problem
by Haicube on Thu 24th May 2007 06:09 UTC
Haicube
Member since:
2005-08-06

But like someone said earlier, it's more of a Man Vs Machine fight than Winvs*Nix fight.

I've been looking for proper software for converting images to text (bulk) and there simply isn't anything free which is relevant.

Now here how this should be done for those interested.

First of all, Neural networks and similar techniques can "teach" your box to handle handwritten stuff etc. So anyone building an OCR software has to be really good in this expertise. Secondly, a "training" part is obviously a need for these kind of softwares in order to get better.

There used to be a proprietary software called Eyes on Hands which has some really good features (Cost: 10 000$+).

For instance, when scanning plenty of docs with numbers, you set up scan fields and say "here we'll have a number as input". Then it matches numbers or characters against what it thinks it is and lists all of them in a logic order in columns. Saying something like "We believe these are 9's in descending order based on likelyhood". Then you just look at it's interpretation and can easily correct what it has done wrong. By using AI/NEural NEts in the software, it can actually then get better at interpreting.

HOwever, as with any software.. I seem to have the experience that any "Niche software" which isn't of great value to sysadmins have very few OSS counterparts I'm afraid. Which is simply just sad =(.

Reply Score: 3

OCR is not a problem
by Ookaze on Thu 24th May 2007 10:27 UTC
Ookaze
Member since:
2005-11-14

People here or elsewhere that have a problem with FOSS OCR apps actually think FOSS is free labor, free slaves working for them. That's not the case at all.
If people really want the field to improve, they can make FOSS programs so that everyone can improve them. The fact that few people improved gocr or ocrad shows that it's of no great interest.
As for admins and regular users, gocr and ocrad are more than sufficient. gocr + ocrad are sufficient, with FuzzyOCR, to help my SpamAssassin recognize image spams, and it's sufficient for the very rare use of OCR at home.
Besides, IIRC the kde app knows about ocrad too.
So I think they're good enough for regular users, and admins.
For more specific work, of course, as always with FOSS, these apps won't improve by magic alone.

Reply Score: 3

RE: OCR is not a problem
by rycamor on Thu 24th May 2007 13:49 UTC in reply to "OCR is not a problem"
rycamor Member since:
2005-07-18

People need to stop accusing others of wanting free labor in such discussions. Giving a factual account of the pros and cons of a piece of F/OSS software is not in itself deserving of such an insult. Now, when users of such software verbally assault the author, or demand support, then of course that crosses the line. You might notice in my occasion that our company was fully willing to pay serious money for one that *would* do the job. If we had found one F/OSS tool that was close to doing the job, we would have gladly payed for extra work. My main desire was to have something that I could compile and run natively in FreeBSD.

Yes, GOCR and friends will do some decent work for casual use or certain server-side tasks, but when I talk about production, I'm talking about systems that can process 10,000 scanned pages a day without choking, and produce formatted document output, preferably PDF.

To any here who are working on F/OSS OCR: there is an absolute *gold mine* waiting for you if you succeed in taking these systems to the next level. The FOSS world has the best webservers, application frameworks, operating systems and programming languages, but is lagging behind in many more specific areas like OCR.

Although, I suspect that anyone who produces such a beast will be greeted by a barrage of patent-violation cases. Not that I like the idea, but I think that's most likely the outcome.

Reply Score: 3

RE[2]: OCR is not a problem
by r_a_trip on Thu 24th May 2007 16:45 UTC in reply to "RE: OCR is not a problem"
r_a_trip Member since:
2005-07-06

You might notice in my occasion that our company was fully willing to pay serious money for one that *would* do the job.

So far so good.

If we had found one F/OSS tool that was close to doing the job, we would have gladly payed for extra work.

But none were close to doing the job, because no-one payed for the work to get it to do the specialist job. Your company, like many others, just opted to sink money into a proprietary package and leave the freedom alternative languishing.

If a few companies, with interests in professional OCR, would form a consortium to bring a F/OSS alternative up to snuff, it would be problem solved. For the long run too.

Nope. Not going to happen. Invest in finished products. If the product isn't finished, leave it by the wayside. In such a climate, F/OSS can't reach maturity.

Reply Score: 5

RE[2]: OCR is not a problem
by Temcat on Fri 25th May 2007 09:38 UTC in reply to "RE: OCR is not a problem"
Temcat Member since:
2005-10-18

If we had found one F/OSS tool that was close to doing the job, we would have gladly payed for extra work.

What do you think about this?

http://www.abbyy.com/sdk/?param=59956

ABBYY are the makers of FineReader, so this is serious stuff. Yes, this is only a backend (SDK), but writing a GUI frontend for a full-featured backend is easier than recreating essential backend features requiring tremendous expertise.

Reply Score: 2

RE[3]: OCR is not a problem
by Havin_it on Fri 25th May 2007 11:46 UTC in reply to "RE[2]: OCR is not a problem"
Havin_it Member since:
2006-03-10

I don't see anything on that page to suggest that the FineReader Engine is available under a F/OSS license. I think if you develop anything that uses it, you're gonna be paying them for the privilege. In the OP's case, that means paying a licensing fee *and* hiring a coder to put a nice GUI on it.

Reply Score: 1

RE[4]: OCR is not a problem
by Temcat on Sat 26th May 2007 11:24 UTC in reply to "RE[3]: OCR is not a problem"
Temcat Member since:
2005-10-18

Yeah, it's not FOSS, but you have the API, so you don't need the source. And it may turn out cheaper in the end. But if a FOSS license, not *nix compatibility, were an absolute requirement (why?), then of course FineReader Engine would not be suitable.

Reply Score: 1

plenty of linux software in real use each day
by jefro on Thu 24th May 2007 20:52 UTC
jefro
Member since:
2007-04-13

To be correct OCR is and was developed using a few special OCR type faces. The engine used a bitmap overlay on each character since spacing was not kerned it was almost easy but very time intensive.

Today billions of packages and addresses are read by delivery and postal machines.

They basically find the address region. Try to break it up to lines of print. Then try to split the lines to single characters. They then use a confidence level from 3 to 6 character engine. Each engine uses it's own algorithm. One is a software copy of an original single purpose hardwired computer designed almost 30 years ago. Oddly the modern engines have not been able to fully beat the old system. Most of the newer designs use either a mathematical based on pixel or shapes/curves to determine the character.

Reply Score: 1

Octopus
by tomcat on Fri 25th May 2007 00:14 UTC
tomcat
Member since:
2006-01-06

99% accuracy isn't all that bad for a typical document. Is it good enough for the U.S. Postal Service? Probably not. But it would do the trick for the vast majority of applications.

Reply Score: 2

RE: Octopus
by shadow303 on Fri 25th May 2007 15:44 UTC in reply to "Octopus"
shadow303 Member since:
2005-06-29

99% accuracy would be fine for the postal service. Postal OCR systems have an advantage in that there is a lot that can be done with contextual analysis (addresses are generally in a known format and there are databases which contain all of the addresses). Postal systems are always trying to find a balance between speed and accuracy in order to correctly process the most mail per unit of time.

I can't help but wonder if some of the DPI performance is based on some size constraints in the code. From the review, it doesn't appear that few (if any) of the OCR engines know what the DPI is, so a higher DPI would make the characters seem larger. Incidently, many of the cameras used in postal applications spit out images at 212 DPI.

Edited 2007-05-25 15:45

Reply Score: 1