Skip to main content

OCR Follow-Up: OneNote scores in over-time

Lingq offered it's annual sale and I signed up. I was excited to make lessons for myself using the OCR capabilities of Google Drive. So, I took some photos to "scan" the first few pages of my book, and Google gives me the following...

Google Drive
I suspected the error is due to me including too many pages, too many pages with pictures, or having too imperfect photos. So, I pile the PDF into OneNote, where I need to copy text from each page individually. It can't get accurate text from the image pages, but the text pages are actually fine...


OneNote


Argh.

Still, the OneNote conversion means I have to copy and past every page in addition to fixing the linebreaks.

So I download paid commercial software (trial version), FoxIt PhantomPDF and Abbyy.

Here's what Abbyy ($199) did with the PDF from GoogleScan:


Abbyy OCR editor interface
Abbyy FineReader verification tool


The Abbyy program is really really nice and has lots of output formats. I just don't understand what happened at the lower half of the page to change the fonts to superscripts and "set" to "^^ "

At least the software makes it much much easier to fix the errors. You can do both bulk edit or use the verification tool to cycle through all the uncertain text. When you put the cursor in the blue highlighted text in the bulk editor, it allows formatting changes and shows to which characters in the PDF it corresponds.

The plain text doesn't have weird spacing, which is nice.

Abby txt file output


Here's what FoxIt PhantomPDF ($129) did:


 It actually had me go through and check the uncertain text.

FoxIt Phantom PDF OCR Process



It's more obvious how it works than Abby's tool, but it's a little more restrictive. I'm not sure how to get back to the tool or go back if I made a mistake.

Here's how the plain text came out:

Phantom PDF txt file output
Then I used the Phantom PDF to just extract one page from the original PDF. It was able to convert, but it missed phrases here and there.

I learned a couple things:

  • Take flatter scans: the bending at the edges of my pages and the perspective of my camera messed up the OCR
  • Each engine messes up different things
  • OneNote now is tied with GoogleDrive in my opinion for free tool capability - since it also needs to be done page by page almost
  • GoogleDrive conversion to GoogleDoc text only works better with fewer pages at a time (not sure what the limit is, I will guess five since the picture book converted with four pages.
  • Paid OCR programs are worth it for the editing features. 
  • Abbyy FineReader 14 has the nicest editing 
  • Abbyy FineReader 14 has the nicest plain txt output

Comments

Most Popular Posts

A Few Thoughts on Privilege and then Affirmative Action

In 2014, David S. Pedulla found that black gay men would receive salaries similar to their straight male counterparts due to what he identified as "privilege" [ 1 ]. Although being black made a man seem more aggressive, being gay made him seem more effeminate. So, being black and gay seemed to "cancel out" and result in salaries that were higher than straight black men and gay white men. Reading this study by Dr. Pedulla and the reflections by gay black men writing about the study, I began to realize that the issue here is in our definition of privilege. No one wants to be hired just because they are white, gay, female, or otherwise labeled.  Privilege, as a useful definition for achieving equality, is not primarily about material benefits. Privilege is being seen and treated as an individual.   When you are seen and treated as an individual, rather than a box, you are more likely to be valued for your skills. Whether or not these findings are generally ...

Sustainable Wardrobe Sourcing Mishaps

Since reading " Overdressed: The Shockingly High Cost of Cheap Fashion " by Elizabeth Cline , I've had a new-found need to make my wardrobe more sustainable. I applied Marie Kondo 's method to identify which items I wanted to keep and which I needed to thank for their service and give away or recycle or turn into rags. I made the vow to buy clothes to replace when necessary, and not to fulfill a shopping desire or urge. I also made the vow to source all new clothing more responsibly. That means features like: Organic cotton GOTS Certified Fair Trade Recycled materials   Some of this effort has gone pretty well. I bought a clearance organic cotton dress from Pact to wear as a night dress along with 95% organic cotton panties. Many of these items are fair trade and GOTS certified. I also got a regular cotton dress from Passion Lillie that is fair trade and dyed in a supposedly more environmentally friendly process. Unfortunately, I've had a number of p...

Ecological Footprint Calculators

This post is to log and share some ecological footprint calculators and give people an idea of the over consumption that we take for granted. I first encountered these calculators for a class assignment and they now form the basis for my career! Hope everyone learns something. (you can click to expand the pictures of my results as of 12.04.2008) Redefining Progress is the best calculator, I think. You can pick a lot more countries. If everyone is a vegan, drives almost never, carpools, lives with 5 other people in a large apartment building in the inner city, never flies, maintains a garden, buys almost never and always organic/sustainable, uses 30% renewable energy, buys carbon offsets, and takes every measure to recycle and conserve resources. Then we need 0.87 earths. That's perfect! The Footprint Network and Earthday Site use scales and not numbers (which is nice) and try to draw a picture of your lifestyle. The Ecology Fund is a very short and sweet version. The WWF has a ...