Lingq offered it's annual sale and I signed up. I was excited to make lessons for myself using the OCR capabilities of Google Drive. So, I took some photos to "scan" the first few pages of my book, and Google gives me the following...
 |
| Google Drive |
I suspected the error is due to me including too many pages, too many pages with pictures, or having too imperfect photos. So, I pile the PDF into OneNote, where I need to copy text from each page individually. It can't get accurate text from the image pages, but the text pages are actually fine...
 |
| OneNote |
Argh.
Still, the OneNote conversion means I have to copy and past every page in addition to fixing the linebreaks.
So I download paid commercial software (trial version), FoxIt PhantomPDF and Abbyy.
Here's what Abbyy ($199) did with the PDF from GoogleScan:
 |
| Abbyy OCR editor interface |
 |
| Abbyy FineReader verification tool |
The Abbyy program is really really nice and has lots of output formats. I just don't understand what happened at the lower half of the page to change the fonts to superscripts and "set" to "^^ "
At least the software makes it much much easier to fix the errors. You can do both bulk edit or use the verification tool to cycle through all the uncertain text. When you put the cursor in the blue highlighted text in the bulk editor, it allows formatting changes and shows to which characters in the PDF it corresponds.
The plain text doesn't have weird spacing, which is nice.
 |
| Abby txt file output |
Here's what FoxIt PhantomPDF ($129) did:
It actually had me go through and check the uncertain text.
 |
| FoxIt Phantom PDF OCR Process |
It's more obvious how it works than Abby's tool, but it's a little more restrictive. I'm not sure how to get back to the tool or go back if I made a mistake.
Here's how the plain text came out:
 |
| Phantom PDF txt file output |
Then I used the Phantom PDF to just extract one page from the original PDF. It was able to convert, but it missed phrases here and there.
I learned a couple things:
- Take flatter scans: the bending at the edges of my pages and the perspective of my camera messed up the OCR
- Each engine messes up different things
- OneNote now is tied with GoogleDrive in my opinion for free tool capability - since it also needs to be done page by page almost
- GoogleDrive conversion to GoogleDoc text only works better with fewer pages at a time (not sure what the limit is, I will guess five since the picture book converted with four pages.
- Paid OCR programs are worth it for the editing features.
- Abbyy FineReader 14 has the nicest editing
- Abbyy FineReader 14 has the nicest plain txt output
Comments
Post a Comment