Skip to main content

Comparison of OCR Options to Help Learn a Language

Learning a foreign language (especially an uncommon language) can be pretty expensive, frustrating, or boring. The price increases as the language becomes less common because there are fewer people paying for the content, meaning the creator needs to charge more to get by. I'm learning Finnish, so there's not nearly as much content as there is for English, Spanish or Mandarin.

Learners want a mix of vocabulary lists, graded readers, video, audio, writing prompts and in-person conversations. Finding someone to converse with is harder with Finnish because there are about five and a half million speakers in the entire world. Spanish has about 480 million. I found three conversational partners on websites like Italki for about $14-40 an hour. It was ok, but hard to schedule the time due to the time-zone difference and it required a lot of preparation to plan what we would talk about. Luckily, I also have access to native speakers locally because they are my family. But, I would rather have more complex conversations. I have learned the words to play card games with my partner, that vocabulary is fairly constant and it's a nice way to practice listening and speaking.

But, I need a way to build up my vocabulary without requiring someone to talk to me in baby words. Especially since most native speakers find it very difficult to adjust their word choice to use simpler vocabulary.

My preferred way to learn vocabulary at my level is to read. 

Graded readers are especially made for language learners to tell stories using basic or very common words. The grading occurs by making sure each level only has a limited set of new words. You want to know about 90% of the vocabulary already. Common languages have many of these readers available for foreigners to use for study, and experts spend a good deal of time creating this content.

I used one set of graded readers in English to learn to read as a child (McGuffey Readers, created in the 1800s to teach children to read in the U.S.). The first level had sentences like the following: "A pen."  "I have a pen."  "I have a pen in my hand." I'm sure that Finnish has graded readers for native speakers learning to read, but it would be very hard to find in the U.S. to get access.

I've found a few sources of Finnish reading content:

  • YLE (basically the Finnish version of BBC, NPR, Deutsche Welt...) has "easy news" written simply for learners
  • Children's bedtime books online (though I found that magical capes and fairies aren't the most useful vocabulary, pick stories wisely)
  • Childrens books I bought in Finland (goodnight stories for rebel girls!)
  • Easy readers (which I haven't tried yet) from Artemira publishing
  • Selkokirjat (plain books) by Hanna Männikkölahti

But there are so many words that I don't know, it takes me an hour to read three sentences. So, I prefer to use digital services where I can.

Combining online content with Google Translate as a chrome extension allows much faster rading because it quickly translates new words where and when I need it. It gives me a chance to guess before I look up the word, but doesn't require switching between multiple books or devices. It can create an experience very similar to using Lingq, a service that costs $12 a month.


Highlight a word in a webpage and the translate icon pops up, click on the icon and the translation appears. Doing one word at a time is actually better than a whole sentence, because translate thought "selkouutisia" meant "back exercises" in the sentence...

tee hee
 Ling works very similarly with more features. It lets readers import text and then highlights words that aren't in the user's database. You click to add words to the list of known words or to the list of unknown words. Translations are provided for the unknown words to help you read and learn. It makes the whole process much faster since each sentence has new words.  I could import any digital text into the Lingq service and get going.

To simulate Lingq for free, I just need to turn any text into an html document and open it in Chrome with the extension installed ( as shown in the photo above). But there's the problem, I have to digitize text. Two of my sources, and arguably the most interesting sources, are not digital.

Here, I've compared two popular ways of digitizing text, OneNote with Lens and Google Drive.

 I decided to compare using a difficult document to stress test the two. I assume that whichever performs better on the more difficult document will perform just as well on the easier document. (not always true, but a good assumption). By better, I mean that more of the words in an image or scan will be recognized and turned into text correctly.

Google Drive won by a landslide.

 First, neither of these tools work in android. I wanted to do all of the work from my Samsung Note 12 tablet, but the converting options just aren't there. I can't get the "Copy Text From Picture" in the OneNote app for android, and I can't get the "Open with Google Docs" option in Drive.

Here's what I did:
  1. I snapped a photo or scan of the book page once using Microsoft Lens and once using Google Drive apps on my android tablet. 
  2. Google drive converted the image to PDF and saved to Google Drive. Microsoft Lens has options to convert to Word Files and to OneNote. 
  3. Synced.
  4. Switched to my laptop with synced OneNote and Drive open.
  5. In Google Drive, I found the PDF and right clicked to open the PDF as a Google Doc. In OneNote, I right clicked on the image and selected "Copy Text from Picture". Then I pasted the text into One Note. In OneDrive, I opened the Word document saved in Documents > Office Lens.
OneNote caught very few words and didn't transcribe well.

As you can see in the image, OneNote wasn't able to find much of the text in the page and what text it did find wasn't clearly transcribed. Of course, the image is very very difficult with the background lines from the wallpaper and graphics and unclear columns and read order.

The MS Word document just didn't create text at all.

That's why I was so impressed by what Google Drive returned. It also seems like Google Drive attempts to maintain the positioning of the text. It didn't do that well enough, so the line breaks and spacing just end up being a bit annoying.

Google even caught the text in the cartoon; I added the cartoon portion to the document.

OneNote is still good enough for most books, though. I copied a page from the Finnish edition of Goodnight Stories for Rebel Girls and got the following...

OneNote still works very well for standard documents.
OneNote might even be preferable for the way it organizes and stores the text. It's just not as amazing at picture books.

OneNote isn't out of the running for OCR duties, but I'll be using GoogleDrive to create my digital text for reading in Chrome with the extension or to import into a dedicated service like Lingq.

All that's left is to make an html of the text for opening in Chrome.

Comments

Most Popular Posts

A Few Thoughts on Privilege and then Affirmative Action

In 2014, David S. Pedulla found that black gay men would receive salaries similar to their straight male counterparts due to what he identified as "privilege" [ 1 ]. Although being black made a man seem more aggressive, being gay made him seem more effeminate. So, being black and gay seemed to "cancel out" and result in salaries that were higher than straight black men and gay white men. Reading this study by Dr. Pedulla and the reflections by gay black men writing about the study, I began to realize that the issue here is in our definition of privilege. No one wants to be hired just because they are white, gay, female, or otherwise labeled.  Privilege, as a useful definition for achieving equality, is not primarily about material benefits. Privilege is being seen and treated as an individual.   When you are seen and treated as an individual, rather than a box, you are more likely to be valued for your skills. Whether or not these findings are generally ...

Sustainable Wardrobe Sourcing Mishaps

Since reading " Overdressed: The Shockingly High Cost of Cheap Fashion " by Elizabeth Cline , I've had a new-found need to make my wardrobe more sustainable. I applied Marie Kondo 's method to identify which items I wanted to keep and which I needed to thank for their service and give away or recycle or turn into rags. I made the vow to buy clothes to replace when necessary, and not to fulfill a shopping desire or urge. I also made the vow to source all new clothing more responsibly. That means features like: Organic cotton GOTS Certified Fair Trade Recycled materials   Some of this effort has gone pretty well. I bought a clearance organic cotton dress from Pact to wear as a night dress along with 95% organic cotton panties. Many of these items are fair trade and GOTS certified. I also got a regular cotton dress from Passion Lillie that is fair trade and dyed in a supposedly more environmentally friendly process. Unfortunately, I've had a number of p...

Ecological Footprint Calculators

This post is to log and share some ecological footprint calculators and give people an idea of the over consumption that we take for granted. I first encountered these calculators for a class assignment and they now form the basis for my career! Hope everyone learns something. (you can click to expand the pictures of my results as of 12.04.2008) Redefining Progress is the best calculator, I think. You can pick a lot more countries. If everyone is a vegan, drives almost never, carpools, lives with 5 other people in a large apartment building in the inner city, never flies, maintains a garden, buys almost never and always organic/sustainable, uses 30% renewable energy, buys carbon offsets, and takes every measure to recycle and conserve resources. Then we need 0.87 earths. That's perfect! The Footprint Network and Earthday Site use scales and not numbers (which is nice) and try to draw a picture of your lifestyle. The Ecology Fund is a very short and sweet version. The WWF has a ...