An Awesome Journey of learning OCRs?

Haroon Khan
5 min readJun 24, 2021

It is really amazing to learn something new in your life and sometimes it could be challenging too I am sharing my learning experience while working on one of the projects where I get the chance to be part of a tremendous team of data engineers. I-e ‘Story Squad’.

Introduction

Story Squad is the first EdTech app trying to get kids off devices and back into “imagination mode” armed with only loose-leaf paper and pencil in hand. It turns “reluctant readers” into authors and illustrators through a collaborative world-building game. This app for 8-to-12-year-old kids so they can read, write, draw and compete against other students

App Home Page

Our team’s job in this project was to focus on text transcription using Tesseract OCR, exploring possibilities, and identifying value for the stakeholder.it was really exciting and also a bit challenging as I was not having a lot of experience working on OCRS.

(Optical character recognition, algorithms allow computers to analyze printed or handwritten documents automatically and prepare text data into editable formats for computers to efficiently process them. It is another way to extract and leverage business-critical data.) Which by the way is really awesome.

How This All Works:

If you are not from the field of data science at this point still have an idea that what OCR is, so diving deep into my experience as mentioned above our team was working specifically on Tesseract OCR where our job was to build pipelines for preprocessing of images (An image of handwritten stories of Kids) that later were supposed to pass through tesseract OCR to extract the text from that image and to come up with the metric to measure how accurate our OCR extract that text.

Pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in a time-sliced fashion

Image pre-processing is the name for operations on images at the lowest level of abstraction whose aim is an improvement of the image data that suppress undesired distortions or enhances some image features important for further processing. It does not increase the image information content

The first thing first for the beginner and data science students who are using Windows Machines here is the introductory video by David Bombal to set up WSL (Windows Subsystem for Linux) locally on your machine because it will make your life easier.

Working on the images with printed or typed text on them was not that challenging in fact the results were good even by using out of the box OCR for the typed or printed text but to do the same for the handwritten image was the actual challenge we faced,

for example, we took an image with some handwriting on it and by using the 3 different metrics CER (Character Error Rate), WER( Word Error Rate), and Fuzzy Ratio to compare how accurately our OCR extract the text from it without any preprocessing application and the results were zero that means none of the text was identified by our OCR.

ground_truth = "Delicious substitute for tuna in your favorite recipes. Perfect for creative hors d'oeuvres and snacks. Great as a salad topper"#pytesseract no preprocessinghypothesis0 = pytesseract.image_to_string(sample_img)print("CER, WER, Rate :",cer(ground_truth, hypothesis0),wer(ground_truth, hypothesis0),fuzz.ratio(ground_truth, hypothesis0))print(hypothesis0)CER, WER, Rate : 100.0 100.0 0

To overcome this problem we did some preprocessing and skewing to the same image and tried it again and that actually give us some better results

#pytesseract out of the box with preprocessing w/ deskewinghypothesis3 = pytesseract.image_to_string("output_processed.png")print("CER, WER, Rate :",cer(ground_truth, hypothesis3),wer(ground_truth, hypothesis3),fuzz.ratio(ground_truth, hypothesis3))print(hypothesis3)CER, WER, Rate : 19.047619047619047 66.66666666666667 74Data it Understand was "Delicious substi tote for tune. In your Favorite pec pes. eptect For Creative hors diocuyres and Cote CAS: Great as & Sa. lod topper"

The Word Error Rate (WER) and Character Error Rate (CER) indicate the amount of text in handwriting that the applied HTR model did not read correctly. A CER of 10% means that every tenth character (and these are not only letters but also punctuations, spaces, etc.) was not correctly identified

Though it was not so great at this point we needed to train our custom OCR and required a lot of clean data i-e is handwritten images in order to enhance the functionality of our OCR.

Future of Story Squad

Currently, we reach the point where we can see the difference that how the results can be measured, and applying some preprocessing we can enhance the output of our OCR also working on getting more data to train the model on handwriting images and actually one of our data engineers team is dedicated to get the data and clean it so that we can have good and trusted training data available.

As of now, the app is using Google Vission OCR which is great for handwriting but here our goal is to replace Google vision OCR with our custom trained Tesseract OCR and get the same or better output and to do so the only challenge is to obtain a huge amount of handwritten images /samples to train the OCR to the same level of Google vision.

Why am I so Excited

Getting a chance to learn this technology and gets hands-on experience is so great and it's a new tool in your pocket as The use of OCR by enterprises increases operational efficiency to ensure every customer is satisfied with the services rendered. This technology enables enterprises and organizations to improve customer experience by making unstructured content searchable. Reading text from images enables the extraction of value from content that can be used to make better decisions. Advanced technology that allows you to recognize text using OCR can be applied in many different fields to ensure customers have access to the information they are looking for.

--

--