Adventures in Optical Character Recognition (OCR): Recommendations for Paper Loving Organizations
If your organization uses paper-based forms for operations, then keep reading — you can go from files and piles of unprocessed paper forms (stuck in a data swamp) towards actionable insights (swimming in a data lake) with computer vision and Optical Character Recognition (OCR). Take paper based-data and transform it into a format that can be analyzed and acted upon. Minimize periods of dark data going forward. With recent advances in ML models and cloud computing, it is easier than ever to reach preliminary results with a Minimum Viable Product (MVP).
What is OCR?
I get it — paper looks and feels nice. And you don’t have to train a workforce to fill out paper forms. If you want to continue using paper, or you if you have paper-based historical data, consider using OCR. Optical Character Recognition (OCR) gives a computer the ability to read text that appears in an image, letting applications make sense of signs, articles, flyers, pages of text, menus, or any other place that text appears as part of an image. Naga Kiran of Data Driven Investor puts it well: “Optical character recognition is the process of classifying optical patterns contained in a digital image. The character recognition is achieved through segmentation, feature extraction and classification.” Put simply, OCR is the recognition of printed or written text characters by a computer. This involves photo-scanning of the text character-by-character, analysis of the scanned-in image, and then translation of the character image into character codes, such as ASCII. Gidi Shperber of Towards Data Science points out that OCR is one of the earliest addressed computer vision tasks; there were different OCR implementations even before the deep learning boom in 2012, and some even dated back to 1914.
Recent advances include cloud computing and improved Machine Learning models. Although the OCR problem is not yet “solved” and many challenges remain, thanks to recent advances we can build a Minimal Viable Product (MVP) faster and cheaper than ever. We encountered some of these challenges in a recent project.
One of our clients recently requested a Machine Learning model that can read scans and photos of paper forms and convert the data to a csv file. The company leases and maintains air conditioning and ventilation devices for numerous clients spread over a large area in Central Europe. In most instances, a staff member travels to a customer site, and performs one or more service actions (inspects the machines, performs measurements, replaces liquids, cleans the machines). Sometimes, a faulty device is replaced with a functioning one. After the task is completed, they handwrite a paper form detailing the actions taken. The company generates approximately 5,000 forms a month and has a backlog of 60-80,000 forms waiting to be entered into the database. The form contains data about the location, serial numbers of equipment, services provided, cities and addresses, and checkboxes. Most of the fields in the documents are boxes for handwriting.
Why would they want to digitize and analyze their huge stack of forms? To glean actionable insights, of course! They currently have staff people who input the forms manually. It’s a repetitive job that is a good candidate for automation.
Here are some of the challenges facing such an automation effort:
- The layout of the forms have changed over time.
- There are many different types of handwriting with varying levels of legibility.
- Some of the forms were scanned, and others photographed. Quality varies.
- Some of the forms are held up by staff people for the photos, showing fingers and warped paper surfaces. The forms were held unsteadily resulting in unfocused images.
- Some fields have been crossed out, and new values written around the field.
I imagine that this company’s situation is similar to problems faced by other organizations who capture data via paper and handwriting.
The Appsilon team’s approach to the problem is as follows:
- Normalize — clean up the visual “noise”, remove the background. It’s an advanced type of thresholding. The foreground should be only information, nothing else.
- Localize the information. What are the locations of the various pieces of information on the sheet?
- Find the relevant data fields, then section off. For fields that have bounding boxes, one can leverage them to identify relevant pieces of the form.
- Use available Application Program Interfaces (API) for text recognition, such as Google’s.
- Collect scraped data in a format useful for later analysis, e.g., a csv.
Unfortunately, the technology will not capture 100% of the hand-written responses at this moment in time. So the machine will still require some help from friendly humans. But with some time and adjustments it is possible to optimize the results. If your company or organization plans to leverage your paper-based data in the future, here are some recommendations to ensure a smooth transfer. Some of them are non-obvious!
- Use a form which is pale pink or green (to make the fore/background split easiest). Move away from black and white scans. Extra color channels in a picture can play to your advantage!
- Format as much of the responses as you can into checkboxes
- Free-response entries should have boxes for characters if possible
- Standardize formats – for example ask for dates in dd-mm-yyyy formats
What Can They Do with the Data?
- Use decision support dashboards that display updates on the location and condition of their equipment.
- Filter data for specific types of problems with equipment.
- Improve deployment of assets and inventory.
- Trend analysis: identify which problems happen where and when. Is there a seasonal element to certain purchases and repairs?
- Minimize wasteful purchases (and storage of excess supplies)
- Check efficiency of staff and partners.
- Fraud prevention.
- Create a data lake with a 360 degree view of the organization.
So there are a lot of options for what can be done with the data once it is transferred to an accessible digital format. When the data is available to be put to work, data scientists call it a “data lake.” Once we have a data lake, then we can build decision support systems such as dashboards that can give a company real-time information and even recommendations.
You want to get to the point where there is minimal time between the completion of the paper form and entrance into the database, where the data can be put to work. You want to endure minimal periods of dark data.
In this post, we discussed different challenges and hacks in the OCR field. As with many problems in deep learning/computer vision, many challenges remain. On the other hand, we’ve seen it is not very hard to reach preliminary results. As with other areas of ML/AI/DS, I recommend consulting an AI team before investing in hardware and building data pipelines. Determining the highest priority business needs and then planning out the data and software requirements from there will save you a great deal of time and money in the long run.
Thanks for reading! Follow me on LinkedIn.
Follow Appsilon Data Science on Social Media