If your organization uses paper-based forms for operations, then keep reading — you can go from files and piles of unprocessed paper forms (stuck in a data swamp) towards actionable insights (swimming in a data lake) with computer vision and Optical Character Recognition (OCR). Take paper based-data and transform it into a format that can be analyzed and acted upon. Minimize periods of dark data going forward. With recent advances in ML models and cloud computing, it is easier than ever to reach preliminary results with a Minimum Viable Product (MVP).
I get it — paper looks and feels nice. And you don’t have to train a workforce to fill out paper forms. If you want to continue using paper, or you if you have paper-based historical data, consider using OCR. Optical Character Recognition (OCR) gives a computer the ability to read text that appears in an image, letting applications make sense of signs, articles, flyers, pages of text, menus, or any other place that text appears as part of an image. Naga Kiran of Data Driven Investor puts it well: “Optical character recognition is the process of classifying optical patterns contained in a digital image. The character recognition is achieved through segmentation, feature extraction and classification.” Put simply, OCR is the recognition of printed or written text characters by a computer. This involves photo-scanning of the text character-by-character, analysis of the scanned-in image, and then translation of the character image into character codes, such as ASCII. Gidi Shperber of Towards Data Science points out that OCR is one of the earliest addressed computer vision tasks; there were different OCR implementations even before the deep learning boom in 2012, and some even dated back to 1914.
Recent advances include cloud computing and improved Machine Learning models. Although the OCR problem is not yet “solved” and many challenges remain, thanks to recent advances we can build a Minimal Viable Product (MVP) faster and cheaper than ever. We encountered some of these challenges in a recent project.
One of our clients recently requested a Machine Learning model that can read scans and photos of paper forms and convert the data to a csv file. The company leases and maintains air conditioning and ventilation devices for numerous clients spread over a large area in Central Europe. In most instances, a staff member travels to a customer site, and performs one or more service actions (inspects the machines, performs measurements, replaces liquids, cleans the machines). Sometimes, a faulty device is replaced with a functioning one. After the task is completed, they handwrite a paper form detailing the actions taken. The company generates approximately 5,000 forms a month and has a backlog of 60-80,000 forms waiting to be entered into the database. The form contains data about the location, serial numbers of equipment, services provided, cities and addresses, and checkboxes. Most of the fields in the documents are boxes for handwriting.
Why would they want to digitize and analyze their huge stack of forms? To glean actionable insights, of course! They currently have staff people who input the forms manually. It’s a repetitive job that is a good candidate for automation.
Here are some of the challenges facing such an automation effort:
I imagine that this company’s situation is similar to problems faced by other organizations who capture data via paper and handwriting.
The Appsilon team’s approach to the problem is as follows:
Unfortunately, the technology will not capture 100% of the hand-written responses at this moment in time. So the machine will still require some help from friendly humans. But with some time and adjustments it is possible to optimize the results. If your company or organization plans to leverage your paper-based data in the future, here are some recommendations to ensure a smooth transfer. Some of them are non-obvious!
So there are a lot of options for what can be done with the data once it is transferred to an accessible digital format. When the data is available to be put to work, data scientists call it a “data lake.” Once we have a data lake, then we can build decision support systems such as dashboards that can give a company real-time information and even recommendations.
You want to get to the point where there is minimal time between the completion of the paper form and entrance into the database, where the data can be put to work. You want to endure minimal periods of dark data.
In this post, we discussed different challenges and hacks in the OCR field. As with many problems in deep learning/computer vision, many challenges remain. On the other hand, we’ve seen it is not very hard to reach preliminary results. As with other areas of ML/AI/DS, I recommend consulting an AI team before investing in hardware and building data pipelines. Determining the highest priority business needs and then planning out the data and software requirements from there will save you a great deal of time and money in the long run.
Thanks for reading! Follow me on LinkedIn.
Follow Appsilon Data Science on Social Media