In this research paper we examine Optical Character Recognition and it’s application in the accounting domain, covering the business needs of keeping accounting related documents up to date and accurate both for ensuring stakeholders are receiving information for decision making and for satisfying legal reporting requirements. We conclude that the accounting domain needs to handle a variety of documents and the information gained need to be reconciled and digested by information systems regularly. Then, we proceed by covering basic machine learning tasks and types, classifying OCR as a supervised machine learning methodology. To further understand the machine learning terms and methods, we reviewed former research comparing a variety of machine learning methodologies, concluding Multilayer Perceptron based models to be the most effective. We cover the basic dynamics of MLPs such as forward and back propagation, activation functions and neurons and elaborate how they are embedded in OCR engines. After this, we discuss a couple of application areas and challenges of OCR related to the accounting domain using one of the market leading open source OCR engine, Tesseract, and review the available research on the engine to understand how it works. Finally, we conclude that Tesseract OCR is mostly capable of fulfilling the typical needs of the accounting firms, but due to the numerous sources accounting documents can come from, ranging from paper-based, photographed and scanned documents, or those created using some sort of accounting software, companies might need further pre-processing than what Tesseract can offer or what is considered OCR. Further to this, we note that accounting documents can contain words or information not typically found in OCR vocabularies, therefore invoice line items or company names can represent another challenge when using OCR engines with dynamics similar to Tesseract. We also note if companies can overcome these challenges, OCR could greatly enhance their effectiveness in terms of information processing capability.
Keywords: Optical Character Recognition, OCR, Artificial Intelligence, Accounting, Machine Learning, Tesseract
As per (Jain et al, 2016), who already researched a wide variety of Machine Learning algorithms (Naïve Bayes, Naïve Bayes with Laplace Smoothing, Sequential Minimal Optimization, C4.5 decision trees and Logistic Regression) to compare their performances when using them in Optical Character Recognition (OCR) concluded that Logistic regression tends to perform best on handwritten digits. As per this, we can safely state that Machine Learning (ML) is embedded into OCR, in fact, there are a variety of ML algorithms that can perform OCR.
There are papers (Mori et al, 1999 ) stating OCR is being used even in check-out counters at stores and money changing machines. In this paper we’ll examine the business usage of Optical Character Recognition, focusing on the field of accounting and auditing, and how OCR and other types of Machine Learning is used to enhance business performance and solve problems. OCR is a technology that has become increasingly popular in many application areas, and amongst it’s many practical applications there are traffic control solutions, scanning machines or even postal systems.
Another research concludes OCR is not just a tool for character recognition that enables scalable document processing, (Junker et al, 1998) also looked at the application of OCR based solutions to pursue document classification. In their research they looked at various text representations based on n-grams and single words, and compared the possibility of classifying OCR texts into two different topics, one being business messages and the other one technical reports. They found that the n-grams method was applicable to group documents by their nature, which represents a different set of application areas.
With that being said, we’ll investigate possible uses of OCR in the field of financial accounting. This is following our theory as per which accounting is a field that relates to some extent to a wide range of fields and therefore needs to handle lots of paper-based or unstructured documents on a daily bases. In that environment, OCR engines might have a huge added value to any business, enhancing both speed and operational quality, not to mention the benefits of not having expensive human resources having to spend time of highly repetitive tasks that comes with mining data from either paper documents or unstructured electronic invoices.
OCR of course is more than just passing through a couple of scanned paper invoices or electronic documents – it requires a fair amount of pre-processing images, making sure they are in the correct format and have sufficient dots per inch (DPI) for an OCR engine to effectively work on them. In this paper, we’ll cover the process of making sure these documents are pre-formatted for OCR processing, and look at the business environment as well.
In terms of business environment, there are various legislative expectations that can change region from region, but as (Dumitru, 2013) stated, each economic unit must observe the legislative regulations in terms of filling in methods, storage and archiving of the accounting related paperwork or documents, including those documents that are electronic and not paper-based. Therefore, in this paper we’ll talk about the expectations in line with Generally Accepted Accounting Principles (GAAP).
Role of OCR in Accounting
As said earlier, the accounting domain traditionally and by function is in need of significant document processing. As (Fisher et al, 2010) stated in their research paper on the role of text analytics in accounting, the GAAP and the annual financial statements along with additional reporting narratives are important sources of analytical information to support any stakeholders from the finance and investment sector with the needed transparency to make decisions when evaluating a company. In their paper they already looked at a couple of text analytics areas, more specifically at:
compared manual and computational text-based content processing in terms of accounting narratives and related text mining,
examined information retrieval streams that addresses the extraction of both text elements and quantities embedded within the text of the accounting documents.
In their paper they’ve used the GAAP principles in line with the Financial Accounting Standards Board (FASB) as a basis point for literature references, and concluded future research in accounting text analytics would be beneficial.
To further emphasize the business need to keep these accounting based document’s up to date and accurate, we can also look at patent rights next to citations. For example, Nowotny(2009) patented a software building on the fact – as stated in the background of the patent right – that reconciliation of accounting documents is one of the major tasks each business needs to conduct prior to closing the business cycle or the given period. As per this process, Nowotny states that the sub ledger accounts need to be reconciled and if needed explained, for which sub ledger entries might need a revision as well – those entries are usually comprised of a long list of source documents.
Just by looking at a couple of earlier research articles, we noted that document processing is not just an everyday task in accounting and finance, but these documents often need to be reconciled and businesses are in constant need of making sure they provide the correct information to their stakeholders.
Talking about accuracy, (Singh et al, 2012) looking into application of OCR on a survey basis to investigate usage on Captcha solving, Institutional Repository and Optical Music Character Recognition. They’ve utilized an enhanced image segmentation algorithm for OCR. which algorithm was based on histogram equalization and could successfully convert a variety of noisy pictures, further proving the power of OCR.
In the following research we’ll not strictly follow any of the earlier approaches, but instead investigate these application areas and usage opportunities with a slightly different scope, and will look at one of the market leading OCR engines called Tesseract, and also investigate the relationship between OCR and other types of Machine Learning.
Types of Machine Learning and which one is OCR
By today, it’s out of question that ML and the similar analytics methods have become successfully implemented into various ERPs and companies are using them to improve their processes on a continued basis. Also, digitalization is everywhere and seemingly it keeps to become even more embedded into our everyday lives, raising a whole lot of privacy and security concerns next to a variety of legal questions in terms of allowing computers to make decisions without human intervention. To be able to better understand how Artificial Intelligence (AI) and ML can be used in the accounting domain, and how OCR ties into the same picture, we need to gain insights into what ML is and what types of ML there are on the market.
(Alpaydin, 2020) described ML in his book Introduction to Machine Learning as the process of programming computers in a way that enables them to optimize their own performance while solving specific tasks. They do that by using past experience – more specifically, training data that contains earlier events and their outcomes. Alpaydin noted a model can be predictive, in which case it creates predictions in the future, or can be descriptive to create insights that are not known. In the same source, it’s also noted that ML uses statistics to build models that the computer can use to minimize model errors using the training set.
(Zhang, 2010) also researched ML and the related algorithm types, and concluded that the common algorithm types in ML are:
Supervised Learning, when the computer generates a function that maps inputs to desired outputs, using several input-output examples to optimize the function,
Unsupervised Learning, where the algorithm models a set of inputs, without taking any pre-defined output as an example – which can help understanding new relationships or data which is not known by the researcher prior to modelling,
Semi-supervised Learning, in which case the model is a combination of the labeled and unlabeled classifiers enabling appropriate function generation,
Reinforcement learning, which is a specific type of ML where the model is given an observation and taught how to act based on that, and then every further observation has an impact on that environment resulting in complex patterns of reactions and events,
Transduction, which is, as Zhang stated, very similar to supervised learning except for the model where a function is not explicitly created in transduction models, but instead they use training inputs, and training outputs and new outputs,
Learning to learn, where the model learns an inductive bias based on previous experience.
The same book visualizes the 2 probably most common learning methodologies, the supervised approach and the unsupervised approach as follows.
Figure 1. Examples of Supervised and Unsupervised Learning, (Zhang, 2010)
Supervised learning is stated as a common approach to solve classification problems due to it’s nature of taking a set of observations as inputs and minimizing an error function against a set of output observations using past data, as shown on Figure 1. Supervised learning, also on Figure 1, assumes that outputs are following a causal chain of reactions, and therefore unsupervised models tend to be more vulnerable for missing datasets.
Considering the above, OCR is a supervised learning problem due to it’s nature. When performing OCR, a list of inputs is almost always available, them being those characters that we are aiming to predict. Further to this, an output layer can also be easily defined since the output would be the character that’s provided as an input – only difference being that the input can vary due to different character types.
Comparing a few Machine Learning methodologies
As (Makridakis et al, 2018) noted in their research Statistical and Machine Learning forecasting methods: Concerns and ways forward, and as we pointed out earlier, Artificial Intelligence(AI) has gained significant attention over the last decade or so, and AI already has a long list of technological advancements we can thank it for:
and many more.
In their research, (Ahmed et al, 2010) and the team compared a variety of ML algorithms to be able to derive conclusions form performance-wise comparison. The comparative study was specific for predictive analysis, more specifically for time series forecasting. The models they used were:
Multi-Layer Perceptron (MLP)
Bayesian Neural Network (BNN)
Radial Basis Functions (RBF)
Generalized Regression Neural Networks (GRNN), also called kernel regression
K-Nearest Neighbor regression (KNN)
CART regression trees (CART)
Support Vector Regression (SVR), and
Gaussian Processes (GP).
The team run this analysis after noting there’s only a small amount of studies comparing various types of ML or AI.
Figure 2. Forecasting performance (sMape) of the ML methods tested in the study of Ahmed et.al.
The study concluded there are significant differences between various methodologies and that the best methods for time series forecasting are seemingly Multilayer Perceptrons and Gaussian Process Regressions. They also note that different pre-processing methods of the input data tend to impact performance.
Multilayer Perceptrons, the foundation of many OCR and predictive ML engines
After covering the basic types of ML and why OCR and AI can be important in the accounting domain, in this chapter we’ll dig a little deeper and elaborate one of the underlying building blocks underneath ML. There are already papers looking at activation functions and their effects on a Neural Network (NN), in this chapter we’ll not specifically focus on either activation functions or the effect of propagation methodologies on network performance, but rather on what is a multilayer perceptron and what are the basic dynamics.
As per (Baum, 1988), Multilayer perceptrons(MLP) are defined as a network with an input layer of units. For each of these input units, the MLP contains a definitive number of successive units and layers, with the number of units and successive layers varying network from network. After this so-called hidden layer, the MLP network’s last layer is the output layer. Each of these units are connected to each other on a layer basis, and the connections are called synapsis. Each synapsis would hold the weight for the connected units, which weight is optimized when the network is learning. With that being said, each unit’s output is connected to the input of each unit in the next layer, and multiplied by the weight represented by the appropriate or we could say, by the connected synapsis.
(Pal, Mitra, 1992) described the mechanics of MLP networks as the process of forward and back-propagation algorithms. In their paper, an MLP network was drown as below.
Figure 3. A neural network with three hidden layers. Neurons connected to each other via variable connection weights (Pal and Mitra, 1992)
On the figure, Layer 0 is the input layer, which takes a set of inputs, and converts them into an output by forward propagating through layers L h and Lh+1. The number of layers in an MLP NN can vary and is usually customized to best solve specific problems. Each layers is connected to each other by the connected unites we also call neurons. The neurons connected by their synapsis, which contains the information about their representative weights. This weight is used to come up with an output by multiplying the inputs by their weights in the succeeding neurons – and by converting them using an activation function in many cases. Activation functions are usually used to introduce non-linearity to the model, and can be any function that is non-linear. After creating an output, the network would compare the generated output to a reference value, derive an error between the desired output and the created output, and would update the weights for each synapsis by back-propagating the error through the network.
As we said earlier, OCR problems are typically supervised learning problems where the computer generates a function that maps inputs to desired outputs, using several input-output examples to optimize the function. After looking at the dynamics of NNs and MLP networks, now we should have a much clearer picture on why that is and how it works. In case of OCR engines, the input is usually a binary matrix generating from converting a picture of a character back to it’s binary format. After that, the MLP network can be trained to recognize the characters and generate the desired output, which would in that case be recognizing the character.
Applying OCR MLP networks on the accounting domain
When using MLP networks for OCR purposes, accounting is a promising field due to the large number of documents accountants have to maintain. On the other hand, it’s also challenging due to the type of the documents being either unstructured or varying in many cases – not the mention that many of the invoices are still paper based. Electronically generated documents tend to have good quality and dots per inch(DPI) near 300, or can be converted easily into high quality files. On the other hand, paper based documents such as invoices scanned or photographed can be in any format or quality, and converting them into appropriate quality can require significant work. When running OCR algorithms, one of the most challenging aspects is image quality and pre-processing.
(Summers, 2003) already looked into the ways of automatically selecting an image enhancing method to improve OCR accuracy. Summers characterized the input images based on two attributes: a set that were intended to contain some noise on the picture while being in a single font of Roman or similar script and another set that is more general. The paper considers a variety of various ways to improve the quality of the images for OCR purposes, and checks whether transforming the image helped to better performance. They noted that the results varied, and overall the system could make the correct decision 64% of the time.
While this leaves a couple of questions, (Shen, Lei, 2015) also took a shot at looking into improving OCR efficiency, more specifically of recognizing characters on images using OCR. They noted many documents (similarly to scanned invoices and accounting related, originally non-electronic documents) would have embedded background images or noise that could distract algorithms from correctly recognizing the characters. For example, small dots, sharp edges can be mistaken for characters and OCR would try to convert them into one when it generates the output. In their paper, they enhanced the Tesseract open source OCR engine’s character recognition capabilities by adjusting brightness and contrast parameters for the images, and converting the images to greyscale prior to thresholding. They state that Tesseract’s performance improved significantly after removing background images.
(Bhaskar et al, 2010) also used Tesseract for OCR, aiming for recognizing texts on business cards. They’ve been using Android mobile cameras and varying conditions in terms of images noises and backgrounds. As they summed it up, Tesseract goes through the following steps to perform OCR on pictures:
Tesseract takes a .tiff or .bmp image, but can work with a couple other types as well on a non-native basis. The Images can be either greyscale or coloured.
The engine applies Otsu’s method for adaptive thresholding and performs the reduction of a grayscale image to a binary image, assuming there are foreground and background (black and white) pixels, then automatically calculates the optimal threshold to separate the pixels. In their paper, Bhaskar and the team visualized this conversion as shown on figure 4.
Figure 4. Original image and Adaptively Thresholded Image, (Bhaskar et al, 2010)
This, of course can be split into multiple sub-steps, for example (Bangare et al, 2015) has split this single step into 4 steps, as shown on figure 5.
Figure 5. Flow of Otsu’s thresholding in Image Processing, (Bangare et al, 2015)
Connected-component labelling, when Tesseract goes through the binary image and marks foreground pixels as potential characters.
Line finding, when each lines of text is found and identified. They note that this part of the algorithm uses a Y projection of the binary image and finds locations by looking at the pixel count, identifying potential lines and analysing further to confirm.
Baseline fitting, when algorithm actually goes through each line of text and examines it to find approximate text height across the lines. This is considered the first step to actually convert the image into characters.
Fixed pitch detection, next to baseline fitting, is where Tesseract finds character locations. This is when it finds the width of the characters, allowing for incremental extraction.
Non-fixed pitch spacing delimiting, where characters that are not uniform width are reclassified and reprocessed in an alternate manner
Word recognition is the last step in the OCR process of Tesseract, at which step it goes word by word and line by line, to find words by passing everything through a contextual analyser ensuring accurate word recognition.
After understanding the accounting domain’s need for document processing in the first few chapters, and the dynamics of OCR algorithms and what challenges they typically need to overcome, it’s time to derive conclusions.
We’ve investigated the accounting domain and the uses of OCRs. Following the steps of other researchers and their publications, we concluded the accounting domain could use OCR to automate and enhance many of the tasks it is required to perform, mainly on the document processing part. Accounting firms need to process potentially thousands of invoices or more on a monthly basis, and OCR could help freeing up human labor from that repetitive work. On the other hand, we also noted that the nature of the documents vary: for example, many invoices are paper-based, and many of them are created using electronic accounting systems. For those that are produced with electronic systems, it’s relatively easy to process through an OCR engine due to high quality images and DPI often reaching 300. On the other hand, the quality of those images accounting firms process on a paper basis can vary significantly, depending on how the picture are digitalized. One of studies noted image digitalization using android phone cameras, while other studies are referring to scanned documents, both methodologies being different in terms of DPI and image quality, not to mention image orientation. The quality can also differ camera by camera and scanner by scanner, while environmental factors are playing another significant role with photo-based digitalization.
We’ve also looked at the dynamics behind OCR engines, and understood the machine learning process behind. Upon reviewing various types of machine learning and what data input they need by nature, we are considering OCRs as supervised machine learning, and elaborated how Multilayer Perceptron networks can perform OCR. Engine-wise, we focused on Tesseract, one of the most popular open-source OCR engine on the market.
By nature, the Tesseract OCR engine is capable of processing text from images, and can handle a variety of picture formats (.tiff, .bmp, and many more). It can also handle different picture sizes and both greyscale and coloured images. It uses Otsu’s method to dynamically threshold pictures to be able to improve it’s performance. It’s also found that pre-processing pictures and reducing background noise or removing picture background can greatly increase performance.
Finally, images across the accounting domain can vary both in structure and size due picture being submitted from various sources such as paper-based images or electronic accounting systems. That eventually adds further challenges to applying OCR effectively when scaling accounting related document processing. When using Tesseract, another challenge is coming from invoice line items and company names, which ones can contain names that OCRs can’t match to a dictionary of pre-taught words, therefore reducing the change of correct OCR. With that being said, if the accounting domain can effectively overcome these challenges, OCR could greatly enhance their capabilities of processing invoices or information coming from a variety of documents.
Mori Shunji and Nishida Hirobumi and Yamada Hiromitsu, (1999), Optical Character Recognition, John Wiley & Sons, Inc., ISBN: 0471308196
Junker, M., Hoch, R. An experimental evaluation of OCR text representations for learning document classifiers. IJDAR 1, 116–122 (1998). https://doi.org/10.1007/s100320050012
Dumitru, Aurelia and Cotoc, Elena, Archiving, Keeping Records and Financial Accounting Documents (2013). International Journal of Education and Research, 1(11), 2013. Available at SSRN: https://ssrn.com/abstract=2737018
Ingrid E. Fisher, Margaret R. Garnsey, Sunita Goel, Kinsun Tam; The Role of Text Analytics and Information Retrieval in the Accounting Domain. Journal of Emerging Technologies in Accounting 1 December 2010; 7 (1): 1–24. doi: https://doi.org/10.2308/jeta.2010.7.1.1
Dietmar Nowotny, (2009), Reconciliation of accounting documents, patent right, US20100332360A1
S. Kotsiantis, E. Koumanakos, D. Tzelepis and V. Tampakas, (2006), International Journal Of Computational Intelligence Volume 3 Number 2 2006 Issn 1304-2386
V. Jain, A. Dubey, A. Gupta and S. Sharma, (2016), Comparative analysis of machine learning algorithms in OCR, 3rd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, 2016, pp. 1089-1092.
Amarjot Singh, Ketan Bacchuwar, and Akshay Bhasin, (2012), Survey of OCR Applications, International Journal of Machine Learning and Computing, Vol. 2, No. 3, June 2012
Ethem Alpaydin, (2020), Introduction to Machine Learning, Massachusetts Institute of Technology, 4th edition
Yagang Zhang, (2010), New Advances in Machine Learning, Intech, ISBN: 978-953-307-034-6
Makridakis S, Spiliotis E, Assimakopoulos, (2018) Statistical and Machine Learning forecasting methods: Concerns and ways forward., PLoS ONE 13(3): e0194889. https://doi.org/10.1371/journal.pone.0194889
Nesreen K. Ahmed, Amir F. Atiya, Neamat El Gayar & Hisham El-Shishiny (2010) An Empirical Comparison of Machine Learning Models for Time Series Forecasting, Econometric Reviews, 29:5-6, 594-621, DOI: 10.1080/07474938.2010.481556
Eric B. Baum, (1988), On the capabilities of multilayer perceptrons, Journal of Complexity, Volume 4, Issue 3, September 1988, Pages 193-215
Sankar K. Pal, Susjmita Mitra, (1992), Multilayer perceptron, fuzzy sets, classification, IEEE Transactions on Neural Networks, vol. 3
Kristen M. Summers “Document image improvement for OCR as a classification problem”, (2003)l Proc. SPIE 5010, Document Recognition and Retrieval X; https://doi.org/10.1117/12.476023
Mande Shen, Hansheng Lei, (2015), “Improving OCR performance with background image elimination,” 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Zhangjiajie, pp. 1566-1570, doi: 10.1109/FSKD.2015.7382178.
Bhaskar, S.A., Lavassar, N., & Green, S.A. (2010). Implementing Optical Character Recognition on the Android Operating System for Business Cards.
Bangare, Sunil & Dubal, Amruta & Bangare, Pallavi & Patil, Suhas. (2015). Reviewing otsu’s method for image thresholding. International Journal of Applied Engineering Research. 10. 21777-21783.
PhD Student, Szent István University, Doctoral School of Management and Business Administration
PhD Student, Szent István University, Doctoral School of Management and Business Administration
Márk Tóth, associate professor
Szent István University, Faculty of Economics and Social Sciences, Institute of Finance, Accounting and Controlling
@ WCTC LTD --- ISSN 2398-9491 | Established in 2009 | Economics & Working Capital