In this particular article, we will consider the problem of receipt digitization i.e extracting necessary and important information in form of labels from hardcopy receipts such as medical invoices, tickets, etc. These kinds of models can be highly useful in real life and help users, to better understand data as still a large chunk of our daily work deals with hardcopy receipts. In the world of natural language processing, this task is called sequence tagging, as we are tagging each input entity in some form of predefined class such as for normal receipts for grocery shopping, labels can be TOTAL_KEY, SUBTOTAL_KEY, COMPANY_NAME, COMPANY_ADDRESS, DATE, etc. The below diagram described the general pipeline for these kinds of work which will be described one by one in upcoming section.
Motivation to use GNN/GCN ?
The need to recognize local patterns in graphs, similarly to the way a CNN scans the input data through a small window recognizing local relations between the nodes within the window, a GCN could start by capturing local patterns between neighboring nodes in a graph . Recognizing hierarchies of patterns can be greatly exploited by GCNs.
Let us try to understand the basic common pipeline for these kinds of projects:
- The input image is captured in the form or image/video which goes to a number of image proprocessing steps such as cropping a receipt from image, histogram adjustment, brightness adjustment etc. OpenCV is industry standard for these kinds of task. You can learn about image segmentation i.e cropping a receipt from an image from this notebook , also learn about few common preprocessing from here .
- Ones the image is cropped and proprocessed accordingly, we provide this image to OCR  systems. You can use Google’s cloud APIs , Tesseract  or any OCR system of your choice in your budget, needs and system accuracy.
- After the process of OCR, we have a table which contains the text and their position in the input images. Usually OCR system provides the coordinates of left top point and right bottom point for each detected text.
- Now comes the interesting part, the outputs of OCR i.e the bounding boxes on receipts are used to create the input graph which will be used by graph neural networks. Each text/bounding box is considered to be a node, the edge connection creation can have multiple ways. One of such techniques  creates a maximum of four edges for each node, the edges connect each text area to its closest four neighboring text areas in each direction (Up, Down, Left and Right). Get some idea how this can be coded from here .
- The output of OCR is also used to create embeddings. To create word embeddings we can use glove, or we can encode text segments using a Transformer to get text embeddings. The embeddings are created for each detected text and stored in an node feature matrix. Using image embeddings is optional but they have shown significant improvement in model such as PICK  as they can carry useful information like text foxt, text curvatures etc. We can think of this as the model can predict that a text is of category STORE_NAME, if its font size is large because usually store name fonts are larger than the other text present on receipts.
- These two type of embeddings are combined to create a new fusion embedding for better understanding of data and used as node inputs for Graph neural nets. To better understand the use of embedding it is suggested to go through this paper ones  and its implementation .
- For each ouput text we already have their output classes assigned to them, which will be used for learning. You can search for these kinds of receipts based datasets, one such is .
- At this point we have or adjacency matrix (A), feature matrix (x) created using the combination of word and image embeddings for each nodes and finnaly the labels (y). Now we can treat this as a normal machine learning problem where A and x are independent features and y is dependent which needs to be learned and predicted.
- A, x and y will be used to train a graph based neural network models which will learn to classify each node in the possible classes. The GCN, Graph Convolution Neural Network learns to embed node feature vector (combination of word embedding and the connection structure to other nodes) by generating a vector of real numbers that represents the input node as a point in an N-dimensional space, and similar nodes will be mapped to close neighboring points in the embedding space, allowing to train a model able to classify the nodes . This article depects theory related to node classification .
- Ones the model provides a satisfactory results from test set in terms of accuracy, F1 score etc. It can be used on real world data to extract information from a hardcopy receipts in terms of extracting text and predicting their possible category.
This is just an overview about how these systems work, I can recommed learning more from , , ,  and maybe this can be implemented using open source graph learning based libraries like Spektral  or any other library of yor choice.
- Image segmentation by OpenCV : https://www.kaggle.com/dmitryyemelyanov/receipt-ocr-part-1-image-segmentation-by-opencv
- Pre-Processing from OCR!!! : https://towardsdatascience.com/pre-processing-in-ocr-fc231c6035a7
- Optical Character Recognization : https://en.wikipedia.org/wiki/Optical_character_recognition
- Google Vision API : https://cloud.google.com/vision/docs/ocr
- Tesseract : https://github.com/tesseract-ocr/tesseract
- Effecient, Lexicon free OCR using deep learning : https://arxiv.org/abs/1906.01969
- Information Extraction from Receipts with Graph Convolutional Networks : https://nanonets.com/blog/information-extraction-graph-convolutional-networks/
- Graph Convolution on Structured Document : https://github.com/dhavalpotdar/Graph-Convolution-on-Structured-Documents/blob/master/grapher.py
- PICK : https://arxiv.org/abs/2004.07464
- PICK-pytorch : https://github.com/wenwenyu/PICK-pytorch
- CORD : https://github.com/clovaai/cord
- Automizing Receipt Digitization with OCR and Deep Learning : https://nanonets.com/blog/receipt-ocr/
- Graph Convolution for Multimodel Information Extraction for Visually Rich Documents : https://arxiv.org/abs/1903.11279
- Spektral : https://graphneural.network/
- Understanding GCN for Node Classifcation : https://towardsdatascience.com/understanding-graph-convolutional-networks-for-node-classification-a2bfdb7aba7b
- Extracting Structred Data from Invoices : https://medium.com/analytics-vidhya/extracting-structured-data-from-invoice-96cf5e548e40