Introduction
Optical Character Recognition (OCR) technology has revolutionized the way we convert different types of documents, such as scanned paper documents, PDFs, or images captured by a digital camera, into editable and searchable data. OCR technology is crucial for automating data entry processes, improving accuracy, and saving time by eliminating the need for manual data extraction. Its applications span across various industries including banking, healthcare, logistics, and government services, making it an essential tool in the digital transformation journey.
In this tutorial, we will focus on a specific use case of OCR technology: driver's license recognition. Recognizing and extracting information from driver's licenses is a common requirement for businesses and organizations that need to verify identity, such as car rental services, financial institutions, and security agencies. Automating this process using OCR can significantly enhance operational efficiency, reduce human error, and streamline customer interactions.
For this tutorial, we will use the API4AI OCR API, a robust and versatile solution that offers high accuracy and performance for general OCR tasks. API4AI was chosen for its ease of use, comprehensive documentation, and competitive pricing. It provides a flexible API that can be integrated into various applications to perform OCR on different types of documents, including driver's licenses. You are also free to use any other tools, using this tutorial as inspiration.
One of the key motivations behind using a general OCR API like API4AI, as opposed to specialized solutions designed specifically for driver's license recognition, is cost-effectiveness. Specialized solutions often come with higher costs and less flexibility, which can be a significant burden, especially for small to medium-sized businesses. By leveraging a general OCR API, you can achieve similar results at a fraction of the cost while maintaining the flexibility to adapt the solution to other OCR needs as well.
In the sections that follow, we will guide you through the process of setting up your environment, integrating the API4AI OCR API, and writing the necessary code to recognize and extract information from driver's licenses. Whether you're a developer looking to add OCR capabilities to your application or a business owner seeking to automate identity verification, this step-by-step tutorial will provide you with the knowledge and tools to get started.
Understanding OCR and Its Applications
Definition of Optical Character Recognition (OCR)
Optical Character Recognition (OCR) is a technology that enables the conversion of different types of documents, such as scanned paper documents, PDFs, or images, into editable and searchable text data. OCR algorithms analyze the visual patterns of characters within these documents and translate them into machine-readable text, allowing computers to understand and process the content. OCR has become an indispensable tool in the digitization of information, enabling automation and streamlining workflows across various industries.
Common Applications of OCR Technology
OCR technology finds applications in a wide range of industries and scenarios, including:
Document Digitization: Converting physical documents into digital formats for storage, retrieval, and sharing.
Data Entry Automation: Automating data entry tasks by extracting text from documents and entering it into databases or other systems.
Text Recognition in Images: Recognizing text within images captured by digital cameras or smartphones, such as signs, labels, or handwritten notes.
Translation Services: Enabling the translation of printed or handwritten text from one language to another.
Accessibility: Making printed materials accessible to visually impaired individuals by converting them into text-to-speech or braille formats.
Specific Use Cases in Driver's License Recognition
Driver's license recognition is a specialized application of OCR technology that involves extracting information from driver's licenses, such as the name of the license holder, license number, date of birth, and address. This information is commonly required for identity verification purposes in various industries, including:
Car Rental Services: Verifying the identity of customers before renting vehicles to ensure compliance with age restrictions and driver eligibility.
Financial Institutions: Authenticating customer identities for account opening, loan applications, or financial transactions.
Government Agencies: Processing driver's license renewals, registrations, and other administrative tasks efficiently.
Security and Access Control: Granting access to restricted areas or sensitive information based on verified identities.
Importance of Choosing the Right OCR API for the Task
When it comes to driver's license recognition and other OCR tasks, choosing the right OCR API is crucial for achieving accurate and reliable results. Factors to consider when selecting an OCR API include:
Accuracy: The ability of the OCR engine to accurately recognize text, even in challenging conditions such as low-quality images or distorted text.
Speed: The processing speed of the OCR API, especially when dealing with large volumes of documents or real-time applications.
Ease of Integration: The simplicity and flexibility of integrating the OCR API into existing applications or workflows.
Language Support: The support for multiple languages and character sets, especially for applications in multilingual environments.
Cost: The pricing structure of the OCR API, including any usage-based fees or subscription plans, and its affordability for the intended use case.
By carefully evaluating these factors and choosing a reliable OCR API like API4AI, you can ensure the success of your driver's license recognition project and maximize its benefits in terms of efficiency, accuracy, and cost-effectiveness.
Why Not Use Specialized Solutions for Driver's License Recognition?
Overview of Specialized Solutions for Driver's License Recognition
Specialized solutions for driver's license recognition are designed specifically to extract and verify information from driver's licenses. These solutions often come with pre-built templates and algorithms tailored for different license formats, making them seemingly convenient for businesses that require high accuracy and quick deployment. These solutions typically offer features such as automatic format detection, advanced data extraction, and integration with identity verification services.
Discussion on the High Costs Associated with Specialized Solutions
While specialized solutions offer convenience and high accuracy, they come with significant drawbacks, primarily in terms of cost. These solutions often involve:
High Licensing Fees: Specialized software typically comes with high upfront licensing costs or subscription fees that can be prohibitively expensive for small to medium-sized businesses.
Per-Transaction Costs: Many specialized solutions charge based on the number of transactions or scans, leading to escalating costs as the volume of processed licenses increases.
Maintenance and Support Fees: Ongoing costs for software maintenance, updates, and support can add up, further increasing the total cost of ownership.
Vendor Lock-In: Businesses may become dependent on a single vendor, limiting their flexibility to switch to alternative solutions without incurring additional costs or undergoing significant disruptions.
Benefits of Building a Solution on Top of General OCR APIs
Using a general OCR API, such as API4AI, for driver's license recognition offers several advantages over specialized solutions:
Cost-Effectiveness: General OCR APIs typically have lower upfront costs and more flexible pricing models, including pay-as-you-go options. This makes them more affordable, especially for businesses with varying processing volumes.
Flexibility and Customization: General OCR APIs provide the flexibility to adapt and customize the OCR process to specific needs. Developers can fine-tune the data extraction process, implement custom validation rules, and integrate with other systems without being constrained by the limitations of a specialized solution.
Scalability: General OCR APIs are designed to handle a wide range of document types and can scale easily with the growth of the business. As the volume of processed licenses increases, the solution can be scaled up without significant changes to the underlying infrastructure.
By leveraging the power of general OCR APIs, these organizations achieved significant cost savings, improved efficiency, and maintained the flexibility to adapt their solutions as their needs evolved. This demonstrates the effectiveness of general OCR solutions in real-world applications, reinforcing the case for their use in driver's license recognition tasks.
Writing Code to Recognize Driver's Licenses with API4AI OCR
Assumptions
In this tutorial, we will explore the application of the API4AI OCR API to recognize key information from a driver’s license. Leveraging OCR technology, we can automate the extraction of this data, making processes more efficient and reducing the potential for human error. To keep our tutorial focused and manageable, we will use a sample driver’s license from Washington, D.C and will work with the ID and the name of the license holder. This will help us demonstrate the process clearly and effectively. However, the principles and methods we discuss can be applied to driver’s licenses from any US state. By the end of this tutorial, you should have a solid understanding of how to integrate and utilize the API4AI OCR API for driver's license recognition in your own projects.
Additionally, for our demonstration, we will use the demo API endpoint provided by API4AI, which offers a limited number of queries. This will be quite sufficient for our experimental purposes, allowing us to illustrate the capabilities of the OCR API without any cost. If you wish to implement a full-featured solution in a production environment, please refer to the API4AI documentation page for detailed instructions on obtaining an API key and understanding the full range of features available.
For testing and development we will use the picture below.
Understanding API4AI OCR API
The OCR API can be used in two modes: “simple_text” (by default) and “simple_words”. The first mode produces text with recognized phrases separated by line breaks and their positions. We’re not really interested in that right now because we want to know the location of each word so that we have something to fall back on. But first, we need to understand how the API works. As they say, better one example code than 1024 words.
import math
import sys
import cv2
import requests
API_URL = 'https://demo.api4ai.cloud/ocr/v1/results?algo=simple-words'
# get path from the 1st argument
image_path = sys.argv[1]
# we us HTTP API to get recognized words from the specified image
with open(image_path, 'rb') as f:
response = requests.post(API_URL, files={'image': f})
json_obj = response.json()
for elem in json_obj['results'][0]['entities'][0]['objects']:
box = elem['box'] # normalized x, y, width, height
text = elem['entities'][0]['text'] # recognized text
print( # show every word with bounding box
f'[{box[0]:.4f}, {box[1]:.4f}, {box[2]:.4f}, {box[3]:.4f}], {text}'
)
In this short code, we access the API by sending a picture in a POST request, with the path to the picture passed as the first command-line argument. This program will simply display the normalized values of the top-left coordinate, width, and height of the area containing the recognized word, as well as the word itself. Here is an output fragment for the above picture:
...
[0.6279, 0.6925, 0.0206, 0.0200], All
[0.6529, 0.6800, 0.1118, 0.0300], 02/21/1984
[0.6162, 0.7175, 0.0309, 0.0200], BEURT
[0.6515, 0.7350, 0.0441, 0.0175], 4a.ISS
[0.6515, 0.7675, 0.1132, 0.0250], 02/17/2010
[0.7662, 0.1725, 0.0647, 0.1125], tomand
[0.6529, 0.8550, 0.0324, 0.0275], ♥♥
[0.6941, 0.8550, 0.0809, 0.0275], DONOR
[0.6529, 0.8950, 0.1074, 0.0300], VETERAN
[0.9000, 0.0125, 0.0691, 0.0375], USA
Let’s try to apply the obtained data to an image by drawing bounding boxes using OpenCV. To do this, we need to convert the normalized values into absolute values expressed in integer pixels. We need the exact coordinate values of the upper left corner and the lower right corner so that we can use them to draw the bounding box. To achieve this, let’s create the get_corner_coords function.
def get_corner_coords(height, width, box):
x1 = int(box[0] * width)
y1 = int(box[1] * height)
obj_width = box[2] * width
obj_height = box[3] * height
x2 = int(x1 + obj_width)
y2 = int(y1 + obj_height)
return x1, y1, x2, y2
The function for drawing the bounding box will be very simple:
def draw_bounding_box(image, box):
x1, y1, x2, y2 = get_corner_coords(image.shape[0], image.shape[1], box)
cv2.rectangle(image, (x1 - 2, y1 - 2), (x2 + 2, y2 + 2), (127, 0, 0), 2)
In this feature, we slightly widened the frame by two pixels so that it is not pressed too close to the words. The color (127, 0, 0) is navy blue specified in BGR format. The thickness of the frame is two pixels.
Of course, to work with an image, it must first be read. Let’s modify the last part of our script: read the image, remove the debug output with information about frames, draw each bounding box on the read image, and then save the modified image to the file “output.png”.
image = cv2.imread(image_path)
for elem in json_obj['results'][0]['entities'][0]['objects']:
box = elem['box'] # normalized x, y, width, height
text = elem['entities'][0]['text'] # recognized text
draw_bounding_box(image, box) # add boundaries to image
cv2.imwrite('output.png', image)
And what do we have now:
Extracting the ID and Name of the License Holder
Before, we managed to use the API and extract text information from the picture of the driver’s license. That’s great! But how do we get to the number and name?
These are the elements we have in the area we are interested in:
[0.3059, 0.1975, 0.0500, 0.0175], 4d.DLN
[0.3059, 0.2325, 0.1059, 0.0275], A9999999
[0.3074, 0.2800, 0.0603, 0.0200], 1.FAMILY
[0.3735, 0.2800, 0.0412, 0.0175], NAME
[0.3059, 0.3150, 0.0794, 0.0300], JONES
[0.3059, 0.3675, 0.0574, 0.0225], 2.GIVEN
[0.3691, 0.3675, 0.0529, 0.0225], NAMES
[0.3074, 0.4025, 0.1191, 0.0275], ANGELINA
[0.3074, 0.4375, 0.1191, 0.0300], GABRIELA
Yes, the POST request gave ordered results, but the order could actually be different, so we can’t rely on it. It is better to assume that the result always stores the recognized elements in a scattered fashion.
Let’s create a list named words, so that we can easily search for words and their positions:
words = []
for elem in json_obj['results'][0]['entities'][0]['objects']:
box = elem['box']
text = elem['entities'][0]['text']
words.append({'box': box, 'text': text})
Let’s call “4d.DLN,” “1.FAMILY,” and “2.GIVEN” the field names, and what’s below them in the picture the field values. The easiest approach is to look for the closest lower-lying elements based on the positions of the field names. We may find words far to the right or left, so we should consider the distance between the text elements rather than their position relative to the axes. Let’s write some code.
First, let’s find the positions of the field names:
ID_MARK = '4d.DLN'
FAMILY_MARK = '1.FAMILY'
NAME_MARK = '2.GIVEN'
id_mark_info = {}
fam_mark_info = {}
name_mark_info = {}
for elem in words:
if elem['text'] == ID_MARK:
id_mark_info = elem
elif elem['text'] == FAMILY_MARK:
fam_mark_info = elem
elif elem['text'] == NAME_MARK:
name_mark_info = elem
Next, we will write a function that finds the nearest elements below the given reference element:
def find_label_below(word_info):
x = word_info['box'][0]
y = word_info['box'][1]
candidate = words[0]
candidate_dist = math.inf
for elem in words:
if elem['text'] == word_info['text']:
continue
curr_box_x = elem['box'][0]
curr_box_y = elem['box'][1]
curr_vert_dist = curr_box_y - y
curr_horiz_dist = x - curr_box_x
if curr_vert_dist > 0: # we are only looking for items below
dist = math.hypot(curr_vert_dist, curr_horiz_dist)
if dist > candidate_dist:
continue
candidate_dist = dist
candidate = elem
return candidate
Let’s try to apply this function and draw the boundaries of the found elements:
id_info = find_label_below(id_mark_info)
fam_info = find_label_below(fam_mark_info)
name_info = find_label_below(name_mark_info)
name2_info = find_label_below(name_info)
canvas = image.copy()
draw_bounding_box(canvas, id_info['box'])
draw_bounding_box(canvas, fam_info['box'])
draw_bounding_box(canvas, name_info['box'])
draw_bounding_box(canvas, name2_info['box'])
cv2.imwrite('result.png', canvas)
Let's take a look at what we have accomplished so far: