OCR in the wild

DeepDetect Server and Platform come with everything ready to setup your own OCR solution. This application page describe how to get running in minutes.

An example of the final output:

This OCR solution works in two main steps:

Text detection from images uses a word detection deep learning model that outputs bounding boxes around text
OCR uses a multi-word deep learning model that takes the relevant image crops from the previous step as input and predicts a text string

For specific applications you may have to train your own OCR model for best results, while the text detection model remains pretty generic.

This setup is similar to the OCR setup used by Facebook and called Rosetta. We borrow from them the image below that summarizes the process:

A two model architecture: 1/ detect words; 2/ OCR on word image crops. [Image borrowed from Rosetta], with the difference that DeepDetect is using a Single Shot Detector (SSD) and uses image crops instead of feature maps.

Setup

We start by setting up the DeepDetect Server, we assume a GPU setup with Docker that can be adapted as needed.

Then we setup the required models.

Word detection model

We use the DeepDeteect word detection deep learning model. Follow installation instructions, then test with the call below:


curl -X POST 'http://localhost:8080/predict' -d '{
    "data": [
        "https://deepdetect.com/img/apps/short-paris-street-signs.jpg"
    ],
    "parameters": {
        "input": {},
        "mllib": {
            "gpu": true
        },
        "output": {
            "bbox": true,
            "confidence_threshold": 0.2
        }
    },
    "service": "word_detect"
}
'

or equivalently in Python using dd_client.py :


from dd_client import DD
from IPython.display import Image

import cv2
import numpy as np

filepath = (https://deepdetect.com/img/apps/short-paris-street-signs.jpg)

bbox_pad = 0.015  # ratio before cropping

img = Image(filepath)
arr = np.asarray(bytearray(img.data), dtype=np.uint8)
img_arr = cv2.imdecode(arr, cv2.IMREAD_COLOR)
ysize, xsize, _ = img_arr.shape

dd = DD("localhost", 8080)
dd.set_return_format(dd.RETURN_PYTHON)

detection = dd.post_predict(
    "word_detect",
    data=[filepath],
    parameters_input={},
    parameters_mllib={},
    parameters_output={"confidence_threshold": 0.2, "bbox": True},
)

# Optional: getting the list of bbox to easily draw them later on
list_bbox = detection["body"]["predictions"][0]["classes"]
bbox_list = []

for elt in list_bbox:
    xmin = int(elt["bbox"]["xmin"])
    xmax = int(ceil(elt["bbox"]["xmax"]))
    ymin = int(elt["bbox"]["ymax"])
    ymax = int(ceil(elt["bbox"]["ymin"]))
    deltax = int(bbox_pad * (xmax - xmin))
    deltay = int(bbox_pad * (ymax - ymin))
    xmin, xmax = max(0, xmin - deltax), min(xsize, xmax + deltax)
    ymin, ymax = max(0, ymin - deltay), min(ysize, ymax + deltay)
    bbox_list.append((xmin, xmax, ymin, ymax))

On the shell, the curl call yields results that look like:


{
    "body": {
        "predictions": [
            {
                "classes": [
                    {
                        "bbox": {
                            "xmax": 304.4400939941406,
                            "xmin": 160.37918090820312,
                            "ymax": 107.0147705078125,
                            "ymin": 130.6460723876953
                        },
                        "cat": "1",
                        "prob": 0.9976158142089844
                    },
...
                    {
                        "bbox": {
                            "xmax": 315.3223571777344,
                            "xmin": 296.52557373046875,
                            "ymax": 93.34095764160156,
                            "ymin": 107.23853302001953
                        },
                        "cat": "1",
                        "last": true,
                        "prob": 0.49438440799713135
                    }
                ],
                "uri": "https://deepdetect.com/img/apps/short-paris-street-signs.jpg"
            }
        ]
    },
    "head": {
        "method": "/predict",
        "service": "word_detect",
        "time": 8095.0
    },
    "status": {
        "code": 200,
        "msg": "OK"
    }
}

And visualizing the boxes with a short Python script:


from matplotlib import patches

fig, ax = plt.subplots(figsize=(20, 15))

ax.imshow(img_arr[:, :, ::-1])

for (xmin, xmax, ymin, ymax), prediction in zip(
    bbox_list,
    sorted(ocr["body"]["predictions"], key=lambda elt: int(elt["uri"])),
):
    rec = patches.Rectangle((xmin, ymin),xmax - xmin,ymax - ymin,
                            linewidth=2,
                edgecolor="red",
                facecolor="none",
               )
    ax.add_patch(rec)
ax.set_xticks([])
ax.set_yticks([])

OCR model

We use the DeepDetect multi-word deep learning model model. Follow installation instructions, then test with the call below:


curl -X POST 'http://localhost:8080/predict' -d '{
    "data": [
        "https://deepdetect.com/img/apps/ocr_1.png"
    ],
    "parameters": {
        "input": {},
        "mllib": {
            "gpu": true
        },
        "output": {
            "blank_label": 0,
            "confidence_threshold": 0,
            "ctc": true
        }
    },
    "service": "word_ocr"
}'

That calls OCR on the crop below:

This yields the following results:


{
    "body": {
        "predictions": [
            {
                "classes": [
                    {
                        "cat": "cambon",
                        "last": true,
                        "prob": 1.0
                    }
                ],
                "uri": "https://deepdetect.com/img/apps/ocr_1.png"
            }
        ]
    },
    "head": {
        "method": "/predict",
        "service": "word_ocr",
        "time": 429.0
    },
    "status": {
        "code": 200,
        "msg": "OK"
    }
}

Detect text and read words

Now putting it together efficiently, the bounding box crops are turned into base64 encoded in-memory images, and passed to the OCR model in batches. This is much faster than processing single boxes, especially on GPU.

The steps are as follows: - Call on the word detection model - Crop the boxes around text - Turn the crops into base64 - Pass the boxes to the OCR model

Python script for the base64 encoding and OCR processing:


from math import ceil
import base64

list_bbox = detection["body"]["predictions"][0]["classes"]
bbox_list = []
list_base64 = []

for elt in list_bbox:
    xmin = int(elt["bbox"]["xmin"])
    xmax = int(ceil(elt["bbox"]["xmax"]))
    ymin = int(elt["bbox"]["ymax"])
    ymax = int(ceil(elt["bbox"]["ymin"]))
    bbox_list.append((xmin, xmax, ymin, ymax))
    status, array = cv2.imencode(".png", img_arr[ymin:ymax, xmin:xmax])
    list_base64.append(base64.b64encode(array.tostring()).decode())

ocr = dd.post_predict(
"word_ocr",
data=list_base64,
parameters_input={},
parameters_mllib={},
parameters_output={"ctc": True, "blank_label": 0},
)

Results can be visualized with:


from matplotlib import patches

# from operator import attrgetter

fig, ax = plt.subplots(figsize=(20, 15))

ax.imshow(img_arr[:, :, ::-1])

for (xmin, xmax, ymin, ymax), prediction in zip(
    bbox_list,
    sorted(ocr["body"]["predictions"], key=lambda elt: int(elt["uri"])),
    ):
    rec = patches.Rectangle((xmin, ymin),xmax - xmin,ymax - ymin,
                linewidth=2,
                edgecolor="red",
                facecolor="none",
               )
     ax.add_patch(rec)
     ax.text(xmin,ymin,
         prediction["classes"][0]["cat"],
            color="red",
            bbox=dict(facecolor="white", edgecolor="red", boxstyle="round,pad=.5"),
        )
ax.set_xticks([])
ax.set_yticks([])

As a result, for every image, there’s now:

Bounding box coordinates around pieces of text
A string value for every of the boxes

Conclusion

DeepDetect Server and Platform come with built-in models to easily test and build a preliminary OCR solution. By gathering training sets for your own tasks, you can improve on the solution above for your own needs and requirements.

OCR in the wild

Results on Paris street sign.

A two model architecture: 1/ detect words; 2/ OCR on word image crops. [Image borrowed from Rosetta], with the difference that DeepDetect is using a Single Shot Detector (SSD) and uses image crops instead of feature maps.