Vision in iOS: Text detection and Tesseract recognition
Let me tell you a story.
Two weeks ago I joined the
Stupid Hackathon
in Oslo, where people came up with some stupid ideas and hacked together. As I just watched the
Big big numbers counting clip
by Donald Trump, I thought it might be a good/stupid idea to make a fun iOS app that can recognise
a number and tell if it is big enough, all through Trump’s voice.
Before I probably needed to use some libraries like
OpenCV
to solve this text tracking challenge. Now with the introduction of Vision in iOS
11, I have all the thing I need. So the implementation doesn’t take a long time, it is like playing Lego.
In this guide, I will show you the technical details on working with Vision in iOS, as well as the experience I learned.
Here is the final project on GitHub —
BigBigNumbers
. You can use it for reference when reading this guide. The project uses Swift
4.1 with iOS 11. There are
ViewController
containment and multiple service classes to break down responsibilities, so we can easily follow along.
Ah, and OCR stands for
Optical Character Recognition
which is the process of converting images to
readable texts. We will use this abbreviation on the way. Now let’s go to code!
Camera session
Firstly we need to setup camera session as we need to capture the picture for text recognition. The camera logic and its preview layer are encapsulated in a custom view controller
CameraController
.
Here we setup a default capture device for back camera. Remember to set
videoGravity
to
resizeAspectFill
to get full screen preview layer. To get captured buffer from camera, our view controller needs to conform to
AVCaptureVideoDataOutputSampleBufferDelegate:
Every captured frame reports a buffer information through the delegate function
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection).
Now that we have
CMSampleBuffer
, let ‘s work with
Vision
in our
VisionService
class.
Vision
Vision was
introduced
at WWDC 2017 together with Core ML. It provides easy to use computer vision APIs with many
interesting features like face detection, facial landmarks, object tracking, text tracking. Looking at the documentation there is
VNDetectTextRectanglesRequest
which is perfectly fine for our task in this guide.
An image analysis request that finds regions of visible text in an image
We need to make a request to
Vision
to get detected rectangle within the captured frame. The
VNImageRequestHandler
accepts
CVPixelBuffer
,
CGImage
and image
Data
. Here we convert
from
CMSampleBuffer
to
CGImage
via
CVImageBuffer
.
Vision
exposes a very high level APIs, so working with it is as easy as passing the request to
VNDetectTextRectanglesRequest
.
Vision handles still image-based requests using a
VNImageRequestHandler
and assumes that images are oriented upright, so pass your image with orientation in mind.
CGImage
,
CIImage
, and
CVPixelBuffer
objects don’t carry orientation, so provide it as part of the initializer.
We need to convert from
UIImageOrientation
to
CGImageOrientation
for Vision to properly work. Here is the code from Apple
sample
:
The result should be an array of
VNTextObservation
, which contains region information for where the text is located within the image. For this demo, I only select results with big enough
confidence
.
What you get is what you should see. Let’s draw the region in
BoxService
in the main queue.
Even when Vision calls its completion handlers on a background thread, always dispatch UI calls like the path-drawing code to the main thread. Access to UIKit, AppKit & resources must be serialized, so changes that affect the app’s immediate
appearance belong on the main thread.
Drawing the detected text region
We can draw using
drawRect
, but a
CALayer
with custom border should be easier.
Keep in mind that
VNTextObservation
has an array of
characterBoxes
of type
VNRectangleObservation
. Those contains information individual character bounding boxes found within the observation’s
boundingBox
.
This is for fine grain control, however in our app we just need the whole bounding box. As
VNTextObservation
subclasses from
VNDetectedObjectObservation
, we have access to the whole
boundingBox
.
The coordinates are normalized to the dimensions of the processed image, with the origin at the image’s lower-left corner.
Now we can use
layerRectConverted
from
AVCaptureVideoPreviewLayer
to convert from
boundBox
to view rect. There may be some advanced calculations that make the rectangle show up in place, but for now this
simple function works.
Converts a rectangle in the coordinate system used for metadata outputs to one in the preview layer’s coordinate system.
If you simply want to draw the rectangle onto the captured image, then you can follow Apple ‘s sample using the helper
boundingBox
function:
Cropping the detected text region
Still within the
BoxService
, we should crop the image in the detected rectangle for OCR (Optical Character Recognition). We compute in the coordinate of the captured image and insert a big to take a slightly bigger image to accommodate
for top and bottom edges. The code is tweaked from
Convert Vision boundingBox from VNFaceObservation to rect to draw on image
:
The
croppedImage
should contain text, you can use
Quick Look
in Xcode to check.
Now that we have an image that ‘s ready for text recognition. Let’s work with
OCRService
.
Text recognition
I personally like pure Swift solution, so
SwiftOCR
is a perfect choice, it is said to perform better than Tesseract. So I gave it a try.
The API can’t be simpler.
For some reasons, I don’t know, but this does not work well. It might because of the font
Lato
I use in Sketch (this is how I quickly test the text detection). I read that
SwiftOCR
allows custom
training
for new font, but because I was lazy, I tried Tesseract.
Tesseract
is a “is an optical character recognition engine for various operating systems. It is free software, released under the Apache License,
Version 2.0, and development has been sponsored by Google since 2006”. The iOS port is open source on
GitHub
and has
CocoaPods
support. So simply put
pod ‘TesseractOCRiOS’
in your
Podfile
and you’re good to go.
As in
README
and in the
TestsProject
,
tessdata
is needed, it contains language information for Tesseract to work. Without this
tessdata
then the framework
TesseractOCR
will yell with some warnings about missing
TESSDATA_PREFIX.
Strict requirement on language files existing in a referenced “tessdata” folder.
Download the
tessdata
from
here
,
add add it as a
reference
to your Xcode project. The color blue indicates that this folder is added as reference.
You may also need to add
libstdc++.dylib
and
CoreImage.framework
to your target:
Tesseract
Using Tesseract is easy. Remember to import
TesseractOCR
, not
TesseractOCRiOS:
g8_blackAndWhite
is a convenient filter to increase the contrast in the image for easy detection. For
pageSegmentationMode
I use
singleBlock
as our number should be in a uniformed block of text, you can
also try
singleLine
mode. Lastly, we set
engineMode
to
tesseractCubeCombined
, which is the most accurate, but it could take some time. You can set to
tesseractOnly
or
cubeOnly
to compromise for speed.