Document Processing
Capture information from an organization's form that are stored as PDF documents.

Introduction

In this series of posts, advanced users at Clarifai will present working solutions to help you kick-start your own AI solutions.

The Use Case

There is an a problem facing many organizations as they attempt to modernize: digitizing documents. In order to effectivelt gain insights from their old paper records, organizations must transform them into a digital version. Now, simply making a digital copy of the document is actually rather easy; simply scan it, or even just upload a photo. The problem though is that while this changes how the document is stored, it doesn't give us any real improvements to accessing the data therein. For the longest time this required a laborious, manual, data entry process. Someone would have to transcribe the documents, one-by-one, and enter each field into the books. This presents a problem to organizations that potentially have thousands-upon-thousands of documents in their records: which can be intractable when it comes to the time and cost of the effort. Luckily though, there's a middle-ground.
Using Clarifai's publicly-available Optical Character Recognition (OCR) models, we can leverge Artifical Intelligence to both do this in a quick and cost-effective manner, but without sacraficing the insights they would have from recording every single value.

Assumptions

Before we begin, let us make some assumptions:
  1. 1.
    The form is a standard for, with static regions for fixed values, ie the "name" field will always appear in the same location across all forms
  2. 2.
    All of the entries will be in English, using the Roman alphabet
  3. 3.
    The organization has a simple means of converting their paper documents to pdf documents, and storing them to a local file system; which is a common feature on most commercial print stations
  4. 4.
    All of the forms will be type-filled, not handwritten; so as to make generating examples easier.
These assumptions were largely made to make this example succinct and easily digestable.

Setup

Before we get to the implementation, let's take a moment to provide an overview thereof.
First off, the broad strokes have already been laid out: convert pdf to image, use Clarifai for OCR, and from that you'll have the text, which you then store in order to access later. Clearly there are some gaps that need to be filled in though; the largest of which is just how the document will be processed.
Working backwards a bit, the way in which the information will be recorded will be highly dependent on the organization's data policies. So to simplify things, we will simply utilize Clarifai's platform to store the annotated documents.
Given assumption 1 above, we know that the fields will be in fixed locations. This means we can define those ahead of time, and here we've chosen to do so using a JSON file, in which we define the document's structure in a manner similar to:
1
{
2
"field_1": [0.25, 0.25, 0.50, 0.50],
3
"field_2": [0.50, 0.25, 0.75, 0.50],
4
.
5
.
6
.
7
"field_n": [0.25, 0.75, 0.50, 1.00]
8
}
Copied!
Each key-value pair in the JSON file corresponds to the field name, the key ("field_n"), and the region coordinates in the form of $[x_0, y_0, x_1, y_1]$.
Note: All of the region coordinates on the Clarifai are relative, not pixel values. This is important, as other image processing libraries might use the pixel values instead. We will address converting between these values below.
Given that we know the name of the field, and where it is on the image, we can easily iterate through all of these field values, and annotate the corresponding region on the image. Having the coordinate values will also let us take sub-crops of the document to use the OCR model to predict on; isolating the text associated with a given field.
With this, we have a more fleshed out plan:
We assume that the user is already familiar with basic platform usage, and has an account. If more information is needed here, please find the appropriate section of the document in order to access more indepth information.
  1. 1.
    Convert PDF to Image, and upload it the the Clarifai platform for storage.
  2. 2.
    Read values from the JSON where the form's fields and their locations are defined.
  3. 3.
    For each field and region:
    • Extract a sub-crop for the field
    • Use Clarifai's OCR model to predict the text associated with the field
    • Write the predicted text back to the input as an annotation
Now let's dive into the implementation:
Starting with the conversion of a PDF document to an image, we can handle this with the open-source library pdf2image; which does exactly what the name suggests. In order to be a bit more defensive with our programming we with wrap the call to the pdf2image.convert_from_path method in a separate function, and do some quick sanity checking to make sure the PDF file exists.
1
import os
2
3
from pdf2image import convert_from_path
4
5
6
def pdf_to_page_images(file_path):
7
"""return an iterable of images that span the pages of the document"""
8
assert os.path.exists(file_path), f"file not found: {file_path}"
9
pdf_images = convert_from_path(file_path)
10
11
return pdf_images
Copied!
This will return an iterable of images that correspond to the individual pages of the document.
Note: For simplicity's sake, our form only has one page.

Full implementation

Untitled
Untitled
{% tab title="intelligent_document_processing.py"}
1
#!/usr/bin/env python3
2
import io
3
import os
4
import json
5
import time
6
import argparse
7
8
from pdf2image import convert_from_path
9
from clarifai_grpc.channel.clarifai_channel import ClarifaiChannel
10
from clarifai_grpc.grpc.api import resources_pb2, service_pb2, service_pb2_grpc
11
from clarifai_grpc.grpc.api.status import status_pb2, status_code_pb2
12
13
14
def pdf_to_page_images(file_path):
15
"""return an iterable of images that span the pages of the document"""
16
assert os.path.exists(file_path), f"file not found: {file_path}"
17
pdf_images = convert_from_path(file_path)
18
19
return pdf_images
20
21
def post_image_bytes_as_input(image_bytes, stub, metadata):
22
"""post an image in bytes format to the platform as an input"""
23
post_inputs_response = stub.PostInputs( # what is an intellgent way to handle these platform objects? Obvi singleton object that acts as a unified permissions manager...
24
service_pb2.PostInputsRequest(
25
inputs=[
26
resources_pb2.Input(
27
data=resources_pb2.Data(
28
image=resources_pb2.Image(
29
base64=image_bytes
30
)
31
)
32
)
33
]
34
),
35
metadata=metadata
36
)
37
38
return post_inputs_response
39
40
def image_to_bytes(img):
41
"""convert a PIL image object to a byte array"""
42
byte_arr = io.BytesIO()
43
img.save(byte_arr, format='PNG')
44
return byte_arr.getvalue()
45
46
def pixels_to_proportions(coordinates, image):
47
"""
48
This function expects a sequence of coordinates as inputs, along with the image it corresponds to.
49
That is, something like: $[(x_0, y_0), (x_1, y_1), ..., (x_n, y_n)]$
50
"""
51
w, h = image.size
52
output = []
53
54
for (x, y) in coordinates:
55
# x /= w
56
# y /= h
57
output.append((x/w, y/h))
58
59
return output
60
61
62
def proportions_to_pixels(coordinates, image):
63
"""see docstring for `pixels_to_proportions`"""
64
w, h = image.size
65
output = []
66
for (x, y) in coordinates:
67
output.append((x*w, y*h))
68
69
return output
70
71
def unpack_tuple_list(a):
72
"""flatten a nested list. Currently fixed at a depth of k=2."""
73
return [i for sub in a for i in sub]
74
75
def grouped(iterable, n):
76
"""h/t https://stackoverflow.com/a/5389547
77
Given the iterable `S`, and the integer n
78
$S \to (s_{0,0}, s_{0,1}, s_{0,2}, \dots, s_{0, n-1}), \ldots, (s_{m,0}, s_{m,1} , s_{m,2},...s_{m, n-1})$
79
"""
80
return zip(*[iter(iterable)]*n)
81
82
def read_json_fields(json_file):
83
"""
84
parse the document fields defined in json_file
85
"""
86
with open(json_file, 'rb') as f:
87
d = json.load(f)
88
89
for k, v in d.items():
90
yield k, v
91
92
93
def _hold_for_upload(asset_id, stub, metadata, t=.5):
94
"""function that will halt the program while we wait for the input to finish uploading"""
95
from itertools import count
96
for i in count():
97
get_inputs_response = stub.GetInput(
98
request=service_pb2.GetInputRequest(
99
input_id=asset_id,
100
),
101
metadata=metadata
102
)
103
assert get_inputs_response.status.code == status_code_pb2.SUCCESS
104
105
if get_inputs_response.input.status.code == status_code_pb2.INPUT_DOWNLOAD_SUCCESS:
106
break
107
else:
108
time.sleep(t)
109
continue
110
111
return True
112
113
114
def predict_text(image, model_id, stub, metadata):
115
"""return the text value output by the specified OCR model. This is """
116
image_bytes = image_to_bytes(image)
117
118
post_model_outputs_response = stub.PostModelOutputs(
119
service_pb2.PostModelOutputsRequest(
120
model_id=model_id,
121
inputs=[
122
resources_pb2.Input(
123
data=resources_pb2.Data(
124
image=resources_pb2.Image(
125
base64=image_bytes
126
)
127
)
128
)
129
]
130
),
131
metadata=metadata
132
)
133
if post_model_outputs_response.status.code != status_code_pb2.SUCCESS:
134
raise Exception("Post model outputs failed, status: " + post_model_outputs_response.status.description)
135
136
predicted_text = post_model_outputs_response.outputs[0].data.text.raw
137
138
return predicted_text
139
140
def make_concept(concept, value=1.):
141
"""create a concept object. Note: By default this will create a positive association - value=1. - with the concept."""
142
return resources_pb2.Concept(id=concept, value=value)
143
144
def coords_to_bbox(x0, y0, x1, y1):
145
"""create a BoundingBox object from a set of 2d Cartesian coordinates"""
146
return resources_pb2.BoundingBox(
147
left_col=x0,
148
top_row=y0,
149
right_col=x1,
150
bottom_row=y1
151
)
152
153
def make_annotation(input_id, coords, body, stub, metadata, *concepts):
154
"""we're going to simply post a single region annotation at a time"""
155
post_annotations_response = stub.PostAnnotations(
156
service_pb2.PostAnnotationsRequest(
157
annotations=[
158
resources_pb2.Annotation(
159
input_id=input_id,
160
data=resources_pb2.Data(
161
regions=[
162
resources_pb2.Region(
163
region_info=resources_pb2.RegionInfo(
164
bounding_box=coords_to_bbox(*coords),
165
text=resources_pb2.Text(raw=body)
166
),
167
data=resources_pb2.Data(
168
concepts=[make_concept(concept) for concept in concepts],
169
)
170
)
171
]
172
),
173
),
174
]
175
),
176
metadata=metadata
177
)
178
179
if post_annotations_response.status.code != status_code_pb2.SUCCESS:
180
raise Exception("Post annotations failed, status: " + post_annotations_response.status.description)
181
182
return post_annotations_response
183
184
def main(args):
185
# initialize the Clarifai client
186
print(args)
187
channel = ClarifaiChannel.get_json_channel()
188
stub = service_pb2_grpc.V2Stub(channel)
189
190
metadata = (('authorization', f'Key {args.api_key}'),)
191
192
# import the pdf document, and convert it to an iterable of images for the pages
193
doc, *_ = pdf_to_page_images(args.file) # we know our document is only one page, so we isolate the first item in the iterable. Isomorphic to pdf_to_page_images(args.file)[0]
194
doc_bytes = image_to_bytes(doc)
195
196
# post the doc as an input
197
post_input_response = post_image_bytes_as_input(doc_bytes, stub, metadata)
198
199
doc_id = post_input_response.inputs[0].id # we know there will only be one input, given the setup above
200
201
print(f"[DOC] - {doc_id}")
202
_ = _hold_for_upload(doc_id, stub, metadata) # ensure that the input is uploaded, so that we can annotate the regions-of-interest
203
204
doc_fields = read_json_fields(args.layout)
205
206
for field, value in doc_fields:
207
relative_coords = grouped(value, 2) # xy-coords -> n=2
208
pixel_coords = proportions_to_pixels(relative_coords, doc)
209
pixel_coords_flat = unpack_tuple_list(pixel_coords)
210
211
# get a crop of the region
212
region = doc.crop(pixel_coords_flat)
213
214
# predicted text in cropped region
215
predicted_text = predict_text(region, args.model_id, stub, metadata)
216
print("\t-", f"{field} | {predicted_text}")
217
218
post_annotation_response = make_annotation(doc_id, tuple(value), predicted_text, stub, metadata, field)
219
220
if post_annotation_response.status.code != status_code_pb2.SUCCESS:
221
breakpoint()
222
223
print("Done.")
224
225
226
if __name__ == "__main__":
227
parser = argparse.ArgumentParser()
228
parser.add_argument('-f', '--file', type=str, help="File path to the PDF document you would like to parse and annotate.")
229
parser.add_argument('-k', '--api_key', type=str, help="The Clarifai API key associate with your application.")
230
parser.add_argument('-m', '--model_id', type=str, help="The ID of the Clarifai model you would like to use for OCR.", default='eng-ocr')
231
parser.add_argument('-l', '--layout', type=str, help="Path to the JSON file in which the document layout is defined.", default='assets/field_regions.json')
232
233
args = parser.parse_args()
234
235
_ = main(args)
Copied!
Last modified 1mo ago