Enabling OCR with AWS Textract

Updated 3 Nov 2023 Search & Document Tools

Enabling OCR with AWS Textract

The Foldr server can process and analyse image files and documents containing graphical elements to extract text that is then searchable by users. The Optical Character Recognition (OCR) process can be performed either using the Foldr server’s built-in OCR engine (Tesseract), or cloud OCR engines hosted by AWS (Textract) and Google Vision.

This article assumes that Search has already been configured on the Foldr server. This is covered by the following KB article

Creating the AWS IAM/user for Textract

Log into the Amazon AWS console with a suitable administrative account

In the Search for services bar, search for IAM – then select IAM from the results

Click Users

Click Add Users (found on the right of the screen)

Give the user a suitable Username and select Access Key – Programmatic Access as the AWS credential type

Click Next: Permissions

Click **Attach existing policies directly
**

Filter the permissions and add the following.

The AWS managed policies below are convenient for setup, but for production we recommend creating a custom least-privilege policy scoped to: textract:* on the document resources you process, s3:GetObject/PutObject/DeleteObject on the specific Foldr bucket only, and sns:Publish/Subscribe on the specific Foldr SNS topic only. Avoid *FullAccess policies on production IAM users.

**AmazonSNSFullAccess
**AmazonTextractFullAccess
AmazonS3FullAccess

Click Next: Tags

Click Next: Review

The create user and permission summary will be shown

Click Create User

A success message will be shown

Below the success message, the Access Key ID and Access Secret are shown. Click the ‘show’ link to display the secret and make a note of both values.

Create Foldr Service Account

A Service Account for the Textract OCR process should now be configured in Foldr Settings.

Select Integrations in the Foldr web admin interface

Click the Service Accounts tab

Click + Add New and configure the service account.

Type = AWS S3

Enter the Application Key (Access Key ID) and Application Secret (Secret Access Key) as shown in AWS Console.

Select the appropriate Region and click Update

Enable Textract OCR on the storage item/share

Within Foldr Settings, select Files & Storage and edit the appropraite storage item.

Selet the Search and Data tab and then select Settings

In the Crawl tab, scroll down and enable the toggle Index file contents

Now, select the OCR tab (top of screen)

Select the Enable OCR toggle, select AWS Textract as the OCR Engine and the Textract service account.

Click Save Changes

The AWS Textract integration is now complete and an initial crawl can be started by the administrator by clicking the Search and Data > Activity tab > + Crawl Now. Additionally, all files now uploaded to this storage item will be indexed by the Foldr server and processed by AWS Textract for OCR.

Note – All SMB shares will require a suitable service account to be configued and selected under Foldr Settings > Files & Storage > Access tab. The service account neesd read permission to the SMB share and files contained within and should use the account UPN ([email protected] format) for its username.