Extract text with AWS Textract using AWS Step functions

0
1349
feature-image

Introduction

In this post, we will look into how we can extract text from an image with AWS Textract and then generate a pdf file and upload it to S3 bucket using AWS Step functions, we are going to deploy a serverless stack with three lambda functions, one lambda will be triggering our AWS step functions state machine and other lambdas will be used to extract the text from the image, generate the pdf and then upload it to S3 bucket.

To know more about AWS Step Functions check out AWS Step Functions Cheatsheet

Project setup

Our project structure will look like this

project structure

We will need to set up a basic serverless project with a serverless.yml file and our lambda functions, we also need to install aws-sdk to interact with AWS services, I’ll not go into details on what is serverless.yml file is or how to set up a serverless project, for that you can check out this post.

Serverless.yml file

Let’s start with defining our serverless.yml file, we will go step by step for easier explanation.

Permissions and configuration

service: aws-step-functions

plugins:
- serverless-step-functions

custom:
  stateMachineName: newStateMachine

provider:
  name: aws
  runtime: nodejs12.x
  iamRoleStatements:
    - Effect: Allow
      Action:
        - states:StartExecution
        - textract:DetectDocumentText
        - s3:Get*
        - s3:List*
        - s3:PutObject*
      Resource: "*"

Let’s understand this code by breaking it down

plugins – Here we define all the plugins or we can say node package modules which we want to use with our AWS serverless project, to use AWS step functions with the serverless framework we need a plugin called serverless-step-functions.

custom – Here we define all the properties which we want to reference in our serverless.yml file, so in this case, we are defining the name of our state machine, we will also add this as an environment variable later in our lambda function configuration.

provider – This block is used to define all the configuration, settings, permissions, etc, related data, the main thing here is that we are defining our permissions in this block, we need to define all the permissions for actions which our Lambda functions will be performing, in our case those are –

  • Starting the AWS step functions state machine.
  • Using AWS Textract DetectDocumentText API to extract text from an image.
  • Get the image from the S3 bucket for extracting the text.
  • Uploading generated pdf file to the S3 bucket.

Defining step functions block

stepFunctions:
  stateMachines:
    newStateMachine:
      name: ${self:custom.stateMachineName}
      tracingConfig:
        enabled: true
      definition:
        Comment: Image extraction and pdf generation
        StartAt: extractText
        States:
          extractText:
            Type: Task
            Resource: !GetAtt extractText.Arn
            Next: generatePdf
          generatePdf:
            Type: Task
            Resource: !GetAtt generatePdf.Arn
            End: true
            Retry:
            - ErrorEquals: ['States.ALL']
              IntervalSeconds: 1
              MaxAttempts: 3
              BackoffRate: 2

This block is used to define all our AWS step functions steps, settings, configuration, let’s try to understand it by breaking it down

stateMachines – Here we define all our state machines and their respective configuration, in our case we are only using a single state machine.

name – This is just the name of our state machine, notice here that we are referencing the custom property which we defined previously.

tracingConfig – This config is defining whether we want to turn on AWS X-Ray tracing or not, this is a personal preference, we can also turn it off.

definition – In this block, we define actual steps for our AWS step functions.

StartAt – This is used to define our starting point of the state machine, meaning from which step our state machine will start executing.

We are defining two steps in this state machine, the first step will call the Lambda function which will extract the text from an image, and the second step will call the Lambda function which will generate the pdf file with the text content of the image and upload that pdf file to the S3 Bucket.

Resource – This property is used to define the resource name which needs to be called on that step, so here we are setting the name of our Lambda function because we want to call our Lambda functions on both of our steps.

ErrorEquals – Here we define, for which steps we want to do retry if it fails for some reason, we are adding retry for all our steps.

Defining Lambda functions

functions:
  extractText:
    handler: src/extractText/index.extractText
  
  generatePdf:
    handler: src/generatePdf/index.generatePdf

  triggerStateMachine:
    handler: src/triggerStateMachine/index.triggerStateMachine
    environment:
      stateMachineName: ${self:custom.stateMachineName}
      ACCOUNT_ID: ${aws:accountId}
    events:
      - s3:
          bucket: my-bucket-34
          event: s3:ObjectCreated:*
          existing: true

We are defining three Lambda functions

extractText – This Lambda will get the image from S3 and extract the text from the image using AWS Textract.

generatePdf – This Lambda will receive the extracted text and then it will generate the pdf file with that text and upload it to the S3 Bucket.

triggerStateMachine – We need this lambda to trigger our state machine.

events – Final thing is to attach an S3 event to our lambda function so it gets called as soon as a new image is uploaded to the S3 bucket, this bucket is the name of the bucket where we will upload images, we can create this bucket manually from the AWS console and then put the same name here, the existing property is set to true because this bucket is already created, if we don’t pass this flag this template will try to create the bucket.

Putting it all together

service: aws-step-functions

plugins:
- serverless-step-functions

custom:
  stateMachineName: newStateMachine

provider:
  name: aws
  runtime: nodejs12.x
  iamRoleStatements:
    - Effect: Allow
      Action:
        - states:StartExecution
        - textract:DetectDocumentText
        - s3:Get*
        - s3:List*
        - s3:PutObject*
      Resource: "*"

stepFunctions:
  stateMachines:
    newStateMachine:
      name: ${self:custom.stateMachineName}
      tracingConfig:
        enabled: true
      definition:
        Comment: Image extraction and pdf generation
        StartAt: extractText
        States:
          extractText:
            Type: Task
            Resource: !GetAtt extractText.Arn
            Next: generatePdf
          generatePdf:
            Type: Task
            Resource: !GetAtt generatePdf.Arn
            End: true
            Retry:
            - ErrorEquals: ['States.ALL']
              IntervalSeconds: 1
              MaxAttempts: 3
              BackoffRate: 2

functions:
  extractText:
    handler: src/extractText/index.extractText
  
  generatePdf:
    handler: src/generatePdf/index.generatePdf

  triggerStateMachine:
    handler: src/triggerStateMachine/index.triggerStateMachine
    environment:
      stateMachineName: ${self:custom.stateMachineName}
      ACCOUNT_ID: ${aws:accountId}
    events:
      - s3:
          bucket: my-bucket-34
          event: s3:ObjectCreated:*
          existing: true

Extract text from an image

Let’s start with our first lambda function which is extractText, which will use AWS Textract to get the text from an image uploaded to the S3 bucket, we will break down the function into parts.

Imports

const AWS = require("aws-sdk");
const textract = new AWS.Textract();

We need aws-sdk and an instance of Textract()

Getting text from an image

const { bucket, key } = event;
    try {
        const params = {
            Document: {
                S3Object: {
                    Bucket: bucket,
                    Name: key,
                }
            }
        };
 const response = await textract.detectDocumentText(params).promise();

Firstly we are receiving bucket and key from our triggerStateMachine lambda function which will be called when an object will be uploaded to our S3 bucket (more on this later).

We are calling detectDocumentText API which will extract the information from an image and return us the data we need.

Gathering text data from response of AWS Textract

let text = '';
  response.Blocks.forEach((data) => {
   if (data.BlockType === 'LINE') {
       text += `${data.Text} `;
    }
   })
return { key, pdfData: text };

Here we are just looping through the response array which gets returned from Textract API call, we only need data where BlockType is ‘LINE’ which is each line of text from the processed image. We are appending all the lines of text to a single string.

After that, we are just returning that data so our next lambda in the state machine steps receives this data for generating the pdf and uploading it to the S3 bucket.

Whole function

const AWS = require("aws-sdk");
const textract = new AWS.Textract();

exports.extractText = async (event) => {
    const { bucket, key } = event;
    try {
        const params = {
            Document: {
                S3Object: {
                    Bucket: bucket,
                    Name: key,
                }
            }
        };
        const response = await textract.detectDocumentText(params).promise();
        let text = '';
        response.Blocks.forEach((data) => {
            if (data.BlockType === 'LINE') {
                text += `${data.Text} `;
            }
        })
        return { key, pdfData: text };
    }
    catch (e) {
        console.log(e);
    }
}

PDF generation and uploading it to the S3 Bucket

In this Lambda function, we will create a pdf with the data we received from image analysis and then upload that pdf to the S3 bucket.

Imports

const AWS = require("aws-sdk");
const PDFDocument = require("pdfkit")
const s3 = new AWS.S3();

We are going to use an npm called pdfkit to write and generate our pdf file.

Writing data to the pdf file

const { key, pdfData } = event;
const fileName = 'output.pdf';
const pdfPromise = await new Promise(resolve => {
const doc = new PDFDocument();

doc.text(pdfData);
doc.end();

const buffers = [];
doc.on("data", buffers.push.bind(buffers));
doc.on("end", () => {
  const pdfData = Buffer.concat(buffers);
  resolve(pdfData);
   });
});

We are receiving image file key and data which got returned from our extractText lambda, let’s understand this code step by step

doc.text() – This is just writing the data to our pdf file.

doc.end() – This is closing the writing stream.

We are also using events like data and end, we need to use events because we don’t know how much time it will take for the file to be fully written and generated before we upload it to S3, in the end event we are returning the generated file by resolving it.

Uploading pdf file to S3 bucket

const params = {
 Bucket: 'pdf-store-34',
 Key: `${key.split(".")[0]}-${fileName}`,
 Body: pdfPromise,
 ContentType: 'application/pdf',
 };

 const response = await s3.putObject(params).promise();
 return response;

Bucket – This is the bucket name, you can put the name of your bucket where you want to upload the pdf file here.

Key – This is the name of the file name you want to upload to the S3 bucket, we are appending the original image file name before the name of the pdf file.

Body – This is the pdf file that we generated, we are just passing that file which we resolved earlier in the previous step.

Lastly, we are calling putObject API to create a bucket object in S3 and return the response to show the success of our last step in the AWS step functions state machine.

Full function looks like this

const AWS = require("aws-sdk");
const PDFDocument = require("pdfkit")
const s3 = new AWS.S3();

exports.generatePdf = async (event) => {
    try {
        const { key, pdfData } = event;
        const fileName = 'output.pdf';
        const pdfPromise = await new Promise(resolve => {
            const doc = new PDFDocument();

            doc.text(pdfData);
            doc.end();

            const buffers = [];
            doc.on("data", buffers.push.bind(buffers));
            doc.on("end", () => {
                const pdfData = Buffer.concat(buffers);
                resolve(pdfData);
            });
        });
        const params = {
            Bucket: 'pdf-store-34',
            Key: `${key.split(".")[0]}-${fileName}`,
            Body: pdfPromise,
            ContentType: 'application/pdf',
        };

        const response = await s3.putObject(params).promise();
        return response;
    }
    catch (e) {
        console.log(e);
    }
}

Triggering state machine using Lambda

In our triggerStateMachine Lambda function we are going to trigger our state machine, this lambda will get called on S3 object upload event.

Getting required data from event object

const bucket = event.Records[0].s3.bucket.name;
const key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, " "));

const { AWS_REGION, ACCOUNT_ID, stateMachineName } = process.env;

When this lambda gets called, it will receive the bucket name, and file name of the file which got uploaded to the S3 bucket, we are getting these details from the event object.

We are also fetching environment variables like region, AWS accountId, and state machine name to form the ARN for our state machine to start its execution.

Starting execution of state machine

const params = {
stateMachineArn:`arn:aws:states:${AWS_REGION}:${ACCOUNT_ID}:stateMachine:${stateMachineName},
input: JSON.stringify({ bucket, key })
};

await stepfunctions.startExecution(params).promise();

Here we are just calling startExecution function to start execution for our state machine.

Whole code

const AWS = require("aws-sdk");
const stepfunctions = new AWS.StepFunctions()

exports.triggerStateMachine = async (event) => {
    const bucket = event.Records[0].s3.bucket.name;
    const key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, " "));

    const { AWS_REGION, ACCOUNT_ID, stateMachineName } = process.env;

    try {
        const params = {
            stateMachineArn: `arn:aws:states:${AWS_REGION}:${ACCOUNT_ID}:stateMachine:${stateMachineName}`,
            input: JSON.stringify({ bucket, key })
        };

        await stepfunctions.startExecution(params).promise();
    }
    catch (e) {
        console.log(e);
    }
}

Conclusion

Congratulations if you reached up to this point, now you have a system where when you upload any image to your S3 bucket, it will get all the text from that image and generate the pdf file and upload it to another S3 bucket using AWS step functions.

Get this code

Source code on Github

LEAVE A REPLY

Please enter your comment!
Please enter your name here