Introduction
In this post, we will look into how we can extract text from an image with AWS Textract and then generate a pdf file and upload it to S3 bucket using AWS Step functions, we are going to deploy a serverless stack with three lambda functions, one lambda will be triggering our AWS step functions state machine and other lambdas will be used to extract the text from the image, generate the pdf and then upload it to S3 bucket.
To know more about AWS Step Functions check out AWS Step Functions Cheatsheet
Project setup
Our project structure will look like this
We will need to set up a basic serverless project with a serverless.yml file and our lambda functions, we also need to install aws-sdk
to interact with AWS services, I’ll not go into details on what is serverless.yml file is or how to set up a serverless project, for that you can check out this post.
Serverless.yml file
Let’s start with defining our serverless.yml file, we will go step by step for easier explanation.
Permissions and configuration
service: aws-step-functions
plugins:
- serverless-step-functions
custom:
stateMachineName: newStateMachine
provider:
name: aws
runtime: nodejs12.x
iamRoleStatements:
- Effect: Allow
Action:
- states:StartExecution
- textract:DetectDocumentText
- s3:Get*
- s3:List*
- s3:PutObject*
Resource: "*"
Let’s understand this code by breaking it down
plugins – Here we define all the plugins or we can say node package modules which we want to use with our AWS serverless project, to use AWS step functions with the serverless framework we need a plugin called serverless-step-functions
.
custom – Here we define all the properties which we want to reference in our serverless.yml file, so in this case, we are defining the name of our state machine, we will also add this as an environment variable later in our lambda function configuration.
provider – This block is used to define all the configuration, settings, permissions, etc, related data, the main thing here is that we are defining our permissions in this block, we need to define all the permissions for actions which our Lambda functions will be performing, in our case those are –
- Starting the AWS step functions state machine.
- Using AWS Textract
DetectDocumentText
API to extract text from an image. - Get the image from the S3 bucket for extracting the text.
- Uploading generated pdf file to the S3 bucket.
Defining step functions block
stepFunctions:
stateMachines:
newStateMachine:
name: ${self:custom.stateMachineName}
tracingConfig:
enabled: true
definition:
Comment: Image extraction and pdf generation
StartAt: extractText
States:
extractText:
Type: Task
Resource: !GetAtt extractText.Arn
Next: generatePdf
generatePdf:
Type: Task
Resource: !GetAtt generatePdf.Arn
End: true
Retry:
- ErrorEquals: ['States.ALL']
IntervalSeconds: 1
MaxAttempts: 3
BackoffRate: 2
This block is used to define all our AWS step functions steps, settings, configuration, let’s try to understand it by breaking it down
stateMachines – Here we define all our state machines and their respective configuration, in our case we are only using a single state machine.
name – This is just the name of our state machine, notice here that we are referencing the custom property which we defined previously.
tracingConfig – This config is defining whether we want to turn on AWS X-Ray tracing or not, this is a personal preference, we can also turn it off.
definition – In this block, we define actual steps for our AWS step functions.
StartAt – This is used to define our starting point of the state machine, meaning from which step our state machine will start executing.
We are defining two steps in this state machine, the first step will call the Lambda function which will extract the text from an image, and the second step will call the Lambda function which will generate the pdf file with the text content of the image and upload that pdf file to the S3 Bucket.
Resource – This property is used to define the resource name which needs to be called on that step, so here we are setting the name of our Lambda function because we want to call our Lambda functions on both of our steps.
ErrorEquals – Here we define, for which steps we want to do retry if it fails for some reason, we are adding retry for all our steps.
Defining Lambda functions
functions:
extractText:
handler: src/extractText/index.extractText
generatePdf:
handler: src/generatePdf/index.generatePdf
triggerStateMachine:
handler: src/triggerStateMachine/index.triggerStateMachine
environment:
stateMachineName: ${self:custom.stateMachineName}
ACCOUNT_ID: ${aws:accountId}
events:
- s3:
bucket: my-bucket-34
event: s3:ObjectCreated:*
existing: true
We are defining three Lambda functions
extractText – This Lambda will get the image from S3 and extract the text from the image using AWS Textract.
generatePdf – This Lambda will receive the extracted text and then it will generate the pdf file with that text and upload it to the S3 Bucket.
triggerStateMachine – We need this lambda to trigger our state machine.
events – Final thing is to attach an S3 event to our lambda function so it gets called as soon as a new image is uploaded to the S3 bucket, this bucket
is the name of the bucket where we will upload images, we can create this bucket manually from the AWS console and then put the same name here, the existing
property is set to true
because this bucket is already created, if we don’t pass this flag this template will try to create the bucket.
Putting it all together
service: aws-step-functions
plugins:
- serverless-step-functions
custom:
stateMachineName: newStateMachine
provider:
name: aws
runtime: nodejs12.x
iamRoleStatements:
- Effect: Allow
Action:
- states:StartExecution
- textract:DetectDocumentText
- s3:Get*
- s3:List*
- s3:PutObject*
Resource: "*"
stepFunctions:
stateMachines:
newStateMachine:
name: ${self:custom.stateMachineName}
tracingConfig:
enabled: true
definition:
Comment: Image extraction and pdf generation
StartAt: extractText
States:
extractText:
Type: Task
Resource: !GetAtt extractText.Arn
Next: generatePdf
generatePdf:
Type: Task
Resource: !GetAtt generatePdf.Arn
End: true
Retry:
- ErrorEquals: ['States.ALL']
IntervalSeconds: 1
MaxAttempts: 3
BackoffRate: 2
functions:
extractText:
handler: src/extractText/index.extractText
generatePdf:
handler: src/generatePdf/index.generatePdf
triggerStateMachine:
handler: src/triggerStateMachine/index.triggerStateMachine
environment:
stateMachineName: ${self:custom.stateMachineName}
ACCOUNT_ID: ${aws:accountId}
events:
- s3:
bucket: my-bucket-34
event: s3:ObjectCreated:*
existing: true
Extract text from an image
Let’s start with our first lambda function which is extractText, which will use AWS Textract to get the text from an image uploaded to the S3 bucket, we will break down the function into parts.
Imports
const AWS = require("aws-sdk");
const textract = new AWS.Textract();
We need aws-sdk
and an instance of Textract()
Getting text from an image
const { bucket, key } = event;
try {
const params = {
Document: {
S3Object: {
Bucket: bucket,
Name: key,
}
}
};
const response = await textract.detectDocumentText(params).promise();
Firstly we are receiving bucket
and key
from our triggerStateMachine
lambda function which will be called when an object will be uploaded to our S3 bucket (more on this later).
We are calling detectDocumentText
API which will extract the information from an image and return us the data we need.
Gathering text data from response of AWS Textract
let text = '';
response.Blocks.forEach((data) => {
if (data.BlockType === 'LINE') {
text += `${data.Text} `;
}
})
return { key, pdfData: text };
Here we are just looping through the response array which gets returned from Textract API call, we only need data where BlockType
is ‘LINE’ which is each line of text from the processed image. We are appending all the lines of text to a single string.
After that, we are just returning that data so our next lambda in the state machine steps receives this data for generating the pdf and uploading it to the S3 bucket.
Whole function
const AWS = require("aws-sdk");
const textract = new AWS.Textract();
exports.extractText = async (event) => {
const { bucket, key } = event;
try {
const params = {
Document: {
S3Object: {
Bucket: bucket,
Name: key,
}
}
};
const response = await textract.detectDocumentText(params).promise();
let text = '';
response.Blocks.forEach((data) => {
if (data.BlockType === 'LINE') {
text += `${data.Text} `;
}
})
return { key, pdfData: text };
}
catch (e) {
console.log(e);
}
}
PDF generation and uploading it to the S3 Bucket
In this Lambda function, we will create a pdf with the data we received from image analysis and then upload that pdf to the S3 bucket.
Imports
const AWS = require("aws-sdk");
const PDFDocument = require("pdfkit")
const s3 = new AWS.S3();
We are going to use an npm called pdfkit
to write and generate our pdf file.
Writing data to the pdf file
const { key, pdfData } = event;
const fileName = 'output.pdf';
const pdfPromise = await new Promise(resolve => {
const doc = new PDFDocument();
doc.text(pdfData);
doc.end();
const buffers = [];
doc.on("data", buffers.push.bind(buffers));
doc.on("end", () => {
const pdfData = Buffer.concat(buffers);
resolve(pdfData);
});
});
We are receiving image file key and data which got returned from our extractText
lambda, let’s understand this code step by step
doc.text() – This is just writing the data to our pdf file.
doc.end() – This is closing the writing stream.
We are also using events like data
and end
, we need to use events because we don’t know how much time it will take for the file to be fully written and generated before we upload it to S3, in the end
event we are returning the generated file by resolving it.
Uploading pdf file to S3 bucket
const params = {
Bucket: 'pdf-store-34',
Key: `${key.split(".")[0]}-${fileName}`,
Body: pdfPromise,
ContentType: 'application/pdf',
};
const response = await s3.putObject(params).promise();
return response;
Bucket – This is the bucket name, you can put the name of your bucket where you want to upload the pdf file here.
Key – This is the name of the file name you want to upload to the S3 bucket, we are appending the original image file name before the name of the pdf file.
Body – This is the pdf file that we generated, we are just passing that file which we resolved earlier in the previous step.
Lastly, we are calling putObject
API to create a bucket object in S3 and return the response to show the success of our last step in the AWS step functions state machine.
Full function looks like this
const AWS = require("aws-sdk");
const PDFDocument = require("pdfkit")
const s3 = new AWS.S3();
exports.generatePdf = async (event) => {
try {
const { key, pdfData } = event;
const fileName = 'output.pdf';
const pdfPromise = await new Promise(resolve => {
const doc = new PDFDocument();
doc.text(pdfData);
doc.end();
const buffers = [];
doc.on("data", buffers.push.bind(buffers));
doc.on("end", () => {
const pdfData = Buffer.concat(buffers);
resolve(pdfData);
});
});
const params = {
Bucket: 'pdf-store-34',
Key: `${key.split(".")[0]}-${fileName}`,
Body: pdfPromise,
ContentType: 'application/pdf',
};
const response = await s3.putObject(params).promise();
return response;
}
catch (e) {
console.log(e);
}
}
Triggering state machine using Lambda
In our triggerStateMachine
Lambda function we are going to trigger our state machine, this lambda will get called on S3 object upload event.
Getting required data from event object
const bucket = event.Records[0].s3.bucket.name;
const key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, " "));
const { AWS_REGION, ACCOUNT_ID, stateMachineName } = process.env;
When this lambda gets called, it will receive the bucket name, and file name of the file which got uploaded to the S3 bucket, we are getting these details from the event object.
We are also fetching environment variables like region, AWS accountId, and state machine name to form the ARN for our state machine to start its execution.
Starting execution of state machine
const params = {
stateMachineArn:`arn:aws:states:${AWS_REGION}:${ACCOUNT_ID}:stateMachine:${stateMachineName},
input: JSON.stringify({ bucket, key })
};
await stepfunctions.startExecution(params).promise();
Here we are just calling startExecution
function to start execution for our state machine.
Whole code
const AWS = require("aws-sdk");
const stepfunctions = new AWS.StepFunctions()
exports.triggerStateMachine = async (event) => {
const bucket = event.Records[0].s3.bucket.name;
const key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, " "));
const { AWS_REGION, ACCOUNT_ID, stateMachineName } = process.env;
try {
const params = {
stateMachineArn: `arn:aws:states:${AWS_REGION}:${ACCOUNT_ID}:stateMachine:${stateMachineName}`,
input: JSON.stringify({ bucket, key })
};
await stepfunctions.startExecution(params).promise();
}
catch (e) {
console.log(e);
}
}
Conclusion
Congratulations if you reached up to this point, now you have a system where when you upload any image to your S3 bucket, it will get all the text from that image and generate the pdf file and upload it to another S3 bucket using AWS step functions.