PDF OCR via React, Django REST Framework, and Heroku, Part 6: Upload to AWS

Joseph Cardenas
8 min readSep 25, 2020

Update — 4.11.21!

I think I’ve made an error regarding the use of subprocess to run OCR on our files, so while you can OCR files locally you don’t actually get an OCR’d file stored in AWS. Sorry about this, folks. 😩 I’m working on a way to correct this. If you have any suggestions on how to make things work drop me a line.

Why We Need AWS

Heroku has an ephemeral file storage system, essentially meaning when the Heroku server is shut down, our files are lost. As such, we’ll need to connect to another service to have any sort of lasting file storage of our OCR’d PDFs. Of course, there are a variety of services like AWS, but I found it was a bit easier to get my project off the ground with AWS.

Install django-storages and boto3

In order to get our back end working with AWS, we’ll need to install the django-storages and boto3 packages. A simple poetry add django-storages and poetry add boto3 is all we need to install the package into our project, along with adding ‘storages’ to our INSTALLED_APPS section of settings.py :

INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'rest_framework',
'corsheaders',
'ocr',
'whitenoise',
'storages',
]

While django-storages will be called out explicitly in our code, boto3 is no less important as it allows back end interaction with Amazon’s S3 storage system. There are other AWS-specific variables we need to fill out in settings.py, but we need to set up our account and S3 bucket in order to do this.

Setting Up Your AWS S3 Bucket

I’ll let the reader sign up for AWS on their own, as this isn’t too complicated. What will be more complicated is setting up your S3 bucket in such a way that you are able to both post and get content.

Create a New User

For security, we’ll have you create a non-root user to manage your S3 bucket.

Go to the “Services” section in the upper right of the screen and then click on the IAM option.

The AIM setting is right under “Security, Identity, and Compliance” section to the left of the “Internet of Things” section.

Click “Users” under the “IAM Resources” and then click the big blue Add user button to start the process of making a non-root user. Creating at least one non-root user is on the list of priorities Amazon gives you when heading into the AIM section. This is done for security reasons, so that if an attacker compromises an AWS user and gets to their buckets, it’s not the root user who controls the whole account.

Now, we’ll create a new user called “tutorial_user” and give the new user Programmatic access to the account. We need this level of access because we need access keys in order for our backend server to contact the new user’s S3 bucket.

Create a user called “tutorial_user” and check the box below giving it “Programmatic access”.

Now that we’ve done this, we need to click the “Next”: Permissions” button at the bottom of the screen. This brings us to the screen where we apply user permissions. Now, we click the button “Create group” and search for “AmazonS3FullAccess”. We’ll also give our group a name: “tutorial_group”:

After this is all done, hit the blue button “Next: tabs”. We’ll actually skip this and go on to review our new user:

Go on to review our new user.

Now that we have completed creating our user, AWS provides us with our public and private key. I recommend downloading the .csv file you have a backup of both your access and secret access key. Either way, make sure you have copies of these keys as they are essential for adding to settings.py to access the bucket we’ll create.

Create a Bucket

Now we’ll go back to the Services tab and search for “S3" and hit the blue “Create bucket” button. Now we’ll name our bucket something like “django-ocr-tutorial-bucket” (once someone uses a particular bucket name, no one else can use same name):

Make sure to set your region to wherever you may be. Now hit next through the Properties screen to the screen to Block Public Access. For this tutorial, we are going to make our bucket public, therefore don’t use it to store any sensitive documents.

After we are done setting our permissions, we can review the bucket property and hit the blue “Create bucket” button.

Set the Parameters for the Bucket

We need to make sure we have full access to the contents of our bucket, or else we either won’t be able to post content to it nor retrieve that content to view. We need to alter the CORS setting of the bucket next:

Hit the button “CORS configuration” under the “Permissions” tag and add the following to the text field and then hit “Save”:

<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<CORSRule>
<AllowedOrigin>*</AllowedOrigin>
<AllowedMethod>GET</AllowedMethod>
<AllowedMethod>POST</AllowedMethod>
<AllowedMethod>PUT</AllowedMethod>
<AllowedHeader>*</AllowedHeader>
</CORSRule>
</CORSConfiguration>

What we’re doing here is making sure the bucket allows all the methods we need to make our app work.

We will also want to make sure our bucket is public in order to make access to and from it easier. Go to the Permissions tab and click Bucket Policy, then enter the following:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "PublicRead",
"Effect": "Allow",
"Principal": "*",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion"
],
"Resource": "arn:aws:s3:::django-ocr-tutorial-bucket/*"
}
]
}

The next step is to make sure we have access to the PDFs we upload, so now go to the Access Control List tab and hit the circle under Public access

Under the Public access heading there is a Group subhead Under Group, hit the circle next to category Everyone to edit access

This will launch a popup window where we now check the top two boxes— “List objects” and “ Write objects”.

Keep in mind, we don’t necessarily want to keep our S3 buckets public by default as this is how sensitive information is often discovered. That being said, please do not upload anything but generic pdfs to this public S3 bucket (I’ll have an addition on uploading to a secure bucket later, but for now let’s just get our project off the ground.)

Add AWS-specific variables to our settings.py file

Now that we have the necessary software added and our bucket set up, we need to fill out our settings.py file in order to talk to AWS. Now, remember when we set up environment variables for our app’s SECRET_KEY? We are going to do the same with our AWS access keys. (Refer back to the first lesson if you forgot how to do this. 😄)

With our access keys now safely stored as environment variables, enter the following block of code at the bottom of settings.py :

#AWS settingsAWS_ACCESS_KEY_ID = os.environ.get('TUTORIAL_AWS_ACCESS_KEY_ID')AWS_SECRET_ACCESS_KEY =   os.environ.get('TUTORIAL_AWS_SECRET_ACCESS_KEY')AWS_STORAGE_BUCKET_NAME = 'django-ocr-tutorial-bucket'AWS_S3_CUSTOM_DOMAIN = '%s.s3.amazonaws.com' % AWS_STORAGE_BUCKET_NAMEAWS_S3_OBJECT_PARAMETERS = {    'CacheControl': 'max-age=86400',}AWS_LOCATION = 'static'STATICFILES_DIRS = [    os.path.join(BASE_DIR, 'mysite/static'),]STATIC_URL = 'https://%s/%s/' % (AWS_S3_CUSTOM_DOMAIN, AWS_LOCATION)STATICFILES_STORAGE = 'storages.backends.s3boto3.S3Boto3Storage'DEFAULT_FILE_STORAGE = 'storages.backends.s3boto3.S3Boto3Storage'AWS_DEFAULT_ACL = None

Just like when we stored our actual Django SECRET_KEY in our Heroku Config vars, we will do the same for our AWS keys:

Add your actual AWS access and secret access keys in your Heroku Config vars section.

Given the changes we’ve made, we can now use Postman to test the updates to our back end to see if uploaded files are sent to our S3 bucket.

After a git commit and push, let’s provide Postman our (Heroku) back end tutorial URL:

Run Postman as you did earlier to test the back end, but now the file will go to the S3 bucket.

Now, run the back end Django server and you should see your uploaded file appear and be available for download:

You should see the back end connect to AWS and be able to download the file you just uploaded. Try it!

Great! Given that all the work we did regard the back end, our React front end doesn’t know anything has changed. That being the case, we really ought to have a clear way for people to obtain the files they’ve uploaded.

Edit the Front End to Display Files

Open up your front end project in an editor. To start our edit of the front end, we’ll create a folder called components within our src directory. This allows for better organization, as the code allowing the list display is technically a separate component of the front end. We’ll take this opportunity to clean up our project by separating function components into separate JS files.

Within the components directory, create a file called FileList.js. This will hold the code that displays the files you’ve uploaded. Now create a file called UploadForm.js, which will hold the code governing, well, our basic little upload form. We’re not altering the latter in any way, so copy the class UploadForm code and the export default UploadForm; at the bottom and copy it into UploadForm.js, then save. Now go back to App.js and add the following:

import UploadForm from './components/UploadForm';import FileList from './components/FileList';class App extends Component {    render() {        return (            <div>                <UploadForm />                <FileList />            </div>        )    }};export default App;

We don’t want all our code in one App.js file, which would look messy and become unwieldy. Rather, it’s better to think of things as reusable and replaceable functional components.

Your project directory should look like this:

Your components directory should be within the src directory and ought to contain FileList.js and UploadForm.js.
The components directory should have the two files listed above. App.js still sits only within the src directory.

Now our new file list will need the react-grid package installed, so in your command line type npm install react-grid to get this done. Now enter the following into FileList.js:

import React, { Component } from 'react';class FileList extends Component {    state = {    fileList: []    };    async componentDidMount () {    try {        // fetch the data from the api        const res = await fetch('https://django-ocr-   backend.herokuapp.com/')        const fileList = await res.json();        this.setState({            fileList         });    } catch (e) {        console.log(e);     }}render () {    return (        <div>            {this.state.fileList.map(item => (                <div key={item.id}>                    <h1>{item.title}</h1>                    <p>{item.content}</p>                    <p><a href={item.file} download>file</a></p>                </div>            ))}        </div>    );  }}export default FileList;

Now, push your changes to GitHub and then log into Heroku to start up your new front end. You should be able to both see your recent uploads to the Django back end but also download the OCR’d file you uploaded.

Yay! 😄

Now that we can reliably store, perform some work on, and then retrieve files, we technically have a full-fledged web app up and running! Now we iterate, so watch this space for more updates.

--

--