PDF OCR via React, Django REST Framework, and Heroku — Part 1: Set up and Starting On the Back End

Joseph Cardenas
16 min readJul 30, 2020

I thought I would write this tutorial as a way to help myself better understand a project I was working on. What we’re trying to do here is create a system where you can upload a PDF via the web and have it OCR’d for you. I’ve chosen the Django REST API for running the back and and React for creating the front end. Both, for now, will be the minimum viable product for getting this project up and running.

Unlike many other tutorials combining React and Django, I’ll keep the front and back end portions as completely separate projects. Now, I’ve seen a variety of Django/React tutorials have both front and back end combined under a single folder. I’m going to break these up and keep the React front end and the Django back end completely separate. Why? Keeping the front and back end separate simplifies the projects, as you don’t have everything lumped together in one big directory. Also, each project can move at its own pace and have its own set of contributors. Deploys for each project will become simpler and more frequent as well, as neither side of the overall project is being held up by the other.

Some of the basic prerequisites are a familiarity with Python and PowerShell in Windows 10 (because not everyone codes on Ubuntu 😉).

Finally, apologies if this first section runs a bit long for some folks. I wanted to cover a l

Prerequisites

First off, make sure you have at least React and Django installed on your system. Installation for both are pretty straightforward, but you’ll need both node and Python installed first.

Node

Click the link above and you should see a page with the latest versions of node.js.

Go to https://nodejs.org/en/ to download the current LTS release.
Current versions of node.js as of writing.

We don’t need the most cutting-edge features for this project, so we’ll go with the Long Term Support (“LTS”) version. From this point onward you’ll download and progress through the installer, which is pretty straightforward.

Now that you have Node installed you have access to npm, which is the Node Package Manager (i.e. what allows users to download Node software). In addition to npm, you also get access to npx. Basically, npx allows you to run create-react-app and generate the front end for this project.You can read about the differences between npm and npx here.

Python

Much like the Node.js installer, Windows users will need to download the most recent version of the Python installer.

Go to https://www.python.org/downloads/ and download the latest stable Windows release. During install, add Python to PATH.
The current version of python for download, as of writing.

Also like the Node.Js installer, the Python installer is pretty straightforward. Note: I would recommend allowing the installer to add Python to your system’s PATH to make running Python from your command line easier going forward.

Install Django

With Python installed, installing Django is as easy as typing pip install Django into your terminal. If you are having trouble with installing Django, see the Django documentation on getting the software installed on Windows.

Install Poetry

The dependency management and virtual environment for our project will be handled by Poetry. Poetry is an easy way to both create a virtual environment and manage project dependencies. More importantly, you can easily create the requirements.txt file you’ll need later on to upload the back end file to Heroku.

OK, now that we have our basic dependencies installed, we can get to work on our back end.

Dependency Management with Poetry

In terms of dependency management in the Django back end, I decided to go with Poetry on the advice of my mentor. Poetry is a modern dependency management tool that I’ve come to appreciate due to its simplicity and ease of use. For any one familiar with Pipenv should be comfortable with Poetry. But why bother with a dependency management tool at all? Well, if you already are a professional developer, Poetry will either be good practice or an introduction to a new tool. If you are training to be a developer, then you definitely need to understand how to manage the dependencies of your projects. To read more on this, here is the official guide to managing python project dependencies. The developers of Poetry give their reasons for making the tool in their official repo:

Why use Poetry? To have a single file to manage and resolve dependencies in Python.

Starting the Back End with Django

For our back end, we will use the Django REST Framework, or DRF.

Installing Django does not automatically install the DRF, so you will need to install it separately. Once that is done we can move on to creating our back end directory with django-admin startproject tutorial_backend . As we will be using Poetry for our virtual environment and dependency management, cd into your new project folder and type poetry init . This will initialize Poetry within the current project without creating any additional directories.

When you run poetry init you are prompted to fill out several values which will make up the pyproject.toml file Poetry uses to manage your dependencies:

  • Your Package name is the name of the directory you ran the poetry init command within. Leave this value as is per what’s in square brackets unless you want to change it.
  • The version is kept as-is.
  • The description is whatever you want it to be.
  • Feel free to set yourself as the author .
  • I went with MIT as my license (or whatever you prefer).
  • Go with the standard Python version noted in square brackets.
  • Now, when asked to define your main dependencies interactively I went with no , as I found it easier to add them later as a group.
  • We will not define the development environment interactively, so again answer no.
  • Poetry will show you the dependencies you’ve already specified, which should be nothing more than python = “^3.8” .
  • Finally, type yes or hit return.
Defining fields within your pyproject.toml

Completing these steps will result in Poetry creating our pyproject.toml file. We’ll now define our dependencies with poetry add . Stay within your tutorial_backend directory and type the following:

poetry add gunicorn OCRmyPDF django djangorestframework django-heroku django-cors-headers whitenoise pytesseract ghostscript

Clearly we’ll need Django and Django REST Framework, but the rest will need some explanation.

  • gunicorn is a web server and will be needed to upload the back end to Heroku
  • OCRmyPDF is what we will use on the back end of the project to do the actual heavy lifting of performing optical character recognition on the documents we upload.
  • While we already needed Django installed to create our project, we’ll still need it when running commands within Poetry’s virtual environment.
  • The django-heroku package will be needed when we upload the project to Heroku. It is used in configuring Django to deploy to Heroku.
  • The django-cors-headers package is needed for authentication.
  • The whitenoise package is used to serve what are called static files in Django.
  • the pytesseract and ghostscript packages are dependencies to OCR the project

In addition to all the dependencies in our pyproject.toml, we also have the poetry.lock file, which works with installing the dependencies listed in the pyproject.toml file using the poetry install command. We can leave the poetry.lock file alone for now.

Also, as we will create a separate app within out main project, so type django-admin startapp ocr.

So, this is what our project structure looks like right now:

Picture of project structure presented in VS Code. Text version of the structure is below.
Current project structure

Viewed another way, via the tree /f in PowerShell:

│   manage.py
│ poetry.lock
│ pyproject.toml

├───ocr
│ │ admin.py
│ │ apps.py
│ │ models.py
│ │ tests.py
│ │ views.py
│ │ __init__.py
│ │
│ └───migrations
│ __init__.py

└───tutorial_backend
asgi.py
settings.py
urls.py
wsgi.py
__init__.py

To start off, double check you in the main tutorial_backend directory and run python manage.py runserver to start Django’s development server. Assuming there are no problems, head over to 127.0.0.1:8000 and you should see the following:

The default Django landing page.

This is something we are going to do pretty often to make sure we don’t break anything and have to trace back through a bunch of steps in order to repair the damage.

Set Up Our Project on GitHub

Before we go too far, we ought to make sure our project is on GitHub. This is not only a good backup of our project, it also gives us the power to revert back to a previous version of the project if we goof anything up. So, if you don’t have an account on GitHub, go and set one up now.

Protect Sensitive Variables With a .env File

Before we push our code to GitHub, we need to be aware of a basic security setting for Django even for small projects like this. The SECRET_KEY in our settings.py file is actually very important to the security of our entire app, so it’s best to start the habit of securing it. You can read some more about it here, but we need to take steps to secure it and make sure it’s safe. To do this, we’ll secure it (and other sensitive variables) in a .env file using the python-dotenv package. It is essential to enter the .env file into your .gitignore that is

Create Your Repo

Now that you have your account set up and SECRET_KEY secured, go create a new repository using the green button on the upper left of the main github.com page:

Make a new repo click the green “New” button in the upper left of the GitHub main page or upper right of “Repositories” page.
Use the green button in the upper left of the main page to create a new repository. You can also do this from the “Your Repositories” page.

Let’s name the repository after our main project directory, tutorial_backend:

Name and create your repository.
Name your new repo “tutorial_backend”

After you hit the “Create repository” button, the next page will outline the steps you need to take to set up your repository via the command line:

Please note that you should be connecting to your own directory, not mine. I’m simply listing the tutorial_backend on my own account for example purposes. When you run these steps in GitHub, the instructions below should be within your own account.

Steps to follow in order to set up your repo via the command line.
The series of commands for setting up a new repo via the command line.
echo "# tutorial_backend" >> README.md
git init
git add README.md
git commit -m "first commit"
git remote add origin https://github.com/[your_username]/tutorial_backend
git push -u origin

If you follow all the steps above then reload the instruction page, you ought to be greeted with an empty repo page;

Brand new, empty repo.
Your empty repo page.

Create a .gitignore file

An extra step here I’ll add is to add a .gitignore file, which is crucial to ensuring both sensitive and non-essential data are kept out of our repository. At the root of our project — in Django, that’s where manage.py lives — we’ll create a new file and call it .gitignore. Now we’ll go here and copy the sample .gitignore and past it into our new file. That should cover all our bases, especially the .env file we’ll use to store credentials! The reason we need to do this is that certain providers will disable our access to their services if they detect user credentials left out in the open. At least, this can lead to wasted time in re-setting those credentials and potentially our accounts to be thoroughly compromised.

Push All the Things 😉

Now that we have the basic To push the contents of our Django project to our new repo, type git add ., then git commit -m “commit of django project”, and then git push . You should now be greeted with the following if you reload the repo’s page:

Tutorial repo after having contents pushed to it via “git add .”
I added the .gitignore section above after taking this screenshot. Sorry!

Commit from vim

If you do a git commit without the -m, you will likely be be thrown in a text editor you few have experience with and may not know how to exit. While it depends on OS, git commit will start the vim text editor by default:

The Vim text editor you drop into after typing just “git commit”

Commands in Vim are done via text, not with a GUI. Vim has different “modes” for when you want to enter text or when you want to enter commands. For our purposes, we only need to type some letter to enter the text editor mode. Now, type your commit message. Once done, type ESC to enter the command mode. Now to save your text and exit Vim, type :wq and hit Enter. Your commit should be recorded:

A successful commit message after exiting Vim.

You can read a great deal more about how to use GitHub from a whole range of sources, but just remember to push your work on a regular basis!

Our next step will be to go and update our settings.py file with our new app and other dependencies. Just because we have certain dependencies installed doesn’t mean Django is aware of them, so when we create new apps in Django and/or install new dependencies we often need to inform our Django app via the settings.py file. After the above edits, your code should look like this:

‘rest_framework’, (the Django Rest Framework)

‘corsheaders’, (another dependency we added)

‘ocr’, (the app we just created)

‘whitenoise’, (another dependency)

Now, your INSTALLED_APPS should look like this:

INSTALLED_APPS = ['django.contrib.admin','django.contrib.auth','django.contrib.contenttypes','django.contrib.sessions','django.contrib.messages','django.contrib.staticfiles','rest_framework','corsheaders','ocr','whitenoise',]

Now, in the MIDDLEWARE section, enter the following under the 'django.middleware.security.SecurityMiddleware' line:

'whitenoise.middleware.WhiteNoiseMiddleware','corsheaders.middleware.CorsMiddleware',

Now your MIDDLEWARE section should look like this:

MIDDLEWARE = ['django.middleware.security.SecurityMiddleware','whitenoise.middleware.WhiteNoiseMiddleware','corsheaders.middleware.CorsMiddleware','django.middleware.common.CommonMiddleware','django.contrib.sessions.middleware.SessionMiddleware','django.middleware.csrf.CsrfViewMiddleware','django.contrib.auth.middleware.AuthenticationMiddleware','django.contrib.messages.middleware.MessageMiddleware','django.middleware.clickjacking.XFrameOptionsMiddleware',]

In case you’re wondering, the order of the Middleware components matters:

The order in MIDDLEWARE matters because a middleware can depend on other middleware. For instance, AuthenticationMiddleware stores the authenticated user in the session; therefore, it must run after SessionMiddleware. See Middleware ordering for some common hints about ordering of Django middleware classes.

Finally, we need to enter a few things at the bottom of the settings.py file. Because we are uploading files we need to define a media directory to hold those files, so underneath the STATIC_URL line we’ll type MEDIA_URL = ‘/media/’ . When we start uploading files, this folder will get created. We also need to define a root directory for media files, so also type MEDIA_ROOT = os.path.join(BASE_DIR, “media”) .

Now, to make sure everything actually still works, we ought to run python manage.py runserver again. If you want, you can leave this running a separate tab or window and the Django server will continuously re-run as you save changes to files.

This is cool and all, but we need our Django app to do something. We’ll get on our way by modifying the urls.py files within the tutorial_backend directory and in our main directory. First is to add include to the end of the line reading from django.urls import path . Then under the line reading path('admin/', admin.site.urls) add path('', include(‘ocr.urls’)) . What this is doing is connecting us to the urls.py file within our ocr app. The file should look like this now:

from django.contrib import adminfrom django.urls import path, include
urlpatterns = [
path('admin/', admin.site.urls), path('', include('ocr.urls'))]

Securing the Admin Panel

Right above we see the stock ‘admin/’ login page, which offers some degree of security but not quite the right amount we ought to have. With a few simple tweaks we can increase the security of our web app by simply obscuring the official admin page and setting up a false one for would-be attackers.

Setting Up the django-admin-honeypot package

First, we need to secure our ‘admin/’ page from potential attackers. One way to test whether a site is made with Django is to see if 'admin/' leads anywhere. If an attacker knows what technology a site was made with, they can narrow down possible exploits and attacks. They could also directly attack our admin page and take control of our site. Rather than given them that chance, we can use the django-admin-honeypot package to both obscure our actual ‘admin/’ page while setting up a fake admin login page to alert us to any attempted break-ins.

Head over to the django-admin-honeypot repo page to get an overview of the project. Now, we’ll type poetry add django-admin-honeypot to add the package to our backend.

Now that the package has been installed we can add it to the INSTALLED_APPS portion of our settings.py file.

INSTALLED_APPS = ['django.contrib.admin','django.contrib.auth','django.contrib.contenttypes','django.contrib.sessions','django.contrib.messages','django.contrib.staticfiles','rest_framework','corsheaders','ocr','whitenoise','storages',# django-admin-honeypot'admin_honeypot',]

The next step is to update our URL patterns to reflect both the fake admin page and our re-named admin page. Given that the honeypot project’s own repo advises us to pick a different name than secret/ to obscure the real admin page, let’s pick something else:

# fake, honeypot admin
path('admin/', include('admin_honeypot.urls', namespace='admin_honeypot')),
# the actual admin page, using whatever name you want
path('heywhaterver/', admin.site.urls),

Make sure to pick a different custom admin name for your own personal project! 😅

While the above might seem like extra work, we ought to secure our web applications as we make them rather than run back and perform these simple tweaks later. The same logic applies to creating the .env file to contain our sensitive variables.

Updating Our OCR app urls.py file

Because we don’t have a urls.py file in the ocr directory, we’ll go create the file now.

A urls file is not automatically created within your new app, so you will need to create a urls.py file within the ocr app. Set it up per the below code:

from django.urls import pathfrom .views import PostViewsurlpatterns = [    path('', PostViews.as_view(), name='file-upload')]

Now, when we run python manage.py runserver, another error will pop up. Now we’ll be warned that the PostViews we’re trying to import does not exist, and it doesn’t because we’ve not created it yet with our views.py file.

We get an error running the Django server, as we’re trying to import a view we’ve not yet created yet.
We get an error running the Django server, as we’re trying to import a view we’ve not yet created yet.

The views file is actually going to take a lot of work and itself depends on other files, such as our models.py and serializers.py files, so we’ll create the latter two files first.

Within your ocr sub directory, open up your models.py file. The models in your application define how the application accesses and manages data in the database.

We’re going to create a Python class called Post which will contain three fields: title, content, and file . These fields will contain the title of the file we’re uploading, a description of the content, and the file itself that we’re uploading.

Under where the code says # Create your models here. we first need to define our class, drawing upon Django’s models library:

from django.db import models# Create your models here.class Post(models.Model):    title = models.CharField(max_length=100)    content = models.TextField()    file = models.FileField(upload_to='post_files', default='file')def __str__(self):    return self.title

We define our title as a models.CharField with a max_length of 100. Our content variable will be a models.TextField, and the file will be a models.FileField, where we say we want our file uploaded to subfolder called post_files. This subfolder will actually go within the media subfolder we defined in settings.py.

Open views.py and type class PostViews(APIView): and within that class, simply type the word pass. This keyword allows our new class to be valid, but doesn’t do anything. We could try running makemigrations again, but we’d get another error as we’ve not imported the APIView module. Below the line reading from django.shortcuts import render type from rest_framework.views import APIView . See, our views file will show us how the restframework library we added is going to come in handy.

from django.shortcuts import renderfrom rest_framework.views import APIView
# Create your views here.
class PostViews(APIView): pass

Now that we’ve created our models, we need to do two things. First is to push our code to github and then we need to do some work with the models we just created. Follow the above steps in the GitHub section to push your code, and then type python manage.py makemigrations . A migration is a change to the database schema, or structure, based on how your models are constructed. So when we run makemigrations we are telling Django to look at models.py and make new migrations based on what we done there.

The creation of our Post model in the database.
The creation of our Post model in the database.

Now that we’ve run makemigrations, we need to apply those migrations with python manage.py migrate.

Now, I want to introduce a GitHub feature to you to better help organize your commits into branches, which can then be reviewed before they are merged back into the main branch.

In your terminal, type git checkout -b run-migrations . This will not only create a branch from our main branch but also simultaneously allow us to work on that new branch. This is the work of the checkout keyword. If there was no branch called create-PostView, we would receive an error as we have nothing but the master branch to switch to. Using branches helps you divide up work more clearly and gives greater control as to what work gets merged into the main, master branch. To check what branch you’re on, you can always type git status, which will also give you a list of files that have been changed and in need of committing. Now that you’ve confirmed you’re on the create-PostView branch, we can continue.

Run git add and git commit as you normally would. If you try pushing this branch to GitHub, you see you’ll need to type the commands git push — set-upstream origin run-migrations in order for GitHub to recognize this branch as one to accept. Go to your tutorial_backend repo and you’ll see a yellow bar at the top saying you have a new pull request to approve.

The commit from your new branch will automatically appear in a yellow bar at the top of your repo’s page.

Click the green button saying Compare & pull request to go to the screen where you can merge your pull request into the main branch.

Hit Create pull request. You’ll now have the option to merge your code:

Hit Merge pull request and Confirm merge and you did it! Feel free to hit the button that deletes this new branch, as we won’t be needing to run any migrations for a while.

Now we can run python manage.py runserver and head to localhost:8000in order to make sure everything still works:

Typing “localhost:8000” into your browser bar will get you a page with the text Post Views at the top. Success!
A working web page, clearly using Django REST Framework.

Now, you can switch back to the main branch by typing git checkout master and also type git pull (to pull the changes to made back down to the main branch on your computer) you’ll be all up to date!

Thank you for sticking with a rather long tutorial! Now that we have the back end set up and working, we can now move on to the testing our back end with Postman to see if we can upload something that gets a back end response.

--

--