PDF OCR via React, Django REST Framework, and Heroku — Part 1: Set up and Starting On the Back End
I thought I would write this tutorial as a way to help myself better understand a project I was working on. What we’re trying to do here is create a system where you can upload a PDF via the web and have it OCR’d for you. I’ve chosen the Django REST API for running the back and and React for creating the front end. Both, for now, will be the minimum viable product for getting this project up and running.
Unlike many other tutorials combining React and Django, I’ll keep the front and back end portions as completely separate projects. Now, I’ve seen a variety of Django/React tutorials have both front and back end combined under a single folder. I’m going to break these up and keep the React front end and the Django back end completely separate. Why? Keeping the front and back end separate simplifies the projects, as you don’t have everything lumped together in one big directory. Also, each project can move at its own pace and have its own set of contributors. Deploys for each project will become simpler and more frequent as well, as neither side of the overall project is being held up by the other.
Some of the basic prerequisites are a familiarity with Python and PowerShell in Windows 10 (because not everyone codes on Ubuntu 😉).
Finally, apologies if this first section runs a bit long for some folks. I wanted to cover a l
Prerequisites
First off, make sure you have at least React and Django installed on your system. Installation for both are pretty straightforward, but you’ll need both node and Python installed first.
Node
Click the link above and you should see a page with the latest versions of node.js.
We don’t need the most cutting-edge features for this project, so we’ll go with the Long Term Support (“LTS”) version. From this point onward you’ll download and progress through the installer, which is pretty straightforward.
Now that you have Node installed you have access to npm, which is the Node Package Manager (i.e. what allows users to download Node software). In addition to npm, you also get access to npx. Basically, npx allows you to run create-react-app
and generate the front end for this project.You can read about the differences between npm and npx here.
Python
Much like the Node.js installer, Windows users will need to download the most recent version of the Python installer.
Also like the Node.Js installer, the Python installer is pretty straightforward. Note: I would recommend allowing the installer to add Python to your system’s PATH to make running Python from your command line easier going forward.
Install Django
With Python installed, installing Django is as easy as typing pip install Django
into your terminal. If you are having trouble with installing Django, see the Django documentation on getting the software installed on Windows.
Install Poetry
The dependency management and virtual environment for our project will be handled by Poetry. Poetry is an easy way to both create a virtual environment and manage project dependencies. More importantly, you can easily create the requirements.txt
file you’ll need later on to upload the back end file to Heroku.
OK, now that we have our basic dependencies installed, we can get to work on our back end.
Dependency Management with Poetry
In terms of dependency management in the Django back end, I decided to go with Poetry on the advice of my mentor. Poetry is a modern dependency management tool that I’ve come to appreciate due to its simplicity and ease of use. For any one familiar with Pipenv should be comfortable with Poetry. But why bother with a dependency management tool at all? Well, if you already are a professional developer, Poetry will either be good practice or an introduction to a new tool. If you are training to be a developer, then you definitely need to understand how to manage the dependencies of your projects. To read more on this, here is the official guide to managing python project dependencies. The developers of Poetry give their reasons for making the tool in their official repo:
Starting the Back End with Django
For our back end, we will use the Django REST Framework, or DRF.
Installing Django does not automatically install the DRF, so you will need to install it separately. Once that is done we can move on to creating our back end directory with django-admin startproject tutorial_backend
. As we will be using Poetry for our virtual environment and dependency management, cd
into your new project folder and type poetry init
. This will initialize Poetry within the current project without creating any additional directories.
When you run poetry init
you are prompted to fill out several values which will make up the pyproject.toml
file Poetry uses to manage your dependencies:
- Your
Package name
is the name of the directory you ran thepoetry init
command within. Leave this value as is per what’s in square brackets unless you want to change it. - The
version
is kept as-is. - The
description
is whatever you want it to be. - Feel free to set yourself as the
author
. - I went with
MIT
as mylicense
(or whatever you prefer). - Go with the standard Python version noted in square brackets.
- Now, when asked to
define your main dependencies interactively
I went withno
, as I found it easier to add them later as a group. - We will not define the
development environment
interactively, so again answerno
. - Poetry will show you the dependencies you’ve already specified, which should be nothing more than
python = “^3.8”
. - Finally, type
yes
or hit return.
Completing these steps will result in Poetry creating our pyproject.toml
file. We’ll now define our dependencies with poetry add
. Stay within your tutorial_backend
directory and type the following:
poetry add gunicorn OCRmyPDF django djangorestframework django-heroku django-cors-headers whitenoise pytesseract ghostscript
Clearly we’ll need Django and Django REST Framework, but the rest will need some explanation.
- gunicorn is a web server and will be needed to upload the back end to Heroku
- OCRmyPDF is what we will use on the back end of the project to do the actual heavy lifting of performing optical character recognition on the documents we upload.
- While we already needed Django installed to create our project, we’ll still need it when running commands within Poetry’s virtual environment.
- The
django-heroku
package will be needed when we upload the project to Heroku. It is used in configuring Django to deploy to Heroku. - The
django-cors-headers
package is needed for authentication. - The
whitenoise
package is used to serve what are called static files in Django. - the
pytesseract
andghostscript
packages are dependencies to OCR the project
In addition to all the dependencies in our pyproject.toml
, we also have the poetry.lock
file, which works with installing the dependencies listed in the pyproject.toml
file using the poetry install
command. We can leave the poetry.lock
file alone for now.
Also, as we will create a separate app within out main project, so type django-admin startapp ocr
.
So, this is what our project structure looks like right now:
Viewed another way, via the tree /f
in PowerShell:
│ manage.py
│ poetry.lock
│ pyproject.toml
│
├───ocr
│ │ admin.py
│ │ apps.py
│ │ models.py
│ │ tests.py
│ │ views.py
│ │ __init__.py
│ │
│ └───migrations
│ __init__.py
│
└───tutorial_backend
asgi.py
settings.py
urls.py
wsgi.py
__init__.py
To start off, double check you in the main tutorial_backend
directory and run python manage.py runserver
to start Django’s development server. Assuming there are no problems, head over to 127.0.0.1:8000 and you should see the following:
This is something we are going to do pretty often to make sure we don’t break anything and have to trace back through a bunch of steps in order to repair the damage.
Set Up Our Project on GitHub
Before we go too far, we ought to make sure our project is on GitHub. This is not only a good backup of our project, it also gives us the power to revert back to a previous version of the project if we goof anything up. So, if you don’t have an account on GitHub, go and set one up now.
Protect Sensitive Variables With a .env File
Before we push our code to GitHub, we need to be aware of a basic security setting for Django even for small projects like this. The SECRET_KEY
in our settings.py
file is actually very important to the security of our entire app, so it’s best to start the habit of securing it. You can read some more about it here, but we need to take steps to secure it and make sure it’s safe. To do this, we’ll secure it (and other sensitive variables) in a .env
file using the python-dotenv
package. It is essential to enter the .env file into your .gitignore that is
Create Your Repo
Now that you have your account set up and SECRET_KEY
secured, go create a new repository using the green button on the upper left of the main github.com page:
Let’s name the repository after our main project directory, tutorial_backend
:
After you hit the “Create repository” button, the next page will outline the steps you need to take to set up your repository via the command line:
Please note that you should be connecting to your own directory, not mine. I’m simply listing the tutorial_backend on my own account for example purposes. When you run these steps in GitHub, the instructions below should be within your own account.
echo "# tutorial_backend" >> README.md
git init
git add README.md
git commit -m "first commit"
git remote add origin https://github.com/[your_username]/tutorial_backend
git push -u origin
If you follow all the steps above then reload the instruction page, you ought to be greeted with an empty repo page;
Create a .gitignore file
An extra step here I’ll add is to add a .gitignore
file, which is crucial to ensuring both sensitive and non-essential data are kept out of our repository. At the root of our project — in Django, that’s where manage.py
lives — we’ll create a new file and call it .gitignore
. Now we’ll go here and copy the sample .gitignore and past it into our new file. That should cover all our bases, especially the .env
file we’ll use to store credentials! The reason we need to do this is that certain providers will disable our access to their services if they detect user credentials left out in the open. At least, this can lead to wasted time in re-setting those credentials and potentially our accounts to be thoroughly compromised.
Push All the Things 😉
Now that we have the basic To push the contents of our Django project to our new repo, type git add .
, then git commit -m “commit of django project”
, and then git push
. You should now be greeted with the following if you reload the repo’s page:
Commit from vim
If you do a
git commit
without the-m
, you will likely be be thrown in a text editor you few have experience with and may not know how to exit. While it depends on OS,git commit
will start the vim text editor by default:
Commands in Vim are done via text, not with a GUI. Vim has different “modes” for when you want to enter text or when you want to enter commands. For our purposes, we only need to type some letter to enter the text editor mode. Now, type your commit message. Once done, type
ESC
to enter the command mode. Now to save your text and exit Vim, type:wq
and hit Enter. Your commit should be recorded:
You can read a great deal more about how to use GitHub from a whole range of sources, but just remember to push your work on a regular basis!
Our next step will be to go and update our settings.py
file with our new app and other dependencies. Just because we have certain dependencies installed doesn’t mean Django is aware of them, so when we create new apps in Django and/or install new dependencies we often need to inform our Django app via the settings.py
file. After the above edits, your code should look like this:
‘rest_framework’,
(the Django Rest Framework)
‘corsheaders’,
(another dependency we added)
‘ocr’,
(the app we just created)
‘whitenoise’,
(another dependency)
Now, your INSTALLED_APPS
should look like this:
INSTALLED_APPS = ['django.contrib.admin','django.contrib.auth','django.contrib.contenttypes','django.contrib.sessions','django.contrib.messages','django.contrib.staticfiles','rest_framework','corsheaders','ocr','whitenoise',]
Now, in the MIDDLEWARE
section, enter the following under the 'django.middleware.security.SecurityMiddleware'
line:
'whitenoise.middleware.WhiteNoiseMiddleware','corsheaders.middleware.CorsMiddleware',
Now your MIDDLEWARE
section should look like this:
MIDDLEWARE = ['django.middleware.security.SecurityMiddleware','whitenoise.middleware.WhiteNoiseMiddleware','corsheaders.middleware.CorsMiddleware','django.middleware.common.CommonMiddleware','django.contrib.sessions.middleware.SessionMiddleware','django.middleware.csrf.CsrfViewMiddleware','django.contrib.auth.middleware.AuthenticationMiddleware','django.contrib.messages.middleware.MessageMiddleware','django.middleware.clickjacking.XFrameOptionsMiddleware',]
In case you’re wondering, the order of the Middleware components matters:
The order in MIDDLEWARE matters because a middleware can depend on other middleware. For instance, AuthenticationMiddleware stores the authenticated user in the session; therefore, it must run after SessionMiddleware. See Middleware ordering for some common hints about ordering of Django middleware classes.
Finally, we need to enter a few things at the bottom of the settings.py
file. Because we are uploading files we need to define a media directory to hold those files, so underneath the STATIC_URL
line we’ll type MEDIA_URL = ‘/media/’
. When we start uploading files, this folder will get created. We also need to define a root directory for media files, so also type MEDIA_ROOT = os.path.join(BASE_DIR, “media”)
.
Now, to make sure everything actually still works, we ought to run python manage.py runserver
again. If you want, you can leave this running a separate tab or window and the Django server will continuously re-run as you save changes to files.
This is cool and all, but we need our Django app to do something. We’ll get on our way by modifying the urls.py
files within the tutorial_backend
directory and in our main directory. First is to add include
to the end of the line reading from django.urls import path
. Then under the line reading path('admin/', admin.site.urls)
add path('', include(‘ocr.urls’))
. What this is doing is connecting us to the urls.py
file within our ocr
app. The file should look like this now:
from django.contrib import adminfrom django.urls import path, include
urlpatterns = [ path('admin/', admin.site.urls), path('', include('ocr.urls'))]
Securing the Admin Panel
Right above we see the stock ‘admin/’
login page, which offers some degree of security but not quite the right amount we ought to have. With a few simple tweaks we can increase the security of our web app by simply obscuring the official admin page and setting up a false one for would-be attackers.
Setting Up the django-admin-honeypot package
First, we need to secure our ‘admin/’
page from potential attackers. One way to test whether a site is made with Django is to see if 'admin/'
leads anywhere. If an attacker knows what technology a site was made with, they can narrow down possible exploits and attacks. They could also directly attack our admin page and take control of our site. Rather than given them that chance, we can use the django-admin-honeypot
package to both obscure our actual ‘admin/’
page while setting up a fake admin login page to alert us to any attempted break-ins.
Head over to the django-admin-honeypot repo page to get an overview of the project. Now, we’ll type poetry add django-admin-honeypot
to add the package to our backend.
Now that the package has been installed we can add it to the INSTALLED_APPS
portion of our settings.py
file.
INSTALLED_APPS = ['django.contrib.admin','django.contrib.auth','django.contrib.contenttypes','django.contrib.sessions','django.contrib.messages','django.contrib.staticfiles','rest_framework','corsheaders','ocr','whitenoise','storages',# django-admin-honeypot'admin_honeypot',]
The next step is to update our URL patterns to reflect both the fake admin page and our re-named admin page. Given that the honeypot project’s own repo advises us to pick a different name than secret/
to obscure the real admin page, let’s pick something else:
# fake, honeypot admin
path('admin/', include('admin_honeypot.urls', namespace='admin_honeypot')),# the actual admin page, using whatever name you want
path('heywhaterver/', admin.site.urls),
Make sure to pick a different custom admin name for your own personal project! 😅
While the above might seem like extra work, we ought to secure our web applications as we make them rather than run back and perform these simple tweaks later. The same logic applies to creating the .env
file to contain our sensitive variables.
Updating Our OCR app urls.py file
Because we don’t have a urls.py
file in the ocr
directory, we’ll go create the file now.
A urls
file is not automatically created within your new app, so you will need to create a urls.py
file within the ocr
app. Set it up per the below code:
from django.urls import pathfrom .views import PostViewsurlpatterns = [ path('', PostViews.as_view(), name='file-upload')]
Now, when we run python manage.py runserver
, another error will pop up. Now we’ll be warned that the PostViews
we’re trying to import does not exist, and it doesn’t because we’ve not created it yet with our views.py
file.
The views file is actually going to take a lot of work and itself depends on other files, such as our models.py
and serializers.py
files, so we’ll create the latter two files first.
Within your ocr
sub directory, open up your models.py
file. The models in your application define how the application accesses and manages data in the database.
We’re going to create a Python class called Post
which will contain three fields: title
, content
, and file
. These fields will contain the title of the file we’re uploading, a description of the content, and the file itself that we’re uploading.
Under where the code says # Create your models here.
we first need to define our class, drawing upon Django’s models
library:
from django.db import models# Create your models here.class Post(models.Model): title = models.CharField(max_length=100) content = models.TextField() file = models.FileField(upload_to='post_files', default='file')def __str__(self): return self.title
We define our title
as a models.CharField
with a max_length
of 100. Our content variable will be a models.TextField
, and the file will be a models.FileField
, where we say we want our file uploaded to subfolder called post_files
. This subfolder will actually go within the media subfolder we defined in settings.py
.
Open views.py
and type class PostViews(APIView):
and within that class, simply type the word pass
. This keyword allows our new class to be valid, but doesn’t do anything. We could try running makemigrations
again, but we’d get another error as we’ve not imported the APIView
module. Below the line reading from django.shortcuts import render
type from rest_framework.views import APIView
. See, our views
file will show us how the restframework
library we added is going to come in handy.
from django.shortcuts import renderfrom rest_framework.views import APIView
# Create your views here.class PostViews(APIView): pass
Now that we’ve created our models, we need to do two things. First is to push our code to github and then we need to do some work with the models we just created. Follow the above steps in the GitHub section to push your code, and then type python manage.py makemigrations
. A migration is a change to the database schema, or structure, based on how your models are constructed. So when we run makemigrations
we are telling Django to look at models.py
and make new migrations based on what we done there.
Now that we’ve run makemigrations
, we need to apply those migrations with python manage.py migrate
.
Now, I want to introduce a GitHub feature to you to better help organize your commits into branches, which can then be reviewed before they are merged back into the main branch.
In your terminal, type
git checkout -b run-migrations
. This will not only create a branch from our main branch but also simultaneously allow us to work on that new branch. This is the work of thecheckout
keyword. If there was no branch calledcreate-PostView
, we would receive an error as we have nothing but the master branch to switch to. Using branches helps you divide up work more clearly and gives greater control as to what work gets merged into the main, master branch. To check what branch you’re on, you can always typegit status
, which will also give you a list of files that have been changed and in need of committing. Now that you’ve confirmed you’re on thecreate-PostView
branch, we can continue.
Run git add
and git commit
as you normally would. If you try pushing this branch to GitHub, you see you’ll need to type the commands git push — set-upstream origin run-migrations
in order for GitHub to recognize this branch as one to accept. Go to your tutorial_backend repo and you’ll see a yellow bar at the top saying you have a new pull request to approve.
Click the green button saying Compare & pull request
to go to the screen where you can merge your pull request into the main branch.
Hit Create pull request
. You’ll now have the option to merge your code:
Hit Merge pull request
and Confirm merge
and you did it! Feel free to hit the button that deletes this new branch, as we won’t be needing to run any migrations for a while.
Now we can run python manage.py runserver
and head to localhost:8000
in order to make sure everything still works:
Now, you can switch back to the main branch by typing git checkout master
and also type git pull
(to pull the changes to made back down to the main branch on your computer) you’ll be all up to date!
Thank you for sticking with a rather long tutorial! Now that we have the back end set up and working, we can now move on to the testing our back end with Postman to see if we can upload something that gets a back end response.