PDF OCR via React, Django REST Framework, and Heroku, Part 4: OCR our Documents
Now that we have our front and back ends working at the bare minimum level, we need to go back and update the back end so it will actually run OCR on the file we upload to it. After all, there’s no point in just uploading our own documents to ourselves locally! This is going to be a big change in terms of functionality more than lines of code, which is why I’m combining it with preparing for deployment of our apps.
Download and Install Prerequisites
If for any reason you didn’t install the following packages in Step 1, go back and get them added to Poetry, as these packages are going to do the heavy lifting of our OCR app:
OCRmyPDF pytesseract ghostscript
So now we are going back to our
tutorial_backend directory, into our
ocr/views.py file. You’ll add the following to your imports:
import ocrmypdffrom subprocess import Popen
Now we will add two lines to our
PostViews class. We’ll add the following two lines right underneath the line reading :
uploaded = posts_serializer.save()
process = Popen(['ocrmypdf', uploaded.file.path, 'output.pdf'])
Your views.py file should now look like this:
from django.shortcuts import renderfrom rest_framework.views import APIViewfrom rest_framework.parsers import MultiPartParser, FormParserfrom rest_framework.response import Responsefrom rest_framework import statusimport ocrmypdffrom subprocess import Popenfrom .serializers import FileSerializerfrom .models import Post# Create your views here.class PostViews(APIView): parser_classes = (MultiPartParser, FormParser) def get(self,request, *args, **kwargs): posts = Post.objects.all() serializer = FileSerializer(posts, many=True) return Response(serializer.data) def post(self, request, *args, **kwargs): posts_serializer = FileSerializer(data=request.data) if posts_serializer.is_valid(): uploaded = posts_serializer.save() process = Popen(['ocrmypdf', uploaded.file.path, 'output.pdf']) return Response(posts_serializer.data, status=status.HTTP_201_CREATED) else: print('error', posts_serializer.errors) return Response(posts_serializer.errors, status=status.HTTP_400_BAD_REQUEST)
That’s it. Now start your Django server and upload a file through the front end. To make sure I wasn’t using a PDF that wasn’t already OCR’d, I converted a photo of something with clear, plain text to PDF. Whatever you want to use, just make sure you can’t already search for text in the file to be uploaded.
In your terminal, you should the OCRmyPDF working on your file.
Your OCR’d output file should be put inside your main project directory. Open it in a PDF reader and search for a word. Clearly, those two lines allow the above three imports to work their magic on a PDF and OCR it’s contents. To better explain this magic, we need to look at the
subprocess module we’re importing here.
The subprocess module allows you to call other programs from within another Python program. In our program, the
uploaded variable stores the value of
posts_serializer . That value then gets passed on to the the subprocess Popen module, which uses this data to as part of the arguments needed to call the OCRmyPDF program. We then return the response as normal.
Now that our apps work and actually does something, we can create a new GitHub branch and push our work.
Now that we have our front end working and hooked up to working OCR engine, we can not move onto deploying our apps to Heroku!