December 12, 2019
Kazi Ehsan Aziz
The application has to let a user upload multiple PDFs together. And on each of those PDFs, an Optical Character Recognition (OCR) operation needs to be run to extract photos and some information.
We decided to use Tesseract to
help us out with the OCR. Our server application was written in
Spring Boot (Java), from which a wrapper
(Tess4j)
would invoke the tesseract-ocr engine.
Our original plan was to let tesseract-ocr manage its own
multithreading to get a PDF OCRed as quickly as possible, and then
move on to the next one in the queue of uploaded PDFs. But
tesseract-ocr in multithread mode was
significantly slower
than in single-thread mode at the time this application was being made.
So we forced each spawned process of tesseract-ocr to use one
thread only by setting OMP_THREAD_LIMIT=1 in the environment.
But now, it would be great if we could launch 4 of those processes
together to make the most of our available CPU cores.
Quartz allows us to create jobs
and then run those jobs concurrently if needed. So, every time a
PDF was successfully uploaded synchronously at the request of the
user, we scheduled a job for it. This asynchronous job would
actually invoke the tesseract-ocr. When done with a PDF, the
job updates a record on our database so that the user can learn
about the OCR completion.
We told Quartz to keep it to 4 concurrent jobs at maximum. And this combination of single-threaded Tesseract and a multi-threaded Quartz, was the sweet spot for our application.