OCRing Old Math: Do It Faster with olmOCR

Anyone working on translations and old math and science works will be familiar with spending hours typing up old documents, even if they're in readable scans that OCR should, in principle, be able to handle. The problem is of course the math and tables. Typing up the equations in LaTeX is a major hurdle, even if you use a snippets-based approach like Gilles Castille. I personally use Obsidian as a note-taking app, and the LaTeX Suite plugin, with some minor tweaks to the shortcuts, and I'm quite fast at this point. But seeing a dense page of multi-line equations (e.g. by Gauss) still makes me tense up a bit. Transcribing one page is no trouble, but 60 pages turns into a multi-week endeavor, and I'm limited by an issue with my ulnar nerves that makes typing for longer than 30 minutes quite painful. Needless to say, transcribing is a burden that I would very much like to lessen. 

Now I think I've cracked it, thanks to olmOCR. I tried it out on their web demo, and it seemed to handle the documents I fed it quite well. Proof-reading is obviously necessary, things are missed, but it is by far the most accurate OCR tool I've used, and it outputs to my preferred format, Markdown! But to make real use out of it, I'd have to leverage somebody else's hardware, because I don't have a fancy GPU or fast computer. Here's how I did it. 

How to Set Up olmOCR with DeepInfra

The README for olmOCR says:

Requirements:
  • Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 12 GB of GPU RAM
  • 30GB of free disk space
You will need to install poppler-utils and additional fonts for rendering PDF images.

As I mentioned, this is a non-starter for me. But they also mention that you can use an inference provider to run it remotely. That sounded like the right choice for me. Now, this comes with a lot of new terminology. What exactly is going on? 

  • olmOCR uses a large language model (LLM) (particularly a visual VLM) to do the OCR stuff
    • These "models" comprise an architecture, and a giant matrix of floating point numbers; but none of those details will come up.
  • We want someone else to host this particular model for us. This is a common demand, so there are companies that run models and give you access to them. Such companies are called inference providers
  • There are numerous such providers out there, and olmOCR lists three, of which I settled on DeepInfra because, frankly, that's the one I could figure out. This is not an endorsement, but it seems reputable. 
  • When you sign up for DeepInfra, it takes you to a homepage. You go to the Billing tab, you can set a monthly usage limit, then add billing info, and "top up" your balance (i.e. add some funds to use, say $5, which is good for thousands of pages). Once that's set up, you'll be able to access stuff with your newly generated API key, found in the Keys tab. This is the key that will allow us to access the olmOCR model.

So that's the background. What does this actually look like? 

  1. Install olmOCR locally with conda.
    1. There are various resources (the Github README, articles, etc) out there for how to install olmOCR, and I won't try to cover all the possibilities. I just followed the instructions there, and for the `pip install` part I simply did `pip install olmocr`. This installs a few GB (!) of Nvidia packages, which probably aren't necessary, so feel free to experiment here. But this worked for me. 
  2. Activate the conda environment, and make a folder called `localworkspace1` where the outputs will end up
  3. Find your DeepInfra API key by going to the Keys tab and copying it. It will be called something like "auto"
  4. Use the command listed in the table on the README for DeepInfra. Replace the API key DfXXXXXXX with your key from DeepInfra. Add the --markdown flag before the --pdfs flag to get Markdown output. Replace the --pdfs argument with your file(s). 
  5. Watch the magic happen.
When finished, the results will be in `localworkspace1/markdown` and `localworkspace1/results`. 

NOTE: I ran into a bug where it got hung up on a couple pages in a pdf. It'll just hang and you'll end up wasting money. Press ctrl+c to kill that process, and delete the contents of the localworkspace1 folder, hopefully that takes care of it. There is a pull request for adding a timeout to help with this, because it's really annoying. Otherwise, try and extract only some of the pages of the doc you're working on, and see if it gets hung up again, until you find the problem. 

Finally, to make things work in Obsidian, I had to change the Markdown "flavor" or dialect by changing brackets to dollar signs. Then it's a matter of proofreading and reformatting where things got messed up. But that effort is far, far less than what I'm used to, and is much easier on my arms!    

If my experience changes or I run into additional problems, I'll be sure to update this post. For now, this is just my log of getting this working, as a total newbie. 

Comments

Popular posts from this blog

English Translation of Leibniz, "Historia inventionis phosphori" (1710)

Translation of S. D. Poisson's Premiere Memoire sur la Distribution de l'Electricite a la surface des Corps conducteurs; (1812)

Translation of Gauss's Theoria Attractionis Corporum Sphaeroidicorum Ellipticorum Homogeneorum methodo novo tractata (1813)