OCRing Old Math: Do It Faster with olmOCR
Anyone working on translations and old math and science works will be familiar with spending hours typing up old documents, even if they're in readable scans that OCR should, in principle, be able to handle. The problem is of course the math and tables. Typing up the equations in LaTeX is a major hurdle, even if you use a snippets-based approach like Gilles Castille. I personally use Obsidian as a note-taking app, and the LaTeX Suite plugin, with some minor tweaks to the shortcuts, and I'm quite fast at this point. But seeing a dense page of multi-line equations (e.g. by Gauss) still makes me tense up a bit. Transcribing one page is no trouble, but 60 pages turns into a multi-week endeavor, and I'm limited by an issue with my ulnar nerves that makes typing for longer than 30 minutes quite painful. Needless to say, transcribing is a burden that I would very much like to lessen.
Now I think I've cracked it, thanks to olmOCR. I tried it out on their web demo, and it seemed to handle the documents I fed it quite well. Proof-reading is obviously necessary, things are missed, but it is by far the most accurate OCR tool I've used, and it outputs to my preferred format, Markdown! But to make real use out of it, I'd have to leverage somebody else's hardware, because I don't have a fancy GPU or fast computer. Here's how I did it.
How to Set Up olmOCR with DeepInfra
- Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 12 GB of GPU RAM
- 30GB of free disk space
- olmOCR uses a large language model (LLM) (particularly a visual VLM) to do the OCR stuff
- These "models" comprise an architecture, and a giant matrix of floating point numbers; but none of those details will come up.
- We want someone else to host this particular model for us. This is a common demand, so there are companies that run models and give you access to them. Such companies are called inference providers.
- There are numerous such providers out there, and olmOCR lists three, of which I settled on DeepInfra because, frankly, that's the one I could figure out. This is not an endorsement, but it seems reputable.
- When you sign up for DeepInfra, it takes you to a homepage. You go to the Billing tab, you can set a monthly usage limit, then add billing info, and "top up" your balance (i.e. add some funds to use, say $5, which is good for thousands of pages). Once that's set up, you'll be able to access stuff with your newly generated API key, found in the Keys tab. This is the key that will allow us to access the olmOCR model.
- Install olmOCR locally with conda.
- There are various resources (the Github README, articles, etc) out there for how to install olmOCR, and I won't try to cover all the possibilities. I just followed the instructions there, and for the `pip install` part I simply did `pip install olmocr`. This installs a few GB (!) of Nvidia packages, which probably aren't necessary, so feel free to experiment here. But this worked for me.
- Activate the conda environment, and make a folder called `localworkspace1` where the outputs will end up
- Find your DeepInfra API key by going to the Keys tab and copying it. It will be called something like "auto"
- Use the command listed in the table on the README for DeepInfra. Replace the API key DfXXXXXXX with your key from DeepInfra. Add the --markdown flag before the --pdfs flag to get Markdown output. Replace the --pdfs argument with your file(s).
- Watch the magic happen.
Comments
Post a Comment