April 21, 2023

Paper accepted!

I’m pleased to announce that our recent paper “Augraphy: A Data Augmentation Library for Document Images” was accepted to ICDAR 2023! I’ve given a lot of energy to the Augraphy project over the past couple years, so it’s great to have finally reached this milestone and gotten formal recognition for that effort.

Augraphy is a Python framework for defining image augmentation pipelines, useful for producing synthetically-noisy images of documents, which can then be used to train deep neural networks to remove real-world instances of that noise.

Starting with an image of a clean document, like a PNG export of your favorite ArXiV PDF, Augraphy augmentations apply pixel-level transformations, producing output that simulates realistic effects on that document. For example, an office-worker may print out a quarterly report for distribution in a department meeting, with the expectation that recipients will annotate their copies. One such copy jams the printer briefly, smudging some of the ink, and this copy later ends up receiving not only handwritten markings like highlighting and underlining, but also coffee stains and a slight fold in the page. Augraphy contains functions designed to produce these effects, layering them onto the clean document image, completely digitally.

Datasets produced this way can be used in conjunction with the original clean document images to perform supervised training (with the clean image serving as the label for the “crappified” copies) of models for denoising, binarization, and so on, and this is what we explore in the paper.