April 21, 2023

Paper accepted!

Update (2023-08-19): it’s published in Lecture Notes in Computer Science.

I’m pleased to announce that our recent paper “Augraphy: A Data Augmentation Library for Document Images” was accepted to ICDAR 2023! I’ve given a lot of energy to the Augraphy project over the past couple years, so it’s great to have finally reached this milestone and gotten formal recognition for that effort.

The main output is a Python framework for defining image augmentation pipelines, which is useful for producing purely synthetic document images containing simulated effects like coffee stains, folded paper, spilled ink, and so on. We then use supervised trainning with the resulting datasets to train deep neural networks which reliably clean images of real documents.

This work is interesting in its own right, but points at a broader phenomenon which will surely see its day in the sun: synthetic data can be hugely powerful when care is taken to faithfully generate the right features. More on this in future writings.