Digitizing archived paintings
- Job: University Project
- Date : Feb 2016 - May 2016
- Technologies: Python, Anaconda, OpenCV, Image Processing, Layout analysis, Optical Character Recognition
The Giorgio Cini Foundation is a non-profit cultural instituation located in Venice, Italy. It aims to create a cultural center in the Island of San Giorgio Maggiore. As part of this goal, I worked on a semester project at EPFL (DHLAB) to automate the digitiation of a large dataset of painting photos. The scans, front and back shown below, need be processed and then associated with their paintings and textual description. The final pipeline takes as input the front and back scans, and returns the bounding boxes of the painting, the text area and the barcode. It also recognizes each text section individually as well as their textual information using OCR.
Using different morphology operators, gradients, and OTSU thresholding completed by some dimension and position constraints, the painting and text boxes were identified and cropped from the image and re-aligned. The painting is identified as the largest cluster remaining after processing the image, while the text area is found by taking the lowest horizontal using Hough Transform line that’s above the painting box. The results from the previous image are shown below.
From there, the text area is taken and processed using a layout analysis tool: Kraken. It tries to find the smallest bounding boxes in an image giving a result like this:
Text text area is also processed using Frangi Filters, which are able to find ridges in the image, allowing to find thing edges separating similar regions in it. The result is an image segmentation separating all the parts of the text area as follows:
The results from both methods are merged together, and sorted vertically and horizontally in order to classify each area. For example the top-left most area is the City in which the painting can be found. The end result of the process is something like this:
After that each identified section is cropped and parsed using Tesseract OCR, and the results are stored in a json file as a set of keys and values for this image, giving a result similar to the below structure:
"City" : "COLONIA", "Author": "GIOVANNI FRANCESCO da RIMINI (attr.)", "Attribution": "FONDO G. FIOCCO", "Museum": "MUSEO DIOCESANO", ...