INTRODUCTION

Creating an archive of vintage wine catalogs can help wine economists, producers, and other interest groups to have access to this information for future analysis which might help them gain a better insight and understanding of the wine market over time. The Sherry Lehmann catalogs contain numerous catalogs with different designs, outlines, formats and colors from the 30’s to 80’s. Each individual page or series of catalogs might have a unique design which contains only images, articles, actual wine catalogs, or even other alcoholic beverages. Extracting the information manually from the images might be easier but is time-consuming and error-prone. Therefore, we needed to develop an automated system to extract the relevant information with high accuracy from the scanned wine catalogs.

We developed an algorithm to extract specific information (e.g, wine name, bottle price, case price) and organize them into a data frame as an output. Furthermore, we anticipate that the interest groups for our project would be people who want to revise and improve our code as well as people who want to use the outputs from our code. Based on our available time and resources, we developed an algorithm for our approach.

By looking at different factors; for instance, time and accuracy, we decided to use Tesseract as a software to interpret scanned images. Because of the complexity of our problem, we needed to break down our approach step by step such as extracting all of the prices from the Tesseract output, then extracting the wine names and matching them with their respective prices.

Data Extraction of VINTAGE Wine CATALOGS

INTRODUCTION