DISCUSSION

Using supervised machine learning and statistical methods helped us develop an algorithm for the successful extraction of historic prices from Sherry Lehmann wine catalogs (wine name, id, bottle price, case price). As previously mentioned, each catalog has a different design and unique characteristics. Therefore, creating a unique approach to develop an algorithm is a very challenging process.

Our approach was to identify and group those catalogs which have similar characteristics to create a training set which we labeled manually. Then, we developed our algorithm based on some facts we observed from the scanned catalogs we picked.

As a fundamental step, we extracted all of the prices from the Tesseract output and tried to cluster them into at least one cluster based on their right coordinates. Then, within each cluster, we separated each row based on their top values. Next, each row has one bottle price which might pair with a case price or not. Afterwards, we use the bottle price and match each one of them with their respective wine name which is usually in the same line as the prices. Finally, we saved the wine names, bottle and case prices in a data frame.

Furthermore, we calculated the accuracy rate using the Levenshtein Distance (LD) for each feature within a scanned page. The LD method returned an integer for each cell which we then divided by the number of characters in the cell to calculate the percentage of error. In fact, by subtracting one from the error rate we can get the accuracy rate for each feature and the overall accuracy for each catalog. Indeed, over time we have improved our algorithm to reach 72% overall accuracy when we tested our algorithm using the test set, and the accuracy for each feature is: wine name 61%, bottle price 81%, case price 73%. This result is only based on our test set where pages contain similar outlines or characteristics. We can’t predict what our error rate will be if we run our algorithm on all 4,111 scanned catalogs.

We have recognized two types of problems which decrease the accuracy rate of our algorithm. First, the Tesseract software errors which we label as preprocessing errors; second, those errors which depend on our developed algorithm are post-processing errors. To improve our overall error rate, we would have to address those two type of errors in our algorithm by preprocessing and post-processing.

In preprocessing, we faced problems such as Tesseract missing text completely or partially (incorrect format) because of page coloration, rotation, font and design. For example, the scanned catalog can be brightly colored, slightly rotated when scanned and have an unfamiliar font that Tesseract could not detect. We were able to solve rotation by using linear regression to find the best fitted line using the right coordinates of the prices (Appendix 2).
In post-processing, we encountered issues with extracting features correctly, labeling prices incorrectly and fixing prices from partially incorrect Tesseract outputs. For example, when Tesseract separates the price 41.15 into 41 and 0.15 or when Tesseract does not capture bottle price, our algorithm puts the case price as the bottle price. We were able to fix some of these prices by using regular expressions to recognize these patterns and replacing them with the correct price.

Future Direction:

We are aware that our algorithm has limitations and because of that there are still multiple ways that we can improve our algorithm to achieve a higher accuracy. These problems range from extracting the bottle price with higher accuracy from the Tesseract output and labeling them, and decreasing high wine name error rates which is directly correlated to the accuracy of bottle price extraction. Therefore, the solution is to separate wine number ID from the wine name and save it in a separate column and to add a method to capture those wine names which are in a two line format.

Furthermore, we need to extract and build the price list extraction better even if the Tesseract format was not accurate. For instance, if Tesseract separates a price into two parts which have commas and/or decimal our algorithm couldn’t recognize that. Statistically speaking we need to create a larger training and test set which will help us train our algorithm better and calculate a more representative error rate. Furthermore, we can use both ABBYY Reader and Tesseract output to develop a method to cross check the text each one provides and combine into one piece of text to increase the confidence level of correctly extracting the text from the catalogs.

Data Extraction of VINTAGE Wine CATALOGS

Future Direction: