A web service for performing OCR specifically tuned to extracting the text from memes, adverts, and other images.
Initially built around Tesseract, an open-source OCR engine that is known to perform well, it worked best when recognising text in standard printed works, i.e., black text on a white background. Unfortunately, memes and adverts are usually very different, often white text on a highly complex coloured background, and as such, the performance of Tesseract on these images isn’t as good. In fact, some preliminary user testing within WeVerify showed that the results were often confusing due to both mis-recognised text and a large amount of noise; shapes in the background being wrongly recognised as letters and numbers.
To address the relatively poor performance of Tesseract on the types of images we wanted to process, we investigated a number of other OCR systems. After careful consideration, we have adopted a two-stage approach that uses recently released state-of-the-art open-source libraries – CRAFT to isolate the areas of the image containing text, and a deep text recognition library to recognise the individual words.
- GATE Cloud demo interface can be found at https://cloud.gate.ac.uk/shopfront/displayItem/ocr-service.
- A separate more image centric UI is also available at https://ocr.cloud.gate.ac.uk/.
- A GATE Cloud REST API for easy integration with other tools https://cloud-api.gate.ac.uk/process-document/ocr-service, and full details of how to use the API are available at: https://cloud.gate.ac.uk/info/help/online-api.html.