Extracting Text From Images Using Ruby


tldr; show me the code.

I’m a great hoarder of memes. Seriously, I have a huge folder full of memes and GIFs. I collect my memes usually whenever I’m watching a movie or a series by taking a screenshot of the scene with subtitles enabled. The issue is, the program I use to take screenshots, names each new file by the date and time the screenshot was taken. I have a 1200 screenshot folder each named by date and time. So finding a certain screenshot/meme is too much a hassle.

I thus wanted a better way to go through my collection of memes. My goal was to rename each screenshot file to the subtitles it contains. I have, in fact, been procrastinating this task for a long time now and the mid-year recess was the perfect time to start a small project.

To read the subtitles out of my images, Google search led me to Tesseract. Tesseract is a tool that recognizes the text in images using Optical Character Recognition (OCR). I started first experimenting with Tesseract and its wrapper, pytesseract in Python and used OpenCV for image processing and it worked perfectly right from the start!

I started with a small Ruby script to read my images, process, and recognize the text in each one of them. However, unfortunately, Ruby lacks in good tooling for computer vision or image processing. I couldn’t find a good OpenCV wrapper in Ruby except for ruby-opencv which is quite outdated and only supports OpenCV v2 (Version 4 of OpenCV is now released by the way).

After some tinkering with ImageMagick and its Ruby wrapper, MiniMagick, I could finally achieve the same results I got in Python. So, in the few next sections, I’ll demonstrate the steps the script goes through to recognize the subtitles in each screenshot.

The Script

Next is a simpler version of the script I actually used. The full script including the part where I process the output text and rename each file to the new name is a gist on GitHub.

The first part of the script just loops through each file (a 1920x1080 PNG image file) in my source directory and reads it as a grayscale image.

require 'pathname'
require 'open3'
require 'mini_magick'

SRC_DIR = '/path/to/src/dir'.freeze
TMP_DIR = '/path/to/tmp/dir'.freeze

Dir.mkdir TMP_DIR unless File.exists?(TMP_DIR)

Pathname.new(SRC_DIR).children.each do |f|
  src_path = f.realpath
  tmp_path = "#{TMP_DIR}/#{f.basename}"

  img = MiniMagick::Image.open(src_path)
  img.colorspace('Gray')
  img.write(tmp_path)
end

Let’s take the following screenshot as an example and feed it to the script. First, we load it normally:

Example screenshot

then we convert it to grayscale. This is what it looks like:

Example screenshot grayscale

Cleaning The Image

The second part is where all the image processing happens. Basically, the script negates the image to black and white to remove all the noise. This improves the ability of Tesseract to read the text in the image.

# ...

Pathname.new(SRC_DIR).children.each do |f|
  # ...
  MiniMagick::Tool::Magick.new do |magick|
    magick << tmp_path
    magick.negate
    magick.threshold("007%")    # I couldn't prevent myself
    magick.negate
    magick << tmp_path
  end
end

This is what image looks like after processing. You can see how almost all the details in the image are removed except for the vivid subtitles. This file is stored under the TMP_DIR we defined earlier in the script. You can of course delete the TMP_DIR safely or instrument the script to do it for you.

Example screenshot negative

Recognizing The Text

In this part, we call Tesseract command on our image and get back our recognized text. We could have used a Ruby wrapper around Tesseract, and rtesseract is a great wrapper, however; we only need to call a simple command here so we don’t need all the features the wrapper provides.

# ...

Pathname.new(SRC_DIR).children.each do |f|
  # ...
  text, _,  _ =
    Open3.capture3("tesseract #{tmp_path} stdout -l eng --oem 0 --psm 3")

  puts text.strip
end

and this is the final output of our script:

Oh, man. In my next life
I'm coming back as a toilet brush.

Results

Speed

As I mentioned earlier, I first started with a Python script to test Tesseract. I did a small time comparison between the Ruby version and the Python version after some great discussions on /r/ruby. Unfortunately, the Python version is much faster. I believe that most of the overhead in the Ruby version comes from using ImageMagick for image processing. Here’s the comparison of running both scripts on a sample of 525 1920x1080 PNG files:

ruby ocr.rb  864.69s user 62.12s system 136% cpu 11:19.01 total

vs.

python3 ocr.py  277.92s user 31.38s system 141% cpu 3:37.85 total

Edit: /u/janko-m on Reddit pointed me to libvips and I actually replaced ImageMagick with libvips. The script is also on GitHub. The runtime became 3 times faster than what it used to be and, in fact, outperformed the Python version. Here’s the output of the time command:

ruby ocr.rb  234.74s user 35.64s system 120% cpu 3:45.17 total

Accuracy

In the end, the text in 93% of my screenshots was recognized almost correctly (with some gibberish noise in some files). I filtered out the screenshots that don’t contain any subtitles before running the script. This means that only 7% of my screenshots were not recognized correctly or were recognized as empty text. Tesseract wiki provides some tips to improve text recognition accuracy mainly, they are all about processing the source image before feeding it to Tesseract.

Different Colors

This issue occurred with screenshots that contained subtitles in colors other than white. I have for example a screenshot with yellow subtitles that wasn’t recognized by the original script. I had to use a lower threshold while converting the image to a black-and-white binary image so that the subtitles wouldn’t be filtered out as noise.

Duplicates

This is not related to Tesseract itself or text recognition per se, but it’s more about the script. If there are more than one screenshot with the same subtitles, one image file will overwrite the other. I don’t handle this case in my script as of yet; I want to get rid of duplicates anyway.