Introduction

This is the first post in our two-part series devoted to computer vision in Ruby.

The term “computer vision” is not easy to describe, just as it’s not easy to explain what it means to “see” something. The common ground is building some high-level representation, or structure, of images or videos depicting the world. The exact nature of this representation depends on the particular application, and consequently “computer vision” is an umbrella term that covers a wide range of specific problems.

And the applications are vast. As the computing power available to us is becoming more plentiful, and with freely available libraries such as OpenCV and Tensorflow becoming increasingly mature, we find ourselves living in a world that almost seemed fantasy yesterday: a world where machines not only can drive cars, but also assist visually impaired users in describing the world to them in words.

One of the problems considered in computer vision is classification: given a predefined set of distinct categories, determining the one that best matches the given image. (Think OCR, which basically splits the image into letters and repeatedly asks the question: “what letter is this?”) Another one is object detection, or discerning individual objects on the picture; this is often used in labelling or tagging the image with a small number of one- or two-word tags understandable by humans.

This last task is especially interesting, as a number of websites have sprung up recently which offer exactly this as a service. You upload an image via a RESTful API, and you get back a JSON containing the tags. In this post, we’ll take a look at five of these services: Clarifai, Imagga, Google Cloud Vision, Microsoft Cognitive Services, and Algorithmia. What are the pros and cons of using each one? How meaningful are the results? Most importantly, which one to go with for use in your own app?

Let’s find out.

The common ground

But before we delve into describing each service in detail, some words are in order about what they all have in common – indeed, they are perhaps more similar to each other than they are dissimilar.

First, the APIs. If you’ve seen one, you’ve seen ‘em all: you submit an URL of a picture (most services also allow to POST local images) and say “please tag this for me”. For instance, suppose you submit a cat exploring a floor:

Cat exploring a floor

In response, you receive a JSON like this – I’m pasting the result from Imagga, but they all have more or less similar structure:

{
  "results": [
    {
      "tagging_id": null,
      "image": "8bc9128a4ee08639a5ff888cfe7c7416",
      "tags": [
         { "confidence": 14.267, "tag": "cat" },
         { "confidence": 13.160, "tag": "animal" },
         { "confidence": 12.001, "tag": "black" },
         { "confidence": 11.717, "tag": "feline" },
         { "confidence": 11.3, "tag": "standard poodle" }
         // more results redacted out for brevity
      ]
    }
  ]
}

The most important thing here is the tags element, containing an array of structures, each of which describes one tag. At a minimum, this includes the name of the tag and a numeric value of confidence, which describes the extent to which the algorithm was sure of assigning a given tag to the image. Typically, this value ranges from 0 to 1, corresponding to a probability estimation; Imagga is exceptional in that it has a scale that extends beyond 1.

Another thing that sometimes appears in tags is a machine-readable ID pointing to an entry in a pre-existing knowledge base, called an ontology. Ontologies link concepts and relations between them into global graphs. By tracking connections in these graphs, one can discover meaningful relationships between images.

Without further ado, we move on to reviewing each service in turn.

The giants

Google Cloud Vision

You can play around with the Google Vision APIs using the Web-based UI on their site; however, to get started writing code, you’ll need to set up a billing account and connect it to your Google Cloud account. This process requires a debit/credit card number and is by far the most elaborate of all five services. Everyone else’s getting started workflow is far smoother.

Once you get past that, you can start using the APIs. There’s one method, annotate, that can perform multiple kinds of annotations at once: apart from tagging the image (called “label detection” in Google parlance), it can detect labels, logos, landmarks, faces, and attribute safe-search properties (such as whether the image contains adult content or is spoofed).

Labels assigned by Google are augmented by textual opaque IDs, called mids. Some, but not all, of these point into another Google service, the ontology known as Google Knowledge Graph; you can query it by mid.

In addition, the service returns dominant colors characteristic of the image, along with their names, RGB values, and percentages.

Dominant colors in the cat image, according to Google Vision

There is an official (maintained by Google themselves) Ruby gem available, google-cloud. This gives you access to all the Google Cloud APIs, not just Vision. It enables interoperation with Google’s rather intricate authentication process and as a bonus you get a handy command-line gcloud utility.

Microsoft Cognitive Services

The team of Microsoft Cognitive Services, formerly known as Project Oxford, boast themselves as having won the 2015 ImageNet Large Scale Visual Recognition Challenge.

Like Google, Microsoft’s service can be tried out via a Web interface. It also supports dominant color detection and can detect faces, facial expressions (officially called “Emotions API”), celebrities, image type (photo, clipart or drawing), and NSFW content. Moreover, the service supports videos (extracting individual frames of them), and can attribute an image to one of 86 predefined categories.

As part of Cognitive Services, Microsoft also offers APIs that are not related to computer vision and cover areas such as speech, natural language, and knowledge.

The Microsoft API is noteworthy in that it’s the only one that’s able to generate succint summaries of pictures in natural English. For example, the cat picture above is summarized as “a cat sitting on a hard wood floor”.

Microsoft doesn’t maintain any Ruby gem themselves; the sample snippets of Ruby code in the documentation use Net::HTTP. However, there is a third-party gem that covers the entire API. It is easy to use, but returns raw JSON responses that you need to parse yourself. The structure of the result is simple and well-documented, so this isn’t a huge obstacle.

The less-well-known specialists

Clarifai

We now turn our attention to services whose only purpose is to automatically tag images. Clarifai is one.

Clarifai can tag images and videos. Unlike Microsoft, it handles the latter on a whole-video basis, rather than by extracting and tagging individual frames.

The API is particularly well thought out. It is properly versioned, uses OAuth2-based authentication, and contains niceties such as individual IDs assigned to each tagged image which can be then used in subsequent API calls.

Noteworthily, you can specify a model that will be used for tag generation. Like the API itself, models are versioned. At the time of writing this post, there are five to choose from: the default, general, covers a wide range of content that may appear in images, but you may also specify NSFW, weddings, travel or food.

Of all the APIs we experimented with, Clarifai is the only one that allows for submitting user feedback. If you discover anything amiss with the results, you can suggest an additional tag for a given image, or provide a hint that a particular tag be removed. You can also claim that two pictures are dissimilar or similar to each other.

Clarifai is able to return tags in multiple languages. Unfortunately, they appear to be translated automatically from English. As a native speaker of Polish, I found some of the tags returned in Polish unnatural.

There are no official Ruby gems, but the community has produced two: clarifai and clarifai_ruby.

Imagga

Like Clarifai, Imagga is a service dedicated to tagging images. Again, tags in multiple languages are supported.

There are two distinguishing features of Imagga. First, in addition to tagging, it can classify images using one of the available classifiers. Currently, there are two: NSFW (self-explanatory) and Personal Photos with output labels such as nature_landscape, beaches_seaside, or events_parties.

Second, a special API parameter (verbose) tells Imagga to include WordNet synsets corresponding to individual tags. An explanation is in order here. WordNet is a lexical database of English words developed at Princeton University in the 1990s. At its core, it can be thought of as an English dictionary and thesaurus. But it’s more than that: it links together words into graphs collecting different kinds of semantic relations between concepts, such as synonymy (having similar meaning) or meronymy (being a part of something). Groups of words having similar meanings are called synsets.

Why is WordNet important, one might ask? One of the reasons for the recent explosion in computer vision technologies was the advent of ImageNet. ImageNet is a collection of 14M+ images, each of which is connected to a number of WordNet nodes, called synsets. This is the most comprehensive tagged image database in existence, and has been used to train the neural networks behind most services described here. (Incidentally, the fact that the images can be linked together via WordNet relations is the reason why you’ll often see a cat tagged as feline or mammal.) By virtue of both databases being freely available, Imagga can be used to search for ImageNet images similar to a given one.

There is no dedicated gem. The code samples provided in documentation use the rest_client library.

The community

Algorithmia

If Imagga and Clarifai follow the Unix principle of “doing one thing and doing it well,” Algorithmia is “a Swiss Army knife.” It has also been called “the open source app store for algorithms.”

At the heart of Algorithmia is a community of scientists and programmers. Every member can contribute and monetize their own algorithm. There are over 2,500 algorithms available as Web services, from classics such as Dijkstra’s algorithm or number factorization to machine learning and computer vision. In particular, one of the available algorithms is Illustration Tagger, by a member named deeplearning.

This approach has the benefit of being open and clearly documented: it is known that it uses the open-source Illustration2Vec models. Unlike the other services described here, these models are not trained on ImageNet data and are tailored to tagging drawings and paintings, rather than photos. Nevertheless, we include it for posterity.

There is an official Algorithmia gem.

Tests

To assess the quality of tagging, we assembled a collection of ten images depicting a wide variety of objects and sceneries, including people, animals, nature, human-built objects, and a hand-drawn picture. We then tagged them with each service and selected ten tags with the top confidence score.

To avoid cluttering the post, we present the raw results on a separate result page. You are highly encouraged to browse them and draw your own opinions; because there’s no “reference” tagging and any judgements of the results are necessarily subjective, it’s difficult to call any one service superior to its alternatives on the basis of tagging quality.

However, there are some observations we’d like to share:

  • Illustration Tagger’s results are clearly unsuitable for photos (not drawings or paintings). If you’re looking for general-purpose quality tags, you should look someplace else; if it’s specifically illustrations you are after, you may want to consider this one.

  • Microsoft Cognitive Services tend to assign a smaller number of well-chosen tags, rather than a lot of tags that might potentially overlap each other or be superfluous. In doing so, however, sometimes important image features are missed. This is especially evident in the moon image, which MS only tagged as “dark;” it also didn’t notice the Eiffel Tower. On the other hand, no other service noticed the floor underneath the cat.

  • Clarifai, Imagga and Google Cloud Vision’s results appear to be mostly on par with each other. Of the three, Imagga seems to be a little less specific; it mistook the plane for another flying object and ignored the pottery on the shelves, instead trying to guess the nature of the room they are found in.

Conclusion

Let’s summarize what we’ve said so far into a comparison table:

  Google Cloud Vision Microsoft Cognitive Services Clarifai Imagga Illustration Tagger (Algorithmia)
Tagging quality Good Okay Good Good Mediocre (general); okay (illustrations)
Entry barrier Medium Low Low Low Low
Ontology links Knowledge Graph None None WordNet None
Multilingual tags No No Yes Yes No
Feedback API No No Yes No No
Video support No Yes Yes No No
Image captioning No Yes No No No
NSFW classification Yes Yes Yes Yes Yes
Dominant color detection Yes Yes Yes Yes No
Ruby gem available? Yes, official Yes, community Yes, community No Yes, official
Images/month in free tier 1000 5000 5000 2000 5000

Which one to go with? The answer, as always, depends on your use case. In the second part of this series, we’re going to develop a simple gallery in Rails that automatically tags submitted images, using Clarifai as a backend. Furthermore, we will make sure that it’s easy to swap Clarifai for another tagging service provider.