Creating Technology for Social Change

Finding Bieber: Using Computers and Humans to Surface the Talent in Millions of YouTube Videos

This is a writeup of Hrishikesh Aradhye, Ph.D.’s talk at the Media Lab last month, with my own commentary sprinkled throughout.

Power to the people, at last! It’s a new hour
Now we all ain’t gon’ be American Idols
But you can ‘least grab a camera, shoot a viral

Kanye West, Power

An hour of video is uploaded to YouTube every second (or ten years’ worth of video every single day). Think about that for a minute. That’s a lot of content. And, as haters everywhere have pointed out already, a lot of it is crap.

The more interesting point, though, is that some of these videos are actually really good. If YouTube can get better at surfacing the good stuff, whether it’s a funny comedian, a talented singer, or a hilarious FAIL clip, we all benefit (including Google). Identifying talent has traditionally been a very subjective art, and as a result, the quantification of talent hasn’t really been discussed in published literature.

Accelerating the cream’s rise to the top is one of Google’s biggest challenges, according to David Lawee, Google’s Mergers & Acquistions chief. To help solve this problem, Hrishikesh Aradhye, Ph.D. and his team at Google Research have built YouTube Slam. Slams are gamefied hot-or-not interfaces for comparing two videos. Inspired by the success of Google Demo Slams, two videos within the same genre (e.g. comedy, dance) are pitted against one another, and the user is asked to vote which video is better.

How to Analyze Video Content
The team computationally analyzes videos, which required combining traditional computer vision to analyze still images with an analysis of audio and music. Fortunately for the team, Google Research already had two teams conducting this kind of analysis. (They also have groups working on biologically-inspired vision).

In addition, the Research team has access to valuable metadata and user behavior analytics. The various cues informing their video analysis include:

  • Visual clues: e.g. the video’s color, texture, motion shot, grammar, faces
  • Auditory clues: e.g. the video’s volume, spectogram, melody, beats
  • Textual clues: e.g. the video’s title, description, tags, comments (using traditional Natural Language Processing)
  • Behavioral clues: e.g. videos that were co-clicked, co-queried, co-uploaded, co-commented
  • Contextual clues: the uploader’s age/gender, time of view, location, history

Towards A Populist Definition of Music
Traditional research in musicology focuses on complicated genre taxonomies, assembled by musicians and musicologists with clear understandings of sub-niches. The lay person doesn’t understand the minor variations, and instead comes up with their own terms, like “Pump-up music.” This tension created a desire amongst Hrish and his team to produce labels for and with the people that use them on YouTube.

Computing features are expensive, and YouTube serves millions of videos, so there’s a strong incentive to discover concepts that can be applied before a video is served to a user (I don’t know if you’ve heard, but Google hates latency). The team has been able to categorize music by format (acoustic songs, cover songs, instrumental, karaoke, live music, etc.) as well as genre (anime, Bollywood, cartoon, children, chipmunk, choir, etc.).

I don’t want a genre, I want a playlist
Taking populist enjoyment of music further, YouTube’s researchers have found that users categorize music in more subjective buckets than genre, like moods and modes. Regardless of actual genre, music can be considered by a pump-up, smooth, inspirational, battle, workout, party, club, morning, fast, or old school song. You might recognize these categories from any young person’s iTunes playlists. Our traditionally defined genres matter less in a Spotified world than how a particular song fits into the user’s life.

Pandora‘s recommendations expose the elements of the music genome to tell you why a song is recommended, but it only focuses widely distributed music, because they pay music students to sit in a room and code each individual song. YouTube’s incredibly long tail demands a digital answer to this problem.

Computational video analysis looks for clear patterns to categorize music videos. In the case of live music, there are the tell-tale signs of stages, with deep blue and purple lighting, and with live performances, you’ll often see shots of people in a crowd waving their arms in the air. Guitar and piano songs are both popular on YouTube, and they’re easy for a computer to recognize. A karaoke video has its own grammar. On one hand, there’s a clear formula: text on a screen with an instrumental beginning to the video. But even within this template, there’s wild variation on YouTube, with endless combinations of fonts and colors and goofy backgrounds employed.

Bollywood videos, compared to another genre, like hip-hop, feature color palettes rich with greens and oranges and other bright colors. The audio is also quite different compared to other genres. YouTube’s machine classifiers can also recognize 25 different languages in a song, more than any other published research at this scale. This feature is especially important for Indian songs, where the metadata is in English, but the song is not. The system can make mistakes between similar languages like Spanish and Portuguese, but generally works pretty well.

Chipmunk videos are made by modulating the frequencies of traditional songs, and often include a still image from one of the recent Chipmunks feature films. Sadly enough, there are an enormous number of people creating these, to the degree that this genre bubbles up as one of the top types. This sort of genre would never show up in Wikipedia’s article on musical genres. The Finding Bieber project seeks to discover what matters to Youtube users in the context of music, the true populist definition of music, as it’s played and understood by real people around the world. And sometimes that means this:

Cover songs are another computationally discernable genre. They usually consist of videos of people in their living rooms and bedrooms, singing a version of their favorite song in front of a camera. The grammatical pattern is obvious over a corpus of many videos: a head-and-shoulder view of a person, with ambient indoor lighting, and a single dominant vocal and instrument track with nothing fancy over it. But again, there are a whole lot of people in a whole lot of living rooms singing a whole lot of songs. Generally, though, the commonalities are broad enough to categorize these videos.

Cover songs are the driving force behind the Finding Bieber project. Thousands of cover songs are uploaded every day by people waiting and working to be discovered. And they occasionally are found, as Justin Bieber was. It’s more and more common for emerging artists to post their work to YouTube before being discovered, but discovery is still a very accidental process: the right person (an agent) has to find your video and share it, and then you basically have to go viral for someone in the music industry to notice you. The result is a rich-get-richer problem: the people with the most YouTube subscribers can get their videos seen, because they can make the ‘Most Viewed Today’ lists that drive traffic on the site.

Why Slams Work
Despite the availability of all of YouTube’s metadata and user analytics, Slams have proven to be a very effective method of ranking otherwise subjective content. The Slam system is based on a few core concepts. First, the head-to-head voting is a better method of identifying quality clips because, the team has found, people are much better at assigning relative value than absolute value. Asking people for preferential judgements (“The video on the right is funnier”) turns out to provide higher quality answers than measuring absolute judgements, which are, in the context of YouTube, indicators like watching a video in its entirety, leaving a comment, or using the thumbs up and thumbs down buttons.

Another core concept behind Slams is gamification. To keep the crowd interested in sorting videos, the process must be fun. Points are assigned and there are weekly leaderboards, but the team has also found other, more unique ways to keep people engaged.

First, videos are automatically categorized into several categories: Comedy, Dance, Cute, Bizarre, and Music. These categories were chosen because they are the classic YouTube genres; they are some of the best verticals where an amateur can make a splash by picking up a camera and becoming a producer. Then, the videos are computationally filtered for a certain threshold of quality. It would be hard to keep voters engaged if half of the videos were comprised of microphone feedback. In addition, the system intentionally serves up videos it already knows are well-liked to keep users engaged and watching. This has proven to encourage more votes over the course of a session despite the momentary duplication of effort. (It’s sort of like an inversion of the gold data strategy we used on Amazon Mechanical Turk).

The result is the beginning of a formula for qualifying descriptions we might have previously considered subjective. With a corpus of millions of cute cat videos, we can begin to expose the magic behind what we define as ‘cute.’ An entire world of possibilities boils down pretty quickly to babies laughing, kittens, puppies, other small animals, and big dogs performing tricks.

Crowdsourcing Discovery
As industries full of talent scouts and clip show producers already know, there’s a lot of value in identifying the gems early on. How many hours has Bob Saget spent reviewing crotch-shots? American Idol has 12 singers, and millions of voters, creating a very simple counting problem. Slams create the opposite problem: many singers and few voters. The result is that every vote must to be milked for additional information.

Votes update the reputation of each video, but also the voter’s reputation vis-à-vis the majority. If you vote incorrectly 3 times, you lose the round. You can play again, but to earn points, you must correctly predict a video’s worth. Like the GRE, the system can serve up more challenging video pairs to users who have historically scored well, and easier pairs for users that have scored low. You earn points only if your guess falls in line with what others guessed. I found rating comedy videos using Slam to be frustrating, as the game punished me if I failed to choose the lower-brow brand of humor YouTube is famous for.

Timing Comedy
Comedy is very hard for a computer to judge by video content alone. The team also looks for indicators like whether the person holding the camera is laughing (a useful sign that at least one person thinks the content is funny), but it’s not as easy as other genres to predict what audiences will like.

To complicate matters, they’d like to know just how funny a video is relative to other comedy videos. In the case of comedy and bizarre videos, analyzing a video’s comments has been even more useful than video analysis. YouTube’s comments are not usually considered a great place to study the English language, but it turns out that machines can learn a lot about how humorous a video is based on how people emphasize their plain text LOLs, including:

  • elongation – how many o’s there are (loool)
  • capitalization (LOL)
  • exclamation points (lol!!!)
  • emoticons (:-D)

In each case the computer is able to pick up cues from users making use of the limited functionality of YouTube’s comment box to express their degree of joy.

Within each category, you also end up with amazing, only-on-YouTube subcategories, like the wide world of tree-cutting FAILs  and latte art, including a printer creating some very fancy logos in pixels made of milk.

Slam Efficacy
It’s easy to forget, but YouTube’s search engine is the second most popular search engine in the world, behind only Google itself. And yet, given enough votes, the Slam-style ranking of videos beats the site’s own organic search engine. It has three things going for it:

  1. It’s based on relative preferential judgements, as detailed earlier.
  2. It’s based on active solicitation. Active machine learning, on average, requires far fewer samples to learn the same concept compared to traditional supervised machine learning.
  3. It’s gamefied – people are more incentivized to put in some thought into their rankings.

The main disadvantage is that, as a product of Test Tube (i.e. YouTube Labs), there are far fewer people playing with Slam than there are searching on the site. As slam’s leaderboard is populated, at around 800 votes, it starts to outperform the search engine, but Slams don’t always have this many votes. In some cases, 150 users in a gamefied setup were able to produce better results than YouTube’s search engine.

Civic Applications
My mind immediately began wandering to more useful applications of the amazing technologies described. Why not apply some of these methods to educational videos? Khan Academy lessons could be A/B tested with other online resources, with quizzes at the end testing retention and determining which video better conveys the subject.

The same could work for news. Like with textual news, there are many news videos covering the same stories. Viewers could vote with their feet on which outlet produced the most compelling, informative take.

Hirsh pushed back on these ideas, saying that Comedy and Cute and Music are the verticals the YouTube empire was built on, the genres where every day people have the power to be famous and play producer. He also said that YouTube’s primed for gamification because people usually come there for entertainment.

Even if we stick to music, these technologies have open up exciting possibilities for participatory media. YouTube’s power to discover talent could surface and promote local artists and cultural experiments as an antidote to the Top 40 lists that dominate traditional radio stations. They could identify regional genres before Clear Channel ever gives the emerging artists behind them a chance.

Of course, there are real limits on this work, as well. As noted in a session I blogged at ROFLcon, YouTube itself is moving away from empowering You to star on the Tube, and moving aggressively towards paying professionals to produce what are essentially television shows. Vevo’s canned music videos have surpassed some of YouTube’s biggest organic hits. And there’s the question of how tailored we want our content to be. If automatic categorization takes off, a video’s predicted success could determine its actual success, no matter how limited our range of quality indicators may be.