Large-scale video processing for YouTube

Every minute of the day, more than 100 hours of video is added to Youtube. This large inflow of footage presents several difficulties. The two that I will discuss are copyright detection and recommendation generation.

Youtube must remove copyrighted material, preferably before it is ever posted, without slowing down the upload process. In parallel, more and more copyright owners are providing us with footage to protect against upload. The combination of massive numbers of uploads along with years worth of content to protect make any pairwise comparison completely intractable. To make matters worse, we need to detect copying of content when the content is "mashed up", re-cropped, overlayed, recompressed, and color or brightness shifted. I will describe one version of the feature set that we have used for this task as well as the subsequent processing that allows us to efficiently treat this as an approximate nearest neighbor retrieval problem. The resulting system has excellent identification capabilities for small snippets of audio or video that have been degraded in a variety of manners, including competing noise, poor recording quality, cell-phone playback (for audio), and camcorder recapture (for video). To make the system work in production, we have also explicitly developed methods to limit memory usage, and computation with minimal degradation in detection performance. The system is more accurate than the previous state-of-the-art system while being more efficient and flexible in memory usage and computation.

Another process that must happen with new material on Youtube is to make that new content discover-able. A large part of this is to correctly add new content to the recommendation lists for established related videos. For videos that have some previous traffic, we can do this by treating the video database as a weighted graph with connections being made between videos that were "co-viewed" in the same user session. For videos that are completely new, we need to rely on connections formed between videos with similar user-provided text data and or similar visual and audio features. I will discuss how we can use these connections to create better video recommendation lists, by treating each possible recommendation as a label. As with previous work, we have found that graph-based propagation can be very effective at finding a good label distribution across nodes, starting from partial information. With videos as nodes and recommended videos as labels, this results in recommendations that are stronger than alternative approaches. I will also discuss how to make this graph propagation process efficient, by moving all re-normalization scaling to a pre-processing step, allowing the use of linear algebra techniques (e.g., Guassian elimination, stabilized bi-conjugate gradient descent) for the solution of the system of equations. Finally, I will discuss techniques that allow incremental update of the recommendation labels, in order to handle the characteristics of the Youtube database.

If time allows, I will also discuss research into learning visual distances that are context dependent, to allow better initial connections between videos.


Michele Covell received the BS in electrical engineering from the University of Michigan. She received the MS and PhD from MIT in signal processing, for her research with Jae Lim and Alan Oppenheim, respectively. She joined SRI International, in the area of active acoustic-noise control, and then Interval Research Corporation, where her research covered a wide range of topics in audio, image, and video processing. In 2000, she joined YesVideo and worked on faster-than-real-time video analysis. She moved to the Mobile Streaming Media group in HP Labs, as a key contributor in streaming-video services in 3G telephony networks. This work is listed as one of the top-40 accomplishments from HP Labs' 40-year history. She moved to Google, in the research group, in 2005, where she focused for several years on large-scale audio and video fingerprinting, identification, and retrieval. For this work, she received two Google awards --- one for innovation and one for financial impact. More recently, she has been working in image and texture analysis and in large-scale graph labeling problems. In addition to her extensive publications, she has 45 granted US patents and 10 published US-patent applications.

Untitled Document