The automatic detection of semantic concepts like objects, locations, and events in video streams is becoming an urgent problem, as the amount of digital video being stored and published grows rapidly. Such tagging systems are usually trained on a dataset of manually annotated videos. The acquisition of such training data is time-consuming and cost-intensive, such that current standard benchmarks provide high-quality, but small training sets.
In contrast to this, the human visual system permanently learns from a plethora of visual information, parts of which are digitized and publicly available in large-scale video archives such as youtube. The overall goal of the MOONVID project is to exploit such web video portals for visual learning. Particularly, three scientific questions of fundamental importance are addressed:
- How can proper features for the inference of semantics from video be selected and combined?
- How can visual learning be made robust with respect to irrelevant content and weak annotations?
- Can motion segmentation, which separates object from the background, be used to realize an improved detection of objects?