Skip to main content Skip to main navigation

Project | LIEREx

Duration:
Language-Image Embeddings for Robotic Exploration

Language-Image Embeddings for Robotic Exploration

"Robot, bring me a cup!"

This seemingly simple instruction often poses significant problems for robots.

A robot can use semantic maps to query the positions of objects, but those objects may not even be observed in the environment at all.

As humans, the solution to this problem is usually intuitively clear to us. We know we must search for the desired object in a logical location, in this case the kitchen cabinet for example. But inferences of this kind by a robot require the modeling of a semantic domain, e.g. by explicitly creating a rule that cups can be found in kitchen cabinets, or the provisioning of suitable training data so that the implicit relationship between cups and kitchen cabinets can be learned by a neural network. Both variants require a great deal of effort and still generally only cover a few possible cases.

An equally problematic case is when an object is not even included in the vocabulary, i.e. the possible object classes, of the semantic map. Such objects cannot be recognized by a robot at all and therefore cannot be found.

In recent years, great progress has been made in combining the visual and linguistic domains through the development of large language models (LLMs) and vision transformers. By using vision-language (VL) models such as CLIP (Radford et al., 2021), joint embeddings of text and image data can be generated. These models enable the recognition of objects beyond a previously defined vocabulary and can also model relationships between object classes via their embedding space.

In the LIEREx project, we are developing a new type of semantic map based on these VL models, which will allow queries of arbitrary objects. By exploiting the implicit relationships between related object classes and existing explicit prior knowledge, this map will also allow a suitable search strategy for unknown objects. The overall system will be implemented on a mobile robot and evaluated by means of "goal-oriented exploration" of an indoor environment.

LIEREx is directly related to the objectives of the ExPrIS project. In ExPrIS, expectations are generated from prior knowledge to influence the outcome of deep learning models for computer vision problems, and the approaches investigated in that project for embedding knowledge and representing the scene context can also be used in LIEREx and supplemented through the use of language embeddings.

Sponsors

BMBF - Federal Ministry of Education, Science, Research and Technology

01IW24004

BMBF - Federal Ministry of Education, Science, Research and Technology