Using RAG, a language model is to be optimised so that it can refer to information outside of its own training data and incorporate this into an answer. In the case of the project, the relevant websites are to act as sources of knowledge.
AI determines page content and prepares information
If the project succeeds as planned, answers to questions such as ‘Which countries do the MAs who have studied computational linguistics and are working on speech recognition come from?’ will be just a finger exercise for the DFKI technology. Among other things, it opens up the possibility of finding out things on the basis of the website-specific RAGs that would otherwise hardly be visible or combinable.
Another advantage: ‘The websites automatically become accessible because they can be presented in many languages, by text, voice, image, etc. and in simplified language,’ says Schmeier. At the same time, website maintenance would become much less complicated.
Real answers
Conventional search engines return documents as results to the person searching. RAGs, on the other hand, provide real answers - however, many problems that arise with RAGs from websites have not yet been solved.
The solution approach of the researchers at DFKI: ‘Through the type of indexing, i.e. the transformation of the website content into the content of the RAG, we can find general solutions for the RAGs that can also be applied to other sources,’ explains Schmeier. This would be made possible, for example, by links within documents to other documents.
Difficulties within the project
Making all information accessible for corresponding search queries appears to be a mammoth task that involves a number of hurdles. Even if everything runs smoothly on the part of the AI application, the difficulty lies in the individuality of the websites.
‘When parsing the websites to create a robust textual representation of the websites, there have been application-specific challenges to date,’ report the researchers. While working on the project, Sven Schmeier and his team have to deal with ever new exceptions in the design and layout of websites.
On the way to a solution
Research is currently being conducted on two fronts. On the one hand, the creation of a benchmark data set for multi-hop information retrieval via web content - i.e. raw websites. On the other hand, the reasoning capabilities of open-source LLMs for navigating web content are being tested using our own textual web representations.
However, the current zero-shot tests show that the language models used do not select the optimal actions based on the question/web content. In addition, the researchers have already identified significant differences between the open-source LLMs Llama2 70b and GPT4.
The search for a suitable language model therefore continues. In the next series of tests, Gemini ultra 1.5 will be tested in the hope of achieving even better performance. The data set created by the researchers and the improved reasoning capabilities of the Gemini models should contribute to this effect in tandem.