High quality information extraction is a major pillar of our rapidly developing global information and knowledge society. At this time, either rule-based or statistical approaches are employed. Statistical methods usually deliver better results than rule-based systems and are developed much more quickly. On the downside,
- a large repository of sample solutions is required for training purposes,
- the quality of the results often can only be affected by expensive "trial & error" methods,
- complex entities are recognized coparatively badly.
The goal of the project HIT is a suitable combination of rule-based and staistical approached in order to eliminate the above disadvantages. For this purpose the successful rule-based core techology SProUT for shallow multilingual text analysis si extended by major novel functionality:
- Flexible processing strategies that are configurable by the user,
- Integration of statistical information into the contraint-based formalism,
- Freely definable workflows combining results reached at by several grammars,
- Comfortable interfaces for the integration of external language technologies, such as tokenitzeers, or mophological analysis components,
- Research results for the efficient processing of very large gazetteer data,
- Analysis of tables,
- Mutual integration of SProUT and GATE
The project HIT will produce a new system, SProUT NG ("NG" stands for "next generation") that will be further developed to eventually reach professional usability. The project shows its results using a demonstrator that is applied to information extraction tasks in the field of contract reviewing. At the same time, a large number of new opportunities for IE users in Berlin and elsewhere will arise.
The project is coordinated by the application partner Leverton GmbH and carried out in Berlin.
The project has received funding by the European Fund for Regional Development (EFRE) of the European Union.
Partners
Leverton GmbH, DFKI GmbH