Étude doublons et synonymes thésaurus APV
- Start date
- Feb. 1, 2008
- End date
- Dec. 31, 2011
- Private funding by Peugeot Citroën
- Telelingua International
The goal of the PSA project is to improve vocabulary consistency in technical texts across 20 languages by removing identical and semantical doubles from the database that is used (and consistently updated) for compiling technical documentation. The French database contains about 400,000 entries (sentences, part of sentences and isolated words) with an average of 9 words per entry. The content of the French database has been translated to some extent into the other 19 supported languages. The language portfolio exists of French, English (two pivot languages), German, Chinese, Croatian, Danish, Spanish, Finnish, Greek, Dutch, Hungarian, Italian, Japanese, Polish, Portuguese, Russian, Slovenian, Swedish, Czech and Turkish.
The milestones of the project are the following:
1. Automatic bilingual lexicon extraction with French as pivot language: identify in all languages tokens, lemmata, frequency information, synonym sets, context of usage (in case of polysemous words) and identify the concept (most frequent synonym) for all synonym sets.
2. Replacement of all tokens by the concept that has been identified in Deliverable I.
3. Identification of all doubles (identical entries + semantical doubles)