Fernando Sánchez Rodas

SOURCE CORPUS PREPARATION

NAIMES utilises the ParlaMint 5.0 corpus, a multilingual collection of speeches from European national parliaments.

During the initial stage of NAIMES, the Czech, English, and German sub-corpora were filtered and extracted from ParlaMint 5.0. These annotated texts were curated following predefined corpus design criteria to ensure they are gender-balanced and reflect the political diversity of each country.

CREATION OF ONOMASTIC REFERENCE DATABASE

To avoid evaluating name translation in isolation, we created a multilingual reference database of institutional names in the four languages of the project (Czech, German, English, and Spanish). Taking advantage of ParlaMint’s existing Named Entity Recognition (NER) metadata, we extracted and matched proper name equivalents across the four language sub-corpora. Any unpaired proper names were then reconciled using trusted resources, including IATE (interinstitutional terminology database of the European Union), UNTERM (terminology database of the United Nations), and specialised legal dictionaries.

GENERATION OF AI-TRANSLATED SPANISH CORPUS

With the source corpus and onomastic reference database established, we initiated the machine translation phase. This process began with the design and iterative testing of translation-specific prompts to optimise the model’s output. Using these refined prompts, translations were generated using ChatGPT and systematically compiled into an AI-generated corpus, ParlIAmento.

In addition to named entity annotation, this corpus incorporates metadata essential for documenting AI-generated translations, including timestamps, language direction, the specific prompt strings used, as well as coded references to the source texts.

EVALUATION OF AI-TRANSLATED NAMES

Building upon the onomastic reference database from Stage 2, this stage involves the design of multidimensional assessment metrics for AI-translated names. The system will assess the Spanish output based on several parameters, including addition/omission rates, resemblance to existing human translations, the use of borrowings, and the degree of creativity.

The resulting data will be analysed at two levels: 1) differences in quality between the three language combinations (Czech > Spanish, German > Spanish, and English > Spanish); and 2) an overall performance assessment for Spanish. To ensure the broader impact of NAIMES, the methods in this stage are designed to be language-independent, serving as a universal model for name translation assessment across any language pair.

This work was supported by OP JAC Project “MSCA Fellowships at Palacký University IV.” CZ.02.01.01/00/22_010/0013054, run at Palacký University in Olomouc, Czech Republic.

Methodology