Multilingual Web Site
Edit only typos/grammar. Add comments in the dicussion.
Multilingual Web Site (MWS) refers to sites that contain multilingual parallel texts; i.e., texts that are translations of each other. For example, most of the European Institutions sites are MWS, such as Europa.
The objective is to specify a comprehensive open architecture (and not just an application) that allows the creation of high quality low cost MWS. Many existing applications have some multilingual facilities and (stating the obvious) one should harvest the best techniques around.
Comprehensive in the sense of having one whole integrated architecture that addresses the cycle Authorship, Translation and Publication chain (ATP-chain).
MWS are of great practical relevance as these are very important portals with many hits; also they are very complex and costly to create and maintain.
This document is a position paper for the Multilingual Web Site BOF at WWW2006.
Transparent Content Negotiation (TCN)
One URI can have several variants.
http://mysite/doc could have:
- English in HTML
- English in PDF
- Spanish in HTML
- Spanish in PDF
- Variant list: list of availables variants
- Language variant list: list of available linguistic versions; a subset of variant list
Some of the HTTP header fields involved are:
In addition to language and format (MIME type), there are other dimensions. For details (and strict definitions) have a look to the RFC Transparent Content Negotiation in HTTP.
servers do not return the variant list.
seems only to return the variant list with
406 Not Acceptable.
One can make Apache to always return the variant list
by changing only one line in the source code and recompiling it
(thanks to K. Holtman for pointing out the line).
But the requirement is for parametrizing servers to return the variant list or subsets.
VariantList Language MediaType
Note that this do not exist in Apache. It is just an example and proposal.
The greatest cost with MWS is translating:
- Original pages
- Maintainance, in particular, linguistic segments
The public expect web sites to be up to date; errors are expected to be corrected inmediatly. This is very different from paper publications where the public expect errors to be corrected in the next edition.
Hence, often ones has to translate many linguistic segments; a costly business as there is a fix overhead for each translation request, indepently of the size. Indeed, most translation services are geared to the translations of full documents.
Authorship, Translation and Publication chain (ATP-chain)
ATP-chain is the cycle for multilingual publishing. Traditionally it was a one way path:
- Authorship: The author writes the source material
- Translation: The translator(s) translate(s) into the target language(s)
- Publication: The typographer composes the publication
For non-literary materials, now a day this chain could be two ways; e.g., the translator could send back the source material to the author with change requests to facilitate the translation. Also, one has marking from the beginning to automate the whole process.
Multilingual parallel text
Multilingual parallel texts are translations of each other. For example, the Treaty of Rome in 22 languages.
Source and target languages
The most common case is that the author writes in one source language and it is translated to other target languages. But it is not rare to have multilingual sources; e.g., a document with three chapters each written in a different language. Indeed, in the case of MWS is quite common.
For a legal point of view, one can have multilingual parallel texts where all the linguistics version are considered source languages.
Multilingual parallel texts have several dimensions:
- Completeness: full translation, partial translation (ongoing translation), summary, etc.
- Aligness: aligned at document level, paragraph level, term or even word level.
- Resource: human, machine and anything in between.
Each of these dimensions should be considered a continuum between to extremes.
The two main aspects are:
- User side
- Webmaster side
The user should have at least the following facilities:
- The best language variant with TCN
- A mechanism to access all the language variants
- If non of the requested languages is available, an automatic offer of the available language variants with a selection mechanism
The access mechanism to the language variants can be:
- Browser side: a language button in the browser (in the same row as File, Edit, etc), enabled when other language variants are available. When non of the requested language variants are available, it will be enable even for one language variant.
- Server side: a language link that when followed will trigger the server to generate an HTML page with the language variants, if any. When non of the requested language variants are available, it will be triggered automatically.
In this context, Webmaster refers to all the aspect of the construction of MWS: author, translator, etc.
The general approach should be to generate all the linguistic versions in parallel. It is based on the following two components:
- Language table
The intention is to replace each language key in the skeleton by its language value from the language table.
<html lang="⟨"> <body> <p>&hello;</p> </body> </html>
Language table (as two text files)
lang=en hello=Hello word
lang=es hello=Hola mundo
<html lang="en"> <body> <p>Hello word</p> </body> </html>
<html lang="es"> <body> <p>Hola mundo</p> </body> </html>
This construction is format (MIME type) dependent; e.g., it can be done in HTML and XML, but it might not be done in other formats.
Given the pair (language key, language), one must obtain the corresponding language value.
The language table can be impletemented in at least the following ways:
- Text files: one file per language; e.g., the line
- URI: e.g.,
The language key is a unique identifier.
The language keys in a skeleton could be abbreviated;
could be abbreviated to just
The HTML generator program must know how to compose the full key; e.g., a parameter or a meta declaration in the skeleton.
A language value is whatever is pointed to by a language key. Typically a text phrase. But it could be in any format; e.g., a sound file.
In this context, phrase do not have any grammatical connotation. In the case of text, one can think as a string.
ATP-chain in MWS
- The author produces the skeleton and the source language values
- The translator produces the other language values
- A program generates the HTML pages
It could be:
- Internal to the server
- Dynamically when they are requested
- Generate the first time requested and keep until stale, for example because one of the entities has changed
- Separated program
- Batch; i.e., all in one go
Multilingual Web Content Management System
The language link can be generalized as the ampersand page. In addition to the language variants, the server could also return with a good presentation:
- Other variants
- Metadata and in particular copyrights
- The Accept-Language sent to the server
Ampersand page for http://mysite/doc
This page available in: English, Spanish, French
Copyright for the site
Your preferred languages: English, Spanish
The language of the ampersand page will be as per the TCN. If the server does not know any of languages requested, there is an intermediate step with a menu, so the user chooses one of the languages available for the ampersand page.
The ampersand link (the
href attribute) should be the same for all pages.
One can use the field
Referer to generate the page.
Having different ampersand links for each page would it much harder.
Language neutral URIs
User translation request
Mechanism that allows users to request translations.
- Write a report from the BOF
- Start a working group
- It could be within existing organization; e.g., W3C
M.T. Carrasco Benitez Disclaimer: I only talk for myself.