Multilingual Web Site
From WWW2006
Edit only typos/grammar. Add comments in the dicussion.
Contents
|
Introduction
Multilingual Web Site (MWS) refers to sites that contain multilingual parallel texts; i.e., texts that are translations of each other. For example, most of the European Institutions sites are MWS, such as Europa.
The objective is to specify a comprehensive open architecture (and not just an application) that allows the creation of high quality low cost MWS. Many existing applications have some multilingual facilities and (stating the obvious) one should harvest the best techniques around.
Comprehensive in the sense of having one whole integrated architecture that addresses the cycle Authorship, Translation and Publication chain (ATP-chain).
MWS are of great practical relevance as these are very important portals with many hits; also they are very complex and costly to create and maintain.
This document is a position paper for the Multilingual Web Site BOF at WWW2006.
Background
Transparent Content Negotiation (TCN)
One URI can have several variants.
For example, http://mysite/doc could have:
- English in HTML
- English in PDF
- Spanish in HTML
- Spanish in PDF
Informally:
- Variant list: list of availables variants
- Language variant list: list of available linguistic versions; a subset of variant list
Some of the HTTP header fields involved are:
-
Accept-Language -
Content-Language -
Alternate -
Referer
In addition to language and format (MIME type), there are other dimensions. For details (and strict definitions) have a look to the RFC Transparent Content Negotiation in HTTP.
Often,
servers do not return the variant list.
For example,
Apache
seems only to return the variant list with
406 Not Acceptable.
One can make Apache to always return the variant list
(in Alternate)
by changing only one line in the source code and recompiling it
(thanks to K. Holtman for pointing out the line).
But the requirement is for parametrizing servers to return the variant list or subsets.
For example:
-
VariantList All -
VariantList Language -
VariantList Language MediaType
Note that this do not exist in Apache. It is just an example and proposal.
Translation
The greatest cost with MWS is translating:
- Original pages
- Maintainance, in particular, linguistic segments
The public expect web sites to be up to date; errors are expected to be corrected inmediatly. This is very different from paper publications where the public expect errors to be corrected in the next edition.
Hence, often ones has to translate many linguistic segments; a costly business as there is a fix overhead for each translation request, indepently of the size. Indeed, most translation services are geared to the translations of full documents.
Authorship, Translation and Publication chain (ATP-chain)
ATP-chain is the cycle for multilingual publishing. Traditionally it was a one way path:
- Authorship: The author writes the source material
- Translation: The translator(s) translate(s) into the target language(s)
- Publication: The typographer composes the publication
For non-literary materials, now a day this chain could be two ways; e.g., the translator could send back the source material to the author with change requests to facilitate the translation. Also, one has marking from the beginning to automate the whole process.
Multilingual parallel text
Multilingual parallel texts are translations of each other. For example, the Treaty of Rome in 22 languages.
Source and target languages
The most common case is that the author writes in one source language and it is translated to other target languages. But it is not rare to have multilingual sources; e.g., a document with three chapters each written in a different language. Indeed, in the case of MWS is quite common.
For a legal point of view, one can have multilingual parallel texts where all the linguistics version are considered source languages.
Dimensions
Multilingual parallel texts have several dimensions:
- Completeness: full translation, partial translation (ongoing translation), summary, etc.
- Aligness: aligned at document level, paragraph level, term or even word level.
- Resource: human, machine and anything in between.
Each of these dimensions should be considered a continuum between to extremes.
MWS aspects
The two main aspects are:
- User side
- Webmaster side
User side
The user should have at least the following facilities:
- The best language variant with TCN
- A mechanism to access all the language variants
- If non of the requested languages is available, an automatic offer of the available language variants with a selection mechanism
The access mechanism to the language variants can be:
- Browser side: a language button in the browser (in the same row as File, Edit, etc), enabled when other language variants are available. When non of the requested language variants are available, it will be enable even for one language variant.
- Server side: a language link that when followed will trigger the server to generate an HTML page with the language variants, if any. When non of the requested language variants are available, it will be triggered automatically.
Webmaster side
In this context, Webmaster refers to all the aspect of the construction of MWS: author, translator, etc.
General approach
The general approach should be to generate all the linguistic versions in parallel. It is based on the following two components:
- Skeleton
- Language table
The intention is to replace each language key in the skeleton by its language value from the language table.
Example:
Input
Skeleton
<html lang="⟨"> <body> <p>&hello;</p> </body> </html>
Language table (as two text files)
lang=en hello=Hello word
lang=es hello=Hola mundo
Output
100.en.html
<html lang="en"> <body> <p>Hello word</p> </body> </html>
100.es.html
<html lang="es"> <body> <p>Hola mundo</p> </body> </html>
Skeleton
This construction is format (MIME type) dependent; e.g., it can be done in HTML and XML, but it might not be done in other formats.
Language table
Given the pair (language key, language), one must obtain the corresponding language value.
The language table can be impletemented in at least the following ways:
- Text files: one file per language; e.g., the line
k1inmyfile.es.txt - URI: e.g.,
http://es.mysite/k1orhttp://mysite/es/k1 - Database
Language key
The language key is a unique identifier.
The language keys in a skeleton could be abbreviated;
e.g.,
http://es.mysite/k1
could be abbreviated to just
k1.
The HTML generator program must know how to compose the full key; e.g., a parameter or a meta declaration in the skeleton.
Language value
A language value is whatever is pointed to by a language key. Typically a text phrase. But it could be in any format; e.g., a sound file.
In this context, phrase do not have any grammatical connotation. In the case of text, one can think as a string.
ATP-chain in MWS
- The author produces the skeleton and the source language values
- The translator produces the other language values
- A program generates the HTML pages
Generating techniques
It could be:
- Internal to the server
- Dynamically when they are requested
- Generate the first time requested and keep until stale, for example because one of the entities has changed
- Separated program
- Batch; i.e., all in one go
Multilingual Web Content Management System
Expand.
Ampersand page
The language link can be generalized as the ampersand page. In addition to the language variants, the server could also return with a good presentation:
- Other variants
- Metadata and in particular copyrights
- The Accept-Language sent to the server
Example:
Ampersand page for http://mysite/doc
This page available in: English, Spanish, French
Copyleft forhttp://mysite/doc: Somebody
Copyright for the sitehttp://mysite: seehttp://mysite/copyright
Your preferred languages: English, Spanish
The language of the ampersand page will be as per the TCN. If the server does not know any of languages requested, there is an intermediate step with a menu, so the user chooses one of the languages available for the ampersand page.
The ampersand link (the href attribute) should be the same for all pages.
For example,
http://mysite/amp.
One can use the field Referer to generate the page.
Having different ampersand links for each page would it much harder.
Language neutral URIs
e.g., numbers
Expand.
User translation request
Mechanism that allows users to request translations.
Expand.
Next steps
- Write a report from the BOF
- Start a working group
- It could be within existing organization; e.g., W3C
Author
M.T. Carrasco Benitez Disclaimer: I only talk for myself.
