University of Glamorgan

Cardiff • Pontypridd • Caerdydd

datblogu

There is an interesting snippet on the Google Reader Blog entitled Is Your Web Truly World-Wide?. The title is rather misleading however, it actually refers to a new feature in Google Reader whereby you can have feeds translated into your language. Or at least you can if your language is one of the chosen few and you are happy to put up with the vagaries of machine translation.

I do get the feeling that opening up to multiple languages is rather the flavour of the moment, which is no bad thing and long overdue in my opinion. My concern is that this will only extend as far as the usual suspects, reinforcing their position and weakening the lesser used languages – again.

Posted by djcunlif | Tagged with , , ,

7 Responses to “Is Google Reader Truly World-Wide?”

  1. Mark Says:
    What are you proposing? That everything written for the web, ever, is translated into every conceivable language before it's posted?
  2. Daniel Cunliffe Says:

    No, I am pretty certain that isn't what I am proposing.

    In as much as I am proposing anything, it is probably something much more modest, like companies opening up their systems to allow lesser used language communities to contribute language resources that would enable translation tools such as these to operate in those languages. These resources could be crowd-sourced or provided by language organisations or similar.

    Does that sound so infeasible?

  3. Mark Says:

    Well you're right, you weren't really proposing anything, but it's difficult to see your criticism as constructive without at least telling them what you want.

    Google takes a much more open approach to this than you might be aware. First some background: their translation service is, unusually, based on statistical analysis (more information here: http://en.wikipedia.org/wiki/Statistical_machine_translation) rather than the rule-based . This usually results in a better translation but, according to Franz-Josef Och, the guru behind Google's method, it requires enormous amounts of bilingual text (over a million words) and over a billion words of monolingual text for each language that it can infer a more complete translation.

    This is where the first problem in including Welsh arises (you don't mention Welsh but I'm assuming that's the language you're interested in). To get these vast bodies of text, Google uses UN documents, each of which must be published in the 6 official languages of the UN (of which, unsurprisingly, Welsh is not one), which gives them a useful 20 billion or so words to infer translations from.

    As you can imagine, this results in pretty good results most of the time, and if you compare the service with SYSTRAN's babel fish, you'll notice that the results are markedly more natural and context-intelligent.

    However, it's not perfect, for example rare and technical words may not be included even in those 20 billion so the system lacks a translation for some things. This brings me neatly onto the subject of opening up systems to allow user involvement.

    Google acknowledges their system, although very good, is not perfect. In fact, like you suggest, they allow users to suggest improved translations. This is useful and allows knowledgeable people to contribute to and improve the system. Even so, there are serious downsides to using contributed translations.

    Examples of abuse of this include, probably to the chagrin of Slovenian users, the translation into English of 'Slovenian coast' being given as 'Croatian coast' and the name of their capital Ljubljana, being translated into 'rape'.

    Slovenian isn't one of the UN languages. I'm speculating, but perhaps the fault here lies with a vocabulary that is too small and too readily accepts user input. Welsh would suffer from similar problems.

    So the question for Google is at what point is a translation sufficiently accurate to be allowed into their apps. Unfortunately, I suspect that due to a lack of available texts to infer from, the translation available for Welsh is simply not reliable enough.

    If you're serious about contributing to their translation, then if you can arrange for a sufficiently large bilingual Welsh-English text to be sent to Google, I'm sure they will happily integrate the language into their service. But don't imagine this is prejudice, it is purely and simply down to the fact that it is extremely difficult to source expansive and comprehensive texts for minority languages/

  4. Daniel Cunliffe Says:

    Fair point - you're not the first person to point out that I tend towards grumpy criticism rather than constructive critique!

    Thanks for the very detailed and considered response, which makes and raises a number of interesting issues. Certainly the lack of a sufficiently large bilingual corpus is an issue for many language technologies. I'm not sure to what extent such a corpus actually exists for Welsh - perhaps someone reading this will know? This does however suggest one way in which money might be targeted effectively to promote a minority language - I am sure that the folks at Canolfan Bedwyr could think of lots of interesting things to do with such a corpus. I suspect that Welsh is relatively well placed compared with a lot of minority languages in that it has a strong literary tradition and reasonable literacy levels.

    Regardless of Googles (probably entirely good) intentions, the end result is that some languages become more mainstreamed and other become more marginalised or excluded. This then begs the question of what I expect Google to do about it, which is perhaps what you were getting at with your first comment. To be honest, I'm not really sure, in fact I'm not sure if it is even Google who should be doing something. If Google did add Welsh it would just be slightly redrawing this particular digital divide, other languages would still be excluded. Is there something more noble we can hope to achieve rather than just the survival of our favoured minority language - I really hope so, even if I have no idea what it might be - sigh!

  5. Rhodri ap Dyfrig Says:
    Surely there is enough offical bilingual corpus residing within Assembly documents by now though? As far as machine translators go, we are far from having a system that is acceptable to Welsh speakers, so I don't think that we would use something that is too imperfect. There is major mistrust of machine translation as recent bilingual signage problems have shown. However, there is absolutely no excuse for Gmail etc to still not be available in Welsh when the text strings for the interface have already been translated many moons ago. As Facebook has proven there is demand for localised interfaces for popular services in Welsh and Google should recognise this.
  6. Francis Tyers Says:
    You can download a pre-processed Welsh parallel corpus from the Universitat d'Alacant here: http://xixona.dlsi.ua.es/corpora/UAGT-PNAW/ Given a good knowledge of programming and the Moses SMT toolkit, you can have a system up and running in a couple of days. It has approximately 500,000 aligned sentences, although this isn't sufficient to give a reasonable coverage. Also, the domain is quite different (assembly speeches) than you would find on the web. I'm personally interested in machine translation in Welsh and have been working on the Apertium Welsh→English system (see http://www.cymraeg.org.uk). Unfortunately many of the resources that have been created with public money, for example terminology lists etc. are not available to be included in open-source machine translation systems. This means that we have to "re-invent" the wheel a lot. It also results in the rather surreal situation where there are, for a lesser-resourced language, two of everything, one proprietary, closed-source, funded with public money, and another open-source, free-software which receives no support.
  7. Te Taka Keegan Says:
    Hi Everyone, Just curious; if people were interested in seeing what kind of job MT could do with Welsh texts approximately how much translated texts do you think would be able to be made available? Are there many more possibilities aside from has been listed above?

Sorry, comments are closed for this article.

University of Glamorgan

Pontypridd, CF37 1DL, UK.

© University of Glamorgan