Internationalization (i18n): A Simple Definition

photo-1451226428352-cf66bf8a0317“Think globally, act locally.”

Akio Morita, co-founder of Sony (1921-99) said these famous words*. Local cultures are much more than just about language. Much like how different groups in the United States have their own in-jokes, dialects, idioms and customs, different countries have their own ideas about how things should be done and how things should be presented. That’s why software, websites and apps need to be developed with internationalization (i18n) in mind.

What Is Internationalization (i18n)?

Internationalization, often written as i18n, is the process through which products can be prepared to be taken to other countries. It doesn’t just mean being able to change languages; instead it means being to accept different forms of data, different settings to match local customs and different strings of data and process it correctly.

The W3C Group defines it “Internationalization is the design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language.”

According to them Internationalization (i18n) normally includes:

  1. “Designing and developing in a way that removes barriers to localization or international deployment. This includes such things as enabling the use of Unicode, or ensuring the proper handling of legacy character encodings where appropriate, taking care over the concatenation of strings, avoiding dependance in code of user-interface string values, etc.”

  2. “Providing support for features that may not be used until localization occurs. For example, adding markup in your DTD to support bidirectional text, or for identifying language. Or adding to CSS support for vertical text or other non-Latin typographic features.”

  3. “Enabling code to support local, regional, language, or culturally related preferences. Typically this involves incorporating predefined localization data and features derived from existing libraries or user preferences. Examples include date and time formats, local calendars, number formats and numeral systems, sorting and presentation of lists, handling of personal names and forms of address, etc.”

  4. “Separating localizable elements from source code or content, such that localized alternatives can be loaded or selected based on the user’s international preferences as needed.”

What Is Localization (l10n)?

Localization is simply the act of changing a piece of software to suit a different locale. In many ways, internationalization can be thought of as building the structure of a piece of software so that it can be adjusted for different markets, and localization is the process of actually doing so for a specific market.
The W3.org group refers to localization as follows:

“Localization refers to the adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a locale).

Localization is sometimes written as l10n, where 10 is the number of letters between l and n.

Often thought of only as a synonym for translation of the user interface and documentation, localization is often a substantially more complex issue. It can entail customization related to:

  1. Numeric, date and time formats
  2. Use of currency
  3. Keyboard usage
  4. Collation and sorting
  5. Symbols, icons and colors
  6. Text and graphics containing references to objects, actions or ideas which, in a given culture, may be subject to misinterpretation or viewed as insensitive.
  7. Varying legal requirements
  8. and many more things.

Localization may even necessitate a comprehensive rethinking of logic, visual design, or presentation if the way of doing business (eg., accounting) or the accepted paradigm for learning (eg., focus on individual vs. group) in a given locale differs substantially from the originating culture.”

Why Is Internationalization (i18n) Important?

In a number of Asian countries, the family name is given first, and the given name is given second. Your software needs to be able to understand the difference and present the correct information to the consumer so that that person puts in the data correctly. All of this is put in place by the process of internationalization.

Similarly, not everyone uses a ZIP code. Most countries have postcodes, and even then, the format can differ substantially. In Canada, for example, a postcode takes the form X0X 0X0, where X is a letter and 0 is a number. In the United Kingdom, however, a postcode can take the form X00 0XX, XX00 0XX, XX0, 0XX or X0 0XX. In Brazil, postcodes take the form 00000-000. Appropriate internationalization creates software that can handle multiple inputs. Even better is when the software can automatically check those inputs to ensure that the right format is used for the right country.

All of these are important aspects to developing software that consumers can relate to and use appropriately. A business that cannot accept orders through its software because that software cannot render postcodes properly is not going to last in the international market for very long. Even software that gets given names and family names mixed up is going to create distance between the software and its user base. This means that code must be internationalized throughout the development process and not as an afterthought.

As a brief example, Baidu is the number one search engine in China. It reached this position because it can resonate effectively with people within China, because it’s targeted specifically for them. While this is an example of localization rather than internationalization, it’s able to do better than Google because of this targeting and understanding of local cultures, restrictions and, possibly most importantly, government requirements, such as access to user information and, reportedly, censorship. Because Baidu isn’t particularly well internationalized, however, it hasn’t been able to break into any markets outside China. Although this is unlikely to be a concern in a country with more than 1 billion potential users, it does limit potential future growth.

Google, on the other hand, has been out to break into most markets thanks to its internationalized software. Because it’s easily adaptable to a wide variety of locales, it can present interesting information that meets the searcher’s requirements, whether that person is in South Africa, the United States or Russia. In a similar vein, its Android operating system, Google Chrome browser and numerous other products are all effectively internationalized, so they can be easily converted to meet the user’s cultural and personal requirements.

How Does Internationalization (i18n) Affect Developers?

Drilling down into code a little more, it’s clear that there are several good practices that go into reliable and trustworthy internationalization. As an example, around a third of all WordPress downloads are for localized non-English versions. This means that those developing various plug-ins need to take into account localization when building them and ensure their versions are fully internationalized.

This means, for example, that they shouldn’t use PHP variables inside a translation function’s strings. This is simply because translation software typically scans all the strings and pulls out the parts that need to be translated, designated by __(). If you have a PHP variable within that, it tries to pull that variable out for translation because it doesn’t know any better, and an accidental deletion by a translator can render the entire line of code worthless, and this is an error that can be difficult to track down.

In addition, it’s essential to translate phrases rather than individual words. A simplistic example would be the difference between the positioning of French adjectives and English adjectives. In English, you would say, “He is an orange man.” In French, you would say, “Il est un homme d’orange” (literally, he is a man of orange). If you are translating individual words, however, the English structure wouldn’t sound right to a native French speaker. In other languages, there are different pluralization rules for different numbers, particularly in Polish. This brings a serious complication to translations because the system must be able to return different words for different numbers. Even in English, you often use a different word after a single object than you do for a pair of objects (one potato, two potatoes). Even worse, some words are just not translatable, so the translator has to create an approximation that gets the meaning across accurately.

Disambiguation is also essential, particularly for homonyms. A homonym is a word that has multiple meanings, which, unfortunately, is the majority of English words. Context tells us a lot, but when you have multiple strings to be translated, it may not be obvious what that context is. Take the word “comment.” Is it a comment on the site, or are you being asked to make a comment? The differences in other languages can make a difference to the success of your software. When using PHP, use _x() to create a comment defining what the word means.

Other languages, such as Java, are relatively easy to internationalize, but even then, Java internationalization is a tricky subject for those not used to the process. That’s why it’s best to start the process correctly from the beginning and use tools designed specifically for internationalization.

All of this also makes it easier for translators to understand and modify your text without affecting how the website, software or app functions.

Internationalization (i18n) Gone Wrong

Bad internationalization (i18n) typically means bad localization. A classic example is when only prices are localized on an e-commerce site while the product descriptions, weights and measures remain in the original language. Because a large number of websites are based in the United States, this often means European, Asian and African customers are only given quantities and product descriptions in language that is not their own. For many, pounds, feet, inches and ounces are not easily convertible, so this turns the customer off the website because they don’t understand what they’re being offered. For clothing retailers, the same number can mean vastly different sizes in different countries. A size 10 in the United Kingdom would be a 38 in Europe. A size 38 in the United States, however, would be vastly different —the European size 38 is actually a size 6. Good coding would allow automatic conversion of data so that it appears in the target language and cultural context — at least for the website’s prime markets — and good coding has to start at the beginning of the development process.

In addition, different parts of the world use different date forms. In the United States, January 2 would be written 1/2. In the UK, that would mean February 1. This can make a big difference to delivery dates and could be a big factor as to whether your customer wants to purchase from you.

Partial translations are often the result of bad internationalization, as well. In some cases, menus may be untranslated, or perhaps essential contact information may be only in English. Similarly, the website may be able to translate certain portions at all — which is particularly common when untranslatable JPEGs or PNGs are used instead of text. This is quite common with adverts.

Different layouts are also required for different cultures. Typically, a minimalist layout is fine for many countries, but in some, such as Japan, a much denser layout is more common. Good internationalization means that you can present different products, layouts and even colors for different audiences, whereas bad internationalization means you have to use exactly the same layout for everyone.

Bad translation is often example of bad localization rather than internationalization, but it’s still important to know that these two are highly interlinked subjects.

What Good Internationalization (i18n) Means

Ultimately, good internationalization ensures your software, app or website works across a variety of cultures and target markets. It means that every piece of text should be translatable and that there shouldn’t be any code that relies on text being input in a specific language or alphabet. It should be able to render prices in an appropriate format, dates in a way that makes sense for the reader and variables so that they make sense.

Most importantly, good internationalization that’s accomplished with the right software means that you can hand off your program to your translators safe in the knowledge that they can simply get translating and return it back to you without any code changes required. This makes development much easier, bug solving simpler and updates even easier — if you only have to update the code and not the translation, it saves a significant amount of money in the long term.

This article was originally published in THECONVERSATION.COM

Creating a universal language

Creating a universal language

According to linguist and political thinker Noam Chomsky, “A language is not just words. It’s a culture, a tradition, a unification of a community, a whole history that creates what a community is. It’s all embodied in a language.”

So can we have a universal, global language unifying and embodying all of us? Given the diversity of human life, is that even possible? Proto-Indo- European may have come closest, and then Hellenistic Koine, then Latin. What about CHTML or Python? After all, computers talk to each other in 1s and 0s regardless of the language used to program them and they span the globe. It seems that wherever languages are used, the desire for some form of universal language is identified as a means of circumventing the one-to-one translation process. The idea of a bridge (a koine language) connecting a number of languages, understandable to a large population, does indeed have a strong appeal, especially when a goal of globalism is real-time, multilingual communication. Does a universal language make sense in today’s network-connected world?

Language has many functions. We do not have a universal means of communicating with each other quite simply because we do not have a universal topic to discuss — we have millions. This is excellent news for translators and localizers. Perhaps not such good news for those hoping that computer-assisted translation will be a magic bullet for cross-cultural communication. Yet the idea persists and with a growing appreciation of what characterizes a global community, it is still an idea under investigation.

As Latin fell into decline in the post-Renaissance world of European letters, many thinkers sought to replace its abilities to express all manner of subjects in a widely understood form. Mathematicians Rene Descartes and Gottfried Leibniz attempted to formulate a means of constructing a language capable of expressing conceptual thoughts. In England, John Wilkins, among others, sought to facilitate trade and communication between international scholars using a system of “real characters,” symbols that constituted a lingua franca.

In 2001, Professor Abram de Swaan of the University of Amsterdam described how power and languages are connected in the global community in his book Words of the World: The Global Language System. His accomplishments as a social scientist enabled him to detail how a multilingual world can also be described in hierarchical terms that expose the uneven field upon which languages compete for dominance. In his model, English occupies the “hypercentral” position, whereas other languages exist more diffusely from central to peripheral positions. In the translation community, we work in this arena on a daily basis. There have been critics of de Swaan’s ideas in the academic community, but the work has been highly influential in furthering our understanding of how communication can facilitate human affairs globally.

Theoretical approaches to specifying how a universal language works are essential to understanding how the global, multilingual community might operate using a single or dominant language. But how might this work in practice? As mentioned above, thinkers in the 17th century were interested in using signs and symbols to communicate. This is still an idea that is being explored with the unlikely sounding Lovers Communication System (LoCoS) devised by Yukio Ota, Professor of Design at Tama Art University in Japan. Ota is world-famous for his design of the green running man used to mark exit doors in millions of proliferating use of emojis and their incorporation in Unicode, this is hardly surprising. But their use thus far has been largely confined to text messages and websites. They do, however, represent text to varying degrees and it is premature to say just what their future is. That said, if a picture is truly worth a thousand words, then they surely must have a bright future. When actor Kyle MacLachlan was asked to explain the plot of the film Dune on Twitter, he managed to describe the entire movie with 41 emoji characters (see Figure 1).

Movie Dune in emojis

The emoji “language” is already recognizable universally and this is due to the Unicode Consortium, which has embraced the new language, and is diligently defining and approving new emoji characters. Every new version of Unicode includes recommendations for implementation but companies are free to represent the emoji whichever way they wish, thus growing the range of expressions. With growth comes diversification, confusion and misunderstandings. With representations now covering multiple skin tones and occupations having a female variant, Unicode is doing a spectacular job in providing creative solutions. For example, the gender-diverse emoji for occupations is a combination of the standard “man” and “woman” emoji with a second emoji to represent the occupation. These are joined together by a special invisible character called the “zero-width joiner” (ZWJ). Platforms that support the new emoji recognize the ZWJ and display a single emoji while others will display two separate ones. The ability to create new emojis brings its own problems, mainly fragmentation and inability to include them in the official Unicode version. For example, Twitter has a pirate flag, Windows has added ninja cats and WhatsApp has an Olympic rings emoji, which in other platforms is shown as five plain circles. The potential for confusion and misrepresentation across different platforms can only be avoided by sticking to the official Unicode version. As the emoji language grows and increases its expressions, its universal nature is what appeals to people.

With cultural and commercial imperatives driving the world’s need for instant and universal communication, it’s hardly surprising that many place great faith in technology to provide universal, workable solutions to common problems. However, the ATA skillfully and properly put the White House in its place when a call was made in 2009 for bigger and better automated translation. In a deftly-worded response to President Obama, ATA President Jiri Stejskal asserted that “translation software and qualified human translators are vital to your goal of achieving language security. Today all the leading proponents of computer translation recognize that human beings will always be essential, no matter how sophisticated translation programs become.” I doubt any language professional would disagree, and this tends to suggest that there is assuredly no place for a universal language in our community. But the pace of change at the cutting edge of tech is still blistering. Welcome to the brave new worlds of the Internet of Things and machine learning.

Picking up on Chomsky’s idea that all aspects of a community are embodied in its language, can we say the same for a community’s technology? The ancient Greek word τη˜ λε (tele, meaning afar), which we find in telephone, television, telecommunication and so on, bridges enormous distances. These devices shrink our world, but they enlarge our communities. The Internet of Things promises to connect us to an even greater degree, if we are to believe the hype. We are promised that just as our phones pack enormous computing power into a hand-held device, billions of gadgets will be similarly empowered. A recent report identified “implementation problems” as a barrier to progress in achieving this ultra-ambitious internet-connected network of devices. Implementation in what respect? Business and tech analysts will cite security and privacy as massive headaches or achieving robust and reliable connectivity in a massively heterogeneous networked world. But what if your device speaks one language and you speak another? Will we need to localize smart fridges? The answer has to be yes. If Siri, Apple’s virtual assistant is available in a growing variety of languages; if PayPal’s services are now available in over 200 languages; if Amazon has operations in at least 15 international locations; then to paraphrase H. G. Wells, I’ve seen the future and it’s multilingual.

So what exactly is powering this hugely diverse yet intimately connected world of tech? The computing community, like the language community, communities, of which artificial intelligence (AI) is one. In turn, it too is made up of many varied communities. AI used to be regarded by more mainstream computing communities as exotic. That, however, has changed dramatically and AI is now truly mainstream. AI has many areas of application, but one that is of particular interest to the language community is natural language processing. In particular, machine learning is being applied to endow computers with the capability of “understanding” texts and, taking a further leap, of “translating” them.

With a field that draws input from computer science, cognitive psychology, neurolinguistics, data science and numerous theories of education, it is no wonder that numerous different approaches are taken to automating language acquisition. It would be counter-productive to even attempt to generalize efforts in the field. However, two approaches to training a computer to learn how to translate a language are worth a very brief examination. These are: rule-based systems and statistical systems. We should note that neither of these are cognitive approaches. They involve processing.

Rule-based systems rely upon a set of syntactic and orthographic conventions that are used to analyze the content of a source text, which provides the input to be generated into the target language. But the problems with using this approach are obvious. Word order, for example, is anything but universally the same. Indeed, the notion of core grammar just doesn’t relate to the real world of diverse language families, not to mention accommodating isolates like Basque. Clearly the problems can be overcome such that there is a way of connecting languages in pairs, but for the present we rely upon the hard work of the poor old human linguist.

The other approach involves the use of statistical processing based on bilingual text corpora. Google Translate is the perfect example of this approach which harnesses raw processing power to detect patterns of equivalence in language pairs. Almost all of the texts that are mined in this way are the product of human translators in the first place and this is what gives proponents of this approach confidence that the quality of the output target will be of satisfactory quality. Another benefit of this approach is that it is responsive to use with new language pairs and this gives some researchers hope that a monolinguistic text corpus, the engine of a universal translator, is a future possibility.

If computers are not already everywhere, they soon will be. And what of our silicon friends who speak at light-speed in 1s and 0s? Will they ever achieve consciousness as some researchers believe? AI researcher Alan Stewart, who is working on neural networks, says “I am optimistic about the future capabilities of computers and by that I mean that raw power and sophisticated logic will create amazing technology, but unless there are some startling breakthroughs, it will still fall short of nature’s biological capabilities.” However, he speculates that with the learning capabilities that computers are being given, it’s possible that they will begin to look for more efficient ways to achieve the tasks we ask them to do. That’s one of the products of learning algorithms. At the recent DEFCON in Las Vegas, a Cyber Grand Challenge was staged that pitted two computer systems against each other with the aim of discovering weaknesses in the opposing systems. The results fell short of present human standards, but this is just the beginning.

With computers able to learn large amounts of material at high speed, a new communications paradigm is a strong possibility. For example, is it possible that computers will actually create their own language? I know it sounds ridiculously far-fetched, but there was a time not so long ago that we scoffed at even quick-and-dirty machine translation. That universal language may still be out there in the future, but will we be able to understand it?

This article was originally published in Multilingual magazine, December 2016 edition.