SlatorCon 2018—Your Definitive Guide And Key Takeaways

SlatorCon 2018—Your Definitive Guide And Key Takeaways

The SlatorCon tour took in four global cities across three continents in 2018 and featured a grand total of 35 speakers representing the full breadth of the language industry from startups, language technology and language service providers to the buy side, finance and academia.

We also added a bonus event with SlatorMeet, a new format where language industry leaders gather in a more intimate setting. The inaugural SlatorMeet was held in Hong Kong in October 2018 and more are planned for 2019. Here, we revisit the key takeaways from our events calendar in 2018:

Startups and Technology

Oya Koc, Oyraa, Founder and CEO (Tokyo):

Founder and CEO of Tokyo-based remote interpretation app Oyraa, Oya Koc, described the language industry startup scene in Japan. At the time, she said that only 1% of all Japanese startups focus on language services and technology, compared to the more active fields of fintech (9%), IoT/AR/VR/Robot (21%), for example, which command a much bigger share of the startup pie. Nonetheless, against a backdrop of a buoyant startup environment and a fast-growing industry the future is bright for language industry startups in Japan, she shared.

Startups face two main challenges in Japan, however. Firstly, Japanese culture tends to be generally risk-averse meaning that the high-risk startup scene is in danger of being shunned in place of stable, safe, big companies. Secondly, there is less hype surrounding language services than the other fields of fintech, AI generally or AR / VR. Related to this is the fact that there is also insufficient analysis of industry needs and too much dependency on MT, Koc said, which leads to there being less traction for language sector startups, she added.

Koc also explained her observation that translation startups like Gengo and Conyac are focused on crowdsourcing translation despite the worldwide focus on MT. On the interpretation side, despite the value of human translation, startups are focusing more on automated, instant interpretation gadgets that she says don’t work.

Andrey Schukin, Interprefy, CTO (Tokyo)

The CTO of language startup Interprefy, joined the panel session and shared his views on tech-driven market disruption. Interprefy also provided remote interpretation via its app for SlatorCon Tokyo, which worked flawlessly and enabled a fluent discussion in both Japanese and English.

Jean Senellart, Systran, Global CTO (London)

The Global CTO of machine translation provider Systran took participants on a whistle-top tour of machine translation history, from rules-based MT (RBMT) production in the late 1960s through to present-day neural MT (NMT). He spoke about the rapid rise and merits of OpenNMT in fostering collaboration even among competitors.

Launched in early 2017, OpenNMT, to which SYSTRAN, Ubiqus and Harvard University all contribute, is a jointly developed NMT toolkit. At the time of the presentation, the project had gathered strong momentum with 18 major releases, 3300 stars and 1020 forks on Github, and 6 complete code refactorings, Senellart said. It is now the second largest open-source NMT project.

Norbert Oroszi, memoQ, CEO (San Francisco)

The CEO of memoQ, a translation software provider, does not see automation as killing off jobs, but credits it with doing something different, something that puts translators at the center of tools and will create time for linguists, project managers and company owners.

Matt Conger, Cadence Translate, CEO (San Francisco)

The Cadence Translate CEO joined the startup panel at SlatorCon San Francisco. Cadence Translate works provides interpreters for the highly specialist and confidential work of facilitating investor calls for their clients who are active in the investment space. It’s a unique challenge for a premium niche market, Conger said.

When conducting research on the size of the opportunity, “every day we found that there are around five thousand investor due diligence calls that happen around the world, and they’re paying anywhere from a thousand to USD 1,500 per phone call,” he added. And so Cadence Translate became a multilingual research partner for firms that perform overseas investments or M&A due diligence.

However, Conger said they were doing more than just interpreting—they were facilitating. “Interpreters are taking this Hippocratic oath not to insert themselves into a dialogue,” he said. “But in reality, investors…want someone who’s going to get on these calls and get to the bottom of the truth, almost like a journalist or an investigator.” In essence, “we’re focused on training both clients and more importantly interpreters to help bridge that gap,” he concluded.

Jeffrey Sandford, Wovn Technologies, Co-Founder (San Francisco)

The Co-Founder of Wovn Technologies also joined the startup panel and explained how his company is focused on resolving two main challenges. On the one hand, Wovn allows developers “to go completely outside of the localization process and not worry about how their code will affect [it],” and, on the other hand, the company enables content managers “to not worry about the details and hiccups along the way and just focus on the content.”

As a website localization service, Wovn’s solution aims to separate the tasks of development and content management, Sandford said. He was enthusiastic about inter-market opportunities between Japan and China, and said that Southeast Asia, India, and Africa are all emerging digital economies and potential markets.

Bryan Forrester, Boostlingo, Co-Founder (San Francisco)

Boostlingo provides what Forrester called an IMS: an interpretation management service. Forrester commented that despite headwinds such as compliance and immigration, “we actually think it’s a huge growth market, a greenfield market and there’s a lot of businesses out there that actually need interpreting but they don’t even know these services exist.”

Forrester said that interpreting is “about a quarter of [the entire language industry’s] addressable market, so about a USD 13bn market worldwide.” During the September presentation, he added that they estimate remote interpreting and video services to be growing at about a 30% compound annual growth rate (CAGR).

Anna Schlegel is also the co-founder of Women in Localization

Boostlingo had a defining pivot moment that probably saved the business, Forrester said: “Early on when we didn’t know better, [we wanted to be] Uber for interpreting.” But then the company learned about the complexities of what they wanted to achieve. “It took us about four days [to realize] that we wanted to not be a service provider and qualify and vet interpreters,” he said. “We’d probably already be out of business if we had gone down that route,” Forrester admitted. “We decided that we wanted to be the technology layer…[and tell partners]: ‘you do what you do best—vet and qualify your interpreters and provide services to clients. Let us do the technology piece’.”

Mark Shriner, Wordbee, Director North America (San Francisco)

The Director North America at Wordbee shed some light on customer expectations for technology during the tech panel: “there are a million different features and everybody wants a subset of them,” he shared. “They want an integrated platform,” Shriner said, “and they want that platform to be easily integrated into some form of CMS or document management system.”

Aside from integration, clients also look for agile development, according to Shriner, “because all organizations have their own proprietary platforms and processes and if I come to you with a big bundle of software and say take it or leave it, it’s not gonna fit all your needs.” Those making good headway with technological innovation include customers in software and ecommerce, such as Expedia,” he said. “They just developed their own proprietary processes for localizing UGC on the fly.”

Francesco Bombassei, Google, Senior Technical Program Manager (Zurich)

The Senior Technical Program Manager at Google spoke about the latest advances in Google’s machine learning capabilities and its recently-launched custom NMT offering, Auto ML Translation. Bombassei said that with the new system it’s quick and easy for users to build their own custom NMT models by simply uploading their specific bilingual data and waiting about three hours for the engine to be ready.

The time frame in which users can now get access to a custom NMT model has been significantly reduced, Bombassei said, thanks to Google’s third generation TPUs, which are around 10x more efficient than regular GPUs. The upshot of all this, Bombassei explained, is that users can get domain-specific NMT engines even if they don’t have people with machine learning expertise inhouse to write the code, “making [NMT] more accessible to a wider audience.”

Aleš Tamchyna, Memsource, Software Engineer – Artificial Intelligence, Zurich

Tamchyna joined the panel at SlatorCon Zurich to give his take on current state machine translation. He.said that the advancement in machine translation should be taken with “a big grain of salt” and that we should not be misled into thinking that machine translation is a solved problem. For some specific domains and on a sentence level only, Tamchyna said, there have been recorded instances of machine translation output being indistinguishable from human output. However, quoting Memsource data, he said that in reality MT output is only used raw (without human post editing) in around 5-15% of real-world cases. The real test comes, therefore, when you begin to look at translation in context. For this reason, he said, there is a strong trend towards researchers investigating document level translation.

Another research trend he highlighted was the area of low resource languages, language pairs for which there is little language data available. Within this field, researchers are looking at how to use “big language pairs to help the small ones,” he said. And, in the next few years, Tamchyna predicted, machine translation will be able to cover a much broader range of language pairs.

Florian Faes, Slator

Language Service Providers

Ikuo Higashi, Honyaku Center Inc., President (Tokyo)

Higashi is President of Japan’s largest language service provider Honyaku Center Inc. A company with USD 96.5m in 2017 revenues, Higashi said at the time of the presentation that Honyaku is forecasting sales of USD 100.4m for 2018. There is room for growth, given that Japan’s language industry has a total market size of JPY 290bn (USD 2.7bn), he said.

According to Higashi, the language services market in Japan will continue to grow in the future due to a couple of high-growth verticals (local infrastructure and automobiles), ongoing globalization, and an emphasis on events and conferences. Then there is also tourism, with the Tokyo 2020 Olympics on the horizon.

Honyaku is putting its money where its mouth is with a couple of strategic investments in MT companies. In 2017, Honyaku Center acquired Media Research Inc. and took a 13% stake in MT company Mirai Translate. “It is now growing into a “comprehensive supplier of foreign language business” as the domestic translation industry’s largest enterprise,” according to Higashi’s presentation.

Stuart Green, ZOO Digital, CEO (London)

The CEO of cloud-based media localizer ZOO Digital explained that media localization is one of the fastest growing industry verticals (and one that is still relatively untouched by NMT). Green also confirmed in his presentation that “demand is outstripping supply”.

ZOO’s cloud-based model is challenging the traditional ways of providing services such as subtitling, dubbing, and voiceover. Dubbing, for example, would traditionally involve bricks-and-mortar operations and require the voice artist to physically be present at a studio. ZOO’s platform does away with that, allowing talent to work from home.

ZOO has struck big over the past two years as Netflix and other over-the-top content (OTT) providers expanded so aggressively across many international markets that the existing media localization infrastructure could barely cope. The result has been exceptional revenue growth and a skyrocketing share price.

Benjamin du Fraysseix, Technicis, CEO (London)

Unchallenged for the most number of acquisitions in 2018, CEO of Paris-based Technicis du Fraysseix explained his approach to M&A and divulged during his May 2018 presentation that he has had discussions with dozens of potential targets since the company began to embark on a buy-and-build strategy backed by its private equity investors.

du Fraysseix admitted that M&A can at times be a painful and difficult journey but stressed that in the end it is exciting and that he believes it creates value. Aside from the benefits of M&A in delivering top-line growth, du Fraysseix also spoke of the need for companies to drive organic growth and stay focused on sales.

Richard Glasson, Hogarth, CEO (London)

Since the company’s founding in 2008, WPP-owned Hogarth has grown far beyond its original transcreation business and has expanded into a number of areas around the “production, adaptation and supply of advertising materials”. Language service, however, continues to play a major part in Hogarth’s business, and Glasson shared that in Hogarth’s operating model services related to language are provided at a regional level. “Language for us, primarily as an advertising business, is based around transcreation, and transcreation needs real expertise, real specialization, and very high quality talent”, Glasson said.

Glasson told participants that a real tension exists between ever more complex client requirements as global content continues and shrinking budgets. He highlighted that while what’s needed is a consultative approach, procurement often just wants to “manage vendors.”


Romina Franceschina, Hogarth, Global Head of Language Services (London)

Hogarth’s Global Head of Language Services stood in for Glasson on the SlatorCon London panel and explained why some of her clients prefer to have linguists and project managers onsite. “They are very passionate about meeting the people that will carry their brand in their language”, Franceschina said.

Adolfo Hernandez, SDL, CEO (San Francisco)

SDL CEO Hernandez told the audience how people factors played a part in SDL’s July 2018 acquisition of Donnelley Language Solutions; to implement the company’s strategy of pursuing premium markets, SDL needed “people who have been there, done it and got the t-shirt” – enter the Donnelley team.

Hernandez spoke of the complementary nature of people, customers and locations in the Donnelley acquisition; only 3% of Donnelley Language Solutions’ revenue came from common clients. He also said that he expects consolidation to continue in the language industry, though it will not be a winner takes all market.

Mark Howorth, SDI Media, CEO (San Francisco)

The CEO of media localizer SDI Media, a company operating within one of the industry’s fastest growing verticals, explained in his September presentation that there has “never been a content producer that’s come in and grown as fast as Netflix.”

Howorth described how Netflix, among others, disrupted the media industry to the extent that they “reset the bar on quality”. So much so, he said, that SDI Media had to “put ourselves in the shoes of our customer and say ‘if we were starting from scratch and gave the customer a wishlist, what would they want?’” In doing so, SDI saw a crucial need to provide customers with speed, simplicity and transparency.

Jeremy Woan, Cyracom, CEO (San Francisco)

The Chairman and CEO of Cyracom, a US-based remote interpreting provider, said that the vertical is driven by regulation, immigration and replacement of modalities. The choice of modality, e.g. whether interpreting is done onsite, over the telephone or via video, is context dependent, the Cyracom CEO said.

Overall, he sees more of a shift from onsite to telephonic and thinks video is more complex. Asked about the startups operating in the interpretation space, Woan said that building a platform is not the tough part, but managing vendors is the real hard task, especially for newcomers.

Edward Vick, EVS Translations, CEO (Hong Kong)

The CEO of EVS Translations, who since 1991 has built his company into a force in the financial, legal, and other premium verticals, explained how the rise of AI is leading his clients to become ever more security conscious, with many asking “what’s happening to my data?” To win some of the top law firms as clients, EVS has been through two-year onboarding processes to verify the right security arrangements were in place.

For Vick, an important part of the solution is employing in-house linguists, who accounted for roughly half of EVS Translations’ more than 200 employees at the time of the presentation. “If you’re only able to use specific translators” said Vick, and “if you’re not allowed to use freelancers outside of a dedicated network, you may be able to increase the price, increase the quality, but for sure you increase the security.”

While Vick believes that the legal industry is “absolutely ready” for AI, EVS still offers clients their choice of solutions, human translation without AI and AI solutions with post-editing. For clients choosing the latter, EVS can tailor the solution down to the law firm’s client, practice area, and language pair.

Richard Delanty, into23, Founder (Hong Kong)

The Founder of Hong Kong-based startup LSP into23 detailed several areas where Asia offers huge potential for providers of language services and tech. He sees demand for translation changing quickly with the rise and globalization of Chinese companies, saying “I see Chinese as important today as a source language as English and I think that’s something that’s going to continue to be important in the future.”

Delanty broke down how the expansion in Asia of Chinese smartphone makers and fintech companies in particular was helping drive the growth of regional commerce. He saw ecommerce and travel growth in the key markets of India, Pakistan, Thailand, Indonesia, Vietnam, and Malaysia creating unique challenges for LSPs given the lack of qualified linguists in fast emerging, intra-Asian language combinations.

Delanty highlighted that Asia also had low technology adoption and remained fragmented as no major language service provider has captured significant market share. His recommendation for LSPs hoping to succeed in Asia is to innovate an MT-first supply chain, collaborate with partners to fill human-talent gaps, educate customers on workflow automation, and compete to win.

Vincent Nguyen, Ubiqus, President (Zurich)

The President of Ubiqus described the challenges of deploying machine translation in a production environment, an exercise that Ubiqus has spent the past 18 months focusing on, Nguyen shared at the end of November. The company now considers NMT to be “just another CAT [productivity] tool in the workflow,” he said.

To be able to get to the point of deploying NMT operationally, Nguyen said that it’s essential to secure buy-in from the people who actually interact with the technology: the translators and project managers. According to Nguyen, this journey is made easier if management is engaged and owns the roll out. Ultimately, though, while it is not (yet) perfect, “NMT output itself is the best advocate for NMT,” Nguyen concluded.

As well as the company’s focus on NMT, Ubiqus is also no stranger to another hot topic in 2018, M&A. Nguyen said during the speaker panel that over the past 18 months M&A in the language industry has “been heating up, especially among the top 50, top 100 [ranked providers]”.

Rasmus Lokvig, Language Wire and CataCap, Deputy Chairman and Partner (Zurich)

Lokvig has a dual role as Deputy Chairman of LanguageWire and Partner of CataCap, the private equity firm that took a majority stake in LanguageWire in 2017. Lokvig shared the rationale behind CataCap’s initial investment and explained why they backed LanguageWire, a (ca.) USD 30m LSP in buying Xplanation, a company roughly equal in size.

Buying up a company with the same ballpark revenue can be “more tricky and [involve] more risks” Lokvig said, but it’s worth it if it’s a good fit. And in Belgium-based Xplanation, LanguageWire saw a potential acquisition target that ticked all the right boxes, Lokvig emphasized, because it was complementary to LanguageWire in terms of size, geography, customer base, technology and people.

Finance

Dominic Emery, Raymond James, Managing Director (Zurich)

The Managing Director of investment bank Raymond James walked SlatorCon attendees through the blow-by-blow of executing an acquisition or sale. Emery highlighted the fact that “preparation is absolutely critical” and can determine “not just the [success of the] deal but all the value you will create after the deal.”

Raymond James specializes in facilitating M&A for tech-focused companies, including in the language industry, and has observed that “more and more of it is happening in this industry,” and commenting in his November presentation that “market conditions are very hot,” which is influencing valuations.

Academia

Kayoko Takeda, Rikkyo University, Professor of Translation and Interpreting Studies (Tokyo)

The Professor of Translation and Interpreting Studies at Rikkyo University and former head of the Japanese translation and interpreting program at the Middlebury Institute of International Studies at Monterey spoke about the role of academia in preparing the next generation of linguists for a competitive market.

Takeda illustrated some of the specific challenges and opportunities facing academia in Japan. Firstly, she said that previous generations of linguists lack the academic credentials necessary to become instructors. There is also a relationship gap between academia and industry due to the short history of these programs. Not only that but Academia is missing great opportunities in working with partner universities overseas, Takeda highlighted. There is also a lack of specialism since potential linguists go through generic courses rather than specific training. And, lastly, language provision is limited since most courses focus on only English and Japanese.

Takeda concluded by calling on all stakeholders in the language industry to adopt a more “holistic approach” to linguist training.

Buyers, Users, Enterprise

Chizu Tanaka, Booking.com, Team Leader Translations & Content Agency (Tokyo)

The Team Leader of Booking.com’s Translations & Content Agency in Japan outlined how this fast-growing company manages millions of words of multilingual content daily. Booking Holdings, previously Priceline, is the third largest ecommerce company after only Amazon and Alibaba. Booking.com’s scale is a huge challenge. At the time of the presentation, Booking.com had some 15,500 employees, over 1.5 million registered properties, and more than 190 offices in more than 220 countries and regions all over the globe. And the company was registering 1.5 million room nights daily.

The company’s Content Agency is comprised of 200 in-house Language Specialists who deal with high-value or high-impact localization. A pool of around 2,500 freelance translators handle high-volume content, such as property descriptions. Localization and translation work is done for customer-facing and partner-facing content, with the former taking up most of the work, even extending to social media depending on language.

Booking.com extensively A/B tests their content, with hundreds of of A/B tests conducted daily. Tanaka said that the website is localized into 43 languages, but the characteristics of specific markets and users will require additional functions and features displayed only for those markets.

Eiji Sano, SAP, Director of Language Services (Tokyo)

The Director of SAP’s Language Services unit in Japan spoke in February 2018 about the company’s sourcing strategy for language service providers. With over 88,500 employees serving over 378,000 customers spread across 180 countries, SAP requires translation of all products, product services, and corporate content and translated approximately a billion words in 2017.

The SLS team of over 200 people works with a large network of over 2,800 freelancers from more than 120 language service providers (LSPs) spread across 41 countries. The SLS team manages the supplier pool through a Central Information System called the SAP Translation Support Portal. The SLS team also supports this network through events such as the Language Services Forum and regular roundtables. Sano explained these suppliers are are mostly single language vendors, with some multi-language ones.

As for machine translation, in Sano-san’s presentation, he emphasized that “SLS views the integration of MT into its standard processes as a key strategic driver for the coming years.”

Michaela Bartelt-Krantz, Electronic Arts, Senior Localization Director (London)

From the heart of the game localization industry, Michaela Bartelt-Krantz from Electronic Arts (EA) spoke about the company’s outsourcing strategy (direct to freelancers with some single language vendors) and of her hopes for future-state workflows where real-time, automated localization is a reality. The Senior Localization Director described EA’s internal language service as operating a supply circle, rather than a supply chain, whereby activities such as language planning, development, game design and live content are interlinked, and real-world events play a role in the localization strategy.

Ferose V R, SAP, Senior Vice President and Head of Globalization (San Francisco)

The Senior Vice President and Head of Globalization at SAP painted a bright future for human and AI interaction. Machines will never have the ability to introspect, meditate or be mindful, he said, and there will always be a place for humans in the translation workflow since “the heart of technology is human” and the role of technology is to “elevate rather than deflate humans.”

Ferose believes that the future is the “convergence of three things: translation, transcription and voice,” and while voice in particular will come with a significant security challenge, there is a big opportunity for companies in the integration of voice-based APIs.

Ferose V R also gave his take on another hot topic, that of disruption and growth markets. SAP’s Head of Globalization foresees massive expansion in previously untapped markets, such as Asia and Africa. Localizing for these multilingual geographies brings its own challenges but there is great opportunity to be found in these longtail languages, he said.

Anna Schlegel, NetApp, Head of Globalization (San Francisco)

The Head of Globalization at cloud storage company NetApp spoke of NetApp’s journey to becoming a globalized company, and underlined the importance of dealing effectively with C-suite executives to secure buy-in, gain direction, and deliver programs that help products gain number one position.

When Schlegel joined NetApp over ten years ago as the first person hired onto the localization team, she “couldn’t see much localized content,” Schlegel said in September 2018. There was a clear business case, Schlegel felt, for globalizing NetApp’s products into additional languages to reach more potential users and to make sure that the company was not “leaving money on the table.” She asked for a team and a budget.

At first, Schlegel said, NetApp was “localized in a few areas. Now we have the whole company globalized.” Schlegel is now leading nearly 200 people across the Globalization, Information Engineering and Product Portfolio teams and her team also now runs a “Globalization Forum where we have the Heads of all Departments telling us what the important countries, products, practices and goals are.” This all feeds into a content strategy for the company, with globalization at the heart.

Lupe Gervas, Quora, Localization Manager (San Francisco)

The Localization Manager for intelligent question and answer community Quora was new to the role when she spoke at SlatorCon San Francisco in September 2018. Gervas gave her take on the current hiring practices within the industry, discussing the challenges, pitfalls, and successes of hiring language professionals.

Getting hiring right is extremely important, said Gervas, and all the more so because of the fast changing nature of work. For example, social products, platforms that serve as a vehicle for user generated content, are rapidly expanding and “are shaping the way we are working every day,” Gervas added. Therefore, it is important to consider the future and specifically, according to Gervas, “what are the products and the content that we are going to be working on?”

The crux of the matter from Gervas’ perspective is “are we really hiring for these new products that are coming? For the next 1 billion users?” Her response: “I question that.” Ultimately, she said, taking hiring risks, and screening candidates for adaptability rather than just experience is a good way to safeguard localization workforces.

Sonia Oliveira, GoPro, Senior Director of Globalization (San Francisco)

Joining the tech panel, the Senior Director of Globalization at GoPro said that being a video-based and edgy brand does not lend itself to using MT, but that there can be some application in the customer support and chat segments. “If you’ve seen any of our videos, we’re all about adventure; we’re a sexy brand and we want to push the envelope and our marketing reflects that,” she said. “Machines don’t do too well on that front.” Oliveira admitted that whereas “15 to 20 years ago, machine translation was laughable,” today, she thinks that “we’re gonna get even better than we have gotten up to now.”

Asked by an audience member whether the future is leaning more towards open standards or proprietary solutions, Oliviera conceded that while proprietary solutions won’t be going away, she thinks the two are not mutually exclusive. “I think it’s both, because there will be companies that are not going to be using open source,” she said. “Sometimes open source develops in a way that proprietary tools will then copy and make better for everybody.”

As for the unique challenges faced by different verticals, Oliveira said video localization has fundamental problems that are not necessarily resolved by current developments in language technology. “Let’s talk about a very basic one,” she started. “Video files are huge. It’s very difficult to send them for quality check to an outside partner, for example.”

Mike Kim, Tencent America, Localization Director (San Francisco)

The Localization Director at Tencent America described the company’s localization strategy: “in terms of what our priority language is, it’s actually where the money’s at.” He explained that the localization strategy relies on knowing which countries have a lot of spending potential for games.

For him, it not even a matter of replacing or augmenting human translation work with MT. “For marketing content, copywriters don’t even speak the source language, they just look at what’s been translated, receive the context, and rewrite,” he said. “Copywriters will write new style guides, new characteristics for heroes, so it’s like creating a new world apart from what was originally created. That’s really hard to do with MT.”

Jie Li, Alibaba Translate, Senior Product Operations Advisor (Hong Kong)

The Senior Product Operations Advisor at Alibaba Translate explained to participants why language is a critical component in the ecommerce giant’s growth strategy. “My first task [at Alibaba],” said Li, “is how to make language into a product. And also what I focus [on is to] mainly use AI, mainly machine translation to combine with other AI technology like speech recognition.”

Ultimately, Li’s mission is to shorten and improve the connection between Alibaba’s suppliers and end-users within Alibaba’s end-to-end ecommerce ecosystem. As an example, she demoed their recently launched neural machine translation powered live chat function, where buyers and sellers interact in real time. The underlying MT engine was trained on two decades worth of e-commerce content.

Alibaba’s language requirements go far beyond this, with Li sharing a slide of their technology platform and product matrix. It featured more than two dozen areas in language services and tech where they are looking for partners. “We’re looking for vendors [to] not only provide language solution,” said Li “but user experience testing, user feedback, user surveys, and more.”

Jie Li, Senior Product Operation Advisor, Alibaba Translate presentation at SlatorMeet HK2018

Claudine Nick, Roche, Head of Project Management Language Services (Zurich)

The Head of Project Management for the internal language services department of multinational healthcare company Roche shared about the role of her team in supporting the healthcare giant’s language needs. The super sensitive nature of the content within this highly regulated industry means that there are extremely stringent data privacy and compliance considerations that have to be adhered to throughout the translation process, Nick said. Consequently, the capacity that Nick’s team has to manage and perform human translation securely within Roche’s “center of excellence” for language is of “big value” to the company, she added.

Slator

Florian Faes, Slator, Co-Founder

The Slator Co-Founder is a permanent fixture on the SlatorCon line up, taking the stage in Tokyo, London, San Francisco, Zurich and at SlatorMeet Hong Kong, to talk the audiences through the main trends impacting on the language industry.

The language industry has remained a strong market throughout 2018, Faes said at the final SlatorCon of the year, and some specific pockets of the industry such as media localizationlife sciencese-commerceremote interpreting and gaming have been experiencing above average growth, boosted by their own vertical-specific growth factors.

Faes described how new services, including language data creation and curation, have sprung up to fuel the growth of the data-hungry neural machine translation (NMT) technology. It’s “probably going to be a long-term trend for this industry,” Faes predicted. The industry’s thirst for language data is here to stay, therefore, as NMT tech has stepped up a gear to enter a what Faes identified as a “new phase, which is customizable machine translation.” A potential game-changer for the industry, big tech frontrunners Microsoft and Google have both begun to sell custom NMT direct to customers in the last six months, Faes said.

M&A has also ramped up throughout 2018. While it was hot but not sizzling in the summer, deal-making in the industry has accelerated into Q3 and Q4. The landscape among the top players in language services looks markedly different from the start of the year, as Donnelley Language SolutionsTelelingua and Xplanation were absorbed into SDL, Technicis and LanguageWire, respectively. Technicis and LanguageWire, both backed by private equity, have now become two of the most major players in Europe, while SDL has closed in on frontrunners Transperfect and Lionbridge, as the largest LSPs begin to widen the gap between them and the midfield of LSPs. External buyers, including UK-based e-commerce company The Hut Group who acquired Language Connect in 2018, have also begun to explore M&A opportunities in language services company, Faes said.

The likely result of this increased M&A activity Faes summed up in his 2019 outlook given at the final SlatorCon of the year, SlatorCon Zurich. “Big LSPs will get bigger,” he predicted, pointing to more consolidation among already sizable players in the language service space. His other forecasts? Custom NMT will continue to gain traction, which will lead to specialized solutions and niche expertise in this area, Faes said.

The Slator Co-Founder also left a question mark over the impact that the language industry’s growing band of startups will have in the near term. To date, Faes said, none have succeeded in disrupting the competitive landscape in any meaningful way. But 2018 has seen more cash pumped into language industry startups meaning many will be under increased scrutiny to deliver results and making 2019 a likely crunch time for the now well-funded startups.

Save the Date

We are pleased to announce the dates and locations for 2019 SlatorCon:

    • London, May 16, 2019
    • San Francisco, September 12, 2019
  • Amsterdam, November 28, 2019

Stay tuned for more information on SlatorMeet 2019.

This post originally appeared on Slator

Best Practices in Translation Memory Management

Document Revision History

Version Revision Date Description
v2.1 10 December 2018 Markdown version
v2.0 7 December 2018 Add Creative Commons License
v1.5 21 September 2018 First Final Draft After Incorporating Community Feedback
v1.0 22 May 2018 First Draft sent by the GILT Leaders’ Forum for Review by GILT community members

TM Management Task Force Contributors

Marco Angiuoni  – VMWare
Janice Campbell – Adobe
Johann Cronin – eBay
Sankeshwari Deo – Autodesk
Michael Kuperstein – Intel
Ryan F. Lee – LDS Church
Natalia Levitina – PTC
Lynn Ma – VMWare
Silvio Picinini – eBay
Andrzej Poblocki – Veritas
Vidya Ramachandran – Adobe
Octavio Ramos – Intel

Contact

GILT Leaders’ Forumhttps://github.com/GILT-Forum/TM-Mgmt-Best-Practices

License

This resource is free for you to use and share as long as you adhere to the terms of the CC license.

CC-BY-NC-SA

Best Practices in Translation Memory Management is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Introduction

The GILT Leaders Forum is a self-organized group of seasoned globalization professionals representing various companies from the “buyer” side. The group chartered a Translation Memory Management Task Force to research and produce best practices for translation memory management, as it relates to translation management systems (TMS/GMS) and machine translation engine training.

The purpose of this work has been to gather knowledge from TM experts and transfer it to those responsible for managing TMs in their own organizations, or on behalf of others.

Upon completion of the first draft, we sought feedback from practitioners and experts in the wider GILT Community, including client-side organizations, Language Service Providers, academic and research institutions, industry forums and localization technology providers.

In order to reduce redundancy, we requested that information be gathered across the organization and consolidated as a single response, except when localization services coexist with technology product or service marketers.

We received a total of 32 responses, broken down as follows:

Affiliation

In client organizations, internal technical or engineering teams may perform the TM/MT tasks, while in other cases these activities are outsourced. In the latter case, a respondent from a client organization may not have knowledge of, or visibility into, the practices carried out on their behalf. Since companies that provide technology or services have many clients with varying requirements, they were asked to respond, as much as possible, with the most common use cases.

As with any survey, it is possible we did not ask the right question in the right way. For example, the question “do you do it” is different than “would you recommend doing it?” It is important to acknowledge this in an effort to help readers of this document not draw the wrong conclusions. Upon analysis, a negative response about a practice did not necessarily imply that it was not done because it was not a good practice. It may be that the respondent’s organization might not have the capability, resources, tools, or know-how to do so. The ability to perform the practice might also be dependent on a TMS or MT engine. Some practices may turn out to be quite complicated to carry out, and therefore, the organization may avoid it and focus on those tasks delivering the highest impact. And finally, it might never have occurred to an organization to do such a task, but they would have done it if they had known about it. In some of the comments, respondents expressed that they wished that they were doing a specific practice.

Throughout the best practices, we have integrated the community responses from the survey. This information is called out in a yellow box, which may include charts, in addition to a summary with statistics. Example:

Community Feedback:

It is a common practice to remove non-text segments, or remove non-text parts of the segments:

  • Remove the entire segment, 60%
  • Remove the non-word characters, 10%
  • Don’t have such practice, 40%

Our expectation is that this document will help you adopt the practices that suit your business needs and improve upon the practices that you already use. In the end, we hope this document increases your knowledge in this area and that this becomes a living document that will benefit people over time as well.

Table of Contents

Definitions/Key

Symbol Definition
✔️ Recommended practice.
Not recommended or not valid for this use case.
⚠️ Proceed with caution. This may or may not be a good practice depending on your environment. You may want to consult with your TMS and/or MT service providers for tool-specific guidance.
TMS Translation Management System (aka Globalization Management Systems (GMS) or Global Content Management Systems (GCMS)). A type of software for managing the content workflow within the human translation process.

Best Practices

Create an Admin Role to Manage TMs

TMS: ✔️ MT: ✔️

It is recommended to assign an Admin role within your organization to oversee and manage Translation Memories. The TM Admin role would manage the organizational structure of the TMs. This might include:

  • Oversee the cleanup of TMs
  • Grouping/hierarchy of workflows
  • Prioritization and updating of TMs
  • Definition of metadata

Specific recommended practices for this role may include:

  • Approve creation of new TM groups
  • Recommend removal of temporary (e.g. test or seed) or non-relevant TMs

Community Feedback:

83% of respondents have an admin role in one form or another, while 17% do not have it.

Define how TMs are organized

TMS: ✔️ MT: ✔️

Categorizing TM content is a challenge for every company and differs depending on each client’s strategy and portfolio.

Many considerations factor into deciding when to silo content into separate TMs. Some strategies are designed to maximize leverage, while others focus on maximizing the quality of translation leveraged.


ℹ️ Tip: It’s possible for content from multiple sources to coexist in the same TM if the translation style is the same. This strategy can help minimize the number of TMs you need to maintain.
Having more TMs increases TM maintenance effort and reduces usefulness of each TM. So maintain as few TMs as possible, but as many as necessary.


Here are some considerations that may help determine your TM hierarchy:

  • Language Separation
    • Split each language into a separate TM, or
    • Store all languages in one TM
  • Separate TMs by Product
    Group by Product family/Business unit/Code stack (i.e. keep distinct TMs for each product area).
  • Separate TMs by Translation Style
    If the same style guide is used to translate several different components, it may make sense to group all such content into one TM, regardless of the source (e.g. keep all your different product content in one TM).
  • Group by Screen Size
    Different devices impose size restrictions/constraints on translation length.

    • You may need to separate TMs based on intended use (desktop vs mobile)
  • Separate TMs based on Quality of Translation
    Sub-divide TMs based on quality of translation, or keep all in one.

    • You may elect to keep length-restricted translations separate.
    • You may elect to keep reviewed content separate to unreviewed content.
    • You may elect to keep machine-translated content that went through light post-editing (to achieve “fit for purpose” quality) separate from content translated to full human quality (using MT + full post-editing and/or CAT tools).

Other arrangements are also possible but may not bring not bring much benefit; rather they just bloat the number of TMs to maintain. e.g.

  • Separate TMs by every Code base/File Type
    Group by code stack or file type (i.e. keep distinct TMs for each type of file being processed).

⇨ CASE STUDY: Example of TM separation at Company A

  • Separate TMs were created for:
    • Mobile (iOS & Android)
    • UI
      • All goes into one TM
      • One product with length-restrictions (for UI) stored in separate TM
    • Web content
    • Help content
    • Documentation
    • Customer support
    • Misc TM (Legal text, Marketing text, graphics localization, Survey docs, etc…)
  • TMs use a uniform naming convention, for consistency and to help group TMs by name
  • TMs have detailed descriptions for all TMs so everyone knows exactly what they should contain

Community Feedback:

TMs are organized by the respondents using the following criteria, in the order of popularity (multiple options allowed):

  • Product Family/Business Unit/Code Stack, 78%
  • Language, 66%
  • Content Type (UI/DOC/Web), 63%
  • Translation style, 34%
  • Quality (length-restricted, machine-translated, possibly unreviewed), 28%
  • Screen Size (desktop vs. mobile), 9%

Define metadata for TMs

TMS: ✔️ MT: ✔️

Define the metadata required to organize TMs. This helps to be able to organize the content of the TMs in different ways for specific purposes. Recommended attributes (even if not available in TMS systems) include:

  • Product
  • Vertical/Subject
  • Type of Content (UI, Help, Doc, Short/Long, User Generated or Not)
  • Style (Formal vs. Informal, or Research vs. Gaming)
  • Visibility (Internal vs. Published) – which influences the quality need
  • Quality Level (Perfect vs. Good Enough)
  • Bilingual/Multilingual TM

Community Feedback:

Most respondents are already using metadata to capture product, type of data and bilingual/multilingual information.

The most desirable metadata that some of the respondents are not using now and would like to add are:

  • Quality Level (Perfect v. Good enough)
  • Style (Formal vs. Informal, Research vs. Gaming)
  • Visibility (Internal vs. Published)
  • Vertical/Subject

TM metadata is used to maximize quality and leverage, as well to identify the most effective MT training data. One of the users noted a need to track usage analytics for TM matches to facilitate TM maintenance.

Determine the plan or principles for using TMs (grouping / leveraging / updating)

TMS: ✔️ MT: ✔️

Determine the stated plan or guiding principles for grouping, priority, applying, penalization, and updating of TMs.

  • Grouping / Hierarchy
  • Priority or Sequence (list of 5 TMs in prioritized order, for example)
  • Leverage Penalty (if any) for each TM in the sequence
  • Which TM(s) is/are updated after translation

 Note: Keep your leveraging rules simple. Leveraging from a very large number of TMs has dubious benefits with a higher risk of adding complexity and potentially reducing quality. For some TMS systems, reducing the number of TMs leveraged is a recommended practice due to quality and performance concerns. Consult with your TMS provider for guidance.

It is not recommended to update to multiple TMs after translation.

It is recommended to have a structured approach to managing TMs, otherwise cleanup efforts are not as effective.


ℹ️ Group TM Considerations

Many TMS allow you to reference other TMs in order to maximize the leverage potential of your corpus. Most TMS allow you to apply penalties to any leverage you get from reference TMs, in order to ensure the content appears as a fuzzy match requiring review, before being committed to the write-to TM. Consider the options below:

  • Leverage from TMs with length restrictions?
    Translations may be compromised by the screen limitations or UI restrictions. Do you want to reference TMs that contain abridged translations?
  • Leverage from TMs with different translation style?
    Do you want to reference TMs that use a different translation style guide, e.g. formal v informal?
  • Leverage from unreviewed TMs?
    Do you want to reference TMs that contain unreviewed content, e.g. internal training content, perhaps inconsistent with your your term database?
  • Leverage from Machine Translated TMs?
    Do you want to reference TMs that contain machine-translated content, (reviewed, or not)? If you reference reviewed MT content, do you want to reference only full human-quality or good-enough (fit for purpose) quality?

ℹ️ Apply Penalties

In all the above cases you should apply a penalty against any references coming from the lesser-quality TM, or from one that uses a different translation style, in order to ensure it gets properly reviewed before being used.


Community Feedback:

Most respondents are using a stated plan or guiding principles for using their TMs based on:

  • Priority or Sequence, 78%
  • Leverage Penalty (if any) for each TM in the sequence, 66%
  • Which TM(s) is/are updated after translation, 66%
  • Grouping / Hierarchy, 59%

Determine the process and criteria for cleaning TMs

TMS: ✔️ MT: ✔️

Define why, when, and how TMs should be cleaned:

  • Define separate processes for cleaning TMs for a TMS and for MT.
  • Set a schedule for cleaning.
  • The general practice for MT is to exclude all segments where the source is ‘suspicious’ or does not look like a sentence, so that non-fluent data does not affect the MT engine quality.
  • Weigh the cost/benefits of having a very dirty TM against the value of high leverage for a particular project. Consider that allowing junk segments into the TM because they give high leverage for a particular project, then the result is that these junk segments are only appropriate for that particular project, and the segments may not be as suitable for other projects in general.

 Note: You may wish to tag and archive all verified wrong translations for future use in Machine Learning, as an example of a bad translation.


ℹ️ Criteria that may be considered in decisions about cleanup:

  • Age of the segment (last modified date / last used date)
  • Terminology updates
  • Duplicates
  • Changes in style (formal to informal, for ex.)
  • Segment analysis (corruption, incorrect language, mismatching count of format parameters, etc.)

(A detailed list of cleaning criteria is contained in the sections below.)


Community Feedback:

  • About a half of respondents perform cleanup operations for TMS purposes by implementing terminology updates and by modifying “wrong” segments.
  • Around a third of respondents perform all the other cleanup operations for the TMS maintenance purposes on a schedule (deleting “wrong” segments, based on segment age, segment analysis, changes in style and duplicates).
  • Between 30 and 45% of respondents perform cleanup operations on a schedule for the MT purposes by deleting or modifying “wrong” segments and doing segment analysis. A quarter of users delete duplicates for the MT training purposes.
  • Only 13-20% of respondents do not perform any cleanup tasks for either TMS or MT purposes.

General TM Housekeeping tasks

These are language-independent tasks, meaning that these tasks should be applied across all languages.

Create complete descriptions for TMs

TMS: ✔️ MT: ❌

The TM ‘Description’ field is critical in helping to differentiate and describe TM contents and use. It’s especially useful when you have multiple Admins creating TMs. Recommended practice.


⇨ Example syntax:

[User's initials] [Date] [Description of TM]


 Note: Some TMs are temporary, used to seed other languages (e.g. French to French-Canadian), so they have a shelf life. Knowing who created it, when, and having a useful description helps know how it’s used and when to delete.

Community feedback:

Community feedback indicates that a higher percentage create complete TM descriptions for TMS vs. MT.

Use consistent filenames and path normalization to avoid duplicates

TMS: ✔️ MT: ❌

Consistent filenames: Avoid submitting a project file to translation with a different filename each time the content is updated (e.g. localization\sprint1.xmllocalization\sprint2.xml). Such a practice would create multiple entries in the TM for the same content because the filename is different each translation cycle (or sprint).

Path normalization: Use path normalization in conjunction with using consistent filenames, to reduce creation of TU duplicates. Recommended practice.

 Note: This practice is not relevant for all TMS or for systems that use alternatives to TMs (e.g. documents in Transit or LiveDocs in memoQ).


⇨ SDL WorldServer Example

BEFORE path normalization:

Before path normalization is introduced, the same TU is saved 4 times into a TM even when no changes have been made to the translation – original TU plus 3 duplicates with the following ‘Entry Origin’ attribute values:

/FileSystem/Projects/ProductA/Version1/batch1/foo.htm
/FileSystem/Projects/ProductA/Version1/batch2/foo.htm
/FileSystem/Projects/ProductA/Version2/batch1/foo.htm
/FileSystem/Projects/ProductA/Version2/batch2/foo.htm

AFTER path normalization:

Entry Origin for all 4 cases looks like this:

/FileSystem/Projects/foo.htm

Therefore no duplicate TUs are created.


Community Feedback:

Most respondents indicate a preference for creating consistent filenames for updated content & using path normalization.

Detect and fix technical issues in the content

TMS: ✔️ MT: ❌

If you find character corruption, escaping, or other quality issues in the source, look upstream at the source content and then the filters/parsers to investigate why those issues happened, e.g.:

  1. Encoding issues in the source can cause character corruption in the translated assets, especially if non-English accented characters are not supported by the chosen code page/encoding.
  2. Escape characters (such as \n, \r) should not be included as translatable elements since that increases the risk of translators incorrectly interpreting or omitting these. Character escaping should occur outside of the content bundle so linguists can translate content without seeing backslashes in the source, or worrying about adding escape sequences in the translated assets. Similarly, it’s not good practice to escape quotes in the source content (e.g. Can\’t). Instead, have a post-processing function to escape characters where needed.
  3. Check your filter/parser rules to ensure your content is sent correctly for translation and the untranslatable tags are identified as such.

⚠️ If the issue is with the filter/parser, it should be fixed before any effort is expended on cleaning, to avoid losing leverage. Plan parser changes carefully because parsing affects matches at every level (ICE, 100% and fuzzy matches). If the parser and the cleanup efforts are not done in a coordinated fashion, a leverage loss will occur.


⇨ Examples:

  • Character corruption needs investigation.
  • Escaped HTML entities (or double-escaped entities) often indicate an issue that should be fixed in the parser.
  • Line breaks at 80 characters may indicate a parser that should do a better job of parsing text to flow across lines.

Community Feedback:

Most respondents have a process in place to fix corruption issues.This area continues to be a challenge for 20% of those surveyed.

Remove empty segments (source or target)

TMS: ✔️ MT: ✔️

This cleanup step is applicable regardless of language. Recommended practice. Some systems or configurations prevent this from happening, but that is not always the case.

Community Feedback:

For cleanup purposes, 64% of respondents remove empty segments for TMS and 75% for MT.

Some TMS can be configured to not produce empty segments, so for those systems this is a non-issue.

Characters

⚠️ Take extreme care with any normalization during cleanup; over-normalization can lead to loss of information and/or loss of leverage.

  • Inspect the incoming source text for escaped characters/entities, control characters, or excessive white space. If these exist, look upstream at the parsers for the source of these issues before cleaning them from the TMs.
  • When sending content to an LSP for translation, the text should be raw unescaped characters.
  • There are many exceptions to general normalization rules based on the file type, parser, and content type. Consider your specific use case and needs before taking any action.

Normalize escaped characters/entities

TMS: ⚠️ MT: ✔️

Definition: “Escaped” characters are representations of characters using only ASCII characters. For example, &\#x20AC; is the escaped representation of the Euro symbol.

Recommended Practice: Replace all escaped HTML entities with the actual Unicode character, including these common ones:

  • &lt; (<)
  • &gt; (>)
  • &amp; (&)
  • &nbsp;

This enables the text to be more readable and more easily searched (e.g. during QA phase).

How: TMS systems can store non-ASCII Characters as either characters or entities. It’s recommended to store characters in the TM to make sure that an unescaped version of source content is used during the TM leveraging and translation process.

Benefits:

  • Content is much easier to read, translate, and to troubleshoot.
  • Data in the TMs is independent of source content type.
  • Normalizing to unescaped characters across all your content should produce better leverage.

Restricted characters must be escaped again before the final content is delivered.

 Note: When working with XML files, there a 5 ‘special characters’ that must be stored as entities in the XML output to avoid parser issues (i.e. & ‘ > < “). These can be stored as characters in the TM, but as entities in the translated XML output.

Community Feedback:

  • 50% of respondents normalize characters/entities for TMS & MT.

Normalize certain control characters

TMS: ⚠️ MT: ✔️

Definition: Control characters are codepoints that do not represent written symbols, but rather perform some other function in a document (formatting, spacing, legacy functions that are no longer used, etc.). A list of control characters can be found at: https://en.wikipedia.org/wiki/Control_character.

Recommended Practice: Remove certain control characters, such as non-printable characters.

 Notes:

  • Non-printable characters such as the BELL or Unicode Byte-Order-Mark sequence in TM data may indicate corruption from a bad parser or other data source, so consider inspecting and correcting upstream sources if you find these.
  • Unescaped non-printable control characters are not legal in XML file content, and must either be removed or escaped before further processing.
  • Tab, new line, and other whitespace characters are also considered ‘control’ characters, but we address them separately in the next sections.

Community Feedback:

  • 43% of respondents normalize control characters for TMS and MT.

Normalize whitespaces

TMS: ⚠️ MT: ✔️

Collapse series of white spaces to one instance.

 Note: Some TMSs allow the option of storing a sequence of whitespaces as a single space.

⚠️ Caution:

  • There are 31 whitespace characters listed in the Unicode standard, with specific linguistic rules or technical uses for certain languages. Hence, removing all whitespace characters without analysis is not advisable.
  • The decision of whether to normalize whitespace should be dependent on advice from the tool vendor, analyzing the content, and the translated output.

Community Feedback:

  • 44% of respondents normalize whitespaces for TMS.
  • 53% of respondents normalize whitespaces for MT.

Normalize quotes

TMS: ❌ MT: ⚠️

Recommendation for MT training only, in order to have a consistent input for all content, it is suggested to perform various quotes conversions:

  • Source & Target: Convert curly single quotes to regular single quotes.
  • Source & Target: Convert curly double-quotes to regular double-quotes.

Conversion of curly quotes to straight quotes helps introduce consistency into the MT training corpora and may correct some errors in technical content, where code was typed or copied from a word processor.

Other search and replace quote operations include substituting non-English (e.g. Chinese, Japanese and French) quotes with English double-quotes and/or removing all double-quotes. Performing these conversions should be done carefully due to risk of introducing errors, possibly technical or linguistic.

⚠️ Caution: There are different views on this practice, with some people considering that these conversions are not a recommended practice.

  • Target: Convert Japanese and Chinese single and double quotes to full-width single and double quotes, respectively.
  • Target: For French (France), convert curly single and regular single quotes to the left and right guillemet characters (« »), with a non-breaking space after the left guillemet, and a non-breaking space before the right guillemet. For French Canada, do not include the non-breaking spaces.
  • Remove all standard double quotes.

Community Feedback:

  • 70% of respondents do not normalize quotes in the SOURCE
  • 65% of respondents do not normalize quotes in the TARGET.

Tags

Retain tagged or parameterized variables

TMS: ✔️ MT: ✔️

Definition: Tagged or parameterized variable content is a variable inserted in the text to represent any possible value. For example, “{0} not found” or “<ITEMID> not found” means that ITEMID will be replaced by an actual ID for an item.

These tags should be retained as-is to avoid breaking the meaning of the sentence.

Community Feedback:

  • 88% of respondents retain tagged or parameterized variables for TMS.
  • 84%% retained these variables for MT purposes.

Normalize untagged variable content

TMS: ❌ MT: ✔️

Definition: Variable content is content that can be replaced by a variable name to increase the usability of the data. The replacement of numbers is a good example. A sentence such as “I have worked here for 6 years” would only help the translation of sentences containing “6 years” but not 7 or 8. Replacing the number 6 with a variable such as $num will make this a number-neutral sentence.


⇨ Example

  • “You have 5 emails.” >> “You have 0 emails.”
  • “Edited on 10:49 AM Feb 9” >> “Edited on 0:00 AM Jan 1”

Although it is not a common practice to normalize an untagged variable, doing so during pre-processing of MT corpora (convert numbers to $num) and input text may enhance leverage. Such a practice may be dependent on the MT engine.

Community feedback:

  • For MT, 68% responded they do NOT normalize untagged variables.
  • For TMS, 72% do NOT do this practice.

For the most part, this is not a recommended practice, according to feedback from the GILT community.

Removing tags that don’t affect the meaning

TMS: ❌ MT: ✔️

Definition: Tags that don’t affect meaning are formatting tags.

  • HTML Formatting: <b>This is bold.</b>
  • Trados style inline markup: {\cs6\f1\cf6\lang1024 </ut><strong><ut>}
  • Trados font tags: {\f2 Le sedi del training non sono comode da raggiungere}

Community Feedback:

  • 74% of respondents say they do remove tags not affecting the meaning for MT purposes.
  • For TMS, 46% remove these tags.

Response chart

For TMS, there is not a strong consensus on the practice to remove these types of tags. It may be dependent on the TMS used or the pre-processing steps to prepare an MT corpus. This practice is followed, in general, more widely for MT purposes.

Duplicates

While some duplicate segments in the TM are needed for ICE-matching purposes, it is otherwise good practice to remove unnecessary duplicates from TMs. This improves data quality, leverage and performance of TMs.

Identify and remove duplicates with no context for MT training purposes

TMS: N/A MT: ✔️

Machine translation engines use only source and target sentences with no context from metadata. In this scenario one should remove duplicates where two segments have the same source and the same target, that is, an identical translation.

Community Feedback:

  • 63% of respondents remove duplicates for MT training purposes.
  • Overall, survey results support the practice of removing duplicates.

Identify and remove In-Context Exact (ICE) duplicates for TMS

TMS: ❌ MT: N/A

In-Context Exact matches are matches that go beyond having the same source and target. They have more information in common that further guarantees that the translation is appropriate to the context. For example, besides the current segment being translated, also the previous and next segments are the same.

 Note: If you have duplicates of ICE matches you may want to apply practices of path normalization to avoid this.

Community feedback:

  • 70% of TMS respondents do not remove ICE duplicates.
  • Survey results support retaining ICE duplicates.

Age/Obsolete

Remove segments for being older than a certain age

TMS: ⚠️ MT: ⚠️

Metadata such as Usage Count, Last Used, and Last Updated can indicate if a certain segment is being actively leveraged.

 Note: This practice is not broadly recommended and should be carefully evaluated for your use case.

Reasons for removing older segments:

  • Significant changes in style (for example, from formal to informal) may indicate the need for removing content older than the date of change.
  • TMs may grow too large causing leveraging and updating to take a long time.

Reasons against removing older segments:

  • Old segments might used years later, such as for warranty work.
  • Should keep everything for MT.

 Note: Removing old TM entries may reduce your ICE leverage results. Recreating TMs from latest files is one solution to identify only active/relevant TM content. Consider if applicable to your use case.

Community Feedback:

The majority of respondents (70% for TMS and 76% for MT) do not remove old segments.

Respondents keep older TM entries because:

  • The content is still active despite not being updated recently, and may need to be leveraged at any time (e.g. when supporting older products or legacy code)
  • Inactive & older translations still have potential use in machine translation training
  • To maintain term history
  • No easy way to purge the TM to retain only ‘active’ terms
  • The TMS shows no data on when term was last leveraged, so it’s unclear what can be safely deleted.
  • Fear of corrupting TMs

Respondents remove older TM entries because:

  • Size reduction
  • Terms are updated often, so older terms are deprecated
  • Old terms are moved from main TM to a secondary TM reference (sometimes with penalty)
  • Cost
  • Quality/Consistency/Usefulness of content
  • Performance

Remove low value entries

Check if a segment contains mostly non-text content

TMS: ❌ MT: ✔️

Exclude segment (for MT) if a high percentage of the characters are non-word characters.

Community Feedback:

It is a common practice to remove non-text segments, or remove non-text parts of the segments:

  • Remove the entire segment, 60%
  • Remove the non-word characters, 10%
  • Don’t have such practice, 30%

Characters that do not match either the expected source or target language

TMS: ⚠️ MT: ✔️

Identify if there are characters in the source or target that should not be used in that language.

For TMS, check for it, but do not necessarily delete automatically because there may be valid mixed character sets in a segment.

For MT, consider removing the entire segment that contains characters in the translation that don’t fit into the expected charset for the target language, using mapping tables available from Unicode.org

 Note: Corrupted characters may be identified as a result of the checks 7.1 and 7.2 above. Corrupted characters should not be in the source content bundles and should be reported to the developers for removal/fix.

Community Feedback:

It is a common practice for the MT:

  • Remove the entire segment, 53%
  • Remove the non-word characters, 5%
  • Don’t have such practice, 42%

But not for the TM:

  • Remove the entire segment, 29%
  • Remove the non-word characters, 0%
  • Don’t have such practice, 71%

Do not remove segments where source = target

TMS: ✔️ MT: ✔️

Typically, content that should not be translated should be hidden during the translation process and should not be saved in TMs. However, some content that should not be translated cannot be automatically distinguished from content that should be translated. For example, brands, product names, or copyright statements might not be translated, but are likely to be mixed in with translatable content. In this case, it would be ideal to use terminology management system to manage this content and ensure that it is handled correctly. This way the content is actively managed and kept up-to-date with the organization’s current standards.

However, if a robust and well-managed terminology solution is not available, an acceptable alternative is to store untranslated segments in TMs, with the target content identical to the source content.

For MT training, it is desirable to use a dictionary or terminology functionality provided by the MT system. However, if this is unavailable or undesirable for some reason, TM entries with identical source and target data can also be used for training MT systems to increase the likelihood that the MT system will learn to leave these special terms untranslated.

Community Feedback:

  • Only 4% of the responders have a practice of removing such segments in TM
  • Only 14% of the responders have such practice for MT

Most responders have other ways of dealing with (filters, terminology), and validating that the untranslated segments are correct.

Those who remove such segments don’t want the MT system to learn to leave segments untranslated.

Check unbalanced brackets

TMS: ✔️ MT: ✔️

If there is an opening parenthesis and no closing one, it may simply be a typo, missing from the source/translation. If the closing parenthesis appears on the next segment this can signify bad segmentation/incomplete sentences. In that case, check the cause for the sentence being split into multiple segments (e.g. abbreviation) and correct the sentence breaker rules to keep the sentence whole.

MT recommendation: compare to the source. If it does not match the source, then remove.

Community Feedback:

It is not a common practice to validate the brackets

  • Only 22% of responders have such a practice

Remove entries consisting of only punctuation, whitespace, or tags

TMS: ✔️ MT: ✔️

Segments consisting of only punctuation, whitespace, and tags are not translatable. Entries without any translatable text are generally not useful and should be removed from TMs and MT corpora.

Community Feedback:

  • 35% of responders have such a practice for the TM
  • 78% have such practice for MT

It is more common to have additional validation of the corpora for the MT training.

Remove segments that are too long

TMS: ❌ MT: ✔️

For MT training, some segments may be too long to be useful and should be removed from the training corpus. As an example, the open-source Moses MT system automatically removes any segments that contain more than 80 words.

Community Feedback:

  • 45% of responders have this practice

Remove segments that are too short

TMS: ❌ MT: ✔️

For MT training, some users consider that short (one-word) segments may introduce unnecessary inconsistencies or ambiguity due to homonym usage and lack of context, and remove such segments. Other users consider that short segments contain valuable target content, that helps the MT resolve ambiguities and increase vocabulary coverage, and don’t remove such segments.

Here is a general recommendation from one MT provider.

1) 1-grams: Do not include in the training data. As a general rule, 1-gram sequences are not good training data for both SMT and NMT systems. For both types of MT, the context for the use of a word is learned from the sequence/segment it belongs to. This is especially relevant in languages that are highly inflected.

2) 2-grams: Do not include in the training data. Learning is not optimal in terms of the context in which the words are used and how they relate to each other. However, including them does serve to beef-up the word alignment and vocabulary of your engine.

3) 3-grams: Include in the training data. For languages that are highly inflected, 3-gram sequences offer the opportunity to learn about gendered spelling and inflections.

Community Feedback:

Response chart

Please note that the response rate for this question was low, suggesting that most of the responders don’t have any practice around 1-, 2- and 3-grams.

From those who responded: It is a more common to remove 1- and/or 2- grams, but not 3-grams:

  • 53% of the responders remove the 1-grams and 2-grams
  • Only 8% (1 responder) also remove 3-grams

Majority of the responders agree with the practice of removing segments that are too short or too long (even if not doing it right now).

Inconsistency

Identify segment inconsistencies and fix if appropriate

TMS: ✔️ MT: ✔️

Analyze the TM and identify inconsistencies. A linguist should determine if a change or removal is warranted.

Examples:

  • Same source and two different targets. Some may be valid inconsistencies, for different meanings.
  • Same target for two different sources. Some may be valid inconsistencies, for similar sentences in source being translated the same, thanks to fuzzy matches. But it can also be a wrong translation, an accepted fuzzy match that does not match the source.

Community Feedback:

There was no prevalent practice either for TM or MT maintenance.

Respondents that perform maintenance:

  • Have developed their own data normaliser and can write rules to normalise their training data.
  • Manage this process via their vendor
  • Perform the maintenance every 6 months
  • Use tools, scripts & macros to run quality checks on TMs, then fix manually (i.e. the scripts may be proprietary but updates are performed by a linguist, not programmatically)
  • Some TMS have a setting to keep only one translation for same source in TM

Tools mentioned included Oliphant and ApSIC Xbench.

Some TMS have built-in QA checks and segment filters to alert user to review same-source-different-target, different-source-same-target, fuzzy match accepted but not edited.

Terminology

Identify terminology inconsistencies

TMS: ✔️ MT: ✔️

Use a Terminology Database to identify deprecated or rejected terms, potential new terms, and incorrect terms in the TM. Review the resulting inconsistencies between the Terminology Database and the TM. This requires a linguist review, since there are likely to be false positives in this check.

 Note: Some TMS systems have mechanisms to do this for you. Consult with your TMS provider.

Community Feedback:

  • 45% of TMS respondents identify and fix terminology inconsistencies in translation memory.
  • 38% of MT respondents identify and fix terminology inconsistencies in translation memory

 Note: The results do not necessarily indicate a best practice, only that people generally aren’t taking action to identify and fix terminology, for many possible reasons. Many of the respondents who aren’t currently identifying and fixing terminology inconsistencies still consider it a best practice.

Tools mentioned as solutions for fixing inconsistencies included ApSIC Xbench, Okapi Olifant, SDL Multiterm, memoQ, XTM automatic QA checks, and various internal scripts or tools.

Misalignments

It sometimes happens that source segments are associated with targets that do not represent a valid translation of the source content. This can often happen when existing bilingual data is automatically broken into sentences, for example. It is therefore worthwhile to periodically scan your TM data and check for a few criteria that may indicate that the source and target are mismatched. In the TMS case, these segments should not be automatically deleted, but rather flagged for a linguist to review. For MT it may be acceptable to automatically remove segments that seem to be misaligned.

Check sentence length ratios

TMS: ⚠️ MT: ✔️

Inconsistent sentence length ratios between source and target may indicate a misalignment. This may be measured in terms of words or in terms of characters. Note, however, that the acceptable length ratio will depend on the language pair under consideration; for example, one would expect a Chinese translation of an English sentence to contain significantly fewer characters than a German translation of the same sentence.

Community Feedback:

  • For TMS, 81% of respondents responded No.
  • For MT, 52% of respondents responded No

Survey respondents were more significantly less likely to use sentence length ratios to identify bad segments for TM’s.

 Note: The results do not necessarily indicate a best practice, only that people generally aren’t identifying bad segments using sentence length ratios, for many reasons. However, we still consider this practice a useful practice, and it is recommended by this group.

TM Maintenance

TM Maintenance Tools

TMS: ✔️ MT: ✔️

Many commercial TMS have built-in TM utilities but open-source tools are also available to perform TM processing, QA checks or other TM management tasks.

  • Okapi Framework (open-source localization & translation tools):
    • Olifant: view and process TMX files.
    • Rainbow: Generate TMX from XLIFF or bilingual files for importing into your TMS
    • CheckMate: allows you to perform various quality checks on bilingual translated documents
  • GlobalSight Suite (open-source Translation Management System)
  • Various Quality Assurance and Terminology Management tools
  • Various Machine translation platforms/tools

Many companies also opt to develop proprietary tools for workflow services, TM storage and terminology management, using TMS SDK and APIs.

TM Backups

TMS: ✔️ MT: ✔️

DAILY BACKUP OF PRODUCTION DATABASE

Your database administrator should have a robust backup plan in place so that if your database has issues or data gets corrupted, there’s an available backup to go to.

  • Production data is backed up to a ‘standby’ database, with write-delay
    (i.e. there’s usually 8 – 24 hour delay before copying data to the standby database).

    The delay in writing to the backup database is intentional, so that major corruption or deletion of data in the primary database doesn’t equally compromise the backup database, thus allowing you to still recover data.

TM BACKUPS VIA API

In addition to the database backup managed by your DBA, it’s also prudent to have backups of your TM managed by your TMS Administrator. You can achieve this by:

  • Manual backup (TM > Export)
  • Automated backups (via API calls)

TM exports of this nature can be more finite so you can can have local TMX copies of each of your TMs, easily accessible to the team for reference and error recovery, or for use with MT training.

  • User error – TM entries accidentally deleted and need to be restored
  • MT corpus – TMX file is available for MT processing

You can set your own schedule for such backups, but a monthly cadence may suffice.


Example of TM Backup in use:

A user searches for a term in the TM and intends to export the results, but accidentally deletes all search results instead.

Solution:

The TMX backup is searched using the same search parameters as the original user query and the results are imported back into the TMS, thus restoring the TM to its original state. No DBA is required and the fix is achieved within hours.


Summary Table of Recommended Tasks for TMs

Task For TMS For MT
Create an Admin role ✔️ ✔️
Define how TMs are organized ✔️ ✔️
Define metadata for TMs ✔️ ✔️
Determine the plan or principles for using TMs (grouping / leveraging / updating) ✔️ ✔️
Determine the process and criteria for cleaning TMs ✔️ ✔️
General Housekeeping tasks
Create Complete descriptions for TMs ✔️
Use consistent filenames and path normalization to avoid duplicates ✔️
Detect and fix technical issues in the content ✔️
Remove empty segments (source or target) ✔️ ✔️
Characters
Normalize escaped characters/entities ⚠️ ✔️
Normalize certain control characters ⚠️ ✔️
Normalize whitespaces ⚠️ ✔️
Normalize quotes ⚠️
Tags
Retain tagged or parameterized variables ✔️ ✔️
Normalize untagged variable content ✔️
Removing tags that don’t affect the meaning ✔️
Duplicates
Identify and remove duplicates with no context for MT training purposes N/A ✔️
Identify and remove In-Context Exact (ICE) duplicates for TMS N/A
Age/Obsolete
Remove segments for being older than a certain age ⚠️ ⚠️
Remove low value entries
Check if a segment contains mostly non-text content ✔️
Characters that do not match either the expected source or target language ⚠️ ✔️
Do not remove segments where source = target ✔️ ✔️
Check unbalanced brackets ✔️ ✔️
Remove entries consisting of only punctuation, whitespace, or tags ✔️ ✔️
Remove segments that are too long ✔️
Remove segments that are too short ✔️
Inconsistency
Identify segment inconsistencies and fix if appropriate ✔️ ✔️
Terminology
Identify terminology inconsistencies ✔️ ✔️
Misalignments
Check sentence length ratios ⚠️ ✔️
TM Maintenance ✔️ ✔️
TM Maintenance Tools ✔️ ✔️
TM Backups ✔️ ✔️

This post originally appeared on Github

16 Best Translation & Language Influencers to Follow on Twitter

In the translation industry, there is always more to learn. Keeping up with the latest technological trends and industry standards can seem like an impossible task, especially if you’re just a beginner in the field. That’s why we at Gengo created this list of 16 active translation and language accounts to follow on Twitter to stay up to date with the latest news.

If you don’t already, give us a follow too at @GengoIt.

Translators & Interpreters

  • @integlangsbiz: Dr. Jonathan Downie is a consultant interpreter and French/English conference interpreter.

Dr Jonathan Downie@integlangsbiz

The gap between how machine is sold and what it can actually do is due to a fundamental misunderstanding of the task.

See Dr Jonathan Downie’s other Tweets
  • @interpretaatioo: Henry Líu is an interpreter, translator, and 13th President of the International Federation of Translators (FIT).
  • @ContractSpeak: Richard Lackey is a legal translator and qualified member of the Institute of Translation and Interpreting (ITI).
  • @Tesstranslates: Tess Whitty is a certified English-Swedish translator and localizer, as well as marketing consultant, author, trainer, speaker, mom, and yogi.
  • @LucyWTranslator: Lucy Williams is a English/Spanish translator and blogger for tourism, leisure, food, and fashion.
  • @UweMuegge: Uwe Muegge tweets about job, internship, and event opportunities in translation, localization, interpreting, and terminology.
  • @mstranslations: Mariana Serio is an English/Spanish business translation expert from Argentina, helping clients expand to the Latin American market.
  • @cdmellinger: Chris Mellinger, Ph.D in translation studies, translator and interpreter, editor, Assistant Professor of Spanish Interpreting and Translation Studies.
  • @ceciliaenback: CEO of Swedish LSP Translator Scandinavia, co-organizer of Nordic Translation Industry Forum.
  • @DanielaZambrini: Daniela Zamrini is a freelance Italian/English translator specialized in airline industry, logistics, legal, nautical, and aerospace and defence topics.

Translation & Language News

  • @slatornews: Slator makes business sense of the language services and technology market with news on the people and deals that shape the industry.

Slator@slatornews

Launching the Slator 2018 and Report: We’re looking at current language industry-related blockchain projects and examine if there are there any potential use cases for blockchain in translation. http://bit.ly/slatorblockchain2018 

Slator 2018 Blockchain and Translation Report | Slator

24-page report. Blockchain and Translation industry overview. ICOs, business use-cases, solution analysis, crypto chatter and further reading, a cautionary tale.

slator.com

See Slator’s other Tweets
  • @UnitedLanguage: United Language Group is a leading global language service provider specializing in translation, localization, and interpretation.
  • @multilingualmag: MultiLingual magazine, website, and newsletter are information sources for localization, global business, translation and language technology.
  • @atanet: American Translators Association (ATA) is the largest professional association of translators and interpreters in the United States, with nearly 11,000 members in over 100 countries.
  • @ConcLangVillage: Concordia Language Villages is the premier language and culture immersion program in the United States, offering programs for youth, adults, and families in 15 languages.
  • @LitTranslate: The American Literary Translators Association (ALTA) promotes literary translation through advocacy, education and services to literary translators.This post originally appeared on Gengo