Twitter needs manual language selection

Published 30 December 2021 category: thoughts

Lots of Twitterers speak languages that are not English. For people who read tweets that are not in English, it is important that these tweets are marked as such. I feel Twitter needs a feature for this.

It would be nice if, when writing a tweet, we could manually select which language the tweet is in, and that Twitter would use that information to set the appropriate lang attribute on our content:

screenshot of tweet creation widget with a set language button; the tweet says in Dutch that I would like to share my opinion on using clasnames for everything Sharing a controversial opinion on CSS frameworks in the Dutch language

Twitter is an authoring tool, for which the Authoring Tool Accessibility Guidelines recommend that “accessible content production is possible” (Guideline B.1.2).

The lang attribute

Language attributes identify which language some web content is in. They are usually set on a page level, added to the HTML element:

<html lang="en">

Most developers don’t write these attributes often, the code often lives somewhere in a template that we don’t touch every day, or ever. But it’s an important attribute. Setting it correctly gets your page to pass one whole WCAG criterion (3.1.1 Language of page).

In some cases, we have to set language attributes on individual elements, too, like if some of our content is not in the page’s main language. On the website I built for the British-Taiwanese band Transition, we combine content in Mandarin with content in English on one page:

website screenshot, there is English text with Chinese text alongside of it, and links to YouTube videos The Transition “Music” page

We picked en as the main language and set it on the <html> element. This meant we had to mark all Chinese content as zh, in this case zh-TW as it is specifically Mandarin as spoken in Taiwan. Of course, we could have written this the other way around, too. Usually we want to pick the language that’s most common on the page as the page’s language.

Setting a lang attribute on parts of a page is its own WCAG criterion, too (3.1.2 Language of parts), by the way.

The user need

Setting the language is important for end users, like:

The author need

There is also an author need, both for people who write content and for web developers.

Content editors

People who write content may get browser-provided spellcheckers. They will work better if they know what the content’s language is. I think Twitter.com has somehow turned browser spellcheck off, but there may be Twitter clients or indeed other authoring tools where this is relevant.

Web developers

Language attributes are important for web developers, too, as it allows them to use the :lang() pseudo class in CSS more effectively.

Some CSS will behave differently based on languages. When you use hyphens: auto, the browser needs to look up words in a dictionary to apply hyphenation correctly. It has to know the language for this.

With appropriate language attributes, you can also use CSS features like writing modes and typographic properties more effectively. See Hui Jing Chen’s deep dive into CSS for internationalisation for more details.

Automating and lang-maybe

Identifying languages can be automated. In fact, Twitter does this. When they recognise a tweet’s language, they add the relevant lang attribute proactively. See for instance the European Commission chair’s multilingual tweets:

three tweets by Ursula  von der Leyen, in French, German and English with dev tools open and each tweet pointing to the lang attribute in the markup Twitter’s auto-added lang attributes in action

Yay! I think this is very cool (thanks ThainBBdl for pointing this out). The advances in natural language processing are really impressive.

Having said that, any automated system makes mistakes. Vadim Makeev shared:

Yes, sometimes they take my Russian tweets and render them as Bulgarian. It’s not just the lang, they also use some Cyrillic font variation that makes them harder to read.

It is safe to assume such mistakes will skew towards minority languages and miss subtleties that matter a lot to individual people, especially in areas where language is political.

On the one hand, I think it makes sense to deploy automated language identification. As there are a lot of users, Twitter can safely assume not everyone would set a language for all of their tweets. People might not know or care (insert sad face here), a fallback helps with that. On the other hand, if this tech exists, might it make more sense if a browser would deploy it rather than an individual website? Why not have the browser guess the content’s language, for every website and not just Twitter?

If browsers would do this, Twitter’s lang attributes may get in the way. They kind of give the impression that this information is author-provided. This makes me wonder, should there be a way for Twitter to say their declaration is a guess? lang-maybe?

Manual selection

Automated language detection probably works best if it complements manual selection. It could help provide a default choice or suggestion for manual selection, and work as a fallback. So, I’m still going to make the case for a method for users to specify a language manually.

A per-tweet manual language picker would be great as it can:

Summing up

For non-English tweets to meet WCAG, they need to have their language declared with a lang atttribute. Twitter currently guesses languages, which is a great step in the right direction, but is likely of little help to speakers of minority languages. A manual selector would be a great way to complement the automation.

Comments, likes & shares (27)

Bernard Nijenhuis replied:
Gooit ie er nog even snel een blog uit hoor! Lekker bezig! En goed punt overigens. PS: "people who use want right click a word in our content to look it up in a dictionary", "I’m sitll going to" en "likely of litlte help"...
Hidde replied: thanks hoor, de publiceerknop is m'n spellchecker zo blijkt maar weer 😅
Hidde replied: Deze van jou heeft ie mooi op lang=en gezet ghehe
Mu-An Chiou replied: thanks for the write up. i’ve been punting on this for one of my own website. do you have a preference between one tweet having multiple translations or multiple tweets in different languages?
Bernard Nijenhuis replied: Zelfs met de spelfouten erin, haha
Stefan Judis replied:
Huh! TIL Twitter auto-detects a Tweet's language and adds the proper `lang` attribute. @hdv shares pros, cons and more details on his blog. 👇 🙇‍♂️ hiddedevries.nl/en/blog/2021-1…
nic replied: it has just occured to me that i speak a language that doesnt even have a language code, and that the language code only reflects the way it is written (like charse) but not the way it is pronounced, which is obviously vital for intelligibility. 🤯
Hidde replied: interesting! One I speak (en.m.wikipedia.org/wiki/Frisian_l…) does have a code but also lots of pronounciation differences to the point that people would misunderstand each other… maybe lucky there is no text to speech for it yet
myf replied:
Coincidently Twitter served me this super viral tweet right next to yours. (Summary: some [persian] words transliterated to [latin] may resemble offensive [english] words.) I don't think that even lang="fa-Latn" markup would prevent that ignorance, though.
Gunnar Bittersmann replied:
And sometimes not the proper `lang` attribute, like "bg" instead of "ru" which causes wrong font rendering. 1. twitter.com/g16n/status/11… 2.
Adrian Roselli 🗯 replied:
IME, automatically detecting and fixing language issues (just like accessibility issues) does not work. It is a human issue, not a tech issue. So I am on board with Hidde’s suggestion that Twitter allow authors to choose the language of their tweets. hiddedevries.nl/en/blog/2021-1…
Leave a comment
Posted a response to this?

This website uses Webmentions. You can manually notify me if you have posted a response, by entering the URL below.