A new AI system can automatically decipher a lost language that’s no longer understood — without knowing its relationship to other languages.
Researchers at MIT CSAIL developed the algorithm in response to the rapid disappearance of human languages. Most of the languages that have existed are no longer spoken, and at least half of those remaining are predicted to vanish in the next 100 years.
The new system could help recover them. More importantly, it could preserve our understanding of the cultures and wisdom of their speakers.
The algorithm works by harnessing key principles from historical linguistics, such as the predictable ways in which languages use sound substitutions. The researchers give the example of a word with a “p” in a parent language possibly changing to a “b” in its descendent, but probably not to a “k” due to the difference in pronunciation.
These kinds of patterns are then turned into computational constraints. This allows the model to segment words from an ancient language and map them to a related language.
The algorithm can also identify different language families. For example, their method suggested that Iberian is not related to Basque, supporting recent scholarship.
The project was led by MIT Professor Regina Barzilay, who last month won a $1 million award from the world’s largest AI society for her pioneering work on drug development and breast cancer detection.
She now wants to expand the work to identifying the semantic meaning of words — even if we don’t know how to read them.
“For instance, we may identify all the references to people or locations in the document which can then be further investigated in light of the known historical evidence,” Barzilay said in a statement.
“These methods of ‘entity recognition’ are commonly used in various text processing applications today and are highly accurate, but the key research question is whether the task is feasible without any training data in the ancient language.”
You can read the full research study here.
Published October 22, 2020 — 14:12 UTC