Tartle Best Data Marketplace
Tartle Best Data Marketplace
Tartle Best Data Marketplace
Tartle Best Data Marketplace
July 1, 2021

Resurrecting Dead Languages with AI, Machine Learning

Resurrecting Dead Languages with AI
BY: TARTLE

Algorithms and Dead Languages

Here is your fun fact for the day – Napoleon actually broke the Rosetta Stone. Go figure. In a way, it’s a great metaphor. The Rosetta Stone has been an incredible tool for translating multiple languages in the centuries since its discovery, proving itself a valuable aid in helping put back the pieces of many languages that tend to get broken and lost over time. The value though is not merely in being able to translate ancient languages, it’s in all the history that comes with being able to read ancient texts for the first time. Suddenly a whole perspective on historical events opens up, or knowledge of things we could never have known about otherwise is unlocked. Putting an ancient language back together doesn’t just open up words, it opens up literal worlds.

Now, the geniuses over at MIT have come up with another tool that we can use to unlock a few more. A new system has been developed by the Computer Science and Artificial Intelligence Laboratory (CSAIL) that can actually decipher lost languages. Best of all, it doesn’t need extensive knowledge of how it compares with already known languages to crack the code. The program can actually figure out on its own how different languages relate to one another. 

So, how does that wizardry work? One of the chief insights that make CSAIL’s program possible is the recognition of certain patterns. One of these is that languages only develop in certain ways. Spellings can change in some ways, but not others due to how different certain letters sound. Based on this and other insights, it was possible to develop an algorithm that can pick out a variety of correlations. 

Of course, such a thing has to be tested before it can be trusted. If you don’t test your language detector, you get bad languages. That’s probably how the whole “Aztecs said the end of the world would be in 2012” thing started. One intern with a bad translator program took it from, “And then I decided I could stop chiseling the years now. I’m a few centuries ahead,” to “the earth will stop completely rotating in 2012”. Fortunately, the researchers at MIT were a bit brighter than that. They took their program and tested it against several known languages, correctly pointing out the relationships between them and putting them in the proper language families. They are also looking to supplement their work with historical context to help determine the meaning of completely unfamiliar words, similar to what most people do when they come across a word they don’t know. They look at the entire sentence and try to figure out the meaning from the surrounding context. 

Led by Professor Regina Barzilay, the CSAIL team has developed an incredibly useful tool to help us understand not just the events of times gone by, but the way people thought back then. By better understanding the languages of the past, we can learn why people did what they did. We could gain valuable insight into cultures long dead to us. That knowledge will in turn help us to better understand our past and how we got to where we are. It gets us more information, information straight from the source, or at least closer to it. If TARTLE likes anything in the world, it’s getting information straight from the source. 

After all, that’s what we preach day in and day out around here. Getting our information from the source, minimizing false assumptions and bias when it comes to analyzing information. It’s great to see that same spirit at work in one of the world’s premier research centers and to see it being applied to our past. 

What’s your data worth?

Summary
Resurrecting Dead Languages with AI, Machine Learning
Title
Resurrecting Dead Languages with AI, Machine Learning
Description

The Rosetta Stone has been an incredible tool for translating multiple languages in the centuries since its discovery, proving itself a valuable aid in helping put back the pieces of many languages that tend to get broken and lost over time.

Feature Image Credit: Envato Elements
FOLLOW @TARTLE_OFFICIAL

For those who are hard of hearing – the episode transcript can be read below:

TRANSCRIPT

Announcer (00:08):

Welcome to TARTLE Cast with your hosts, Alexander McCaig and Jason Rigby, where humanity steps into the future and source data defines the path.

Alexander McCaig (00:20):

Jason, Sprechen Sie Deutsch?

Jason Rigby (00:28):

What the hell, Alex, do you have on the screen up there?

Alexander McCaig (00:31):

That's Rosetta Stone.

Jason Rigby (00:33):

The official Rosetta, not the program, but the Rosetta Stone. The official Rosetta Stone?

Alexander McCaig (00:36):

Yeah. It's the one that Napoleon broke.

Jason Rigby (00:40):

Oh, that guy.

Alexander McCaig (00:41):

That's what you get for-

Jason Rigby (00:41):

The guy did a lot of breaking things.

Alexander McCaig (00:43):

That's what you get for rampaging around. I love, love languages. I think they're so cool, and there's so much history involved in a language, especially the study of words like etymology and where they came from. It can tell you a lot about people, the area, the culture, why they operate a certain way. Why a mountain is called a specific thing, right? Or how something has been carried through history. And MIT has absolutely crushed it with this system called CSAIL. I think that's what it's called. It's a machine learning system that it can actually translate and bring back to life these dead languages.

Jason Rigby (01:22):

Think about that guys.

Alexander McCaig (01:23):

Yeah.

Jason Rigby (01:24):

So MIT's Computer Science and Artificial Intelligence Laboratory, CSAIL, can decipher a lost language.

Alexander McCaig (01:32):

Yeah, and all they need is a couple thousand words. What they found is that there are a lot of vectors of relationships, so directional relationships forward, backwards, up, down, whatever you want to call it, that would affect how this machine translates. It looks at one is the geography of a region. It looks at the history like the cultural history. Then it looks at the etymology of how those words have moved over time from what our understanding was, and also the pronunciation, the consonants, the vowels, and the changing of a P to a B. And so if it looks at these logical changes that they've become, essentially, like Anglicized over time, you can actually continue to drop back in certain languages once it had an error make base for ones that may have like a Greek or Latin base and be like, oh, now we understand how the development of these languages actually happened from point A, which was a lost dead language, to what we have now at point B that we're currently using.

Alexander McCaig (02:24):

It's an albeit quite brilliant system, but in doing so, it's not the fact that you're bringing up this dead language back to life, it's all the history and context that comes with it. And so imagine an archeologist goes to look at something and at first the language was dead, there was no way to read it, but we have a system that can now translate it efficiently and they can be like, oh, now I understand what's actually going on here. If I'm looking at cuneiform on a clay tablet, they used to press in arrows or almost look like greater than signs into a tablet. That was language, it's early written language. And so you can bring these things back to life and really understand what was going on. And maybe you could even understand concepts. Concepts with a consciousness and how people operated, they thought in different patterns. Maybe something that we wouldn't recognize that has been lost that was truly special.

Jason Rigby (03:16):

Yeah. It said the algorithm learns to embed language sounds into a multidimensional space where differences in pronunciation are reflected in the distance between corresponding vectors.

Alexander McCaig (03:25):

Yeah. So if you look at how the pronunciation, how it's actually being delivered for that word, and you look at the relationship historically in other languages that almost sit in that same sort of language family, it can then begin to decipher it and say, oh, I can see how these transitions logically happened from here to here and the cause and effect of the development of this language, and then from that, that's when you start to resurrect these things.

Jason Rigby (03:46):

Yeah, and it said the proposed algorithm can assess the proximity between two languages; in fact, when tested on known languages, it can even accurately identify language families. The team applied their algorithm to Iberian considering Basque, B-A-S-Q-U-E, as well as less-likely candidates from Romance, Germanic, Turkic, and Uralic families. While Basque and Latin were closer to Iberian than other languages they were still too different to be considered related.

Alexander McCaig (04:13):

Right? So we would have put them in their related category, but this algorithm with better knowingness, because we programmed it to do so, said these are actually more closely related to these two rather than the both of them together. But I really think the real magic about this thing is having all that data input, having people... So consider a data packet on TURTLE, where you're having somebody read a sentence. Imagine if you had all of that raw data input for a natural language processing algorithm? To go back, okay we have all the pronunciations, all the different variants in all the specific subcultures, from all these different geographies, all across the globe, let's pump it into this thing and then figure out how language has migrated over time, much like when you go on your ancestry.com and when you do your mouth swab to see, oh, my DNA's been here, here, and here. You can actually see the movement of that language. And that would frankly be more precise than actually trying to track DNA.

Jason Rigby (05:12):

Yeah. I want to give shout outs to these because this is amazing. MIT Professor Regina Barzilay, and then an MIT PhD student, they wrote the paper on this, Jiaming Luo, they're the ones that developed this aside from the algorithm. Here's what they want to say for their future work. Here's what they're working on. This is so awesome.

Alexander McCaig (05:29):

I love this.

Jason Rigby (05:29):

I'll let you talk about this and we'll close out. In future work, the team hopes to expand their work beyond the act of connecting texts to related words in a known language. An approach referred to as cognate-based decipherment. This paradigm assumes that such a known language exists, but the example of Iberian shows that this is not always the case. The team's new approach would involve identifying semantic meaning of the words, even if they don't know how to read them.

Alexander McCaig (05:53):

Okay, so if you don't know... So just be looking at these relationships, we know what this word is telling us, even though we can't pronounce it. So cognitively, we understand what they thought and what was trying to be projected through this writing.

Jason Rigby (06:07):

Yeah, and I think what they're talking about is they want to have known historical evidence and be able to put that in there.

Alexander McCaig (06:11):

That's what I was saying. Imagine for an-

Jason Rigby (06:12):

Then have it interpret-

Alexander McCaig (06:13):

... archeologists. If you get all the bias of how an archeologist might read hieroglyphics out of the way, now you're getting to something special. And that's why I'm like, it's more... the language is important, but what's more important is understanding how people think, the balance of their thoughts. Maybe some cultures just in that language is a more unifying culture, or something that's inherently more descriptive. Something that does a better idea of actually vocalizing how someone thinks or feels like German or Sanskrit.

Jason Rigby (06:43):

Yeah. And shout out to MIT, bro. It's like every week we're pulling up articles from them-

Alexander McCaig (06:47):

They're crushing it.

Jason Rigby (06:48):

They're crushing it.

Alexander McCaig (06:49):

So much creativity. That's all it is. It's scientific creativity there. They're not bogged down by red tape and government stuff that you find over here at the labs.

Jason Rigby (06:58):

We would love to be able to... I mean, you're from that neck of the woods, we would love to be able to go up there and just hammer out. We would set it up. We could set it up there at MIT and just hammer out podcast after podcast with professors. I would love that.

Alexander McCaig (07:11):

Yeah, what are you doing with data?

Jason Rigby (07:12):

Yeah, exactly.

Alexander McCaig (07:13):

Sit right here at the table. Let's talk.

Jason Rigby (07:14):

Yeah. These guys, love to have them on a podcast.

Alexander McCaig (07:17):

Most definitely.

Jason Rigby (07:18):

We can just sit there for a week and just do a hundred podcasts. That'd be so [crosstalk 00:07:21]-

Alexander McCaig (07:22):

I'd be so amped up to talk about data, language, the correlation of those things.

Jason Rigby (07:26):

Any of it. Artificial intelligence, machine learning. I mean, the list goes on and on.

Alexander McCaig (07:29):

Climate models, all that stuff.

Jason Rigby (07:30):

Yeah. I'm reading another article from MIT. And I'm thinking about... And it's talking about machine learning, AI and feminism.

Alexander McCaig (07:39):

Interesting.

Jason Rigby (07:40):

Yeah. That's a whole concept because you don't think about that. Most tech companies are 90% white. These white privileged guys or whatever, and it's mostly male based. So to take these two women and they're making this perspective about looking into this. That's just so creative to me.

Alexander McCaig (07:55):

But you know what's great-

Jason Rigby (07:57):

It's only going to allow it to have more and more and more information.

Alexander McCaig (07:59):

Data, it just makes things more transparent and open, and then when you start to apply it properly, it brings so much truth to light, and it's what we've needed for so long. It's now we're only just beginning to start to facilitate its use properly, effectively.

Jason Rigby (08:14):

Yes.

Alexander McCaig (08:14):

Evolutively.

Jason Rigby (08:15):

Evolutively. Great word.

Alexander McCaig (08:16):

Yeah.

Announcer (08:24):

Thank you for listening to TURTLE Cast with your hosts, Alexander McCaig and Jason Rigby, where humanity steps into the future and source data defines the path.