“The cool thing that we were doing in our lab at CMU was building this knowledge-base of everything. Of all the facts in the world.”
We are sitting in a small office with just enough space for six desks and monitors. The workplace is spotless. The whiteboard is covered with grisly splotches of math, written in colored ink – just a glimpse into the work that goes on here. As soon as I meet her, I realize Derry Wijaya is just the type of person to take on lofty endeavors – to build that “knowledge-base of everything”.
After establishing her love for research during her time as an undergraduate, she continued her work at CMU. This eventually led to her PhD thesis: the “knowledge-base of every verb.” More specifically, she was working on a machine that would automatically read texts and define relationships between verbs and other parts of the text. “Who is the CEO of what, who is the father of who, where is this company located, who is married to who, so on and so forth,” she explains.
This machine is called NELL – the Never Ending Language Learner. It reads billions of web pages and extracts information to later classify and organize.
Say NELL was given the word “cookie” to classify. It would use the surrounding words of the sentence to define a pattern and classify it as a food. Say you have the sentence “John eats a shortbread cookie”. The verb “eat” would help classify “cookie” as a food, the adjective “shortbread” would indicate what type of cookie is. NELL would learn these patterns and categorize words accordingly.
“The cool thing about NELL is that it keeps using the information that it has read to help you extract more information.”
Back to the cookie example, what happens when NELL reads that someone throws away their Internet cookies? NELL combines what it already knows about Internet and about cookie and the relationship between “Internet” and “cookie” in a sentence to understand that an Internet cookie is a file, not a food. Just like that, the machine slowly learns how to interpret the intricacies of a language that has been built on by humans for centuries. But this is no ordinary machine learning algorithm.
“In machine learning, you give it data, you run it, you get an output for training. Then you give it a test and you run the same model that you already trained and you get an output and that’s it.” NELL, however, is continually learning over time. It feeds the classifications it has already made back to the model so it never stops learning.
“In a way, it’s becoming more like humans in terms of learning.”
While she first made NELL to read English texts, Derry is onto even bigger things now: enabling machines to read texts in any language. Derry is one of Professor Chris Callison-Burch’s post-doc researchers working in a subset of the field of Natural Language Processing. According to Chris Callison Burch (CCB), Derry is working on what he calls the Paraphrase Database. In English and many other languages, there are phrases that have the same meaning as individual words. For example, “thrown in jail” has the same meaning as “imprisoned” or “incarcerated.” The Paraphrase Database would be an automatically built log of these phrases and words with similar meanings.
The way they plan to go about identifying parallel text is through two levels of translation. First, they would translate the word “incarcerated” into another language, say Spanish. Often times, once a word or phrase is translated, it takes on a different literal meaning in that language. By translating back to English, the different literal meaning would be captured. So, the word “incarcerated” may literally mean “thrown in jail” in some other language, and translating it back would make the paraphrase equality apparent.
The Paraphrase Database is just one component in the realm of Natural Language Processing that CCB works in. He has one PhD student working on clustering synonyms across languages. He has another student working on simplifying text for different levels of reading.
Perhaps the most interesting project they are working on is a universal translation machine. Currently, the languages that can be machine translated are based off of text that already has human translation in at least two languages. The current machines then learn patterns that human translators establish. This new translator would theoretically not need any prior human translation in order to determine a pattern for machine translation. This process is called Zero Shot Learning.
The machine starts by taking a language, say English, and mapping the words into a vector space. In this space, words like “house” and “home” are close to each other based on some criteria that determines they are synonyms. Then, in an alien language, their words for “house” and “home” will also be close to each other in their own, isolated vector space. The crux of the algorithm would then be to take these two spaces and “learn a mapping” between them. The machine can then translate any language to any other language with no external data.
Derry’s life’s work so far has been dedicated to pushing the boundaries of the field. I asked her if she would continue doing so. “There’s so much more work to do,” she says. “Natural language processing, machine translation, everything.” From NELL to the universal translator, she'll be able to contribute even more to the field of NLP.