Police detectives track criminals using fingerprints.
UAB computer scientist Thamar Solorio, Ph.D., wants to do the same with words. Her research team is bringing artificial intelligence technology to bear on the field of stylometry, which aims to figure out who wrote a piece of text by analyzing word choice and other idiosyncrasies.
“Our goal is to see if we can generate a ‘writeprint’ to identify a document with its author,” Solorio says. The UAB group is developing algorithms that can sift through tiny snippets of style from Twitter updates, Facebook posts, and chat transcripts to discover common elements. Several other research teams are working on automated “authorship attribution,” Solorio notes, but her lab is one of the first to tackle social media.
Solorio’s work, funded by the National Science Foundation and the United States Office of Naval Research, among others, could help identify the authors of terrorist plots from conversations in Internet chatrooms. The same algorithms could also be used to combat cyberbullying among schoolkids and provide valuable information in many other applications, Solorio says. She and fellow UAB researcher Ragib Hasan, Ph.D., are now investigating ways to use authorship attribution techniques to combat a major problem facing Wikipedia—namely, the altering, or defacing, of pages on controversial topics by partisans supporting different sides.
The Clue’s in the Comma
Solorio’s research group, the Computational Representation and Analysis of Language (CoRAL) lab, specializes in natural language processing. This branch of artificial intelligence drives everything from Google’s sorting of search queries to the speech-recognition software used by your bank.
Whether you’re aiming to teach a computer to recognize customers’ voices or a cyberbully’s threats, “you’re trying to design a program that can generalize beyond the examples that you give it so that it can make accurate predictions about new data,” Solorio explains.
The trick is to generate useful predictions when you have only a handful of characters to study—such as the dozen or so words in a typical Twitter post. To succeed, “you need to move beyond word choice and frequency,” Solorio says. “You need to look at syntax, what kinds of word classes are being used, and the length of the sentences, for example. On the Web, you can look at emoticon use and capitalization, too.” Punctuation marks can also hold clues, Solorio says—“there are definite patterns in how people use semicolons, for instance.”
The researchers also wonder if stylistic clues from a Twitter message or Facebook post carry over to other forms of writing, and vice versa. “If you’ve built up a profile of someone based on an essay, would that let you identify that person in a Twitter conversation?” Solorio says.
“Then you can move on to the extreme case,” she says. “What if you only have written samples, and then someone hands you a transcript of a conversation? Can you still identify common links between written and spoken communication?” Those are difficult questions, she notes; another is the issue of privacy concerns. “Any technology can be adapted to bad ends, but we can’t stop research because of a fear that it might land into the wrong hands,” Solorio says. “We are developing this program because we are ultimately trying to document how humans process language. We don’t yet know the answers. But we’re trying to find out.”
Twists of the Tongue
The CoRAL lab is collaborating on several projects with the UAB Center for Information Assurance and Joint Forensics Research, but crimefighting is not its only interest. Solorio, who originally hails from Mexico, is also intrigued by the linguistic mysteries surrounding language learning. “That’s something I’ve always been interested in, perhaps because I am bilingual myself,” she says.
Funded by a grant from the National Science Foundation, Solorio is working on algorithms to detect a bilingual person’s rapid shifts between languages. “We have very good translation technology today, but it is designed to focus on one language at a time,” she says, “and multilingual speakers don’t stick to one language.”
Multilingual conversations flow back and forth “in a natural, instantaneous way,” Solorio explains. But scientists already know these transitions follow a set of general rules. Finding the signs that suggest a switch in language is ahead—and teaching them to a computer—could improve speech recognition systems and automated translation services. It also could answer a question that has baffled linguists for decades: How does the brain juggle the rules and vocabulary of multiple languages simultaneously?
The researchers are focusing on several high-interest language pairs, including Spanish-English, Mandarin-English, Arabic-English, and combinations of Arabic dialects.
“The U.S. government is very interested in these language pairs, but this is also a fascinating research question in itself,” Solorio says.
By: Matt Windsor