Academics in the US and UK have created a machine-learning tool for predicting when newly registered internet domains will be used to spread false information, in the hope these sites can be blocked or shut down before they pollute online communication channels.
In a recently published working paper, “Real-Time Prediction of Online False Information Purveyors and their Characteristics,” Anil R Doshi (UCL School of Management), Sharat Raghavan (University of California, Berkeley) and William Schmidt (Cornell University) describe how they used domain registration data, in conjunction with Mozilla web browsing data, to construct a machine-learning classifier that can anticipate which websites are likely to spew deceptive content.
“By using domain registration data, we can provide an early warning system using data that is arguably difficult for the actors to manipulate,” said Doshi in a statement. “Actors who produce false information tend to prefer remaining hidden and we use that in our model.”
“False information” is a term the boffins use to describe disinformation, misinformation, and made-up news – bogus content, crafted to look like legit news reporting, that’s produced to serve an agenda rather than the public interest. Such scurrilous fluff became a matter of major concern after the 2016 US election, the subject of what the US Office of the Director of National Intelligence described [PDF] as a Russian influence campaign that combined covert cyber operations “with overt efforts by Russian Government agencies, state-funded media, third-party intermediaries, and paid social media users or ‘trolls’.”
Since then, false information has elicited growing concern and scrutiny from academics, policy makers, technology advocates, internet users, and businesses. Some beneficiaries of the distribution of engagement-boosting falsehoods like Google, Facebook, and Twitter, however, have been self-servingly slow to implement revenue-reducing countermeasures.
So how well did you block fake news, Google? Facebook? Web goliaths turn in self-assessment homework to Europe
Doshi, Raghavan, and Schmidt have chosen to focus on the role that websites play in facilitating the spread of the false information. Websites, they observe in their paper, are quick to set up and cost nothing to abandon. And once sites spreading lies seed the system, the makers of misinformation can rely on the viral nature of networked communication to carry their message across social media networks.
The researchers’ hope is to spot websites designed for mischief early on, before the damage is done.
“Our early-identification system can help policy makers deploy their limited resources more rapidly and effectively by prioritizing domains for potential sanction or increased monitoring,” the paper explains.
To construct their classifier, the eggheads relied on various data points available from public domain registrations, including whether there’s an individual or institutional name in the billing contact field, the domain extension, registrar, registration state, and country, and the inclusion of political terms in the domain name. This sort of analysis can be done by a skilled human, though machine-learning brings automation, which is key in rapidly detecting and blocking bad sites before they go viral.
The researchers’ data suggests their machine-learning classifier works reasonably well, correctly identifying 92 per cent of false information domains and 96.2 per cent of legitimate information domains set up for the 2016 US election.
Sean Gallagher, a threat researcher at security biz SophosLabs, told The Register that the researchers’ technique is similar to those used by infosec professionals and cautioned that detecting disinformation is inherently unreliable because those behind it adapt to defenses.
“The machine learning technique described in this paper closely resembles work done in detecting potential phishing domains and scam websites,” he said. “The tactics of disinformation, like those of other web threats, are fluid, and the variation in tactics could make a 100 per cent accurate detection system difficult – especially given that there are other channels for disinformation.”
The academics involved appear to understand as much. They expect their machine learning classifier will be used in conjunction with other tools like text-based classifiers, in the hope that “policy makers can mitigate the possibility of taking action based on possible false positive classifications, which are inherent in any machine learning system.” ®