Genetic Identity can be Inferred via Internet Search

Researchers at MIT have discovered that the identity of a person donating their DNA sequence anonymously for research purposes may be revealed using only information that is publicly available on the Internet, according to a new paper published in Science. By cross-referencing genome sequencing data that is posted online with public genealogy data, the identity of an anonymous DNA donor may be found. Previously, participants in DNA sequencing projects had been alerted to the fact that their anonymity would not be guaranteed, but that the risk of identification was minimal. This finding, however, marks the end of genomic anonymity.

In 2005, the Washington Post reported on a teenage boy who managed to track down his biological father, a sperm-donor, by submitting his own DNA to an online genealogy service. The service searched their sequence databases for males that shared repeat sequences of the boy’s Y-chromosome, finding weak matches to two men with the same surname. This information, coupled with the father’s place and date of birth, information which had been given to the mother, allowed the boy to find his biological father. Researcher Yaniv Erlich and his team heard about this story and instantly realized the threat to the privacy of genome donors.

His team duplicated the methodology in question, performing online genealogy searches on Y-chromosome short tandem repeats (STRs) retrieved from whole genome sequences. They found that such searches revealed not only strong correlations between the genome and potential surnames, but also other information about patrilineage, including geography, and detailed pedigree. Surnames, however, are a weak identifier, so the team cross-referenced the revealed information with other data like age and state of residence, and even other sources like online obituary information, which all greatly increased the chances of correct identification. The entire process was done with publicly available information. There were no private databases used. While they only identified 5 donors, they also revealed the identities of 50 individuals in total through the pedigrees of those donors. Erlich estimates that there’s a less than one in a million chance that their methodology was so effective on chance alone.

The privacy concerns raised by this experiment are not restricted to genetic donors. As shown by the case that spawned this research, private citizens can do their own genetic searches with the right information, and while the Genetic Information Nondiscrimination Act prohibits employers and health insurance companies in the United States from discriminating on the basis of genetic information, there is no way to track such public searches, and the mere threat of such a breach in anonymity may be enough to dissuade people from volunteering for genomic research projects.

The full implications of the kind of freedom of information afforded by the Internet on biometric identification and anonymity have yet to be seen, but this research offers a glimpse into where we may be headed.