Speech2Face Turns Small Audio Clips Into Surprisingly Accurate Human Portraits

Speech2Face Turns Small Audio Clips Into Surprisingly Accurate Human Portraits

A team of researchers from MIT have published a paper that details the development of an algorithm that is able to recreate human faces based on a small voice sample. Dubbed Speech2Face, the algorithm was trained with millions of educational YouTube clips featuring more than 100,000 different speakers.

The researchers were quick to emphasize that the images were not accurate representations of any individual, noting that the algorithm is only effective because it uses a large sample size and draws on traits that are shared by enough people to establish a pattern. The resulting images are essentially composites, with the algorithm limited to determining relatively broad categories like age, race, and gender.

However, the images were surprisingly accurate within those parameters. That makes the results quite interesting, even if the algorithm is currently more of a parlor trick than a legitimate biometric tool.

Someone could further refine the concept in the future, although there are ethical concerns that would need to be addressed in a future study. As the researchers themselves pointed out, the YouTube videos used to train the algorithm offered a non-representative look at global population, meaning the algorithm will display an inherent bias depending on the language spoken or some other criteria.   

Of course, this is not the first time someone has walked the line between aesthetics and identity while dealing with biometric tech. NEC showcased an iris scan art exhibit at this year’s SXSW, while Samsung previously experimented with a VR interface built around facial recognition.

Source: Gizmodo

June 11, 2019 – by Eric Weiss