Meta has announced a new, open-source dataset for AI training that the company is hoping will reduce the kind of demographic bias that has been documented by researchers at the National Institute of Standards and Technology (NIST) and elsewhere.
The “Casual Conversations v2” dataset comprises over 26,000 video monologues depicting individuals from a number of countries: Brazil, India, Indonesia, Mexico, Vietnam, Philippines, and the United States. In the videos, the participants describe certain of their own demographic attributes – things like race, gender, and age – which can help AI systems to properly tag and interpret demographic data.
Importantly, in recruiting the 5,567 paid participants who recorded the videos, Meta asked them to explicitly consent to having their data collected and used for AI training, which should help the company to steer clear of lawsuits under some of the disparate biometric privacy regulations around the world.
Illinois’s Biometric Information Privacy Act (BIPA), for example, is notorious for its wide scope and harsh penalties, and led to legal trouble for Amazon and Microsoft over their use of IBM’s “Diversity in Faces Dataset”, which they leveraged in an effort to reduce the demographic bias of their own facial recognition systems. While the Diversity in Faces Dataset comprised images collected through the public photo-sharing platform Flickr, the companies nevertheless faced BIPA lawsuits thanks in part to their failures to obtain consent for the use of the photos’ biometric data.
Clearly establishing a consensual basis for the collection of subjects’ faces in the Casual Conversations v2 dataset is therefore an important step in helping to make third party organizations feel safe in their use of the dataset for AI training, and that, in turn, could lead to positive outcomes down the line in terms of reducing or eliminating the demographic bias that has tarnished the reputation of facial recognition technology in recent years.
The dataset’s inclusion of vocal sample could also help to alleviate the related but much less discussed issue of demographic bias in voice recognition. While the real world outcomes of this algorithmic bias are likely less impactful than those of facial recognition, which tends to be used in consequential law enforcement and government security applications, demographic disparities in voice recognition could potentially affect huge numbers of consumers interacting with digital devices’ voice interfaces.
The Casual Conversations v2 dataset is available through the Meta AI website.
March 14, 2023 – by Alex Perala