In a blog yesterday IBM announced that it is releasing two new facial image datasets as part of an effort to establish machine learning training data that is unbiased relative to “. . .skin tones, genders, and ages . . .”:
“1) One of the biggest issues causing bias in the area of facial analysis is the lack of diverse data to train systems on. So, this fall, we intend to make publicly available the following dataset as a tool for the technology industry and research community:
-
A dataset of annotations for over 1 million images to improve the understanding of bias in facial analysis being built by IBM Research scientists. Images will be annotated with attributes, leveraging geo-tags from Flickr images to balance data from multiple countries and active learning tools to reduce sample selection bias. Currently, the largest facial attribute dataset available is 200,000 images so this new dataset with a million images will be a monumental improvement.
-
An annotation dataset for up to 36,000 images – equally distributed across skin tones, genders, and ages, annotated by IBM Research, to provide a more diverse dataset for people to use in the evaluation of their technologies. This will specifically help algorithm designers to identify and address bias in their facial analysis systems. The first step in addressing bias is to know there is a bias — and that is what this dataset will enable.”
A better dataset is certainly a step in the right direction but other issues can also impact the accuracy of facial recognition associated with skin tone and race. For example, the device’s combination of lens, sensor, lighting and angle, combined with skin tone, can impact what the final image for evaluation looks like. This suggests that if no standard for clarity is established some training may be needed using the images perceived by the device itself; a more complex and expensive effort.
Overview by Tim Sloane, VP, Payments Innovation at Mercator Advisory Group