Humans in the Loop offer various solutions to computer vision needs such as dataset collection, annotation and model validation. As a provider of dataset collection services, we know the importance of dataset collection in creating not only accurate artificial intelligence and machine learning models, but making sure they are ethical and bias free. We published a whitepaper earlier this year on avoiding bias in computer vision AI through better data collection where mitigating harmful biases through ethical and inclusive dataset collection were discussed along with our recommendations for best achieving this goal. With our dataset collection services, we try our best to ensure that we collect datasets that are unbiased, however, sometimes we can also collect data to augment your already completed dataset to this end. Datasets are often time-consuming and difficult to obtain: finding the right images – with the subject matter needing to be in an optimal configuration or position in the data -, as well as finding data that can be used and shared that is open source. To this end, we have published various free datasets and are even offering 1000 free labeling hours for one AI for Good project!
One of the most requested dataset searches on Google is people datasets. Datasets of people can often be biased, and hence it is always important to try and get a representative group in the dataset. However, as we are well aware that finding the right datasets for your computer vision needs is difficult, we decided to share with you eight great free datasets involving people that are out there, and open source, for you to use! Though these datasets might not be as diverse as they could be, they are a good starting point for the first iteration of your model to be trained on! A number of the datasets suggested here are hosted by Kaggle, which is a subsidiary of Google, an online community of data scientists and machine learning practitioners which hosts a large number of open-source public datasets and notebooks.
1. Labeled Faces in the Wild
The first dataset we are sharing is Labeled Faces in the Wild. This dataset consists of 13,000 different images of people’s faces. This dataset was published and is hosted by University of Massachusetts Amherst. One of its creators UMass College of Information and Computer Sciences (CICS) professor Erik Learned-Miller was honored with the Mark Everingham Award for service to the Computer Vision Community for this dataset in 2019. It is one of the foremost datasets for face verification and pair matching. However, they have advocated against using this dataset for commercial purposes for a variety of reasons including, but not limited to, underrepresented groups such as the lower and upper age ranges, and unproportional representation of genders and ethnicities. While not the most representative of reality, this dataset is a great starting point for a potential ethical and unbiased dataset.
2. CMU Face Images
The E. Fredkin University Professor at Carnegie Mellon University Tom Mitchell published an open source database of 20 people with 32 images per person hosted by Data.World. Each individual image of each person has a differing characteristic such as their pose and their expressions. In black and white, these images can be used to train and test machine learning models on various poses, expressions, whether they are wearing eye coverings such as sunglasses and size. However, due to the limited number of subjects, this dataset has to be augmented before it can be used as a good starting point and guide in creating a dataset with more balanced demographics.
3. Human faces align crop and segment
Human faces align crop and segment is a dataset on Kaggle published by Arnaud Rougetet, a Computer Imaging Scientist, whilst he was doing his Master’s degree in Informatics at the University of Lille. This dataset is based on the Humans Faces dataset and features more than six thousand human faces that have been aligned, cropped and segmented from various angles to make them more usable. With a wide representation of demographics, this is a very useful dataset of faces.
4. BioID Face
Another very useful dataset actually compiled with the purpose of testing face detection algorithms is the BioID Face hosted by the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) in Germany. It can be downloaded from the university’s website. It consists of 1521 gray level images of 23 people in portable gray map (pgm) data format along with manually set eye positions. As it has been compiled for the purpose of testing face detection algorithms, it is somewhat limited and also does not realistically represent the real world demographics. What is great about this dataset, however, is that it does take into account various real world conditions such as a large variety of illumination, backgrounds, and facial sizes.
5. Facial Points and Information of Faces
Kaggle user Nikheil Malakar, a Data Scientist specializing in building Artificial Intelligence, published a dataset called Facial Points and Information of Faces (Npy files). This dataset consists of clean data in the form of black and white images of around four thousand images of unlabeled data along with the same images with points information included such as the nose, the mouth and the eyes. This is an amazing dataset with some annotation! With such a large number of images, this dataset can be used to train and test models on a wide variety of faces, though the demographic representation is somewhat imbalanced.
6. Male and female faces dataset
Ashwin Gupta, an AI researcher and a student at B.M.S. College of Engineering, compiled this Male and female faces dataset of more than five thousand web scraped images to create a dataset of gender sorted faces which he released on Kaggle. Having compiled this dataset for a gender based face generator he was working on, this dataset can be used to train and test male and female classifier models, as a starting point for a bigger project and for face generator models. Like the other datasets, this dataset – as stated in the suggested applications on Kaggle – can be used as a starting point/ sub dataset and hence this suggests that it is not the ideal representation of reality and would have to be augmented or balanced.
7. Silhouettes for Human Posture Recognition
Whilst a lot of people’s datasets are focused on faces, Abhishek Kumar and Ebin Deni Raj’s dataset Silhouettes for Human Posture Recognition is a useful dataset when working with models that involve recognizing human posture. This dataset consists of almost five thousand jpeg images of people sitting, lying, bending and standing and would be very useful in many AI and ML applications such as tracking movement in the retail sector, or even in the training and analyzing of sports.
8. Medical mask and accessory detection
Over the last couple of months, datasets involving people and medical masks have often been requested and searched for. As our contribution in the fight against COVID 19, we released a free labelled dataset: Medical mask and accessory detection. Whilst there are a number of other datasets out there, we believe that our values of creating ethical and bias free AI influenced the dataset collection process, and hence the dataset consists of diverse peoples and examples that in our opinion represent reality better than other similar datasets online. We made sure to pay extreme attention to diversity, featuring people of all ethnicities, ages, and regions during the dataset collection process. Our dataset consists of more than 6000 images for detecting masks and accessories, split into different sections such as varying accessories, faces with a mask, without a mask, or with an incorrectly worn mask.
Whilst the above datasets are great free resources for your ML and AI projects, sometimes you might need assistance in augmenting a good dataset to create a great one that is both ethical and bias-free AI. Here at Humans in the Loop we offer dataset collection and model validation services to assist you in this way. Get in touch with us and one of our project managers will get in touch with you to see how we can help you achieve your computer vision needs. Additionally we offer dataset annotation and labeling services for a wide range of data types!