As a member of the AI ecosystem and an important link in the AI supply chain, we at Humans in the Loop recognize our role in ensuring that computer vision solutions are built and used in an ethical way. One of our responsibilities as a supplier of dataset collection and annotation is to support and advise our clients on how to build models that are free of harmful bias as much as possible.
Here are 5 of our best tips that we always try to apply in our work in order to mitigate bias in computer vision through better data!
1. Get your taxonomy right
Taxonomy provides the machine with a hierarchical structure of the data that will be analyzed.
Creating the hierarchy of classes which our model will use is usually set on very early in the process: when the organization is formulating the problem it wants to solve. Getting the taxonomy right is crucial for the accuracy and consistency of the results. Three of the prevalent problems that we have seen are invisible, overarching, and overlapping notions.
Invisible notions
Using a simple example, a model trained on a dataset with the classes “dog”, “cat” and “mouse” would be blind to other types of animals and the absence of a “rat” class might cause rats to be incorrectly detected as mice when the model is deployed. During labeling, annotators might also be confused and label rats as “mice”, thereby introducing noise in the data.
Overarching notions
In some datasets, these classes might also be grouped into one generic “animal” class. However, even though a model built on this taxonomy supposedly would recognize animals, it might perform much more poorly on cats than on dogs, depending on the composition of the training data in the “animal” class. Since there are many visually different animals grouped under this overarching notion, there will be inevitable biases against some types of animals.
Overlapping notions
Finally, if we have classes for “animal”, “pet”, “dog”, “cat”, and “mouse”, this means that there is an overlap. E.g. a dog may be classified either as a dog, as a pet, or as an animal, and all of these would be right. This might lead to confusion among labelers as to how to label the images, a lack of consistency, and a lot of false positives once a model is trained.
2. Collect data which is representative
For a computer vision model to be able to generalize effectively, its training data must be representative of the real-life setting where it will be deployed. Some of the biggest issues may happen when an AI solution provider sells a software which an end user (another company or institution) deploys in a new setting which the model is not trained on.
Geographical and demographic representation is one of the biggest challenges in dataset collection, especially for models which are deployed across different locations and communities.
Avoid public databases
Usually, it’s best to collect data directly on the ground and to avoid image repositories and publicly available datasets. These images are usually sourced from the media, stock photography, and other online sources where minorities and women have been historically underrepresented or represented in stereotypical ways. In fact, many of the canonical large-scale datasets which have been traditionally used in computer vision have been taken down or revised in recent years in order to replace them with more balanced versions.
Avoid stereotypical representations
When collecting an image dataset, it is a common mistake to collect only images that depict the object or subject you want in a stereotypical position, angle or environment, making the dataset unindicative of all the ways the object/subject may be found in real life. Your dataset must contain objects/subjects depicted in a variety of conditions such as different lighting, backgrounds and points of view, as well as extreme poses, expressions, occlusions, and closeups, etc.
3. Annotate the most diverse samples
In order to make the most of your data labeling efforts, if you have a large-scale database (e.g. hours and hours of footage of street scenes), you need to find a way to reduce the amount of data to be labeled and to only extract the most representative and diverse samples.
Use an active learning tool
Several tools on the market today offer active learning features which can define dataset subsets and sort data by priority depending on decision boundaries set by the user. This will also help with generalization and bias, since it de-prioritizes or excludes redundant examples and focuses on the most diverse images which will bring the greatest gain to the model’s accuracy.
Be careful with augmentation
Data augmentation techniques like random cropping, rotation, flip, blurring, or color changes are a standard method for increasing the size of your dataset. However, these do not necessarily help with making the dataset more diverse or reduce its inherent biases. Rather, it’s the same samples in other variations. In addition, it will be counterproductive to apply an augmentation to a feature which is important to the model (e.g applying color changes to a model which depends on color to perform the object detection, like a fruit detector).
4. Transmit clear instructions to annotators
This is one of the most trivial pieces of advice you could ever get, but it is crucial to get the communication with your annotators right. Setting clear and precise instructions not only helps them understand the task better, but also makes them confident that they are doing the right thing.
Monitor consensus
Consensus is a widely-used approach in ensuring dataset quality. Its premise is that two or more people annotate the same sample, and if they disagree, the image is sent to a third arbiter who decides which is the right answer. However, AI thought leader Andrew Ng has proposed an alternative use of consensus as a basis of his call for data-centric AI: where annotators disagree, that means that the instructions are not clear enough. In these cases, use annotator disagreement to iteratively improve the instructions until there is no room for doubt.
Think of all the potential edge cases
If we continue with our “dog”, “cat” and “mouse” detector, we need to think about all possible edge cases and potential examples that annotators might interpret differently, including:
- Truncated, occluded, and cropped instances: how should these be annotated? E.g. should we annotate only the visible portion or try to guess where the hidden parts are?
- Too small or blurry instances: What’s the minimum size which we should annotate? A 10x10px instance may not be too helpful for our model
- Groups: how should we annotate large groups of animals close to each other? Are we allowed to group them together?
- Depictions: if there is a painting of a dog, a statue of a dog, or a toy dog, should these be annotated or not?
5. Improve your data iteratively
In line with the data-centric credo, it is much more valuable to focus on improving your data than on improving your models when building AI solutions. In the end, model tweaking can improve the results only marginally while the biggest gains can be obtained from iteratively improving the data and making sure its complete and consistent. Unfortunately, this means that there is a lot more grunt work involved, but there are great tools on the market which allow for easier dataset visualization and error detection.
Go beyond the train/val/test mentality
The most common model for evaluating a model’s performance is to split the ground truth dataset into a “training”, “validation” and “testing” subset. However, if the entire ground truth dataset is biased and unrepresentative of the reality, the evaluation of the model’s performance will give falsely high results. So, models would give more accurate results if they were tested in real-life practical situations and their evaluation is based on that rather than a static test dataset.
Perform model error analysis
Using real-life data to test the model, it’s important to incorporate a professional human-in-the-loop to regularly validate the proposed labels. Human operators also provide useful insights to data scientists through error analysis by reviewing and classifying the errors. Error classification can be used to distinguish between model errors (localization errors, confusion with semantically similar objects, false positives on background) or ground truth errors like mis-labelled data, missing labels or incorrectly grouped objects. This will help you to see where the model performs poorly, if it’s exhibiting any biases, and what type of data is needed in order to retrain it.
If you are interested in learning more about dealing with bias in AI through better dataset collection and annotation, check out our whitepapers. If you are currently working on an AI project, don’t hesitate to get in touch with us. We offer dataset collection, labeling and annotation, as well as model validation services!