Bias in Machine Learning

This post has been republished via RSS; it originally appeared at: Microsoft Developer Blogs - Feed.

Dev Consultant Ashley Shorter examines the dangers of bias and importance of ethics in Machine Learning.

In our digital era, efficiency is expected. We can instantly find the fastest route to a destination, make purchases with our voice, and get recommendations based on our previous purchases. These examples of machine learning have impacted our lives positively, saving customers time and companies money. Machine learning is the scientific study of algorithms and statistical models that result in devices automatically learning and improving from experiences without being explicitly programmed. With so much success integrating machine learning into our everyday lives, the obvious next step is to integrate machine learning into even more systems. Unfortunately, the collected data used to train machine learning models is often riddled with bias. Every time a dataset includes human decisions there is bias. Bias is learned stereotypes or prejudice in favor or against one thing or another. It can be conscious or unconscious. Datasets are vulnerable to bias during the data cleaning process where data is manipulated to increase overall effectiveness and accuracy. Both missing data and outliers need to be handled during data cleaning. If done inaccurately, bias may be introduced into the model and machine.

Missing Data

There are three main types of missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random or non-ignorable (MNAR/NI). In MCAR, the cause of missing data is independent of any values and/or variables in the dataset (i.e. a broken test tube). This is uncommon but ideal because analysis remains unbiased. MAR is where the cause of the missing data is related to some of the non-missing values of other variables. Bias may be introduced when analyzing if an observation is dependent on a missing value (MCAR or MAR) and how to handle it. For example, a dataset of patient information includes age and blood pressure. A data scientist notices multiple blood pressure observations missing from patient data. During data cleaning, she concludes that the missing blood pressure observations are MCAR and deletes all the patient information for anyone who is missing a blood pressure reading. What she doesn't know is that doctors are less likely to take the blood pressure of younger patients, so the missing data is MAR, not MCAR. The data scientist has now unconsciously introduced bias toward younger patients into her dataset and, eventually, her model. A similar problem can happen with MNAR data because it is missing data whose cause is related to other missing data (i.e. people with higher and lower income are less likely to provide their income information). It is important to correctly identify and handle missing data to prevent bias.

Outliers

Sometimes there are observations that are significantly distant from other observations in a dataset. These are called outliers and are caused by errors within a dataset or natural variability. While outliers can reveal valuable information about a dataset, it can also skew your model and lead to less than optimal results. It is up to the data scientist to determine if outliers need to be removed, replaced, or remain in the dataset. If natural variability is treated as an error by mistake, a data scientist will remove the observation and valuable information will be lost. For example, in the early stages of the Flint Water Crisis in Flint, Michigan, water samples were taken from select homes in the community and tested for lead. The results revealed that a majority of homes had safe levels of lead in their water supply, but a couple of homes came back with results of dangerously high levels of lead in the water supply. The cause of the outliers was deemed human error and they were consequently removed from the dataset. The revised dataset was then sent back to the appropriate authorities and no action was taken to improve the water quality. In this case, the outlier was not dealt with appropriately and, as a result, introduced bias into the dataset, putting the health of people at risk. Bias can have dangerous consequences. The effects of writing our unconscious bias into machine learning models can make a machine, whose task is efficiency, just as flawed as human beings. As flawed as we are − How can we fix a problem that we cannot see? While Machine Learning is a powerful tool that brings values to many industries and problems, it’s critically important to be aware of the inherent bias humans bring to the table. This is also a key reason that ethical principles must be considered in the future of AI. To learn more about what Microsoft is doing in this space, visit Microsoft AI Principals.

Missing Data

Outliers

Leave a Reply Cancel reply