Ethical Data Analysis

As we continue to wrestle with the ethical concerns of data science and technology, we probably don’t have to fear being enslaved by an evil Skynet, Matrix machines, or Isaac Asimov’s supercomputers. The problem with data ethics isn’t so much that machines lack a soul, but that as humans, we cloud data ethics with our own biases, intolerances, and discriminatory practices. 

The evils of technology may just be the ones that humanity unwittingly creates. 

The influence of data science is growing in every sphere of life, and this makes ethical data analysis more important than ever. So, it was refreshing to hear from an industry expert who has great hope for the future, and for people who will use data analysis and data science for the greater good. 

Recently, Coder Academy hosted a talk with Cornell Tech research assistant, Dan Adler. After lamenting that Dan was no longer a teacher at Coder Academy, having been “stolen by Cornell University”, Dan and a handful of Coder Academy students launched into a fascinating Q&A session surrounding data ethics. What follows is a brief glimpse into some of the highlights.   

 

How Does Dan Bring Data Ethics into His Daily Work

Dan Adler has been working at the intersection of data science and healthcare for much of his career. From his role as a healthcare advanced analytics associate at PricewaterhouseCoopers (PwC) to his current role as a research assistant at Cornell Tech, Dan has had to pay careful attention to the data ethics of much of his daily work and research. 

With such an extensive background in balancing the different sectors of data science and healthcare, Dan was able to talk about the ethical concerns that dominate this space and the ways in which data can be used for good. 

“We weren’t talking about ethics and data analytics fifteen years ago,” Dan says. “But it’s becoming a requirement in almost anything we do, and that’s because the community is maturing. We’re maturing. It’s now an integral part of the analysis process, and the learning process as you immerse yourself in this field.”

Dan points out that an academic couldn’t expect to conduct research today without questions about data ethics. He also says that it’s important to consider the implications of your work and that you should never collect data (especially sensitive data) just for the sake of it. 

“Do it for a purpose,” he says. “And include other people in the conversation when you decide to tackle a problem.” 

 

These Algorithms Could Be Good… or Bad

Let’s look at some of the potential ethical pitfalls, and the ways in which anyone working in data science can consider the ethical implications of their work. 

 

Uphold Data Ethics by Learning to Address Problems of Bias 

The data you collect has big implications for what you are going to do, even before you get to the modelling stage. It’s important to collect data objectively, but as Dan points out, even when you do collect data objectively, you can still encounter issues. 

“Society might not have been great at the time period where you collected that data set,” he says.  

Because every data set exists in the context of the historical period in which it was collected, it is important to look at the data set through a framework of data ethics and to consider whether it really is as objective as we believe it to be. 

Dan used the example of a report from Amazon on a prediction algorithm that they were using for resume pre-screening. The algorithm had been trained on resumes and past hires, however, due to the overrepresentation of males in the tech sector in the United States, the algorithm was biased towards male candidates.

This is just one example of a historical issue in our society coming out in a data prediction that’s going to impact the future. 

One way to counter this data ethics problem? Have diverse people analyse the data and problems too. 

“A problem that you might not think of, another team member might actually think of,” Dan suggests. “Have a good group of people that don’t look and act and are the same age as you.”

 

Understanding Data Ethics and Some Common Pitfalls 

Early on, people doing the initial work in prediction and medicine made quite racially sensitive conclusions in their research. Dan talked about these “naïve” conclusions that could be ethically wrong. 

Just as the Amazon resume screening algorithm was supposed to be trained on an objective data set, and to come out with objective conclusions, people don’t necessarily set out with bad intentions, but due to the nature of society and our own biases, it is easy to jump to the wrong conclusion and to see problematic modelling as a result. 

Dan gave the example of reports of physicians who ignored the pain of African American women when compared to Caucasian women, because of predefined bias that had come from the conclusions of past research.

Making conclusions regarding demographic variables is a particularly difficult area. Research may appear to show a correlation, for example, between a certain ethnicity and poorer health outcomes, while ignoring the context underpinning those variables. If certain groups have historically experienced issues with access to healthcare, then this is more likely the cause than the demographic that they belong to in and of itself. Problematic conclusions that ignore context then have the potential to reinforce assumptions and biases for the future. 

The data ethics and implications of getting it right will be important in any field, but it is especially vital to get it right in healthcare, where treatments could be advised or withheld, or a misdiagnosis made, based on the supposed wisdom of past research. 

While it may be important to encompass different demographics in the data you collect, Dan stressed that there is a difference between putting demographic variables into a model, and showing how that demographic impacts whatever data you’re showing. This is where the problematic conclusions can so easily creep in.  

As well as the implications of healthcare professionals being influenced by research outcomes, there is the possibility of patients, or even study participants, being impacted. 

Dan used the example of his own research into stress resilience and stress sensitivity in resident physicians. Dan and his team had to consider what might happen if they told a person that they were going to be stress resilient or stress sensitive. There is the possibility that giving a prediction back to an individual can create a negative feedback loop, and that telling them that a certain outcome is more likely will affect them in some way. 

This is why considerations surrounding data ethics are so important for anyone working in data analysis and data science. 

“Rigorously question the ethical implications… and how it might affect different individuals,” Dan says. 

 

Consider Data Ethics When Collecting and Using Data  

There can be a delicate balance between collecting useful data and protecting the privacy of individuals.

GPS data is considered a sensitive data type, but it can be very useful in the field of healthcare. One example Dan used was comparing an individual’s location variability compared to their normal routine, and using this as a predictor of depression. There is great potential for this information to be used in a helpful way, but it could also be used to identify an individual from the dataset. 

“Data is often less anonymised than you think it is,” Dan says. 

There is work being undertaken by Google in the area of differential privacy and federated learning, which looks at different ways to handle data. One idea within this approach would be to only use sensitive data whilst it remains on the user’s device. In theory, a data scientist could then carry out their modelling and predictions without ever collecting the data in a centralised database. 

Image: Screenshot of Federated Learning by the federated learning team at Google AI. Story by Lucy Bellwood and Scott McCloud.  This comic is licensed under the Creative Commons Attribution Non-commercial No Derivative Works 3.0 license

 

Dan suggests that data scientists should always question their need for sensitive information. 

“I’m always hesitant about how much data we give companies,” he says. 

While the data ethics of academics will be governed by an institutional review board (IRB), most private companies don’t have these independent ethics committees reviewing and guiding their use of data. As Dan and Coder Academy both pointed out, it’s often not until a company is hacked or taken to court that you hear about its use (or misuse) of sensitive data.   

 

Despite the potential pitfalls and conflicts, Dan is optimistic about the potential for good people to use data science in a positive way. He describes the field as a great place to be and to grow in your career.  

At the end of the day, data analytics and data science provide people with a toolkit. They can be used to perpetuate bad things in the world, but with good people considering the data ethics, and rigorously thinking through a problem with the help of a diverse team, the power of data science can be harnessed to approach any problem in a positive way.

 

If you are thinking about kicking off your journey as a data analyst, then why not attend our FREE beginner-friendly Data Analysis Workshop - they are a great way to discover the power of storytelling with data. Register here.

 


Coder Academy is the #1 ranked Australian Coding Bootcamp provider according to Course Report. Our accredited coding Bootcamps will set you up for a new career in tech via our Fast Track Bootcamp or our Flex Track Bootcamp . Our immersive courses help students acquire in-demand skills through hands on, project-based training by industry educators and experts.

Now enrolling | domestic & international students in Sydney, Melbourne & Brisbane! Study now, pay later with FEE-HELP!