Episode 18 – Train Transparently

Episode 18 – Train Transparently


Welcome to the data science
ethics podcast. My name
is Lexy and I’m your host. This podcast is free and independent
thanks to member contributions. You can help by signing up to
support us at datascienceethics.com. For just $5 per month, you’ll get
access to the members only podcast, data science ethics in pop culture.
At the $10 per month level, you will also be able to attend live
chats and debates with Marie and I. Plus you’ll be helping us to
deliver more and better content. Now on with the show. Hello everybody and welcome to
the data science ethics podcast. This is Marie Weber and I am joined by
Lexy Kassan and today we are going to be talking about the concept of making sure
that you train transparently whenever you’re setting up your data science models
and doing your data science process. Lexy, when you’re working on putting together
a model – and we’ve talked a little bit about the data science process – one of
the things I know you’ve talked about before is thinking about where you
get your data. What data you use. So how does that relate to the
concept of train transparently? There are two aspects to that. I would think one is how you frame the
problem and therefore what you gather for the data. The other is whose data are
you using and do they know. One is, generally speaking
as data scientists, making sure that the companies that
we’re advising know what we’re doing. And then, secondly, having some means of identifying to the
people who are affected by an algorithm that there is an algorithm and play and
that their data is being used in those algorithms. One of the concepts that I kind of
think about when it comes to training transparently is really
around documentation and
it’s like a dirty word in the industry because as you’re
trying to iterate, innovate, and be agile and fast and you get to the
great algorithm that’s going to solve your problem documentation just
as a time suck. It takes forever. It’s a lot of effort. It’s not
the fun part of data science, but it’s absolutely crucial partly because
it gives you something that you can look back on so that you don’t have to
try and remember everything you’ve done, but also so that it can be the foundation
of a peer review were either another data scientist or kind of a community
of data scientists could look at what you’ve done and potentially even at the
data and determine if what you’ve done is a reasonable process. And I think that documentation also is
helpful from the standpoint of doing, not just good work yourself, but also making sure that anybody else
that gets onboarded onto the project can understand what was done. Or if, say that you’re working with a client
and you have more of an agency client relationship, and they end up moving
to another agency in the future. They could still have that documentation
to be able to explain to another vendor how this model was set
up, what it’s doing, how it’s operating and so that way it
depending on what the agreement is and how they can use that algorithm in the future
is something that they can continue to get business value out of That and also I think about it from
the perspective of peer review. In most traditional sciences, if you submit a study to a journal
for review or for publication, it is sent out and the the
data is made available, the methodology is made available, and it is scrutinized by others in the
industry to ensure that what you’ve done is reasonable. That the conclusions that you’ve reached
or reasonable based on what you had. And when you talk about the idea of
things being peer reviewed in science, a lot of time the gold standard there is
can somebody replicate the same results that you got in your study in their
study and so from a data science perspective, you could come up with that same rigor
of if somebody was able to use the same data set and basically start
with the same set of inputs, get the same type of outputs. With data science, often
what you share is not… It’s not just the methodology.
There’s a lot of sharing of code. That’s how transfer learning
starts to take place. It’s, well usually there’s some sort
of other api that you can use, but you can also share just
here’s the code I used. At that point, you can have someone scrutinized all
the steps that you’ve gone through, everything you did to that dataset at
least as far as the starting point of that data set. Now there’s a whole other
set of processing that may
have gone on prior to that data set. In data science. We talk about the fact that 70 or 80
percent of your time is spent preparing data, cleaning it up, processing that
data, bringing data sources together. And then only 20 percent of your
time is in modeling the data, creating the algorithms and
actually interpreting the results. That 80 percent needs
to be documented too. And that needs to be transparent as to
what you included and what you did not include. If there were sources that
were potentially available
or that you looked for and couldn’t find, what were those that you wanted to bring
in or you thought about bringing in but excluded for whatever reason? Those are just as valid and that
has to be transparent as well. And that type of documentation could
also help people understand either one, the limitations that you were up against
and why you made those decisions or it could help people basically
in the peer review process, point out potential biases that maybe
you didn’t recognize as you were going through the process. Yeah, that’s very valid and there’s a tremendous
opportunity for everyone to bring in their own biases. So if you say, I
thought about bringing in this data set, but I opted not to because I didn’t think
it was either a credible or pertinent or what have you.
Someone else may say, no, I think that that is pertinent and
you really ought to consider it. Those are absolutely things that can
happen depending on the context of what you’re building. You may or may not really bring this
to kind of a public community of data scientists. But at least having someone even if
it’s within your own organization, peer review, what you’ve done, even
if it’s the business stakeholder, the person who’s asked you for
the analysis that you’ve done, the algorithm that you’ve developed. Ensuring that they understand what has
gone on with that data is a crucial step because they’re the ones that are going
to use that information in whatever way. They’re going to use your algorithm
in some way, shape or form. They need to know what happened
in it so that they can say, yes, I can use this, or no, I
can’t use this. Let’s revisit. Or in that type of situation that the
legal team or the compliance team being able to review it and understand
what steps were taken. We’ve talked a little bit about some
industries that are more heavily regulated like health care or finance, where compliance teams are a
very real and very strict group. Where you have to document absolutely
everything and all of your decisions have to be justified and in compliance with
whatever set of regulations you are under. The idea that there is a specter of a
compliance officer is often helpful, I will say. To think that someone’s
gonna look at this, someone, someone’s going to review this. I better be able to justify every
single decision along the way. The other part of it that I think is
really important to point out as we talked a little bit about the biases that could
have been built in and that people can point those out, but I think the other part
is in the usage of the model. It’s one thing when you say, well, the data you used was or wasn’t skewed
in some way or you did or didn’t include a certain source or what have you, or you’ve over underrepresented
segments of the population. It’s another to say you
represented them properly. However, the impacts based on the use, the intended usage of this
model are somehow unfair. They’re going to abnormally
penalize one group over another. They’re going to have some sort of
disproportionate effect on one segment of population. The model itself
may not be doing that, but the usage of that model might. So we’ll get into some more examples
of these in future episodes. But this could be where as an example, you’ve used some piece of information
that seemed on its face to be reasonable, seemed like it’s a fair way
of estimating something, but for one reason or another it wound
up disproportionately effecting one group or another based on how
that model then got used. So we’ll talk through some more
of those in upcoming episodes. Absolutely. And when you think
about training transparently, how does that relate back to one of
the steps we talked about in the data science process in terms of the
care and feeding of your model? In the care and feeding process – let’s
say you did this algorithm development a year ago and you’re trying to revisit
it, or you came in new, as you mentioned, to a new group and you’re now in charge
of retraining a model or reevaluating the performance of an algorithm.
How do you know how it came to be? How do you know the constraints under
which the creator of that Algorithm was operating? Like you mentioned, there’s a need for that type of
information to be available so that it’s transferable or it’s just
remembered. We forget things. We go through a lot of projects in a year. Even if you’re just one person in a larger
organization and even if that’s your entire role, chances are you’re not going to
remember exactly how you came to that conclusion. If you went through multiple
iterations, which is very common. We talked about this as part
of the data science process. Maybe you tested
something, it didn’t work, but now you don’t even
remember that you tested it. It becomes a crucial part of revisiting
your algorithms over time to ensure that you’re not repeating yourself. You can still justify the decisions that
were made and now you can move forward enhancing a model in
ways that are meaningful. There can be things
that change over a year. There could be new compliance
rules that come out. There could be new types of data
sets that you get access to. There could be just a different scale in
terms of what the business is taking it on or they might have new
lines of business that
they’re taking on that affect your model. So all those things are considerations
that as you think about not just the model that you’re building today, but how that model could exist in the
future and that documentation can help you say, okay, this is how
the model was built, which helps you be able to move forward
with that model so much easier than trying to reverse engineer what happened. So you can figure out if you
can use it moving forward. And I think that brings up another point
in terms of train transparently where you want to be able to describe how a
model is coming to the conclusions that it is, and sometimes with the documentation
that’s pretty straight forward. But with some of the more
advanced techniques that
you can use in data science, sometimes that’s a little bit less clear. And so being able to explain what your
model is doing and why it’s doing it can be really important in this
transparency conversation. And you need to be able to do that as
best as possible depending on what methods you’re using. Yeah. We talked in a prior episode about AIs
that are trying to explain themselves. The fact that there are deep learning
techniques out there which human beings often can’t fully explain. We can sort of see the outcomes and we
can see differences amongst the outcomes and so forth, but we don’t necessarily know all the
ways that it reached the conclusion it reached. There are some models that are trying
to explain their methodology along the way. Those types of considerations though
often lead to choosing simpler modeling techniques – choosing simpler, more explainable algorithms over
more complex ones. In compliance, they want to be able to understand
exactly what went in and exactly what came out and what happened in between. They
want to know all the steps. However, as we get further into deep learning
and people are trying to use these more advanced, more esoteric concepts in their
modeling to get more accurate results, there’s a tradeoff. That tradeoff
has to be evaluated very carefully. Even that tradeoff has to be documented. You have to be transparent about the
fact that we are using a more complex, more sort of black box algorithm because
the benefits of doing so outweigh the risks. That has to be a part of the conversation
when you’re looking at what algorithms you’re going to use. If I’m trying to build an algorithm
that’s looking for a more complex outcome, I need to use most likely
a more complex algorithm. Perfect. Well thanks so much, Lexy, for going over more details on
the train transparently and again, this is Marie with the data
science ethics podcast. And Lexy. Thanks so much for joining us. See you next time. We hope you’ve enjoyed listening to
this episode of the Data Science Ethics podcast. If you have, please like and
subscribe via your favorite podcast App, joined in the
[email protected]
or on facebook and twitter at ds ethics. Also, please consider
supporting us for just $5 per month. You can help us deliver more and
better content. See you next time. When we discussed model behavior, this
podcast is copyright Alexis Cason. All rights reserved. Music for
this podcast is by Dj Shaw money. Find him on soundcloud or
Youtube as DJ money beats.

Leave a Reply

Your email address will not be published. Required fields are marked *