"In fact, some data scientists are
— for all practical purposes — statisticians,
while
others are pretty much indistinguishable from software engineers. Some are machine-learning experts,
while others
couldn't machine-learn their way out of kindergarten...
In short, pretty much no matter how
you define
data science, you'll find practitioners for whom the definition is totally, absolutely wrong."
1. Obtain: pointing and clicking does not scale
2. Scrub: the world is a messy place
3. Explore: You can see a lot by looking
4. Models: always bad, sometimes ugly
5. INterpret: "The purpose of computing is insight, not numbers."
• Design ∘ Translate a problem into a data-problem. ∘ Survey or experimental design ∘ Database infrastructure
• Acquire ∘ Survey or experiment ∘ Download the dataset! CSV, API, etc. ∘ Web scraping
Data Modeling Process
Data
→
Preprocess
→
Explore
→
Model
→
Communicate
• Wrangle ∘ Format ∘ Clean and organize ∘ Check data integrity
• Prepare ∘ Label ∘ Split into training and testing sets ∘ Normalize
Data Modeling Process
Data
→
Preprocess
→
Explore
→
Model
→
Communicate
• Visualize ∘ Plot and familiarize with data ∘ Look for and compare features visually ∘ Consider appropriate models
• Inspect ∘ Exploratory data analysis ∘ Descriptive statistics ∘ Identify features analytically
Data Modeling Process
Data
→
Preprocess
→
Explore
→
Model
→
Communicate
• Model ∘ Try and compare multiple models ∘ Consider bias and variance ∘ Interpret model and performance
• Validate ∘ Assess model performance on independent test data
∘ Error analysis and stress-test ∘ Consider consequences
Data Modeling Process
Data
→
Preprocess
→
Explore
→
Model
→
Communicate
• Reflect ∘ Consider contexts, bias, and consequence ∘ Create audit plant ∘ Document - data and model
• Share ∘ Report documentation ∘ Inform policy ∘ Deploy in product
Data Modeling Process
Data
→
Preprocess
→
Explore
→
Model
→
Communicate
Data Modeling Process
Environment
→
Data
→
Preprocess
→
Explore
→
Model
→
Communicate
→
A framework for critical analysis
Data
• Harmful data collection, lack of consent, insecure / lack of privacy, historical,
representational, or measurement bias, ...
Preprocess
• Labor exploitation, labeling by non-experts, incorrect labeling, trauma
experienced by labelers, ...
Explore
• Feature selection bias, bias in interpretation of data visualization, data
manipulation, feature hacking, ...
Model
• Bias in model choice, model-amplified bias, environmental impact, learning bias,
evaluation bias, peripheral modeling, ...
"Wu and Zhang’s sample ‘criminal’ images (top) and ‘non-criminal’ images
(bottom)." 2016
"Simplistic stereotypes is really not a basis for developing AI, and if your AI is based on this
then basically what you're doing is enshrining stereotypes in code." (11:42)
To what extent are we training the next generation of pseudoscientists?
A common misconception is that
data + compute ⟶ solutions
If the problem isn't solved yet, it's just because you haven't added enough technology yet!
This is one facet of Technosolutionism
"Not only do many of the hiring tools not work, they are based on troubling
pseudoscience
and can discriminate"
Hilke Schellmann tried the
myInterview tool to check her "hiring score":
Honest interview in English: 83%
Reading a random wikipedia page in German: 73%
Getting a robot voice to read her English: 79%
"Our success,
happiness, and wellbeing are never fully of our own making. Others' decisions can profoundly affect
the course of our lives...
Arbitrary, inconsistent, or faulty decision-making thus raises
serious concerns..."