Welcome to Computing and Data Science
This course is about applied mathematics and leveraging computing to solve problems.
Applied Mathematics
Mathematical modeling is the process of translating a real-world problem into an idealized representation, working on the idealized problem, then translating the results back into real-world insights.
Real-world problems include all of the complexities and nuance of context. Abstraction is the de-contextualization of problems. This involves making assumptions about what to include and what to ignore in our representations, which gives us "nice" idealized versions of reality that are easier to work with. George Box frequently reiterated that:
"Essentially, all models are wrong, but some are useful."highlighting that when we model a phenomenon, our models are a tool of analysis and not the phenomenon itself. We cannot capture the full complexity of the real world, but we can model the behavior of some things to a level of accuracy that is sometimes useful.— Box & Draper, Empirical Model-Building and Response Surfaces, p. 424
Many seemingly unrelated problems have similar underlying structure. Scheduling of flights, placing packages into containers, and prioritizing processor threads for computer applications can be modeled with similar mathematical representations. Because the mathematics of the representation is common to all of the models, insights into one phenomenon can offer insights into other phenomena. The real world can inform the idealized world and vice versa, fueling a reciprocal process of discovery.
Data Science
Data Science is a broad field. If you want to write a definition of data science, then you might look at job postings to see what it looks like in industry. Or you might look at university programs to see how academia views it. However you try, you will run into conflicting views.
"In fact, some data scientists are — for all practical purposes — statisticians, while others are pretty much indistinguishable from software engineers. Some are machine-learning experts, while others couldn't machine-learn their way out of kindergarten...
In short, pretty much no matter how you define data science, you'll find practitioners for whom the definition is totally, absolutely wrong."— Joel Grus, Data Science from Scratch, p. 20
Teaching data science is a way of inviting everyone to disagree with you!
"How would you design a data science textbook? It's not a well-defined body of knowledge, and there's no canonical corpus. It's popularized and celebrated in the press and media, but there's no "authority" to push back on erroneous or outrageous accounts. There's a lot of overlap with various other subjects as well; it could become redundant with a machine learning textbook, for example."— Rachel Schutt and Cathy O'Neil, Doing Data Science, p. 348
In our course of study, we will build our skills from first principles and work toward developing projects with the following data-modeling pipeline:
This framework offers a structure for both doing data science and for critically analyzing data projects. Each stage in the framework highlights points where choices are made, offering opportunities to reflect on potential consequences in the development and deployment of our models.
A framework for critical analysis
Data | • Harmful data collection, lack of consent, insecure / lack of privacy, historical, representational, or measurement bias, ... |
Preprocess | • Labor exploitation, labeling by non-experts, incorrect labeling, trauma experienced by labelers, ... |
Explore | • Feature selection bias, bias in interpretation of data visualization, data manipulation, feature hacking, ... |
Model | • Bias in model choice, model-amplified bias, environmental impact, learning bias, evaluation bias, peripheral modeling, ... |
Communicate | • Biased model interpretation, ignoring variance, rejecting model, deploying harmful products, deployment bias, ... |
Meta | • "Pernicious feedback loops", runaway homogeneity, susceptability to adversarial attack, lack of oversight or auditing, ... |
A critical look at the modeling process reveals that it is riddled with choices. Many of the technical issues along the way become ethical issues down the pipeline, and many ethical issues are fundamentally technical issues. A necessary but insufficient condition for avoiding harm with data technologies is: understand what you're doing.
Course Grades
Your grades will be based on the quality of the work that you produce. Throughout the course, we will have several problem sets and quizzes. Later in the year we will focus on increasingly large data projects. Grades will be calculated using total points (not broken down by category), where independent assessments are worth more points.
Academic Integrity
Academic integrity is paramount. The work that you submit must be your original work. It should be an expression of your understanding and a product of your problem solving. Never submit work that is not your own, and do not submit work that in any way misrepresents your contribution. Do not support others in submitting work that is not their own. Instead, support one another through collaborative curiosity and building understanding.
Use of generative AI can be particularly detrimental to an education. While there may be times later in the year where we experiment with generative AI, the use of such technologies in the completion of our work is otherwise unacceptable.
Learning is a process, and if you skip the process then you skip learning. When it comes to learning, there are effective strategies, but there are no effective shortcuts. The biggest favor you can do for yourself is to take the long way, and put the work in.
Part of having academic integrity is the respect you bring to our classroom, for each other and for the course. During our classtime, it is expected that you will be working on this course of study. There is far more work than we could possibly complete in a year, so if there is a point when you feel like you don't have something to work on, let me know!
Resources
There are many high-quality freely available resources, many of which find their way into our course.
Courses
- Practical Data Science, CMU
- Introduction to CS and Programming Using Python, MIT OCW
- Data 8, Computational and Inferential Thinking: The Foundations of Data Science
- Artificial Intelligence, MIT OCW, Fall 2010
- Counting Rocks! An Introduction to Combinatorics, CSU
- Nonlinear Dynamics and Chaos, Steven Strogatz
- Chaos, Fractals, and Dynamical Systems, UVM
Books
- Python Data Science Handbook, Jake VanderPlas
- Elements of Data Science, Allen B. Downey
- An Introduction to Statistical Learning, Gareth, et al
- Hands-On Data Visualization, Dougherty
- Fairness and Machine Learning, Barocas, et al
- Patterns, Predictions, and Actions: A Story About Machine Learning, Hardt and Recht
- Discrete Mathematics: An Open Introduction, Levin
- Modeling and Simulation in Python, Allen B. Downey
- Think Stats: Exploratory Data Analysis, Allen B. Downey
Papers
- A Mathematical Theory of Communication. C.E. Shannon. 1948.
- Computing Machinery and Intelligence. Alan Turing. 1950.
- A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence. McCarthy, et al. 1956.
- Determinsitic Nonperiodic Flow. Edward N. Lorenz. 1963.
- Simple mathematical models with very complicated dynamics. Robert M. May. 1976
- A history of chaos theory. Christian Oestreicher. 2007.
- The KDD Process For Extracting Useful Knowledge from Volumes of Data. Fayyad, et al. 1996
- CRISP-DM: Towards a Standard Process Model for Data Mining. Wirth and Hipp. 2000.
- Statistical Modeling: The Two Cultures. Leo Breiman. 2001.
- 50 Years of Data Science. David Donoho. 2017.