Computing and Data Science

Welcome to Computing and Data Science

This course is about applied mathematics and leveraging computing to solve problems.

Applied Mathematics

Mathematical modeling is the process of translating a real-world problem into an idealized representation, working on the idealized problem, then translating the results back into real-world insights.

RealWorldIdealizedWorldPhenomenonInsightModelImplicationsAbstractionAnalysisInterpretationValidation

Real-world problems include all of the complexities and nuance of context. Abstraction is the de-contextualization of problems. This involves making assumptions about what to include and what to ignore in our representations, which gives us "nice" idealized versions of reality that are easier to work with. George Box frequently reiterated that:

"Essentially, all models are wrong, but some are useful."
— Box & Draper, Empirical Model-Building and Response Surfaces, p. 424
highlighting that when we model a phenomenon, our models are a tool of analysis and not the phenomenon itself. We cannot capture the full complexity of the real world, but we can model the behavior of some things to a level of accuracy that is sometimes useful.

Many seemingly unrelated problems have similar underlying structure. Scheduling of flights, placing packages into containers, and prioritizing processor threads for computer applications can be modeled with similar mathematical representations. Because the mathematics of the representation is common to all of the models, insights into one phenomenon can offer insights into other phenomena. The real world can inform the idealized world and vice versa, fueling a reciprocal process of discovery.

Data Science

Data Science is a broad field. If you want to write a definition of data science, then you might look at job postings to see what it looks like in industry. Or you might look at university programs to see how academia views it. However you try, you will run into conflicting views.

"In fact, some data scientists are — for all practical purposes — statisticians, while others are pretty much indistinguishable from software engineers. Some are machine-learning experts, while others couldn't machine-learn their way out of kindergarten...

In short, pretty much no matter how you define data science, you'll find practitioners for whom the definition is totally, absolutely wrong."
— Joel Grus, Data Science from Scratch, p. 20

Teaching data science is a way of inviting everyone to disagree with you!

"How would you design a data science textbook? It's not a well-defined body of knowledge, and there's no canonical corpus. It's popularized and celebrated in the press and media, but there's no "authority" to push back on erroneous or outrageous accounts. There's a lot of overlap with various other subjects as well; it could become redundant with a machine learning textbook, for example."
— Rachel Schutt and Cathy O'Neil, Doing Data Science, p. 348

In our course of study, we will build our skills from first principles and work toward developing projects with the following data-modeling pipeline:

DataPreprocessExploreModelCommunicate
This is a common workflow that has developed over decades as multiple disciplines converged on data-modeling processes (Chapman 1999, Han 2012 Figure P.1, Schutt 2013 p. 41, Estrellado 2020 Chp. 3, Nantasenamat 2020, Kolter 2021).

This framework offers a structure for both doing data science and for critically analyzing data projects. Each stage in the framework highlights points where choices are made, offering opportunities to reflect on potential consequences in the development and deployment of our models.

A framework for critical analysis

Data
• Harmful data collection, lack of consent, insecure / lack of privacy, historical, representational, or measurement bias, ...

Preprocess
• Labor exploitation, labeling by non-experts, incorrect labeling, trauma experienced by labelers, ...

Explore
• Feature selection bias, bias in interpretation of data visualization, data manipulation, feature hacking, ...

Model
• Bias in model choice, model-amplified bias, environmental impact, learning bias, evaluation bias, peripheral modeling, ...

Communicate
• Biased model interpretation, ignoring variance, rejecting model, deploying harmful products, deployment bias, ...

Meta
• "Pernicious feedback loops", runaway homogeneity, susceptability to adversarial attack, lack of oversight or auditing, ...

A critical look at the modeling process reveals that it is riddled with choices. Many of the technical issues along the way become ethical issues down the pipeline, and many ethical issues are fundamentally technical issues. A necessary but insufficient condition for avoiding harm with data technologies is: understand what you're doing.

Course Grades

Your grades will be based on the quality of the work that you produce. Throughout the course, we will have several problem sets and quizzes. Later in the year we will focus on increasingly large data projects. Grades will be calculated using total points (not broken down by category), where independent assessments are worth more points.

Academic Integrity

Academic integrity is paramount. The work that you submit must be your original work. It should be an expression of your understanding and a product of your problem solving. Never submit work that is not your own, and do not submit work that in any way misrepresents your contribution. Do not support others in submitting work that is not their own. Instead, support one another through collaborative curiosity and building understanding.

Use of generative AI can be particularly detrimental to an education. While there may be times later in the year where we experiment with generative AI, the use of such technologies in the completion of our work is otherwise unacceptable.

Learning is a process, and if you skip the process then you skip learning. When it comes to learning, there are effective strategies, but there are no effective shortcuts. The biggest favor you can do for yourself is to take the long way, and put the work in.

Part of having academic integrity is the respect you bring to our classroom, for each other and for the course. During our classtime, it is expected that you will be working on this course of study. There is far more work than we could possibly complete in a year, so if there is a point when you feel like you don't have something to work on, let me know!

Resources

There are many high-quality freely available resources, many of which find their way into our course.

Courses

Books

Papers

References

Box, G. E. P., & Draper, N. R. (1986). Empirical model-building and response surfaces. John Wiley & Sons, Inc.
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (1999). CRISP-DM 1.0: Step-by-step Data Mining Guide. https://www.the-modeling-agency.com/crisp-dm.pdf.
Han, J., Kamber, M., & Pei, J. (2012). Data Mining, Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems.
Schutt, R., & O'Neil, C. (2014). Doing data science: Straight talk from the frontline. O'Reilly Media, Inc.
Estrellado, R. A., Freer, E. A., Mostipak, J., Rosenberg, J. M., & Velázquez, I. C. (2020). Data Science in Education Using R. New York, NY: Routledge.
Nantasenamat, C. (2020). The Data Science Process. Towards Data Science. https://towardsdatascience.com/the-data-science-process-a19eb7ebc41b.
Kolter, Z. (2021). Practical Data Science: Introduction. Practical Data Science. http://www.datasciencecourse.org/notes/intro/.