1 of 6

Spring 2023

Schedule

Welcome to Data Sci for Lang & Mind, Spring 2023!

Announcements

Schedule

All assignments are due by Wednesday at 11:59pm unless otherwise noted!

Syllabus

Welcome to Data Science for studying Language and the Mind!

Please note that the organization of the course has changed substantially. Read the syllabus carefully to see if the course will be a good fit for you!

Overview

Description

Data Sci for Lang & Mind is an entry-level course designed to teach basic principles of data science to students with little or no background in statistics or computer science. Students will learn to identify patterns in data using visualizations and descriptive statistics; make predictions from data using machine learning and optimization; and quantify the certainty of their predictions using statistical models. This course aims to help students build a foundation of critical thinking and computational skills that will allow them to work with data in all fields related to the study of the mind (e.g. linguistics, psychology, philosophy, cognitive science, neuroscience).

Prerequisites

There are no prerequisites beyond high school algebra. No prior programming or statistics experience is necessary, though you will still enjoy this course if you already have a little. Students who have taken several computer science or statistics classes should look for a more advanced course.

Sections

Materials

Readings

Statistical Modeling: A Fresh Approach 2nd Edition by Daniel Kaplan. A free e-textbook designed to make advanced statistical methods accessible to beginners by emphasizing conceptual understanding of the intuition behind them (models).
Computational and Inferential Thinking: The Foundations of Data Science 2nd Edition by Ani Adhikari, John DeNero, David Wagner. Note that this textbook was developed for the UC Berkeley course Data 8: Foundations of Data Science and focuses on python, but we will reference a few chapters.

Tools & Resources

Course website (you are here) for schedule, syllabus, and links to all assigned materials
Canvas site mostly used for posting grades
Ed Discussion for course announcements and online discussion and support
Gradescope for lab assignments, project checkpoints, and exams
R is an evolving summary of resources related to the R concepts we learn in the course (e.g. cheatsheets, extra learning resources) Content updated throughout the course.
Datasets includes a collection of the datasets we use in the course
Exam study guides will include the study guides for the midterm and final

Components

Lecture

Lectures are held in MOOR 216 on Tuesdays and Thursdays at 12pm. You can also join the lectures live on Zoom. We highly recommend attending live in-person, but we will post recordings of the lectures in case you are sick or unable to attend.

At some point during each lecture, we will ask you to do the Lecture check-in survey. We will not use the lecture check-in as part of your grade. We are collecting this data to understand how student engagement styles influence course performance or enjoyment. Data science!

We may also use the survey data to determine whether to permit make-up work. If you are engaged with most lectures, we will be more likely to offer options like extensions or exam re-takes. If you rarely attend or engage with the lecture material, we will be more likely to offer options like taking an incomplete or withdrawing from the course.

Lab

Weekly in-person labs consist of working on lab assignments or project checkpoints with small groups of peers (see sections for available times and locations). Attending lab is a mandatory part of your grade. To receive credit, you must attend in person and make significant progress on the week's assignment.

Lab assignments will be released on Thursday afternoons by 1:45pm. You will be asked to submit your in-progress lab assignment to Gradescope at the end of each lab session to receive participation credit (whether you have finished or not). You will have until Wednesday night at 11:59pm to submit the final version you would like us to grade.

Lab sessions will not be recorded. Late assignments will not be accepted.

Project

Group project details will be released on the Project Guidelines page during week 3.

Exams

The midterm and final will be pen and paper exams designed to test your conceptual understanding of the material covered in the course. Both the midterm and final will be closed book, closed note, and held in person.

The midterm is scheduled during your regular lab section in week 8. The final will be scheduled during finals week (time and date TBD). Note that the final is required to pass the course.

Grading

Components

Course components contribute to your final grade according to the following table. Assignments within each category are weighted equally.

Letter grades

The table below shows the minimum score before rounding for letter grades. Grading is not on a curve: there is room for everyone to do well!

Policies

Late assignments

Late lab assignments will not be accepted. We drop your two lowest lab assignment grades. Project checkpoints will be accepted up to 24 hours late with no penalty.

Regrade requests

If you notice a grading mistake, you must make a regrade request on Gradescope before the regrade deadline. If you ask about grading in person or via email, you’ll be directed to make a formal regrade request in Gradescope.

Accommodations

We are happy to provide accommodations to anyone with documentation from Student Disability Services and to make alternate arrangements when class conflicts with a religious holiday. Please notify your lab section TA as soon as possible to make these arrangements.

Academic integrity

We will follow the rules of the University and the Code of Academic Integrity. It is your responsibility to be familiar with these policies.

Support

Asking for help is a sign of strength! We hope you’ll reach out to us if you need help. We also want you to be aware of Penn’s Academic & Wellness Resources.

Staff

Check back the first week of classes for more details including Office Hour times and locations.

Instructor

Dr. Kathryn Schuler (Katie)

Email: kschuler@sas.upenn.edu
Pronouns: She/Her
Office Hours: Tuesdays 1:30-2:30pm, Linguistics Department, Room 314C

Lead TA

June Choe

Email: yjchoe@sas.upenn.edu
Pronouns: He/Him
Office Hours: Thursdays 10:45-11:45am in 3401 Walnut 338C (Linguistics Department)
About me: I am a third year PhD student in the linguistics department. I study how children learn words over developmental time and how adults parse sentences in real time. Outside of research, I'm a data visualization enthusiast and develop open source packages in R - if you share these hobbies (or become interested in them over the semester), do let me know!

TAs

Ravi Arya

Email: raviarya@sas.upenn.edu
Pronouns: He/Him/His
About me: Hi my name is Ravi and I am a junior dual degree student studying Finance and Political Science. I am originally from NJ and am super excited to be your TA this semester. Feel free to email me with any questions about the course or Penn in general!

Avinash Goss

Email: amgoss@sas.upenn.edu
Pronouns: He/Him
About me: Avinash is a Junior in the College of Arts and Sciences majoring in Mathematical Economics, as well as pursuing minors in Statistics, Data Science, and Classical Studies. He loves to travel, having lived on 3 different continents and visited dozens of countries.

Ariana Wiltjer

Email: awiltjer@sas.upenn.edu
Pronouns: She/Her
Office Hours: Wednesdays 3:30-4:30pm in PCPSE 103C (1st floor of Perelman Center for Political Science and Economics)
About me: I am a senior from Portland, Oregon, studying Economics and minoring in Consumer Psychology. After graduation this Spring, I am planning to work in Data Analytics so I am excited to be working with this class, which has been one of my favorite and most useful classes I’ve taken at Penn so far! Outside of the classroom, I am a foodie and love exploring new restaurants, traveling, and learning languages (I have been learning Spanish and Italian and just started with Japanese). Being from the Pacific Northwest, I also love going on hikes and walks along the Schuylkill with my dog, Zelda. Looking forward to working with you all!

FAQs

What is the best way to get help?

Ask questions on Ed and come to office hours!

If I miss lab can I come to another lab section that week?

Probably! You can join any lab session that will work for your schedule provided you ask the section TA and they agree there is room.

I noticed a mistake in the grading of my assignment. How do I get this fixed?

We will look at your assignment again if you make a regrade request on Gradescope before the regrade deadline.

I missed or failed an exam. What can I do to make it up?

Discuss with your section TA to see what options are available to you. If you are engaging with the material (Lecture check-ins) we may offer a make-up. If you are not engaging with the material, you may not be permitted to make-up an exam.

Can I turn in my lab assignment or late?

No, lab assignments are not accepted late. However, we do drop your two lowest grades.

Can I turn in my project checkpoint late?

Yes. You can turn in project checkpoints up to one day late and receive full credit.

I missed class, how can I catch up?

All lectures will be automatically recorded and posted to canvas a few minutes after class ends. You can watch the lectures when you are ready. We also recommend attending office hours to make sure you don’t fall behind.

I do not feel well, tested positive for COVID, and/or need to miss something. What should I do?

Please don’t attend in person you are unwell or under quarantine. Follow the steps in the question above (under “I missed class”) to catch up. If you miss a lab, you can try the lab assignment on your own and visit office hours to catch up. We drop you two lowest lab participation scores, so a few absences will not impact your grade.
If you are sick for an exam, let us know before the exam begins and we can discuss your options.

What happens if one of the instructors or TAs does not feel well, tests positive for COVID, etc?

We will teach the course remotely if we need to, or fill in for each other if we can. We have a large teaching team, so this shouldn’t be an issue.

Resources

R

R colab notebook
Base R Cheat Sheet
dplyr full reference (we'll only use some of these functions)
dplyr vignette
dplyr & tidyr cheat sheet
ggplot full reference
ggplot2 overview and more learning resources
ggplot2 cheat sheet
ggplot2 workshop part 1 (youtube webinar, if you want to go further!)

Datasets

Human Brain Evolution

DeSilva, J. M., Traniello, J. F., Claxton, A. G., & Fannin, L. D. (2021). When and why did human brains decrease in size? A new change-point analysis and insights from brain evolution in ants. Frontiers in Ecology and Evolution, 712.

Data downloaded from the supplemental materials in DeSilva et al (2021); to use:

data <- read.csv('https://kathrynschuler.com/datasci-langmind-datasets/human-brain-evolution/data.csv'

Stanford's Wordbank

Cross-linguistic trajectories of two words: ball and dog taken from wordbank.stanford.edu. To use this dataset:

data <- read.csv('https://kathrynschuler.com/ling172/datasets/crosslinguistic-dog-ball.csv')

Nettle's Language diversity

A long format dataset that is most useful in wide format. Data taken from Appendix 1 in:

Nettle, D. (1998). Explaining Global Patterns of Language Diversity. Journal of Anthropological Archaeology, 17, 354–374.

To use this dataset, you’ll need the jvcasillas/untidydata package:

install.packages("devtools")
devtools::install_github("jvcasillas/untidydata")

Then call

language_diversity

Exam study guides

Project guidelines

Welcome to your data science project!

Throughout the semester, you'll be applying what you learn to a data science project that is of particular interest to you (and your group!). You'll need to select a project within the bounds of linguistics or cognitive science, but other than that the topic is up to you.

Types of projects

Projects can be one of two types:

A group project: you join a group with 1 or 2 other students in your lab section (max group size is 3 students total). You and your group replicate a classic study in linguistics or cognitive science by reconstructing the data and analysis from the published paper
A solo project: you work alone and either (1) replicate a classic study in linguistic or cognitive science by reconstructing the data and analysis from the published paper (same as group project version, you just work alone) OR (2) you work on an original research project in which you collect data yourself.

Reasons for doing a solo project would include: (1) you are required to, usually in order to use this class in a specific way toward your major or minor; (2) you are conducting (or want to conduct!) a research project in linguistics or cognitive science either as part of a class, a thesis, or independently; and you want to use this class to help you do the data science on that project.

Classic studies

Wondering how to find a classic study? Here are a few suggestions:

Look to your previous classes! Does anything stand out? Any interesting things you heard about in class that you'd like to explore more?
Look in intro textbooks! Introductory textbooks in a given field often describe classic studies in an accessible way. That would be a great place to start to figure out what studies you might want to replicate. They will reference the original research article.
Look to social media! Have you read about anything in the news or on social media that you'd like to dive into? You can usually find the reference to the original research study somewhere in a news article.
Use our lists below! We've put together a list of possible classics in Cognitive Science and Linguistics that you might be interested in.
Ask on Ed! Describe your interest on Ed and we can try to direct you to some specific papers that way.

Linguistics

Sociolinguistics (from Dr. Meredith Tamminga, Associate Professor of Linguistics)

Tagliamonte & D'Arcy 2007 on the "be like" quotative
Hay & Drager 2010 speech perception study involving kiwis and kangaroos!

Language Evolution (from Dr. Gareth Roberts, Associate Professor of Linguistics)

Galantucci, B. (2005). An experimental study of the emergence of human communication systems. Cognitive Science, 29(5), 737-767.
Garrod, S., Fay, N., Lee, J., Oberlander, J., & MacLeod, T. (2007). Foundations of representation: where might graphical symbol systems come from? Cognitive Science, 31(6), 961-987.
Scott-Phillips, T. C., Kirby, S., & Ritchie, G. R. (2009). Signalling signalhood and the emergence of communication. Cognition, 113(2), 226-233.
Sneller, B. and Roberts, G. (2018) Why some behaviors spread while others don't: A laboratory simulation of dialect contact. Cognition 170C: 298–311.

Semantics (from Dr. Florian Schwarz, Associate Professor of Linguistics)

Bott & Noveck’s 2004 experiment on scalar implicatures (some vs. all elephants are mammels)

And from Dr. Schuler - You could look to some of the papers from my graduate seminars on topics:

Cognitive Science

List provided by Dr. Russell Richie, Associate Director of Cognitive Science and mindCORE programs

Introduction

Donders (1868) How long does it take to make a decision? (subtractive method)

Cognitive Revolution and the Computational Theory of Mind

Latent learning with rats in a maze (rats learn even if not rewarded/punished). Tolman and honzik 1930.
Magical number seven. Miller 1956
Behrend & Bitteren (1961) – simple reinforcement learning example of fish learning food probabilities from two feeders

Modularity

Firestone and Scholl (2016): This is a BBS piece and doesn’t have original data itself, but discusses a lot of studies of alleged top down effects of cognition on perception. Students may find it interesting to replicate those.

Judgment and Decision-Making: Are we 'good' at reasoning?

Wason selection task. Wason, P. C. (1968).
People are better at logical reasoning if the problem fits into a ‘schema’ (e.g., permission schema). Cheng and Holyoak (1985)
Conjunction fallacy (aka linda problem) -- Tversky and Kahneman 1981
Base rate neglect (Tom problem) -- Kahneman & Tversky (1973)
Availability heuristic - Tversky & Kahneman, 1973
Anchoring heuristic – Tversky and Kahneman 1974. Ariely et al 2003.

Judgment and Decision-Making: Behavioral Economics

Certainty effect – Tversky and Kahneman 1986
Loss aversion - Kahneman, D. & Tversky, A. (1979). "Prospect Theory: An Analysis of Decision under Risk".
Framing/epidemic problem. Tversky and Kahneman 1981.
Mental accounting – Heath and Soll 1996 (journal of consumer research)
Decoy effect -- Huber, Joel; Payne, John W.; Puto, Christopher (1982)
Intransitive preferences – Tversky 1969

Language structure

Eimas et al 1971 on categorical perception in infants
Werker and tees 1984 on losing ability to distinguish non-native speech category contrasts.
Bias towards hearing clicks at phrase boundaries -- Ladefoged and Broadbent (1960) and Fodor and Bever (1965)

Language comprehension

When listening/reading to words, search for and activate words in parallel, rather than serially. Tanenhaus et al 1979
Cohort theory vs TRACE model of word recognition: Allopenna et al 1998
Evidence for interactive theory of sentence processing: i. Visual world context: Tanenhaus et al. (1995); Trueswell et al. (1999) ii. Verb bias: snedeker and trueswell 2004 iii. Prosody: snedeker and trueswell 2003 iv. Real world knowledge: Chambers et al 2004

Language acquisition

Word segmentation i. Conditional probabilities between syllables: Saffran et al 1996 ii. Stress patterns: Jusczyk et al 1999
Word learning. How to aggregate information across word-referent pairings?
Global cross-situational word learning: Yu and Smith 2007
Hypothesis testing/Propose by verify: Trueswell et al 2013
Hybrid Pursuit model, developed by Charles and others at Penn. ii. Noun bias: bates et al 1994 iii. Human simulation paradigm: Gleitman et al 1999 iv. Syntactic bootstrapping: Naigles, 1990; Hirsh-pasek et al 1996; yuan and fisher 2009
Rule learning and regularization i. Artificial language learning and the Tolerance Principle: Schuler et al 2021 😉 ii. Deaf children exposed to sign language early outperform their hearing parents (Newport, 1990)

Language and thought

Effect of cross-linguistic differences on color perception (or perhaps just decision-making!) – Winawer et al 2007
“Whorf hypothesis is supported in the right visual field but not the left” – Gilbert et al 2006
“Does categorical perception in the left hemisphere depend on language?” – Holmes and Wolff 2012
Can you represent spatial relations (e.g., left/right) if you don’t have words for them? See Brown and Levinson 1993 for ‘yes’. i. But then see Li and Gleitman 2002 for counterargument with Penn undergrads!

Neuroscience – Methods

Dead salmon fMRI study, showing risks of research degrees of freedom – bennett et al (2009, neuroimage)
Different neurons/patches of superior temporal gyrus encode different phonetic features (evidence from direct electrode recordings) -- Mesgarani et al 2014

Neuroscience – Plasticity

Long term potentiation – bliss and Lømo 1973
Beatrice Gelber studies on conditioning in single-celled organisms (paramecia) -- Gershman et al 2021 in eLIFE.
Critical periods: i. Hubel and Wiesel 1964 studies with suturing cat eyes shut, on critical period effects in visual development… ii. Newport 1990 showing age of acquisition/critical period effects of language acquisition in Deaf people acquiring ASL (who don’t have another L1, thus suggesting critical period effects in learning additional languages as an adult are not merely interference from L1).
Ferrets with visual pathway rewired to auditory cortex can still see. (i.e., brain areas have some flexibility in the inputs they can take). Von melchner et al 2000
London cab drivers have larger grey matter in posterior hippocampus, relative to London bus drivers. Maguire et al 2006 make argument that needing to flexibly navigate increase hippocampal volume.

Cognitive Development - Object perception

Infants perceive objects as unitary (if two things move in tandem but their connection is occluded, infants assume a single object) -- Valenza, E., Leo, I., Gava, L., & Simion, F. (2006). Child Development, 77, 1810–1821.
Infants know objects persist when they disappear (object permanence): wang et al 2004
Infants know objects move continuously through space and time (Aguiar & Baillargeon, 1998; Johnson et al 2003)
When objects violate physics, infants test them appropriately (drop objects that defy gravity; bang objects that pass through others)…Stahl and Feigenson 2015.
Infants detect shape changes first in development, then pattern, then color (wilcox, 1999)

Cognitive Development - Understanding of agents

Infants can use repeated reaching to infer a goal – Woodward 1998
Infants expect goal-directed actions to be efficient – Liu et al 2019; Gergely and Csibra 2002
Infants expect successful agents to be happy – skerry and spelke 2014.
Do babies prefer agents that help, over agents that hinder? Hamlin et al 2007 says ‘yes’. But see failed replications, Schlingloff et al. 2020, Salvadori et al 2015! (FYI, there is an ongoing multi-site replication: https://manybabies.github.io/MB4/)

Reinforcement Learning

Blocking effect: Suppose an animal already knows that A (chime) predicts B (food). If X (light) is presented simultaneously with A, animal won’t learn association between X and B. Kamin 1968.
Dopamine spikes don’t accompany receipt of reward; they reflect changes in expectations of reward/punishments! Schultz et al 1997

Concepts and categories

Production Tasks (list members of a category). List examples of birds: “robin, sparrow…..parrot…ostrich, penguin” people are relatively consistent (typical first)
Rating tasks (rate members of a category) people in agreement about what is typical member
Sentence Verification “A robin is a bird” is faster than “A penguin is a bird”
Picture Identification People faster to identify typical members of category. Posner and Keele 1968 v. Missing prototype effect – Posner and keel 1968
Odd (even) numbers show prototypicality effects in rating tasks and sentence verification tasks! Armstrong, Gleitman, and Gleitman 1983.
Blind and sighted people have similar organization of visual verbs. Bedny et al 2019.

Number cognition

Monkeys can count! Brannon and Terrace, 1998; Cantlon and Brannon, 2006.
So can infants! Feigenson et al. (2002)
Ants count their steps in order to navigate! Wittlinger et al 2006
Number words/symbols seem to be necessary to exactly represent large number concepts (like 56). Two papers: i. Exact and Approximate Arithmetic in an Amazonian Indigene Group. Pica et al 2004. ii. Number cognition in Deaf Nicaraguan homesigners. Spaepen et al 2011.

Collective cognition and behavior

The spread of behavior in an online social network – Centola (2010). Behavior spreads more quickly in social networks with clusters, than in social networks where people are connected randomly, even though it takes fewer steps to get between any two people in a random network!

Final project submission

due on gradescope by May 10th at 11:59pm (no late submissions possible: grades are due May 12!)

Your final project submission will essentially be a more formal, cleaned up report of your preregistration and project checkpoint 3. Your submission should be a google colab notebook (see sample!) and include the following sections:

Introduction - in the introduction, provide a brief background and summary of the research question(s) addressed in the original paper. Describe (again very briefly) how the researchers addressed this question(s) in their original work. Then describe which aspect of the paper your group chose to replicate.
Method - your method section should include at least 3 subsections
1. Participants - describe the participants in the portion of the study you are replicating and how you simulated that participant structure (if your data was not available). Include the R code that generates the participant structure (or summarizes it if your data was availble) and show this in a table with R code.
2. Procedure - describe in more detail the procedure of the portion of the study you are replicating. What did the researchers do (or have their participants do) in the experiment? On each trial? What were the stimuli like? Did the researchers code or summarize the data in any way? Explain that here. Include the R code that generates the data, including the trial-by-trial data. If you imported data, include R code that generates tables to indicate the number of trials, etc.
3. Analysis - describe in more detail the analysis you and your group have opted to conduct. First, restate the research question you are addressing and its null hypothesis. You should describe your planned analysis, which should include at least a linear or logistic regression to test your null hypothesis. If you will remove outliers or trim data before anlaysis, describe that here. Include the R code for outlier removal and model building here.
Results - begin by recreating the figure (or creating a figure if one did not exist) that summarizes your research question. Then run the anova() function on your model in R and interpret the results in text (Chapter 15 in the book is helpful for this!). Then, use summary() to get the regression coefficients and interpret those results in the text (Chapter 7 in the book is helpful for this!). Finally, get your model's predictions with the predict() function and add the model predictions to your original figure.
Conclusions - Finally, include a few sentences to remind the reader what you set out to do, what you found, and whether the results from your model were the same or different than the original research finding. Briefly summarize what this might mean.

Here is a sample final project submission, based on the Petitto & Marentette 1991 article you all read for Project Checkpoint 01, plus a blank R notebook if you'd like to start there: