Welcome to Data Science for studying Language and the Mind!
Please note that the organization of the course has changed substantially. Read the syllabus carefully to see if the course will be a good fit for you!
Data Sci for Lang & Mind is an entry-level course designed to teach basic principles of data science to students with little or no background in statistics or computer science. Students will learn to identify patterns in data using visualizations and descriptive statistics; make predictions from data using machine learning and optimization; and quantify the certainty of their predictions using statistical models. This course aims to help students build a foundation of critical thinking and computational skills that will allow them to work with data in all fields related to the study of the mind (e.g. linguistics, psychology, philosophy, cognitive science, neuroscience).
There are no prerequisites beyond high school algebra. No prior programming or statistics experience is necessary, though you will still enjoy this course if you already have a little. Students who have taken several computer science or statistics classes should look for a more advanced course.
Statistical Modeling: A Fresh Approach 2nd Edition by Daniel Kaplan. A free e-textbook designed to make advanced statistical methods accessible to beginners by emphasizing conceptual understanding of the intuition behind them (models).
Computational and Inferential Thinking: The Foundations of Data Science 2nd Edition by Ani Adhikari, John DeNero, David Wagner. Note that this textbook was developed for the UC Berkeley course Data 8: Foundations of Data Science and focuses on python, but we will reference a few chapters.
Course website (you are here) for schedule, syllabus, and links to all assigned materials
Canvas site mostly used for posting grades
Ed Discussion for course announcements and online discussion and support
Gradescope for lab assignments, project checkpoints, and exams
R is an evolving summary of resources related to the R concepts we learn in the course (e.g. cheatsheets, extra learning resources) Content updated throughout the course.
Datasets includes a collection of the datasets we use in the course
Exam study guides will include the study guides for the midterm and final
Lectures are held in MOOR 216 on Tuesdays and Thursdays at 12pm. You can also join the lectures live on Zoom. We highly recommend attending live in-person, but we will post recordings of the lectures in case you are sick or unable to attend.
At some point during each lecture, we will ask you to do the Lecture check-in survey. We will not use the lecture check-in as part of your grade. We are collecting this data to understand how student engagement styles influence course performance or enjoyment. Data science!
We may also use the survey data to determine whether to permit make-up work. If you are engaged with most lectures, we will be more likely to offer options like extensions or exam re-takes. If you rarely attend or engage with the lecture material, we will be more likely to offer options like taking an incomplete or withdrawing from the course.
Weekly in-person labs consist of working on lab assignments or project checkpoints with small groups of peers (see sections for available times and locations). Attending lab is a mandatory part of your grade. To receive credit, you must attend in person and make significant progress on the week's assignment.
Lab assignments will be released on Thursday afternoons by 1:45pm. You will be asked to submit your in-progress lab assignment to Gradescope at the end of each lab session to receive participation credit (whether you have finished or not). You will have until Wednesday night at 11:59pm to submit the final version you would like us to grade.
Lab sessions will not be recorded. Late assignments will not be accepted.
Group project details will be released on the Project Guidelines page during week 3.
The midterm and final will be pen and paper exams designed to test your conceptual understanding of the material covered in the course. Both the midterm and final will be closed book, closed note, and held in person.
The midterm is scheduled during your regular lab section in week 8. The final will be scheduled during finals week (time and date TBD). Note that the final is required to pass the course.
Course components contribute to your final grade according to the following table. Assignments within each category are weighted equally.
The table below shows the minimum score before rounding for letter grades. Grading is not on a curve: there is room for everyone to do well!
Late lab assignments will not be accepted. We drop your two lowest lab assignment grades. Project checkpoints will be accepted up to 24 hours late with no penalty.
If you notice a grading mistake, you must make a regrade request on Gradescope before the regrade deadline. If you ask about grading in person or via email, you’ll be directed to make a formal regrade request in Gradescope.
We are happy to provide accommodations to anyone with documentation from Student Disability Services and to make alternate arrangements when class conflicts with a religious holiday. Please notify your lab section TA as soon as possible to make these arrangements.
We will follow the rules of the University and the Code of Academic Integrity. It is your responsibility to be familiar with these policies.
Asking for help is a sign of strength! We hope you’ll reach out to us if you need help. We also want you to be aware of Penn’s Academic & Wellness Resources.
Section | Time | Location | Staff |
---|---|---|---|
Component | Grade |
---|---|
Letter | Minimum score (before rounding) |
---|---|
401 LEC
TR 12-12:59 PM
Dr. Kathryn Schuler
402 LAB
R 1:45-2:44 PM
June Choe
403 LAB
R 3:30-4:29 PM
Ariana Wiltjer
404 LAB
F 12-12:59 PM
Ravi Arya
405 LAB
F 1:45-2:44 PM
Avinash Goss
Lecture Check-ins
0% - not used for grading
Lab Participation
10%
Lab Assignments
20%
Project
25%
Midterm
15%
Final
30%
A+
97%
A
94%
A-
90%
B+
87%
B
84%
B-
80%
C+
77%
C
74%
C-
70%
D+
67%
D
64%
D-
60%
Check back the first week of classes for more details including Office Hour times and locations.
Email: kschuler@sas.upenn.edu
Pronouns: She/Her
Office Hours: Tuesdays 1:30-2:30pm, Linguistics Department, Room 314C
Email: yjchoe@sas.upenn.edu
Pronouns: He/Him
Office Hours: Thursdays 10:45-11:45am in 3401 Walnut 338C (Linguistics Department)
About me: I am a third year PhD student in the linguistics department. I study how children learn words over developmental time and how adults parse sentences in real time. Outside of research, I'm a data visualization enthusiast and develop open source packages in R - if you share these hobbies (or become interested in them over the semester), do let me know!
Email: raviarya@sas.upenn.edu
Pronouns: He/Him/His
About me: Hi my name is Ravi and I am a junior dual degree student studying Finance and Political Science. I am originally from NJ and am super excited to be your TA this semester. Feel free to email me with any questions about the course or Penn in general!
Email: amgoss@sas.upenn.edu
Pronouns: He/Him
About me: Avinash is a Junior in the College of Arts and Sciences majoring in Mathematical Economics, as well as pursuing minors in Statistics, Data Science, and Classical Studies. He loves to travel, having lived on 3 different continents and visited dozens of countries.
Email: awiltjer@sas.upenn.edu
Pronouns: She/Her
Office Hours: Wednesdays 3:30-4:30pm in PCPSE 103C (1st floor of Perelman Center for Political Science and Economics)
About me: I am a senior from Portland, Oregon, studying Economics and minoring in Consumer Psychology. After graduation this Spring, I am planning to work in Data Analytics so I am excited to be working with this class, which has been one of my favorite and most useful classes I’ve taken at Penn so far! Outside of the classroom, I am a foodie and love exploring new restaurants, traveling, and learning languages (I have been learning Spanish and Italian and just started with Japanese). Being from the Pacific Northwest, I also love going on hikes and walks along the Schuylkill with my dog, Zelda. Looking forward to working with you all!
Materials for the Spring 2023 version of the course are archived here. For the most recent version of the course, visit
. Due by May 08 at 11:59pm
W | Lecture | Read | Lab Assignment |
---|
Website:
Office Hours: Fridays 3-4pm on
Office Hours: Tuesdays 3:30-4:30pm on
What is the best way to get help?
Ask questions on Ed and come to office hours!
If I miss lab can I come to another lab section that week?
Probably! You can join any lab session that will work for your schedule provided you ask the section TA and they agree there is room.
I noticed a mistake in the grading of my assignment. How do I get this fixed?
We will look at your assignment again if you make a regrade request on Gradescope before the regrade deadline.
I missed or failed an exam. What can I do to make it up?
Discuss with your section TA to see what options are available to you. If you are engaging with the material (Lecture check-ins) we may offer a make-up. If you are not engaging with the material, you may not be permitted to make-up an exam.
Can I turn in my lab assignment or late?
No, lab assignments are not accepted late. However, we do drop your two lowest grades.
Can I turn in my project checkpoint late?
Yes. You can turn in project checkpoints up to one day late and receive full credit.
I missed class, how can I catch up?
All lectures will be automatically recorded and posted to canvas a few minutes after class ends. You can watch the lectures when you are ready. We also recommend attending office hours to make sure you don’t fall behind.
I do not feel well, tested positive for COVID, and/or need to miss something. What should I do?
Please don’t attend in person you are unwell or under quarantine. Follow the steps in the question above (under “I missed class”) to catch up. If you miss a lab, you can try the lab assignment on your own and visit office hours to catch up. We drop you two lowest lab participation scores, so a few absences will not impact your grade.
If you are sick for an exam, let us know before the exam begins and we can discuss your options.
What happens if one of the instructors or TAs does not feel well, tests positive for COVID, etc?
We will teach the course remotely if we need to, or fill in for each other if we can. We have a large teaching team, so this shouldn’t be an issue.
1 | Jan 12: No lecture! Attend your lab section on R or F |
2 | Jan 17: Causation & experiments Jan 19: Programming in R |
3 | Jan 24: Data types |
4 | Jan 31: Tidy data Feb 2: Data wrangling recap w June |
5 | Feb 7: Data visualization Feb 9: More data visualization |
6 | Feb 14: ggplot recap w June |
7 | Feb 21: Language of models Feb 23: Model formulas and coefficients |
8 |
No class today! Mar 2: Midterm review | No lab |
9 | No lab |
10 | Mar 14: Fitting models to data & correlation! Mar 16: Exam Q&A with June! Nothing due this week! Study for Midterm! | Midterm (taken in lab) |
11 | Mar 21: Total and partial change Mar 23: Modeling randomness |
12 | Mar 28: Confidence in models Mar 30: Logic of hypothesis testing |
13 | Apr 4: Hypothesis testing on whole models Apr 6: Hypothesis testing on parts of models Extended deadline for Lab 9: due Monday Apr 10 for Passover! |
14 | Apr 11: Advanced R TBD with June Apr 13: Advanced R TBD with June |
15 | Apr 18: Non-parametric approaches Apr 20: Logistic regression | TBD |
16 | Apr 25: Last class! Final exam review | No lab |
17 | Final exam (date/time TBD) |
Lab 1
|
| |
Lab 2
| | Video
Jan 26: Data tables | | Video
Project
|
Lab 3
|
Lab 4
Feb 16: Confidence intervals
| No demo
, , &
Project
| No demo
| No demo
&
Lab 5
Spring break! No class
|
&
|
&
Lab 6
|
|
&
Lab 7
&
Project (released Mon Apr 10)
|
|
Lab 8
|
Apr 27: No class, reading period
dplyr full reference (we'll only use some of these functions)
ggplot2 workshop part 1 (youtube webinar, if you want to go further!)
Data downloaded from the supplemental materials in DeSilva et al (2021); to use:
Cross-linguistic trajectories of two words: ball
and dog
taken from wordbank.stanford.edu. To use this dataset:
A long format dataset that is most useful in wide format. Data taken from Appendix 1 in:
To use this dataset, you’ll need the jvcasillas/untidydata
package:
Then call
Welcome to your data science project!
Throughout the semester, you'll be applying what you learn to a data science project that is of particular interest to you (and your group!). You'll need to select a project within the bounds of linguistics or cognitive science, but other than that the topic is up to you.
Projects can be one of two types:
A group project: you join a group with 1 or 2 other students in your lab section (max group size is 3 students total). You and your group replicate a classic study in linguistics or cognitive science by reconstructing the data and analysis from the published paper
A solo project: you work alone and either (1) replicate a classic study in linguistic or cognitive science by reconstructing the data and analysis from the published paper (same as group project version, you just work alone) OR (2) you work on an original research project in which you collect data yourself.
Reasons for doing a solo project would include: (1) you are required to, usually in order to use this class in a specific way toward your major or minor; (2) you are conducting (or want to conduct!) a research project in linguistics or cognitive science either as part of a class, a thesis, or independently; and you want to use this class to help you do the data science on that project.
Wondering how to find a classic study? Here are a few suggestions:
Look to your previous classes! Does anything stand out? Any interesting things you heard about in class that you'd like to explore more?
Look in intro textbooks! Introductory textbooks in a given field often describe classic studies in an accessible way. That would be a great place to start to figure out what studies you might want to replicate. They will reference the original research article.
Look to social media! Have you read about anything in the news or on social media that you'd like to dive into? You can usually find the reference to the original research study somewhere in a news article.
Use our lists below! We've put together a list of possible classics in Cognitive Science and Linguistics that you might be interested in.
Ask on Ed! Describe your interest on Ed and we can try to direct you to some specific papers that way.
Sociolinguistics (from Dr. Meredith Tamminga, Associate Professor of Linguistics)
Tagliamonte & D'Arcy 2007 on the "be like" quotative
Hay & Drager 2010 speech perception study involving kiwis and kangaroos!
Language Evolution (from Dr. Gareth Roberts, Associate Professor of Linguistics)
Galantucci, B. (2005). An experimental study of the emergence of human communication systems. Cognitive Science, 29(5), 737-767.
Garrod, S., Fay, N., Lee, J., Oberlander, J., & MacLeod, T. (2007). Foundations of representation: where might graphical symbol systems come from? Cognitive Science, 31(6), 961-987.
Scott-Phillips, T. C., Kirby, S., & Ritchie, G. R. (2009). Signalling signalhood and the emergence of communication. Cognition, 113(2), 226-233.
Sneller, B. and Roberts, G. (2018) Why some behaviors spread while others don't: A laboratory simulation of dialect contact. Cognition 170C: 298–311.
Semantics (from Dr. Florian Schwarz, Associate Professor of Linguistics)
Bott & Noveck’s 2004 experiment on scalar implicatures (some vs. all elephants are mammels)
And from Dr. Schuler - You could look to some of the papers from my graduate seminars on topics:
List provided by Dr. Russell Richie, Associate Director of Cognitive Science and mindCORE programs
Introduction
Donders (1868) How long does it take to make a decision? (subtractive method)
Cognitive Revolution and the Computational Theory of Mind
Latent learning with rats in a maze (rats learn even if not rewarded/punished). Tolman and honzik 1930.
Magical number seven. Miller 1956
Behrend & Bitteren (1961) – simple reinforcement learning example of fish learning food probabilities from two feeders
Modularity
Firestone and Scholl (2016): This is a BBS piece and doesn’t have original data itself, but discusses a lot of studies of alleged top down effects of cognition on perception. Students may find it interesting to replicate those.
Judgment and Decision-Making: Are we 'good' at reasoning?
Wason selection task. Wason, P. C. (1968).
People are better at logical reasoning if the problem fits into a ‘schema’ (e.g., permission schema). Cheng and Holyoak (1985)
Conjunction fallacy (aka linda problem) -- Tversky and Kahneman 1981
Base rate neglect (Tom problem) -- Kahneman & Tversky (1973)
Availability heuristic - Tversky & Kahneman, 1973
Anchoring heuristic – Tversky and Kahneman 1974. Ariely et al 2003.
Judgment and Decision-Making: Behavioral Economics
Certainty effect – Tversky and Kahneman 1986
Loss aversion - Kahneman, D. & Tversky, A. (1979). "Prospect Theory: An Analysis of Decision under Risk".
Framing/epidemic problem. Tversky and Kahneman 1981.
Mental accounting – Heath and Soll 1996 (journal of consumer research)
Decoy effect -- Huber, Joel; Payne, John W.; Puto, Christopher (1982)
Intransitive preferences – Tversky 1969
Language structure
Eimas et al 1971 on categorical perception in infants
Werker and tees 1984 on losing ability to distinguish non-native speech category contrasts.
Bias towards hearing clicks at phrase boundaries -- Ladefoged and Broadbent (1960) and Fodor and Bever (1965)
Language comprehension
When listening/reading to words, search for and activate words in parallel, rather than serially. Tanenhaus et al 1979
Cohort theory vs TRACE model of word recognition: Allopenna et al 1998
Evidence for interactive theory of sentence processing: i. Visual world context: Tanenhaus et al. (1995); Trueswell et al. (1999) ii. Verb bias: snedeker and trueswell 2004 iii. Prosody: snedeker and trueswell 2003 iv. Real world knowledge: Chambers et al 2004
Language acquisition
Word segmentation i. Conditional probabilities between syllables: Saffran et al 1996 ii. Stress patterns: Jusczyk et al 1999
Word learning. How to aggregate information across word-referent pairings?
Global cross-situational word learning: Yu and Smith 2007
Hypothesis testing/Propose by verify: Trueswell et al 2013
Hybrid Pursuit model, developed by Charles and others at Penn. ii. Noun bias: bates et al 1994 iii. Human simulation paradigm: Gleitman et al 1999 iv. Syntactic bootstrapping: Naigles, 1990; Hirsh-pasek et al 1996; yuan and fisher 2009
Rule learning and regularization i. Artificial language learning and the Tolerance Principle: Schuler et al 2021 😉 ii. Deaf children exposed to sign language early outperform their hearing parents (Newport, 1990)
Language and thought
Effect of cross-linguistic differences on color perception (or perhaps just decision-making!) – Winawer et al 2007
“Whorf hypothesis is supported in the right visual field but not the left” – Gilbert et al 2006
“Does categorical perception in the left hemisphere depend on language?” – Holmes and Wolff 2012
Can you represent spatial relations (e.g., left/right) if you don’t have words for them? See Brown and Levinson 1993 for ‘yes’. i. But then see Li and Gleitman 2002 for counterargument with Penn undergrads!
Neuroscience – Methods
Dead salmon fMRI study, showing risks of research degrees of freedom – bennett et al (2009, neuroimage)
Different neurons/patches of superior temporal gyrus encode different phonetic features (evidence from direct electrode recordings) -- Mesgarani et al 2014
Neuroscience – Plasticity
Long term potentiation – bliss and Lømo 1973
Beatrice Gelber studies on conditioning in single-celled organisms (paramecia) -- Gershman et al 2021 in eLIFE.
Critical periods: i. Hubel and Wiesel 1964 studies with suturing cat eyes shut, on critical period effects in visual development… ii. Newport 1990 showing age of acquisition/critical period effects of language acquisition in Deaf people acquiring ASL (who don’t have another L1, thus suggesting critical period effects in learning additional languages as an adult are not merely interference from L1).
Ferrets with visual pathway rewired to auditory cortex can still see. (i.e., brain areas have some flexibility in the inputs they can take). Von melchner et al 2000
London cab drivers have larger grey matter in posterior hippocampus, relative to London bus drivers. Maguire et al 2006 make argument that needing to flexibly navigate increase hippocampal volume.
Cognitive Development - Object perception
Infants perceive objects as unitary (if two things move in tandem but their connection is occluded, infants assume a single object) -- Valenza, E., Leo, I., Gava, L., & Simion, F. (2006). Child Development, 77, 1810–1821.
Infants know objects persist when they disappear (object permanence): wang et al 2004
Infants know objects move continuously through space and time (Aguiar & Baillargeon, 1998; Johnson et al 2003)
When objects violate physics, infants test them appropriately (drop objects that defy gravity; bang objects that pass through others)…Stahl and Feigenson 2015.
Infants detect shape changes first in development, then pattern, then color (wilcox, 1999)
Cognitive Development - Understanding of agents
Infants can use repeated reaching to infer a goal – Woodward 1998
Infants expect goal-directed actions to be efficient – Liu et al 2019; Gergely and Csibra 2002
Infants expect successful agents to be happy – skerry and spelke 2014.
Do babies prefer agents that help, over agents that hinder? Hamlin et al 2007 says ‘yes’. But see failed replications, Schlingloff et al. 2020, Salvadori et al 2015! (FYI, there is an ongoing multi-site replication: https://manybabies.github.io/MB4/)
Reinforcement Learning
Blocking effect: Suppose an animal already knows that A (chime) predicts B (food). If X (light) is presented simultaneously with A, animal won’t learn association between X and B. Kamin 1968.
Dopamine spikes don’t accompany receipt of reward; they reflect changes in expectations of reward/punishments! Schultz et al 1997
Concepts and categories
Production Tasks (list members of a category). List examples of birds: “robin, sparrow…..parrot…ostrich, penguin” people are relatively consistent (typical first)
Rating tasks (rate members of a category) people in agreement about what is typical member
Sentence Verification “A robin is a bird” is faster than “A penguin is a bird”
Picture Identification People faster to identify typical members of category. Posner and Keele 1968 v. Missing prototype effect – Posner and keel 1968
Odd (even) numbers show prototypicality effects in rating tasks and sentence verification tasks! Armstrong, Gleitman, and Gleitman 1983.
Blind and sighted people have similar organization of visual verbs. Bedny et al 2019.
Number cognition
Monkeys can count! Brannon and Terrace, 1998; Cantlon and Brannon, 2006.
So can infants! Feigenson et al. (2002)
Ants count their steps in order to navigate! Wittlinger et al 2006
Number words/symbols seem to be necessary to exactly represent large number concepts (like 56). Two papers: i. Exact and Approximate Arithmetic in an Amazonian Indigene Group. Pica et al 2004. ii. Number cognition in Deaf Nicaraguan homesigners. Spaepen et al 2011.
Collective cognition and behavior
The spread of behavior in an online social network – Centola (2010). Behavior spreads more quickly in social networks with clusters, than in social networks where people are connected randomly, even though it takes fewer steps to get between any two people in a random network!
due on gradescope by May 10th at 11:59pm (no late submissions possible: grades are due May 12!)
Your final project submission will essentially be a more formal, cleaned up report of your preregistration and project checkpoint 3. Your submission should be a google colab notebook (see sample!) and include the following sections:
Introduction - in the introduction, provide a brief background and summary of the research question(s) addressed in the original paper. Describe (again very briefly) how the researchers addressed this question(s) in their original work. Then describe which aspect of the paper your group chose to replicate.
Method - your method section should include at least 3 subsections
Participants - describe the participants in the portion of the study you are replicating and how you simulated that participant structure (if your data was not available). Include the R code that generates the participant structure (or summarizes it if your data was availble) and show this in a table with R code.
Procedure - describe in more detail the procedure of the portion of the study you are replicating. What did the researchers do (or have their participants do) in the experiment? On each trial? What were the stimuli like? Did the researchers code or summarize the data in any way? Explain that here. Include the R code that generates the data, including the trial-by-trial data. If you imported data, include R code that generates tables to indicate the number of trials, etc.
Analysis - describe in more detail the analysis you and your group have opted to conduct. First, restate the research question you are addressing and its null hypothesis. You should describe your planned analysis, which should include at least a linear or logistic regression to test your null hypothesis. If you will remove outliers or trim data before anlaysis, describe that here. Include the R code for outlier removal and model building here.
Results - begin by recreating the figure (or creating a figure if one did not exist) that summarizes your research question. Then run the anova() function on your model in R and interpret the results in text (Chapter 15 in the book is helpful for this!). Then, use summary() to get the regression coefficients and interpret those results in the text (Chapter 7 in the book is helpful for this!). Finally, get your model's predictions with the predict() function and add the model predictions to your original figure.
Conclusions - Finally, include a few sentences to remind the reader what you set out to do, what you found, and whether the results from your model were the same or different than the original research finding. Briefly summarize what this might mean.
Here is a sample final project submission, based on the Petitto & Marentette 1991 article you all read for Project Checkpoint 01, plus a blank R notebook if you'd like to start there: