Homework 7: Studying Crowdsourced Annotations

Due Thursday April 18, 11:59PM EST (Revision: May 13)

Gradescope Data Spreadsheet Late Day Request Form (accepted up to 24 hours before deadline)
In this homework assignment, you will perform a “bias audit” of an NLP dataset produced by crowdsourcing. You will attempt to measure the presence of social stereotypes in this dataset that may have harmful effects if used to train classifiers in downstream tasks.
You will use pointwise mutual information (PMI) to find which associations are being made with identity labels. PMI can be used as a measure of word association in a corpus, i.e. how frequently two words co-occur above what might just be expected based on their frequencies. See the PMI Wikipedia page for more details. Here we use PMI to measure which words co-occur with labels for identities. This allows us to see associations that may perpetuate stereotypes.
After this analysis, you will present specific examples from the data that you speculate could be particularly biased and problematic.

Learning Goals:
Once you complete this assignment, you should:
  • Understand the ethical implications of soliciting crowdsourced data, specifically social biases that may emerge when asking for generated sentences
This assignment is connected to the following overall learning goals of the course:
  • Be able to identify pros and cons of various data collection and labeling practices that are commonly used in NLP, including data scraping and crowdsourcing
  • Be exposed to ways in which language technology can perpetuate stereotypes and biases related to minoritized groups, and inequity among different languages
  • Demonstrate your ability to engage with recent research papers in NLP
Submit these files: report.md
Leaderboard: There is no leaderboard for this assignment.
Credits: This assignment is adapted from an assignment made by Yulia Tsvetkov for her computational ethics for NLP course.

Typically, I have no way of knowing whether your gradescope submissions are finalized, so I wait to grade them until the deadline. If you’d like to have a revision for this homework graded early, please fill out the CS 457 Revision Grading Request form. I’d like to try to get revisions graded as soon as possible as the semester wraps up so that you can dedicate time to finishing your project!

Implementation Details

This assignment doesn’t require any programming. Instead, you will focus on your report, in which you’ll reflect on stereotypes in data collected through crowdsourcing.

I have provided a spreadsheet with the ordered word association lists required for your analysis. The method for deriving these lists is written out below. You do not need to implement this method, but reading through this will be useful when you write about limitations of the method in your report!

The word associations are computed using pointwise mutual information (PMI) between unigram frequencies in the SNLI dataset. PMI is computed as follows:

Let $c(w_i)$ be the count of word $w_i$ in the corpus and $c(w_i, w_j)$ be the number of times that $w_i$ and $w_j$ occur in the same premise or hypothesis. If they co-occur more than once within a premise or hypothesis, you can still just calculate that as one. With $N$ as the number of documents (premises or hypotheses) in the corpus, we define $P(w_i)$ as the word frequency $c(w_i) / N$. Then PMI is:

\[PMI(w_i, w_j)=\log_2\frac{P(w_i, w_j)}{P(w_i)P(w_j)}=\log_2\frac{N \cdot c(w_i, w_j)}{c(w_i)c(w_j)}\]

PMI is computed between each word in the identity list in our spreadsheet and all other words in our vocabulary. The vocabulary is a filtered version of all of the words in our corpus; specifically, it includes words that appear at least 10 times in the corpus that are not stopwords. All text is converted to lowercase as part of the preprocessing. Duplicate premises and hypotheses are not considered. PMI is calculated separately for identity terms in the premises, which are the original provided captions from the Flickr30k image captioning dataset, and identity terms in the hypotheses, which were elicited in a crowdworking task. You will compare the associations made in the write-up.

The spreadsheet shows the 10 words with the highest PMI for every identity label.1

Required Reading

Read about what crowdworkers were asked to do in constructing the SNLI corpus in the SNLI paper. You need to read Section 2.1 to complete this assignment, but you may find it useful to read other parts of the paper to learn more about NLI and the crowdsourcing process. Come up with at least one idea about how the designers of the crowdsourcing task might have mitigated any social bias you may find in your analysis. For example, are there certain topics that often led to biased hypotheses? Could the task have been structured differently or different instructions given to mitigate bias?

Resources

Data

The text data from which these scores were computed is the Stanford Natural Language Inference (SNLI) dataset. It’s very large, so I don’t recommend downloading it on your personal computer. To see a subset of the data as you write your analysis, you can look at it in the huggingface dataset viewer.

On Tuesday afternoon there was an error message when searching in the huggingface dataset viewer. If you are seeing this error message, there is a workaround posted on edstem. Unfortunately it involves downloading the large files to your computer, sorry!

The other resource used is our list of identity labels, which are based on Rudinger et al. (2017), Social Bias in Elicited Natural Language Inferences.

Deliverables

Your report should answer the following questions, along with the basics section. You can download a template report here.

Find specific hypotheses from the dataset where an identity label occurs with a top-associated term that shows some social bias or does not.

Look at 1-2 examples for at least 5 different identity labels. Also note the label (entailment, contradiction, neutral) and consider whether there is an impact of asking annotators for certain types of inference.

This example shows the expected format:

* Identity Label: thai
  * Term: thailand
  * Premise: A group of Asians are eating outside with one passing another a napkin.
  * Hypothesis: Tourists from Thailand eating Phad Thai.
  * Label: neutral

Provide your five examples in the space provided. If you would like to give more than 1 example for an identity label you may, but make sure to include at least five different identity labels!

  • Identity Label:
    • Term:
    • Premise:
    • Hypothesis:
    • Label:
  • Identity Label:
    • Term:
    • Premise:
    • Hypothesis:
    • Label:
  • Identity Label:
    • Term:
    • Premise:
    • Hypothesis:
    • Label:
  • Identity Label:
    • Term:
    • Premise:
    • Hypothesis:
    • Label:
  • Identity Label:
    • Term:
    • Premise:
    • Hypothesis:
    • Label:

A discussion of associations made in the SNLI dataset that are stereotypical, or a lack thereof.

If you see any, discuss differences between premises and hypotheses regarding these stereotypes. Give specific examples of data where stereotypical associations are being made, and what potential harm that reinforcing this stereotype in a natural language inference dataset may have.
Expected length: at least 200 words

A brief discussion of what steps may be taken to mitigate this effect when using crowdsourcing to create and annotate NLP datasets.

For example, what instructions could be given to crowdworkers, and more importantly, how might a crowdsource task be structured to elicit fewer stereotypes? Do you think asking for free-form generated sentences from crowdworkers will always produce responses that reproduce stereotypes, or are there ways to give context that may lessen that effect? How much of any bias that you see is due to the original premises provided to annotators?
You may answer some or all of the questions above, but if you answer only one question, you should go into more depth. Expected length: at least 200 words.

What are some limitations of the method to generate words lists using PMI that you noticed while performing the analysis?

Expected length: at least 25 words

Submission

Submit your report.md on gradescope. Make sure to add your partner, if you worked with one on this assignment.

  1. Identity labels that do not meet the requirements to be included in the vocabulary are excluded.