Homework 6: Dictionaries, Data Representations Due Friday November 10, 11:59PM EST

In this assignment, you’ll use dictionaries and iteration to solve several problems involving text analysis: the use of computational tools to learn more about bodies of text. You’ll also further explore binary representations of data.

Below is a suggested timeline to complete the assignment, which leaves plenty of time before the due date for debugging if necessary:

DateTasks to Complete
11/04Read through the assignment and start problem 1
11/05Complete problem 1, pass problem 1 tests on the autograder
11/06Complete problem 2, pass problem 2 tests on the autograder
11/07Complete problem 3, pass problem 3 tests on the autograder
11/08Complete problems 4/5, pass problem 4/5 tests on the autograder
11/10DEADLINE! Finsh debugging and submit your final code. Attend drop-in hours if you’ve had problems with any of the previous parts. Make sure to review your code for code quality.

These are soft deadlines that are not part of your grade, but I encourage you to stick to this timeline if you’ve struggled to complete the homework assignments on time. Being ahead of this timeline is great!

Downloading starter file

Start by downloading the homework 6 starter file here.

Problem 1: to_binary

In class, we practiced converting integer from binary into decimal representations and vice-versa. In this problem, you’ll write a function called to_binary() that converts a decimal integer to a binary representation. This function should accept an integer argument and return a string of 0s and 1s. For example:

to_binary(16) 
>>> "10000"
to_binary(12345) 
>>> "11000000111001"

See slide 17 from October 26 for the algorithm for converting from decimal to binary.

HINT: This is a good application for a while-loop. It is possible to do this problem using recursion instead of a loop.

Problem 2: count_words

Introduction

To start this problem, please start by finding a large file of plain text that you would like to study. One easy way to find such a file is to visit Project Gutenberg’s list of popular books.

Click a title you’d like to use, and find the Plain Text UTF-8 version. See an example of what it should look like here.

Choose File –> Save As, and save the file with the filename story.txt. You should save it in the same directory as your homework file. You can open this file in a text editor. You might wish to do so and delete the language at the beginning and end of the file about Project Gutenberg, so that you can focus on the text.

We are going to study word frequencies in this text. By the end of this sequence of problems, you’ll be able to print a list describing the most common words in the text.

The following code loads your file and divides it into a list of lowercase words. For example, if your file was very small and contained only the words “The cat sat on the mat”, we would have:

words = load_words("text.txt") words
>>> ["the", "cat", "sat", "on", "the", "mat"]
import string

def load_words(path):
    with open(path, "r", encoding = "utf8") as f:
        s = " ".join(list(f.readlines()))
    for punc in string.punctuation:
        s = s.replace(punc, "")
    s = s.lower()  
    words = s.split()      
    return words

Your task

Now, write a function called count_words() whose argument is a list of words and whose return value is a dictionary where each key is a word and each key’s value is the number of times that word appears in the list. This is sometimes called a concordance. For example, using the words variable from above:

count_words(words)
>>> {"the" : 2, "cat" : 1, "sat" : 1, "on" : 1, "mat" : 1}

HINT: This problem is closely related to Example 3 from our reading on dictionaries.

You will likely find this example to be helpful in the next several problems.

HINT: This is a problem that could be addressed using recursion, but Python places a limit on how many times a function can be called recursively. Because of this, I recommend that you use a loop instead.

Problem 3: remove_stopwords

A stopword is a word that that is considered to be uninteresting for text analysis. Examples of English stopwords include “the,” “but,” “and,” “her,” and so on. You can find a list of common stopwords in the hw6.py file (the variable STOPWORDS).

Write a function called remove_stopwords(). This function should take two arguments:

  • A dictionary of counts, such as would be returned by count_words()
  • A list of stopwords.

Your function should return a dictionary of counts, NOT INCLUDING stopwords.

With the example before:

stopwords = ["the", "on", "and"]
d = count_words(words) 
d = remove_stopwords(d, stopwords)
d 
>>> {"cat" : 1, "sat" : 1, "mat" : 1}

The counts for "the" and "on" have been removed because they are stopwords.

Problem 4: print_top_words

Write a function called print_top_words(). This function will print (NOT return) the words with the highest counts in the data set. This function should accept two arguments:

  • d, a dictionary of counts such as would be returned by count_words() or remove_stopwords()
  • num_words, the number of words to print

The function should print the top num_words words in the data set, in descending order, along with their counts. For example, I obtained these counts on the text of Grimm’s Fairy Tales:

print_top_words(d, 10)

Output:

little               388 
away                 278 
king                 264 
man                  214 
old                  201 
time                 184 
day                  181
come                 170 
home                 170 
shall                168

For full credit, PLEASE USE RECURSION. It’s also possible to do this problem using a loop; loop-based solutions will receive most but not all of the credit. Your recursion-based solution might use both recursion and a loop - that’s fine in this case!

To do this recursively:

  • if the input dictionary is not empty and num_words is positive:
    • Find the word with the highest count in the dictionary, and print it.
    • Remove that word from the dictionary.
    • Reduce num_words by 1.
    • Call print_top_words() with the modified dictionary and reduced num_words.

HINTS:

  • The implementation have some similarities with what you did for matched_min on Homework 4.
  • print(f"{word:20} {count}") will print your word and count in a pretty way, like shown above.
  • It’s ok if your function prints MORE than the specified number of words, in case some words are tied for 10th.
  • The largest value in the dictionary d can be found using max(d.values()).
  • d.pop() can be used to remove key-value pairs from dictionaries.

Problem 5: putting it all together (summarize_file)

Using the functions that you implemented for problems 2-4, along with the load_words function, write a function summarize_file that takes a string filename and a list stopwords as input. Your function should:

  • Use load_words to read the file and create a list words
  • Use count_words to convert words to a dictionary counts
  • Use remove_stopwords to create a new dictionary clean_counts without stopwords
  • Use print_top_words to print the 10 top words from clean_counts

In the end, you should be printing the 10 most frequent words from filename that are not stopwords! Your function shouldn’t return anything.

Test your funciton on your "story.txt" file to make sure everything is working before submitting to gradescope!