Homework2

Regular expressions

DUE: February 8, 2013 by 1pm

Preamble

Your starting point is a set of stub programs that are in the course repository. Clone it or pull the latest and look into the applied-nlp/homework/hw2 directory. These are incomplete implementations that handle a few things for you (such as code to test your code), and have comments embedded in them to help you solve the problems.

Your submission will be a zipped up file with your implementations for each problem in it. Submit it on Blackboard as __hw2.zip. Please follow the same instructions as for Homework1 with respect how to package up the solutions before submission.

If you don’t absolutely need to have words on paper, don’t print this homework description out. It has a lot of example output that makes it lengthy, and it will be easier to do some cutting and pasting etc right from the webpage.

1. Finding syllables

This problem uses the same text passage from H.G.Wells, The War of the Worlds that was used in Homework1:

After the glimpse I had had of the Martians emerging from the cylinder in which they had come to the earth from their planet, a kind of fascination paralysed my actions. I remained standing knee-deep in the heather, staring at the mound that hid them. I was a battleground of fear and curiosity.

I did not dare to go back towards the pit, but I felt a passionate longing to peer into it. I began walking, therefore, in a big curve, seeking some point of vantage and continually looking at the sand heaps that hid these new-comers to our earth. Once a leash of thin black whips, like the arms of an octopus, flashed across the sunset and was immediately withdrawn, and afterwards a thin rod rose up, joint by joint, bearing at its apex a circular disk that spun with a wobbling motion. What could be going on there?

For this problem, you will perform series of manipulations and operations on this passage. Modify the stub program syllables.scala, looking at the comments in it for guidance. See the output at the end of this question for guidance on what your program’s output should be.

(a) Identify how many words in the War of the Worlds passage are “one syllable words.” For this problem, use the (imperfect) heuristic that one syllable words consist of zero or more consonants, followed by one or more vowels, followed by zero or more consonants (i.e., there is only one group of vowels in the word). In answering this question, you should not distinguish between capitalized and non-capitalized words. The vowels to use are a, e, i, o, and u. All other lowercase letters are to be considered consonants.

For this part, use a regular expression on the raw text String, and output the number of one syllable words. You should express the above definition of a vowel pattern, and use \b to indicate a word boundary. Don’t forget to handle capitalization (so cat, Cat and CAT are all one syllable words.) Also, remember that new-comers does indeed contain a one-syllable word since - is a boundary.

Hint: Use the method findAllIn of Regex to obtain all the matches as a List.

(b) Now, identify how many one syllable words (using the same definition as for (a) above) there are in the passage, but this time working with the passage as a List of Strings split on whitespace. You should get the same answer as for (a).

Goal output: (with ### substituted for actual numerical values)

$ scala syllables.scala

Part (a)
Number of one syllable words on full String: ###

Part (b)
Number of one syllable words on split String: ###

2. Basic regular expressions

Do problem 2.1 on pages 42-43 of Jurafsky and Martin. Put your solutions in the stub file regex.scala. As you work on them, run regex.scala and you’ll see that it is checking the accuracy of your regular expressions. See the comments in the stub file for details.

3. Regular expressions for capturing date expressions

There are a lot of different ways that you can express a particular date, e.g.:

September 7, 2011
09-07-2011
9/7/2011
the 7th of September, 2011

One thing one could do with a system that can reliably capture date expressions: to build an application that identifies meeting times in your emails and automatically transfers them to your calendar. (But note that such an application would have to be extremely reliable to be useful.)

In this problem you will use regular expressions to recognize dates, extract their components, and normalize them to a standard YEAR-MONTH-DAY format, e.g. 2011-9-7.

For this problem, modify the stub program dates.scala.

(a) Write a regex FullFormDate that covers date expressions of the form “MONTH DAY, YEAR”, where

MONTH is one of January, February, . . . , December
DAY is a number between 1 and 31 (You do not need to check that you don’t match February 31 and similar impossible dates.)
YEAR is a number between 1900 and 2099, so you only need to check years in the 1900s or 2000s

(b) Write a regex ShortDate that covers date expressions of the form M/D/Y and M-D-Y where

M is a number 1 to 12, with single digit numbers optionally preceded by 0, e.g. 01, 02.
D is a number 1 to 31, with single digit numbers optionally preceded by 0, e.g. 01, 02.
Y is a number 1900 to 2099

Note: You should not match patterns like M/D-Y or M-D/Y.

(c) Write a regex OrdinalDate that covers date expressions of the form “the ORDINAL of MONTH, YEAR” where ORDINAL is one of 1st, 2nd, 3rd, 4th to 31st (with appropriate use of ‘st’, ‘nd’, ‘rd’, ‘th’). Make sure not to accept 0th, 33rd, etc. MONTH and YEAR are as in part (a).

(d) Write a function normalizeDate that takes as input a date in any of the formats above and outputs it in normalized format YEAR-MONTH-DAY, e.g. /September 7, 2011/, /09-07-2011/ and /the 7th of September, 2011/ all are normalized to /2011-9-7/. Then, apply this function to the elements of the command line args array such that it is possible to run the program as following:

$ scala dates.scala "September 7, 2011" "12/27/1911" "4-18-2004" "the 21st of November, 1924"
2011-9-7
1911-12-27
2004-4-18
1924-11-21

You should test your program with other date expressions and make sure it does the right thing.

4. Finding names in the works of Mark Twain

This problem involves working with the Entire Works of Mark Twain from Project Gutenberg, which is available as twain.txt in the applied-nlp/data/gutenberg directory.

(a) The name of the file containing the works of Twain should be entered as the first argument on the command line, and your program should read the contents of that file and assign it to the variable book. Notice that there is material about Project Gutenberg at the beginning and end of each file, and that there are consistent strings indicating where the actual text of a book starts and ends. Write a regular expression JustTextRE that matches these delimiters and captures the text in between them (as a regex group). Apply this regular expression to the text and obtain the matched text such that you can save it to a variable text.

Tip: you should print out the text and verify that it is correct, though you should turn this output off once you’ve verified it is working.

Tip: you'll need to preface your regular expression with (?s) so that the "any" character . matches newlines.

(b) Write a regular expression for identifying names with titles in the text like Mr. John Brown, Mrs. Smith and Prof. William G. Jones. You should handle Dr, Mr., Mrs., Miss, Ms., Prof. and Rev. Apply it to the text and print out each match to standard output. You’ll need to use (?s) again.

Hint: Check the output you are getting, and make sure you aren’t getting new lines between a title and the name. If you are, fix it -- easily done by mapping a fixer function over the match results.

(c) Use names.scala to process twain.txt, piping its output directly to the Unix utilities we looked at in class to obtain a list of all the names and their counts in reverse sorted order. See the slides on Unix pipelines for help. Save the output as names_twain.txt and include it with your submission.

Tip: use a Unix redirect to send the output of the counting and sorting Unix pipeline to a file.

(d) In hw2_answers.txt, provide the ten most common names as output in part (c). Also, state what the count of Mr. Twain is.

(e) Discuss some of the obvious errors and why you think they happened. Make sure to note the error Rev. Dr being recognized as a name and explain why it happened.

5. Extracting articles from the Federalist Papers

This problem involves working with the Federalist Papers from Project Gutenberg, which is available as the file federalist.txt in the applied-nlp/data/gutenberg. You'll modify the stub file articles.scala. This will form the basis for your work on Homework3, as you’ll extend this solution to extract the values from these texts to create the input for clustering algorithms to explore authorship in the Federalist Papers.

(a) The name of the file containing the Federalist Papers should be entered as the first argument on the command line, and your program should read the contents of that file and assign it to the variable book. Use the regular expression JustTextRE from part (a) of the previous problem to pull out the text of the Federalist papers (just copy your solution to articles.scala). Then, use the split method on strings to obtain an Array of the raw articles, which you can do by splitting on the lines that have the form FEDERALIST. No.. Here’s an example that has a similar structure to it -- the key thing is to realize that the argument to split is a regular expression.

scala> val dialogue = "A simple dialogue. A: Hi. B: Hello. A: So, how are you? B: Fine. And you? A: Great. B: Goodbye. A: See you."

dialogue: java.lang.String = A simple dialogue. A: Hi. B: Hello. A: So, how are you? B: Fine. And you? A: Great. B: Goodbye. A: See you.

scala> dialogue.split("[AB]:")

res0: Array[java.lang.String] = Array("A simple dialogue. ", " Hi. ", " Hello. ", " So, how are you? ", " Fine. And you? ", " Great. ", " Goodbye. ", " See you.")

Assign the Array obtained by doing the split to the variable rawArticles, and then print the length of this Array. It should have 87 elements.

(b) There are 85 articles in the Federalist Papers. Find out why part (a) produces 87 elements and write the reasons down in hw2_answers.txt.

(c) The articles have a lot of information in a semi-structured format. For example, the beginning portion of article 34 has the article id, followed by the title, then the publication venue, the date, the author, the addressee, and then the text itself.

FEDERALIST No. 34



The Same Subject Continued

(Concerning the General Power of Taxation)

From the New York Packet.

Friday, January 4, 1788.



HAMILTON



To the People of the State of New York:

I FLATTER myself it has been clearly shown in my last number
that the particular States, under the proposed Constitution, would
have COEQUAL authority with the Union in the article of revenue,
<...snipped...>

However, there are some minor variations in how these various elements are expressed (including at times being missing). Your job is to convert this semi-regular format into a Map from attributes like 'author', 'title', 'venue', 'date', ‘addressee’ and 'text' to their values. You will do this with a regular expression ArticleRE that you will design to match the elements of each article. With that regular expression, convert the Array[String] variable rawArticles into an Array[Map[String,String]] variable called articles.

Notes and tips:

It will be useful to look at the example output in part (d) before attempting this part.
You should create the articles Array by flat mapping over rawArticles, using ArticleRE to return a Some(Map(...)) when it matches and a None when it doesn’t. Note that ArticleRE must match every one of the articles.
While building the ArticleRE regular expression, start by matching the id and letting the remaining material be matched with .+, calling the latter the text (even though initially it will contain the title and more). Then extend it to the title, then to the venue, and so on, making sure that each new element you add doesn’t start "unmatching" articles. (Check this by printing the length of articles, which should be 86.)
Obviously, you’ll need to use capture groups to extract the values that will populate the Map representing each article. Some of the values will be empty.
There are some very minor variations that are likely to cause you headaches. This is a real text, so this is representative of the actual sorts of semi-regular patterns you’ll need to deal with in the wild, so look at this as necessary character building on your way to regex ninja-hood. You can minimize headaches by following the incremental regex building strategy outlined above.
Make sure to capture the date on article 42 correctly.
When you add the values to the Map, you should remove newlines from the venue, title, date and author. You can use the replaceAll method on Strings for this.

A note on optionality and capture groups that will matter for your solution -- note the difference in the following examples:

scala> val OptionOutsideRE = """(a)(b)?(c)""".r
OptionOutsideRE: scala.util.matching.Regex = (a)(b)?(c)

scala> val OptionOutsideRE(x,y,z) = "abc"
x: String = a
y: String = b
z: String = c

scala> val OptionOutsideRE(x,y,z) = "ac"
x: String = a
y: String = null
z: String = c

scala> val OptionInsideRE = """(a)(b?)(c)""".r
OptionInsideRE: scala.util.matching.Regex = (a)(b?)(c)

scala> val OptionInsideRE(x,y,z) = "abc"
x: String = a
y: String = b
z: String = c

scala> val OptionInsideRE(x,y,z) = "ac"
x: String = a
y: String = ""
z: String = c

You'll want to use the "inside" type so that you get empty strings rather than nulls when an optional element (like a date) is missing from a given article. Note that at times you'll need to group things, so to change the regex (a)(bc)?(d) to an "inside" form, you need to use the ignore operator for capture groups, e.g. (a)((?:bc)?)(d). Make sure you understand what is going on with this, and if not, ask me right away.

Another note on building a large regular expression: you can and should break it up over multiple lines. Here’s a simple example of how you can do that:

val LowercaseThenDigitRE = (
 """([a-z]+)""" + // Get a series of lowercase characters
 """\s(\d+)"""     // Get a series of digits
).r

For this problem, I recommend having the portion of the regular expression that is responsible for each element on its own line. Add comments to remind yourself what each does.

(d) Write code that loops over the articles Array, and for each element, prints out its keys and values in the format given below. For the text value, you should only print the first 200 characters (see the substring method for Strings).

Example output:

$ scala articles.scala federalist.txt
id: 1
title: General Introduction
author: HAMILTON
venue: For the Independent Journal.
date:
addressee: To the People of the State of New York:
text:

AFTER an unequivocal experience of the inefficacy of the
subsisting federal government, you are called upon to deliberate on
a new Constitution for the United States of America. The subject
speaks i

id: 2
title: Concerning Dangers from Foreign Force and Influence
author: JAY
venue: For the Independent Journal.
date:
addressee: To the People of the State of New York:
text:

WHEN the people of America reflect that they are now called upon
to decide a question, which, in its consequences, must prove one of
the most important that ever engaged their attention, the proprie

<many articles skipped in this output>

id: 15
title: The Insufficiency of the Present Confederation to Preserve the Union
author: HAMILTON
venue: For the Independent Journal.
date:
addressee: To the People of the State of New York.
text:

IN THE course of the preceding papers, I have endeavored, my
fellow-citizens, to place before you, in a clear and convincing
light, the importance of Union to your political safety and
happiness. I

id: 16
title: The Same Subject Continued (The Insufficiency of the Present Confederation to Preserve the Union)
author: HAMILTON
venue: From the New York Packet.
date: Tuesday, December 4, 1787.
addressee: To the People of the State of New York:
text:

THE tendency of the principle of legislation for States, or
communities, in their political capacities, as it has been
exemplified by the experiment we have made of it, is equally
attested by the ev

<many articles skipped in this output>

id: 85
title: Concluding Remarkse
author: HAMILTON
venue: From MCLEAN's Edition, New York.
date:
addressee: To the People of the State of New York:
text:

ACCORDING to the formal division of the subject of these papers,
announced in my first number, there would appear still to remain for
discussion two points: "the analogy of the proposed government t

Note that in the actual output, the title for article 16 does not wrap to the next line (which could happen above if your browser window isn't wide enough), and yours also should not wrap.

EXTRA: Another dataset?

Do you have another data set that you are interested in working with? If so, try working with it to get it into a similarly structured form as was done for the previous problem. Note that your inputs might look quite different -- for example, you might have each document in its own file, in which case you’ll need to learn about listing the files in a directory using Scala code, iterating over and opening those files, etc. Note that you by no means need to stick to English, nor do you need to stick with nicely edited text (messy tweets would be fine). Feel free to ask for suggestions or help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly