ICS 22 / CSE 22 Fall 2012
Project #5: Signal to Noise
Due date and time: Wednesday, November 28, 11:59pm
This project is to be completed individually
Introduction
The ability to generate random data that conforms to a particular format can be handy. It's not uncommon for us to want to test our programs with a large amount of data, so that we can find bugs and inefficiencies, especially those that arise from dealing with large data sets; this kind of testing is often called stress testing. Generating large amounts of random data is a tedious task for people, but a task that we can certainly imagine that a computer would be good at. (We talked a lot in lecture about how we should trust our computers to do boring, repetitive work, so that we can concentrate on the truly interesting or difficult stuff; that's why automating tests is so useful, for example.) It's not hard to see that a program that's capable of generating data in a given format would be very useful in these kinds of circumstances.
For this project, you will write a program that randomly generates data in a form described by a grammar. (Grammars, in addition to being a nice way to model this problem, are recurrent in the study of computer science; you're likely to see them again in your studies — more than once.) You will also gain practice implementing a recursive algorithm in Java, a vital skill that you will find to be useful in many ways as you continue your study of computing.
Grammars
A grammar is a collection of substitution rules that describe a set of sentences. Each sentence is a sequence of terminals. Different kinds of sentence fragments are represented by variables, with a rule for each variable specifying how it can be replaced by one of a set of possible sequences of variables and terminals. One of the variables is designated as the start variable, which means that it represents an entire sentence.
An example of a grammar follows. The start variable is A. The variables are A and B, while the terminals are 0, 1, and #.
This grammar states that the variable A can be replaced either with the sequence 0A1A or the variable B, while the variable B can only be replaced with #.
From a conceptual point of view, a grammar can be used to generate strings of terminals in the following manner. (I should point out that this isn't precisely how your program will generate its strings, but your program will do something that has an equivalent effect.)
A sequence of substitutions leading from the start variable to a string of terminals is called a derivation. When the leftmost variable is always replaced at each step, the derivation is called a leftmost derivation. The string 00#1#1# can be generated by the grammar above. The following leftmost derivation — which begins with the start variable, with one substitution made at each step — proves that it can be done.
A ⇒ 0A1A ⇒ 00A1A1A ⇒ 00B1A1A ⇒ 00#1A1A ⇒ 00#1B1A ⇒ 00#1#1A ⇒ 00#1#1B ⇒ 00#1#1#
Since 00#1#1# can be generated by the grammar, we would say that the string 00#1#1# is in the language of the grammar. In other words, the language of a grammar is the set of all strings that can be generated from it. (Many grammars, including this one, have an infinite number of strings in their languages. This grammar generates an infinite number of strings since the rule A → 0A1A can be used an arbitrary number of times in a derivation.)
The concept of a grammar will be central to our random sentence generator. A grammar will describe a set of sentences (which may indeed be infinite). Each sentence you generate will be a sequence of words, with the words being the terminals in the grammar. The variables in the grammar will describe sentence fragments, with the start variable describing an entire sentence.
The program
You will write a Java program that takes a grammar file as input and writes a specified number of randomly-generated sentences, one per line, into an output file. The name of the input file, the number of sentences, the name of the start variable, and the name of the output file should be passed as command-line arguments to the program. Your main() method should be placed into a class called Generator, so we will easily be able to figure out how to run it. An example of how your program could be executed from the command line is:
java Generator grammar.grm 10 start sentences.out
This command specifies that the grammar file is a file called grammar.grm. Ten sentences should be generated using the start variable start and written into an output file called sentences.out. Your program is required to take these four command-line arguments in this order (though, of course, their values might be different). If more or fewer than four command-line arguments are provided to your program, it should print an error message and terminate.
The format of the grammar file
One of the inputs to your program will be a grammar file containing one or more rules of the following form.
You may assume that grammar files will always be in precisely this form. You are not required to check for errors in the grammar file.
Here is an example grammar file that generates random Facile programs. Since our program generates sentences that each appear on one line, the word NL is used in place of the newline character that separates lines in a Facile program; you can use find-and-replace in a text editor to solve this program and make it possible to execute the generated Facile programs using your interpreter from Project #3. (The generated programs will not have syntax errors in them, but they may have run-time errors or other problems, such as infinite loops or division by zero.)
Do not make assumptions about variable names because of what you see in the example grammar file. You may assume that the names of variables will never have spaces or tabs in them, but you otherwise may not make assumptions about what the names of the variables will be, how many rules there will be, how many right-hand sides each rule will have, and so on. In general, any grammar file that meets the requirements above should be parsed successfully by your program.
Storing the grammar as a set of objects
From the description of the grammar file, we can come to the following conclusions:
These facts lead directly to an idea of how to design the set of objects that will be used to store a grammar in memory. I imagine the following classes comprising your design of a grammar.
I'd like you to use this, or a similar, object-oriented approach for storing the grammar in memory, as it will lead to a clean, recursive algorithm for generating sentences. This algorithm is described in the next section.
Generating random sentences from a grammar
Once you've stored your grammar as the set of objects described in the previous section, it is possible to implement a relatively straightforward recursive algorithm to generate random sentences from it. The algorithm revolves around the idea of generating sentence fragments, then putting the fragments together into a complete sentence.
The first step in the implementation is to establish the fact that all of the objects that make up a grammar — Grammars, Rules, RightHandSides, Variables, and Terminals — can have sentence fragments (or complete sentences) generated from them. This concept can be wrapped up in an interface called Generable, which might look something like this:
public interface Generable { public String generate(Grammar g, Random r); }
Grammar, Rule, RightHandSide, Variable, and Terminal should all implement this interface, though, of course, the actual implementation of the generate() method will differ from one class to another. It's necessary for generate() to take a Grammar as a parameter, because, during the process of generating a sentence, it's sometimes necessary to look up the rule in the grammar that corresponds to a particular variable.
Here is a sketch of the sentence generation algorithm:
The amazing thing about this recursive strategy is that, with relatively little code, you'll be able to ask a grammar to generate its own random sentences.
A caveat about your Symbol interface
The Symbol interface, which should be implemented by both Terminal and Variable as a way to "mark" that both Terminals and Variables are symbols, should also specify that all symbols are generable; in other words, it should be possible to generate a sentence fragment from any kind of symbol, be it a terminal or a variable.
To do this requires one additional piece of syntax that we haven't learned. It's possible for one interface to extend another. In general, if an interface J extends the interface I, J contains all of the methods of I plus the methods that are listed in J. In our case, the Symbol interface does not introduce any new methods, but it should include the generate() method from the Generable interface. We write this in Java as follows:
public interface Symbol extends Generable { }
Because you'll be implementing the Symbol interface in Variable and Terminal, both Variable and Terminal will be required to have a generate() method; it will not be necessary for you to explicitly state that Variable or Terminal implements Generable, because that's implied by the fact that they implement Symbol.
Random number generation
How does random number generation work?
Since the program must generate sentences randomly, it is necessary for us to briefly discuss how random number generators work. Computers cannot generate sequences of genuinely random numbers. Instead, they generate sequence of numbers that satisfy statistical tests of randomness; these numbers are often called pseudo-random numbers. Provided that the algorithm used to generate the sequence is well-chosen, there is little practical difference between random numbers and pseudo-random ones.
A random number generator is an object that generates a sequence of pseudo-random numbers. Here's how it works:
The selection of a seed is important. The same mathematical function is applied every time you ask for a random number. If you always picked the same number as the seed, you'd always get the same sequence of pseudo-random numbers. (This is handy when testing some programs that behave randomly, but it's not so good when you want truly random behavior.) So a randomly-selected seed is also necessary. Of course, computers cannot select numbers at random, but a common way to supply a "randomly-selected" seed is to take the current value in the system's clock and use it as the seed. (This isn't truly random, but will be essentially impossible to control precisely, since the system clock is kept in terms of milliseconds.)
In Java, the Random class in the java.util package provides an easy-to-use random number generator that you can use in your program. First, you need to create a Random object, which represents a sequence of pseudo-random numbers. When you construct the object, it initializes the seed using the system clock. Once you've created the Random object, you can ask it for the next pseudo-random number by calling methods such as nextInt(), nextLong(), nextBoolean(), nextDouble(), and so on. An example of using the nextInt() method follows.
Random r = new Random(); // this should only be done once, with the // resulting Random object being reused throughout // the execution of the program // ... int i = r.nextInt(100); // nextInt(100) returns a pseudo-random number in // the range 0..99 inclusive
One important thing to remember about random number generation as you work on this project: a single Random object represents, in essence, a sequence of random numbers. It's important to create one Random object and use it throughout the program's execution. A good technique for solving this problem is to create your Random object outside of your Grammar and pass it as a parameter in each call you make to generate(). (There are other solutions to this problem, too, but this one makes it possible to test your program even though it's supposed to behave randomly. I'll explain this further in the "Testing" section later in the write-up.)
Why infinite recursion is expected behavior (in some cases)
Grammars are permitted to be recursive. For example, the following grammar is legal:
A → Ax
If you write this grammar in our input format and pass it to your program as input, your program will recurse infinitely and terminate with a StackOverflowError — which is what happens when Java runs out of space to store the "call stack" that keeps track of information about all the methods that are currently executing.
The reason this program will cause your program to crash is simply that there's no way out of the recursion:
We won't test your program with grammars like this, and it's expected — and, in fact, very difficult to avoid! — that your program to crash when given such grammars. Another way to think about a grammar like this is that the only sentences in its language are infinitely long, so it's not surprising that a program to generate such sentences would recurse infinitely.
Note that not all recursive grammars have this problem:
A → xA | x
This grammar describes an infinite language of sentences, each of which is a sequence of one or more x's. If written in our input format and passed to your program, this grammar should cause your program to generate sentences with various numbers of x's. (Roughly half of them will have one x, roughly one-quarter of them will have two x's, roughly one-eighth of them will have three x's, and so on. Stop and think for a minute about why, on average, the sentences will be distributed this way.)
Testing
The problem with randomness, revisited
In this project, we're confronted with the problem of testing a program that has intentionally random behavior. The randomness that makes the program more useful when it's finished complicates the matter of testing, because it's difficult to know whether the difference between what you expect and what you get is caused by a program bug or (intentionally) random fluctuations in the output.
There is a solution to this problem, which we'll explore in the testing phase of this project. The solution is to set up something called a "mock object." A mock object is one that takes the place of an actual object that does some task that may behave unpredictably, or may require using resources external to the program (e.g., connect to a web site, load information from a database) that will cloud the results of our unit tests with possible failures that are unrelated to the issues we're testing for.
In this case, we'll use a "mock" random number generator called MockRandom. It will behave just like the Random object from Java, in that it will have a nextInt() method. It will also extend Random, so that a MockRandom can be used in place of a Random anywhere we'd like. The difference is that our MockRandom, rather than giving back a sequence of random numbers, will actually give back exactly the sequence of numbers I ask it to. This is useful in a test of our code here, because then we'll know exactly what to expect as output!
Here is a link to the code for MockRandom, which I'm providing for your use. Check out the comment in this file for an example of how it works.
What you'll need to test
Write a JUnit test of the generate() methods in your various grammar classes. Because so many of the objects are interrelated — Grammars contain Rules, Rules contain RightHandSides, RightHandSides contain Symbols, etc. — it's best to just write one JUnit test case class that tests sentence-generating behavior in total.
Note that you're not required to test the parsing of the input file or writing the output file. It's just necessary to write tests that set up a Grammar (and its various sub-objects) and call generate(). Pass a MockRandom object to generate() in place of an actual Random object, deliberately choosing your sequence of "random" numbers so that you'll know what the result of generating a sentence should be.
Try to think of some interesting grammars to test with; don't just use one and generate two sentences from it and leave it at that. What are some interesting cases you'd need to deal with? Different numbers of right-hand sides in a rule? Recursive rules (but not infinitely recursive rules)? Think of what you might try that will uncover a bug. It's fine to hard-code your grammars directly into your test methods. So, for example, you might write something like this. (You may need to write something slightly different, depending on the design and the names of your various methods.)
Grammar g = new Grammar(); Rule r1 = new Rule("X"); RightHandSide rhs1 = new RightHandSide(); rhs1.addSymbol(new Terminal("Alex")); rhs1.addSymbol(new Terminal("is")); rhs1.addSymbol(new Variable("Y")); r1.addRightHandSide(rhs1); // ...
Ultimately, you'll want each of your test methods to create a grammar, create a MockRandom with your chosen sequence of "random" numbers, call the generate() method, and assert that the resulting sentence is what you expected it to be.
Deliverables
You must submit all of the .java files that comprise your program. Please do not turn in the .class files, or other files generated by your development environment.
Follow this link for a discussion of how to submit your project.
Additional challenges
As discussed earlier in the write-up, a sensible design strategy is to write a Grammar class, which encapsulates a complete grammar. A grammar is a collection of rules, each of which can be represented as a Rule object. In earlier sections of the write-up, I advocate storing these Rule objects in a flat data structure, such as an ArrayList or a linked list, and using a linear search whenever you need to look up the rule corresponding to a particular variable.
However, this is actually a poor approach if the number of rules will be large and many sentences will be generated by the program. I invite you to explore these other strategies for implementing the storage of Rules in your Grammar class instead.