ICS 23 / CSE 23 Summer 2012
Project #3: Always Changing Probably

Due date and time: Friday, August 3, 11:59pm


Introduction: What is a database?

Consider a major web site, such as Amazon or Twitter. At any given time, they're storing a tremendous amount of information — inventory, in the case of Amazon, messages and interrelationships in the case of Twitter — and making it available via the web. Further, the information is fairly fluid; every minute, many thousands of requests pour in, each causing information to be accessed, changed, added, or deleted.

Companies like these store their information in databases. A database is a collection of data, often (though not always) organized into tables consisting of rows and columns. There are many important requirements guiding the handling of the data in a company's databases; some of the most important requirements are:

Naturally, this is a complex set of requirements; implementing such a system is not trivial. Fortunately, most companies that need to store large amounts of data share a similar set of requirements; correctness, low latency, persistence, fault tolerance, transactional integrity, and security are important for everybody! So, rather than every company implementing its own database system, a few companies (and a few open-source efforts) have implemented database management systems (DBMS's). A DBMS does all of the dirty work involved with managing a database: organizing the data into rows and columns, storing the data so that it can be efficiently accessed, updated, added, or removed, handling persistence, fault tolerance, and transactional integrity requirements, among others. Companies like Oracle, IBM, and Microsoft have built well-known, battle-tested DBMS systems, and sell them for sometimes as much as hundreds of thousands of dollars. This was a worthwhile investment on their part, as the core of most web-based businesses includes at least one database.

While there is a great deal of complexity in a DBMS, far more than we're equipped to handle in this course, two requirements that we can address with the tools we've learned are correctness and low latency. For this project, you'll implement a very rudimentary database, capable of storing data in tables stored in memory, consisting of rows and columns, quickly looking that data up based on a search key, updating the data, and removing it.


Tables, rows, and columns

For our purposes, a database is a collection of tables. Each table has a name, and consists of an arrangement of data items into rows and columns. Each row consists of a sequence of cells, one corresponding to each of the columns in the table. Importantly, one of the columns contains the keys, which are used to uniquely identify the rows in the table. No two rows will ever have the same key.

A university might have a table like this, called Students, in its database:

Student ID (key) First Name Last Name GPA Fees Paid?
12345678 John Doe 3.50 Yes
23456789 Jane Doe 3.75 Yes
34567890 Lazy Jones 1.12 No

Placing data into tables may seem to be a restrictive approach, but much of the expressiveness provided by a "real" DBMS lies in its adeptness at relating the data in various tables to each other. For example, the table above can be related to the data in the following Grades table, based on the appearance of a student ID in each row, indicating which student has earned the grade in that row:

Grade ID (key) Student ID Course Quarter Grade
1 12345678 ICS 23 Fall 2002 A+
2 34567890 Econ 20A Spring 2001 C-
3 12345678 ICS 22 Spring 2001 A-

By combining the data in the two tables, an actual DBMS is able to easily determine that John Doe earned an A+ in ICS 23 in Fall 2002 and an A- in ICS 22 in Spring 2001, and that Lazy Jones earned a C- in Econ 20A in Spring 2001. So it becomes simple for the system to perform operations such as "Give me a list of the names of all students who have received at least a grade of A- in ICS 23 within the last three years." Further, if Jane Doe gets married next year and opts to change her name to Jane Smith, changing her name in the Students table will effectively modify her name everywhere, since she is identified only by her student ID throughout the rest of the database.

Note, too, that many of these systems include the ability to enforce constraints, so, for example, a grade can never be entered for a student ID that does not exist. This is a powerful tool that eliminates an entire class of bugs from your overall system.


The program

To reiterate, your program for this project is not intended to be anywhere near a production-quality DBMS. Many of the ideas that I've introduced above are simply to provide you enough background to understand the larger context into which your work fits. Your program will be a prototype of a very simple database system, capable of storing data into tables in memory, looking that data up by keys, updating that data, and removing it. For simplicity, all of the data in our tables will be integers. Furthermore, your program will not be required to understand or enforce the relationship between data in different tables.

Your program will read a sequence of commands from the console (presumably using a Scanner wrapped around System.in) and print output to System.out as directed by the specification below. Your program should not print out any prompts such as "Please enter your next command." It should simply read commands blindly typed into the console, process them, and produce output. Many of the commands, in fact, will produce no output. The reason for this design decision is two-fold. Firstly, this is intended to be a prototype, meaning that it's not intended to be used by anyone who is not familiar with the details of the project, so the user interface need not be all that friendly. Secondly, directly reading commands from the console allows us to redirect input from a file into the program, then redirect output into another file, for ease of automated testing. (I'll talk more about this aspect of the project later in the write-up.)


The commands

Your program needs to support the following commands:

Command

CREATE - Creates a new table with the given name. Neither table names nor column names may have spaces in them. The new table has no rows in it and the specified list of columns. All tables have an unnamed key column automatically, so the listed columns are additional to the implicit key column.

CREATE tableName columnName1 columnName2 ...

Examples:

  • CREATE PhoneNumbers AreaCode Prefix Suffix: Creates a new table called PhoneNumbers, with three columns in addition to the key.
  • CREATE OnlyKeys: Creates a new table called OnlyKeys with only a key column in it.

This command should generate no output. Each column's name must be unique within the table, but may be the same as the name of a column in another table. The name of the new table must not be the same as the name of any existing table.

INSERT - Inserts a new row into a table, with the given key and column values. The columns need not be listed in the order they were listed in the CREATE command, and not all of the columns in the table need to be listed; those that are not listed will be given the value 0.

INSERT tableName key columnName1=columnValue1 columnName2=columnValue2 ...

Examples:

  • INSERT PhoneNumbers 1234 AreaCode=949 Prefix=824 Suffix=6624: Inserts a new row into the PhoneNumbers table, with the key 1234, an AreaCode of 949, a Prefix of 824, and a Suffix of 6624.
  • INSERT PhoneNumbers 1235 AreaCode=949: Inserts a new row into the PhoneNumbers table, with the key 1235, an AreaCode of 949, and all of the other columns set to 0.

This command should generate no output. The value placed into the key column must be unique in the table (i.e. no row may already exist with that value in the key column). If any column names listed in this command do not exist in the table, they are simply ignored.

UPDATE - Updates an existing row in a table, with the given key and new column values. Not all of the columns in the table need to be listed; those that are not listed will not be changed.

UPDATE tableName key columnName1=columnValue1 columnName2=columnValue2 ...

Examples:

  • UPDATE PhoneNumbers 1235 AreaCode=949 Prefix=824 Suffix=6624: Updates the row in the PhoneNumbers table with the key 1235, changing its AreaCode to 949, its Prefix to 824, and its Suffix to 6624.
  • UPDATE PhoneNumbers 1234 AreaCode=714: Updates the row in the PhoneNumbers tables with the key 1234, changing its AreaCode to 714, but leaving the other columns as they are.

This command should generate no output. If any column names listed in this command do not exist in the table, they are simply ignored.

LOOKUP - Retrieves the value from the listed columns of a row in a table.

LOOKUP tableName key columnName1 columnName2 ...

Example:

  • LOOKUP PhoneNumbers 1234 AreaCode Prefix: Looks up the AreaCode and Prefix columns in the row in the PhoneNumbers table whose key is 1234.

The above command generates output that looks like this:

    PhoneNumbers 1234 AreaCode=714 Prefix=824

If any column names listed in this command do not exist in the table, they are simply ignored.

DELETE - Removes a row from a table with a particular key.

DELETE tableName key

Examples:

  • DELETE PhoneNumbers 1235: Deletes the row in the PhoneNumbers table whose key is 1235.

This command should generate no output.

EXIT - Exits the program.

EXIT

This command should generate no output, and should end the program.


Starting point

I've provided a few Java source files, along with a few compiled Java .class files in a JAR, that you are required to use as a starting point. They are provided as a Zip archive.


Parsing and executing input commands

Writing the code that handles the task of parsing the commands from the input can be tedious, so I have provided a compiled class called CommandParser that does this job for you. CommandParser consists of one public static method, with the following signature:

    public static Command parseCommand(String s)

It takes a String (presumably one line of input from the console) and returns a reference to an object that represents the command. You need not be concerned with the details of how the parsing works, which is why I haven't provided the source code for this method. However, we do need to take a step back so I can explain some background about the Command objects.

The Command pattern

In lecture, we briefly discussed the concept of design patterns. Design patterns are well-understood solutions to commonly-occurring programming problems.

There is a well-known pattern that you'll need to understand for this project. Many programs take user input in the form of commands. There are different ways to pass commands to a program, of course. A console-mode user interface, like the one you will build on in this project (and have likely built in others), accepts commands by allowing the user to type them into the console. A graphical user interface accepts commands from the user by allowing the user to click buttons, select menu items, drag objects around, and so on. Some programs take commands from the user by accepting command-line arguments. But regardless of the style of user interface, the program's execution can be viewed as a sequence of commands from the user, each followed by the appropriate response from the program. This is a commonly-occurring design requirement, so it makes sense that we ought to think of a common design that addresses it.

The Command pattern is exactly that common solution. In a program that accepts input commands from the user — be it commands typed into the console, mouse clicks and menu selections, or whatever — it is sensible for each command to be represented by a command object. Different kinds of commands are represented by different kinds of objects. When it's time to execute the command, we can call an execute() method on the object. So, we can define one class for each kind of command, and derive all of them from an abstract superclass, called Command. The Command class will have an abstract execute() method (we'll make it abstract, since we don't know how to execute a command unless we know specifically what kind of command it is), and probably little or nothing else. The subclasses of Command — in this project: CreateCommand, InsertCommand, UpdateCommand, LookupCommand, DeleteCommand, and ExitCommand — will provide implementations of this execute() method.

There are a couple of advantages to using this pattern. First of all, the program's main loop becomes very simple indeed:

    while there are still lines of input to be read
    {
        read one command from the console
        parse the command, building the appropriate kind of Command object to represent it
        execute the command by calling its execute() method
    }

There is a clear separation between the code that oversees the execution of the program — you might call this the command loop or event loop — and the code that does other jobs (such as parse the input or execute the commands). Adding a new kind of command does not require touching this command loop at all, though you would obviously have to teach the parser to understand it, and you'd have to write a new subclass of Command that would know how to execute it.

Another advantage of using the Command pattern is that it becomes very easy to keep track of all of the commands that have been executed, by maintaining a collection of all of the Command objects. This would provide, among other things, an easy way to implement an "undo" feature, whereby the commands can be undone in the reverse of the order in which they were originally executed. To implement an undo feature, every time a command is executed, its corresponding command object can be pushed on to a stack. When the undo feature is used, the top object can be popped from the stack and "un-executed." To implement a corresponding "redo" feature, you could store the commands in two stacks, in much the same way that web links are stored in two stacks to implement the "back" and "forward" buttons in a web browser. Note that this program should not support an undo/redo feature!

What does this have to do with me?

The reason I've explained the Command pattern to you is that you'll be required to use it. I've provided an empty abstract class called Command, to which you'll need to add an abstract execute() method. The provided CommandParser.parseCommand() method takes a String and builds an object of one of Command's subclasses, which I've provided. You'll need, then, to add an implementation of your execute() method to each of the subclasses of Command.

Why does this seem sort of familiar?

If you took ICS 22 / CSE 22 and wrote the Facile interpreter that I assign in that course, some of this will sound familiar, because the Facile interpreter also used the Command pattern, with each Facile statement implemented by its own class, and with each class containing an execute() method. If you wrote the Facile interpreter in ICS 22 / CSE 22, you might want to run through your code from that project briefly to reacquaint yourself with this technique.


Handling erroneous commands

Since your program is intended to be a rudimentary prototype, it need not report specific error messages to indicate specific problems. Instead, any command that is not understood or does not follow the rules above should cause your program to simply print the word "ERROR" by itself on a line.

The provided CommandParser.parseCommand() method will return null for any input command that is not syntactically correct (e.g., a CREATE command with no table name, a LOOKUP command with a non-integer key, etc.). However, CommandParser will not perform any other checking on commands. So, it may return a CreateCommand with a table name that is not in the database, or a LookupCommand with a column name that doesn't exist in the database. Your execute() methods will need to check for such conditions and respond appropriately, ideally by throwing an exception. Your command loop can then catch these exceptions and print "ERROR" in response to them. Since all of the execute() methods will need to have the same signature, I suggest having them all throw the same kind of exception — for example, CommandExecutionException — when a problem occurs while executing a command. You'll need to define this exception yourself, which you can do similarly to how I've written code for classes like DuplicateKeyException.


Implementing a table

You can think of one of the tables in our database very much like the data structure which is commonly called a map (though it, too, is sometimes called a table). A map is a set of key/value pairs, where each key uniquely identifies a particular value. In our case, we can conceptually think of each row in the table as a key/value pair, where the key is the integer in the key column for that row, and the value is the collection of integers in the remaining columns.

Since each of our tables may contain a very large collection of rows, it will be necessary for us to build an efficient implementation, which will provide fast insertions, updates, lookups, and deletions. Since databases are often used in situations where the transaction throughput — the number of transactions that can be executed per second — is considered to be more important than the performance of each individual transaction, it will be tolerable for some operations to be somewhat slower than others, so long as the average time for each is fast. One rather obvious choice is a binary search tree, though we've discussed in class that the performance of binary search trees, while good in the typical case, can be very poor in certain circumstances that aren't necessarily rare (for example, if the data is inserted with the keys in ascending order). As we discussed rather extensively in class, this problem can be solved by keeping the tree balanced, so long as the time spent balancing the tree isn't so great that it negates the advantage of balancing.

While balanced binary search trees provide one acceptable solution to our problem, a different approach is to abandon the idea of using a binary search tree altogether, and instead use a skip list. A skip list is a data structure that can be used to implement the concept of a map efficiently, with each node storing one key/value pair. Some key/value pairs will be stored in the skip list more than once. Recall that a skip list uses randomization to determine which subset of the nodes should be duplicated, with the duplication providing the opportunity for significantly faster searches. (While there is no guarantee that the skip list will provide fast searches, the probability of the skip list performing poorly is low, and gets increasingly lower the larger it gets.)

Before you get started on this project, be sure to read Section 9.4 of the Goodrich textbook for a thorough explanation of how skip lists work.


Suggestions for implementing and debugging your skip list

When implementing my own solution for this project, I initially had a few bugs in my skip list; maintaining what is essentially a quadruply-linked list can be difficult to get right, and the penalty for getting it wrong is often a NullPointerException or unpredictable behavior. The difficulty in implementing infrastructure, such as underlying data structures, is that its behavior is not visible to users. A nice solution to this problem is to find a way to make the infrastructure visible while debugging.

To fix the problems I had, I wrote two methods in the SkipList class that helped a lot:

It is not required for you to build these methods, but I suspect you will find them useful if you have bugs that you're having difficulty finding and fixing. I'm happy to help when you're having problems, but you may find that I ask you "What does your verification method say in the case where you're having problems?"

The graphical debugger in Eclipse

Another approach to debugging is to use the graphical debugger in Eclipse. This is a way of telling Eclipse to stop your Java program's execution in mid-stream at a certain line, then step through the code one line at a time, letting you view the values of variables at each step. This can be a very helpful tool, especially when you're building something like a skip list where incorrect behavior isn't easily visualized.

We'd be glad to help you in the lab with learning how to use the debugger if you're new to it; what you learn will benefit you not just on this project, but on any future work you do in any programming environment that includes a debugger!


Redirection of input and output

Recall that you can execute a Java program from the command line by using the java command, like this:

    java MyProgram

where MyProgram is the name of the class that contains a main() method. Ordinarily, a Java program reads its "standard" input from the console, and writes its "standard" output to the console. In other words, when you use the System.out.println() method, the output goes to the console.

Most operating systems — Windows, Unix, and Linux, for example — allow you to redirect the standard input and output when you execute a program. The contents of an existing file may be redirected into the standard input, meaning that, rather than allowing the user to type input into the console, the program proceeds as though the user has typed the next line of the file each time the program requires input. Similarly, the standard output can be redirected into a file, meaning that all of the output to System.out will be stored in a file, rather than displayed on the console.

The typical mechanism for redirection is to use the < and > operators on the command line, like this:

    java MyProgram <my-input.txt >my-output.txt

Using the command above, every time the program needs input, it will read it from the file my-input.txt. Every time it writes output, it will write it to the file my-output.txt. It is possible to redirect the standard input without redirecting the standard output, and vice versa. Note that the operating system deals with the <my-input.txt and >my-output.txt arguments itself before executing the program, so these will not end up in the array of Strings passed to the main() method. In fact, the Java program will not even be aware of the redirection! As far as the program is concerned, it's reading input from the console and writing output to the console. The operating system handles the redirection transparently.

This powerful and simple technique will allow you to write test input and reuse it many times while testing this program, so that you can test your database with large sets of data.


Deliverables

You must turn in all of the .java files, including the ones that we provided to you. In other words, we need all of the .java files that are actually part of your program.

Follow this link for an explanation of how to turn in your project.


Limitations and advice

With the exception of ArrayList — which you'll probably want to use to store the values in each node — you may not use the predefined Java "collection" classes, such as java.util.TreeMap, in your solution. (The "collection" classes are the ones that store a collection of data, and include such classes as LinkedList, HashMap, Vector, Hashtable, and TreeMap.) Not surprisingly, java.util.concurrent.ConcurrentSkipListMap is definitely off-limits.

I can't stress enough the need to start early. The previous two projects involved a lot of conceptual thinking, but not very much coding. This project, on the other hand, will require you to write more code and do more design. It's actually not as big as you may believe, but I would allocate plenty of time to work on it, so that you can get your questions answered early on, and still have plenty of time to write the code and verify that it's working.


Additional challenges

For those of you looking for additional challenges, I suggest attempting to make your SkipList into a generic class, as I did. (I didn't make this a requirement, because it requires some rather difficult-to-master Java syntax that hasn't likely been taught in a course you've taken, and is somewhat out of the scope of this course.)

Making your SkipList generic is trickier than making something like a stack or a linked list generic, since your SkipList is a map implementation and, thus, needs two type parameters: a key type (K) and a value type (V). Furthermore, not all types can be used as keys; since the keys on each level of a SkipList must be stored in ascending order, the keys must be comparable (i.e., it must be possible to compare two keys and decide which one is smaller). This requires the use of bounded type parameters, which you use to tell Java that the key type parameter can only be filled in with certain types. While the concept is not especially difficult to understand, there are some syntactic issues to watch out for, with error messages emanating from the compiler that may not make a lot of sense initially when you misstep.

If you're interested in exploring this, talk to me; I'd be very glad to point you in the right direction, and I encourage you to take the plunge, especially if you finish the project early and decide you want to continue your work. Bear in mind that no extra credit is offered in this course, so making your SkipList generic is a task that won't improve your score. On the other hand, it will teach you some valuable Java design skills that will benefit you in the long run, both in future courses and beyond.