COSMOS Summer 2006, Lab Exercise 5: Using Your Programs to Process Real Data

Today's exercise

In the past couple of weeks, we've written a variety of programs that process DNA sequences in different ways: counting nucleotides, calculating reverse complements, finding candidate genes, and so on. Unfortunately, it hasn't been possible for us to run them on any real data, such as the data available from GenBank. Today, using the component we built in class yesterday for reading DNA or amino acid sequences from data stored in FASTA format, I'd like you to enable some of your previous programs to work on real data. Be sure to compare your results with one another, as it's one way to know whether your results are correct.

Useful functions

In the last two lectures, we learned about functions, which allow you to write pieces of a program separately, then assemble the pieces into complete programs. Yesterday, we wrote a function to read sequence data from a FASTA file; below is a link to that function.

Function to read FASTA data

Previous solutions of yours — or the "official" solutions that I provided after each lab session — will also be useful today, though you may need to make functions out of some of them.

Useful data

I visited the GenBank web site and downloaded some real DNA data for us to experiment with today. I took the liberty of copying and pasting the data into files for you, so that it can be used by your program. Links to the data files follow.

Save these files on your desktop as you prepare to work today.

Today's programs

Today, I'd like you to assemble a few programs out of pieces that we've already written previously. The objective is to make some of our previous programs work with real data, such as the data that I downloaded from GenBank. You may need to write some code for each, but the difficult part has mostly been done in previous labs; today, it will mostly be necessary to assemble existing parts, with a couple of modifications to some of them so that they'll fit together.

Today's programs are:

Assemble a program that counts and reports the number of A's, C's, T's, and G's in each DNA sequence that appears in a FASTA file. First, your program should ask the user for the filename of the FASTA file, then it should read the sequences out of the file, and finally process the sequences, counting the A/C/T/G nucleotides in each, reporting the counts.
- Note that most of this work has been done previously. In the second lab session, we wrote a program to count nucleotides. Reading data from a FASTA file is done by our function from yesterday's lecture. Code for asking the user for a filename and looping over the sequences read from a file appeared in at least one program from the last Thursday's lab session.
Assemble a program that finds and reports candidate genes on the given and reverse strands of each DNA sequence that appears in a FASTA file. Like the previous program, this program should ask the user for the filename, read the sequences out of the file, and finally process the sequences. Format your output in the same way as the similar program from last Thursday's lab session.
- Initially, find candidate genes that are at least 20 codons long. Take note of the number of candidate genes that your program found on the sequences in each data file. Try increasing the necessary length of codons from 20 to 50, from 50 to 100, and from 100 to 200, to see whether it dramatically impacts the number of candidate genes that your program found.
Using today's second program as a starting point, add the component from last Thursday's lab session that translates each candidate gene to an amino acid sequence. Have your program print the amino acid sequence for each candidate gene as part of its output.

Run each of your programs on the provided FASTA data from GenBank. Be sure to compare your results with others' results.

Finished early? (Sorry, no games or videos today)

If you're done with the assignment, I'd prefer that you spend the remaining time on productive tasks. In particular, it would be good for you to spend time working on your projects, which will need to be done sooner than it seems.