Learning Objectives

Following this assignment students should be able to:

  • use, modify, and write custom functions
  • use the output of one function as the input of another
  • use for loops to automate function operations

Reading

Exercises

  1. -- Use and Modify --

    The length of an organism is typically strongly correlated with it’s body mass. This is useful because it allows us to estimate the mass of an organism even if we only know its length. This relationship generally takes the form:

    Mass = a * Lengthb

    Where the parameters a and b vary among groups. This allometric approach is regularly used to estimate the mass of dinosaurs since we cannot weigh something that is only preserved as bones.

    The following function estimates the mass of an organism in kg based on it’s length in meters for a particular set of parameter values, those for Theropoda (where a has been estimated as 0.73 and b has been estimated as 3.63; Seebacher 2001).

    get_mass_from_length_theropoda <- function(length){
      mass <- 0.73 * length ** 3.63
      return(mass)
    }
    
    1. Add a comment to this function so that you know what it does.
    2. Use this function to print out the mass of a Spinosaurus that is 16 m long based on it’s reassembled skeleton. Spinosaurus is a predator that is bigger, and therefore, by definition, cooler, than that stupid Tyrannosaurus that everyone likes so much.
    3. Create a new version of this function called get_mass_from_length() that estimates the mass of an organism in kg based on it’s length in meters by taking length, a, and b as parameters. To be clear we want to pass the function all 3 values that it needs to estimate a mass as parameters. This makes it much easier to reuse for all of the non-theropod species. Use this new function to estimate the mass of a Sauropoda (a = 214.44, b = 1.46) that is 26 m long.
    [click here for output]
  2. -- Writing Functions --

    Write a function that converts pounds to grams (there are 453.592 grams in one pound). Use that function and a built in function to print out how many grams there are in 3.75 pounds, rounded to the nearest gram.

    Don’t do any printing or rounding inside your function. You want each function to do one thing and do it well, and in this case that thing is converting pounds to grams. Have the function do the conversion and then do the rounding and printing outside of the function.

    [click here for output]
  3. -- Nested Functions --

    This is a follow up to Writing Functions.

    Measuring things using the metric system is great for us scientists, but when you call your grandmother this weekend (you do call your grandmother every weekend don’t you?) you’d like to tell her that you’re taking this awesome class where you learned how to program a computer to calculate how much dinosaurs weigh (if you phrase it like this she’ll have something really cool to brag about to her friends), and that you calculated the weight of a Stegosaurus (no need to scare grandma by talking about the totally awesome, and really scary, Spinosaurus). The problem is that your grandmother doesn’t really know what a kilogram is, she wants to know what it weighed in pounds. So, hook your grandmother up and write a function that converts kilograms into pounds. Use that function along with your dinosaur mass function to estimate the weight, in pounds, of a 12 m long Stegosaurus (12 m is about as big as they come and we want grandma’s story to to be as wild as possible). In Stegosauria, a has been estimated as 10.95 and b has been estimated as 2.64 (Seebacher 2001).

    [click here for output]
  4. -- for Loop --

    This is a follow up to Nested Functions.

    Now that you’ve impressed Grandma, it’s time to do some serious science. Take the following vector of Stegosaur lengths

    lengths <- c(10.1, 9.5, 11.2, 9.8, 10.4, 12.0, 11.5, 9.5, 9.8,
    10.0, 10.7, 10.2, 11.9, 9.7, 11.2, 11.8, 10.7)
    

    and estimate the mass in kilograms for each length using a for loop, your function for estimating mass, a = 10.95, and b = 2.64. Print the results in order.

    [click here for output]
  5. -- stringr --

    A colleague has produced a file with one DNA sequence on each line. Download the file and load it into R using read.csv(). The file has no header.

    Your colleague wants to calculate the GC content of each DNA sequence (i.e., the percentage of bases that are either G or C) and knows just a little R. They sent you the following code which will calculate the GC content for a single sequence:

    library(stringr)
    
    sequence <- "attggc"
    Gs <- str_count(sequence, "g")
    Cs <- str_count(sequence, "c")
    gc_content <- (Gs + Cs) / str_length(sequence) * 100 
    

    This code uses the excellent stringr package for working with the sequence data. You’ll need to install this package before using it.

    Convert the last three lines of this code into a function to calculate the GC content of a DNA sequence.

    Use a for loop and your function to calculate the GC content of each sequence and print them out individually. The function should work on a single sequence at a time and the for loop should repeatedly call the function and print out the result.

    You may have noticed that for Loop prints the results differently. read.csv() imports the data as a data.frame(), unlike the numeric vector in the previous exercise.

    [click here for output]
  6. -- Multiple Files --

    This is a follow-up to stringr.

    Dr. Jekyll is hard at work to perfect his serum and correct the imbalance with his alter ego, Mr. Hyde. Dr. Jekyll is convinced that some mutation in his DNA is responsible for his transformations and he’s looking in the PATRIC bacterial phytogenomic database for clues. He wants to know the GC content of all of the bacteria in the database and got started working with a handful of archaea. Sadly, his skill with a burner and pipette has not prepared him at all for work on a computer.

    Help him out by downloading the data and looping over the files to determine the GC content for each file. Unzip the the .zip file into your data directory. If you look at the data you’ll see that it’s made up of one file per species using the FASTA dna sequence format. We could try to load using read.csv, but the ShortRead package in Bioconductor already exists for parsing fasta files, so we’ll use that instead. Install Bioconductor if you haven’t already.

    source("https://bioconductor.org/biocLite.R")
    biocLite("ShortRead")
    

    The following code will then load a single sequence file:

    library(ShortRead)
    reads <- readFasta("data/archaea_dna/A-saccharovorans.fasta")
    seq <- sread(reads)
    

    You can reuse the GC content function you wrote for stringr to calculate the GC content, but you might need to modify it to accommodate the different capitalization of the bases.

    Each file in the zip represents a single archaea species. Use a for loop and your function to calculate the GC content of each file and print them out individually. You might find the list.files() function useful for working with multiple files in a for loop. The function should work on a single file at a time and the for loop should repeatedly call the function and store the results in a data frame with a row for each file and columns for both the file name and GC content.

    Optional: For a little extra challenge change your answer so that instead of printing out the file names it prints out the species name that is encoded in the file name, but without the .fasta at the end.

    [click here for output]