Learning Objectives

Following this assignment students should be able to:

  • install and load an R package
  • understand the data manipulation functions of dplyr
  • execute a simple import and analyze data scenario
  • manipulate strings using the stringr package

Reading

Exercises

  1. -- dplyr --

    Install and familiarize yourself with the dplyr package. The library() step(s) should always be located at the very top of a script.

    install.packages("dplyr")
    
    library(dplyr)
    
    help(package = dplyr)
    

    This vignette is a great reference for data manipulation verbs to keep in mind.

  2. -- Shrub Volume 3 --

    This is a follow-up to Shrub Volume 2.

    Dr. Granger is interested in studying the factors controlling the size and carbon storage of shrubs. This research is part of a larger area of research trying to understand carbon storage by plants. She has conducted a small preliminary experiment looking at the effect of three different treatments on shrub volume at four different locations. She has placed two data files on the web for you to download:

    Download these into your data folder and get familiar with the data by importing the shrub dimensions data using read.csv() and then:

    1. Check the column names in the data using the function names().
    2. Use str() to show the structure of the data frame and its individual columns.
    3. Print out the first few rows of the data using the function head().

      Use dplyr to complete the remaining tasks.

    4. Select the data from the length column and print it out.
    5. Select the data from the site and experiment columns and print it out.
    6. Filter the data for all of the plants with heights greater than 5 and print out the result.
    7. This code calculates the average height of a plant at each site:

      by_site <- group_by(shrub_dims, site)
      avg_height <- summarize(by_site, avg_height = mean(height))
      

      Modify the code to calculate and print the average height of a plant in each experiment.

    8. Use max() to determine the maximum height of a plant at each site.
    9. Create a new data frame called shrub_data_w_vols that includes all of the original data and a new column containing the volumes, and display it.
    10. Import the experiments data and then use inner_join to combine it with the shrub dimensions data to automatically add a manipulation column to the shrub data.
    [click here for output]
  3. -- Fix the Code 1 --

    This is a follow-up to Shrub Volume 3. If you haven’t already downloaded the shrub volume data do so now and store it in your data directory.

    The following code is supposed to import the shrub volume data and calculate the average shrub volume for each experiment and, separately, for each site

    read.csv("data/shrub_volume_experiment.csv")
    shrub_data %>%
      mutate(volume = length * width * height) %>%
      group_by(site) %>%
      summarize(mean_volume = max(volume))
    shrub_data %>%
      mutate(volume = length * width * height)
      group_by(experiment) %>%
      summarize(mean_volume = mean(volume))
    
    1. Fix the errors in the code so that it does what it’s supposed to
    2. Add a comment to the top of the code explaining what it does
    3. In a text file, discuss how you know that your fixed version of the code is right and how you would try to make sure it was right if the data file was thousands of lines long
    [click here for output]
  4. -- Link to Databases --

    Let’s access an SQL database directly from R. Install the RSQLite package (and the dbplyr package if you haven’t already).

    Either use an existing copy of the portal_mammals.sqlite database or download a new copy. You should now be able to link to the surveys table in the database using:

    portaldb <- src_sqlite("portal_mammals.sqlite")
    surveys <- tbl(portaldb, "surveys") %>% tbl_df
    

    surveys will be a tbl_df, which means that we won’t need to worry about it printing out huge numbers of rows in the answers below.

    Using this table in the database:

    1. Use the nrow() function to determine how many records are represented in this dataset.
    2. Select the year, month, day, and species_id columns in that order
    3. Create a new data frame with the year, species_id, and weight in kilograms of each individual, with no null weights.
    4. Use the distinct() function to print the species_id for each species in the dataset that have been weighed.
    5. Calculate the average size of a Neotoma Albigula (NL) in this dataset.
    6. Create a new data frame with the number of individuals counted in each year of the study. If you don’t know how to count things using dplyr I’d recommend checking out the dplyr vignette. Vignette’s can be a great way to learn how to use packages in R. If you’re in a hurry you can also do a search for count in the page.
    [click here for output]
  5. -- stringr Functions --

    Use the character functions from the package stringr to print the following strings.

    1. "atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc". Do this by duplicating “atgc” 15 times.
    2. " Thank goodness it's Friday" without the leading white space (i.e., without the spaces before "Thank").
    3. "gcagtctgaggattccaccttctacctgggagagaggacatactatatcgcagcagtggaggtggaatgg" with all of the occurences of "a" replaced with "A".
    4. Print the length of this dna sequence "gccgatgtacatggaatatacttttcaggaaacacatatctgtggagagg".
    5. The number of "a"s in "gccgatgtacatggaatatacttttcaggaaacacatatctgtggagagg".
    6. Print the first 20 positions of this dna sequence "gccgatgtacatggaatatacttttcaggaaacacatatctgtggagagg".
    7. Print the last 10 positions of this dna sequence "gccgatgtacatggaatatacttttcaggaaacacatatctgtggagagg".
    [click here for output]
  6. -- Strings from Data 1 --

    A colleague has produced a file with one DNA sequence on each line. Download the file and load it into R using read.csv(). The file has no header and is separated by white space ("").

    Calculate the GC content of each sequence. The GC content is the percentage of bases that are either G or C (as a percentage of total base pairs). Print each GC content in order to the screen (in %).

    [click here for output]