Learning Objectives

Following this assignment students should be able to:

  • implement quality control for data entry in spreadsheets
  • import and create data in an SQL database
  • understand the basic query structure of SQL
  • execute SQL commands to select, sort, group, and aggregate data

Reading

Exercises

  1. -- Data entry validation in Excel --

    You’re starting a new study of small mammals at the NEON site at Ordway-Swisher. Create a spreadsheet in Excel for data entry. It should have four columns: Year, Site, Species, and Mass.

    Set the following data validation criteria to prevent any obviously wrong data from getting entered:

    1. Year must be an integer between 2015 and 2025.
    2. Site should be one of the following A1, A2, B1, B2.
    3. Species should be one of the following Dipodomys spectabilis, Dipodomys ordii, Dipodomys merriami.
    4. Mass should be a decimal greater than or equal to zero but less than or equal to 500 since mass is measured in grams in this study and nothing bigger than half a kilogram will possibly fit into your Sherman traps. Change the error message on this validation criteria to explain why data is invalid and what the valid values are.

    Save this file as yourname_ordway_mammal_data.xlsx.

  2. -- Importing Data --

    Hand entering data is great if you’re collecting your own data and need to enter it yourself, but it’s a pretty terrible way to use already available data, especially if it’s more than a few dozen lines long. This example will walk you through how to get data that already exists into SQLite.

    1. Download the main table for the Portal LTREB mammal survey database. It’s kind of large so it might take a few seconds. This database is published as a Data Paper on Ecological Archives, which is generally a great place to look for ecology data.
    2. Create a new database by clicking on New Database in the Database drop down menu. Select a file name, like portal_mammals.sqlite, and location. ​
    3. Click on the Import icon. ​
    4. Click on Select File and navigate to where you saved the data file and select it. ​
    5. Select CSV. You’ll notice that you can also import from other SQL or modify the Fields separated or enclosed by. You’ll want to make sure to select First row contains column names. ​
    6. Click OK when it asks if you want to modify the data.
    7. Name the table that you are importing into Surveys.
    8. Identify the type for each field, using the Data Type drop-down menus. If it is not obvious if the data type is an INTEGER or VARCHAR for each variable, check the metadata. Important: if you specify the wrong data type it can cause some data to not be imported and/or prevent you from doing some kinds of data manipulations.
    9. Select recordID as the Primary Key and click OK.
    10. Click OK when it asks if you are sure you want to import the data.
    11. Now import the plots, and species tables.
    [click here for output] [click here for output] [click here for output]
  3. -- Creating Tables --

    Create a new table in the Portal Mammals database called FieldNotes. The table should include information on the date of sampling that the notes apply to, designed in such a way as to allow it to be linked to the main table, and a large text field for entering notes.

    I would definitely recommend using VARCHAR for the notes field because you have no idea how extensive future notes might be and the notes will probably vary wildly in length.

    [click here for output]
  4. -- Adding Records --

    Add the following note to the FieldNotes table in the Portal database for the date 04/01/1963:

    "Just completed the April 1963 census of the site. The region is teeming with Dipodomys spectabilis. Using the time machine to conduct trapping prior to the start of the study is working out great!"

    What’s a Dipodomys spectabilis?

    [click here for output]
  5. -- Updating Records --

    Add the following note to the FieldNotes table in the Portal database for the date 10/1/2015:

    "Vegetation seems to have returned to normal for this time of year. The landscape isn't exactly green, but there is a decent amount of plant activity and there should be enough food for the rodents to the winter"

    As soon as you’ve added it you realized that this was the update for October 2014, not October 2015. Update the record so that it contains the appropriate value.

    [click here for output]
  6. -- SELECT --

    For this and many of the following problems you will create queries that retrieve the relevant information from the Portal small mammal survey database. As you begin to familiarize yourself with the database you will need to know some details regarding what is in this database in order to answer the questions. For example, you may need to know what species is associated with the two character species ID or you may need to know the units for the individual’s weight. This type of information associated with data is called metadata and the metadata for this dataset is available online at Ecological Archives.

    1. Write a query that displays all of the records for all of the fields (*) in the main table. Save it as a view named all_survey_data.
    2. We want to generate data for an analysis of body size differences between males and females of each species. We have decided that we can ignore the information related to when and where the individuals were trapped. Create a query that returns all of the necessary information, but nothing else. Save this as size_differences_among_sexes_data.
    [click here for output] [click here for output]
  7. -- WHERE --

    A population biologist (Dr. Undomiel) who studies the population dynamics of Dipodomys spectabilis would like to use some data from Portal, but she doesn’t know how to work with large datasets. Being the kind and benevolent person that you are, write a query to extract the data that she needs. She wants only the data for her species of interest, when each individual was trapped, and what sex it was. She doesn’t care about where it was trapped within the site because she is going to analyze the entire site as a whole and she doesn’t care about the size of the individuals. She doesn’t need the species codes because you’re only providing her with the data for one species, and since she isn’t looking at the database itself the two character abbreviation would probably be confusing. Save this query as a view with the name spectabilis_population_data.

    Scroll through the results of your query. Do you notice anything that might be an issue for the scientist to whom you are providing this data? You should! Think about what you should do in this situation…

    You decide that to avoid invoking her wrath, you’ll send her a short e-mail* requesting clarification regarding what she would like you to do regarding this complexity. Dr. Undomiel e-mails you back and asks that you create two additional queries so that she can decided what to do about this issue later. She would like you to add a query to the same data as above, but only for cases where the sex is known to be male, and an additional query with the same data, but only where the sex is known to be female. Save these as views with the names spectabilis_population_data_males and spectabilis_population_data_females.

    *Short for elven-mail

    [click here for output] [click here for output] [click here for output]
  8. -- ORDER BY --

    The graduate students that work at the Portal site are hanging out late one evening drinking… soda pop… and they decide it would be an epically awesome idea to put together a list of the 100 largest rodents ever sampled at the site. Since you’re the resident computer genius they text you, and since you’re up late working and this sounds like a lot more fun than the homework you’re working on (which isn’t really saying much, if you know what I’m saying) you decide you’ll make the list for them.

    The rules that the Portal students have come up with (and they did spend a sort of disturbingly long time coming up with these rules; I guess you just had to be there) are:

    1. The data should include the species_ID, year, and the weight. These columns should be presented in this order.
    2. Individuals should be sorted in descending order with respect to mass.
    3. Since individuals often have the same mass, ties should be settled by sorting next by hind foot length and finally by the year of capture.

    Since you need to limit this list to the top 100 largest rodents, you’ll need to add the SQL command LIMIT 100 to the end of the query. Save the final query as 100_largest_individuals.

    [click here for output]
  9. -- DISTINCT --

    Write a query that returns a list of the dates that mammal surveys took place at Portal with no duplicates. Save it as dates_sampled.

    [click here for output]
  10. -- Missing Data --

    Write a query that returns the year, month, day, species_id, and mass for every record were there is no missing data in any of these fields. Save it as no_missing_data.

    [click here for output]
  11. -- GROUP BY --

    Using GROUP BY, write a query that returns a list of dates on which individuals of the species Dipodomys spectabilis (indicated by the DS species code) were trapped (with no duplicates). Sort the list in chronological order (from oldest to newest). Save it as dates_with_dipodomys_spectabilis.

    [click here for output]
  12. -- COUNT --

    Write a query that returns the number of individuals of all species combined (total_abundance) in each year, sorted chronologically. Include the year in the output. Save it as total_abundance_by_year.

    [click here for output]
  13. -- SUM --

    Write a query that returns the number of individuals of each species captured in each year (total_abundance) and the total_biomass of those individuals (the sum of the weight column). The units for biomass should be in kilograms. Include the year and species_id in the output. Sort the result chronologically by year and then alphabetically by species. Save as mass_abundance_data.

    [click here for output]