Keeping Data Tidy

Learning Objectives

Following this assignment students should be able to:

implement quality control for data entry into spreadsheets

organize data files in keeping with tidy data principles

tidy a data table with redundant fields or overfilled cells

Reading

Tidy Data - Reading
Data Organization - Reading

Exercises

-- Data entry validation in spreadsheets --

You’re starting a new study of small mammals at the NEON site at Ordway-Swisher. Create a spreadsheet in Excel for data entry. It should have four columns: Year, Site, Species, and Mass.

Set the following data validation criteria to prevent any obviously wrong data from getting entered:
1. Year must be an integer between 2015 and 2025.
2. Site should be one of the following A1, A2, B1, B2.
3. Species should be one of the following Dipodomys spectabilis, Dipodomys ordii, Dipodomys merriami.
4. Mass should be a decimal greater than or equal to zero but less than or equal to 500 since mass is measured in grams in this study and nothing bigger than half a kilogram will possibly fit into your Sherman traps. Change the error message on this validation criteria to explain why data is invalid and what the valid values are.
Save this file as yourname_ordway_mammal_data.xlsx.
-- Messy Portal Data --

A lot of real data isn’t very tidy, mostly because most scientists aren’t taught about how to structure their data in a way that is easy to analyze.

Download a messy version of some of the Portal Project data. Note that there are multiple tabs in this spreadsheet.

Think about what could be improved about this data. In a text file:

1-5. Describe five things about this data that are not tidy and how you could fix each of those issues.

6. Could this data easily be imported into a database in its current form?

7. Do you think it’s a good idea to enter the data like this and clean it up later, or to have a good data structure for analysis by the time data is being entered? Why?
-- Tree Biomass --

Estimating the total amount of biomass (the total mass of all individuals) in forests is important for understanding the global carbon budget and how the earth will respond to increases in carbon dioxide emissions. Measuring the mass of whole trees is a major effort and requires destructive harvest of the tree. Fortunately, we can estimate the mass of a tree based on its diameter.

There are lots of equations for estimating the mass of a tree from its diameter, but one good option is the equation:

Mass = 0.124 * Diameter^2.53

where Mass is measured in kg of dry above-ground biomass and Diameter is in cm DBH (Brown 1997).

We’re going to estimate the total tree biomass for trees in a 96 hectare area of the Western Ghats in India. The raw data is available on Ecological Archives. Unfortunately, the data is stored in a poor database structure and using all of the tree stems would be difficult without first tidying up the data. You can have a look at the metadata to get familiar with the data structure.
1. Use tidyr to gather() the raw data into rows for each measured stem.
2. Write a function that takes a vector of tree diameters as an argument and
  returns a vector of tree masses.
3. Stems are measured in girth (or circumference) rather than diameter. Write a function that takes a vector of circumferences as an argument and returns a vector of diameters (circumference = pi * diameter).
4. Use the two functions you’ve written to estimate the total biomass (i.e., the sum of the masses) of trees in this dataset and print the result to the screen.
5. separate() the SpCode into GenusCode and SpEpCode. Technically the four letter code doesn’t uniquely identify all of the genera in the dataset, but we’ll assume it does for the purpose of this exercise.
[click here for output]
-- Applying Tidy Data Principles --

Tidy data principles and other data management concepts are a great way to think about organizing data, but they really only help us if we actually put them into practice. For this exercise you may choose one of two options:
1. The first option is to create a spreadsheet for managing your own research data. For this option you will need to create a spreadsheet or set of spreadsheets based on principles from the readings for this lesson. You will also need to create data validation rules for each column to ensure data are entered correctly.
OR
1. The second option is to write a 500-word detailed description of how you will implement data validation and tidy data principles within your thesis project.
In either case, the goal of this exercise is for you to carefully think through how to capture and organize your data so that you will be able to trust in the accuracy of your data and easily manipulate them for further analysis.

Data Science for Agriculture

Assignment

Learning Objectives

Reading

Exercises

-- Data entry validation in spreadsheets --

-- Messy Portal Data --

-- Tree Biomass --

-- Applying Tidy Data Principles --