Learning Objectives
Following this assignment students should be able to:
- use joins to combine tables in SQL
- nest the results of one query as the input for another
- understand the basic rules of tidy data
Reading
- Tidy Data - Sections 1-3
- Data Organization - Reading
- Joins - Video | Reading
- Nested Queries - Video | Reading
Exercises
-- JOIN 1 --
The
Plots
table in the Portal database can be joined to theSurveys
table by joiningplot_id
toplot_id
and theSpecies
table can be joined to theSurveys
table by joiningspecies_id
tospecies_id
.The Portal mammal data include data from a number of different experimental manipulations. You want to do a time-series analysis of the natural population dynamics of all of the rodent species at the site, so write a query that returns the
[click here for output]year
,month
,day
,genus
andspecies
of every individual captured on theControl
plots. Choose only rodent species (i.e., species for which thetaxa
field in the species table isRodent
) and exclude all individuals that have not been identified to genus (i.e., species for which thespecies_id
field in the species table isUR
). For easier viewing, sort the results alphabetically by genus and species and then chronologically. Save this query asdata_from_controls
.-- JOIN 2 --
You are curious about what other kinds of animals get caught in the Sherman traps used to census the rodents. Write a query that returns a list of the
[click here for output]genus
,species
, andtaxa
(from the species table) for non-rodent individuals that are caught on theControl
plots. Non-rodents are indicated in thetaxa
column of theSpecies
table. You are only interested in which species are captured, so make this list unique (only one line for each species). Please also sort the results by genus and species. Save this query asnon_rodents_on_controls
.-- JOIN 3 --
We want to do an analysis comparing the size of individuals on the control plots to the Long-term Krat Exclosures. Write a query that returns the
[click here for output]year
,genus
,species
,weight
and theplot_type
for all cases where the plot type is eitherControl
orLong-term Krat Exclosure
. Be sure to choose only rodents and exclude individuals that have not been identified to genus as in Joins 1. Remove any records where theweight
is missing. Save this query assize_comparison_controls_vs_krat_exclosures
.-- JOIN 4 --
Write a query that displays the total number of rodent individuals sampled on each
plot_type
. Save this query asindividuals_per_plot_type
.Note that the number of plots varies among plot types so this number isn’t very informative. We’ll learn how to take this into account soon.
[click here for output]-- Nesting Queries --
Write a query that returns the average number of individuals sampled on each plot type on a per plot basis. In other words, how many individuals are sampled on average on a plot of a given plot type? So, if 500 individuals are sampled on a plot type that has 5 plots, then the result should be 100 individuals.
To do this you will need to combine multiple queries. It will be easiest for first figure out which queries you will need, then write the first query and make sure it is working, and then add the second query.
Save this query as
[click here for output]per_plot_individuals_per_plot_type
.-- Real Portal Data --
A lot of real data isn’t very tidy, mostly because most scientists aren’t taught about how to structure their data in a way that is easy to analyze.
Download a messy version of some of the Portal Project data. Note that there are multiple tabs in this spreadsheet.
Think about what could be improved about this data. In a text file:
1-5. Describe five things about this data that are not tidy and how you could fix each of those issues.
6. Could this data easily be imported into a database in its current form?
7. Do you think it’s a good idea to enter the data like this and clean it up later, or to have a good data structure for analysis by the time data is being entered? Why?