Ecological Data Module Assignment
[ad_1]
Day 4 Ecological Data Module Assignment
Complete the questions and record the answers as you go. When you are confident, input the answers using the quiz in LMS. You will have up to 10 attempts to input the answers, and up to 24 hours for each attempt, but you will get no feedback after each attempt, and only your final attempt will be counted. As always, remember that plagiarism is considered a serious offence at UWA – you are welcome to discuss and compare your work with other people but you must write your own script and do all the questions yourself. There is quite a lot of work involved and some parts are challenging: allow yourself plenty of time and don’t drive yourself crazy thinking you have to get 100%!
Question 1: Effect of time since fire on species abundance
This question is similar to the first example from the labs, so make sure you’ve done and understood that one first. A researcher was investigating the effect of fire on the abundance of a rare lizard species. She identified different sites within a reserve that had been last burned at different times. She then used pit traps to survey for the lizard abundance, recording the number of lizard caught within each site. The data is in the file ‘timesincefiredata_assignment.xlsx’.
Get the data into R and have a look at it (including a plot). Does it look like there is a relationship between time since fire and the abundance of the lizard? Do you think it looks linear, or does it look like it might be more exponential? Does it go up and down like the data in the lab example? Do you think you’ll need a parabola to model it, like we did in the lab?
We’ll use a Poisson family glm to model the data. Why is this more appropriate than a standard linear model with normal (Gaussian) errors, or a glm with binomial errors or Gamma errors?
Fit a series of models to the data, predicting abundance in terms of time since fire. Try models with the following link functions: identity, log and square root. Looking at help(family) in R will tell you how to use different link functions. Next fit a further three models using the same three link functions, but with an additional explanatory variable – time since fire squared, to see if a quadratic model is justified. Plot the predictions of the six fitted models. Which one looks like it fits the data best?
Compare the AIC values of the six models to decide which one is best. (Record the AIC values).
Check this best model for evidence of overdispersion. If needed, fit an appropriate model to account for overdispersion.
Record the p-value that you would report for significance of the effect of time since fire on the abundance of the lizard based on this best model.
What would you conclude from this analysis?
Question 2: Drought experiments
Find and open the ‘drought’ excel file. This data shows the results from a drought experiment for four different species. There were ten big pots for each species. In each case, about the same amount of seed was sown in each pot, but due to variability in germination, the number of plants in each pot was quite variable.
The number of plants in each pot was counted and then a six week drought was applied. After six weeks the pots were watered again. The number of plants that survived the drought was then counted in each of the pots.
The research question was whether there were differences between species in drought tolerance (ie survival).
First put the data into an appropriate format within Excel for entry into R (a separate row for each rep, or pot in this case), convert to .csv and read it in.
(Or convert within R if you prefer)
Next calculate the percentage survival for each pot in R.
Use a boxplot to plot the percentage survival by species – what does this show?
Use a standard linear ANOVA to test for differences between species. What do you find?
There are several major problems with using a standard linear ANOVA in this case. Binomial data, like survival data, is not likely to be normally distributed, especially if there are values close to 0% or 100%. Traditionally this was dealt with by arcsin transformation. You could do this to the data easily in R, then try the ANOVA again. However, we have another problem because there are different numbers of plants in each pot. We should therefore give the pots with more plants more weight in the analysis. This can be done using the ‘lm’ function and you can look up how to do it, but it is much easier to use the ‘glm’ function with binomial error distribution, which will handle all these issues automatically.
Use a binomial glm analysis to test whether there are differences between species. Don’t forget to check whether there is evidence of overdispersion in the data, and account for it if needed.
If you find a significant effect of species overall, then how to do pair-wise comparisons is always a good question… there is not really any super easy approach with a binomial glm. One possibility in this case to test whether two similar species are significantly different is to do a glm on a subset of the data containing only these species. You can get the subset by using the R function ‘subset’.
Another approach might be to relabel the two most similar species with one name (so they are they the same level for the species factor), then fit another model (give it another name), then test whether the two models are different. (This is like what we did in labs for the germination example). If they are not significantly different, then the relabelling is ok, which means the species are not significantly different. You can continue trying to group species in this way until you know that all ungrouped combinations are significantly different.
Test for differences between species using either of these two methods above.
Another issue in this case is the fact that plant density may have had an effect on survival, and that this effect could have depended on species as well. Fit another glm with pre-drought plant number as a covariate, and determine whether there is evidence for whether plant density may have had an effect on survival, and whether this effect could have depended on species. If you find evidence that plant density has had an effect on survival then you will need to test for differences between species again, while also accounting for this affect of plant density. In this case, plotting percentage survival against initial plant density with different colours/characters for different species will help a lot, especially if you then plot the model predictions for the different species as well.
Write down your conclusions about this experiment based on your full analysis.
Another Drought Experiment
The same drought experiment is repeated with the same species, but in a different soil, and you are happy you did the analysis of the first one in R, because now you have all the code you need to do the analysis of the second. The data for this experiment can be found in the ‘drought2’ excel file.
Redo the analysis above for the second experiment, making sure you check for overdispersion and the effect of plant density again, and accounting for them as needed.
Write down your conclusions about this second drought experiment based on your full analysis.
Question 3: Mountain diversity example
Researchers want to test the hypothesis that plant diversity increases with altitude. They find different mountains where national parks or other protected areas cover a reasonably wide range of altitudes going up the side of the mountain – wide enough to enable them to sample sites at different altitudes along a transect up the side of the mountain. They record abundances of all plant species at each site on each mountain, and then calculate the diversity of each site. The resulting data is in the file “mountaindiversity.csv”.
How many different mountains did they sample?
Did they sample the same number of sites on each mountain?
Were the sites sampled on each particular mountain evenly spaced in altitude?
Were the sites sampled on a particular mountain always at the same altitude as the sites sampled on every other mountain?
Plot the height of all the sites against their diversity. Does it look like there is a relationship? Positive or negative? Fit a simple linear model predicting diversity from height. Is it significant? What is the p-value? Does the fitted model indicate a positive or negative relationship (whether significant or not)? Is this test valid? Why/why not?
Boxplot diversity predicted by mountain. Does it look like there are differences between mountains? Fit a simple linear model predicting diversity from mountain. Is it significant? What is the p-value? Is this test valid? Why/why not?
Calculate the mean diversity for each mountain and the mean height for each mountain. Plot these mean heights (on the x-axis) against these mean diversities (on the y-axis). Does it look like there is a relationship? Positive or negative? Fit a simple linear model predicting mean diversity from mean height. Is it significant? What is the p-value? Does the fitted model indicate a positive or negative relationship (whether significant or not)? Is this test valid? Why/why not?
Use the ‘aov’ and ‘Error’ functions to fit a linear model predicting diversity in terms of height, but with a random effect for mountain. Is height a significant predictor of diversity? What is the p-value? Does the fitted model indicate a positive or negative relationship (whether significant or not)? Is this test valid? Why/why not?
Use the ‘lme’ function from the ‘nlme’ library to fit a linear model predicting diversity in terms of height, but with a random effect for mountain. Is height a significant predictor of diversity? What is the p-value? Does the fitted model indicate a positive or negative relationship (whether significant or not)? Is this test valid? Why/why not?
Plot the height of all the sites against their diversity again, but this time using a different colour and/or symbol for each mountain. Hopefully this will help you understand the results of the previous tests better. Make any other plots you think will help. Write down your conclusions from the analysis and plotting that you have done.
Question 4: Germination Data Revisited
In this question we go back to germination data, like in the lab. Lots of what you’ll need was covered in the lab examples, but there are also some fairly challenging parts to this question. If you are struggling and short of time, you may decide to skip some of the more challenging parts towards the end, and focus on doing all the other questions well. The last few percent may not be worth compromising your mental health!
The data set comes from an experiment on germinating white sapote seeds (Casimiroa edulis, also known as Aztec fruit or cochitzapotl). Sapote growers would like seeds to germinate faster, so researchers have trialled a new treatment they hope will speed up germination. They had 100 pots, each with a white sapote seed in it. 50 randomly selected pots were treated and the remainder acted as controls. The researchers checked every day and recorded the day that each seed germinated. The data is recorded in a file called ‘white sapote time to germination.csv’.
Get the data into R and call the data frame that is read in ‘ws’. Have a look at ‘ws’. The two columns correspond to the times to germination for the 50 seeds for the control/treatment as indicated.
You’ll need to get the data into standard format for the next bit ie one variable for all the times and another indicating which treatment. You can do this in R or Excel as you prefer (quicker in R if you can work it out – use the ‘c’ function to make a variable with all the times and factor(rep(c(‘c’,’t’),each=50)) will make a variable with the treatments.) Make a boxplot showing time to germination for the two treatments (ie treatment/control). Does it look like the treatment has an effect? What kind of effect?
Fit a linear model predicting time to germination by treatment. Plot the residuals by treatment (and/or fitted value). Does it look like we have homogeneity of variance? Plot a histogram of the residuals. Does it look like we have normal residuals/errors? Apply a Bartlett test and a Shapiro test to check these… you should find that they clearly indicate that there is a problem with both heterogeneity of variance and non-normal residuals/errors. In what way do the residuals/errors appear non-normal?
These data are actually ‘survival data’, in that they represent the time until something happened for a sample of individuals. A linear model is often likely to be inappropriate for such data, as we see above. A Poisson model is a possible option for such data in this case, because the days are all whole numbers, like counts. Fit a Poisson glm predicting time to germination by treatment, with default link function, and look at the results. Is over-dispersion a problem? How do you know? Fit a quasipoisson model to deal with this. Does this quasipoisson model indicate that the treatment has a significant impact?
A glm with a Gamma distribution for errors is often used to model ‘survival data’, at least in simple cases. Fit a Gamma glm predicting time to germination by treatment, with default link function, and look at the results. Does this model indicate that the treatment has a significant impact? Is the level of significance indicated by the Gamma glm very different to that indicated by the quasipoisson model? (remember anything between 0.01 and 0.05 gets one star * indicating significant, whereas less than 0.01 gets ** indicating highly significant, so 0.015 vs 0.0015 would be very different; 0.015 vs 0.025 would not be very different). Record the AIC of all models fitted so far… which one appears to be the best?
Now we want to go back to do something similar to the lab – using a binomial model. For that we need to modify the data to get the total number of seeds germinated at each time, for each treatment. You could probably do that in Excel, if you had nothing better to do on a Saturday afternoon. But let’s do it in R instead – much faster. Create a histogram of the times to germination for the control only, with ‘bins’ or ‘breaks’ of size 1. This code should work: hist(ws$control,breaks=0:100). This makes a plot but also creates an R object in the process. Give this object created a name, say ‘histo’ and then look at it. Note that it has a sub-object within it called ‘counts’, which the number in each bin of the histogram, or in this case, the number of seeds germinating on each day. This sub-object can be pulled out using histo$counts, just like pulling a variable out of a data frame. Applying the ‘cumsum’ function to these counts then provides the cumulative sum of seeds that have germinated by each day, which is what we want. It should now be pretty easy to calculate this for each treatment. You’ll then need to stick these two variables together to get one long list of the total number of seeds germinated at each time, for each treatment. Remember that the ‘c’ function sticks numbers together. And then you’ll need to create a second variable for the times and a third variable for the treatment. (All of this could be done in Excel if you prefer of course.) When you have the data sorted, you should be able to plot cumulative germination by time and get a plot like the one on the next page.
We can now try some binomial glms. Fit binomial glms predicting germination proportion by time and treatment, with the following link functions: logistic (logit), Cauchy cdf (cauchit), Gaussian cdf (probit) and complementary log-log (cloglog). Include an interaction. Don’t worry if you get warning messages. Record the AIC of each model. Which looks best?
Plot the predictions of each model for the control treatment only onto the data if you can. You’ll need to get ‘response’ predictions, but adjust them to make them for the number out of 50 instead of a proportion out of 1. Add a vertical line to the plot at time=27. Which of the four models gives the highest prediction for germination proportion for the control treatment at time=27? Which gives the second? Third? Last?
Now fit another four binomial glms predicting germination proportion by time and treatment, with the same four link functions, but no interaction. Record the AIC of each model. For which model(s) (ie which link function(s)) does the AIC indicate that the interaction should be included in the model? (Don’t plot the predictions for these models.)
Now fit a binomial gam predicting germination proportion by time, with a standard logistic link function. You’ll need to load the ‘mgcv’ library, would is usually already installed with R, and then use the ‘gam’ function. This works just like ‘glm’ except you can apply a ‘smoother’ to the explanatory variable, using the ‘s’ function. In this case, this would mean the explanatory variable looks something like s(time) instead of just time (if time is what you called your time variable). Great if you can get that to work without an error! But now we want to predict germination proportion by time and treatment! For that, the explanatory variable part of the model formula should look like this: s(times,by=trt). If you can get that work, record the AIC of the fitted gam. How does it compare to the AIC of the other fitted models? Plot the predictions of the fitted gam for the control treatment only onto the data. How does the gam prediction for germination proportion at time=27 for the control treatment compare to the other plotted predictions? There is something a bit unrealistic about the gam prediction – what is it?
The shapes of the predictions of the four glm models are also not great, are they? One doesn’t match the data well as germination approaches 100% – which is that? The other three don’t match the data well in the very early stages – which is worst?
Let’s try one last binomial glm link function, the log link function. Try to fit a binomial glm with a ‘log’ link function. Include the interaction. You should get an error message saying that no valid set of coefficients has been found. But if we now change the model so that we are predicting the proportion of seeds that HAVENT germinated, instead of the proportion that have, then we should get it to work. How does the AIC of this model compare to the AIC of the other four binomial glms with interactions considered so far (ie compare this one plus the other four with interactions – the gam doesn’t count)? Plot the predictions of this binomial glm with a ‘log’ link function for the control treatment only onto the data. You will need to adjust the predictions to make them predictions for the number germinated. How does the prediction for germination proportion at time=27 for the control treatment compare to the five other plotted predictions? Does the shape of this model’s predictions look better in comparison to the data than the other models?
Based on this last model, would you conclude that the treatment has a significant effect on germination of white sapote seeds?
[Button id=”1″]
[ad_2]
Source link