[ad_1]
DATA7202 Assignment 1 (Weight: 20%)
Due: 8/4/2019.
Analysis of UCI online news popularity dataset: exploratory data analysis, prediction,
evaluation and inference with generalized linear models.
The dataset and associated information can be found at
https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
We will focus on predicting the popularity of online articles published by Mashable over two
years from 7th Jan 2013, as measured through the number of times an article was shared
online.
In particular, we try to predict popularity based on explanatory variables (features) which
would be available before the publication of the article, as mentioned in the accompanying
paper by Fernandes et al. (2015) who collected the dataset. The articles have already been
heavily processed to produce a large set of numerical and categorical attributes, and these are
what you should work with. Many more types of attributes could be extracted from the
articles and related information, but we will not pursue that here.
We will try to predict popularity in two ways. The first target is as it is recorded, as the
number of times the articles is shared over the period. The second target is a binary variable
which discretises the above. Following Fernandes et al. (section 3.1), an article is considered
popular if it exceeds 1400 shares. You should create this binary variable. I would like you to
use multiple linear regression for the first task and binary logistic regression for the second.
No model or variable selection is to be performed as part of this assignment (except as
specifically mentioned in questions), despite the reasonable temptation. However,
transformation of variables is allowed.
The aim of the assignment is to develop a solid theoretical and practical understanding of the
models, rather than to achieve the best possible performance. Even within the constraints
above, a large number of models are possible, and the Fernandes et al. article indicates some
of the many other options. I would also like you to consider the aims of the study. These are
somewhat vague, but two aspects mentioned by Fernandes et al. (p536) are:
(i) “Predicting such popularity is valuable for authors, content providers, advertisers and even
activists/politicians (e.g., to understand or influence public opinion).”
Why is not stated, but it seems likely that predicted popularity or probability of being popular
could be considered when deciding whether or not to publish a new article, for example one
submitted by a freelance journalist.
(ii) “allowing (as performed in this work) to improve content prior to publication”.
This considers how to potentially improve the (predicted) appeal of an article by considering
the effect of various changes to the article, and is discussed in section 3.2 of Fernandes et al.
1. (i) Give relevant equations and assumptions to describe the general linear model and the
logistic regression model for multivariate data (assume X has dimension p and Y is a single
variable in each case). Define any notation used.
(ii) Briefly describe the algorithms used to fit these models, their mathematical basis and
conditions which will stop them from working.
(iii) Derive the maximum likelihood estimate of the beta vector in the general linear model,
and any assumptions.
2. Perform exploratory data analysis as relevant to the construction of the regression models.
Investigate and highlight any apparent structure in the data. Consider the assumptions of the
regression models, particularly whether there are any outliers or influential points of note.
3. Use a general linear model (multiple regression) to attempt to predict the number of times
an article will be shared based on the other variables which are available before publication.
(i) You should attempt to check all assumptions of the model and report on this.
(ii) Attempt to transform or re-represent one or more of the variables so that the model fits
better. However, be aware that repeated attempts to do this count as model selection and are
thus likely to incur some selection bias with respect to any performance measure, unless
efforts are made to quarantine data for evaluation. See e.g. Wood et al. 2007 for some
discussion. However, the effect here should be fairly small.
(iii) For this model, look for an interaction between two explanatory variables which might
be useful and add it into the model. Note that the variables can be of any type. Explain why
did the interaction appear promising before trying it? (Back this up with relevant graphs
and/or tables).
(iv) What is the effect of including this interaction?
(v) Give a table including all the estimated model parameters, confidence intervals, test
statistics and p-values.
(vi) Write valid sentences to explain the effect of the two most significant slope parameters.
4. Use a logistic regression model to attempt to predict whether or not an article will be
popular (defined here as >1400 shares).
(i) You should attempt to check all assumptions of the model and report on this.
(ii) By default, re-use any transformations of the X variables attempted with linear regression.
Comment on whether any further transformations or re-representations of the data may be
useful. Use them if you think they help.
(iii) Give a table including all the estimated model parameters, confidence intervals, test
statistics and p-values.
(iv) Write valid sentences to explain the effect of the two most significant slope parameters.
5. Evaluate the predictive performance of each of the above two models using two
appropriate metrics. Include details of each metric and its advantages and disadvantages.
Compare your results with logistic regression to the reported results of Fernandes et al. with
random forests.
6. For the aims of the study, as first outlined by Fernandez, and any related aims which you
might reasonably ascribe to Mashable, which of the two models (multiple linear regression or
logistic regression) do you think is more useful and why? This is mainly an argument about
whether or not it is worthwhile to discretise the shares variable.
7. The model used by Fernandes et al. was a random forest. You may not know this model,
but it is essentially a large ensemble of decision tree models, with generally strong predictive
capabilities, but weak interpretability. It is hoped that the regression models you used will be
more interpretable.
(i) Explain how one can determine how to improve the predicted popularity of an article with
either of the regression models considered here.
(ii) Using your two fitted regression models, identify the article with the highest predicted
popularity and predicted probability of being popular.
(iii) For each of your two fitted regression models, list the attributes of two hypothetical
articles (fake news?) which would give the highest possible predicted popularity and
predicted probability of being popular, respectively. You should give values for every
attribute, but for each variable, keep them within the range seen in the dataset.
(iv) Based on your understanding of dependence between the variables in the dataset,
comment on whether or not these hypothetical articles could be produced.
8. Take a random subsample of 1000 observations from the dataset and refit your models (to
predict each of the two response variables). Explain the main differences in results and
interpretation compared to when you used the whole dataset.
Notes:
Where possible, give reasons for your answers. Avoid thinking “you know what I mean”. Say
what you mean and don’t assume much.
Also – please name your files something like
Yourfirstname_Yoursurname_DATA7202_assn1.pdf . I don’t enjoy having to rename all
your files to remember who wrote which one.
You should de-emphasise any R commands and raw output (e.g. put it in the appendix or just
keep (& improve the formatting of) the most important information. Use figures and tables
with numbers and captions for each and refer to them from the text. Learn to make decent or
nice figures which can be fairly easily understood by your audience. Please include your
name in the filename for all files submitted. You should not generally give R commands in
your main report and should not include any raw output – i.e. just include figures from R
(each with a title, axis labels and caption below) and put any relevant numerical output in a
table or within the text.
As per http://www.uq.edu.au/myadvisor/academic-integrity-and-plagiarism, what you submit
should be your own work. Even where working from sources, you should endeavour to write
in your own words. Equations are either correct or not, but you should use consistent notation
throughout your assignment, define all of it and ensure that your report flows logically.
You are asked to use the R software environment for this assignment. This is available on all
computers in the Maths Department and is also free to install on any of your own computers.
Information and downloads are available from http://www.r-project.org/ . Rstudio
https://www.rstudio.com/ is a quality free interface for R.
Submit your assignment via TurnItIn on Blackboard. In this include the report (with graphs
and tables included) and any R programs or scripts that you write to answer the assignment.
Acceptable formats for the report include a Word document or a pdf file. I may print the
reports in greyscale for marking, so please make sure that the colours in plots will be
distinguishable from each other even in greyscale format. I prefer that you combine your files
into one (e.g. pdf) file and submit that. Please make sure that the overall file size is less than
10 MB. This is because larger sizes are quite slow to print, and it is not necessary to include
every data point on a typical plot. Using just a sample of the data can be ok for most
illustrative plots. The main exception is if you want to show outliers along with all the data.
References:
A.J. Dobson, and A.G. Barnett, An Introduction to Generalized Linear Models, 4th edition,
CRC Press, 2018.
K. Fernandes, P. Vinagre, and P. Cortez, A Proactive Intelligent Decision Support System for
Predicting the Popularity of Online News, in: Pereira F., Machado P., Costa E., Cardoso A.
(eds.) Progress in Artificial Intelligence, EPIA 2015, Lecture Notes in Computer Science,
vol. 9273, Springer.
J. Maindonald and J. Braun, Data Analysis and Graphics Using R – An Example-Based
Approach, 3rd edition, Cambridge University Press, 2010 (available online via UQ library).
W. N. Venables and B. D. Ripley, Modern Applied Statistics with S, Fourth Edition, Springer,
2002.
H. Wickham, and G. Grolemund, R for Data Science, O’Reilly, 2017.
I.A. Wood, P.M. Visscher and K.L. Mengersen, Classification based upon gene expression
data: bias and precision of error rates, Bioinformatics 23 (11), 1363-1370, 2007.
[Button id=”1″]
[ad_2]
Source link
"96% of our customers have reported a 90% and above score. You might want to place an order with us."
