- This event has passed.
ONLINE COURSE – Model selection and model simplification (MSMS03) This course will be delivered live
15 June 2022 - 16 June 2022
Wednesday, April 13th, 2022
This is a ‘LIVE COURSE’ – the instructor will be delivering lectures and coaching attendees through the accompanying computer practical’s via video link, a good internet connection is essential.
TIME ZONE – GMT – however all sessions will be recorded and made available allowing attendees from different time zones to follow.
Please email email@example.com for full details or to discuss how we can accommodate you).
About This Course
This two day course covers the important and general topics of statistical model building, model evaluation, model selection, model comparison, model simplification, and model averaging. These topics are vitally important to almost every type of statistical analysis, yet these topics are often poorly or incompletely understood. We begin by considering the fundamental issue of how to measure model fit and a model’s predictive performance, and discuss a wide range of other major model fit measurement concepts like likelihood, log likelihood, deviance, residual sums of squares etc. We then turn to nested model comparison, particularly in general and generalized linear models, and their mixed effects counterparts. We then consider the key concept of out-of-sample predictive performance, and discuss over-fitting or how excellent fits to the observed data can lead to very poor generalization performance. As part of this discussion of out-of-sample generalization, we introduce leave-one-out cross-validation and Akaike Information Criterion (AIC). We then cover general concepts and methods related to variable selection, including stepwise regression, ridge regression, Lasso, and elastic nets. Following this, we turn to model averaging, which is an arguably always preferable alternative to model selection. Finally, we cover Bayesian methods of model comparison. Here, we describe how Bayesian methods allow us to easily compare completely distinct statistical models using a common metric. We also describe how Bayesian methods allow us to fit all the candidate models of potential interest, including cases were traditional methods fail.
This course is aimed at anyone who is interested in using R for data science or statistics. R is widely used in all areas of academic scientific research, and also widely throughout the public, and private sector.
Last Up-Dated – 05:05:2022
Duration – Approx. 15 hours
ECT’s – Equal to 1 ECT’s
Language – English
This course is aimed at anyone who is interested in advanced statistical modelling as it is practiced widely throughout academic scientific research, as well as widely throughout the public and private sectors.
The course will take place online using Zoom. On each day, the live video broadcasts will occur during UK local time at:
All sessions will be video recorded and made available to all attendees as soon as possible, hopefully soon after each 2hr session.
If some sessions are not at a convenient time due to different time zones, attendees are encouraged to join as many of the live broadcasts as possible. For example, attendees from North America may be able to join the live sessions from 3pm-5pm and 6pm-8pm, and then catch up with the 12pm-2pm recorded session once it is uploaded. By joining live sessions attendees will be able to benefit from asking questions and having discussions, rather than just watching prerecorded sessions.
At the start of the first day, we will ensure that everyone is comfortable with how Zoom works, and we’ll discuss the procedure for asking questions and raising comments.
Although not strictly required, using a large monitor or preferably even a second monitor will make the learning experience better, as you will be able to see my RStudio and your own RStudio simultaneously.
All the sessions will be video recorded, and made available immediately on a private video hosting website. Any materials, such as slides, data sets, etc., will be shared via GitHub.
Assumed computer background
Attendees should already have experience with R and be able to read csv files, create simple plots, and manipulate data frames. The experience of using some basic R spatial packages, such as sp or raster would be beneficial.
However, if you do not have R experience but already use GIS software and have a strong understanding of geographic data types, and some programming experience, the course may also be appropriate for you.
Equipment and software requirements
Attendees of the course will need to use a computer on which RStudio can be installed. This includes Mac, Windows, and Linux, but not tablets or other mobile devices. Instructions on how to install and configure all the required software, which is all free and open source, will be provided before the start of the course. We will also provide time during the workshops to ensure that all software is installed and configured properly.
UNSURE ABOUT SUITABLILITY THEN PLEASE ASK firstname.lastname@example.org
Cancellations are accepted up to 28 days before the course start date subject to a 25% cancellation fee. Cancellations later than this may be considered, contact email@example.com. Failure to attend will result in the full cost of the course being charged. In the unfortunate event that a course is cancelled due to unforeseen circumstances a full refund of the course fees will be credited.
If you are unsure about course suitability, please get in touch by email to find out more firstname.lastname@example.org
Classes from 10:00 to 18:00
Topic 1: Numerical programming with numpy. Although not part of Python’s official standard library, the numpy package is the part of the de facto standard library for any scientific and numerical programming. Here we will introduce numpy, especially numpy arrays and their built in functions (i.e. “methods”). Here, we will also consider how to speed up numpy code using the Numba just-in-time compiler.
Topic 2: Data processing with pandas. The pandas library provides means to represent and manipulate data frames. Like numpy, pandas can be see as part of the de facto standard library for data oriented uses of Python. Here, we will focus on data wrangling including selecting rows and columns by name and other criteria, applying functions to the selected data, aggregating the data. For this, we will use Pandas directly, and also helper packages like siuba.
Classes from 10:00 to 18:00
Topic 3: Data Visualization. Python provides many options for data visualization. The matplotlib library is a low level plotting library that allows for considerable control of the plot, albeit at the price of a considerable amount ofm low level code. Based on matplotlib, and providing a much higher level interface to the plot, is the seaborn library. This allows us to produce complex data visualizations with a minimal amount of code. Similar to seaborn is ggplot, which is a direct port of the widely used R based visualization library.
Topic 4: Statistical data analysis. In this section, we will describe how to perform widely used statistical analysis in Python. Here we will start with the statsmodels, which provides linear and generalized linear models as well as many other widely used statistical models. We will also cover rpy2, which is and interface from Python to R. This allows us to access all of the the power of R from within Python.
Topic 5: Symbolic mathematics. Symbolic mathematics systems, also known as computer algebra systems, allow us to algebraically manipulate and solve symbolic mathematical expression. In Python, the principal symbolic mathematics library is sympy. This allows us simplify mathematical expressions, compute derivatives, integrals, and limits, solve equations, algebraically manipulate matrices, and more.
Topic 6: Parallel processing. In this section, we will cover how to parallelize code to take advantage of multiple processors. While there are many ways to accomplish this in Python, here we will focus on the multiprocessing
Classes from 10:00 to 18:00
Topic 1: Measuring model fit. In order to introduce the general topic of model evaluation, selection, comparison, etc., it is necessary to understand the fundamental issue of how we measure model fit. Here, the concept of conditional probability of the observed data, or of future data, is of vital importance. This is intimately related, though distinct, to concept of likelihood and the likelihood function, which is in turn related to the concept of the log likelihood or deviance of a model. Here, we also show how these concepts are related to concepts of residual sums of squares, root mean square error (rmse), and deviance residuals.
Topic 2: Nested model comparison. In this section, we cover how to do nested model comparison in general linear models, generalized linear models, and their mixed effects (multilevel) counterparts. First, we precisely define what is meant by a nested model. Then we show how nested model comparison can be accomplished in general linear models with F tests, which we will also discuss in relation to R^2 and adjusted R^2. In generalized linear models, and mixed effects models, we can accomplish nested model comparison using deviance based chi-square tests via Wilks’s theorem.
Topic 3: Out of sample predictive performance: cross validation and information criteria. In the previous sections, the focus was largely on how well a model fits or predicts the observed data. For reasons that will be discussed in this section, related to the concept of overfitting, this can be a misleading and possibly even meaningless means of model evaluation. Here, we describe how to measure out of sample predictive performance, which measures how well a model can generalize to new data. This is arguably the gold-standard for evaluating any statistical models. A practical means to measure out of sample predictive performance is cross-validation, especially leave-one-out cross-validation. Leave-one-out cross-validation can, in relatively simple models, be approximated by Akaike Information Criterion (AIC), which can be exceptionally simple to calculate. We will discuss how to interpret AIC values, and describe other related information criteria, some of which will be used in more detail in later sections.
Classes from 10:00 to 18:00
Topic 4: Variable selection. Variable selection is a type of nested model comparison. It is also one of the most widely used model selection methods, and variable selection of some kind is almost always done routinely in all data analysis. Although we will also have discussed variable selection as part of Topic 2 above, we discuss the topic in more detail here. In particular, we cover stepwise regression (and its limitations), all subsets methods, ridge regression, Lasso, and elastic nets.
Topic 5: Model averaging. Rather than selecting one model from a set of candidates, it is arguably always better perform model averaging, using all the candidates models, weighted by the predictive performance. We show how to perform model average using information criteria.
Topic 6: Bayesian model comparison methods. Bayesian methods afford much greater flexibility and extensibility for model building than traditional methods. They also allow us to easily directly compare completely unrelated statistical models of the same data using information criteria such as WAIC and LOOIC. Here, we will also discuss how Bayesian methods allow us to fit all models of potential interest to us, including cases where model fitting is computationally intractable using traditional methods (e.g., where optimization convergence fails). This allows us therefore to consider all models of potential interest, rather than just focusing on a limited subset where the traditional fitting algorithms succeed.
Dr. Mark Andrews
Senior Lecturer, Psychology Department, Nottingham Trent University, England
FREE 2 day Introduction to statistics using R and Rstudio (FIRR)
Introduction to generalised linear models using R and Rstudio (IGLM)
Introduction to mixed models using R and Rstudio (IMMR)
Nonlinear regression using generalized additive models in R and Rstudio (GAMR)
Introduction to hidden markov and state space models using R and Rstudio (HMSS)
Introduction to machine learning and deep learning using R and Rstudio (IMDL)
Model selection and model simplification using R and Rstudio(MSMS)
Data visualization using gg plot 2 using R and Rstudio(DVGG)
Data wrangling using R and Rstudio (DWRS)
Reproducible data science using rmarkdown, git, r packages, docker, make & drake, and other tools (RDRP)
Introduction/fundamentals of bayesian data analysis statistics using R and Rstudio (FBDA)
Bayesian data analysis using R and Rstudio (BADA)
Bayesian approaches to regression and mixed effects models using R and brms (BARM)
Introduction to stan for bayesian data analysis (ISBD)
Introduction to unix (UNIX01)
Introduction to python (PYIN03)
Introduction to scientific, numerical, and data analysis programming in python (PYSC03)
Machine learning and deep learning using python (PYML03)
Python for data science, machine learning, and scientific computing (PDMS02)
Mark Andrews is a Senior Lecturer in the Psychology Department at Nottingham Trent University in Nottingham, England. Mark is a graduate of the National University of Ireland and obtained an MA and PhD from Cornell University in New York. Mark’s research focuses on developing and testing Bayesian models of human cognition, with particular focus on human language processing and human memory. Mark’s research also focuses on general Bayesian data analysis, particularly as applied to data from the social and behavioural sciences. Since 2015, he and his colleague Professor Thom Baguley have been funded by the UK’s ESRC funding body to provide intensive workshops on Bayesian data analysis for researchers in the social sciences.