Data visualisation and manipulation using Python (DVMP01)
4 December 2017 - 8 December 2017£600.00 - £980.00
One of the strengths of the Python language is the availability of mature, high-quality libraries for working with scientific data. Integration between the most popular libraries has lead to the concept of a “scientific Python stack”: a collection of packages which are designed to work well together. In this workshop we will see how to leverage these libraries to efficiently work with and visualize large volumes of data.
This workshop is aimed at researchers and technical workers with a background in biology and a basic knowledge of Python (if you’ve taken the Introductory Python course then you have the Python knowledge; if you’re not sure whether you know enough Python to benefit from this course then just drop us an email).
We offer two packages
• COURSE ONLY – Includes lunch and refreshments.
• ALL INCLUSIVE – Includes breakfast, lunch, dinner, refreshments, minibus to and from meeting point and accommodation. Accommodation is multiple occupancy (max 2 or 4 people and subject to availability) single sex en-suite rooms. Arrival Sunday 3rd December and departure Friday 8th December PM.
To book ‘COURSE ONLY’ or ‘ALL INCLUSIVE’ pleases scroll to the bottom of this page.
Other payment options are available please email email@example.com
Cancellation policy: Cancellations are accepted up to 28 days before the course start date subject to a 25% cancellation fee. Cancellations later than this may be considered, contact firstname.lastname@example.org Failure to attend will result in the full cost of the course being charged. In the unfortunate event that PR~statistics must cancel this course due to unforeseen circumstances a full refund for the course will be credited. However PR~statistics cannot be held responsible for any travel fees, accommodation or other expenses incurred to you as a result of the cancellation.
The workshop is delivered over nine half-day sessions. Each session consists of roughly a one hour lecture followed by two hours of practical exercises, with breaks at the organiser’s discretion. Each session uses examples and exercises that build on material from the previous one, so it’s important that students attend all sessions. The last session will be kept free for students to work on their own datasets with the assistance of the instructor.
Assumed quantitative knowledge
Students should have enough biological/bioinformatics background to appreciate the example datasets.
Assumed computer background
Students should also have some basic Python experience (the Introduction to Python course will fulfill these requirements). Students should be familiar with the use of lists, loops, functions and conditions in Python and have written at least a few small programs from scratch.
Equipment and software requirements
A laptop/personal computer with Python installed.
Students will require the scientific Python stack to be installed on their laptops before attending; instructions for this will be sent out prior to the course.
It is essential that you come with all necessary software and packages already installed (you will be sent a list of packages prior to the course) internet access may not always be available.
UNSURE ABOUT SUITABLILITY THEN PLEASE ASK email@example.com
Meet at Margam Discovery Centre approx. 18:30
Monday 4rth – Classes from 09:00 to 17:00
Module 1: Introduction and datasets
Jupyter (formerly iPython) is a programming environment that is rapidly becoming the de facto standard for scientific data analysis. In this session we’ll learn why Jupyter is so useful, covering its ability to mix notes and code, to render inline plots, charts and tables, to use custom styles and to create polished web pages. We’ll also take a look at the datasets that we’ll be investigating during the course and discuss the different types of data we encounter in bioinformatics work.
Module 2: Introduction to pandas
In this session we introduce the first part of the scientific Python stack: the pandas data
manipulation package. We’ll learn about Dataframes — the core data structure that much of the rest of the course will rely on — and how they allow us to quickly select, sort, filter and
summarize large datasets. We’ll also see how to extend existing Dataframes by writing
functions to create new columns, as well as how to deal with common problems like missing or inconsistent values in datasets. We’ll get our first look at data visualization by using pandas’ built in plotting ability to investigate basic properties of our datasets.
Tuesday 5th – Classes from 09:00 to 17:00
Module 3: Grouping and pivoting with pandas
This session continues our look at pandas with advanced uses of Dataframes that allow us to answer more complicated questions. We’ll look two very powerful tools: grouping, which allows us to aggregate information in datasets, and pivoting/stacking, which allows us to flexibly rearrange data (a key step in preparing datasets for visualization). In this session we’ll also go into more detail about pandas indexing system.
Module 4: Advanced manipulation with pandas
In this final session on the pandas library we’ll look at a few common types of data
manipulation — binning data (very useful for working with time series), carrying out principal
component analysis, and creating networks. We’ll also cover some features of pandas designed for working with specific types of data like timestamps and ordered categories.
Wednesday 6th – Classes from 09:00 to 17:00
Module 5: Introduction to seaborn
This session introduces the seaborn charting library by showing how we can use it to investigate relationships between different variables in our datasets. Initially we concentrate on showing distributions with histograms, scatter plots and regressions, as well as a few more exotic chart types like hexbins and KDE plots. We also cover heatmaps, in particular looking at how they lend themselves to displaying the type of aggregate data that we can generate with pandas.
Module 6: Categories in seaborn
This session is devoted to seaborn’s primary use case: visualizing relationships across multiple categories in complex datasets. We see how we can use colour and shape to distinguish categories in single plots, and how these features work together with the pandas tools we havealready seen to allow us to very quickly explore a dataset. We continue by using seaborn to build small multiple or facet plots, separating categories by rows and columns. Finally, we look at chart types that are designed to show distributions across categories: box and violin plots, and the more exotic swarm and strip plots.
Thursday 7th – Classes from 09:00 to 17:00
Module 7: Customization with seaborn
For the final session on seaborn, we go over some common types of customization that can be tricky. To achieve very fine control over the style and layout of our plots, we’ll learn how to work directly with axes and chart objects to implement things like custom heatmap labels, log axis scales, and sorted categories. Matplotlib
Module 8: Matplotlib
The final teaching session, we look at the library that both pandas and seaborn rely on for their charting tools: matplotlib. We’ll see how by using matplotlib directly we can do things that would be impossible in pandas or seaborn, such as adding custom annotations to our charts. We’ll also look at using matplotlib to build completely new, custom visualization by combining primitive shapes.
Friday 8th – Classes from 09:00 to 16:00
Module 9: Data workshop
The two sessions on the final day are set aside for a data workshop. Students can practice applying the tools they’ve learned to their own datasets with the help of an instructor, or continue to work on exercises from the previous day. There may also be time for some demonstrations of topics of particular interest, such as interactive visualization tools and animations.