Introduction to Python - External Modules

In the previous two notebooks, we've explored some of the basic principles to code in Python which should give you the tools to create and use more complicated code which can be used to solve problems and speed up your existing workflow.

Python is what is known as an open-source language, in that it has developers from all around the world & from multiple organisations. Additionally, there are many, many, many different external, third-party modules which extend and enhance the existing functionality of Python.

Some of these modules are enormous, stable developments and can be used in all sorts of different use cases, while others are much more specific for one area. In this notebook, we will look at a few of the most useful modules which are required for a lot of the GIS specific work we look at in later notebooks.

We will talk about three primary modules in this notebook:

  1. Numpy
  2. Matplotlib
  3. Xarray

These should hand you most of the tools to access and use a datacube and perform other GIS operations later on. There are some other modules used as well which won't be covered in as much detail, but they will be listed with links to their websites at the end of this notebook which you can look at if you want.


Modules

First, a word about modules. Modules can be imported into the top of scripts and notebooks such as this one, which then allow for the usage of the functions contained within them in the code. Effectively, importing a module references that entire module into your code, so any functions or methods within the module can be used by your code.

There are a number of different conventions used to import modules. For example, say you have a module called 'pymodule'. The simplest way to import this module would be simply to write:

import pymodule at the top of your script or notebook. Function in this module could then be called in a similar way to the list methods we looked at before. Some modules have well known abbreviations for import which are used in importing them, using the example of pymodule again, we'd write:

import pymodule as pym. Additionally, it is possible to call individual submodules or functions from within a larger module. Sometimes it can be useful to call in a submodule for speed as this means only part of a larger module needs to be imported. It isn't recommended to import just a single function from a module as the Python interpreter still has to scan the whole module for the function & it hardcodes that module into your script/notebook. In the code below, we will import all the modules needed to run the code below:

Numpy

Numpy (short for Numerical Python) is one of the most helpful modules for any sort of data analysis in Python. It has a suite of inbuild mathematical functions, from trivial one such as np.sqrt (used for calculating square roots) and np.mean (used for calculating the mean) to complex linear algebra functions used for transformations on matrices.

Additionally, it extends and enhances the existing Python list structures, introducing numpy arrays, which can be 1, 2 or n-dimensional arrays. These are useful data structures for querying and manipulating data, but they are built upon by other modules, such as xarray (which we will look at later) and pandas.

And finally, many other modules are built on top of the data array structure Numpy uses, due to its speed and ease of use - so gaining familiarity with Numpy is an important and useful skill to develop when programming with Python.

We will start with looking at basic Numpy arrays. These arrays contain slightly more metadata than inbuilt Python lists and are also easier to manipulate, especially to do mathematical calculations. They can be initialized as empty arrays, or in many cases, are the result of converting a Python list. Run the code box below to see examples of ways to initiate Numpy arrays:

Above were some examples of different ways of generating one dimensional (1D) Numpy arrays. It is possible to initiate a list of the size and shape required, without filling it will actual data values yet, using either the np.zeros() function or the np.ones() function to create a Numpy array comprising of only the number zero or one, respectively.

Additionally, it is possible to convert a normal list to a Numpy array, in cases where you already have a normal Python list but need to perform some calculations on it which might be better served by using Numpy. In this case, using the function np.array() converts this list to a Numpy array.

The most useful instances for using Numpy arrays are for using them as two dimensional (2D) arrays. These arrays can also be initialised very similar ways, which can be shown by running the code below:

In the above section of code, all the previously defined arrays have been able to be re-defined as 2D Numpy arrays, with the numbers in the square brackets denoting the shape of the arrays (in fact it would be possible to define a 1D array this way using np.ones([1,10]) for example, but it is easier in practise to just write the number of values.

There is a wide range of uses for Numpy arrays, and they can be used to speed up almost any numerical task in Python, but to write much more about the uses of Numpy arrays in this guide would make it very long. Other uses will arise in later notebooks in this series, so you will see more advanced Numpy usage there. In the meantime, the Numpy website has a lot of documentation regarding many of the functions if you'd like to see more what they can be used for.


Matplotlib

Matplotlib is the primary plotting library in Python, and allows for a wide range of plot types and styles to be used. It has recognisable and easy to use plotting functions, but there are auxiliary functions which allow for lots of customisation of plots if required.

We will look at a few different types of plot in this guide, with plots to make line graphs, scatter plots and 2D histogram type plots (which are often used for many plots of satellite & GIS data).

Let's make a very simple line plot using matplotlib. Say we want to make a plot of the line $x = y$, we need to create the lists or arrays corresponding to the x and y values at a number of points along the line, before then telling matplotlib to plot these values and finally display (or in some cases save) the plot. Run the code below to plot this graph.

This is a very basic plot, and there are a number of "keyword arguments" (also known as kwargs) which can be specified after specifying x & y in the plot arguments, such as the line style, size, colour etc. Additionally, further arguments can be specified using plt. after creating the plot, such as adding titles, axes titles, ticks, ticklabels, logarithmic axes, just to name a few. It is also possible to use plt.plot to change the plot from a line to a scatter graph, specifying that the keyword arguments "marker" is equal to '.'. Try this in the box below, copying the final line of the previous example but adding in the kwarg, marker='.'

While using plt.plot() can be a viable way to plot scatter graphs, the dedicated function plt.scatter() has a lot more functionality for doing so. From a basic viewpoint though, plotting a scatter graph is an almost identical process to plt.plot(), simply replacing the word plot with scatter, like so:

It's also possible to plot two different types of graph on top of each other. This is done simply by calling two plt functions one after the other. In the plot below, try plotting two graphs on top of each other, first by calling plt.plot() with x and y as arguments like above and then calling plt.scatter() with x_scat and y_scat.

Hopefully, you just managed to plot a straight line denoting $x=y$ and a box between 2.0 and 4.0 in the x axis and 1.0 and 4.5 in the y axis.

The final type of plot we shall explore in this notebook is plt.imshow() which is for plotting 2D datasets, and so is very useful indeed for plotting satellite or spatial data. In order to plot something, we first need a multi-dimensional array, of which the easiest way to make one is to make a numpy array. The code below allows creates a multidimensional numpy array using the Numpy randint function, which creates an array of random numbers between 0 and the specified value, in a shape defined by "size" (in this case a 6x6 matrix). This can then be plotted using plt.imshow().

Test your understanding of the processes we've described in this notebook by plotting the scatter plot over the imshow plot, making sure to call the arrays x_scat and y_scat as arguments to plt.scatter() and the array implot as an argument to the plt.imshow() function.


Xarray

Xarray extends the functionality of Numpy arrays, with additional metadata and capabilities, making them the idea data structure for dealing with satellite data in Python. They also have inbuilt capability to create plots, like plt.imshow() shown above.

It is also easier to handle multi-dimensional data with Xarray, such as containing multiple arrays from different days or containing data with several satellite bands. As a result it is much easier to analyse satellite data in Xarray structure and any queries to the Data Cube return Xarrays, so this is the preferred method of interacting with the Data Cube.

Let us first create an Xarray. Xarray has two primary data structures - DataArrays and Datasets. DataArrays are effectively Numpy arrays with additional information needing to be defined in creation, while Datasets hold multiple DataArrays, and as such can contain multiple variables within the same Dataset. In order to use the inbuilt plotting function, it is necessary to convert a Dataset to a DataArray or select a DataArray hidden within the overall Dataset. Below is an example of how to create a DataArray from Xarray:

There is quite a lot happening here, so let's break it down a bit.

First of all, a Numpy 3D array has been created using np.random.randint() as seen previously in the matplotlib examples. So we have a Numpy array containing three 4x4 matrices. In order to create an Xarray from this, the coordinates need to be specified for each of the three axes. In this example instance, each 4x4 array is a spatial array in the x and y directions, with each array being from a different time. This is the type of Xarray which the Data Cube will return.

Additionally, the "dims" need to be defined, which are labels for each of the coordinates, and is a list of strings defining what each type of coordinate is.

DataArrays are easily convertible into Datasets and vice-versa, using the to_dataset() and to_array() commands. For example, test_array can be converted to a Dataset as a variable within the dataset using the command in the box below:

The above dataset is exactly the format of data from the Data Cube. Getting data from the Data Cube is done in a slightly different way, which will be covered in a later notebook. However, it is useful to see where these data structures come from and how they are constructed.

Now that we have created both a DataArray and a Dataset, we will look at how these data structures can be manipulated. For example, how do we actually look at the Numpy arrays underneath a Dataset in order to use Numpy functions?

The code below provides an example of this:

In this instance, we have selected the variable, test_array, using square brackets and its name as a string, to obtain just the data_array. However, this will the full DataArray with additional coordinates and dimensions as seen when we defined it. In order to get just the data values from the array, it is important to use the .values method to obtain the Numpy array.

From this point, we can use Numpy functions to perform transformations on the array. However, this isn't always necessary for simple functions, such as the mean and median, which can be performed straight onto Xarray data structures, as they have some Numpy functionality built into them.

In the below example, we have found the median value of the test_array dataset. Run the code to see this functionality:

You can also use the print statement to show the results in another way.

Try this out for yourself. In the empty box below, find the mean value of the dataset. The method is exactly same, except you are looking for the mean instead of the median.

Finally, we will investigate using Xarray to plot data, using the inbuilt Matplotlib capabilities Xarray uses. Going back to our Xarray Dataset, test_dataset, in order to plot an aspect of this we have to do a few separate things.

Unless we are plotting an RGB image, the Xarray plot function will only work if a 2D array is specified to be plotted. Additionally, only a DataArray can be plotted, not a Dataset, due to a DataArray only containing one variable. In order to plot an aspect of test_dataset then, we must first save it as a DataArray and then select just one of the 4x4 matrices (representing time in this example) which can be plotted as a 2D image. This functionality is shown below:

So we have successfully plotted the first 4x4 matrix, selecting just that one with the .isel(time=0) command. The plot itself occurs when calling the method .plot(). This is its simplest form and there are a number of kwargs which can be specified to enhance the plot. Finally the plot has been displayed using plt.show().