More Advanced Data Cube Usage¶

NDVI Anomaly Map¶

This notebook will allow you to create a pasture anomaly map comparing current pasture conditions to the long term average. This will be done using MODIS data, but can be done using any sensor which has a red and a near-infrared band in it.

This notebook will require a number of external Python modules to run, which will be imported at the top of the notebook as is convention. Some of these modules (numpy, matplotlib, datacube) we have come across before. The other two are gdal, a Python library used to handle geographic data, particularly some satellite data and dcFunctions, a module written by the SIBELIUs team which contains many functions which are useful for manipulating data in a Data Cube. Run the code below to import these modules.

As before, we need to initialise a Data Cube in order to have access to the Kyrgyz Data Cube (KDC), and this can be done as follows:

For many notebooks, it is useful to write in all the code you would like to compute then run the entire notebook in one go, rather than running each cell individually. You may also want to change a few parameters each time you run the code, so it can be useful to have all the parameters to change in one cell, for simplicities sake.

In this instance, it is useful to have the start and end date defined for the anomaly time period, and the name of the GeoTiff which will be saved out at the end of the notebook defined as well. To do this we define some variables which can then be reused later on in the notebook.

In previous notebooks, the use of a "query" was discussed to specify what data needs to be loaded from the Kyrgyzstan Data Cube. This is a dictionary where a key, such as lat can be attached to a two item list to specify the minimum and maximum latitude values to be loaded in. Let's go through some of the various fields which could be defined in this query now:

Query arguments

'x': This is the x or longitude coordinates
'y': This is the y or latitude coordinates
'time': This is the start and end times. Format is 'YYYY-MM-DD'.
'measurements': This is the satellite bands to load in.
'output_crs': The coordinate system of the data to be loaded in. It is saved in the datacube as 'EPSG:32648'.
'resolution': Resolution. Depends on the product being loaded in and the coordinate system being used.
'dask_chunks': Loads in the data structure but not data itself.

Let us now try to specify a query for all of Kyrgyzstan. As the Data Cube covers the extents of Kyrgyzstan, we don't need to specify an 'x' or 'y' field, as by default it will load in all available data. Similarly, we have defined our time above, so those fields are not required in the query.

We will be loading in data which is already in the Data Cube as cloud-free 10 Daily NDVI data, so our measurement is just ['ndvi']. This data is stored at 250m resolution in the Data Cube, and is stored in UTM Zone 43N, which corresponds to EPSG:32643. We will want to load this data in at native resolution and coordinate system as well.

Finally, we might be loading in data from many time periods. This would be a very large amount of data and needs to be handled carefully to avoid the Data Cube crashing. For this we can use the 'dask_chunks' parameter, which will load in the structure of the data, and tell us what data is in the Data Cube, but not actually load in the values in the datasets until we want it to. There is more information about dask in the official documentation

Try creating this query below. Check the hidden example below if you get stuck.

Now that a query has been supplied, we can load in data to comprise the historical, long term average data and the current data. To do this, we will load in historical data for the entire time period in the Data Cube using our query, while the current data can be loaded in using the start_date and end_date variables defined above.

historical_data contains all the indices for Kyrgyzstan at all times of the year. In order to create indices which correspond to just the time period we are looking for each year, we need to use the load_periodic_data function in dcFunctions. This function requires you to enter the start and end date of the timeperiod in the first year of data you would like to analyse (in this case, going back to 2013 as that is when Sentinel / Landsat data first appears in the Data Cube) and additionally the year which the current data will be used (in this case 2019 as we want to find an anomaly image for 2019.

Now that historical_data contains only data from the specific timespan we are loading in, we can physically load all the actual data into memory. Before, when we added the dask_chunks parameter into our query, we were actually loading in the overall structure of the data, rather than the actual data. This was to avoid the Data Cube crashing due to how much memory would be required. Now, it is possible to load the smaller datasets into memory, and this can be done using the .compute() method on the dataset, as shown below.

Next is to scale the data from the uint16 datatype they are saved in (between 0-65535) to more recognisable float64 values (between -1.0 and 1.0), in order to make the anomaly products later on. This uses the scal_2_tru function in dcFunctions.

At the moment, the historical dataset contains individual datasets for each individual year, and so is a 3D dataset. In order to successfully create the anomaly product, the historical dataset must be compressed into a 2D dataset, and so we take the mean values for NDVI along the time axis to do this. This is done by calling the .mean() method on the dataset, as shown below.

Finally, we are in a position to create the anomaly product, as we have a 2D dataset of NDVI values from the current time period and a 2D dataset of NDVI values representing the average NDVI for that time period from previous years.

So to create the anomaly, we need to subtract the historical data from the current values and then divide by the historical data, to get an anomaly product. This can be represented in the following equation:

$Anomaly = \frac{Current Data - Historical Data}{Historical Data}$

Try to create the anomaly dataset below. Look at the solution if you get stuck or get an error.

Now that we have an anomaly dataset, it can be useful to visualise it to see that it looks how we might expect an anomaly to look. To plot the anomaly image below, we can use matplotlib to plot the data in the same way as described in previous notebooks, using the RdYlGn colourmap to constrain the colours of the plot with low anomalies being red and high anomalies being green.

In order to save out the anomaly product as a GeoTIFF, it needs to be scaled in such a way that nodata values are handled with. To save out this anomaly image that has just been made in int8 formatting (values between -128 and 128), it needs to be cut so that any values above or below +/- 128 are set to a no data value (which in this case ends up being 127). The anomaly process shouldn't produce any values above or below 100 anyway, so it will likely be only bad pixels outside this range.

The last thing to do is to save out the now corrected anomaly product as a GeoTIFF, using the write_geotiff function in dcFunctions. The write_geotiff function has a few defaults that will be changed in this instance, with the default format for the image being uint16, instead of int8 so that is specified as a parameter. In the process of creating the anomaly image, the Coordinate Reference System (crs) of the image was lost as a parameter in the xarray dataset. Writing out a GeoTIFF requires this parameter, so we can set it as shown below using the crs variable that was defined from the initial datasets that were loaded in.

This only works because the crs has not changed while the anomaly image was being created, so it was a case of re-adding the crs object back in. If the crs had changed during the creation process, a new crs object would need to have been specified.