GeoPandas is a python package that is designed to make working with spatial data in python easier. It adds to the functions used by pandas to enable it to work with spatial data.
Documentation: http://geopandas.org/index.html
We're going to work with data sets for district boundaries and miscellaneous population settlements.
Topics:
import geopandas
First we'll read in the dataset to a geodataframe. It's important to specify the encoding so that cyrillic characters are displayed properly.
.head()
shows the top part of the dataframe so we can easily view a part of the dataset without loading it completely. Note the geometry field that contains the values that make up the polygons
district_gdf = geopandas.read_file("data/kgz_admbnda_adm2_moes_20181119.shp", encoding='utf-8')
# view the attribute information
district_gdf.head()
We can view the spatial data using .plot()
# view the geometry
district_gdf.plot()
Coordinate systems are important when dealing with spatial data. Here we check the coordinate system of the dataset.
# view the coordinate reference system
district_gdf.crs
Most of the time you won’t have to set a projection as most data will include projection information. If it doesn't you will need to specify it. It uses the epsg code for the CRS. This website gives codes for many projections http://epsg.io/4326.
# set the coordinate reference system
district_gdf.crs="EPSG:4326"
Note that this is different to reprojecting the data. If we want to reproject the data into a different coordinate system we need to use .to_crs()
and specify the coordinate system we want to convert the data to.
# reproject the coordinate system to another coordinate system
district_utm_gdf = district_gdf.to_crs("EPSG:32643")
# view the reprojected geometry
district_utm_gdf.plot()
Accessing columns can be done in the same way as pandas.
# One way of accessing columns
district_gdf['ADM2_KY']
This is another way of speciying a column.
# alternate method for accessing a column
district_gdf.ADM2_KY
We can select data using attributes. The example below finds features where the Shape_Area is greater than 0.7. Other operators can be used.
# select by attribute
selected_gdf = district_gdf[district_gdf['Shape_Area'] > 0.7]
selected_gdf
Plot the selected data
selected_gdf.plot()
We can also read in csv files and create point data using the x, y coordinates. To do this we'll first use pandas to read the csv using read_csv
.
# read in the csv file to a pandas dataframe
import pandas
settlements_df = pandas.read_csv("data/kgz_misc_popp.csv")
settlements_df.head()
Here we use geopandas.GeoDataFrame()
to create a new geodataframe using the settlements dataframe and add the geometry. We include points_from_xy()
when we specify the geometry. This takes the latitude and longitude columns and uses them to create point objects that are stored in the geometry field of the geodataframe.
http://geopandas.org/reference.html#geodataframe
It's important to make sure that the coordinate system is set so that the geodataframe can be used in spatial operations.
# convert to geodataframe using the latitude and logitude columns
settlements_gdf = geopandas.GeoDataFrame(settlements_df,
geometry=geopandas.points_from_xy(settlements_df.longitude,
settlements_df.latitude))
# make sure the coordinate system is set
settlements_gdf.crs="EPSG:4326"
settlements_gdf.head()
To the right you can see the geometry column that has been created
Next, we'll write out the new settlements stations geodataframe to shapefile using to_file()
.
# write to shapefile, making sure to set the encoding method to utf-8
settlements_gdf.to_file("data/settlements_xy.shp", encoding='utf-8')
We can view the settlements using the district boundaries as a background.
First plot the district boundaries and save the axes to ax
. Note that some keyword arguments are used to specify the fill colour (color='white'
) and the colour of the edges (edgecolor='black'
).
Then we can include ax
in the arguments for the settlements so that they appear on the same plot.
# plot the met stations using the aimag boudaries as a background
ax = district_gdf.plot(color='white', edgecolor='black')
settlements_gdf.plot(ax=ax)
We can also join data using a spatial join. A spatial join joins attributes from one set of features to another based on the spatial relationship.
In the example below, we use certain keyword arguments to find the district that a settlements is located in.
how
: the type of join. inner
returns only those results that matchop
: the spatial operation. intersects
finds features that intersect the base feature.More information about the options can be found in the documentation http://geopandas.org/reference/geopandas.sjoin.html
Scrolling to the right will show the fields that have been added.
# spatial join
district_settlements_gdf = geopandas.sjoin(settlements_gdf, district_gdf, how="inner", op='intersects')
district_settlements_gdf.head()
We can analyse the data further to find the count of the settlements in each district.
First we prepare the dataset so that it contains only the columns we want to use. We do this by selecting those columns and saving that to the geodataframe object.
We then create a new column called count. When you specify a column that doesn't exist, geopandas will create a new column in that dataframe with the name specified. We set the value to 1 so that the sum of the values gives the count.
The dissolve function is then used aggregate the features.
by
: specifies the column to dissolve on. All the features with the same value in this column will be aggregated into one.aggfunc
: specifies the aggregation method. In this case we are using sum
so we will get the sum of the aggragated values.# alter the settlements dataframe so that only the relevant columns are left
district_settlements_gdf = district_settlements_gdf[['ADM2_PCODE', 'geometry']]
# create a new column called count
# this contains values that will be added together to get the total count
district_settlements_gdf['count'] = 1
# group by AIMAGCODE using sum to aggregate the other columns
settlements_dissolve_gdf = district_settlements_gdf.dissolve(by='ADM2_PCODE', aggfunc='sum')
settlements_dissolve_gdf.head()
settlements_dissolve_gdf.plot(column='count');
We can join this dissolved dataframe onto the district polygons
# convert the dissolved stations geodataframe into a pandas dataframe and drop the geometry
settlements_dissolve_gdf = pandas.DataFrame(settlements_dissolve_gdf.drop(columns='geometry'))
# merge the dissolved stations onto the aimag geodataframe using attributes
district_gdf = district_gdf.merge(settlements_dissolve_gdf, on='ADM2_PCODE', how='left')
# if the are any distircts that don't have a count, they will have NaN in the column
# make sure these are converted to zero so they show up in the map
district_gdf['count'].fillna(0, inplace=True)
district_gdf.head()
We can then plot the resulting dataset using a few more options to get a better looking map. This example uses the package matplotlib https://matplotlib.org/ to set some of the options.
column
: Count is used as column to display the values.
cmap
: The colourmap to use for the legend.
https://matplotlib.org/3.1.1/gallery/color/colormap_reference.html
import matplotlib.pyplot as plt
# set up the plot
fig, ax = plt.subplots(1, 1)
# set the size of the plot
fig.set_size_inches(10, 6)
# plot with more attributes
district_gdf.plot(column='count',
ax=ax,
cmap='YlOrRd',
legend=True,
legend_kwds={'label': "Number of settlements by district",
'orientation': "horizontal"})
We can save the plot to an image file.
# save the figure
fig.savefig('data/district_settlements.png', dpi=300, alpha=True)
From the district geodataframe, select the district with the ADM2_PCODE 'KG04220000000'
exercise_gdf = district_gdf[district_gdf['ADM2_PCODE'] == 'KG04220000000']
exercise_gdf
# enter your code here