More Advanced Python - Import data files

Excel Sheet Manipulation

This notebook will allow you to read Excel workbooks into Python, manipulate the data and then save the changed data back out to an Excel sheet (or other data storage format).

To do this, we will use the Python library Pandas, more information about which can be found here. This library is one of the most useful data analysis libraries in Python, extending the functionality of Numpy arrays while still maintaining many of the in-built, whole array functions.

However, there is a slightly different syntax used when querying Pandas dataframes/datasets as opposed to the Numpy arrays. First, we will look at how to import an Excel spreadsheet into Python, as seen below:

This is one of the easiest ways to read Excel data into Python. As shown above, we have imported pandas using the standard import pandas command, and then used the function .read_excel() to import in the Excel data. Using the .read_excel() function means we often don't need to define any arguments in the function call as the function knows what sort of data structure the Excel data will be in.

This function has created and filled something known as a Pandas Dataframe which is essentially a multi-dimensional Numpy array with labels describing aspects of it. Next we will look at slicing parts of the data in the Pandas Dataframe to return a more useful subset.

In the command above, we have selected just the column "latitude" from the df Dataframe. As you can see, the values themselves are printed, although for long lists such as this one, the middle values are hidden. At the bottom of the print out, it also summarises the name of the column printed, it's length and its datatype (dtype).

In the box below, try to select just the "nam" column from df:

Another way to select data from a DataFrame is using the .loc parameter, which is short for locate. This can be applied directly onto a multi-dimensional array or onto an already sliced array. For example, we could slice the nam array we just had to find the 3rd value in that array as follows:

We can also find the values of all the columns for the 3rd row:

You can also select the rows and columns that you want!

In the example above, we have defined both the row and the column, in that order, to find a specific cell within the Dataframe. It is also possible to print out a whole row or column using the .loc method as shown above, but replacing one of the individual values with ":", which signifies returning the whole series. For example, writing df.loc[:, 'latitude'] would return the same result as our first example, df['latitude'].

We can do more than just simple slicing using pandas Dataframes, like returning values which only fit certain parameters. For example, say we wanted to search for all values in df where the 'nam' value is "MANAS". How can this be done?

Well, using the .loc function as an if statement, we can search for all rows which satisfy the criteria of being in the nam "MANAS".

This has returned all rows where the value of "MANAS" has been present in the nam column. Try doing a similar operation, this time searching for where the tile id value is 1.

Be aware that 1 is a number! To check which type of object you have in each column use:

If you use a string it will not work!

Now try finding only nam, tile_id and longitude for settlements that have latitude < 40

Data manipulation in Pandas

In Pandas you can combine, remove, and apply mathematical functions to data! Just like with Numpy arrays.

For instance, with a simple function you can remove all the NaNs values: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

You can change the values in a specific column

You can find max, min, mean, median for all columns.

For example:

It is possible to add new columns to df, using either new data or a combination of the data in the Dataframe. For example, let's add a new column of the ids using random numbers.

Now compute the average temperature using the data in tile_id and tile_id2. Write the results in another column called mean_tile_id.

Now let's save the data to a file!

You can save the entire Dataframe df in one line of code

Or save to file just a subset of df. For example only the latitude, longitude and tile_id

If you want to remove the Unnamed: 0 column, i.e. you do not want to save the index of the df, then use the option index=False. Here more info: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html

Write to txt file:

Read from txt file:

Write to csv file:

Read from csv file: