This notebook will allow you to read Excel workbooks into Python, manipulate the data and then save the changed data back out to an Excel sheet (or other data storage format).
To do this, we will use the Python library Pandas, more information about which can be found here. This library is one of the most useful data analysis libraries in Python, extending the functionality of Numpy arrays while still maintaining many of the in-built, whole array functions.
However, there is a slightly different syntax used when querying Pandas dataframes/datasets as opposed to the Numpy arrays. First, we will look at how to import an Excel spreadsheet into Python, as seen below:
import pandas as pd
df = pd.read_excel('data/kgz_misc_popp.xlsx')
df
print(df)
This is one of the easiest ways to read Excel data into Python. As shown above, we have imported pandas using the standard import pandas
command, and then used the function .read_excel()
to import in the Excel data. Using the .read_excel()
function means we often don't need to define any arguments in the function call as the function knows what sort of data structure the Excel data will be in.
This function has created and filled something known as a Pandas Dataframe which is essentially a multi-dimensional Numpy array with labels describing aspects of it. Next we will look at slicing parts of the data in the Pandas Dataframe to return a more useful subset.
# Columns of the dataframe
print(df.columns)
# Printing a specific column
print(df['latitude'])
In the command above, we have selected just the column "latitude" from the df
Dataframe. As you can see, the values themselves are printed, although for long lists such as this one, the middle values are hidden. At the bottom of the print out, it also summarises the name of the column printed, it's length and its datatype (dtype).
In the box below, try to select just the "nam" column from df
:
print(df['nam'])
# enter your code
Another way to select data from a DataFrame is using the .loc
parameter, which is short for locate. This can be applied directly onto a multi-dimensional array or onto an already sliced array. For example, we could slice the nam array we just had to find the 3rd value in that array as follows:
print(df['nam'].loc[3])
We can also find the values of all the columns for the 3rd row:
print(df.loc[3])
# Showing only the first 5 lines
df.head(5)
You can also select the rows and columns that you want!
df.loc[10:15, 'nam':'latitude']
In the example above, we have defined both the row and the column, in that order, to find a specific cell within the Dataframe. It is also possible to print out a whole row or column using the .loc
method as shown above, but replacing one of the individual values with ":", which signifies returning the whole series. For example, writing df.loc[:, 'latitude']
would return the same result as our first example, df['latitude']
.
We can do more than just simple slicing using pandas Dataframes, like returning values which only fit certain parameters. For example, say we wanted to search for all values in df
where the 'nam' value is "MANAS". How can this be done?
Well, using the .loc
function as an if statement, we can search for all rows which satisfy the criteria of being in the nam "MANAS".
df.loc[df['nam'] == 'MANAS']
This has returned all rows where the value of "MANAS" has been present in the nam column. Try doing a similar operation, this time searching for where the tile id value is 1.
df.loc[df['tile_id'] == 1]
# enter your code
Be aware that 1 is a number! To check which type of object you have in each column use:
print(df.dtypes)
If you use a string it will not work!
df.loc[df['tile_id'] == '1']
Now try finding only nam, tile_id and longitude for settlements that have latitude < 40
df.loc[df['latitude']<40, ['nam','tile_id','longitude']]
# enter your code
In Pandas you can combine, remove, and apply mathematical functions to data! Just like with Numpy arrays.
For instance, with a simple function you can remove all the NaNs values: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
import numpy as np
df.loc[4,'nam'] = np.nan
df['nam'].loc[4]
df.dropna()
df
You can change the values in a specific column
df['tile_id']+100
You can find max, min, mean, median for all columns.
For example:
print('Max \n', df.max())
print('----------------------------')
print('\nMin \n', df.min())
It is possible to add new columns to df
, using either new data or a combination of the data in the Dataframe.
For example, let's add a new column of the ids using random numbers.
print(f'Lenght of our dataframe: {len(df)} rows.')
df[['tile_id','end_id']].max()
print(f'Lenght of our dataframe: {len(df)} rows.')
print(f"Max of ids : {df[['tile_id','end_id']].max()}. \n\nMin of temperatur : {df[['tile_id','end_id']].min()}")
temp_random = np.random.random(len(df))*(df['tile_id'].max()-df['tile_id'].min()) + df['tile_id'].min()
df['tile_id2'] = temp_random.round(1)
df
Now compute the average temperature using the data in tile_id and tile_id2. Write the results in another column called mean_tile_id.
df['mean_tile_id'] = (df['tile_id'] + df['tile_id2'])/2.0
df
df['mean_tile_id'] = df.loc[:,['tile_id','tile_id2']].mean(axis=1)
df
# enter your code
You can save the entire Dataframe df
in one line of code
df.to_excel('data/new_sheet.xlsx')
Or save to file just a subset of df
. For example only the latitude, longitude and tile_id
# Select my subset of data
df_subset = df.loc[:,['latitude','longitude','tile_id']]
# Write to excel file
df_subset.to_excel('data/new_sheet.xlsx')
# Read from excel file
df1 = pd.read_excel('data/new_sheet.xlsx')
df1
print(df1.columns)
If you want to remove the Unnamed: 0 column, i.e. you do not want to save the index of the df
, then use the option index=False
.
Here more info: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html
df.to_excel('data/new_sheet.xlsx', index=False)
df1 = pd.read_excel('data/new_sheet.xlsx')
df1
Write to txt
file:
file = open('data/new_sheet.txt', 'w')
file.write(df_subset.to_string())
file.close()
Read from txt
file:
df1 = pd.read_table('data/new_sheet.txt', delim_whitespace=True)
df1
Write to csv
file:
df.to_csv('data/new_sheet.csv', index= False)
Read from csv
file:
df1 = pd.read_csv('data/new_sheet.csv')
df1
df1 = pd.read_csv('data/new_sheet.csv', usecols=['latitude','longitude','tile_id'])
df1