In the previous post we worked with a custom data set we had created for the purposes of demonstration. This time we are going to work with the gapminder dataset.
Viewing the gapminder dataset
To initialise the gapminder dataset in my project I will use the below import statements to prepare the project ready for the demonstration:
from gapminder import gapminder as gp import pandas as pd import numpy as np # Take a copy of the data frame df = gp.copy() print(df.head(10))
This code uses the gapminder package using from and imports only the relevant information i.e. the gapminder data and this is aliased as gp.
Then, as in the previous example we import pandas and numpy for working in the project.
The final step is to take a copy of the data and then we print the head of the dataset for inspection:
country continent year lifeExp pop gdpPercap 0 Afghanistan Asia 1952 28.801 8425333 779.445314 1 Afghanistan Asia 1957 30.332 9240934 820.853030 2 Afghanistan Asia 1962 31.997 10267083 853.100710 3 Afghanistan Asia 1967 34.020 11537966 836.197138 4 Afghanistan Asia 1972 36.088 13079460 739.981106 5 Afghanistan Asia 1977 38.438 14880372 786.113360 6 Afghanistan Asia 1982 39.854 12881816 978.011439 7 Afghanistan Asia 1987 40.822 13867957 852.395945 8 Afghanistan Asia 1992 41.674 16317921 649.341395 9 Afghanistan Asia 1997 41.763 22227415 635.341351
The dataset contains the population movements against life expectancy and the gross domestic product as a percentage of the population. To sort the data by the highest population the below functionality could be added:
#Sort by value print(df.sort_values(by='pop', ascending=False))
To sort by multiple columns this can simply be extended by using a list in Python, these always have square brackets:
#Sort by value print(df.sort_values(by=['pop', 'year'], ascending=False))
The output from the model shows it has been successful in sorting by the sort values specified:
country continent year lifeExp pop gdpPercap 299 China Asia 2007 72.961 1318683096 4959.114854 298 China Asia 2002 72.028 1280400000 3119.280896 297 China Asia 1997 70.426 1230075000 2289.234136 296 China Asia 1992 68.690 1164970000 1655.784158 707 India Asia 2007 64.698 1110396331 2452.210407 ... ... ... ... ... ... ... 1299 Sao Tome and Principe Africa 1967 54.425 70787 1384.840593 1298 Sao Tome and Principe Africa 1962 51.893 65345 1071.551119 420 Djibouti Africa 1952 34.812 63149 2669.529475 1297 Sao Tome and Principe Africa 1957 48.945 61325 860.736903 1296 Sao Tome and Principe Africa 1952 46.471 60011 879.583586
Selection – getting stuff from data frames
The below two subsections will show how to get items from a data frame and then how you can also set items.
Selecting by column
To select a specific column(s) you can use the following syntax:
# Selection by 1 x column print(df['pop']) # Selection by more than 1 column print(df[['country', 'pop']])
The first line will print out the pop column, whereas the double brackets will select both country and population:
country pop 0 Afghanistan 8425333 1 Afghanistan 9240934 2 Afghanistan 10267083 3 Afghanistan 11537966 4 Afghanistan 13079460 ... ... ... 1699 Zimbabwe 9216418 1700 Zimbabwe 10704340 1701 Zimbabwe 11404948 1702 Zimbabwe 11926563 1703 Zimbabwe 12311143
Slicing rows
To slice rows from the data frame it is as simple as the below:
# Slicing rows print(df[0:10])
The results in the first 10 rows being sliced, as Python has a zero based indexing system, like most languages:
country continent year lifeExp pop gdpPercap 0 Afghanistan Asia 1952 28.801 8425333 779.445314 1 Afghanistan Asia 1957 30.332 9240934 820.853030 2 Afghanistan Asia 1962 31.997 10267083 853.100710 3 Afghanistan Asia 1967 34.020 11537966 836.197138 4 Afghanistan Asia 1972 36.088 13079460 739.981106 5 Afghanistan Asia 1977 38.438 14880372 786.113360 6 Afghanistan Asia 1982 39.854 12881816 978.011439 7 Afghanistan Asia 1987 40.822 13867957 852.395945 8 Afghanistan Asia 1992 41.674 16317921 649.341395 9 Afghanistan Asia 1997 41.763 22227415 635.341351
Selecting by label
The driving function here is to use the loc keyword to select by location in the data frame. This implementation shows how to return all rows and only matching columns:
# Selection by label print(df.loc[:, ['lifeExp', 'pop']])
Outputting:
lifeExp pop 0 28.801 8425333 1 30.332 9240934 2 31.997 10267083 3 34.020 11537966 4 36.088 13079460 ... ... ... 1699 62.351 9216418 1700 60.377 10704340 1701 46.809 11404948 1702 39.989 11926563 1703 43.487 12311143
The : operator says select all rows and only those matching columns in the df.loc command.
To specify both rows and columns to return, you can adapt the statement slightly to easily achieve that:
print(df.loc[0:10, ['lifeExp', 'pop']])
This outputs:
lifeExp pop 0 28.801 8425333 1 30.332 9240934 2 31.997 10267083 3 34.020 11537966 4 36.088 13079460 5 38.438 14880372 6 39.854 12881816 7 40.822 13867957 8 41.674 16317921 9 41.763 22227415 10 42.129 25268405
For getting a scalar variable, you would use:
# For getting a scalar value print(df.loc[dates[0], 'pop'])
That would get the first date index on the row.
Selection by position
To select by position you can use the iloc keyword.
Select single column
This below code block will show how to slice by index position:
# Selection by position print(df.iloc[3])
This outputs:
country Colombia continent Americas year 1952 lifeExp 50.643 pop 12350771 gdpPercap 2144.12 Name: 300, dtype: object
Select multiple elements – rows and columns
This method uses integer slicing to select rows and columns:
#Integer slicing print(df.iloc[3:5, 0:2])
Outputting:
country continent 3 Afghanistan Asia 4 Afghanistan Asia
You can achieve a similar result using numpy lists:
print(df.iloc[[1, 2, 4], [0, 2]])
Slicing rows and columns explicitly
To achieve this, we make use of the magic : operator in Python:
print(df.iloc[1:9, :]
Hey presto:
country continent year lifeExp pop gdpPercap 1 Afghanistan Asia 1957 30.332 9240934 820.853030 2 Afghanistan Asia 1962 31.997 10267083 853.100710 3 Afghanistan Asia 1967 34.020 11537966 836.197138 4 Afghanistan Asia 1972 36.088 13079460 739.981106 5 Afghanistan Asia 1977 38.438 14880372 786.113360 6 Afghanistan Asia 1982 39.854 12881816 978.011439 7 Afghanistan Asia 1987 40.822 13867957 852.395945 8 Afghanistan Asia 1992 41.674 16317921 649.341395
To perform this for columns we just reverse the position in the list:
print(df.iloc[:, 1:3]) #Slicing columns explicitly
This then selects just the first 3 rows you require:
continent year lifeExp 0 Asia 1952 28.801 1 Asia 1957 30.332 2 Asia 1962 31.997 3 Asia 1967 34.020 4 Asia 1972 36.088 ... ... ... ... 1699 Africa 1987 62.351 1700 Africa 1992 60.377 1701 Africa 1997 46.809 1702 Africa 2002 39.989 1703 Africa 2007 43.487
Select a value explicitly
To do this you need to specify a single row and column index:
print(df.iloc[1, 1]) #Getting a value explicitly
This returns just one record:
Asia
To undertake this, giving faster reload, you can use the iat function:
print(df.iat[1, 1]) #Fast access to a scalar explicitly
Boolean indexing
To filter the populations greater than a certain value, we can use boolean indexing:
print(df[df['pop'] > 30000000]) #Using operators to subset
This gives you all the records that are greater than the selected value. To extend this you could use or and and operators. This outputs:
country continent year lifeExp pop gdpPercap 11 Afghanistan Asia 2007 43.828 31889923 974.580338 34 Algeria Africa 2002 70.994 31287142 5288.040382 35 Algeria Africa 2007 72.301 33333216 6223.367465 55 Argentina Americas 1987 70.774 31620918 9139.671389 56 Argentina Americas 1992 71.868 33958947 9308.418710 ... ... ... ... ... ... ... 1651 Vietnam Asia 1987 62.820 62826491 820.799445 1652 Vietnam Asia 1992 67.662 69940728 989.023149 1653 Vietnam Asia 1997 70.672 76048996 1385.896769 1654 Vietnam Asia 2002 73.017 80908147 1764.456677 1655 Vietnam Asia 2007 74.249 85262356 2441.576404
To view more subsetting examples, along with setting values a data frame, come back for session three.
I hope you are enjoying these tutorials and the code can be found by clicking the Github button.