Following on from the previous post, in this post we are going to learn about setting values, dealing with missing data and data frame operations.
Setting values
The below example shows how to use the iat command we learned in the last lesson to set values.
Setting values by position
As stated, the implementation below can be used to set values by position:
from gapminder import gapminder as gp import pandas as pd import numpy as np # view the head of the data df = gp.copy() print(df.head(10)) df_orig = df.copy() # ------------------------------------ Setting --------------------------------------------------# df.iat[0, 2] = '2020' # Set values by position print(df.iat[0,2])
Printing this out we can confirm that we have set the year column (column index 2) to a new value.
Setting by assignment with NumPy array
The next example shows how you can set by assignment using a numpy array to replace values in a data frame. This can be implemented below:
df.loc[:, 'pop'] = np.array([5] * len(df)) #Setting by assignment with a NumPy array print(df) print(np.array(5)*len(df))
This assigns the value of the np.array 5 objects to the population column to the extend of the data frame using the len command:
1 Afghanistan Asia 1957 30.332 5 820.853030 2 Afghanistan Asia 1962 31.997 5 853.100710 3 Afghanistan Asia 1967 34.020 5 836.197138 4 Afghanistan Asia 1972 36.088 5 739.981106 ... ... ... ... ... ... ... 1699 Zimbabwe Africa 1987 62.351 5 706.157306 1700 Zimbabwe Africa 1992 60.377 5 693.420786 1701 Zimbabwe Africa 1997 46.809 5 792.449960 1702 Zimbabwe Africa 2002 39.989 5 672.038623 1703 Zimbabwe Africa 2007 43.487 5 469.709298 [1704 rows x 6 columns] 8520
Please note: these two approaches do modify the original data frame, so it may be best to use the .copy() function to take a copy of the data frame.
Missing data
First of all, our dataset contains no missing records, so we will create all the top 1000 records to have missing records:
df[1:1000] = np.nan print(df)
Printing this to the console, you will see the changes have taken effect:
0 Afghanistan Asia 2020.0 28.801 5.0 779.445314 1 NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN NaN ... ... ... ... ... ... ... 1699 Zimbabwe Africa 1987.0 62.351 5.0 706.157306 1700 Zimbabwe Africa 1992.0 60.377 5.0 693.420786 1701 Zimbabwe Africa 1997.0 46.809 5.0 792.449960 1702 Zimbabwe Africa 2002.0 39.989 5.0 672.038623 1703 Zimbabwe Africa 2007.0 43.487 5.0 469.709298
Drop rows that have missing data
To drop rows that have missing data – you can specify in the how command of the dropna function. Here we will specify all values that are NaN, as this is peskty when working with other packages such as SciPy and Scikitlearn.
df2= df.copy() #Take a copy of df to make sure the changes are only made to df2 print(df2.dropna(how=’any’))To fill values that have an NA with a specific value you can use the fillna method. Other approaches would be to undertake more advanced methods.
Search for NAs
To search for missing values you can simply us the is syntax to check if the value is missing:
print(pd.isna(df)) print(any(pd.isna(df))) # Check to see if the missing values are contained anywhere in df
This returns:
country continent year lifeExp pop gdpPercap 0 False False False False False False 1 True True True True True True 2 True True True True True True 3 True True True True True True 4 True True True True True True ... ... ... ... ... ... ... 1699 False False False False False False 1700 False False False False False False 1701 False False False False False False 1702 False False False False False False 1703 False False False False False False [1704 rows x 6 columns] True
The first statement returns a boolean matrix indicating which rows and columns have NAs. The any command can be used to check if null values exist anywhere in the data frame.
Operations on data frames
This section shows how to apply simple and lambda (anonymous) functions on a data frame.
Simple operations
I want to get the mean of all the values in the GapMinder dataset. Then this would simply be achieved underneath:
df = df_orig.copy() print(df.mean())
Giving the mean of all the observations:
year 1.979500e+03 lifeExp 5.947444e+01 pop 2.960121e+07 gdpPercap 7.215327e+03 dtype: float64
To make this a row wise mean – you would need to specify the axis to be used:
print(df.mean(1))
This gives the operation across the numerical values on the row axis, instead of the whole dataset:
0 2.107023e+06 1 2.310936e+06 2 2.567483e+06 3 2.885201e+06 4 3.270552e+06 ... 1699 2.304793e+06 1700 2.676771e+06 1701 2.851946e+06 1702 2.982319e+06 1703 3.078416e+06 Length: 1704, dtype: float64
Applying functions on data frame
Say I want a cumulative sum of the population and GDP per capita. This can be implemented by:
df_sub = df.loc[:,['pop']] df_sub_copy = df_sub print(df_sub.apply(np.cumsum))
Obviously you would want to group this by the country, but we come to that later when we look more into aggregation functions:
pop 0 8425333 1 17666267 2 27933350 3 39471316 4 52550776 ... ... 1699 50394118807 1700 50404823147 1701 50416228095 1702 50428154658 1703 50440465801
Applying anonymous function using Lambda expression
Anonymous functions are functions, but they only exist in the scope that they are called. Meaning they do not stay in memory and are anonymous to any other object in Python.
The function can be implemented to get the interquartile range of the population:
#Lambda functions print(df_sub_copy.apply(lambda x: x.max() - x.min()))
This prints out the IQR of the population:
pop 1318623085 dtype: int64
Histogramming in Python
We will create a separate Pandas series object and use the value_counts() function to undertake and emulate a histogramming function:
s = pd.Series(np.random.randint(0, 7, size=10)) print(s) print(s.value_counts())
This functions creates a series of random numbers from the numpy randon package and these values will be between 0 and 7 and are sized as 10 observations.
The value counts function then counts the frequency:
0 6 1 3 2 3 3 4 4 6 5 0 6 5 7 1 8 4 9 5 dtype: int32 6 2 5 2 4 2 3 2 1 1 0 1 dtype: int64
There is much more that can be done with operations on data frames. This just scratches the surface.
In the next tutorial we will look at how we can merge and join data frames.