Python Pandas Pro – Session Three – Setting and Operations

Following on from the previous post, in this post we are going to learn about setting values, dealing with missing data and data frame operations.

Setting values

The below example shows how to use the iat command we learned in the last lesson to set values.

Setting values by position

As stated, the implementation below can be used to set values by position:

from gapminder import gapminder as gp
import pandas as pd
import numpy as np

# view the head of the data
df = gp.copy()
print(df.head(10)) 
df_orig = df.copy()

# ------------------------------------ Setting  --------------------------------------------------#

df.iat[0, 2] = '2020' # Set values by position
print(df.iat[0,2])

Printing this out we can confirm that we have set the year column (column index 2) to a new value.

Setting by assignment with NumPy array

The next example shows how you can set by assignment using a numpy array to replace values in a data frame. This can be implemented below:

df.loc[:, 'pop'] = np.array([5] * len(df)) #Setting by assignment with a NumPy array
print(df)
print(np.array(5)*len(df))

This assigns the value of the np.array 5 objects to the population column to the extend of the data frame using the len command:

1     Afghanistan      Asia  1957   30.332    5  820.853030
2     Afghanistan      Asia  1962   31.997    5  853.100710
3     Afghanistan      Asia  1967   34.020    5  836.197138
4     Afghanistan      Asia  1972   36.088    5  739.981106
...           ...       ...   ...      ...  ...         ...
1699     Zimbabwe    Africa  1987   62.351    5  706.157306
1700     Zimbabwe    Africa  1992   60.377    5  693.420786
1701     Zimbabwe    Africa  1997   46.809    5  792.449960
1702     Zimbabwe    Africa  2002   39.989    5  672.038623
1703     Zimbabwe    Africa  2007   43.487    5  469.709298

[1704 rows x 6 columns]
8520

Please note: these two approaches do modify the original data frame, so it may be best to use the .copy() function to take a copy of the data frame.

Missing data

First of all, our dataset contains no missing records, so we will create all the top 1000 records to have missing records:

df[1:1000] = np.nan
print(df)

Printing this to the console, you will see the changes have taken effect:

0     Afghanistan      Asia  2020.0   28.801  5.0  779.445314
1             NaN       NaN     NaN      NaN  NaN         NaN
2             NaN       NaN     NaN      NaN  NaN         NaN
3             NaN       NaN     NaN      NaN  NaN         NaN
4             NaN       NaN     NaN      NaN  NaN         NaN
...           ...       ...     ...      ...  ...         ...
1699     Zimbabwe    Africa  1987.0   62.351  5.0  706.157306
1700     Zimbabwe    Africa  1992.0   60.377  5.0  693.420786
1701     Zimbabwe    Africa  1997.0   46.809  5.0  792.449960
1702     Zimbabwe    Africa  2002.0   39.989  5.0  672.038623
1703     Zimbabwe    Africa  2007.0   43.487  5.0  469.709298

Drop rows that have missing data

To drop rows that have missing data – you can specify in the how command of the dropna function. Here we will specify all values that are NaN, as this is peskty when working with other packages such as SciPy and Scikitlearn.

df2= df.copy() #Take a copy of df to make sure the changes are only made to df2 print(df2.dropna(how=’any’))

To fill values that have an NA with a specific value you can use the fillna method. Other approaches would be to undertake more advanced methods.

Search for NAs

To search for missing values you can simply us the is syntax to check if the value is missing:

print(pd.isna(df))
print(any(pd.isna(df))) # Check to see if the missing values are contained anywhere in df

This returns:

     country  continent   year  lifeExp    pop  gdpPercap
0       False      False  False    False  False      False
1        True       True   True     True   True       True
2        True       True   True     True   True       True
3        True       True   True     True   True       True
4        True       True   True     True   True       True
...       ...        ...    ...      ...    ...        ...
1699    False      False  False    False  False      False
1700    False      False  False    False  False      False
1701    False      False  False    False  False      False
1702    False      False  False    False  False      False
1703    False      False  False    False  False      False

[1704 rows x 6 columns]
True

The first statement returns a boolean matrix indicating which rows and columns have NAs. The any command can be used to check if null values exist anywhere in the data frame.

Operations on data frames

This section shows how to apply simple and lambda (anonymous) functions on a data frame.

Simple operations

I want to get the mean of all the values in the GapMinder dataset. Then this would simply be achieved underneath:

df = df_orig.copy()
print(df.mean())

Giving the mean of all the observations:

year         1.979500e+03
lifeExp      5.947444e+01
pop          2.960121e+07
gdpPercap    7.215327e+03
dtype: float64

To make this a row wise mean – you would need to specify the axis to be used:

print(df.mean(1))

This gives the operation across the numerical values on the row axis, instead of the whole dataset:

0       2.107023e+06
1       2.310936e+06
2       2.567483e+06
3       2.885201e+06
4       3.270552e+06
            ...
1699    2.304793e+06
1700    2.676771e+06
1701    2.851946e+06
1702    2.982319e+06
1703    3.078416e+06
Length: 1704, dtype: float64

Applying functions on data frame

Say I want a cumulative sum of the population and GDP per capita. This can be implemented by:

df_sub = df.loc[:,['pop']]
df_sub_copy = df_sub
print(df_sub.apply(np.cumsum))

Obviously you would want to group this by the country, but we come to that later when we look more into aggregation functions:

              pop
0         8425333
1        17666267
2        27933350
3        39471316
4        52550776
...           ...
1699  50394118807
1700  50404823147
1701  50416228095
1702  50428154658
1703  50440465801

Applying anonymous function using Lambda expression

Anonymous functions are functions, but they only exist in the scope that they are called. Meaning they do not stay in memory and are anonymous to any other object in Python.

The function can be implemented to get the interquartile range of the population:

#Lambda functions
print(df_sub_copy.apply(lambda x: x.max() - x.min()))

This prints out the IQR of the population:

pop    1318623085
dtype: int64

Histogramming in Python

We will create a separate Pandas series object and use the value_counts() function to undertake and emulate a histogramming function:

s = pd.Series(np.random.randint(0, 7, size=10))
print(s)
print(s.value_counts())

This functions creates a series of random numbers from the numpy randon package and these values will be between 0 and 7 and are sized as 10 observations.

The value counts function then counts the frequency:

0    6
1    3
2    3
3    4
4    6
5    0
6    5
7    1
8    4
9    5
dtype: int32
6    2
5    2
4    2
3    2
1    1
0    1
dtype: int64

There is much more that can be done with operations on data frames. This just scratches the surface.

In the next tutorial we will look at how we can merge and join data frames.