Python Pandas Pro – Session Three – Setting and Operations

Following on from the previous post, in this post we are going to learn about setting values, dealing with missing data and data frame operations.

Setting values

The below example shows how to use the iat command we learned in the last lesson to set values.

Setting values by position

As stated, the implementation below can be used to set values by position:

[sourcecode language=”python” wraplines=”false” collapse=”false”] from gapminder import gapminder as gp import pandas as pd import numpy as np # view the head of the data df = gp.copy() print(df.head(10)) df_orig = df.copy() # ———————————— Setting ————————————————–# df.iat[0, 2] = ‘2020’ # Set values by position print(df.iat[0,2]) [/sourcecode]

Printing this out we can confirm that we have set the year column (column index 2) to a new value.

Setting by assignment with NumPy array

The next example shows how you can set by assignment using a numpy array to replace values in a data frame. This can be implemented below:

[sourcecode language=”python” wraplines=”false” collapse=”false”] df.loc[:, ‘pop’] = np.array([5] * len(df)) #Setting by assignment with a NumPy array print(df) print(np.array(5)*len(df)) [/sourcecode]

This assigns the value of the np.array 5 objects to the population column to the extend of the data frame using the len command:

[sourcecode language=”python” wraplines=”false” collapse=”false”] 1 Afghanistan Asia 1957 30.332 5 820.853030 2 Afghanistan Asia 1962 31.997 5 853.100710 3 Afghanistan Asia 1967 34.020 5 836.197138 4 Afghanistan Asia 1972 36.088 5 739.981106 … … … … … … … 1699 Zimbabwe Africa 1987 62.351 5 706.157306 1700 Zimbabwe Africa 1992 60.377 5 693.420786 1701 Zimbabwe Africa 1997 46.809 5 792.449960 1702 Zimbabwe Africa 2002 39.989 5 672.038623 1703 Zimbabwe Africa 2007 43.487 5 469.709298 [1704 rows x 6 columns] 8520 [/sourcecode]

Please note: these two approaches do modify the original data frame, so it may be best to use the .copy() function to take a copy of the data frame.

Missing data

First of all, our dataset contains no missing records, so we will create all the top 1000 records to have missing records:

[sourcecode language=”python” wraplines=”false” collapse=”false”] df[1:1000] = np.nan print(df) [/sourcecode]

Printing this to the console, you will see the changes have taken effect:

[sourcecode language=”python” wraplines=”false” collapse=”false”] 0 Afghanistan Asia 2020.0 28.801 5.0 779.445314 1 NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN NaN … … … … … … … 1699 Zimbabwe Africa 1987.0 62.351 5.0 706.157306 1700 Zimbabwe Africa 1992.0 60.377 5.0 693.420786 1701 Zimbabwe Africa 1997.0 46.809 5.0 792.449960 1702 Zimbabwe Africa 2002.0 39.989 5.0 672.038623 1703 Zimbabwe Africa 2007.0 43.487 5.0 469.709298 [/sourcecode]

Drop rows that have missing data

To drop rows that have missing data – you can specify in the how command of the dropna function. Here we will specify all values that are NaN, as this is peskty when working with other packages such as SciPy and Scikitlearn.

df2= df.copy() #Take a copy of df to make sure the changes are only made to df2 print(df2.dropna(how=’any’))

To fill values that have an NA with a specific value you can use the fillna method. Other approaches would be to undertake more advanced methods.

Search for NAs

To search for missing values you can simply us the is syntax to check if the value is missing:

[sourcecode language=”python” wraplines=”false” collapse=”false”] print(pd.isna(df)) print(any(pd.isna(df))) # Check to see if the missing values are contained anywhere in df [/sourcecode]

This returns:

[sourcecode language=”python” wraplines=”false” collapse=”false”] country continent year lifeExp pop gdpPercap 0 False False False False False False 1 True True True True True True 2 True True True True True True 3 True True True True True True 4 True True True True True True … … … … … … … 1699 False False False False False False 1700 False False False False False False 1701 False False False False False False 1702 False False False False False False 1703 False False False False False False [1704 rows x 6 columns] True [/sourcecode]

The first statement returns a boolean matrix indicating which rows and columns have NAs. The any command can be used to check if null values exist anywhere in the data frame.

Operations on data frames

This section shows how to apply simple and lambda (anonymous) functions on a data frame.

Simple operations

I want to get the mean of all the values in the GapMinder dataset. Then this would simply be achieved underneath:

[sourcecode language=”python” wraplines=”false” collapse=”false”] df = df_orig.copy() print(df.mean()) [/sourcecode]

Giving the mean of all the observations:

[sourcecode language=”python” wraplines=”false” collapse=”false”] year 1.979500e+03 lifeExp 5.947444e+01 pop 2.960121e+07 gdpPercap 7.215327e+03 dtype: float64 [/sourcecode]

To make this a row wise mean – you would need to specify the axis to be used:

[sourcecode language=”python” wraplines=”false” collapse=”false”] print(df.mean(1)) [/sourcecode]

This gives the operation across the numerical values on the row axis, instead of the whole dataset:

[sourcecode language=”python” wraplines=”false” collapse=”false”] 0 2.107023e+06 1 2.310936e+06 2 2.567483e+06 3 2.885201e+06 4 3.270552e+06 … 1699 2.304793e+06 1700 2.676771e+06 1701 2.851946e+06 1702 2.982319e+06 1703 3.078416e+06 Length: 1704, dtype: float64 [/sourcecode]

Applying functions on data frame

Say I want a cumulative sum of the population and GDP per capita. This can be implemented by:

[sourcecode language=”python” wraplines=”false” collapse=”false”] df_sub = df.loc[:,[‘pop’]] df_sub_copy = df_sub print(df_sub.apply(np.cumsum)) [/sourcecode]

Obviously you would want to group this by the country, but we come to that later when we look more into aggregation functions:

[sourcecode language=”python” wraplines=”false” collapse=”false”] pop 0 8425333 1 17666267 2 27933350 3 39471316 4 52550776 … … 1699 50394118807 1700 50404823147 1701 50416228095 1702 50428154658 1703 50440465801 [/sourcecode]

Applying anonymous function using Lambda expression

Anonymous functions are functions, but they only exist in the scope that they are called. Meaning they do not stay in memory and are anonymous to any other object in Python.

The function can be implemented to get the interquartile range of the population:

[sourcecode language=”python” wraplines=”false” collapse=”false”] #Lambda functions print(df_sub_copy.apply(lambda x: x.max() – x.min())) [/sourcecode]

This prints out the IQR of the population:

[sourcecode language=”python” wraplines=”false” collapse=”false”] pop 1318623085 dtype: int64 [/sourcecode]

Histogramming in Python

We will create a separate Pandas series object and use the value_counts() function to undertake and emulate a histogramming function:

[sourcecode language=”python” wraplines=”false” collapse=”false”] s = pd.Series(np.random.randint(0, 7, size=10)) print(s) print(s.value_counts()) [/sourcecode]

This functions creates a series of random numbers from the numpy randon package and these values will be between 0 and 7 and are sized as 10 observations.

The value counts function then counts the frequency:

[sourcecode language=”python” wraplines=”false” collapse=”false”] 0 6 1 3 2 3 3 4 4 6 5 0 6 5 7 1 8 4 9 5 dtype: int32 6 2 5 2 4 2 3 2 1 1 0 1 dtype: int64 [/sourcecode]

There is much more that can be done with operations on data frames. This just scratches the surface.

In the next tutorial we will look at how we can merge and join data frames.

Leave a Reply