I first started using Python a couple of years ago, when the limitations of some of the Deep Learning libraries in R became apparent.
What I want to do is create a series of tutorials that allow you to get up to speed with Pandas for data frames in Python quickly.
The first session will focus on the basics of Pandas and how to create your own pandas in memory very quickly.
Pandas structures
Creating a series
A series in Pandas is the same as a sequence in R. This essentially allow you to create a series of values:
import numpy as np import pandas as pd # Create a series by passing a list of values series = pd.Series([1,2,3,np.nan, 6, 8, np.nan, 10, 11]) print(series) print(series.size)
The above code snippet uses the alias pd to link to pandas, if this was ommitted then you would need to use pandas.Series every time, by aliasing with the as command, this will be familiar to those who use SQL, then it makes it much easier to refer to the libraries to pull out the relevant functions, etc.
The output of the above shows:
0 1.0 1 2.0 2 3.0 3 NaN 4 6.0 5 8.0 dtype: float64 6
This print the extend of the series, the data type contained therein and the series size.
That is all there is to series. The next demo will show how to create a datetime index and labelled columns with a NumPy array.
Create Data Frame by passing a NumPy array with a date time index
The first step, and this will be extending the code we wrote previously, will be to create a dates series and then use this as the index in the Pandas data frame:
dates = pd.date_range('20200101', periods=series.size) print(dates)
This creates a date time index output using the date_range object and then it is printed out to the console:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08', '2013-01-09'], dtype='datetime64[ns]', freq='D')
The date time index output shows the dates, the dtype (data type) and the frequency of the interval between the dates i.e. freq=’D’.
Next, we will now create a data frame based on the randn() function and uses a list to specify the column names and the row index linked to the date index we created:
df = pd.DataFrame(np.random.randn(series.size,4), index=dates, columns = list('1234')) print(df)
The outputs of this show a data frame of the following structure:
1 2 3 4 2020-01-01 -1.001336 0.312421 0.213412 0.399853 2020-01-02 -1.160968 0.600596 0.612449 0.106262 2020-01-03 0.539773 0.457708 0.818120 -0.496321 2020-01-04 0.821966 -0.849103 -1.125686 0.816331 2020-01-05 -0.362707 1.449582 1.485910 -1.284188 2020-01-06 0.168309 -1.627923 -0.900661 -0.185069 2020-01-07 1.736149 0.820594 -0.840311 2.941485 2020-01-08 -0.560419 -0.332010 -1.256690 -1.128578 2020-01-09 0.142882 -1.151348 0.998045 1.472304
Dictionaries and Data Frames
Dictionaries are a very useful type in Python.
The below shows how to use an example of dictionaries with data frames:
df2 = pd.DataFrame( {'A': 1., 'B': pd.Timestamp('20200102'), 'C': pd.Series(1, index=list(range(4)), dtype='float32'), 'D': np.array([3] * 4, dtype='int32'), 'E': pd.Categorical(["test", "train", "test", "train"]), 'F': 'Gary'} ) print(df2)
Dictionaries are creating by using curly braces{} and the above uses:
- Simple double value
- The Pandas Timestamp function
- Pandas Series
- Numpy array
- Pandas Categorical data types to pass categorical values to an array
- A string literal
The output, using print, outputs the below data frame constructed with the various Python structures and data types:
A B C D E F 0 1.0 2020-01-02 1.0 3 test Gary 1 1.0 2020-01-02 1.0 3 train Gary 2 1.0 2020-01-02 1.0 3 test Gary 3 1.0 2020-01-02 1.0 3 train Gary
To check the data types of all the columns in the structure data frame you can use the below syntax:
print(df2.dtypes)
Viewing data types of all data frame objects
This outputs all the data types of the data frame:
A float64 B datetime64[ns] C float32 D int32 E category F object
This shows the underlying data types of all the Python objects.
Data Frame operations
Data frame operations are always accessed by using the period after the object declaration, as in df.head.
Viewing the top and bottom of data frames
To view the top of a large data frame, you can use the head command to achieve this:
print(df2.head(2))
Simply replace the number inside the head function to specify the top n number of values. The code in the statement outputs:
A B C D E F 0 1.0 2020-01-02 1.0 3 test Gary 1 1.0 2020-01-02 1.0 3 train Gary
To perform the same for the bottom values, use the tail function with the same syntax as head.
Obtaining descriptive statistics of a data frame
To obtain the descriptive statistics of a data frame, the function to do this is describe():
#Get descriptive statistics print(df2.describe(include='all')) #The include command will allow the inclusion of all stats
The output of this is:
A B C D E F count 4.0 4 4.0 4.0 4 4 unique NaN 1 NaN NaN 2 1 top NaN 2020-01-02 00:00:00 NaN NaN train Gary freq NaN 4 NaN NaN 2 4 first NaN 2020-01-02 00:00:00 NaN NaN NaN NaN last NaN 2020-01-02 00:00:00 NaN NaN NaN NaN mean 1.0 NaN 1.0 3.0 NaN NaN std 0.0 NaN 0.0 0.0 NaN NaN min 1.0 NaN 1.0 3.0 NaN NaN 25% 1.0 NaN 1.0 3.0 NaN NaN 50% 1.0 NaN 1.0 3.0 NaN NaN 75% 1.0 NaN 1.0 3.0 NaN NaN max 1.0 NaN 1.0 3.0 NaN NaN
Displaying column number
The command to display a column heading is very simple:
print(df2.columns)
This produces:
Index(['1', '2', '3', '4'], dtype='object')
Transposing a data frame
The way to transpose a data frame is by using the T function:
#Transpose the data print(df2.T)This flips the data frame around:
0 1 2 3 A 1 1 1 1 B 2020-01-02 00:00:00 2020-01-02 00:00:00 2020-01-02 00:00:00 2020-01-02 00:00:00 C 1 1 1 1 D 3 3 3 3 E test train test train F Gary Gary Gary Gary
What’s next?
The next in the series is Sorting, Indexing and Slicing data frames in Python with Pandas.
Stay tuned for more tutorials on how to use Pandas.