This tutorial explores the various ways data can be encoded, using Pandas and Numpy, to prepare the data ready for a Machine Learning, or predictive model pipeline.
Encoding methods
There are three main methods explored therein:
- Label encoding – encoding a value based on where the label order falls – could be good for rank and non-parametric methods, but tends to be less used with machine learning models
- One hot encoding (dummy variable encoding) – taking a group of categorical labels and assigning them an encoding / dummy binary label of a 1 (belongs in that category) or 0 (doesn’t belong in that category)
3. Manual encoding – this uses a specific condition to assign the numerical 1 or 0 encoding.
The tutorial
The tutorial is in a YouTube video I created and can help you grasp the concepts. I have created this in Jupyter, and have attached a Python (.py) file to support it. Watch the tutorial below:
Where to get the content?
The supporting code files can be found in my GitHub account. This includes a Jupyter notebook and a Python file. The next tutorial will look at how to do this type of encoding in scikit-learn and other Python libraries, so look out for that.
Signing off
I hope you enjoy my tutorials. Please stay posted for the next video and Subscribe to the YouTube channel.