A package of Machine Learning datasets has arrived for R – MLDataR

I am proud to announce my seventh package for you lovely R folks. This time it is a collection of datasets aimed at giving people in healthcare, and beyond, some solid examples for using with R. There remains a few excellent packages for this, such as mlbench, but they are limited in the number of datasets you can derive.

Using the package

To use the package I have compiled a YouTube video to show you how the datasets could be used, with a thorough example in TidyModels of how to implement the thyroid disease dataset and model it with a random forest.

The link to the video is below:

Package aims

To provide you with useful datasets for undertaking supervised machine learning in R with the machine learning library of your choice.

The package currently has three example datasets, and more are being added every week. The first three datasets contained in the package are:

  • Diabetes disease prediction – supervised machine learning classification dataset to enable the prediction of diabetic patients
  • Heart disease prediction – supervised machine learning classification dataset to enable the prediction of heart disease using a number of key outcome features
  • Thyroid disease prediction – supervised machine learning classification dataset to allow for the prediction of thyroid disease utilising historic patient records

More datasets are being added, so look out for the next version of this package.

Package GitHub

The supporting package GitHub shows how to install the package and get up and running quickly, but I am hoping that the YouTube video will allow you to get up and running with speed, and vigour.

Additionally, you can follow along with the supporting Vignette for this package.

That’s a wrap

I hope you use these datasets in earnest, and please Python users feel free to export to csv and use with scikit learn, as I do intend on making this available on PyPI as well.