CRAN package development is addictive

During my R career I have had the opportunity to work on some really exciting projects, and with some really clever dude and dudettes.

Through sitting in lectures and webinars and package development tutorials I thought perhaps I should make some of my code, that was sitting there in a dusty private GitHub repo, open and accessible to all. This started me on the journey to develop R packages.

OddsPlotty – the first package I created

The first package I created was called OddsPlotty and I created this as a way to easily explain the importance of the features in a model I had developed for predicting whether a patient would be admitted to ED.

I thought – this is actually quite a nice little way to calculate and visualise the odds ratios. To learn how to use the package you can either follow the vignette, or watch the YouTube tutorial below:

Moving on from this, and because I have an interest for machine learning and statistical modelling I was driven to build the next package.

ConfusionTableR – the next in the package line up

The next package I created was the ConfusionTableR package. This packages aim was to collapse down a confusion matrix, derived from a classification model, to be able to store all the values in one row of a database or spreadsheet.

This stemmed from my role as a Senior Data Scientist working on multiple NHS classification machine learning problems, such as stranded patient classification; LOS banding multiple classification; readmissions into ED and inpatients, etc.

There are a couple of functions in this package – two for dealing with binary and multiple classification, one for performing dummy encoding across a number of categorical labels and a visualisation of a confusion matrix constructed in base R graphics.

To learn how to use the package, see my vignette, or watch the supporting YouTube tutorial:

As I was still buzzing from my ML work – I decided to do a package, with easy wrappers, for feature selection.

FeatureTerminatoR – a feature selection package for ML models

As mentioned, this contains a couple of functions for feature selection, relating to methods for statistical selection of features and recursive feature elimination – which is a technique to look at how much weight each predictor variable has on the overall. I need some time to add a few more functions to this, such as elastic net selection.

To learn how to use this tool – see the vignette, or watch the associated YouTube tutorial:

NHSRdatasets – my first collaboration on a package

Then was the NHSR datasets package – a collaboration between a number of excellent NHS-R community data scientists and developers. This contains datasets, for healthcare staff, for working with dplyr and statistical models.

This is the only post where I don’t have a supporting video to show you – but check out the associated vignettes on CRAN.

NHSDataDictionaRy – a package to retrieve common lookups, including web scraping functionality

This was my first funded package – from the NHS-R community and contains a lot of functions to simplify scraping content off the web – such as a method for extracting data tables, a method for NHS purposes for using the NHS Data Dictionary website to get key lookups. This package also allows you to easily extract text from the XPATH element of a website. I have a number of tutorials and workshops on this:

The official launch presentation
A workshop showing how this can be used for tasks other than NHS-R data
A session with LondonR on how this package came about and package development tips

The supporting GitHub for the workshop is here.

SangerTools – tools for working with Population Health

This was a collaboration with a friend, and mentee, Asif Laldin – who worked at Gloucestershire Clinical Commissioning Group at the time.

This package has tools for producing visualised rates of disparity across a population, sample data for working with the package and much more. You need to check this one out if you work in Population Health and use R.

Refer to vignette for details on how to use the package.

MLDataR – datasets for supervised machine learning

This package contains a bunch of datasets for working with supervised machine learning – these are bulleted hereunder:

  • Diabetes disease prediction – supervised machine learning classification dataset to enable the prediction of diabetic patients
  • Diabetes onset prediction – supervised machine learning regression dataset to enable prediction of the age at which a pre-diabetic will develop diabetes
  • Heart disease prediction – supervised machine learning classification dataset to enable the prediction of heart disease using a number of key outcome features
  • Long stayers prediction – supervised machine learning classification dataset to enable the prediction of a patient staying in hospital longer than 7 days.
  • Thyroid disease prediction – supervised machine learning classification dataset to allow for the prediction of thyroid disease utilising historic patient records
  • Failing Care Home classification – classification supervised machine learning dataset to predict a failing care home by selected Datix incidents
  • Counter Strike Global Offensive – supervised machine learning regression and classification data set to predict score or match outcome.

This is a ‘I need you’ moment – as I would love for a wider array of datasets for working with supervised machine learning problems in R. I use the Thyroid dataset with TidyModels in the below tutorial:

To contribute to the package – submit a pull request against the package and get started.

What have I learned from package development?

The following are tips I have learned:

  • Read the R Package development book
  • Watch online tutorials
  • Don’t be afraid
  • Expect your package checks to fail on the first time using devtools::check(cran=T)
  • Don’t submit until you have fixed the errors
  • Listen to feedback from the CRAN maintainers
  • Be patient, you will get there
  • Buddy up with someone

I didn’t press a magic button and everything worked first time – I learned through repeated failure, until I got these packages on CRAN. Remember, you don’t have to get a package to CRAN to make it useful – just host on GitHub, but to allow multiple people to use it, it needs to be on CRAN.

Remember – package development is a fun process and you are learning lots of new ways of testing and deploying as you go along, but I recommend finding someone who has done it before if you need advice. I am happy to collaborate with anyone on any new package idea they have – so let me know.

Don’t be like Moss!

Leave a Reply