1 Loading the package

To load the package, you can use the below command:

2 Accessing the NHS data links

This function expects no return and is a way to query the NHS Data Dictionary database to get the most recent list of data elements and their associated lookups. The return of this will provide a tibble of all the links currently on the NHS Data Dictionary website:

nhs_tibble <- NHSDataDictionaRy::nhs_data_elements()
print(head(nhs_tibble))

## # A tibble: 6 x 6
##   link_name  url    full_url  xpath_nat_code  xpath_default_co~ xpath_also_known
##   <chr>      <chr>  <chr>     <chr>           <chr>             <chr>           
## 1 ABBREVIAT~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
## 2 ABDOMINAL~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
## 3 ABDOMINAL~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
## 4 ABDOMINAL~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
## 5 ABLATIVE ~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
## 6 ABNORMALI~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~

This tibble gives a list of all lookups and their associated xpath codes i.e. a direct link to an HTML element, which is the standard way of extracting HTML DOM content. This is where the other functions in the package become powerful.

3 Get all current hyperlinks from a webpage using linkScrapR

This function can analyse a website and get all the current hyperlinks of a website. This function is used to produce the nhs_data_elements() function, as it calls this function to analyse all the current hyperlinks on the NHS Data Dictionary package, but my example shows an example of scraping the NHSR community website to access the links:

# Analyse all the links on a website
website_url <- "https://nhsrcommunity.com/home/webinars/"
results <- NHSDataDictionaRy::linkScrapeR(website_url)
print(tail(results, 20))

## # A tibble: 20 x 2
##    link_name                    url                                             
##    <chr>                        <chr>                                           
##  1 "R-Bloggers"                 http://r-bloggers.com                           
##  2 "APHA"                       https://www.aphanalysts.org                     
##  3 "APHA Forum"                 https://www.aphanalysts.org/apha-forum/         
##  4 "Improvement Academy"        http://www.improvementacademy.org/observatory/a~
##  5 "R Project Home"             http://www.r-project.org                        
##  6 "R Documentation"            http://www.r-project.org/other-docs.html        
##  7 "R Help Forum"               http://r.789695.n4.nabble.com/R-help-f789696.ht~
##  8 "R Journal"                  http://journal.r-project.org                    
##  9 "R Seek"                     http://www.rseek.org                            
## 10 "Data Science in Gastroente~ http://gastrodatascience.com                    
## 11 "\r\n\t\t\t\t\t\t\t\t"       http://www.health.org.uk/                       
## 12 "\r\n\t\t\t\t\t\t\t\t"       http://www.improvementacademy.org/              
## 13 "\r\n\t\t\t\t\t\t\t\t"       https://improvement.nhs.uk/                     
## 14 ""                           <NA>                                            
## 15 "\r\n\t\t\t\t\t\t\t\t"       https://bradford.ac.uk/                         
## 16 "\r\n\t\t\t\t\t\t\t\t"       http://www.wales.nhs.uk/                        
## 17 "\r\n\t\t\t\t\t\t\t\t"       https://www.strategyunitwm.nhs.uk/              
## 18 "\r\n\t\t\t\t\t\t\t\t"       http://www.aphanalysts.org/                     
## 19 "View our Privacy Policy"    /privacy-policy                                 
## 20 "Website by Verge Design"    https://vergedesign.co.uk

subset_results <- slice(results, 16)
print(subset_results)

## # A tibble: 1 x 2
##   link_name  url                            
##   <chr>      <chr>                          
## 1 NHS-R Blog https://nhsrcommunity.com/blog/

4 Opening a URL from R into a web browser

To navigate to the specific URL you can use the utils::browseURL command:

#This opens the 18th result of the URL
browseURL(subset_results$url)

5 Working with the NHS R Data Dictionary lookup

This package provides functionality for working with the nhs_data_elements extracted from the NHS Data Dictionary website. The two main useful function to extract elements are the tableR function and the xPathTextR function. These can work with the tibble returned to extract useful lookups.

5.1 tableR function (utilising scrapeR function)

The scrapeR function is the workhorse, but the tableR wraps the results of the function in a nice tibble output. This will show you how to utilise the return tibble and to pass the function through the tableR to scrape a tibble to be utilised for lookups:

# Filter by a specific lookup required
reduced_tibble <- 
  dplyr::filter(nhs_tibble, link_name == "ACTIVITY TREATMENT FUNCTION CODE")

#Use the tableR function to query the NHS Data Dictionary website and return the associate tibble

national_codes <- NHSDataDictionaRy::tableR(url=reduced_tibble$full_url,
                          xpath = reduced_tibble$xpath_nat_code, 
                          title = "NHS Hospital Activity Treatment Function National Codes")


# The query has returned results, if the url does not have a lookup table an error will be thrown

print(head(national_codes,10))

## # A tibble: 10 x 4
##    Code  Description              Dict_Type                  DttmExtracted      
##    <chr> <chr>                    <chr>                      <dttm>             
##  1 100   General Surgery Service  NHS Hospital Activity Tre~ 2021-04-19 11:57:49
##  2 101   Urology Service          NHS Hospital Activity Tre~ 2021-04-19 11:57:49
##  3 102   Transplant Surgery Serv~ NHS Hospital Activity Tre~ 2021-04-19 11:57:49
##  4 103   Breast Surgery Service   NHS Hospital Activity Tre~ 2021-04-19 11:57:49
##  5 104   Colorectal Surgery Serv~ NHS Hospital Activity Tre~ 2021-04-19 11:57:49
##  6 105   Hepatobiliary and Pancr~ NHS Hospital Activity Tre~ 2021-04-19 11:57:49
##  7 106   Upper Gastrointestinal ~ NHS Hospital Activity Tre~ 2021-04-19 11:57:49
##  8 107   Vascular Surgery Service NHS Hospital Activity Tre~ 2021-04-19 11:57:49
##  9 108   Spinal Surgery Service   NHS Hospital Activity Tre~ 2021-04-19 11:57:49
## 10 109   Bariatric Surgery Servi~ NHS Hospital Activity Tre~ 2021-04-19 11:57:49

Not all lookups will have associated national code tables, if they are not returned you will receive a message saying the lookup table is not available for this NHS Data Dictionary type.

5.2 Using my lookup with NHS data

There are common lookups that are needed, and this is one such mapping between specialty code, to get the description of the specialty unit description. I will show an example with a made up data frame to illustrate the use case for these lookups and to have up to date lookups:

act_aggregations <- tibble(SpecCode = as.character(c(101,102,103, 104, 105)),
                             ActivityCounts = round(rnorm(5,250,3),0), 
                             Month = rep("May", 5))

# Use dplyr to join the NHS activity by specialty code

act_aggregations %>% 
  left_join(national_codes, by = c("SpecCode"="Code"))

## # A tibble: 5 x 6
##   SpecCode ActivityCounts Month Description    Dict_Type     DttmExtracted      
##   <chr>             <dbl> <chr> <chr>          <chr>         <dttm>             
## 1 101                 250 May   Urology Servi~ NHS Hospital~ 2021-04-19 11:57:49
## 2 102                 252 May   Transplant Su~ NHS Hospital~ 2021-04-19 11:57:49
## 3 103                 250 May   Breast Surger~ NHS Hospital~ 2021-04-19 11:57:49
## 4 104                 251 May   Colorectal Su~ NHS Hospital~ 2021-04-19 11:57:49
## 5 105                 251 May   Hepatobiliary~ NHS Hospital~ 2021-04-19 11:57:49

# This easily joins the lookup on to your data

The benefit of having it in an R package is that you can instantaneously have a lookup of the most relevant and up to date NHS lookups, replacing the need to have a massive data warehouse to capture this information.

5.3 Combine into one function

The routine could be made simpiler by combining into one function:

merged_frameR <- function(filter_param){
  # Filter based on passed parameter
  reduced_tibble <- 
    dplyr::filter(nhs_tibble, link_name == filter_param)
  
  # Use NHSDataDictionaRy to return the codes needed for default and national
  national_codes <- NHSDataDictionaRy::tableR(url=reduced_tibble$full_url,
                           xpath = reduced_tibble$xpath_nat_code, 
                           title = paste(filter_param, " - NATIONAL CODES"))

}


# Use the function we have created to pass the name of the lookup
tail(merged_frameR("ACTIVITY TREATMENT FUNCTION CODE"), 10)

## # A tibble: 10 x 4
##    Code  Description               Dict_Type                 DttmExtracted      
##    <chr> <chr>                     <chr>                     <dttm>             
##  1 661   Music Therapy Service     ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
##  2 662   Optometry Service         ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
##  3 663   Podiatric Surgery Service ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
##  4 670   Urological Physiology Se~ ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
##  5 673   Vascular Physiology Serv~ ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
##  6 675   Cardiac Physiology Servi~ ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
##  7 677   Gastrointestinal Physiol~ ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
##  8 840   Audiology Service         ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
##  9 920   Diabetic Education Servi~ ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
## 10 990   Retired but retained for~ ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49

6 xpathTextR function

This function has been provided to return elements from a website, other than html tables, as these functions predominately work with tables. The below example shows how this can be implemented, but requires the retrieval of the xpath via the Inspect command in Google Chrome (CTRL + SHIFT + I):

url <- "https://datadictionary.nhs.uk/data_elements/abbreviated_mental_test_score.html"
xpath_element <- '//*[@id="element_abbreviated_mental_test_score.description"]'

# Run the xpathTextR function to retrieve details of the element retrieved

result_list <- NHSDataDictionaRy::xpathTextR(url, xpath_element)
print(result_list)

## $result
## [1] "Description\n  \n  \n  \n    \n      ABBREVIATED MENTAL TEST SCORE\n is the \n              PERSON SCORE\n where the \n              ASSESSMENT TOOL TYPE\n is \n              'Abbreviated Mental Test Score'.        \n    The score is in the range 0 to 10.\n  \n\n"
## 
## $website_passed
## [1] "https://datadictionary.nhs.uk/data_elements/abbreviated_mental_test_score.html"
## 
## $xpath_passed
## [1] "//*[@id=\"element_abbreviated_mental_test_score.description\"]"
## 
## $html_node_result
## {html_document}
## <html xmlns="http://www.w3.org/1999/xhtml" xmlns:whc="http://www.oxygenxml.com/webhelp/components" xml:lang="en" lang="en" whc:version="21.1">
## [1] <head>\n<link rel="shortcut icon" href="../oxygen-webhelp%5Ctemplate%5Cre ...
## [2] <body class="wh_topic_page frmBody">\n        <a href="#wh_topic_body" cl ...
## 
## $datetime_access
## [1] "2021-04-19 11:57:49 BST"
## 
## $person_accessed
## [1] "GARYH - LAPTOP-GE3S96EI"

6.1 Cleaning the text example

The example below shows how the text could be cleaned once it is retrieved:

# Use the returned result and do some text processing
clean_text <- trimws(unlist(result_list$result))
clean_text <- clean_text %>% 
  gsub("[\r\n]", "", .) %>% #Remove new line and breaks
  trimws() %>% #Get rid of any white space
  as.character() #Cast to a character vector

print(clean_text)

## [1] "Description                ABBREVIATED MENTAL TEST SCORE is the               PERSON SCORE where the               ASSESSMENT TOOL TYPE is               'Abbreviated Mental Test Score'.            The score is in the range 0 to 10."

6.2 Manipulating the text with Excel like string functions

These functions have been created for users transitioning from Excel to R:

# Use the left_xl function to extract the first 11 characters
left_clean <- NHSDataDictionaRy::left_xl(clean_text, 11)
print(left_clean)

## [1] "Description"

# Use the right string to extract the score range
right_clean <- NHSDataDictionaRy::right_xl(clean_text, 8)
print(right_clean)

## [1] "0 to 10."

#Use the mid function to look in the middle of the string
mid_clean <- trimws(NHSDataDictionaRy::mid_xl(clean_text, 15, 42))
print(mid_clean)

## [1] "ABBREVIATED MENTAL TEST SCORE"

#Get the text length 
full_string <- len_xl(clean_text)
mid_clean_len <- len_xl(mid_clean)
#Print the length of both strings - the original and mid string extracted text
cat(paste0("The original string length was: ", full_string, "\n", "The mid_xl cleaned string length is: ", 
          mid_clean_len))

## The original string length was: 231
## The mid_xl cleaned string length is: 29

7 Working with OpenSafely

The OpenSafely dataset has been added to the data dictionary - courtesy of Callum Polwart. This utilises functionality in the data dictionary and works as a wrapper for the data dictionary.

os_list <- NHSDataDictionaRy::openSafely_listR("opensafely/ace-inhibitor-medications", title="Ace Inhibitor")
glimpse(os_list)

## Rows: 1,096
## Columns: 6
## $ type          <chr> "amp", "amp", "amp", "amp", "amp", "amp", "amp", "amp", ~
## $ id            <chr> "2.191211e+16", "2.192711e+16", "2.998391e+16", "2.19124~
## $ bnf_code      <chr> "0205051AAAAAAAA", "0205051AAAAAAAA", "0205051AAAAAAAA",~
## $ nm            <chr> "Perindopril tosilate 2.5mg tablets (Teva UK Ltd)", "Per~
## $ Dict_Type     <chr> "Ace Inhibitor", "Ace Inhibitor", "Ace Inhibitor", "Ace ~
## $ DttmExtracted <dttm> 2021-04-19 11:57:51, 2021-04-19 11:57:51, 2021-04-19 11~

With this simple wrapper you can pull back the relevant items from the OpenSafely dataset.

8 Wrapping up

More functionality continues to be added to this package, so make sure that you keep up to date with the new developments. CRAN will automatically pull the most recent version down from the website.

This package would not be possible without the funding and support of the NHS-R community. A big thanks to all the team there.

NHSDataDictionaRy package

Gary Hutson - Head of Advanced Analytics

21/04/2021