To load the package, you can use the below command:
This function expects no return and is a way to query the NHS Data Dictionary database to get the most recent list of data elements and their associated lookups. The return of this will provide a tibble of all the links currently on the NHS Data Dictionary website:
## # A tibble: 6 x 6
## link_name url full_url xpath_nat_code xpath_default_co~ xpath_also_known
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ABBREVIAT~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
## 2 ABDOMINAL~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
## 3 ABDOMINAL~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
## 4 ABDOMINAL~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
## 5 ABLATIVE ~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
## 6 ABNORMALI~ data_~ https://~ "//*[@id=\"ele~ "//*[@id=\"eleme~ "//*[@id=\"elem~
This tibble gives a list of all lookups and their associated xpath codes i.e. a direct link to an HTML element, which is the standard way of extracting HTML DOM content. This is where the other functions in the package become powerful.
This function can analyse a website and get all the current hyperlinks of a website. This function is used to produce the nhs_data_elements() function, as it calls this function to analyse all the current hyperlinks on the NHS Data Dictionary package, but my example shows an example of scraping the NHSR community website to access the links:
# Analyse all the links on a website
website_url <- "https://nhsrcommunity.com/home/webinars/"
results <- NHSDataDictionaRy::linkScrapeR(website_url)
print(tail(results, 20))
## # A tibble: 20 x 2
## link_name url
## <chr> <chr>
## 1 "R-Bloggers" http://r-bloggers.com
## 2 "APHA" https://www.aphanalysts.org
## 3 "APHA Forum" https://www.aphanalysts.org/apha-forum/
## 4 "Improvement Academy" http://www.improvementacademy.org/observatory/a~
## 5 "R Project Home" http://www.r-project.org
## 6 "R Documentation" http://www.r-project.org/other-docs.html
## 7 "R Help Forum" http://r.789695.n4.nabble.com/R-help-f789696.ht~
## 8 "R Journal" http://journal.r-project.org
## 9 "R Seek" http://www.rseek.org
## 10 "Data Science in Gastroente~ http://gastrodatascience.com
## 11 "\r\n\t\t\t\t\t\t\t\t" http://www.health.org.uk/
## 12 "\r\n\t\t\t\t\t\t\t\t" http://www.improvementacademy.org/
## 13 "\r\n\t\t\t\t\t\t\t\t" https://improvement.nhs.uk/
## 14 "" <NA>
## 15 "\r\n\t\t\t\t\t\t\t\t" https://bradford.ac.uk/
## 16 "\r\n\t\t\t\t\t\t\t\t" http://www.wales.nhs.uk/
## 17 "\r\n\t\t\t\t\t\t\t\t" https://www.strategyunitwm.nhs.uk/
## 18 "\r\n\t\t\t\t\t\t\t\t" http://www.aphanalysts.org/
## 19 "View our Privacy Policy" /privacy-policy
## 20 "Website by Verge Design" https://vergedesign.co.uk
## # A tibble: 1 x 2
## link_name url
## <chr> <chr>
## 1 NHS-R Blog https://nhsrcommunity.com/blog/
To navigate to the specific URL you can use the utils::browseURL command:
This package provides functionality for working with the nhs_data_elements extracted from the NHS Data Dictionary website. The two main useful function to extract elements are the tableR function and the xPathTextR function. These can work with the tibble returned to extract useful lookups.
The scrapeR function is the workhorse, but the tableR wraps the results of the function in a nice tibble output. This will show you how to utilise the return tibble and to pass the function through the tableR to scrape a tibble to be utilised for lookups:
# Filter by a specific lookup required
reduced_tibble <-
dplyr::filter(nhs_tibble, link_name == "ACTIVITY TREATMENT FUNCTION CODE")
#Use the tableR function to query the NHS Data Dictionary website and return the associate tibble
national_codes <- NHSDataDictionaRy::tableR(url=reduced_tibble$full_url,
xpath = reduced_tibble$xpath_nat_code,
title = "NHS Hospital Activity Treatment Function National Codes")
# The query has returned results, if the url does not have a lookup table an error will be thrown
print(head(national_codes,10))
## # A tibble: 10 x 4
## Code Description Dict_Type DttmExtracted
## <chr> <chr> <chr> <dttm>
## 1 100 General Surgery Service NHS Hospital Activity Tre~ 2021-04-19 11:57:49
## 2 101 Urology Service NHS Hospital Activity Tre~ 2021-04-19 11:57:49
## 3 102 Transplant Surgery Serv~ NHS Hospital Activity Tre~ 2021-04-19 11:57:49
## 4 103 Breast Surgery Service NHS Hospital Activity Tre~ 2021-04-19 11:57:49
## 5 104 Colorectal Surgery Serv~ NHS Hospital Activity Tre~ 2021-04-19 11:57:49
## 6 105 Hepatobiliary and Pancr~ NHS Hospital Activity Tre~ 2021-04-19 11:57:49
## 7 106 Upper Gastrointestinal ~ NHS Hospital Activity Tre~ 2021-04-19 11:57:49
## 8 107 Vascular Surgery Service NHS Hospital Activity Tre~ 2021-04-19 11:57:49
## 9 108 Spinal Surgery Service NHS Hospital Activity Tre~ 2021-04-19 11:57:49
## 10 109 Bariatric Surgery Servi~ NHS Hospital Activity Tre~ 2021-04-19 11:57:49
Not all lookups will have associated national code tables, if they are not returned you will receive a message saying the lookup table is not available for this NHS Data Dictionary type.
There are common lookups that are needed, and this is one such mapping between specialty code, to get the description of the specialty unit description. I will show an example with a made up data frame to illustrate the use case for these lookups and to have up to date lookups:
act_aggregations <- tibble(SpecCode = as.character(c(101,102,103, 104, 105)),
ActivityCounts = round(rnorm(5,250,3),0),
Month = rep("May", 5))
# Use dplyr to join the NHS activity by specialty code
act_aggregations %>%
left_join(national_codes, by = c("SpecCode"="Code"))
## # A tibble: 5 x 6
## SpecCode ActivityCounts Month Description Dict_Type DttmExtracted
## <chr> <dbl> <chr> <chr> <chr> <dttm>
## 1 101 250 May Urology Servi~ NHS Hospital~ 2021-04-19 11:57:49
## 2 102 252 May Transplant Su~ NHS Hospital~ 2021-04-19 11:57:49
## 3 103 250 May Breast Surger~ NHS Hospital~ 2021-04-19 11:57:49
## 4 104 251 May Colorectal Su~ NHS Hospital~ 2021-04-19 11:57:49
## 5 105 251 May Hepatobiliary~ NHS Hospital~ 2021-04-19 11:57:49
The benefit of having it in an R package is that you can instantaneously have a lookup of the most relevant and up to date NHS lookups, replacing the need to have a massive data warehouse to capture this information.
The routine could be made simpiler by combining into one function:
merged_frameR <- function(filter_param){
# Filter based on passed parameter
reduced_tibble <-
dplyr::filter(nhs_tibble, link_name == filter_param)
# Use NHSDataDictionaRy to return the codes needed for default and national
national_codes <- NHSDataDictionaRy::tableR(url=reduced_tibble$full_url,
xpath = reduced_tibble$xpath_nat_code,
title = paste(filter_param, " - NATIONAL CODES"))
}
# Use the function we have created to pass the name of the lookup
tail(merged_frameR("ACTIVITY TREATMENT FUNCTION CODE"), 10)
## # A tibble: 10 x 4
## Code Description Dict_Type DttmExtracted
## <chr> <chr> <chr> <dttm>
## 1 661 Music Therapy Service ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
## 2 662 Optometry Service ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
## 3 663 Podiatric Surgery Service ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
## 4 670 Urological Physiology Se~ ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
## 5 673 Vascular Physiology Serv~ ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
## 6 675 Cardiac Physiology Servi~ ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
## 7 677 Gastrointestinal Physiol~ ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
## 8 840 Audiology Service ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
## 9 920 Diabetic Education Servi~ ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
## 10 990 Retired but retained for~ ACTIVITY TREATMENT FUNCT~ 2021-04-19 11:57:49
This function has been provided to return elements from a website, other than html tables, as these functions predominately work with tables. The below example shows how this can be implemented, but requires the retrieval of the xpath via the Inspect command in Google Chrome (CTRL + SHIFT + I):
url <- "https://datadictionary.nhs.uk/data_elements/abbreviated_mental_test_score.html"
xpath_element <- '//*[@id="element_abbreviated_mental_test_score.description"]'
# Run the xpathTextR function to retrieve details of the element retrieved
result_list <- NHSDataDictionaRy::xpathTextR(url, xpath_element)
print(result_list)
## $result
## [1] "Description\n \n \n \n \n ABBREVIATED MENTAL TEST SCORE\n is the \n PERSON SCORE\n where the \n ASSESSMENT TOOL TYPE\n is \n 'Abbreviated Mental Test Score'. \n The score is in the range 0 to 10.\n \n\n"
##
## $website_passed
## [1] "https://datadictionary.nhs.uk/data_elements/abbreviated_mental_test_score.html"
##
## $xpath_passed
## [1] "//*[@id=\"element_abbreviated_mental_test_score.description\"]"
##
## $html_node_result
## {html_document}
## <html xmlns="http://www.w3.org/1999/xhtml" xmlns:whc="http://www.oxygenxml.com/webhelp/components" xml:lang="en" lang="en" whc:version="21.1">
## [1] <head>\n<link rel="shortcut icon" href="../oxygen-webhelp%5Ctemplate%5Cre ...
## [2] <body class="wh_topic_page frmBody">\n <a href="#wh_topic_body" cl ...
##
## $datetime_access
## [1] "2021-04-19 11:57:49 BST"
##
## $person_accessed
## [1] "GARYH - LAPTOP-GE3S96EI"
The example below shows how the text could be cleaned once it is retrieved:
# Use the returned result and do some text processing
clean_text <- trimws(unlist(result_list$result))
clean_text <- clean_text %>%
gsub("[\r\n]", "", .) %>% #Remove new line and breaks
trimws() %>% #Get rid of any white space
as.character() #Cast to a character vector
print(clean_text)
## [1] "Description ABBREVIATED MENTAL TEST SCORE is the PERSON SCORE where the ASSESSMENT TOOL TYPE is 'Abbreviated Mental Test Score'. The score is in the range 0 to 10."
These functions have been created for users transitioning from Excel to R:
# Use the left_xl function to extract the first 11 characters
left_clean <- NHSDataDictionaRy::left_xl(clean_text, 11)
print(left_clean)
## [1] "Description"
# Use the right string to extract the score range
right_clean <- NHSDataDictionaRy::right_xl(clean_text, 8)
print(right_clean)
## [1] "0 to 10."
#Use the mid function to look in the middle of the string
mid_clean <- trimws(NHSDataDictionaRy::mid_xl(clean_text, 15, 42))
print(mid_clean)
## [1] "ABBREVIATED MENTAL TEST SCORE"
#Get the text length
full_string <- len_xl(clean_text)
mid_clean_len <- len_xl(mid_clean)
#Print the length of both strings - the original and mid string extracted text
cat(paste0("The original string length was: ", full_string, "\n", "The mid_xl cleaned string length is: ",
mid_clean_len))
## The original string length was: 231
## The mid_xl cleaned string length is: 29
The OpenSafely dataset has been added to the data dictionary - courtesy of Callum Polwart. This utilises functionality in the data dictionary and works as a wrapper for the data dictionary.
os_list <- NHSDataDictionaRy::openSafely_listR("opensafely/ace-inhibitor-medications", title="Ace Inhibitor")
glimpse(os_list)
## Rows: 1,096
## Columns: 6
## $ type <chr> "amp", "amp", "amp", "amp", "amp", "amp", "amp", "amp", ~
## $ id <chr> "2.191211e+16", "2.192711e+16", "2.998391e+16", "2.19124~
## $ bnf_code <chr> "0205051AAAAAAAA", "0205051AAAAAAAA", "0205051AAAAAAAA",~
## $ nm <chr> "Perindopril tosilate 2.5mg tablets (Teva UK Ltd)", "Per~
## $ Dict_Type <chr> "Ace Inhibitor", "Ace Inhibitor", "Ace Inhibitor", "Ace ~
## $ DttmExtracted <dttm> 2021-04-19 11:57:51, 2021-04-19 11:57:51, 2021-04-19 11~
With this simple wrapper you can pull back the relevant items from the OpenSafely dataset.
More functionality continues to be added to this package, so make sure that you keep up to date with the new developments. CRAN will automatically pull the most recent version down from the website.
This package would not be possible without the funding and support of the NHS-R community. A big thanks to all the team there.