This vignette introduces the R package doseminer, a package designed for the task of parsing English freetext prescriptions into a structured format.
Intended end users are researchers in pharmacoepidemiology, especially those using data provided by the Clinical Practice Research Datalink (CPRD), a source of anonymised electronic health data in the United Kingdom.
Electronic prescribing records typically include some structured data, such as the total quantity prescribed, some sort of product code to identify the drug, and the start date of the prescription.
However, the daily dosage may only be available as dispensing instructions in English (and Latin) free text, and sometimes the length of the prescription period can only be inferred from the estimated number of days needed to use up the prescribed total at a particular dose per day.
Pharmacoepidemiologists may wish to estimate the ‘exposure window’ of a particular drug (i.e. the time that the patient was taking it) to see if there is an association between taking the drug and experiencing adverse events: one example being the relationship between opioid usage and fractures.
Data should be provided as a character (string) vector. Suppose we have the following dosage instructions (taken from a box of supermarket ibuprofen).
Load the package and use the main workhorse function,
extract_from_prescription
, to parse the data into a
structured data frame.
library(doseminer)
extract_from_prescription(p1)
#> raw freq itvl dose unit
#> 1 Take 1 or 2 tablets up to 3 times a day, as required 0-3 1 1-2 tab
#> optional
#> 1 1
The output includes seven columns:
raw
: the input string. Useful for linking with other
data structuresdose
: the number of units of drug to administer at
onceunit
: the units of dose
(may be
unspecified)freq
: the number of times per day that the dose should
be administereditvl
: the number of days between ‘dose days’; if every
day, then 1optional
: an indicator; can the dose be zero? If so, 1,
else 0.
The package is vectorised, so you can provide a vector and it should process all of it, returning the result in the same order as the input.
p2 <- c('Take 1 or 2 tablets up to 3 times a day, as required',
'Swallow 1 or 2 capsules with water, up to three times a day as required.',
'Two to four 5ml spoonfuls up to 4 times a day')
extract_from_prescription(p2)
#> raw freq
#> 1 Take 1 or 2 tablets up to 3 times a day, as required 0-3
#> 2 Swallow 1 or 2 capsules with water, up to three times a day as required. 0-3
#> 3 Two to four 5ml spoonfuls up to 4 times a day 0-4
#> itvl dose unit optional
#> 1 1 1-2 tab 1
#> 2 1 <NA> cap 1
#> 3 1 10-20 ml spoonful 1
Names of units, such as millilitre spoonfuls, tablets (tabs) and capsules (caps) are standardised. Multiplicative doses, for example “two 5ml spoonfuls” are evaluated, giving “10ml spoonful”. Ranges of values are reported as “min–max” in string format. It is left to the user to decide how to handle such interval data; for example to parse it as a vector, split into columns, compute an average or leave it as a string.
If you do want to divide up such columns then the functions separate()
and separate_rows()
from the tidyr
package can come in useful:
library(tidyr)
extract_from_prescription(p1) %>%
separate_rows(freq:dose, convert = TRUE)
#> # A tibble: 2 × 6
#> raw freq itvl dose unit optional
#> <chr> <int> <int> <int> <chr> <int>
#> 1 Take 1 or 2 tablets up to 3 times a day, as … 0 1 1 tab 1
#> 2 Take 1 or 2 tablets up to 3 times a day, as … 3 1 2 tab 1
Here are some more example prescriptions and their output.
extract_from_prescription(example_prescriptions)
#> raw freq itvl dose
#> 1 1 tablet to be taken daily 1 1 1
#> 2 2.5ml four times a day when required 4 1 2.5
#> 3 1.25mls three times a day 3 1 1.25
#> 4 take 10mls q.d.s. p.r.n. 1 1 10
#> 5 take 1 or 2 4 times/day 4 1 1-2
#> 6 2x5ml spoon 4 times/day 4 1 10
#> 7 take 2 tablets every six hours max eight in twenty four hours 4 1 2
#> 8 1 tab nocte twenty eight tablets 1 1 1
#> 9 1-2 four times a day when required 4 1 1-2
#> 10 take one twice daily 2 1 1
#> 11 1 q4h prn 6 1 1
#> 12 take two every three days 1 3 2
#> 13 five every week 1 7 5
#> 14 every 72 hours 1 3 <NA>
#> 15 1 x 5 ml spoon 4 / day for 10 days 4 1 5
#> 16 two to three times a day 2-3 1 <NA>
#> 17 three times a week 1 2-3 <NA>
#> 18 three 5ml spoonsful to be taken four times a day after food 4 1 15
#> 19 take one or two every 4-6 hrs 4-6 1 1-2
#> 20 5ml 3 hrly when required 8 1 5
#> 21 one every morning to reduce bp 1 1 1
#> 22 take 1 or 2 6hrly when required 4 1 1-2
#> 23 take 1 or 2 four times a day as required for pain 4 1 1-2
#> 24 take 1 or 2 4 times/day if needed for pain 4 1 1-2
#> 25 1-2 tablets up to four times daily 0-4 1 1-2
#> 26 take one or two tablets 6-8 hrly every 2-3 days 3-4 2-3 1-2
#> 27 one and a half tablets every three hours 8 1 1.5
#> unit optional
#> 1 tab 0
#> 2 ml 1
#> 3 ml 0
#> 4 ml 1
#> 5 <NA> 0
#> 6 ml spoonful 0
#> 7 tab 0
#> 8 tab 0
#> 9 <NA> 1
#> 10 <NA> 0
#> 11 <NA> 1
#> 12 <NA> 0
#> 13 <NA> 0
#> 14 <NA> 0
#> 15 ml spoonful 0
#> 16 <NA> 0
#> 17 <NA> 0
#> 18 ml spoonful 0
#> 19 <NA> 0
#> 20 ml 1
#> 21 <NA> 0
#> 22 <NA> 1
#> 23 <NA> 1
#> 24 <NA> 1
#> 25 tab 1
#> 26 tab 0
#> 27 tab 0
While the package tries to make reasonable inferences about missing
data (it’s usually fair to assume a dose interval is daily, if not
otherwise specified), some variables, especially units, will be returned
NA
if there are no clues in the input text.
In order to parse the prescriptions, the package uses several utilities that have more general applications. Chief among these is an English number parser. This can turn individual names of numbers into numeric values:
words2number(c('one', 'two', 'three', 'forty two', 'one million'))
#> one two three forty two one million
#> 1 2 3 42 1000000
And it can find and replace such sequences within sentences:
replace_numbers(c('I have three apples',
'The answer is forty two',
'Take one and a half tablets'))
#> [1] "I have 3 apples" "The answer is 42" "Take 1.5 tablets"
However, it does not handle all cases, especially decimals and fractions (except halves). The digits of these just tend to get added together.
Like R itself, the package doseminer is offered with no warranty. Use it at your own risk. If you are interested in helping improve the performance and features of the doseminer package, then please file issues and submit pull requests on GitHub.