There are several “helper” functions which can simplify the definition of complex patterns. First we define some functions that will help us display the patterns:
<- function(pat){
one.pattern if(is.character(pat)){
patelse{
}::var_args_list(pat)[["pattern"]]
nc
}
}<- function(...){
show.patterns <- list(...)
L str(lapply(L, one.pattern))
}
The nc::field
function can be used to avoid repetition when defining patterns of the form variable: value
. The example below shows three (mostly) equivalent ways to write a regex that captures the text after the colon and space; the captured text is stored in the variable
group or output column:
show.patterns(
"variable: (?<variable>.*)", #repetitive regex string
list("variable: ", variable=".*"),#repetitive nc R code
::field("variable", ": ", ".*"))#helper function avoids repetition
nc#> List of 3
#> $ : chr "variable: (?<variable>.*)"
#> $ : chr "(?:variable: (.*))"
#> $ : chr "(?:variable: (?:(.*)))"
Note that the first version above has a named capture group, whereas the second and third patterns generated by nc have an un-named capture group and some non-capturing groups (but they all match the same pattern).
Another example:
show.patterns(
"Alignment (?<Alignment>[0-9]+)",
list("Alignment ", Alignment="[0-9]+"),
::field("Alignment", " ", "[0-9]+"))
nc#> List of 3
#> $ : chr "Alignment (?<Alignment>[0-9]+)"
#> $ : chr "(?:Alignment ([0-9]+))"
#> $ : chr "(?:Alignment (?:([0-9]+)))"
Another example:
show.patterns(
"Chromosome:\t+(?<Chromosome>.*)",
list("Chromosome:\t+", Chromosome=".*"),
::field("Chromosome", ":\t+", ".*"))
nc#> List of 3
#> $ : chr "Chromosome:\t+(?<Chromosome>.*)"
#> $ : chr "(?:Chromosome:\t+(.*))"
#> $ : chr "(?:Chromosome:\t+(?:(.*)))"
Another helper function is =nc::quantifier= which makes patterns easier to read by reducing the number of parentheses required to define sub-patterns with quantifiers. For example all three patterns below create an optional non-capturing group which contains a named capture group:
show.patterns(
"(?:-(?<chromEnd>[0-9]+))?", #regex string
list(list("-", chromEnd="[0-9]+"), "?"), #nc pattern using lists
::quantifier("-", chromEnd="[0-9]+", "?"))#quantifier helper function
nc#> List of 3
#> $ : chr "(?:-(?<chromEnd>[0-9]+))?"
#> $ : chr "(?:(?:-([0-9]+))?)"
#> $ : chr "(?:(?:-([0-9]+))?)"
Another example with a named capture group inside an optional non-capturing group:
show.patterns(
"(?: (?<name>[^,}]+))?",
list(list(" ", name="[^,}]+"), "?"),
::quantifier(" ", name="[^,}]+", "?"))
nc#> List of 3
#> $ : chr "(?: (?<name>[^,}]+))?"
#> $ : chr "(?:(?: ([^,}]+))?)"
#> $ : chr "(?:(?: ([^,}]+))?)"
We also provide a helper function for defining regex patterns with alternation. The following three lines are equivalent.
show.patterns(
"(?:(?<first>bar+)|(?<second>fo+))",
list(first="bar+", "|", second="fo+"),
::alternatives(first="bar+", second="fo+"))
nc#> List of 3
#> $ : chr "(?:(?<first>bar+)|(?<second>fo+))"
#> $ : chr "(?:(bar+)|(fo+))"
#> $ : chr "(?:(bar+)|(fo+))"
nc::altlist
for named alternativesFor most use cases, nc::alternatives_with_shared_groups
is sufficient, but one case where it does not work is when you want to name each alternative (for example, to easily count how many matches to each alternative there were). In that case you can instead use nc::altlist
as in the code below,
<- nc::altlist(
shared.groups month="[a-z]{3}",
day=list("[0-9]{2}", as.integer),
year=list("[0-9]{4}", as.integer))
<- with(shared.groups, list(
alt.args american=list(month, " ", day, ", ", year),
european=list(day, " ", month, " ", year)))
<- do.call(nc::alternatives, alt.args)
pattern <- nc::capture_first_vec(subject.vec, pattern))
(match.dt #> american month day year european
#> <char> <char> <int> <int> <char>
#> 1: mar 17, 1983 mar 17 1983
#> 2: sep 26 2017 26 sep 2017
#> 3: mar 17 1984 17 mar 1984
lapply(.SD, function(x)sum(x!="")), .SDcols=names(alt.args)]
match.dt[, #> american european
#> <int> <int>
#> 1: 1 2