There are several “helper” functions which can simplify the
definition of complex patterns. First we define some functions that will
help us display the patterns:
nc::field for reducing repetition
The nc::field function can be used to avoid repetition
when defining patterns of the form variable: value. The
example below shows three (mostly) equivalent ways to write a regex that
captures the text after the colon and space; the captured text is stored
in the variable group or output column:
show.patterns(
  "variable: (?<variable>.*)",      #repetitive regex string
  list("variable: ", variable=".*"),#repetitive nc R code
  nc::field("variable", ": ", ".*"))#helper function avoids repetition
#> List of 3
#>  $ : chr "variable: (?<variable>.*)"
#>  $ : chr "(?:variable: (.*))"
#>  $ : chr "(?:variable: (?:(.*)))"
Note that the first version above has a named capture group, whereas
the second and third patterns generated by nc have an un-named capture
group and some non-capturing groups (but they all match the same
pattern).
Another example:
show.patterns(
  "Alignment (?<Alignment>[0-9]+)",
  list("Alignment ", Alignment="[0-9]+"),
  nc::field("Alignment", " ", "[0-9]+"))
#> List of 3
#>  $ : chr "Alignment (?<Alignment>[0-9]+)"
#>  $ : chr "(?:Alignment ([0-9]+))"
#>  $ : chr "(?:Alignment (?:([0-9]+)))"
Another example:
show.patterns(
  "Chromosome:\t+(?<Chromosome>.*)",
  list("Chromosome:\t+", Chromosome=".*"),
  nc::field("Chromosome", ":\t+", ".*"))
#> List of 3
#>  $ : chr "Chromosome:\t+(?<Chromosome>.*)"
#>  $ : chr "(?:Chromosome:\t+(.*))"
#>  $ : chr "(?:Chromosome:\t+(?:(.*)))"
 
nc::quantifier for fewer parentheses
Another helper function is nc::quantifier which makes
patterns easier to read by reducing the number of parentheses required
to define sub-patterns with quantifiers. For example all three patterns
below create an optional non-capturing group which contains a named
capture group:
show.patterns(
  "(?:-(?<chromEnd>[0-9]+))?",                #regex string
  list(list("-", chromEnd="[0-9]+"), "?"),    #nc pattern using lists
  nc::quantifier("-", chromEnd="[0-9]+", "?"))#quantifier helper function
#> List of 3
#>  $ : chr "(?:-(?<chromEnd>[0-9]+))?"
#>  $ : chr "(?:(?:-([0-9]+))?)"
#>  $ : chr "(?:(?:-([0-9]+))?)"
Another example with a named capture group inside an optional
non-capturing group:
show.patterns(
  "(?: (?<name>[^,}]+))?",
  list(list(" ", name="[^,}]+"), "?"),
  nc::quantifier(" ", name="[^,}]+", "?"))
#> List of 3
#>  $ : chr "(?: (?<name>[^,}]+))?"
#>  $ : chr "(?:(?: ([^,}]+))?)"
#>  $ : chr "(?:(?: ([^,}]+))?)"
 
nc::alternatives_with_shared_groups for alternatives
with identical named sub-pattern groups
Sometimes each alternative is just a re-arrangement of the same
sub-patterns. For example consider the following subjects, each of which
are dates, in one of two formats.
subject.vec <- c("mar 17, 1983", "26 sep 2017", "17 mar 1984")
In each of the two formats, the month consists of three lower-case
letters, the day consists of two digits, and the year consists of four
digits. Is there a single pattern that can match each of these subjects?
Yes, such a pattern can be defined using the code below,
pattern <- nc::alternatives_with_shared_groups(
  month="[a-z]{3}",
  day=list("[0-9]{2}", as.integer),
  year=list("[0-9]{4}", as.integer),
  list(american=list(month, " ", day, ", ", year)),
  list(european=list(day, " ", month, " ", year)))
In the code above, we used
nc::alternatives_with_shared_groups, which requires two
kinds of arguments:
- named arguments (month, day, year) define sub-pattern groups that
are used in each alternative.
- un-named arguments (last two) define alternative patterns, each
which can use the sub-pattern group names (month, day, year).
The pattern can be used for matching, and the result is a data table
with one column for each unique name,
(match.dt <- nc::capture_first_vec(subject.vec, pattern))
#>        american  month   day  year    european
#>          <char> <char> <int> <int>      <char>
#> 1: mar 17, 1983    mar    17  1983            
#> 2:                 sep    26  2017 26 sep 2017
#> 3:                 mar    17  1984 17 mar 1984
After having parsed the dates into these three columns, we can add a
date column:
Sys.setlocale(locale="C")#to recognize months in English.
#> [1] "LC_CTYPE=C;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=fr_FR.UTF-8;LC_PAPER=fr_FR.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=fr_FR.UTF-8;LC_IDENTIFICATION=C"
match.dt[, date := data.table::as.IDate(
  paste(month, day, year), format="%b %d %Y")]
print(match.dt, class=TRUE)
#>        american  month   day  year    european       date
#>          <char> <char> <int> <int>      <char>     <IDat>
#> 1: mar 17, 1983    mar    17  1983             1983-03-17
#> 2:                 sep    26  2017 26 sep 2017 2017-09-26
#> 3:                 mar    17  1984 17 mar 1984 1984-03-17
Another example is parsing given and family names, in two different
formats:
nc::capture_first_vec(
  c("Toby Dylan Hocking","Hocking, Toby Dylan"),
  nc::alternatives_with_shared_groups(
    family="[A-Z][a-z]+",
    given="[^,]+",
    list(given_first=list(given, " ", family)),
    list(family_first=list(family, ", ", given))
  )
)
#>           given_first      given  family        family_first
#>                <char>     <char>  <char>              <char>
#> 1: Toby Dylan Hocking Toby Dylan Hocking                    
#> 2:                    Toby Dylan Hocking Hocking, Toby Dylan