Here is an index of topics which are explained in the different vignettes, along with an overview of functionality using simple examples.
Capture first is for the situation when your input is a character vector (each element is a different subject), you want find the first match of a regex to each subject, and your desired output is a data table (one row per subject, one column per capture group in the regex).
<- c(
subject.vec "chr10:213054000-213,055,000",
"chrM:111000",
"chr1:110-111 chr2:220-222")
::capture_first_vec(
ncchrom="chr.*?", ":", chromStart="[0-9,]+", as.integer)
subject.vec, #> chrom chromStart
#> <char> <int>
#> 1: chr10 213054000
#> 2: chrM 111000
#> 3: chr1 110
A variant is doing the same thing, but with input subjects coming from a data table/frame with character columns.
library(data.table)
<- data.table(
subject.dt JobID = c("13937810_25", "14022192_1"),
Elapsed = c("07:04:42", "07:04:49"))
<- list("[0-9]+", as.integer)
int.pat ::capture_first_df(
nc
subject.dt,JobID=list(job=int.pat, "_", task=int.pat),
Elapsed=list(hours=int.pat, ":", minutes=int.pat, ":", seconds=int.pat))
#> JobID Elapsed job task hours minutes seconds
#> <char> <char> <int> <int> <int> <int> <int>
#> 1: 13937810_25 07:04:42 13937810 25 7 4 42
#> 2: 14022192_1 07:04:49 14022192 1 7 4 49
Capture all is for the situation when your input is a single character string or text file subject, you want to find all matches of a regex to that subject, and your desired output is a data table (one row per match, one column per capture group in the regex).
::capture_all_str(
ncchrom="chr.*?", ":", chromStart="[0-9,]+", as.integer)
subject.vec, #> chrom chromStart
#> <char> <int>
#> 1: chr10 213054000
#> 2: chrM 111000
#> 3: chr1 110
#> 4: chr2 220
Capture melt is for the situation when your input is a data table/frame that has regularly named columns, and your desired output is a data table with those columns reshaped into a taller/longer form. In that case you can use a regex to identify the columns to reshape.
<- data.frame(iris[1,]))
(one.iris #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
::capture_melt_single (one.iris, part =".*", "[.]", dim =".*")
nc#> Species part dim value
#> <fctr> <char> <char> <num>
#> 1: setosa Sepal Length 5.1
#> 2: setosa Sepal Width 3.5
#> 3: setosa Petal Length 1.4
#> 4: setosa Petal Width 0.2
::capture_melt_multiple(one.iris, column=".*", "[.]", dim =".*")
nc#> Species dim Petal Sepal
#> <fctr> <char> <num> <num>
#> 1: setosa Length 1.4 5.1
#> 2: setosa Width 0.2 3.5
::capture_melt_multiple(one.iris, part =".*", "[.]", column=".*")
nc#> Species part Length Width
#> <fctr> <char> <num> <num>
#> 1: setosa Petal 1.4 0.2
#> 2: setosa Sepal 5.1 3.5
Capture glob is for the situation when you have several data files on disk, with regular names that you can match with a glob/regex. In the example below we first write one CSV file for each iris Species,
dir.create(iris.dir <- tempfile())
<- function(sp)file.path(iris.dir, paste0(sp, ".csv"))
icsv data.table(iris)[, fwrite(.SD, icsv(Species)), by=Species]
#> Empty data.table (0 rows and 1 cols): Species
dir(iris.dir)
#> [1] "setosa.csv" "versicolor.csv" "virginica.csv"
We then use a glob and a regex to read those files in the code below:
::capture_first_glob(file.path(iris.dir,"*.csv"), Species="[^/]+", "[.]csv")
nc#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <char> <num> <num> <num> <num>
#> 1: setosa 5.1 3.5 1.4 0.2
#> 2: setosa 4.9 3.0 1.4 0.2
#> 3: setosa 4.7 3.2 1.3 0.2
#> 4: setosa 4.6 3.1 1.5 0.2
#> 5: setosa 5.0 3.6 1.4 0.2
#> ---
#> 146: virginica 6.7 3.0 5.2 2.3
#> 147: virginica 6.3 2.5 5.0 1.9
#> 148: virginica 6.5 3.0 5.2 2.0
#> 149: virginica 6.2 3.4 5.4 2.3
#> 150: virginica 5.9 3.0 5.1 1.8
Helpers describes various functions that simplify the definition of complex regex patterns. For example nc::field
helps avoid repetition below,
<- c("sex_child1", "age_child1", "sex_child2")
subject.vec <- list(
pattern variable="age|sex", "_",
::field("child", "", "[12]", as.integer))
nc::capture_first_vec(subject.vec, pattern)
nc#> variable child
#> <char> <int>
#> 1: sex 1
#> 2: age 1
#> 3: sex 2
It also explains how to define common sub-patterns which are used in several different alternatives.
<- c("mar 17, 1983", "26 sep 2017", "17 mar 1984")
subject.vec <- nc::alternatives_with_shared_groups(
pattern month="[a-z]{3}", day="[0-9]{2}", year="[0-9]{4}",
list(month, " ", day, ", ", year),
list(day, " ", month, " ", year))
::capture_first_vec(subject.vec, pattern)
nc#> month day year
#> <char> <char> <char>
#> 1: mar 17 1983
#> 2: sep 26 2017
#> 3: mar 17 1984