4.5 字符串匹配

agrep 和 agrepl 函数做近似（模糊）匹配 (Approximate Matching or Fuzzy Matching) ，对于匹配，考虑到参数 pattern 在参数 x 中匹配时，允许参数值x存在最小可能的插入、删除和替换，这种修改叫做Levenshtein 编辑距离，max.distance 控制其细节

agrep(pattern, x, max.distance = 0.1, costs = NULL,
      ignore.case = FALSE, value = FALSE, fixed = TRUE,
      useBytes = FALSE)

agrepl(pattern, x, max.distance = 0.1, costs = NULL,
       ignore.case = FALSE, fixed = TRUE, useBytes = FALSE)

agrep 函数返回 pattern 在 x 中匹配到的一个位置向量，agrepl 返回一个逻辑向量，这一点类似 grep 和 grepl 这对函数，下面举例子说明

agrep("lasy", "1 lazy 2")

## [1] 1

# sub = 0 表示匹配时不考虑替换
agrep("lasy", c(" 1 lazy 2", "1 lasy 2"), max = list(sub = 0))

## [1] 2

# 默认设置下，匹配时区分大小写
agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2)

## [1] 1

# 返回匹配到值，而不是位置下标，类似 grep(..., value = TRUE) 的返回值
agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE)

## [1] "1 lazy"

# 不区分大小写
agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, ignore.case = TRUE)

## [1] 1 3

startsWith(x, prefix)
  endsWith(x, suffix)

startsWith 和 endsWith 函数用来匹配字符串的前缀和后缀，返回值是一个逻辑向量，参数 prefix 和 suffix 不要包含特殊的正则表达式字符，如点号.，举例子

# 字符串向量
search()

## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"

# 匹配以 package: 开头的字符串
startsWith(search(), "package:")

## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

# 或者
grepl("^package:", search())

## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

当前目录下，列出扩展名为 .Rmd 的文件

# list.files(path = ".", pattern = "\\.Rmd$")
# 而不是 endsWith(list.files(), "\\.Rmd")
endsWith(list.files(), ".Rmd")

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
## [13] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
## [25] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
## [37]  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## [49]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
## [73]  TRUE  TRUE

# 或者
grepl("\\.Rmd$", list.files())

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
## [13] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
## [25] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
## [37]  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## [49]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
## [73]  TRUE  TRUE

部分匹配(Partial String Matching)

match(x, table, nomatch = NA_integer_, incomparables = NULL)
x %in% table
charmatch(x, table, nomatch = NA_integer_)
pmatch(x, table, nomatch = NA_integer_, duplicates.ok = FALSE)

这几个 match 函数的返回值都是一个向量，每个元素是参数x在参数table中第一次匹配到的位置，charmatch 与 pmatch(x, table, nomatch = NA_integer_, duplicates.ok = TRUE) 类似，所以 pmatch 在默认 duplicates.ok = FALSE 的情况下，若x在第二个参数table中有多次匹配就会返回 NA，因此，实际上 pmatch 只允许在第二个参数中匹配一次

match("xx", c("abc", "xx", "xxx", "xx"))

## [1] 2

1:10 %in% c(1,3,5,9)

##  [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE

# charmatch 就比较奇怪，规则太多
charmatch("", "")                             # returns 1

## [1] 1

# 多个精确匹配到，或者多个部分匹配到，则返回 0
charmatch("m",   c("mean", "median", "mode", "quantile")) # returns 0

## [1] 0

# med 只在table参数值的第二个位置部分匹配到，所以返回2
charmatch("med", c("mean", "median", "mode", "quantile")) # returns 2

## [1] 2

charmatch("xx", "xx")

## [1] 1

charmatch("xx", "xxa")

## [1] 1

charmatch("xx", "axx")

## [1] NA

# 注意比较与 charmatch 的不同
pmatch("", "")                             # returns NA

## [1] NA

pmatch("m",   c("mean", "median", "mode")) # returns NA

## [1] NA

pmatch("med", c("mean", "median", "mode")) # returns 2

## [1] 2