3  数据清洗

从非结构的、半结构的数据中抽取有用的信息,常常需要一番数据清洗操作,最重要的工具之一是正则表达式。R 语言内置一系列函数,组成一套工具,详见 ?regex

以 CRAN 上 R 包的元数据作为本章数据清洗的对象。数据清洗主要用于文本分析,元数据中挑选 Package 、Maintainer、Title 、Description 和 Authors@R 等 5个字段。

pdb <- readRDS(file = "data/cran-package-db-20241231.rds")
pdb <- subset(
  x = pdb, subset = !duplicated(Package),
  select = c("Package", "Maintainer", "Title", "Description", "Authors@R")
)

3.1 正则表达式

简单起见,考虑 Rcpp 包的几个字段。

pdb[pdb$Package == "Rcpp","Description"]
[1] "The 'Rcpp' package provides R functions as well as C++ classes which\n offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about 'Rcpp' is provided by several vignettes included in this package, via the\n 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see 'citation(\"Rcpp\")' for details."

3.1.1 量词

字段 Description 中有多个换行符 \n 、多对括号,查找和替换。

grep(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], fixed = TRUE)
[1] 1

3.1.2 级联

3.1.3 断言

正向查找 / 反向查找

3.1.4 反向引用

x <- "年末国债余额为345723.62亿元,包括内债余额342063.98亿元、外债余额3659.64亿元,控制在全国人大批准的国债余额限额352008.35亿元以内。"
str_extract <- function(text, pattern, ...) regmatches(text, regexpr(pattern, text, ...))
# 提取一段匹配上的文本
x <- str_extract(text = x, pattern = "(国债余额限额)(\\d+.*?)亿元")
# 提取文本中的关键信息
sub(pattern = "(国债余额限额)(\\d+.*?)亿元", x = x, replacement = "\\1")
[1] "国债余额限额"
sub(pattern = "(国债余额限额)(\\d+.*?)亿元", x = x, replacement = "\\2")
[1] "352008.35"

参数 replacement 中,可以传递正则表达式的反向引用,其中 \\1 表示匹配上的第一个组,\\2 含义类似。

3.1.5 命名捕捉

正则表达式中,可以对某个子匹配赋于一个名称。

x <- c(
  `2023` = "控制在全国人大批准的国债余额限额308608.35亿元以内",
  `2024` = "控制在全国人大批准的国债余额限额352008.35亿元以内"
)

m <- regexec(pattern = "(?<first>国债余额限额)(?<last>\\d+.*?)亿元", text = x, perl = TRUE)
regmatches(x = x, m = m)
$`2023`
                                                  first 
"国债余额限额308608.35亿元"              "国债余额限额" 
                       last 
                "308608.35" 

$`2024`
                                                  first 
"国债余额限额352008.35亿元"              "国债余额限额" 
                       last 
                "352008.35" 

群组名称中,first 对应国债余额限额,last 对应 352008.35 ,表达式群组名称也可以用中文。

m <- regexec(pattern = "(?<指标>国债余额限额)(?<数值>\\d+.*?)(?<单位>亿元)", text = x, perl = TRUE)
regmatches(x = x, m = m)
$`2023`
                                                   指标 
"国债余额限额308608.35亿元"              "国债余额限额" 
                       数值                        单位 
                "308608.35"                      "亿元" 

$`2024`
                                                   指标 
"国债余额限额352008.35亿元"              "国债余额限额" 
                       数值                        单位 
                "352008.35"                      "亿元" 

函数 regmatches() 返回一个列表,列表中的元素是命名的字符串向量,下面根据索引位置提取向量中的部分内容。

lapply(regmatches(x = x, m = m), `[`, c(2, 3, 4))
$`2023`
          指标           数值           单位 
"国债余额限额"    "308608.35"         "亿元" 

$`2024`
          指标           数值           单位 
"国债余额限额"    "352008.35"         "亿元" 

最后,将提取的数据合并成二维数组。

do.call(rbind, lapply(regmatches(x = x, m = m), `[`, c(2, 3, 4)))
     指标           数值        单位  
2023 "国债余额限额" "308608.35" "亿元"
2024 "国债余额限额" "352008.35" "亿元"

3.2 字符串操作

3.2.1 查找

grep()grepl() 是一对字符串匹配函数,返回是否匹配到字符串的结果,前者返回值是整数向量,后者是逻辑向量。

grep(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], fixed = TRUE)
[1] 1
# 如果匹配上,则返回原字符串
grep(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], value = T, fixed = TRUE)
[1] "The 'Rcpp' package provides R functions as well as C++ classes which\n offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about 'Rcpp' is provided by several vignettes included in this package, via the\n 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see 'citation(\"Rcpp\")' for details."
# 等同于 grep(..., value = T)
grepv(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], fixed = TRUE)
[1] "The 'Rcpp' package provides R functions as well as C++ classes which\n offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about 'Rcpp' is provided by several vignettes included in this package, via the\n 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see 'citation(\"Rcpp\")' for details."
# 如果匹配上,则返回逻辑向量
grepl(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], fixed = TRUE)
[1] TRUE

3.2.2 替换

sub()gsub() 是一对替换字符串的函数,前者匹配和替换一次,而后者可以全部替换。

sub(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], replacement = "", fixed = TRUE)
[1] "The 'Rcpp' package provides R functions as well as C++ classes which offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about 'Rcpp' is provided by several vignettes included in this package, via the\n 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see 'citation(\"Rcpp\")' for details."
# 换行符
gsub(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], replacement = "", fixed = TRUE)
[1] "The 'Rcpp' package provides R functions as well as C++ classes which offer a seamless integration of R and C++. Many R data types and objects can be mapped back and forth to C++ equivalents which facilitates both writing of new code as well as easier integration of third-party libraries. Documentation about 'Rcpp' is provided by several vignettes included in this package, via the 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013, <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018, <doi:10.1080/00031305.2017.1375990>); see 'citation(\"Rcpp\")' for details."
# 单引号
gsub(pattern = "'", x = pdb[pdb$Package == "Rcpp", "Description"], replacement = "", fixed = TRUE)
[1] "The Rcpp package provides R functions as well as C++ classes which\n offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about Rcpp is provided by several vignettes included in this package, via the\n Rcpp Gallery site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see citation(\"Rcpp\") for details."
# 双引号
gsub(pattern = '\"', x = pdb[pdb$Package == "Rcpp", "Description"], replacement = "", fixed = TRUE)
[1] "The 'Rcpp' package provides R functions as well as C++ classes which\n offer a seamless integration of R and C++. Many R data types and objects can be\n mapped back and forth to C++ equivalents which facilitates both writing of new\n code as well as easier integration of third-party libraries. Documentation\n about 'Rcpp' is provided by several vignettes included in this package, via the\n 'Rcpp Gallery' site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and\n Francois (2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,\n <doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta (2018,\n <doi:10.1080/00031305.2017.1375990>); see 'citation(Rcpp)' for details."

3.2.3 提取

前面的 sub()gsub() 是一对关于字符串替换的函数,regexpr()gregexpr() 是另一对字符串提取函数,函数 regexpr() 前加字幕 g 的含义与之相同,均表示全局操作的意思。

x = gsub(pattern = "\n", x = pdb[pdb$Package == "Rcpp", "Description"], replacement = "", fixed = TRUE)
x = gsub(pattern = '\"', x = x, replacement = "", fixed = TRUE)
# 提取 URL 链接
str_extract <- function(text, pattern, ...) regmatches(text, regexpr(pattern, text, ...))
str_extract(text = x, pattern = "(<.*?>)", perl = T)
[1] "<https://gallery.rcpp.org>"
# 提取括号内容
str_extract(text = x, pattern = "(\\(.*?\\))", perl = T)
[1] "(2011, <doi:10.18637/jss.v040.i08>)"

描述字段中含有多个括号包裹 doi 链接,都提取出来

str_extract_g <- function(text, pattern, ...) regmatches(text, gregexpr(pattern, text, ...))
str_extract_g(text = x, pattern = "(\\(.*?\\))", perl = T)
[[1]]
[1] "(2011, <doi:10.18637/jss.v040.i08>)"        
[2] "(2013, <doi:10.1007/978-1-4614-6868-4>)"    
[3] "(2018, <doi:10.1080/00031305.2017.1375990>)"
[4] "(Rcpp)"