R: Gsub with wildcard and several conditions -
i in process of cleaning bibliographic database, first time work r. 1 of columns, variable, column contains references reference in question has cited, , looks instance this:
andreosso-o'callaghan b, 2008, palgrave stud eur un, p61. alecu de flers n, 2005, int relations europe, p317. duchene francois, 1973, nation writ large fo. koh t, 2007, straits times 0808. lenz t, 2009, geopolitics geoecono. lucarelli s, 2010, routl garn ser eur w, v7, p1. manners i, 2002, j common mark stud, v40, p235, doi 10.1111/1468-5965.00353. nye j., 2004, soft power means suc. orbie j, 2010, normative power euro. portela c, 2007, 200710 rscas. rosecrance r., 1998, paradoxes european f. smith k.e., 2003, european foreign pol. song xn, 2010, rev int stud, v36, p755, doi 10.1017/s0260210510000835. tanaka t, 2008, palgrave stud eur un, p170. warleigh-lack a., 2010, comp regional integr, p43.
the problem run same reference occurs in many different disguises. in case above, looks this
- nye j., 2004, soft power means suc.
in other cases, looks this:
- nye j., 2004, soft power: means success in world politics, new york: publicaffairs
there @ least 30 different unique versions of reference. can identify them within database name of author - nye j., year of publication, 2004, , mentioning of "the means suc". idea use gsub function search within delimiters in column (which dot , 2 spaces) parameters , replace whole expression
- nye j., 2004, soft power: means success in world politics, new york: publicaffairs
by now, able simple gsub's, managed replace variations of mr. nye nye j., did through searching variations manually not feasible anymore. this:
help2 <- within(help2, { values <- gsub (x= cr, pattern = "nye j., 2004,*means suc*. ", replacement = "nye j., 2004, soft power: means success in world politics, new york: publicaffairs")})
i aware wildcards work differently in r, can't figure out need change. idea? mant thanks! best regards, steffi
your code can following:
pat <- "(?i)(^|\\. +)nye j\\.(?:(?!\\. {2}).)*?\\b2004\\b(?:(?!\\. {2}).)*?means suc(?:(?!\\. {2}).)*" repl <- "\\1nye j., 2004, soft power: means success in world politics, new york: publicaffairs" explain$cr <- gsub(pat, repl, explain$cr, perl=true)
see r demo
see regex demo
pattern details:
(?i)
- case insensitive modifier making pattern case insensitive(?:^|\. +)
- start of string (^
) or dot followed 2 or more spacesnye j\.
- literalnye j.
substring (a dot must escaped match literal dot)(?:(?!\. {2}).)*?
- char other line break chars (.
), 0 or more occurrences, few possible, not start.
, 2 or more spaces sequence\b2004\b
-2004
whole word (as\b
word boundaries)(?:(?!\. {2}).)*?
- char other line break chars (.
), 0 or more occurrences, few possible, not start.
, 2 or more spaces sequencemeans suc
- literalmeans suc
substring(?:(?!\. {2}).)*
- - char other line break chars (.
), 0 or more occurrences, many possible, not start.
, 2 or more spaces sequence.
the \\1
in replacement pattern backreference value captured in group 1.
Comments
Post a Comment