R: Gsub with wildcard and several conditions -


i in process of cleaning bibliographic database, first time work r. 1 of columns, variable, column contains references reference in question has cited, , looks instance this:

andreosso-o'callaghan b, 2008, palgrave stud eur un, p61. alecu de flers n, 2005, int relations europe, p317. duchene francois, 1973, nation writ large fo. koh t, 2007, straits times 0808. lenz t, 2009, geopolitics geoecono. lucarelli s, 2010, routl garn ser eur w, v7, p1. manners i, 2002, j common mark stud, v40, p235, doi 10.1111/1468-5965.00353. nye j., 2004, soft power means suc. orbie j, 2010, normative power euro. portela c, 2007, 200710 rscas. rosecrance r., 1998, paradoxes european f. smith k.e., 2003, european foreign pol. song xn, 2010, rev int stud, v36, p755, doi 10.1017/s0260210510000835. tanaka t, 2008, palgrave stud eur un, p170. warleigh-lack a., 2010, comp regional integr, p43.

the problem run same reference occurs in many different disguises. in case above, looks this

  • nye j., 2004, soft power means suc.

in other cases, looks this:

  • nye j., 2004, soft power: means success in world politics, new york: publicaffairs

there @ least 30 different unique versions of reference. can identify them within database name of author - nye j., year of publication, 2004, , mentioning of "the means suc". idea use gsub function search within delimiters in column (which dot , 2 spaces) parameters , replace whole expression

  • nye j., 2004, soft power: means success in world politics, new york: publicaffairs

by now, able simple gsub's, managed replace variations of mr. nye nye j., did through searching variations manually not feasible anymore. this:

help2 <- within(help2, { values <- gsub (x= cr, pattern = "nye j., 2004,*means suc*.  ", replacement = "nye j., 2004, soft power: means success in world politics, new york: publicaffairs")}) 

i aware wildcards work differently in r, can't figure out need change. idea? mant thanks! best regards, steffi

your code can following:

pat <- "(?i)(^|\\.  +)nye j\\.(?:(?!\\. {2}).)*?\\b2004\\b(?:(?!\\. {2}).)*?means suc(?:(?!\\. {2}).)*" repl <- "\\1nye j., 2004, soft power: means success in world politics, new york: publicaffairs" explain$cr <- gsub(pat, repl, explain$cr, perl=true) 

see r demo

see regex demo

pattern details:

  • (?i) - case insensitive modifier making pattern case insensitive
  • (?:^|\. +) - start of string (^) or dot followed 2 or more spaces
  • nye j\. - literal nye j. substring (a dot must escaped match literal dot)
  • (?:(?!\. {2}).)*? - char other line break chars (.), 0 or more occurrences, few possible, not start . , 2 or more spaces sequence
  • \b2004\b - 2004 whole word (as \b word boundaries)
  • (?:(?!\. {2}).)*? - char other line break chars (.), 0 or more occurrences, few possible, not start . , 2 or more spaces sequence
  • means suc - literal means suc substring
  • (?:(?!\. {2}).)* - - char other line break chars (.), 0 or more occurrences, many possible, not start . , 2 or more spaces sequence.

the \\1 in replacement pattern backreference value captured in group 1.


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -