python - finding duplicate words in a string and print them using re -
i need printing duplicated last names in text file (lower case , uppercase should same) program not print words numbers (i.e. if number appeared in last name or in first name whole name ignored)
for example: text file :
assaf spanier, assaf din, yo9ssi levi, yoram bibe9rman, david levi, bibi netanyahu, amnon levi, ehud spanier, barak spa7nier, sara neta4nyahu the output should be:
assaf assaf david bibi amnon ehud ======== spanier levi import re def delete_numbers(line): words = re.sub(r'\w*\d\w*', '', line).strip() t in re.split(r',', words): if len(t.split()) == 1: words = re.sub(t, '',words) words = re.sub(',,', '', words) return words fname = input("enter file name: ") file = open(fname,"r") line in file.readlines(): words = delete_numbers(line) first_name = re.findall(r"([a-za-z]+)\s",words) in first_name: print(i) print("***") = "" t in re.split(r',', words): a+= (", ".join(t.split()[1:])) + " "
ok, first let's start aside - opening files in idiomatic way. use with statement, guarantees file closed. small scripts, isn't big deal, if ever start writing longer-lived programs, memory leaks due incorrectly closed files can come haunt you. since file has on single line:
with open(fname) f: data = f.read() the file closed. encourages deal file immediately, , not leave opened consuming resources unecessarily. aside, let's suppose did have multiple lines. instead of using for line in f.readlines(), use following construct:
with open(fname) f: line in f: do_stuff(line) since don't need keep whole file, , need inspect each line, don't use readlines(). use readlines() if want keep list of lines around, lines = f.readlines().
ok, finally, data this:
>>> print(data) assaf spanier, assaf din, yo9ssi levi, yoram bibe9rman, david levi, bibi netanyahu, amnon levi, ehud spanier, barak spa7nier, sara neta4nyahu ok, if want use regex here, suggest following approach:
>>> names_regex = re.compile(r"^(\d+)\s(\d+)$") the patter here, ^(\d+)\s(\d+)$ uses non-digit group, \d (the opposite of \d, digit group), , white-space group, \s. also, uses anchors, ^ , $, anchor pattern beginning , end of text respectively. also, parentheses create capturing groups, leverage. try copy-pasting http://regexr.com/ , play around if still don't understand. 1 important note, use raw-strings, i.e. r"this raw string" versus normal strings, "this normal string" (notice r). because python strings use of same escape characters regex-patterns. maintain sanity. ok, finally, suggest using grouping idiom, dict
>>> grouper = {} now, our loop:
>>> fullname in data.split(','): ... match = names_regex.search(fullname.strip()) ... if match: ... first, last = match.group(1), match.group(2) ... grouper.setdefault(last.title(), []).append(first.title()) ... note, used .title method normalize our names "titlecase". dict.setdefault takes key it's first argument, , if key doesn't exist, sets second argument value, , returns it. so, checking if last-name, in title-case, exists in grouper dict, , if not, setting empty list, [], appending whatever there!
now pretty-printing clarity:
>>> pprint import pprint >>> pprint(grouper) {'din': ['assaf'], 'levi': ['david', 'amnon'], 'netanyahu': ['bibi'], 'spanier': ['assaf', 'ehud']} this useful data-structure. can, example, last-names more single first name:
>>> last, firsts in grouper.items(): ... if len(firsts) > 1: ... print(last) ... spanier levi so, putting together:
>>> grouper = {} >>> names_regex = re.compile(r"^(\d+)\s(\d+)$") >>> fullname in data.split(','): ... match = names_regex.search(fullname.strip()) ... if match: ... first, last = match.group(1), match.group(2) ... first, last = first.title(), last.title() ... print(first) ... grouper.setdefault(last, []).append(first) ... assaf assaf david bibi amnon ehud >>> last, firsts in grouper.items(): ... if len(firsts) > 1: ... print(last) ... spanier levi note, have assumed order doesn't matter, used normal dict. output happens in correct order because on python 3.6, dicts ordered! don't rely on this, since implementation detail , not guarantee. use collections.ordereddict if want guarantee order.
Comments
Post a Comment