python - finding duplicate words in a string and print them using re -
i need printing duplicated last names in text file (lower case , uppercase should same) program not print words numbers (i.e. if number appeared in last name or in first name whole name ignored)
for example: text file :
assaf spanier, assaf din, yo9ssi levi, yoram bibe9rman, david levi, bibi netanyahu, amnon levi, ehud spanier, barak spa7nier, sara neta4nyahu
the output should be:
assaf assaf david bibi amnon ehud ======== spanier levi
import re def delete_numbers(line): words = re.sub(r'\w*\d\w*', '', line).strip() t in re.split(r',', words): if len(t.split()) == 1: words = re.sub(t, '',words) words = re.sub(',,', '', words) return words fname = input("enter file name: ") file = open(fname,"r") line in file.readlines(): words = delete_numbers(line) first_name = re.findall(r"([a-za-z]+)\s",words) in first_name: print(i) print("***") = "" t in re.split(r',', words): a+= (", ".join(t.split()[1:])) + " "
ok, first let's start aside - opening files in idiomatic way. use with
statement, guarantees file closed. small scripts, isn't big deal, if ever start writing longer-lived programs, memory leaks due incorrectly closed files can come haunt you. since file has on single line:
with open(fname) f: data = f.read()
the file closed. encourages deal file immediately, , not leave opened consuming resources unecessarily. aside, let's suppose did have multiple lines. instead of using for line in f.readlines()
, use following construct:
with open(fname) f: line in f: do_stuff(line)
since don't need keep whole file, , need inspect each line, don't use readlines()
. use readlines()
if want keep list of lines around, lines = f.readlines()
.
ok, finally, data this:
>>> print(data) assaf spanier, assaf din, yo9ssi levi, yoram bibe9rman, david levi, bibi netanyahu, amnon levi, ehud spanier, barak spa7nier, sara neta4nyahu
ok, if want use regex here, suggest following approach:
>>> names_regex = re.compile(r"^(\d+)\s(\d+)$")
the patter here, ^(\d+)\s(\d+)$
uses non-digit group, \d
(the opposite of \d
, digit group), , white-space group, \s
. also, uses anchors, ^
, $
, anchor pattern beginning , end of text respectively. also, parentheses create capturing groups, leverage. try copy-pasting http://regexr.com/ , play around if still don't understand. 1 important note, use raw-strings, i.e. r"this raw string"
versus normal strings, "this normal string"
(notice r
). because python strings use of same escape characters regex-patterns. maintain sanity. ok, finally, suggest using grouping idiom, dict
>>> grouper = {}
now, our loop:
>>> fullname in data.split(','): ... match = names_regex.search(fullname.strip()) ... if match: ... first, last = match.group(1), match.group(2) ... grouper.setdefault(last.title(), []).append(first.title()) ...
note, used .title
method normalize our names "titlecase". dict.setdefault
takes key it's first argument, , if key doesn't exist, sets second argument value, , returns it. so, checking if last-name, in title-case, exists in grouper
dict, , if not, setting empty list, []
, append
ing whatever there!
now pretty-printing clarity:
>>> pprint import pprint >>> pprint(grouper) {'din': ['assaf'], 'levi': ['david', 'amnon'], 'netanyahu': ['bibi'], 'spanier': ['assaf', 'ehud']}
this useful data-structure. can, example, last-names more single first name:
>>> last, firsts in grouper.items(): ... if len(firsts) > 1: ... print(last) ... spanier levi
so, putting together:
>>> grouper = {} >>> names_regex = re.compile(r"^(\d+)\s(\d+)$") >>> fullname in data.split(','): ... match = names_regex.search(fullname.strip()) ... if match: ... first, last = match.group(1), match.group(2) ... first, last = first.title(), last.title() ... print(first) ... grouper.setdefault(last, []).append(first) ... assaf assaf david bibi amnon ehud >>> last, firsts in grouper.items(): ... if len(firsts) > 1: ... print(last) ... spanier levi
note, have assumed order doesn't matter, used normal dict
. output happens in correct order because on python 3.6, dict
s ordered! don't rely on this, since implementation detail , not guarantee. use collections.ordereddict
if want guarantee order.
Comments
Post a Comment