python - finding duplicate words in a string and print them using re -


i need printing duplicated last names in text file (lower case , uppercase should same) program not print words numbers (i.e. if number appeared in last name or in first name whole name ignored)

for example: text file :

assaf spanier, assaf din, yo9ssi levi, yoram bibe9rman, david levi, bibi netanyahu, amnon levi, ehud spanier, barak spa7nier, sara neta4nyahu 

the output should be:

assaf  assaf  david  bibi  amnon  ehud  ========  spanier  levi 

import re    def delete_numbers(line):     words = re.sub(r'\w*\d\w*', '', line).strip()     t in re.split(r',', words):        if len(t.split()) == 1:           words = re.sub(t, '',words)           words = re.sub(',,', '', words)     return words      fname = input("enter file name: ")  file = open(fname,"r")  line in file.readlines():     words = delete_numbers(line)     first_name = re.findall(r"([a-za-z]+)\s",words)     in first_name:        print(i)     print("***")    = ""  t in re.split(r',', words):    a+= (", ".join(t.split()[1:])) + " "

ok, first let's start aside - opening files in idiomatic way. use with statement, guarantees file closed. small scripts, isn't big deal, if ever start writing longer-lived programs, memory leaks due incorrectly closed files can come haunt you. since file has on single line:

with open(fname) f:     data = f.read() 

the file closed. encourages deal file immediately, , not leave opened consuming resources unecessarily. aside, let's suppose did have multiple lines. instead of using for line in f.readlines(), use following construct:

with open(fname) f:     line in f:         do_stuff(line) 

since don't need keep whole file, , need inspect each line, don't use readlines(). use readlines() if want keep list of lines around, lines = f.readlines().

ok, finally, data this:

>>> print(data) assaf spanier, assaf din, yo9ssi levi, yoram bibe9rman, david levi, bibi netanyahu, amnon levi, ehud spanier, barak spa7nier, sara neta4nyahu 

ok, if want use regex here, suggest following approach:

>>> names_regex = re.compile(r"^(\d+)\s(\d+)$") 

the patter here, ^(\d+)\s(\d+)$ uses non-digit group, \d (the opposite of \d, digit group), , white-space group, \s. also, uses anchors, ^ , $, anchor pattern beginning , end of text respectively. also, parentheses create capturing groups, leverage. try copy-pasting http://regexr.com/ , play around if still don't understand. 1 important note, use raw-strings, i.e. r"this raw string" versus normal strings, "this normal string" (notice r). because python strings use of same escape characters regex-patterns. maintain sanity. ok, finally, suggest using grouping idiom, dict

>>> grouper = {} 

now, our loop:

>>> fullname in data.split(','): ...     match = names_regex.search(fullname.strip()) ...     if match: ...         first, last = match.group(1), match.group(2) ...         grouper.setdefault(last.title(), []).append(first.title()) ... 

note, used .title method normalize our names "titlecase". dict.setdefault takes key it's first argument, , if key doesn't exist, sets second argument value, , returns it. so, checking if last-name, in title-case, exists in grouper dict, , if not, setting empty list, [], appending whatever there!

now pretty-printing clarity:

>>> pprint import pprint >>> pprint(grouper) {'din': ['assaf'],  'levi': ['david', 'amnon'],  'netanyahu': ['bibi'],  'spanier': ['assaf', 'ehud']} 

this useful data-structure. can, example, last-names more single first name:

>>> last, firsts in grouper.items(): ...     if len(firsts) > 1: ...         print(last) ... spanier levi 

so, putting together:

>>> grouper = {} >>> names_regex = re.compile(r"^(\d+)\s(\d+)$") >>> fullname in data.split(','): ...     match = names_regex.search(fullname.strip()) ...     if match: ...         first, last = match.group(1), match.group(2) ...         first, last = first.title(), last.title() ...         print(first) ...         grouper.setdefault(last, []).append(first) ... assaf assaf david bibi amnon ehud >>> last, firsts in grouper.items(): ...     if len(firsts) > 1: ...         print(last) ... spanier levi 

note, have assumed order doesn't matter, used normal dict. output happens in correct order because on python 3.6, dicts ordered! don't rely on this, since implementation detail , not guarantee. use collections.ordereddict if want guarantee order.


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -