Trying to parse large xml file in Python - Memory Errors -
so i'm beginner 'scraper' not whole truckload of programming experience.
i'm using python, in canopy environment, scrape downloaded xml files , using xml.dom parser so. i'm trying scrape tags first us-bibliographic-patent-grant (which why i'm using [0]) see how want parse , store entire dataset; rather doing @ once. excerpt xml looks this:
<?xml version="1.0" encoding="utf-8"?> <!doctype us-patent-grant system "us-patent-grant-v42-2006-08-23.dtd" [ ]> <us-patent-grant lang="en" dtd-version="v4.2 2006-08-23" file="usd0606726-20091229.xml" status="production" id="us-patent-grant" country="us" date-produced="20091214" date-publ="20091229"> <us-bibliographic-data-grant> <publication-reference> <document-id> <country>us</country> <doc-number>d0606726</doc-number> <kind>s1</kind> <date>20091229</date> </document-id> </publication-reference> <application-reference appl-type="design"> <document-id> <country>us</country> <doc-number>29299001</doc-number> <date>20071217</date>
my code far looks this:
from xml.dom import minidom filename = "c:/users/smolensk/documents/inventor research/xml_2009/ipg091229.xml" f = open(filename, 'r') doc = f.read() f.close() xmldata = '<root>' + doc + '</root>' data = minidom.parse(xmldata) us_biblio = xmldata.getelementsbytagname("us-bibliographic-data-grant")[0] pat_num = us_biblio.getelementsbytagname("doc-number")[0] dates = pat_num.getelementsbytagname("date") date in dates: print(date)
now have gotten messages memory errors after code ran has been able run once , unfortunately unable jot down happened. due high load of data (this file alone being 4.6 million lines) operation crashes every time , i'm unable replicate errors.
is there can see wrong code? code parsing entire dataset before starts storing each tag name might there way parse amount? perhaps make new xml file first set.
if you're wondering used bypass issue of
expaterror: junk after line xxx
i getting beforehand. know coding skills aren't amazing did not make simple , disgusting programming error.
try:
with open(filename, 'r') f: data = minidom.parse(f)
if need tag may need mess around bit, maybe:
data = minidom.parse(itertools.chain('<root>', f, '</root>')
Comments
Post a Comment