Trying to parse large xml file in Python

Trying to parse large xml file in Python - Memory Errors -

April 15, 2013

so i'm beginner 'scraper' not whole truckload of programming experience.

i'm using python, in canopy environment, scrape downloaded xml files , using xml.dom parser so. i'm trying scrape tags first us-bibliographic-patent-grant (which why i'm using [0]) see how want parse , store entire dataset; rather doing @ once. excerpt xml looks this:

<?xml version="1.0" encoding="utf-8"?> <!doctype us-patent-grant system "us-patent-grant-v42-2006-08-23.dtd" [ ]> <us-patent-grant lang="en" dtd-version="v4.2 2006-08-23" file="usd0606726-20091229.xml" status="production" id="us-patent-grant" country="us" date-produced="20091214" date-publ="20091229"> <us-bibliographic-data-grant> <publication-reference> <document-id> <country>us</country> <doc-number>d0606726</doc-number> <kind>s1</kind> <date>20091229</date> </document-id> </publication-reference> <application-reference appl-type="design"> <document-id> <country>us</country> <doc-number>29299001</doc-number> <date>20071217</date>

my code far looks this:

from xml.dom import minidom  filename = "c:/users/smolensk/documents/inventor research/xml_2009/ipg091229.xml"  f = open(filename, 'r')  doc = f.read()  f.close()  xmldata = '<root>' + doc + '</root>'  data = minidom.parse(xmldata)  us_biblio = xmldata.getelementsbytagname("us-bibliographic-data-grant")[0]  pat_num = us_biblio.getelementsbytagname("doc-number")[0]  dates = pat_num.getelementsbytagname("date")  date in dates:     print(date)

now have gotten messages memory errors after code ran has been able run once , unfortunately unable jot down happened. due high load of data (this file alone being 4.6 million lines) operation crashes every time , i'm unable replicate errors.

is there can see wrong code? code parsing entire dataset before starts storing each tag name might there way parse amount? perhaps make new xml file first set.

if you're wondering used bypass issue of

expaterror: junk after line xxx

i getting beforehand. know coding skills aren't amazing did not make simple , disgusting programming error.

try:

with open(filename, 'r') f:     data = minidom.parse(f)

if need tag may need mess around bit, maybe:

data = minidom.parse(itertools.chain('<root>', f, '</root>')

Search This Blog

Insert

Trying to parse large xml file in Python - Memory Errors -

Comments

Post a Comment

Popular posts from this blog

vue.js - Create hooks for automated testing -

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

serial port - hub4com OVERRUN Error -