Web-scraping JavaScript page with Python -
i'm trying develop simple web scraper. want extract text without html code. in fact, achieve goal, have seen in pages javascript loaded didn't obtain results.
for example, if javascript code adds text, can't see it, because when call
response = urllib2.urlopen(request)
i original text without added 1 (because javascript executed in client).
so, i'm looking ideas solve problem.
you can use python library dryscrape scrape javascript driven websites.
example
to give example, created sample page following html code. (link):
<!doctype html> <html> <head> <meta charset="utf-8"> <title>javascript scraping test</title> </head> <body> <p id='intro-text'>no javascript support</p> <script> document.getelementbyid('intro-text').innerhtml = 'yay! supports javascript'; </script> </body> </html>
without javascript says: no javascript support
, javascript: yay! supports javascript
scraping without js support:
>>> import requests >>> bs4 import beautifulsoup >>> response = requests.get(my_url) >>> soup = beautifulsoup(response.text) >>> soup.find(id="intro-text") <p id="intro-text">no javascript support</p>
scraping js support:
>>> import dryscrape >>> bs4 import beautifulsoup >>> session = dryscrape.session() >>> session.visit(my_url) >>> response = session.body() >>> soup = beautifulsoup(response) >>> soup.find(id="intro-text") <p id="intro-text">yay! supports javascript</p>
Comments
Post a Comment