Web-scraping JavaScript page with Python -


i'm trying develop simple web scraper. want extract text without html code. in fact, achieve goal, have seen in pages javascript loaded didn't obtain results.

for example, if javascript code adds text, can't see it, because when call

response = urllib2.urlopen(request) 

i original text without added 1 (because javascript executed in client).

so, i'm looking ideas solve problem.

you can use python library dryscrape scrape javascript driven websites.

example

to give example, created sample page following html code. (link):

<!doctype html> <html> <head>   <meta charset="utf-8">   <title>javascript scraping test</title> </head> <body>   <p id='intro-text'>no javascript support</p>   <script>      document.getelementbyid('intro-text').innerhtml = 'yay! supports javascript';   </script>  </body> </html> 

without javascript says: no javascript support , javascript: yay! supports javascript

scraping without js support:

>>> import requests >>> bs4 import beautifulsoup >>> response = requests.get(my_url) >>> soup = beautifulsoup(response.text) >>> soup.find(id="intro-text") <p id="intro-text">no javascript support</p> 

scraping js support:

>>> import dryscrape >>> bs4 import beautifulsoup >>> session = dryscrape.session() >>> session.visit(my_url) >>> response = session.body() >>> soup = beautifulsoup(response) >>> soup.find(id="intro-text") <p id="intro-text">yay! supports javascript</p> 

Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -