python - How do I filter out .mp3 links using beautifulsoup from (possibly) broken html? (JSON) -

August 15, 2015

i want build small tool family member download podcasts off site.

in order links files first need filter them out (with bs4 + python3). files on website (estonian): download page "laadi alla" = "download"

so far code follows: (most of examples on stackoverflow)

from bs4 import beautifulsoup  import urllib.request import re  url = urllib.request.urlopen("http://vikerraadio.err.ee/listing/mystiline_venemaa#?page=1&pagesize=902&phrase=&from=&to=&path=mystiline_venemaa&showall") content = url.read() soup = beautifulsoup(content, "lxml")  links = [a['href'] in soup.find_all('a',href=re.compile('http.*\.mp3'))] print ("links:", links)

unfortunately 2 results. output:

links: ['http://heli.err.ee/helid/exp/err_raadiouudised.mp3', 'http://heli.err.ee/helid/exp/err_raadiouudised.mp3']

these not ones want. best guess page has broken html , bs4 / parser not able find else. i've tried different parsers resulting in no change. maybe i'm doing else wrong too.

my goal have individual links in list example. i'll filter out duplicates / unwanted entries later myself.

just quick note, in case: public radio , content legally hosted.

my new code is:

for link in soup.find_all('d2p1:downloadurl'):      print(link.text)

i unsure if tag selected correctly.

none of examples listed in this question working. see answer below working code.

please aware listings page interfaced through api. instead of requesting html page, suggest request api link has 200 .mp3 links.

please follow below steps:

request api link, not html page link
check response, it's json. extract fields of need
help family, time :)

solution

import requests, json bs4 import beautifulsoup  myurl = 'http://vikerraadio.err.ee/api/listing/bypath?path=mystiline_venemaa&page=1&pagesize=200&phrase=&from=&to=&showall=false' r = requests.get(myurl) abc = json.loads(r.text)  all_mp3 = {} lstngs in abc['listitems']:     asd in lstngs['podcasts']:         all_mp3[asd['downloadurl']] = lstngs['header']  all_mp3

all_mp3 need. all_mp3 dictionary download urls keys , mp3 names values.

Search This Blog

Insert

python - How do I filter out .mp3 links using beautifulsoup from (possibly) broken html? (JSON) -

Comments

Post a Comment

Popular posts from this blog

vue.js - Create hooks for automated testing -

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

serial port - hub4com OVERRUN Error -