python - How do I filter out .mp3 links using beautifulsoup from (possibly) broken html? (JSON) -


i want build small tool family member download podcasts off site.

in order links files first need filter them out (with bs4 + python3). files on website (estonian): download page "laadi alla" = "download"

so far code follows: (most of examples on stackoverflow)

from bs4 import beautifulsoup  import urllib.request import re  url = urllib.request.urlopen("http://vikerraadio.err.ee/listing/mystiline_venemaa#?page=1&pagesize=902&phrase=&from=&to=&path=mystiline_venemaa&showall") content = url.read() soup = beautifulsoup(content, "lxml")  links = [a['href'] in soup.find_all('a',href=re.compile('http.*\.mp3'))] print ("links:", links) 

unfortunately 2 results. output:

links: ['http://heli.err.ee/helid/exp/err_raadiouudised.mp3', 'http://heli.err.ee/helid/exp/err_raadiouudised.mp3'] 

these not ones want. best guess page has broken html , bs4 / parser not able find else. i've tried different parsers resulting in no change. maybe i'm doing else wrong too.

my goal have individual links in list example. i'll filter out duplicates / unwanted entries later myself.

just quick note, in case: public radio , content legally hosted.

my new code is:

for link in soup.find_all('d2p1:downloadurl'):      print(link.text)  

i unsure if tag selected correctly.

none of examples listed in this question working. see answer below working code.

please aware listings page interfaced through api. instead of requesting html page, suggest request api link has 200 .mp3 links.

please follow below steps:

  1. request api link, not html page link
  2. check response, it's json. extract fields of need
  3. help family, time :)

solution

import requests, json bs4 import beautifulsoup  myurl = 'http://vikerraadio.err.ee/api/listing/bypath?path=mystiline_venemaa&page=1&pagesize=200&phrase=&from=&to=&showall=false' r = requests.get(myurl) abc = json.loads(r.text)  all_mp3 = {} lstngs in abc['listitems']:     asd in lstngs['podcasts']:         all_mp3[asd['downloadurl']] = lstngs['header']  all_mp3 

all_mp3 need. all_mp3 dictionary download urls keys , mp3 names values.


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -