Ruby Nokogiri HTML scraping table with CSS issue -
i have issue scraping of html-table. here link : https://www.basketball-reference.com/players/c/curryst01/gamelog/2016 (yes, it's famous introductive tutorial ruby-scraping). here code related :
doc = nokogiri::html.parse(open(link)) # biggest table big_table = doc.css("table").sort { |x,y| y.css("tr").count <=> x.css("tr").count }.first # number of rows 87, there 5 heads wanna remove big_table.css("tr").count # doesn't remove heads big_table = big_table.select { |row| row.css("th").empty? } in fact in html (i know nothing html , in ruby since 4h) th tag header, td standard cell, , tr line. goal delete header, .empty return if nodeset (nodeset content of tag ? ) empty, last line of code should have return tr elements. doesn't work, in fact result [] .
instead, noticed : big_table.select{|row| row.css("td").empty?}.count equal 5 ... so, decided :
big_table = big_table.select{|row| row.css("td").any?} , worked well...
my question : why did line works ? , why first attempt did fail ? maybe it's in html structure i'm missing ...
thanks !
let's take @ big_table
> big_table.class => nokogiri::xml::nodeset > big_table.size => 1 so first of all, doing enumerable#select against big_table not doing expect. if instead capture rows:
> rows = big_table.css("tr") > rows.count => 87 now can select on rows. let's take arbitrary row , see contains:
> rows[2].css("td").count => 29 > rows[2].css("th").count => 1 so typical row has 29 td elements , 1 th. in fact every row has @ least 1 th, why css("th").empty? returned nothing. conversely, all-header rows not contain td elements, why tried worked.
Comments
Post a Comment