Ruby Nokogiri HTML scraping table with CSS issue -

September 15, 2014

i have issue scraping of html-table. here link : https://www.basketball-reference.com/players/c/curryst01/gamelog/2016 (yes, it's famous introductive tutorial ruby-scraping). here code related :

doc = nokogiri::html.parse(open(link))  # biggest table  big_table = doc.css("table").sort { |x,y| y.css("tr").count <=> x.css("tr").count }.first  # number of rows 87, there 5 heads wanna remove    big_table.css("tr").count  # doesn't remove heads  big_table = big_table.select { |row| row.css("th").empty? }

in fact in html (i know nothing html , in ruby since 4h) th tag header, td standard cell, , tr line. goal delete header, .empty return if nodeset (nodeset content of tag ? ) empty, last line of code should have return tr elements. doesn't work, in fact result [] .
instead, noticed : big_table.select{|row| row.css("td").empty?}.count equal 5 ... so, decided :

big_table = big_table.select{|row| row.css("td").any?} , worked well...

my question : why did line works ? , why first attempt did fail ? maybe it's in html structure i'm missing ...

thanks !

let's take @ big_table

> big_table.class  => nokogiri::xml::nodeset  > big_table.size  => 1

so first of all, doing enumerable#select against big_table not doing expect. if instead capture rows:

> rows = big_table.css("tr") > rows.count  => 87

now can select on rows. let's take arbitrary row , see contains:

> rows[2].css("td").count  => 29  > rows[2].css("th").count  => 1

so typical row has 29 td elements , 1 th. in fact every row has @ least 1 th, why css("th").empty? returned nothing. conversely, all-header rows not contain td elements, why tried worked.

Search This Blog

Insert

Ruby Nokogiri HTML scraping table with CSS issue -

Comments

Post a Comment

Popular posts from this blog

service - Android MediaPlayer calls onCompletion before it already finished -

javascript - Training Neural Network to play flappy bird with genetic algorithm - Why can't it learn? -

javascript - Create a stacked percentage column -