Grep html code between html tags containing a keyword in R -
within file, use grep or maybe use package qdapregex's rm_between function extract whole section of html code containing keyword, lets "discount rate" example. specifically, want results code snippet:
<p>this paragraph containing words discount rate including other things.</p> and
<table width="400"> <tr> <th>month</th> <th>savings</th> </tr> <tr> <td>discount rate</td> <td>10.0%</td> </tr> <tr> <td>february</td> <td>$80</td> </tr> </table> - the trick here must find discount rate first , pull out rest.
- it going between
<p> , </p>or<table , </table>, no other html tags.
a sample .txt file can found here:
https://www.sec.gov/archives/edgar/data/66740/0000897101-04-000425.txt
you can consider file html , explore if scraping rvest:
library(rvest) library(stringr) # extract html file html = read_html('~/downloads/0000897101-04-000425.txt') # 'p' nodes (you can same 'table') p_nodes <- html %>% html_nodes('p') # text each node p_nodes_text <- p_nodes %>% html_text() # find nodes have term looking match_indeces <- str_detect(p_nodes_text, fixed('discount rate', ignore_case = true)) # keep nodes matches # notice remove first match because rvest adds # 'p' node whole file, since text file match_p_nodes <- p_nodes[match_indeces][-1] # if want see results, can print them # (or send them file) for(i in 1:length(match_p_nodes)) { cat(paste0('node #', i, ': ', as.character(match_p_nodes[i]), '\n\n')) } for <table> tags, not remove first match:
table_nodes <- html %>% html_nodes('table') table_nodes_text <- table_nodes %>% html_text() match_indeces_table <- str_detect(table_nodes_text, fixed('discount rate', ignore_case = true)) match_table_nodes <- table_nodes[match_indeces_table] for(i in 1:length(match_table_nodes)) { cat(paste0('node #', i, ': ', as.character(match_table_nodes[i]), '\n\n')) }
Comments
Post a Comment