Grep html code between html tags containing a keyword in R -


within file, use grep or maybe use package qdapregex's rm_between function extract whole section of html code containing keyword, lets "discount rate" example. specifically, want results code snippet:

<p>this paragraph containing words discount rate including other things.</p>

and

<table width="400">    <tr>      <th>month</th>      <th>savings</th>    </tr>    <tr>      <td>discount rate</td>      <td>10.0%</td>    </tr>    <tr>      <td>february</td>      <td>$80</td>    </tr>  </table>

  1. the trick here must find discount rate first , pull out rest.
  2. it going between <p> , </p> or <table , </table> , no other html tags.

a sample .txt file can found here:

https://www.sec.gov/archives/edgar/data/66740/0000897101-04-000425.txt

you can consider file html , explore if scraping rvest:

library(rvest) library(stringr)  # extract html file html = read_html('~/downloads/0000897101-04-000425.txt')  # 'p' nodes (you can same 'table') p_nodes <- html %>% html_nodes('p')  # text each node p_nodes_text <- p_nodes %>% html_text()  # find nodes have term looking match_indeces <- str_detect(p_nodes_text, fixed('discount rate', ignore_case = true))  # keep nodes matches # notice remove first match because rvest adds  # 'p' node whole file, since text file match_p_nodes <- p_nodes[match_indeces][-1]  # if want see results, can print them # (or send them file) for(i in 1:length(match_p_nodes)) {   cat(paste0('node #', i, ': ', as.character(match_p_nodes[i]), '\n\n')) } 

for <table> tags, not remove first match:

table_nodes <- html %>% html_nodes('table') table_nodes_text <- table_nodes %>% html_text() match_indeces_table <- str_detect(table_nodes_text, fixed('discount rate', ignore_case = true)) match_table_nodes <- table_nodes[match_indeces_table]  for(i in 1:length(match_table_nodes)) {   cat(paste0('node #', i, ': ', as.character(match_table_nodes[i]), '\n\n')) } 

Comments