regex - gawk RS only at beginning of line with ^ -
suppose have multi line record = record separator, if = start of line:
$ cat file record 1, field 1 record 1, field 2 = in record 1, field 3 = record 2, field 1 record 2, field 2 = in record 2, field 3 = final record 3, field 1 record 3, field 2 i separate file similar records delimited ^=[ \t] , fields \n.
i tried:
$ gawk -v rs="^=[ \t]" -v fs="\n" '{printf "%s\n--- nf=%s, nr=%s ---\n", $0, nf, fnr}' file but results in:
record 1, field 1 record 1, field 2 = in record 1, field 3 = record 2, field 1 record 2, field 2 = in record 2, field 3 = final record 3, field 1 record 3, field 2 --- nf=9, nr=1 --- i.e., ^ not work expect beginning of line.
i know can do:
$ gawk -v rs="\n=[ \t]" -v fs="\n" '{printf "%s\nnf=%s, nr=%s\n", $0, nf, fnr}' but feels have unix / windows issues line separators. has \n attached final record
i use sed replace ^=[ \t] \n use gawk in paragraph mode:
$ sed 's/^=[ \t]/\ /' file | gawk -v rs="" -v fs="\n" '{printf "%s\n--- nf=%s, nr=%s ---\n", $0, nf, fnr}' record 1, field 1 record 1, field 2 = in record 1, field 3 --- nf=3, nr=1 --- record 2, field 1 record 2, field 2 = in record 2, field 3 --- nf=3, nr=2 --- final record 3, field 1 record 3, field 2 --- nf=2, nr=3 --- which precisely looking for.
question: there way use ^ in rs indicate 'start of line' in gawk multiline records don't have pipe through sed? guess looking equivalent of m flag in pcre regex in gawk.
^ means start of string, not start of line. there no start of line character, carriage return (\r = return cursor start of line) , line feed (\n = drop cursor next line) characters or separately depending on tool/os used indicate end of line aka newline. windows tools tend use \r\n mean newline while unix uses \n alone is why \n referred newline character in unix.
many tools, e.g. sed , grep (and awk default) read 1 line @ time , input buffer contains single line @ time , in context start of string same start of line why hear ^ referred start of line character when in general, isn't. $ end of string character, not end of line character it's referred can used represent end of line when used in context of string input buffer tool reading/populating 1 line @ time.
what means if tool not reading 1 line @ time regexp match character x @ start of line in unix files actually:
(^|\n)x and @ end of line is:
x(\n|$) but aware that matching/consuming linefeed char if present.
in windows change \n \r\n above , work in both can use \r?\n unless file created on windows , contain linefeed mid-line, e.g. csvs exported excel like
field1,"field2 part a\nfield2 part b",field3\r\n where \n , \r of course literal. in case not want standalone \n mid-field misinterpreted newline.
try (gawk-only due multi-char rs , \s shorthand [[:space:]]):
$ awk -v rs='\n(=\\s*|$)' -f'\n' '{printf "%s\n--- nf=%s, nr=%s ---\n", $0, nf, fnr}' file record 1, field 1 record 1, field 2 = in record 1, field 3 --- nf=3, nr=1 --- record 2, field 1 record 2, field 2 = in record 2, field 3 --- nf=3, nr=2 --- final record 3, field 1 record 3, field 2 --- nf=2, nr=3 ---
Comments
Post a Comment