regex - GREP for a dynamic pattern in a file and print the other lines having former pattern and another pattern -


lets have log file looks this:

06/30/2015 00:17:20.716  info   06z07mjbyxfpzs matched line 06/30/2015 00:17:20.723  info   06z07mjbyxfpzs data xxyyzz 06/30/2015 00:17:20.735  info   06z07mdgc66vhc matched line 06/30/2015 00:17:20.759  info   06z07mgdq9thty data xxyyzz 06/30/2015 00:17:20.755  info   06z07mdgc66vhc matched line 06/30/2015 00:17:20.784  info   06z07mdgc66vhc data xxyyzz 06/30/2015 00:17:20.827  info   06z07n2q9s4g07 data xxyyzz 06/30/2015 00:17:20.855  info   06z07mxt44cf03 data xxyyzz 06/30/2015 00:17:20.861  info   06z07n5mxfykhg data xxyyzz 06/30/2015 00:17:20.873  info   06z07nm473brzb data xxyyzz 06/30/2015 00:17:20.902  info   06z07mm059k0tz data xxyyzz 06/30/2015 00:17:20.970  info   06z07nx2lv9wzc matched line 06/30/2015 00:17:20.974  info   06z07nx2lv9wzc data xxyyzz 06/30/2015 00:17:20.991  info   06z07ngwmw16zz matched line 06/30/2015 00:17:20.994  info   06z07ngwmw16zz data xxyyzz 06/30/2015 00:17:21.085  info   06z07n42c6qczx data xxyyzz 06/30/2015 00:17:21.094  info   06z07nmgpjppv1 matched line 06/30/2015 00:17:21.094  info   06z07mxr42tzzw data xxyyzz 06/30/2015 00:17:21.094  info   06z07mwbfvcgd3 data xxyyzz 06/30/2015 00:17:21.095  info   06z07nmgpjppv1 matched line 06/30/2015 00:17:21.100  info   06z07nmgpjppv1 data xxyyzz 06/30/2015 00:17:21.123  info   06z07p0ybwlv0b data xxyyzz 06/30/2015 00:17:21.132  info   06z07nslzf66hk matched line 06/30/2015 00:17:21.137  info   06z07nslzf66hk data xxyyzz 

what wish if:

  • any line contains "matched line", need unique id in column 4 (e.g. 06z07mjbyxfpzs) and,
  • search other lines having unique id + text "some data xxyyzz" and,
  • print line has matching patterns of (unique id + "some data xxyyzz") on console final output.

so in case output should be:

06/30/2015 00:17:20.723  info   06z07mjbyxfpzs data xxyyzz 06/30/2015 00:17:20.784  info   06z07mdgc66vhc data xxyyzz 06/30/2015 00:17:20.974  info   06z07nx2lv9wzc data xxyyzz 06/30/2015 00:17:20.994  info   06z07ngwmw16zz data xxyyzz 06/30/2015 00:17:21.100  info   06z07nmgpjppv1 data xxyyzz 06/30/2015 00:17:21.137  info   06z07nslzf66hk data xxyyzz 

the file talking here huge file (~200 gb file; having millions of records), on shared server, cannot run scripts or commands take lot of time.

[edit] - doing through fgrep printing unique ids matched line in 1 file , some data xxyyzz in other; looking single line grep, awk or sed command (without having create multiple files fgrep)

[edit 2] - output not in file, rather intermediate output of series of grep , sort.

[edit 3] - updated sample input (not in order jumbled):

06/30/2015 00:17:21.094  info   06z07nmgpjppv1 matched line 06/30/2015 00:17:20.716  info   06z07mjbyxfpzs matched line 06/30/2015 00:17:20.735  info   06z07mdgc66vhc matched line 06/30/2015 00:17:20.759  info   06z07mgdq9thty data xxyyzz 06/30/2015 00:17:20.755  info   06z07mdgc66vhc matched line 06/30/2015 00:17:20.784  info   06z07mdgc66vhc data xxyyzz 06/30/2015 00:17:20.827  info   06z07n2q9s4g07 data xxyyzz 06/30/2015 00:17:20.855  info   06z07mxt44cf03 data xxyyzz 06/30/2015 00:17:20.861  info   06z07n5mxfykhg data xxyyzz 06/30/2015 00:17:20.873  info   06z07nm473brzb data xxyyzz 06/30/2015 00:17:20.723  info   06z07mjbyxfpzs data xxyyzz 06/30/2015 00:17:20.902  info   06z07mm059k0tz data xxyyzz 06/30/2015 00:17:20.970  info   06z07nx2lv9wzc matched line 06/30/2015 00:17:20.974  info   06z07nx2lv9wzc data xxyyzz 06/30/2015 00:17:20.991  info   06z07ngwmw16zz matched line 06/30/2015 00:17:21.085  info   06z07n42c6qczx data xxyyzz 06/30/2015 00:17:21.094  info   06z07nmgpjppv1 matched line 06/30/2015 00:17:21.094  info   06z07mxr42tzzw data xxyyzz 06/30/2015 00:17:20.994  info   06z07ngwmw16zz data xxyyzz 06/30/2015 00:17:21.094  info   06z07mwbfvcgd3 data xxyyzz 06/30/2015 00:17:21.095  info   06z07nmgpjppv1 matched line 06/30/2015 00:17:21.100  info   06z07nmgpjppv1 data xxyyzz 06/30/2015 00:17:21.123  info   06z07p0ybwlv0b data xxyyzz 06/30/2015 00:17:21.132  info   06z07nslzf66hk matched line 06/30/2015 00:17:21.137  info   06z07nslzf66hk data xxyyzz 

ordered data

the following goes through file once , therefore should fast:

$ awk '/matched line/{id=$4;next;} id==$4' file.log 06/30/2015 00:17:20.723 info 06z07mjbyxfpzs data xxyyzz 06/30/2015 00:17:20.784 info 06z07mdgc66vhc data xxyyzz 06/30/2015 00:17:20.974 info 06z07nx2lv9wzc data xxyyzz 06/30/2015 00:17:20.994 info 06z07ngwmw16zz data xxyyzz 06/30/2015 00:17:21.100 info 06z07nmgpjppv1 data xxyyzz 06/30/2015 00:17:21.137 info 06z07nslzf66hk data xxyyzz 

in sample input (original question), some data lines follow matched line. enables fast , simple solution.

how use in pipeline

awk works in pipelines. if input not file but, in edit 2, pipeline, use like:

cmd1 <file.log | cmd2 | awk '/matched line/{id=$4;next;} id==$4' | cmd3 

how works

  • /matched line/{id=$4;next;}

    any time find line containing text matched line, save id in variable id. since not want print matched line, tell awk skip rest of commands , jump next line.

  • id==$4

    any time current line has id (field 4) matches our saved id, print line.

    (in awk terminology, id==$4 condition: evaluates true or false. when condition true, action performed. in case, specified no action awk performs default action print line.)

partially ordered data

in edit 3, data lines can appear @ random location after matched line. in case:

$ awk '/matched line/{id[$4]=1;next;} id[$4]' file.log 06/30/2015 00:17:20.784 info 06z07mdgc66vhc data xxyyzz 06/30/2015 00:17:20.723 info 06z07mjbyxfpzs data xxyyzz 06/30/2015 00:17:20.974 info 06z07nx2lv9wzc data xxyyzz 06/30/2015 00:17:20.994 info 06z07ngwmw16zz data xxyyzz 06/30/2015 00:17:21.100 info 06z07nmgpjppv1 data xxyyzz 06/30/2015 00:17:21.137 info 06z07nslzf66hk data xxyyzz  

or, in pipeline:

cmd1 file.log | awk '/matched line/{id[$4]=1;next;} id[$4]' 

Comments