bash - extract n levels links from URL using wget -


i trying extract urls webpage user defined n levels using wget. tried this

 wget -r -l$2 --reject=gif -o out.html www.google.com | sed -n 's/.*href="\([^"]*\).*/\1/p'` " 

it displaying first level. not parsing levels how rectify it

there several problems in command line

wget -r -l$2 --reject=gif -o out.html www.google.com | sed -n 's/.*href="\([^"]*\).*/\1/p'` " 
  1. there superfluous ` , " @ end of command. copy paste error.
  2. the running sed 's/.*href="\([^"]*\).*/\1/p' little bit naive because page can have lot of references on same line , tags may split on several lines.
  3. the regexp has not ending " reference plus rest of line printed
  4. the output written file out.html , nothing forwarded stdout. can changed option -o -. unfortunately option not work -r , -lx. solution store result , execute 2 commands.
  5. www.google.com may return 302 found location pointing localized google , wget not recurse localized page.

so working command (not tested - written inspiration):

$ wget  -nv -r -l1 --reject=gif -o x www.google.it warning: combining -o -r or -p mean downloaded content placed in single file specified.  2015-07-21 14:30:26 url:http://www.google.it/ [18842/18842] -> "x" [1] 2015-07-21 14:30:26 url:http://www.google.it/robots.txt [8170] -> "x" [1] 2015-07-21 14:30:26 url:http://www.google.it/images/srpr/nav_logo80.png [35615/35615] -> "x" [1] ...  $ cat x| sed -e 's/href="/\nhref="/g' | sed -n 's/.*href="\([^"]*"\).*/\1/p' /search?" /aa61d1355af544a297b61b2a6e00ff1c&css_id=bubble.min.css" http://www.google.it/imghp?hl=it&tab=wi" http://maps.google.it/maps?hl=it&tab=wl" https://play.google.com/?hl=it&tab=w8" http://www.youtube.com/?gl=it&tab=w1" ... 

Comments