regex - Parse EML text With Regular Expression -


could me please parse eml text regular expression.

i want separately:

1). text between content-transfer-encoding: base64 , --=_alternative, if there above line content-type: text/html

2). text between content-transfer-encoding: base64 , --=_related, if there 2 lines above line content-type: image/jpeg

take look, please, on peace of code in powershell:

$text = @" --=_alternative xxxxxxxxxxxxxx_= content-type: text/html; charset="koi8-r" content-transfer-encoding: base64  111111111111111111111111111111111111111111111111111111  --=_alternative xxxxxxxxxxxxxx_= content-type: text/html; charset="koi8-r" content-transfer-encoding: base64  222222222222222222222222222222222222222222222222222222 --=_alternative xxxxxxxxxxxxxx_=-- --=_related xxxxxxxxxxxxxx_=--_= content-type: image/jpeg content-id: <_2_xxxxxxxxxxxxxx> content-transfer-encoding: base64  333333333333333333333333333333333333333333333333333333 --=_related xxxxxxxxxxxxxx_= content-type: image/jpeg content-id: <_2_xxxxxxxxxxxxxx> content-transfer-encoding: base64 444444444444444444444444444444444444444444444444444444  --=_related xxxxxxxxxxxxxx_= content-type: image/jpeg content-id: <_2_xxxxxxxxxxxxxx> content-transfer-encoding: base64  555555555555555555555555555555555555555555555555555555 --=_related xxxxxxxxxxxxxx_=-- "@  $regex1 = "(?ms).+?content-transfer-encoding: base64(.+?)--=_alternative" $text1 = ([regex]::matches($text,$regex1) | foreach {$_.groups[1].value}) write-host "text1 : " -fore red write-host  $text1  #i want output elements (of array, maybe, or 1 after another) #1). text between  content-transfer-encoding: base64 , --=_alternative, if there above line content-type: text/html #this #1111111111111111111111111111111111111111111111111111111 #then #2222222222222222222222222222222222222222222222222222222  $regex2 = "(?ms).+?content-transfer-encoding: base64(.+?)--=_related" $text2 = ([regex]::matches($text,$regex2) | foreach {$_.groups[1].value}) #i want output elements (of array, maybe, or 1 after another) #2). text between  content-transfer-encoding: base64 , --=_related, if there 2 lines above line content-type: image/jpeg #this #3333333333333333333333333333333333333333333333333333333 #then #4444444444444444444444444444444444444444444444444444444 #then #5555555555555555555555555555555555555555555555555555555 write-host "text2 : " -fore red write-host  $text2 

thanks help. have nice day.

p.s. based on code of jessie westlake, here little edited version of regex, worked me:

$files = get-childitem -path "\\<server_name>\mailroot\drop" foreach ($file in $files){     $text = get-content $file.fullname      $regextext = '(?:content-type: text/html.+?content-transfer-encoding: base64(.+?)(?:--=_))'     $regeximage = '(?:content-type: image/jpeg.+?content-transfer-encoding: base64(.+?)(?:--=_))'      $textmatches = [regex]::matches($text, $regextext, [system.text.regularexpressions.regexoptions]::singleline)     $imagematches = [regex]::matches($text, $regeximage, [system.text.regularexpressions.regexoptions]::singleline)      if ($textmatches[0].success)     {         write-host "found $($textmatches.count) text matches:"         write-output $textmatches.foreach({$_.groups[1].value})     }     if ($imagematches[0].success)     {         write-host "found $($imagematches.count) image matches:"         write-output $imagematches.foreach({$_.groups[1].value})     } } 

tl;dr : go code @ bottom...

the code below pretty ugly, forgive me.

essentially created regular expression matches starting content-type: text/html. matches following that, lazily until hits newline \n, carriage return \r, or combination of 1 after other \r\n.

you have wrap in parentheses in order use or | operator. don't want capture/return of groups, use non-capturing group syntax of (?:text-to-match). use elsewhere can see. can place capturing , non-capturing groups inside of each other too.

anyway, continuing on. after matching new line, want see content-transfer-encoding: base64. seems required in each of examples.

after want identify next newline, last time. except time want match 1 or more, using +. reason need match more one, there seems times when data want save preceded line. since not preceded line, need make "lazy" following plus question mark +?.

after comes part capturing actual data. first time use actual capturing group, versus non-capturing group (i.e. no question mark followed colon).

we want capture not new line, because seems data followed new line , not. not allowing ourselves capture new lines, force our previous group gobble new lines preceding our data. capturing group ([^(?:\n|\n\r)]+)

what doing there wrapping regex in parentheses in order capture it. place expression inside of brackets because want create our own "class" of characters. of characters inside of brackets going our code looking for. difference ours, though, put carat ^ first character inside brackets. means not of these characters. want match until next line, want capture not newline, once or more, many times possible.

we make sure our regex anchored ending text, keep trying match. starting newline matching @ least one, few required make our capture success (?:\n|\r|\r\n)+?.

lastly, anchor know sure can stop looking our important data. , --=_. wasn't sure if stumble across "alternative" word or "related", didn't go far. it's done.

the key all

we wouldn't able match through new lines if didn't add regular expression "singleline" mode. in order enable have use .net language create our matches. type accelerate [system.text.regularexpressions.regexoptions] type. options "singleline" , "multiline".

i create separate regex text/html , image/jpeg searches. save results of matches respective variables.

we can test success of matches indexing 0 index, contain entire match object , accessing .success property, returns boolean value. count of matches accessible .count property. in order access specific groups , captures, have dot notate them after finding appropriate capture group index. since using 1 capturing group , rest non-capturing, have [0] index our entire text match, , [1] should contain match of our capture group. because object, have access value property.

obviously below code require your $text variable contain data search.

$regextext = '(?:content-type: text/html.+?(?:\n|\r|\r\n)content-transfer-encoding: base64(?:\n|\r|\r\n)+?([^(?:\n|\n\r)]+)(?:\n|\r|\r\n)+?(?:\n|\r|\r\n)(?:--=_))' $regeximage = '(?:content-type: image/jpeg.+?(?:\n|\r|\r\n)content-transfer-encoding: base64(?:\n|\r|\r\n)+?([^(?:\n|\n\r)]+)(?:\n|\r|\r\n)+?(?:\n|\r|\r\n)(?:--=_))'  $textmatches = [regex]::matches($text, $regextext, [system.text.regularexpressions.regexoptions]::singleline) $imagematches = [regex]::matches($text, $regeximage, [system.text.regularexpressions.regexoptions]::singleline)  if ($textmatches[0].success) {     write-host "found $($textmatches.count) text matches:"     write-output $textmatches.foreach({$_.groups[1].value}) } if ($imagematches[0].success) {     write-host "found $($imagematches.count) image matches:"     write-output $imagematches.foreach({$_.groups[1].value}) } 

the code above results in below output screen:

found 2 text matches: 111111111111111111111111111111111111111111111111111111 222222222222222222222222222222222222222222222222222222 found 3 image matches: 333333333333333333333333333333333333333333333333333333 444444444444444444444444444444444444444444444444444444 555555555555555555555555555555555555555555555555555555 

Comments