Script to

I have around 1,500 Windows folders that contain 1,000s of emails in .EML file format.
I need a script that can read each of these .eml files and if within the first 20 lines of text the regular expression "From: "????" <????@domain.com>" cannot be found then delete the file. The ??? marks represent varied values.

The outcome of the script should leave me with just the emails (.eml files) that were sent from "domain.com".

I appreciate the script will take a very long time to process, however it will complete the this task a lot faster than me manually doing it!

Many thanks in advance!
Antonio
antoniokingAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

oBdACommented:
It's not in the TA list, but here's a Powershell script anyway.
To the regular expression (anything not described will be taken as verbatim character; note that Powershell RegEx does case insensitive matching by default):
\AFrom: (?<From>".*" <\S+@domain.com>)\Z
^^      ^^^^^^^^ ^^   ^^^            ^^^-- Match the end of the line
||      |||||||| ||   |||            `---- End of the named group "From"
||      |||||||| ||   ```----------------- Match at least one non-whitespace character
||      |||||||| ``----------------------- Match none or any amount of any character
||      ````````-------------------------- Capture a named group "From"
``---------------------------------------- Match the beginning of the line

Open in new window


If you're unsure about the RegEx matching everything you need, please provide a list of formats you want supported. Do not introduce any meta characters yourself, only replace real names with John Doe or acme.com.

This script is in test mode; it will only display the files it would delete; remove the "-WhatIf" switch in line 16 to run it for real.
$RootFolder = "D:\Temp"
$RE_FromDomain = '\AFrom: (?<From>".*" <\S+@domain.com>)\Z'
Get-ChildItem -Path $RootFolder -Filter *.eml -Recurse | % {
	"$($_.FullName) ... " | Write-Host -ForegroundColor White -NoNewLine
	$Keep = $False
	ForEach ($Line In (Get-Content -Path $_.FullName -TotalCount 20)) {
		If ($Line -match $RE_FromDomain) {
			$Keep = $True
			Break
		}
	}
	If ($Keep) {
		"'$($Matches['From'])'" | Write-Host -ForegroundColor Green
	} Else {
		"no match!" | Write-Host -ForegroundColor Red
		Remove-Item -Path $_.FullName -Force -WhatIf
	}
}

Open in new window

0
Bill PrewCommented:
Here's a small BAT script that should do the job.  Just edit in the MailDir where the EML files reside and give it a test.  As always test well on a sample of files before running for real on live data files.

@echo off
setlocal

REM Define file and folder locations
set MailDir=B:\EE\EE28685701\Files
set KeepList=%Temp%_keep.txt

REM Switch to folder where email files reside
pushd "%MailDir%"

REM Build a list of all emails that contain the email address to KEEP
findstr /i /m /r /c:"From:.*<.*@domain.com>" "*.eml" >"%KeepList%"

REM List all files in directory, remove any files from KEEP list, delete all other emails
for /f "tokens=*" %%A in ('dir /b /a-d *.eml ^| findstr /i /m /v /g:"%KeepList%"') do (
  echo del %%~A
)

REM Clean up - remove KEEP list file
if exist "%KeepList%" del "%KeepList%"

REM Return to original folder
popd

Open in new window

~bp
0
antoniokingAuthor Commented:
Thanks Bill, one issue is the code is searching the entire content of each eml for that regular expression.
I would like it just to check the first 20 lines of text as in an eml it's within the first 20 lines of text you will see who sent the email. Anything past this could be part of the email conversation.


thank you odba, I may test your PowerShell alternative but I'd much prefer using a scripting language I am more familiar with.
0
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Bill PrewCommented:
Okay, let me see if I can adjust for that...

~bp
0
oBdACommented:
Well, I like batch a lot, but I refrained from using it here because there's no "head" like tool that comes with the OS, and trying to parse the first number of lines of unknown content with a "for /f" loop is slow and error prone.
Then Microsoft had its very own, very strange (if not to say incorrect) interpretation of regular expressions when they created findstr.
In the rather short run, there's no way around Powershell, and the script is not quite as easy as it gets, but pretty close.
Here's a fully commented version:
$RootFolder = "D:\Temp"
$RE_FromDomain = '\AFrom: (?<From>".*" <\S+@domain.com>)\Z'
#                 ^^      ^^^^^^^^ ^^   ^^^            ^^^-- Match the end of the line
#                 ||      |||||||| ||   |||            `---- End of the named group "From"
#                 ||      |||||||| ||   ```----------------- Match at least one non-whitespace character
#                 ||      |||||||| ``----------------------- Match none or any amount of any character
#                 ||      ````````-------------------------- Capture a named group "From"
#                 ``---------------------------------------- Match the beginning of the line
# Anything not described will be taken as verbatim character; note that Powershell RegEx does case insensitive matching by default.
#
# Get-ChildItem is PS's "dir"; so get all elements named *.eml in the root folder and all subfolders, then pass that along to the ForEach-Object cmdlet (which does what its name implies).
# Get-ChildItem (as most cmdlets) returns a file OBJECT, not just the file name.
Get-ChildItem -Path $RootFolder -Filter *.eml -Recurse | ForEach-Object {
	# Eye candy - console output.
	# $() is a "subexpression", and inside that subexpression, we're using $_.FullName to access the FullName property of the file item currently in the pipeline.
	"$($_.FullName) ... " | Write-Host -ForegroundColor White -NoNewLine
	# A boolean variable that will decide whether to keep the eml file in question; we're assuming we'll throw it away, unless we find the magic expression.
	$Keep = $False
	# This ForEach is now a PS statement, not the cmdlet from above; we're using it instead of the pipeline because we want to break out of the loop as soon as we find the magic expression.
	# Get-Content does what its name implies, and -TotalCount tells it to only get the first 20 lines.
	ForEach ($Line In (Get-Content -Path $_.FullName -TotalCount 20)) {
		# The -match operator tries to match the current line against the regular expression defined above.
		# If there's a match, the PS default variable "$Matches" will contain an array with the captured groups, and the expression will evaluate to True
		If ($Line -match $RE_FromDomain) {
			# We found the magic expression, so let's keep the file.
			$Keep = $True
			# Break out of the loop, we found what we were looking for.
			Break
		}
	}
	If ($Keep) {
		# Eye candy - console output.
		# In the RE, we've defined a capturing named group called "From", and we just had a match, so the variable $Matches now has an entry for the hash "From"
		"'$($Matches['From'])'" | Write-Host -ForegroundColor Green
	} Else {
		# Eye candy - console output.
		"no match!" | Write-Host -ForegroundColor Red
		# We did not find the magic expression, so let's delete it.
		# Except - we're not that sure yet, so we use "WhatIf" to tell "Remove-Item" that it should only tell us what it would do if we wouldn't use the "WhatIf" switch.
		Remove-Item -Path $_.FullName -Force -WhatIf
	}
}

Open in new window

Microsoft took all the complaints about lacking security to heart, so PS will by default only run signed scripts.
You can change that for your account by entering
Set-ExecutionPolicy RemoteSigned
which would restrict only scripts stored on remote locations, or you can use "Unrestricted" or "Bypass" to allow scripts from any location; details are here: https://technet.microsoft.com/en-us/library/hh849812.aspx.
You can add "-Scope LocalMachine" (if the console was started elevated) to set it for the machine instead of only your account.
And to finally start a script, you always need to provide a path, even if you're already in the script's folder (in which case ".\" will suffice):
.\Whatever.ps1
This is done for security reasons, too, so that you won't accidentally start a script instead of a cmdlet.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Bill PrewCommented:
Okay, here is an updated BAT approach.  Currently in test mode with two ECHO words before the DEL commands.  TEST this way and see if it displays the DEL statements you expect.  If so, remove the two ECHO's words.

@echo off
setlocal

REM Define file and folder locations
set MailDir=B:\EE\EE28685701\Files
set KeepList=%Temp%\_keep.txt
set LinesToScan=20
set Filter=*.eml

if exist "%KeepList%" del "%KeepList%"

REM Switch to folder where email files reside
pushd "%MailDir%"

REM Build a list of all emails that contain the email address to KEEP
for /f "tokens=1-2 delims=:" %%A in ('findstr /i /n /r /c:"From:.*<.*@domain.com>" "%Filter%"') do (
  if %%B LEQ %LinesToScan% echo %%~A>>"%KeepList%"
)

if exist "%KeepList%" (

  REM List all files in directory, remove any files from KEEP list, delete all other emails
  for /f "tokens=*" %%A in ('dir /b /a-d "%Filter%" ^| findstr /i /m /v /g:"%KeepList%"') do ECHO del %%~A

  REM Clean up - remove KEEP list file
  del "%KeepList%"

) else (

  REM No files to save, remove all
  ECHO del /q "%Filter%"

)

REM Return to original folder
popd

Open in new window

~bp
0
antoniokingAuthor Commented:
Thanks gents, apologies for such a delayed response
I will endeavor to test and come back to you within the week.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
VB Script

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.