We help IT Professionals succeed at work.

Check out our new AWS podcast with Certified Expert, Phil Phillips! Listen to "How to Execute a Seamless AWS Migration" on EE or on your favorite podcast platform. Listen Now

x

Powershell - add line to millions of text files / EWS

Medium Priority
65 Views
Last Modified: 2020-05-23
I need to amend upto 5 million text files to add an extra line into them based on a line already in there, I'm currently going to need to process all those files anyway with Powershell to import into Office 365.

To explain why, I've scripted exporting about 5.2 million messages across 600 mailboxes totalling about 800Gb from IBM Domino server into folder structure as:

name@domain.com
   Folder
     Sub Folder
         abc123.eml

I then have a Powershell script which uses EWS to create the folder structure and import each of those using impersonation into everyone's mailboxes.... have used it of course for ages with one-liners and amending existing scripts but not really done much powershell development from scratch until last week.

Anyway that seems to works great once I'd got my head around how best to use Powershell using a function recursively passing parent folder ID's etc. to pickup existing or create the new folders.

BUT the issue is that messages that were sent internally on that mail system and sent messages do not have any Received headers, only Date: ones.

That should be OK but O365/Outlook helpfully shows ones without a Received header with the date it was imported in the Received column, though the proper date in the "sent" header.

So what I need to do is look for any EML files without Received: header in the first few lines, then look for the Date: header and then add a "Received:" header

e.g. this one does not show a date properly:  NOT-OK.eml

Subject: NOT OK - Approved Help Desk Request : HD077285
To: steve.knight@domain.com
Date: Wed, 6 Feb 2019 07:37:47 +0000
From: Matt.XXX@domain.com
Message-ID: <OFA64C3BC2.67F9473C-ON80258399.0029E929-80258399.0029E950@LocalDomain>
Content-type: multipart/related;
      Boundary="0__=0FBB090ADFBA6FB98f9e8a93df938690918c0FBB090ADFBA6FB9"
Content-Disposition: inline
X-Priority: 3 (Normal)
X-Mailer: IBM Notes Release 9.0.1FP8 February 24, 2017
MIME-Version: 1.0

But adding a line like this then it does: NOW-OK---Added-received.eml


Subject: NOT OK Added Received line - Approved Help Desk Request : HD077285
To: steve.knight@domain.com
Date: Wed, 6 Feb 2019 07:37:47 +0000
Received: from internal; Wed, 6 Feb 2019 07:37:47 +0000
From: Matt.XXXXX@domain.com
Message-ID: <OFA64C3BC2.67F9473C-ON80258399.0029E929-80258399.0029E950@LocalDomain>
Content-type: multipart/related;
      Boundary="0__=0FBB090ADFBA6FB98f9e8a93df938690918c0FBB090ADFBA6FB9"
Content-Disposition: inline
X-Priority: 3 (Normal)
X-Mailer: IBM Notes Release 9.0.1FP8 February 24, 2017
MIME-Version: 1.0

So I know I could get the files needed with something like this, though I only need to check the first couple of lines of files so don't know if there is a quicker way which doesn't involve reading  each 5k to 50Mb file for instance?

Get-ChildItem -recurse -file -filter *.imported | where-object {!(Select-String -InputObject $_ -Pattern "Received: ")}

Open in new window


Once I've got the file then I would need to search it for the Date header and add received which is where I am having problems.

Even better if I could incorporate into the import loop if needed, this is part of the code I am using in recursively called function:


 
            $ThisFolderMessages=(Get-ChildItem $($Folder.FullName) -file | Where-Object {$_.name -like "*.Eml"})

                # Process messages in this folder

                foreach ($email in $ThisFolderMessages ) {
                    $EmailtoImport = $($Folder.FullName) + "\" + $Email.Name
                    Write-verbose "   Importing: $EmailtoImport "
                    $UploadEmail = new-object Microsoft.Exchange.WebServices.Data.EmailMessage($service)  
                    
                    #Read File  
                    [byte[]]$EmailinByte=get-content -encoding byte $EmailtoImport  -Readcount 0
                    #Set Mime Content in Message  
                    $UploadEmail.MimeContent = new-object Microsoft.Exchange.WebServices.Data.MimeContent("us-ascii", $Emailinbyte);
                    $PR_Flags = new-object Microsoft.Exchange.WebServices.Data.ExtendedPropertyDefinition (3591, [Microsoft.Exchange.WebServices.Data.MapiPropertyType]::Integer);  
                    $UploadEmail.SetExtendedProperty($PR_Flags,"1")  
                    try {                    
                        $UploadEmail.Save($ThisFolder)
                        $MessageCounter=$MessageCounter + 1
                        rename-item -Path $EmailToImport -NewName ( $Email.Name -replace '.eml','.imported' )
                    }
                    catch {
                        Write-Host "     ERROR importing $EmailToImport : $Error[0]" -ForegroundColor Red
                    }
                    Write-Progress -id 2 -ParentID 1 -Activity "Message import for $MailboxName" "Started at $FolderStart" -CurrentOperation "Messages so far [$MessageCounter of $MailboxEMLCount]"  -perc ($MessageCounter/$MailboxEMLCount*100)
                }

                # Now process any sub folders of this folder
                ProcessFolder -ThisFolderPath "$($Folder.FullName)" -MailFolderParentID $($ThisFolder.UniqueID)

Open in new window


Any clues (or just working code!) appreciated or any other suggestions.

Bonus question.... do you know if I can flag a folder to use the "Sent" folder headers within a new folder using EWS so it shows "To and Sent Date" as opposed to "From and Received" so importing a "sent" folder shows all your own name.

I'm currently creating any folders missing using this function.  It would nice if there was a way of telling Outlook that this folder design should have same headers as "sent".  I'm not sure that is possible though

function Create-Folder {
    param( 
        [Parameter(Position=2, Mandatory=$true)] [String]$NewFolderName,
        [Parameter(Position=3, Mandatory=$true)] [Microsoft.Exchange.WebServices.Data.FolderID]$EWSParentFolderID   #[Microsoft.Exchange.WebServices.Data.Folder]
    )  
    Begin
     {
        $fvFolderView = new-object Microsoft.Exchange.WebServices.Data.FolderView(1)  
        
        #Define a Search folder that is going to do a search based on the DisplayName of the folder  
        $SfSearchFilter = new-object Microsoft.Exchange.WebServices.Data.SearchFilter+IsEqualTo([Microsoft.Exchange.WebServices.Data.FolderSchema]::DisplayName,$NewFolderName)  
        #Do the Search  
        $findFolderResults = $service.FindFolders($EWSParentFolderID,$SfSearchFilter,$fvFolderView)  
        
        if ($findFolderResults.TotalCount -eq 0) {  
            Write-debug ("Folder Does not Exist - $NewFolderName")  
            $NewFolder = new-object Microsoft.Exchange.WebServices.Data.Folder($service)  
            $NewFolder.DisplayName = $NewFolderName 
            $NewFolder.FolderClass = "IPF.Note"
            $NewFolder.Save($EWSParentFolderID)
            Write-debug ("Folder Created - $NewFolderName")  
            return [Microsoft.Exchange.WebServices.Data.FolderID]$NewFolder.ID
        } else{  
            Write-debug ("Folder already exists - $NewFolderName")  
            foreach ($NewFolder in $findFolderResults) { 
                return [Microsoft.Exchange.WebServices.Data.FolderID]$NewFolder.ID
            }
        }  
     }
}

Open in new window


[Edit - added couple of example files]

How Outlook displays messages imported without a Received header:
Comment
Watch Question

CERTIFIED EXPERT
Most Valuable Expert 2019
Most Valuable Expert 2018
Commented:
Unlock this solution with a free trial preview.
(No credit card required)
Get Preview
CERTIFIED EXPERT
Top Expert 2014

Commented:
You've got a few too many questions here - I'd suggest breaking them out into separate posts.

You can give something like below a shot for adding the Received: header.  I also show a method of searching just the first few lines of files for the Received: header.
function Fix-File
{
    Param(
        [string]$fileName
    )

    $matchFound = $false
    $(foreach ( $line in (Get-Content -Path $fileName -ReadCount 250) )
    {
        # Once a matching line is found, pass the rest unmodified.
        If ( -not $matchFound )
        {
            If ( $line -match "^Date: (?<datetime>.*)$" )
            {
                Write-Output "$($Matches[0])`r`nReceived: from internal; $($Matches['datetime'])"
                $matchFound = $true
            }
            Else
            { $line }
        }
        Else
        { $line }
    }) | Set-Content $fileName
}

Get-ChildItem -recurse -file -filter *.eml | Where-Object { -not (($_ | Get-Content -TotalCount 5) -match "^Received: ") } | ForEach-Object `
{
    Fix-File -fileName $_.FullName
}


Edit:  Ha!  I see obda posted something more complete just a bit before I did, but I'll leave this here anyway.
Steve KnightIT Consultancy
CERTIFIED EXPERT

Author

Commented:
Wow, thanks both.  That is some seriously engineere scripting there!  Nearly 10pm here, I'll have a test if I can tonight if not in the morning UK time, though wedding anniversary so best not do too much!

I hadn't noticed the get-content option for TotalCount, that will be useful for another script, was going to have a search for that.

And thanks, yes I'll make a new question

thanks

Steve
Steve KnightIT Consultancy
CERTIFIED EXPERT

Author

Commented:
Had a good test over these, appreciate again the great scripts!

I've set them to run across a sample of files, each their own copy of the same subset of data on an SSD and HDD drive so no cache getting in the way.  46840 files over 38 files totalling 4.19Gb.  There are 26,108 files, 2.15Gb in a "Sent" folder there so likely to need all of those files changed at least.

I have got about 70-80Mb/sec being copied from LAN to a USB drive at the same time copying 11Tb of data off a failing NAS RAID array so may be influencing the speed here.  The actual data is on a VMWare Windows 10 VM at a remote site so will run that over weekend for real.

Both scripts hammer the CPU and disc fair bit as expected.  This is reasonable Xeon with 32Gb RAM, interesting to see how the apparently similar speed of different techniques when repeated many times take longer:

These tests were done in the ISE, may be bit quicker directly in console?

HDD oBdA      9 min 23 secs --> est total = 30 to 60 hours
SSD oBdA      5 min 22 secs --> est total = 18 to 36 hours
SSD Footech  26 min 20 secs --> est total = 88 to 170 hours
HDD Footech   (not run that yet)

Running over the same data again without any change needed (this time in console) was quite a difference for footech method especially - both run from SSD drive:

oBdA         2 mins 5 secs
Footech      3 mins 25 secs

The estimated time is for full run, 200 to 400 times based on this partial mailbox is 0.25% of total by size and 0.5% by number of messages (I don't keep massive attachments in my emails... worst offender I found in a similar dump was 1Gb attachment in Sent which had been sent internally to several people, then in a repy-to-all to those people all with the attachments on.  They wondered why their phones were struggling which they had set to "download whole message and all attachments"

@obDa - worked first time, I've checked manually a sample of files that have changed properly and relevant files appear to have been touched.

@footech - Unfortunately as it stood I got this error below. Matches doesn't seem to be available from the -match command, could that be because $line is returning multiple lines, $line[4] say will return the Date line OK.

Changing the -ReadCount from 250 to 1 works Ok for instance though presumably slowing it down.  Do I need another foreach loop of the content over $line until that matches if I keep it at 250?

I haven't got time today to work out running with 250 but if I do will run it again - do you know how to fix that?

thanks both

Steve

[DBG]: PS D:\EML_Export\Test2-footech>>
Cannot index into a null array.
At D:\EML_Export\Test2-footech\fixfile.ps1:16 char:33
+                 Write-Output "$($Matches[0])`r`nReceived: from intern ...
+                                 ~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [], RuntimeException
    + FullyQualifiedErrorId : NullArray
 


[DBG]: PS D:\EML_Export\Test2-footech>>
Cannot index into a null array.
At D:\EML_Export\Test2-footech\fixfile.ps1:16 char:76
+ ...  "$($Matches[0])`r`nReceived: from internal; $($Matches['datetime'])"
+                                                    ~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [], RuntimeException
    + FullyQualifiedErrorId : NullArray
CERTIFIED EXPERT
Top Expert 2014

Commented:
I don't know why you would be getting that error.  I tested on sample files and had no issue.  With the specific construct that I used, $line will be only one line of the file at a time.  Even though I'm using -ReadCount 250, with the use of the foreach statement it breaks it up to single lines - however the same isn't true if I were to use the ForEach-Object cmdlet instead.  In any case, I wouldn't pursue mine any further unless it's for academics.  Using the System.IO.StreamReader is faster (just takes more code).
Steve KnightIT Consultancy
CERTIFIED EXPERT

Author

Commented:
Ok, thanks well like you say that way appears a bit slower, but I understand it more easily!  I'll have a fiddle with it to see if I can get it working that way too as the 250 lines thing must help just for my own understanding.

I'll close off later.
CERTIFIED EXPERT
Most Valuable Expert 2019
Most Valuable Expert 2018

Commented:
footech,

You'll get this issue when a mail has more than 250 lines.
When ReadCount is larger than the number of lines in the file, Get-Content will work as if ReadCount weren't there, so everything will work as expected.
But if there are more lines, Get-Content will return multiple arrays, each (except maybe the last) with ReadCount lines in it.
The ForEach will break down the array of arrays into single arrays:
1..10 | % {"Line $_"} | sc C:\Temp\ReadCount.tmp
ForEach ($line in (gc C:\Temp\ReadCount.tmp -ReadCount 3)) {"----- <$($line.GetType().FullName)>, $($line.Count) elements ----------"; $line}
del C:\Temp\ReadCount.tmp

Open in new window

$line will then be the array of the first 250 lines, and -match will go into array mode and return the matching lines, without setting $Matches.
If ('abc', 'def' -match '^a') {$Matches[0]}

Open in new window


The first 250 lines of the mail will then be replaced by
<empty line>
Received: from internal; 

Open in new window

CERTIFIED EXPERT
Top Expert 2014

Commented:
Ah, man!  Now I'm embarrassed that I didn't notice that earlier.  It was quite some time ago that I did testing around that, and I'm surprised that my testing didn't reveal the limitations (or I've since forgot about it).  It does appear to work as expected when you use -ReadCount 0, but that may bring it's own issues.  Using -ReadCount 1 (the default) of course works, but it is so much slower.
Steve KnightIT Consultancy
CERTIFIED EXPERT

Author

Commented:
Checked earlier, with ReadCount=0 the script took about 12 minutes instead of 26 and processing the files second time that didn't need any changes 2 min 50 so much closer to the other one.

I'll get this closed off and respond to other Q over weekend.

Steve KnightIT Consultancy
CERTIFIED EXPERT

Author

Commented:
@oBdA - appears in my other question, once you select an answer can no longer add a comment - that probably explains why when I answer questions you sometimes don't get any comment back, been a long time since I've asked a question myself.

Anyway for the other question I had written this:

Thanks, that works fine, I've decided to pre-process all the existing files though as your suggestion.

Good call on the path, this was originally checking for files in a second location too, same files and path to exclude ones that had already been processed so to make it more obvious I set two variables to the two filenames.  Did away with that and renamed them to *.imported when successfully imported instead.

Thanks again, once I've got those replication headers sorted out I've got about 8000 messages users have encrypted (our of 5.2 million) that I've got to have them decrypt on the Notes side and re-export those messages then import them all into O365 in batches to start with.


This script I have tested on sample and have this morning started it running against the 5.2 million messages.  Will feedback and close when done.
CERTIFIED EXPERT
Most Valuable Expert 2019
Most Valuable Expert 2018

Commented:
You should be able to add comments even after a question is closed.
I just recently added a comment to a closed question in which I was not involved before: https://www.experts-exchange.com/questions/29182642/How-to-run-a-logo-script.html
The comment box is still available for me in https://www.experts-exchange.com/questions/29182958/Powershell-incorporate-find-replace-into.html
If a question has been answered quite some time ago, the comment box is not automatically open anymore, but you'll get a box under the last comment, where you can click the "leaving a comment" which will allow comments.
This question has a verified solution.
Continue the conversation by leaving a comment.
Not the solution you were looking for? Submit your own question to our certified professionals and receive a custom solution that works for you.
Have you tried reloading the page after accepting?
Steve KnightIT Consultancy
CERTIFIED EXPERT

Author

Commented:
Yes I do that all the time too, but not on my own questions. Yes I did try reloading the page, opening it again etc. and there was no (obvious) comment facility  Suspect it hides the comment box from the person when they have closed the question.  I've checked back and still doesn't show it actually.

Anyway script running nicely - it's processed 48Gb, about 335 messages in a little over 3 hours so heading towards the 2 day mark to complete but really doesn't matter - the datastore for the VM I have is probably on SATA storage.

Shall be fun to see how long it takes to push the data into O365.
Unlock the solution to this question.
Thanks for using Experts Exchange.

Please provide your email to receive a free trial preview!

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.