Avatar of ibgadmin
ibgadmin
Flag for United States of America asked on

Remove spanish diacritics from UTF8 file

We receive daily UTF8 files for orders that contain spanish diacritics on the names for the orders.  Example of a record below where you can see three different characters.  Is there a way I can replace those to be normal characters i.e. i,n,u, etc.?  Maybe sed or powershell?  I don't really know where to start with something like this.

30D0007991Aníbal           Zúñiga
Powershell

Avatar of undefined
Last Comment
ibgadmin

8/22/2022 - Mon
aikimark

This seems to be an inverse problem of this question that I recently answered:
https://www.experts-exchange.com/questions/29174458/Query-Help-possible-VBA-solution-Array.html?anchorAnswerId=43042006#a43042006

In your case, we would want to create an array of [regex] objects that would each identify all the upper ASCII characters for a (desired) lower ASCII character.  For each line/object, you would loop through these regex objects and replace the matches with the desired/associated lower ASCII character.

Note: In the linked comment, I am including the lower ASCII character, because we were looking for a match, not a replacement.  You would not have lower ASCII characters in your regex patterns.
oBdA

Here's a function that accepts a file path as as input and removes diacritics.
Save as Remove-Diacritic.ps1 or Whatever.ps1.
The output file will by default be generated with the original name and the suffix "_nd". You can change the suffix using the -Suffix argument, or provide a completely new name using the New-Name argument.
Examples:
# Will generate C:\Temp\SomeFile_nd.txt:
Remove-Diacritic.ps1 -Path C:\Temp\SomeFile.txt
# Will generate C:\Temp\NoDiacritic.txt:
Remove-Diacritic.ps1 -Path C:\Temp\SomeFile.txt -NewName NoDiacritic.txt

Open in new window

[CmdletBinding(DefaultParameterSetName='Suffix')]
Param(
	[Parameter(Position=0, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
	[Alias('FullName')]
	[String]$Path,
	[Parameter(Position=1, Mandatory=$true, ParameterSetName='NewName')]
	$NewName,
	[Parameter(Position=1, ParameterSetName='Suffix')]
	$Suffix = '_nd',
	[Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding]$Encoding = 'UTF8'
)
Begin {
	Function Remove-Diacritic {
	Param(
		[Parameter(Position=0, Mandatory=$true, ValueFromPipeline=$true)][AllowEmptyString()]
		[String[]]$String
	)
		Begin {
			$stringBuilder = New-Object -TypeName System.Text.StringBuilder
		}
		Process {
			$String | ForEach-Object {
				[void]$stringBuilder.Clear()
				$_.Normalize([System.Text.NormalizationForm]::FormD).ToCharArray() |
					Where-Object {[Globalization.CharUnicodeInfo]::GetUnicodeCategory($_) -ne [Globalization.UnicodeCategory]::NonSpacingMark} |
					ForEach-Object {[void]$stringBuilder.Append($_)}
				$stringBuilder.ToString()
			}
		}
	}
}
Process {
	$item = Get-Item -Path $Path
	If ($PSCmdlet.ParameterSetName -eq 'NewName') {
		$newPath = Join-Path -Path $item.DirectoryName -ChildPath $NewName
	} Else {
		$newPath = Join-Path -Path $item.DirectoryName -ChildPath "$($item.BaseName)$($Suffix)$($item.Extension)"
	}
	Write-Verbose "Removing diacritics from '$($item.FullName)' ..."
	Get-Content -Path $item.FullName -Encoding $Encoding | Remove-Diacritic | Set-Content -Path $newPath -Encoding $Encoding
	Write-Verbose "... saved as '$($newPath)'"
}

Open in new window

ibgadmin

ASKER
It works when I run in PS ISE and it prompts for the path.  Is there a way to use this in a batch file or something and have the path and newname so I can hard code that?
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
oBdA

What's the exact use case here?
You said you receive files, so hard-coding the new name doesn't make much sense.
ibgadmin

ASKER
I receive the order files multiple times daily.  It is archived in it's original folder also.  It then is placed into a specific processing directory.  I need it to run like so in a batch or something for our automation software to run X-times daily.
Remove-Diacritics.ps1 -Path \\Ibgfs1\ecomedate\it_out\Amazon\SLRCNTRL\ASLRORIN -NewName \\Ibgfs1\ecomedate\it_out\Amazon\SLRCNTRL\ASLROROT
oBdA

That is not precise enough.
What is ASLRORIN? The source directory or a single file without extension?
Are there potentially multiple files in the source directory?
Do you maybe want to process all files found in the ASLRORIN directory, remove the diacritics, then save them with the original name in ASLROROT?
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
ibgadmin

ASKER
ASLRORIN is a file, not directory.  It is the file that needs to be cleaned from diaretics.  I then name it ASLROROT in the same directory.  The file ASLROROT is now clean from diaretics and ready to be processed further.....   Hope that helps and sorry for the confusion on my part.
ASKER CERTIFIED SOLUTION
oBdA

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
ibgadmin

ASKER
It is working perfectly now.  Thanks so much!!