We help IT Professionals succeed at work.

Remove spanish diacritics from UTF8 file

ibgadmin
ibgadmin asked
on
We receive daily UTF8 files for orders that contain spanish diacritics on the names for the orders.  Example of a record below where you can see three different characters.  Is there a way I can replace those to be normal characters i.e. i,n,u, etc.?  Maybe sed or powershell?  I don't really know where to start with something like this.

30D0007991Aníbal           Zúñiga
Comment
Watch Question

CERTIFIED EXPERT
Top Expert 2014

Commented:
This seems to be an inverse problem of this question that I recently answered:
https://www.experts-exchange.com/questions/29174458/Query-Help-possible-VBA-solution-Array.html#a43042006

In your case, we would want to create an array of [regex] objects that would each identify all the upper ASCII characters for a (desired) lower ASCII character.  For each line/object, you would loop through these regex objects and replace the matches with the desired/associated lower ASCII character.

Note: In the linked comment, I am including the lower ASCII character, because we were looking for a match, not a replacement.  You would not have lower ASCII characters in your regex patterns.
CERTIFIED EXPERT
Most Valuable Expert 2019
Most Valuable Expert 2018

Commented:
Here's a function that accepts a file path as as input and removes diacritics.
Save as Remove-Diacritic.ps1 or Whatever.ps1.
The output file will by default be generated with the original name and the suffix "_nd". You can change the suffix using the -Suffix argument, or provide a completely new name using the New-Name argument.
Examples:
# Will generate C:\Temp\SomeFile_nd.txt:
Remove-Diacritic.ps1 -Path C:\Temp\SomeFile.txt
# Will generate C:\Temp\NoDiacritic.txt:
Remove-Diacritic.ps1 -Path C:\Temp\SomeFile.txt -NewName NoDiacritic.txt

Open in new window

[CmdletBinding(DefaultParameterSetName='Suffix')]
Param(
	[Parameter(Position=0, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
	[Alias('FullName')]
	[String]$Path,
	[Parameter(Position=1, Mandatory=$true, ParameterSetName='NewName')]
	$NewName,
	[Parameter(Position=1, ParameterSetName='Suffix')]
	$Suffix = '_nd',
	[Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding]$Encoding = 'UTF8'
)
Begin {
	Function Remove-Diacritic {
	Param(
		[Parameter(Position=0, Mandatory=$true, ValueFromPipeline=$true)][AllowEmptyString()]
		[String[]]$String
	)
		Begin {
			$stringBuilder = New-Object -TypeName System.Text.StringBuilder
		}
		Process {
			$String | ForEach-Object {
				[void]$stringBuilder.Clear()
				$_.Normalize([System.Text.NormalizationForm]::FormD).ToCharArray() |
					Where-Object {[Globalization.CharUnicodeInfo]::GetUnicodeCategory($_) -ne [Globalization.UnicodeCategory]::NonSpacingMark} |
					ForEach-Object {[void]$stringBuilder.Append($_)}
				$stringBuilder.ToString()
			}
		}
	}
}
Process {
	$item = Get-Item -Path $Path
	If ($PSCmdlet.ParameterSetName -eq 'NewName') {
		$newPath = Join-Path -Path $item.DirectoryName -ChildPath $NewName
	} Else {
		$newPath = Join-Path -Path $item.DirectoryName -ChildPath "$($item.BaseName)$($Suffix)$($item.Extension)"
	}
	Write-Verbose "Removing diacritics from '$($item.FullName)' ..."
	Get-Content -Path $item.FullName -Encoding $Encoding | Remove-Diacritic | Set-Content -Path $newPath -Encoding $Encoding
	Write-Verbose "... saved as '$($newPath)'"
}

Open in new window

Author

Commented:
It works when I run in PS ISE and it prompts for the path.  Is there a way to use this in a batch file or something and have the path and newname so I can hard code that?
CERTIFIED EXPERT
Most Valuable Expert 2019
Most Valuable Expert 2018

Commented:
What's the exact use case here?
You said you receive files, so hard-coding the new name doesn't make much sense.

Author

Commented:
I receive the order files multiple times daily.  It is archived in it's original folder also.  It then is placed into a specific processing directory.  I need it to run like so in a batch or something for our automation software to run X-times daily.
Remove-Diacritics.ps1 -Path \\Ibgfs1\ecomedate\it_out\Amazon\SLRCNTRL\ASLRORIN -NewName \\Ibgfs1\ecomedate\it_out\Amazon\SLRCNTRL\ASLROROT
CERTIFIED EXPERT
Most Valuable Expert 2019
Most Valuable Expert 2018

Commented:
That is not precise enough.
What is ASLRORIN? The source directory or a single file without extension?
Are there potentially multiple files in the source directory?
Do you maybe want to process all files found in the ASLRORIN directory, remove the diacritics, then save them with the original name in ASLROROT?

Author

Commented:
ASLRORIN is a file, not directory.  It is the file that needs to be cleaned from diaretics.  I then name it ASLROROT in the same directory.  The file ASLROROT is now clean from diaretics and ready to be processed further.....   Hope that helps and sorry for the confusion on my part.
CERTIFIED EXPERT
Most Valuable Expert 2019
Most Valuable Expert 2018
Commented:
Then this command line should do it:
powershell.exe -ExecutionPolicy Bypass -File "C:\Temp\Remove-Diacritics.ps1" -Path "\\Ibgfs1\ecomedate\it_out\Amazon\SLRCNTRL\ASLRORIN" -NewName ASLROROT

Open in new window

Author

Commented:
It is working perfectly now.  Thanks so much!!