Introduction
Take the title of this blog for example: “Stripping, “Dirty” Characters With PowerShell” and compare it with the URL of this blog post: http://www.tellingmachine.com/post/Stripping-Dirty-Characters-With-PowerShell.aspx. Did you notice that the comma and the quotation marks get stripped of the original title string and spaces got replaced with dashes to form a valid URL?
This post is demonstrating different PowerShell techniques to strip or replace characters that are not valid in URLs.
Figure 1: Dirty Harry is a dirty characters with lots of Power in his Shells
Permalinks and Slugs
BlogPosts use Permalinks and Slugs to provide a human readable URL that can be easily translated back to the title of the referring article. In my case here the title “Stripping, “Dirty” Characters With PowerShell” maps to the URL “Stripping-Dirty-Characters-With-PowerShell”. The following collection of PowerShell functions is a toolkit for the reader to use to customize the conversion process.
Deleting non-word characters
Here I use a regular expression that removes every character that is not a \w word character, a " " space or a hyphen.
function Remove-IllegalCharacters([String] $DirtyString)
{
if ([String]::IsNullOrEmpty($DirtyString))
{
return $DirtyString
}
Write-Debug $DirtyString
$CleanerString = $DirtyString -replace '[^\w -]', [String]::Empty
Write-Debug $CleanerString
return $CleanerString
}
Function Test.Remove-IllegalCharactersInStringWithOnlyTheseShouldReturnEmptyString()
{
#Arrange
$DirtyString = "[`"\[\]*.'#$&+,/:;=?@]'\\"
#Act
$CleanerString = Remove-IllegalCharacters -DirtyString $DirtyString
#Assert
Assert-That -ActualValue $CleanerString -Constraint {$ActualValue -eq [String]::Empty}
}
Function Test.Remove-IllegalCharactersInStringWithWhiteSpaceCaractersShouldReturnWhiteSpaceCharacters()
{
#Arrange
$DirtyString = "[`"\[\] *.'#$&+,/: ;=?@]'\ \"
#Act
$CleanerString = Remove-IllegalCharacters -DirtyString $DirtyString
#Assert
Assert-That -ActualValue $CleanerString -Constraint {$ActualValue -eq " "}
}
Function Test.Remove-IllegalCharactersInStringWithWhiteSpaceAndLettersShouldReturnWhiteSpaceAndLetters()
{
#Arrange
$DirtyString = "[`"\[\]a *.'#$&+,/:b ;=?@]'\c \d"
#Act
$CleanerString = Remove-IllegalCharacters -DirtyString $DirtyString
#Assert
Assert-That -ActualValue $CleanerString -Constraint {$ActualValue -eq "a b c d"}
}
Function Test.Remove-IllegalCharactersInStringWithWhiteSpaceAndTabAndLettersShouldReturnWhiteSpaceAndLettersButNoTabs()
{
#Arrange
$DirtyString = "[`"\`t[\]a *.'#$&+,/:`tb ;=?@]'\c `t\d"
#Act
$CleanerString = Remove-IllegalCharacters -DirtyString $DirtyString
#Assert
Assert-That -ActualValue $CleanerString -Constraint {$ActualValue -eq "a b c d"}
}
Function Test.Remove-IllegalCharactersInStringWithHyphenShouldReturnStringWithHyphen()
{
#Arrange
$DirtyString = "[`"\`t[\]a-a *.'#$&+,/:`tb-b ;=?@]'\c-c `t\d-d"
#Act
$CleanerString = Remove-IllegalCharacters -DirtyString $DirtyString
#Assert
Assert-That -ActualValue $CleanerString -Constraint {$ActualValue -eq "a-a b-b c-c d-d"}
}
Replacing space with hyphen
This function uses a simple regex to replace one or more " " spaces with only one hyphen.
function Replace-SpacesWithHyphen([string] $CleanString)
{
if ([String]::IsNullOrEmpty($CleanString))
{
return $CleanString
}
Write-Debug $CleanString
$HyphenatedString = $CleanString -replace '( )+', '-'
Write-Debug $HyphenatedString
return $HyphenatedString
}
Function Test.Replace-SpacesWithHyphenInStringWithSpacesShouldReturnStringWithHyphen()
{
#Arrange
$CleanString = "a b c d"
#Act
$HypenatedString = Replace-SpacesWithHyphen -CleanString $CleanString
#Assert
Assert-That -ActualValue $HypenatedString -Constraint {$ActualValue -eq "a-b-c-d"}
}
Function Test.Replace-SpacesWithHyphenInStringWithHyphenShouldReturnStringWithHyphen()
{
#Arrange
$DirtyString = "[`"\`t[\]a-a *.'#$&+,/:`tb-b ;=?@]'\c-c `t\d-d"
#Act
$CleanerString = Remove-IllegalCharacters -DirtyString $DirtyString
$CleanerString = Replace-SpacesWithHyphen -CleanString $CleanerString
#Assert
Assert-That -ActualValue $CleanerString -Constraint {$ActualValue -eq "a-a-b-b-c-c-d-d"}
}
Deleting diacritics
This function gets rid of all the little specks that some characters accumulate.
$SystemGlobalizationAssembly = [System.Reflection.Assembly]::LoadWithPartialName("System.Globalization")
$SystemTextAssembly = [System.Reflection.Assembly]::LoadWithPartialName("System.Text")
function Remove-DiacriticsFromLatinCharacters([string] $StringToConvert)
{
if ([String]::IsNullOrEmpty($StringToConvert))
{
return $StringToConvert
}
Write-Debug $StringToConvert
[String] $Normalized = $StringToConvert.Normalize([System.Text.NormalizationForm]::FormD)
[System.Text.StringBuilder] $SB = New-Object -TypeName "System.Text.StringBuilder"
for ($i = 0; $i -lt $Normalized.Length; $i++)
{
[Char] $C = $Normalized[$i];
if ([System.Globalization.CharUnicodeInfo]::GetUnicodeCategory($C) -ne [System.Globalization.UnicodeCategory]::NonSpacingMark)
{
[void] $SB.Append($C)
}
}
$StringWithoutDiacritics = $SB.ToString().Normalize([System.Text.NormalizationForm]::FormC)
Write-Debug $StringWithoutDiacritics
return $StringWithoutDiacritics
}
Function Test.Remove-DiacriticsFromLatinCharactersShouldRemoveAccents()
{
#Arrange
$StringWithDiacricits = "SE:ÅåÄäÖö; PL:ĄąĆćĘꣳŃńÓóŚśŹźŻż; SK:ľščťžýáíéúä&#
244;ň*ȍŽÝÁÍÉÚÄÔŇĎ; HU:ëőüűŐÜŰ; ES:Ññ¿; CA:àèòçï"
#Act
$ResultString = Remove-DiacriticsFromLatinCharacters -StringToConvert $StringWithDiacricits
#Assert
Assert-That -ActualValue $ResultString -Constraint {$ActualValue -eq "SE:AaAaOo; PL:AaCcEeŁłNnOoSsZzZz; SK:lsctzyaieuaondLSCTZYAIEUAOND; HU:eouuO
UU; ES:Nn¿; CA:aeoci"}
}
Deleting diacritics via Code Page conversion
What ever specks/diacritics are left after the former cleaning can be knocked out with the following procedure.
$SystemGlobalizationAssembly = [System.Reflection.Assembly]::LoadWithPartialName("System.Globalization")
$SystemTextAssembly = [System.Reflection.Assembly]::LoadWithPartialName("System.Text")
function Remove-DiacriticsViaCodepageConversion([string] $StringToConvert)
{
if ([String]::IsNullOrEmpty($StringToConvert))
{
return $StringToConvert
}
Write-Debug $StringToConvert
$NonUnicodeEncoding = [System.Text.Encoding]::GetEncoding(850)
$UnicodeEncoding = [System.Text.Encoding]::Unicode
[Byte[]] $UnicodeBytes = $UnicodeEncoding.GetBytes($StringToConvert);
[Byte[]] $NonUnicodeBytes = [System.Text.Encoding]::Convert($UnicodeEncoding, $NonUnicodeEncoding , $UnicodeBytes);
[Char[]] $NonUnicodeChars = New-Object -TypeName "Char[]" -ArgumentList $($NonUnicodeEncoding.GetCharCount($NonUnicodeBytes, 0, $NonUnicodeB
ytes.Length))
[void] $NonUnicodeEncoding.GetChars($NonUnicodeBytes, 0, $NonUnicodeBytes.Length, $NonUnicodeChars, 0);
[String] $NonUnicodeString = New-Object String(,$NonUnicodeChars) #Tricky, Tricky, Tricky
Write-Debug $NonUnicodeString
return $NonUnicodeString
}
Function Test.Remove-DiacriticsViaCodepageConversionShouldRemoveOtherUnusualCharacters()
{
#Arrange
$UnicodeString = "SE:AaAaOo; PL:AaCcEeŁłNnOoSsZzZz; SK:lsctzyaieuaondLSCTZYAIEUAOND; HU:eouuOUU; ES:Nn¿; CA:aeoci"
#Act
$ResultString = Remove-DiacriticsViaCodepageConversion -StringToConvert $UnicodeString
#Assert
Assert-That -ActualValue $ResultString -Constraint {$ActualValue -eq "SE:AaAaOo; PL:AaCcEeLlNnOoSsZzZz; SK:lsctzyaieuaondLSCTZYAIEUAOND; HU:eouuO
UU; ES:Nn¿; CA:aeoci"}
}
Use the .NET EncodeURL operation
.Net also offers a very generic solution to quickly convert any string into an encoded valid URL. The advantage: the process can be reverted. The disadvantage: encoded URLs are not always human readable.
$SystemWebAssembly = [System.Reflection.Assembly]::LoadWithPartialName("System.Web")
function Encode-URL([string] $StringToEncode)
{
Write-Debug $StringToEncode
$EncodedUrl = [System.Web.HttpUtility]::UrlEncode($StringToEncode)
Write-Debug $EncodedUrl
return $EncodedUrl
}
Function Test.Encode-URLReturnsLegalURL()
{
#Arrange
$UnicodeString = "This is just a <invalid url>; but I like it!"
#Act
$ResultString = Encode-URL -StringToEncode $UnicodeString
#Assert
Assert-That -ActualValue $ResultString -Constraint {$ActualValue -eq "This+is+just+a+%3cinvalid+url%3e%3b+but+I+like+it!"}
}
Download
The complete script and the PSUnit unit tests can be downloaded here: Remove-DirtyCharacters.zip
Ausblick
As always. PowerShell is fun, fun fun!