Stripping, “Dirty” Characters With PowerShell

by Klaus Graefensteiner 3. December 2009 11:08

Introduction

Take the title of this blog for example: “Stripping, “Dirty” Characters With PowerShell” and compare it with the URL of this blog post: http://www.tellingmachine.com/post/Stripping-Dirty-Characters-With-PowerShell.aspx. Did you notice that the comma and the quotation marks get stripped of the original title string and spaces got replaced with dashes to form a valid URL?

This post is demonstrating different PowerShell techniques to strip or replace characters that are not valid in URLs.

DirtyHarry2

Figure 1: Dirty Harry is a dirty characters with lots of Power in his Shells

Permalinks and Slugs

BlogPosts use Permalinks and Slugs to provide a human readable URL that can be easily translated back to the title of the referring article. In my case here the title “Stripping, “Dirty” Characters With PowerShell” maps to the URL “Stripping-Dirty-Characters-With-PowerShell”. The following collection of PowerShell functions is a toolkit for the reader to use to customize the conversion process.

Deleting non-word characters

Here I use a regular expression that removes every character that is not a \w word character, a " " space or a hyphen.

function Remove-IllegalCharacters([String] $DirtyString)
{
    if ([String]::IsNullOrEmpty($DirtyString))
    {
	    return $DirtyString
    }
    
    Write-Debug $DirtyString
	$CleanerString = $DirtyString -replace '[^\w -]', [String]::Empty
    Write-Debug $CleanerString
    
    return $CleanerString
}

Function Test.Remove-IllegalCharactersInStringWithOnlyTheseShouldReturnEmptyString()
{
    #Arrange
    $DirtyString = "[`"\[\]*.'#$&+,/:;=?@]'\\"
    
    #Act
    $CleanerString = Remove-IllegalCharacters -DirtyString $DirtyString
    
    #Assert
    Assert-That -ActualValue $CleanerString -Constraint {$ActualValue -eq [String]::Empty}
}

Function Test.Remove-IllegalCharactersInStringWithWhiteSpaceCaractersShouldReturnWhiteSpaceCharacters()
{
    #Arrange
    $DirtyString = "[`"\[\] *.'#$&+,/: ;=?@]'\ \"
    
    #Act
    $CleanerString = Remove-IllegalCharacters -DirtyString $DirtyString
    
    #Assert
    Assert-That -ActualValue $CleanerString -Constraint {$ActualValue -eq "   "}
}

Function Test.Remove-IllegalCharactersInStringWithWhiteSpaceAndLettersShouldReturnWhiteSpaceAndLetters()
{
    #Arrange
    $DirtyString = "[`"\[\]a *.'#$&+,/:b ;=?@]'\c \d"
    
    #Act
    $CleanerString = Remove-IllegalCharacters -DirtyString $DirtyString
    
    #Assert
    Assert-That -ActualValue $CleanerString -Constraint {$ActualValue -eq "a b c d"}
}

Function Test.Remove-IllegalCharactersInStringWithWhiteSpaceAndTabAndLettersShouldReturnWhiteSpaceAndLettersButNoTabs()
{
    #Arrange
    $DirtyString = "[`"\`t[\]a *.'#$&+,/:`tb ;=?@]'\c `t\d"
    
    #Act
    $CleanerString = Remove-IllegalCharacters -DirtyString $DirtyString
    
    #Assert
    Assert-That -ActualValue $CleanerString -Constraint {$ActualValue -eq "a b c d"}
}

Function Test.Remove-IllegalCharactersInStringWithHyphenShouldReturnStringWithHyphen()
{
    #Arrange
    $DirtyString = "[`"\`t[\]a-a *.'#$&+,/:`tb-b ;=?@]'\c-c `t\d-d"
    
    #Act
    $CleanerString = Remove-IllegalCharacters -DirtyString $DirtyString
    
    #Assert
    Assert-That -ActualValue $CleanerString -Constraint {$ActualValue -eq "a-a b-b c-c d-d"}
}

Replacing space with hyphen

This function uses a simple regex to replace one or more " " spaces with only one hyphen.

function Replace-SpacesWithHyphen([string] $CleanString)
{
    if ([String]::IsNullOrEmpty($CleanString))
    {
	    return $CleanString
    }
    
    Write-Debug $CleanString
	$HyphenatedString = $CleanString -replace '( )+', '-'
    Write-Debug $HyphenatedString
    return $HyphenatedString
}

Function Test.Replace-SpacesWithHyphenInStringWithSpacesShouldReturnStringWithHyphen()
{
    #Arrange
    $CleanString = "a b  c   d"
    
    #Act
    $HypenatedString = Replace-SpacesWithHyphen -CleanString $CleanString
    
    #Assert
    Assert-That -ActualValue $HypenatedString -Constraint {$ActualValue -eq "a-b-c-d"}
}

Function Test.Replace-SpacesWithHyphenInStringWithHyphenShouldReturnStringWithHyphen()
{
    #Arrange
    $DirtyString = "[`"\`t[\]a-a *.'#$&+,/:`tb-b ;=?@]'\c-c `t\d-d"
    
    #Act
    $CleanerString = Remove-IllegalCharacters -DirtyString $DirtyString
    $CleanerString = Replace-SpacesWithHyphen -CleanString $CleanerString
    
    #Assert
    Assert-That -ActualValue $CleanerString -Constraint {$ActualValue -eq "a-a-b-b-c-c-d-d"}
}

Deleting diacritics

This function gets rid of all the little specks that some characters accumulate.

$SystemGlobalizationAssembly = [System.Reflection.Assembly]::LoadWithPartialName("System.Globalization")
$SystemTextAssembly = [System.Reflection.Assembly]::LoadWithPartialName("System.Text")


function Remove-DiacriticsFromLatinCharacters([string] $StringToConvert)
{
    
    if ([String]::IsNullOrEmpty($StringToConvert))
    {
	    return $StringToConvert
    }
    
    Write-Debug $StringToConvert
		
    [String] $Normalized = $StringToConvert.Normalize([System.Text.NormalizationForm]::FormD)
	[System.Text.StringBuilder] $SB = New-Object -TypeName "System.Text.StringBuilder"

    for ($i = 0; $i -lt $Normalized.Length; $i++)
    {
        [Char] $C = $Normalized[$i];
        if ([System.Globalization.CharUnicodeInfo]::GetUnicodeCategory($C) -ne [System.Globalization.UnicodeCategory]::NonSpacingMark)
        {
            [void] $SB.Append($C)
        }
    }

    $StringWithoutDiacritics = $SB.ToString().Normalize([System.Text.NormalizationForm]::FormC)
    Write-Debug $StringWithoutDiacritics
    return $StringWithoutDiacritics     
}

Function Test.Remove-DiacriticsFromLatinCharactersShouldRemoveAccents()
{
    #Arrange
    $StringWithDiacricits = "SE:ÅåÄäÖö; PL:ĄąĆćĘꣳŃńÓóŚśŹźŻż; SK:ľščťžýáíéúä&#
244;ň*ȍŽÝÁÍÉÚÄÔŇĎ; HU:ëőüűŐÜŰ; ES:Ññ¿; CA:àèòçï"
    
    #Act
    $ResultString = Remove-DiacriticsFromLatinCharacters -StringToConvert $StringWithDiacricits
    
    #Assert
    Assert-That -ActualValue $ResultString -Constraint {$ActualValue -eq "SE:AaAaOo; PL:AaCcEeŁłNnOoSsZzZz; SK:lsctzyaieuaondLSCTZYAIEUAOND; HU:eouuO
UU; ES:Nn¿; CA:aeoci"}
}

Deleting diacritics via Code Page conversion

What ever specks/diacritics are left after the former cleaning can be knocked out with the following procedure.

$SystemGlobalizationAssembly = [System.Reflection.Assembly]::LoadWithPartialName("System.Globalization")
$SystemTextAssembly = [System.Reflection.Assembly]::LoadWithPartialName("System.Text")

function Remove-DiacriticsViaCodepageConversion([string] $StringToConvert)
{
    if ([String]::IsNullOrEmpty($StringToConvert))
    {
	    return $StringToConvert
    }
    
    Write-Debug $StringToConvert
    $NonUnicodeEncoding = [System.Text.Encoding]::GetEncoding(850)
    $UnicodeEncoding = [System.Text.Encoding]::Unicode

    [Byte[]] $UnicodeBytes = $UnicodeEncoding.GetBytes($StringToConvert);

    [Byte[]] $NonUnicodeBytes = [System.Text.Encoding]::Convert($UnicodeEncoding, $NonUnicodeEncoding , $UnicodeBytes);

    [Char[]] $NonUnicodeChars = New-Object -TypeName "Char[]" -ArgumentList $($NonUnicodeEncoding.GetCharCount($NonUnicodeBytes, 0, $NonUnicodeB
ytes.Length))

    [void] $NonUnicodeEncoding.GetChars($NonUnicodeBytes, 0, $NonUnicodeBytes.Length, $NonUnicodeChars, 0);
    
    [String] $NonUnicodeString = New-Object String(,$NonUnicodeChars) #Tricky, Tricky, Tricky

    Write-Debug $NonUnicodeString
    return $NonUnicodeString
}

Function Test.Remove-DiacriticsViaCodepageConversionShouldRemoveOtherUnusualCharacters()
{
    #Arrange
    $UnicodeString = "SE:AaAaOo; PL:AaCcEeŁłNnOoSsZzZz; SK:lsctzyaieuaondLSCTZYAIEUAOND; HU:eouuOUU; ES:Nn¿; CA:aeoci"
    
    #Act
    $ResultString = Remove-DiacriticsViaCodepageConversion -StringToConvert $UnicodeString
    
    #Assert
    Assert-That -ActualValue $ResultString -Constraint {$ActualValue -eq "SE:AaAaOo; PL:AaCcEeLlNnOoSsZzZz; SK:lsctzyaieuaondLSCTZYAIEUAOND; HU:eouuO
UU; ES:Nn¿; CA:aeoci"}
}

Use the .NET EncodeURL operation

.Net also offers a very generic solution to quickly convert any string into an encoded valid URL. The advantage: the process can be reverted. The disadvantage: encoded URLs are not always human readable.

$SystemWebAssembly = [System.Reflection.Assembly]::LoadWithPartialName("System.Web")

function Encode-URL([string] $StringToEncode)
{
    Write-Debug $StringToEncode
    $EncodedUrl = [System.Web.HttpUtility]::UrlEncode($StringToEncode)
    Write-Debug $EncodedUrl

    return $EncodedUrl
}

Function Test.Encode-URLReturnsLegalURL()
{
    #Arrange
    $UnicodeString = "This is just a <invalid url>; but I like it!"
    
    #Act
    $ResultString = Encode-URL -StringToEncode $UnicodeString
    
    #Assert
    Assert-That -ActualValue $ResultString -Constraint {$ActualValue -eq "This+is+just+a+%3cinvalid+url%3e%3b+but+I+like+it!"}
}

Download

The complete script and the PSUnit unit tests can be downloaded here: Remove-DirtyCharacters.zip

Ausblick

As always. PowerShell is fun, fun fun!

Tags: , , , , ,

PowerShell | Blog Kebab | Blogging

Comments are closed

About Klaus Graefensteiner

I like the programming of machines.

Add to Google Reader or Homepage

LinkedIn FacebookTwitter View Klaus Graefensteiner's profile on Technorati
Klaus Graefensteiner

Klaus Graefensteiner
works as Developer In Test and is founder of the PowerShell Unit Testing Framework PSUnit. More...

Open Source Projects

PSUnit is a Unit Testing framwork for PowerShell. It is designed for simplicity and hosted by Codeplex.
BlogShell is The tool for lazy developers who like to automate the composition of blog content during the writing of a blog post. It is hosted by CodePlex.

Administration

About

Powered by:
BlogEngine.Net
Version: 1.6.1.0

License:
Creative Commons License

Copyright:
© Copyright 2012, Klaus Graefensteiner.

Disclaimer:
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

Theme design:
This blog theme was designed and is copyrighted 2012 by Klaus Graefensteiner

Rendertime:
Page rendered at 2/4/2012 6:42:03 AM (PST Pacific Standard Time UTC DST -7)