Perl Weekly Week 5 (Part 1)

Tags:

So this weeks challenge was all about Anagrams and was very interesting. Because both of the solutions were Anagram based I ended up making a library used by both (and in the case of the second test all 3 of my attempted solutions) and I'm going to cover that first.

Anagram Library

So as this is a simple script I don't plan to release as a module or anything I'm not going full out with a META6.json file just a lib folder in the same folder as the scripts and inside that a file called Anagrams.pm6. (Note Perl6 modules can have iether a .pm6 or a .p6 extension. I think the latter is now recommended... but I'm sometimes slow to catch up.)

Our module starts by declaring it's namespace. As everything in the file is considered part of the module I can just use a unit declaration to state that (if you want to have multiple namespaces in a file you need to use blocks for scoping).

unit package Anagrams;

Of course a module with just that in would be a bit pointless so let's have a simple function is-anagram-of that given two strings will return true if the first is an anagram of the second. Now I know I'm going to be doing this a lot and that are some cases where we can quickly see it's not going to be true. For instance a word is not an anagram of itself so if our two strings are the same return false :

multi sub is-anagram-of( Str \target, Str \word where * eq target ) is export is pure { 
    False; 
}

I'll go into this in a bit more detail as there's a lot here and once you get your head around it it's kind of awesome :

  • multi sub is-anagram-of : This is a multi sub which means there will be more than one code path for it with different arguments.
  • Str \target : I'm defining my variables as a immutable Strings. Immutability is important if you want to easily leverage threading.
  • where * eq target : So this where clause is True if the two strings are the same
  • is export : I want this function to be exported to the callers scope.
  • is pure : For any given pair of strings this function always returns the same result. Can help the VM to speed up the code.
  • { False; } : No need for a return. The last thing evaluated is the return value for the sub.

Another case that's going to turn up a lot is when the two strings are not the same length. Then there's now way they can be anagrams :

multi sub is-anagram-of( Str \target, Str \word where *.codes != target.codes ) is export is pure { 
    False; 
}

Here I use the .codes method to count unicode code points which is generally What you want to do TM.

With those two False cases out of the way I can now test my strings. It's at this point I want to have a small digression about English and theft.

Cafes, Faces and Cáfes

The concept of an Anagram can only have been invented by and English speaker. The English language is much like the English people, a bit light fingered when it comes to other cultures stuff. Historically we were really good at wandering to other peoples countries and going "Oh that looks nice I. It's ours now". And we do it to this day with other languages, English speakers see nothing wrong with appropriating words from other languages. (Except Americanisms, those are just wrong.)

But... technically we don't really use accents on letters. We've just got the 26 and get all confused about funny squiggles on top of letters or dangling off the bottom. And don't get me started on ß...

Anyway when I was testing my anagram code I tried looking for the anagram of cafe and got face. And then I tried looking for the anagram of face and got... nothing. Because cafe is actually spelt cáfe. Please note I can't event type á easily. As a native English speaker though I don't care about the accent, just the letters and I'd quite like my anagram checker to be the same, also it would be good if it wasn't case sensitive.

This first part is simple enough lc will give us the string in lowercase. But stripping of accents and other marks that's going to be a pain isn't it?

Itroduction samemark. Which allows you to match the accent information between two strings. And if the second string is shorter than the first then the last character is used for all matching. Which means :

samemark( lc( "Cáfe" ), "a" ) eq "cafe"

That's great and I've got a little function to do that (I call it normal).

sub normal ( Str \word ) is pure {
    samemark( lc( word ), "a" )
}

This function is only used internally so it's not exported.

Back to Anagrams

Ok so to test if a string is an anagram of another string if I take the string, split it into letters and sort the list and then compare the sorted lists they should be the same. As I'm doing that twice I'll make a function to do it :

sub order-string ( Str \word ) is export is pure {
    normal( word ).comb.sort( { $^a cmp $^b } ).join;
}

Note that we're using our normalised string. I'm also joining it again but it would be trivial to return a list and compare that... I just didn't think of it.

With that in place we've got our final is-anagram-of sub :

multi sub is-anagram-of( Str \target, Str \word ) is export is pure {
    normal( target ) ne normal( word ) && order-string( target ) ~~ order-string( word );
}

So you'll see that the first thing we do is compare the normalised strings are not the same. Otherwise cafe would be an anagram of cáfe. This isn't caught by our initial where clause. I thought about doing it there but I figured it's more of an edge case and I didn't want to normalise every string.

So if our two normal strings are different and the ordered versions are the same then they are anagrams. Awesome!

Putting it together

Now I've got this library the script (which weirdly I see I gave a .p6 extension, I'm nothing if not inconsistent).

Firstly some boilerplate :

#!/usr/bin/env perl6

use v6;
use lib 'lib';
use Anagrams;

This loads our Anagrams module and makes sure if we accidentally run this with perl5 it explodes asap.

Then I like to use Perl6's auto generation for usage text and the following gives us a help / h flag.

my %*SUB-MAIN-OPTS = :named-anywhere;

sub USAGE { say $*USAGE }

#| Display Help file
multi sub MAIN ( Bool :h($help) where *.so ) { USAGE(); }

The my %*SUB-MAIN-OPTS = :named-anywhere; allows you to mix and matched named and positional arguments on the command line. $*USAGE is an autogenerated string based on the signatures and comments on the MAIN functions. (#| being a comment tagged to the function below).

For more information why not check out the talk I gave at London Perl Mongers last year?

With that out of the way here's the meat of the script :

subset FilePath of Str where *.IO.f;

#| Find the anagrams for a given word
multi sub MAIN (
    Str $word, #= Word to check for. Case insensitive
    FilePath :$dict = "/etc/dictionaries-common/words" #= Dictionary file to use. Defaults to "/etc/dictionaries-common/words"
) {
    $dict.IO.words.grep( { is-anagram-of( $word, $_ ) } )>>.say;
}

FilePath is a simple subset to ensure that a given string is actually a file, but the main function is no quite simple. $dict.IO.words opens our files and reads string from it splitting on white space (in this case newlines) this is a Sequence that can be iterated over. grep applies our is-anagram-of function with out given word and keeps those that are then we use the hyper operator and .say to print out any that work.

Generally nice and simple. How hard could part 2 be?

Oh ho ho ho ho. Tune in later for Part two of the challenge and be ready to mock my child like ignorance.