Perl Weekly Week 5 (Part 2)

Tags:

Technical note

My blogging platform was really not happy with some of this post. I tihnk I'm going to have to go back to my earlier plan of building my own in Perl6. So I've cut up my first couple of failed attempts, one was kind of neat as it involved threading but I'm sure I'll revist it in the future. Meanwhile he's the final result.

Back to the drawing board

Ok. Brute force and ignorance, even with threads was not enough. Time to have a think about a better solution. Also... time to re-read the spec. I was trying to do too much. I already have a script that can find Anagrams for a given string, so why am I not doing what the spec says and follow the Unix paradigm? Write a script that does one thing well.

So I had a think. I needed to avoid trying to cross reference things, at this point I had an idea that was based on trees. The core idea goes like this, normalise and sort the string, then store it in a hash and keep count of numbers of times that node has been reached. Then I can keep running count of the biggest number and it's strings. Here's an example. Lets say we've just started and out first word is ant this goes into out tree which becomes :

{ 
 'a' => {
         'n' => {
                 't' => {
                         'count' => 1
                        }
                }
        }
}

So our counter is at 1 and the list of strings is ['ant']. Then we get act :

{ 
 'a' => {
         'c' => {
                 't' => {
                         'count' => 1
                        }
                }
         'n' => {
                 't' => {
                         'count' => 1
                        }
                }
        }
}

The counter stays at 1 and the list is now ['ant', 'act']. Now we get tan which becomes act and the tree is now

{ 
 'a' => {
         'c' => {
                 't' => {
                         'count' => 1
                        }
                }
         'n' => {
                 't' => {
                         'count' => 2
                        }
                }
        }
}

The counter is upped to 2 and the list is reset to ['ant'] and this continues. It's a nice simple algorithm and the code for it turns out like so :

#!/usr/bin/env perl6

use v6;
use lib 'lib';
use Anagrams;

my %*SUB-MAIN-OPTS = :named-anywhere;

sub USAGE { say $*USAGE }

subset FilePath of Str where *.IO.f;

#| Display Help file
multi sub MAIN ( Bool :h($help) where *.so ) { USAGE(); }

So here's our boilerplate again

#| Find the strings(s) which has the most anagrams in the given dictionary
#| Prints out 1 string per line to STDOUT
multi sub MAIN (
    FilePath :$dict = "/etc/dictionaries-common/words" #= Dictionary file to use. Defaults to "/etc/dictionaries-common/words"
) {
    my %store;
    my $max-length = 0;
    my @values;
    for $dict.IO.words -> $word {
        my $string = order-string( $word );
        my @keys = $string.comb;
        my $current = %store;
        for @keys -> $key {
            $current{$key} //= {};
            $current = $current{$key};
        }
        $current<count>++;

This part handles the spidering through the tree. At the end $current is a reference to the end node for the string. We increment the count value (creating it if it doesn't exists).

        given $current<count> {
            when * > $max-length {
                $max-length = $current<count>;
                @values = [$string];
            }
            when * == $max-length {
                @values.push($string);
            }
        }

Given is the Perl 6 version of a switch if our count is over the max length we up it and create a new output array. If it's the same as max length we add our new string to the output array.

    }
    @values>>.say;
}

Then once the loop is down we print out the results.

While the previous script took 3 hours and got no where this took 3 seconds. So I claim that as a win. Roll on next week.