Perl arrays for fun and profit

In my day job, I deal with a lot of perl. After really taking the time to learn it, and working with it in a production environment, I really have a new respect for perl. But that’s a topic for another day. The other day, someone presented me with a perl coding challenge, and it took me a while to come up with the answer. Granted, I’m not really a perl guru, so I’m not too concerned, but I wanted to talk about and document how I solved the problem. The challenge can be simplified a little bit, and restated almost like a homework problem:

Given 2 arrays, find the union, intersection and difference.

See, here’s the thing. Most every other high level language makes this pretty trivial. PHP has the built in functions array_intersec and array_diff, in Ruby, you can just do something like ary1 & ary2 or ary1 - ary2, etc… Perl’s arrays suck eggs. I’m serious, arrays in perl are virtually useless… Well, that’s too strong. Array’s in perl are emenintly useful, they’re just kind of dumb. In order to do most serious processing in perl, you wind up turning things into hashes. And that’s basically what you have to do to get the array intersection, difference and union. Here’s a little script that does the job.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#! /usr/bin/perl
use warnings;
use strict; 
use Data::Dumper;
 
my @list1 = (1, 3, 5, 6, 7, 8);
my @list2 = (3, 5, 6, 9, 10);
my %union = ();
my %isec = ();
my %dif = ();
 
print STDOUT "List 1 is: " , @list1, "\n";
print STDOUT "List 2 is: " , @list2 , "\n";
 
foreach my $e (@list1,@list2){ $union{$e}++ && $isec{$e}++ }
foreach my $e (@list1, @list2){ 
    if ( !$isec{$e} ) { $dif{$e} = 1 }
}
print Dumper(\%union) . "\n";
print Dumper(\%isec) . "\n";
print Dumper(\%dif) . "\n";
my @union = keys %union;
my @isec = keys %isec;
my @diff = keys %dif;
 
print STDOUT "The union of the two lists is: ", @union , "\n" ;
print STDOUT "The intersection of the two lists is: ", @isec , "\n";
print STDOUT "The difference of the two lists is: ", @diff , "\n";

Now to explain what’s happening. The first four lines are standard perl boilerplate… use the system perl, turn on warnings, strict, and import the Data::Dumper module. No big surprises there.

The next 5 lines is just some variable initialization. We create two arrays, and 3 hashes union, isec & dif. Then we print them to STDOUT, just in case we’ve forgotten what the script does. Now, lines 15 through 24 are where the magic happens. Let’s walk through those.

foreach my $e (@list1,@list2){ $union{$e}++ && $isec{$e}++ }

This line is where perl really starts to shine. This line is just a basic foreach loop, since perl treats arrays (or lists) as basically comman separated strings, you can give the loop two separate arrays, and perl will just treat the second array as a continuation of the first one. In essence, perl concatenates the two arrays, so the effect is that perl loops through both. So, as we’re looping, we add every value to the union hash with this $union{$e}. Now, once we get on the second array, we want to start looking to see if the value we’re seeing was present in the first array. If so, we want to add it to the intersection hash. This is one of those slightly weird perl-isms, so let me try to explain what’s happening here. As we loop through the first array, for every array member, we attempt to increment the value of $union{$e}. If we have’t seen the member before, then the $union{$e}++ increments the value to 1. BUT, since the value was undefined before the increment, perl evaluates the value of that expression to be 0, which causes the second part of the expression to be skipped. This is what perl refers to as a short-circuit. If the value on the left of an && expression is false, then the entire expression MUST be false, so perl doesn’t even bother evaluating the value on the right. So the result of looping through the entire first array, is just that the union hash is populated with the members of the first array as keys, and every key should have a value of 1.

Now, perl starts looping through the second array. For each member of the array, we again try to increment it’s value in the union hash. If it’s a member that we haven’t seen before, then it just gets “incremented” from undefined to 1. However, if we have seen the member before, then the increment operation will increment from 1 to 2, and the value that perl evaluates will be 1, which is true. So this time, perl will evaluate the expression on the right hand side of the &&, which is an expression that sets a value for the array member in the “intersection” hash. Of course, it still returns 0. but we don’t really care about that now. What we do care about is that we now have a “union” hash that has a single key for every distinct member of the two arrays. We also have a hash that has a single key for every member that is a member of both arrays… the intersection.

If we go ahead and use Data::Dumper to dump the two hashes, we see this:

$union = {
          '6' => 2,
          '3' => 2,
          '7' => 1,
          '9' => 1,
          '8' => 1,
          '1' => 1,
          '10' => 1,
          '5' => 2
        };

$isec = {
          '6' => 1,
          '3' => 1,
          '5' => 1
        };

What we see there is that every distinct member of the arrays has an entry in the union hash. For members that are present in both arrays, we see that they have a value of 2. This is a result of their being incremented by both sides of the expression $union{$e}++ && $isec{$e}++. Members that were only in one of the two arrays have a value of 1. The intersection array, is the result of the right hand side of that expression, $isec{$e}++, which was only evaluated if the union array already had a value for the member. Here we can see that for each union hash key with a value of 2, there is a corresponding key in the isec hash with a value of 1. That’s our intersection.

foreach my $e (@list1, @list2){ 
    if ( !$isec{$e} ) { $dif{$e} = 1 }
}

So, finding the union and intersection seemed pretty roundabout. Finding the difference doesn’t take any fancy coding at all. All it takes is the understanding that by definition, the difference between two arrays is the set of values that appear in one array or the other, but not both. This is the direct opposite of the intersection, which is the set of values that appears in both arrays. So, taking it one step further, we can define the difference as the set of values in the union which do not appear in the intersection! So, really, we just have to do one more loop, check whether each value is present in the intersection hash, and if not, add it to the difference hash. That’s what the code above does. Another way to do it would be to just define the difference as the set of values of the union hash whose value is 1.

print Dumper(\%union) . "\n";
print Dumper(\%isec) . "\n";
print Dumper(\%dif) . "\n";
my @union = keys %union;
my @isec = keys %isec;
my @diff = keys %dif;

The last few lines just create new arrays from the keys of the various hashes. That’s just a convenience for printing them out. You could certainly do anything you like at this point. What we’ve seen is that by working with hashes, essentially as look up tables, and taking advantage of perl’s short circuit AND operator, we can sort and organize data structures in a pretty efficient way. Perl can be ugly sometimes, but she walks the walk, and she can certainly get things done.