Hadoop Basics and learnings: Coding for Anagram in Map Reduce, Hive and Pig

Anagram:
Given a list of strings, the task is to find the two or more strings having same list of characters but in different order.
For example, ACT is an Anagram to CAT.
Note: If the list of words contains same word repeated again, then those are considered as Anagrams. For example, if ACT occurs twice in the list, then first ACT will be anagram to second ACT.

Map Reduce:
Assuming Text input file:
Map input           - key is LongWritable, value is Text.
Map output         - key is Text, value is Text.
Reduce output    -   key is Text, value is Text
Map task:
Map takes each word from list and then sorts the characters in ascending in that word. Sends this sorted word as key and original word as value.

public class Anagrammap extends Mapper<LongWritable, Text, Text, Text> {
protected void map(LongWritable key, Text value,Context context)
            throws IOException, InterruptedException {
        String outvaltemp = value.toString();
        char[] chars = outvaltemp.toCharArray();
        Arrays.sort(chars);
        String outval = new String(chars);
        context.write(new Text(outval),value);
    }
}

Reduce task:
Reduce task takes the sorted chars of word as key and all words with same sorted chars as list of values. Then it concatenates all the values with comma between and produces final output.
public class Anagramreduce extends Reducer<Text    , Text, Text, Text> {
protected void reduce(Text key, Iterable<Text> value,Context context)
            throws IOException, InterruptedException {
        int count = 0;
        String temp;
        String temp1 = "";
        while(value.iterator().hasNext())
        {
            temp = value.iterator().next().toString();
            count++;

            temp1 = temp1 + temp + ",";

        }

        if (count > 1)
        {
            context.write(key, new Text(temp1));
        }
}
}

You can modify this code as not to return key. Instead u can return a NullWritable object.

Anagram in Hive:
Input is a list of words one per each line. First create a table called angword with a single column - word(String).
Create Table if not exists angword (word String)
---
--
hive> select wordsjnd from
    > (select srt as srt1, WordsJoin(word) as wordsjnd from
    > (select wordsort(word) as srt, word from angword ) wrd
    > group by srt
    > having count(srt) > 1) wrd2;

In this query, the inside most part --
select wordsort(word) as srt, word from angword -- will first query the table angword and selects word which is sorted on its characters as srt and then the actual word itself. For this a hive UDF wordsort has been used.
Next the middle query - select srt as srt1, WordsJoin(word) as wordsjnd from XXXXXXX group by srt   having count(srt) > 1
This query groups the output of inner query on sorted word srt. And then an aggregation function WordsJoin is performed on the actual words that fall under a group. This function will join the words with a comma in between. Then only those groups with count > 1 is selected. Count =1 implies there are no anagrams for that word.
Outer query - select wordsjnd from XXXXXX - selects only the joined words leaving the sorted chars of those words.

UDF - wordsort:
This function takes a String as input, collects individual characters of that String, then put those characters in sorted order and then returns those sorted characters as a String.
public class WordSort extends UDF{
    private String unsrt_word = "";
public String evaluate(String word)
    {
        unsrt_word = word;
        char[] chars = unsrt_word.toCharArray();
        Arrays.sort(chars);
        String srt_word = new String(chars);
        return srt_word;
    }

}

UDAF - WordsJoin:
This takes a group of Strings as its input and joins them with a comma in between and return that entire output as a single string.
public final class WordsJoin extends UDAF {
public static class Word2join {
        private String word;
    }
public static class WordsJoinEvaluator implements UDAFEvaluator {

        Word2join w2j;

        public WordsJoinEvaluator() {
            super();
            w2j = new Word2join();
            init();
        }
        public void init() {
            w2j.word = "";
        }
        public boolean iterate(String w) {
            if (w != null) {
                if (w2j.word == "")
                    w2j.word = w;
                else
                    w2j.word = w2j.word + ", " + w;
            }
            return true;
        }
        public Word2join terminatePartial() {
            // This is SQL standard - average of zero items should be null.
            return w2j;
        }
        public boolean merge(Word2join o) {
            if (o != null) {
                if (w2j.word == "")
                    w2j.word = o.word;
                else
                    w2j.word = w2j.word + ", " + o.word;
            }
            return true;
        }
        public String terminate() {
            // This is SQL standard - average of zero items should be null.
            return w2j.word;
        }
    }

}

Anagram in Pig:
Below script can be used to generate Anagrams from an input that contains list of Strings - one per line.

inpfile = load load '/file/location/file' as (word:chararray);(word:chararray);
inpsrted = foreach inpfile generate WordSort(word) as srt, word;
grped = group inpsrted by srt;
aftrgrp = foreach grped generate group, COUNT(inpsrted) as count1, WordsJoin(inpsrted) as jndword;
angrms = filter aftrgrp by count1 > 1;
anagrams = foreach angrms generate jndword;

Explanation
inpfile = load '/file/location/file' as (word:chararray);
This loads the file to inpfile.

inpsrted = foreach inpfile generate WordSort(word) as srt, word;
For each of the input word, its sorted order is generated and stored as srt.
WordSort is Pig user defined function:
public class WordSort extends EvalFunc<String>{

    @Override
    public String exec(Tuple input) throws IOException {
       if (input == null || input.size() == 0) {
           return null;
       }
       String unsrt_word = (String)input.get(0);
       char[] chars = unsrt_word.toCharArray();
       Arrays.sort(chars);
       String srt_word = new String(chars);
       return srt_word;
    }

}

grped = group inpsrted by srt;
Grouped on sorted word.

aftrgrp = foreach grped generate group, COUNT(inpsrted) as count1,   WordsJoin(inpsrted) as jndword;

For each group count is calcuated - stored as count1. Along with this, all the words of that group are joined using a function WordsJoin. This is an aggregate fuction:
    public String exec(Tuple input) throws IOException {
        // TODO Auto-generated method stub
        DataBag bag = (DataBag)input.get(0);
        Iterator<Tuple> it = bag.iterator();
        Tuple tup;
        String wordcomb = "";

        while(it.hasNext())
        {
            if (wordcomb == "")
            {
                tup = (Tuple)it.next();
                wordcomb = (String)tup.get(1);
            }
            else
            {
                tup = (Tuple)it.next();
                wordcomb = wordcomb + ", " + (String)tup.get(1);
            }
        }

        return wordcomb;
    }

angrms = filter aftrgrp by count1 > 1;
The grouped output is then filtered for all count > 1. This means all words that doesnot have anagrams are not selected.

anagrams = foreach angrms generate jndword;
Output only joined word that contains anagrams which are separated by comma, ignoring the sorted keyword.

Hadoop Basics and learnings

Monday, February 17, 2014

Coding for Anagram in Map Reduce, Hive and Pig

No comments:

Post a Comment