Alien Language Parser

Alien Language Parser

Given a dictionary of strings [ strings are in sorted order] you have to find the precedence of characters according to the dictionary..
eat
bxy
e is ranked above b according to the dictionary.

Solution:


Very well explained by ChrisRussell41 on careercup.
It is copied from there.

Problem restatement:
There exists an alphabet comprising some number of characters.
The order of the alphabet, i.e. the full-ordering, is not known.
You are given a list of “words” comprised of characters from the alphabet.
The list of words is given to be sorted in lexicographical order.
Write a program to deduce the order of the alphabet.For the sake of example, let’s assume that we’re using an English alphabet (i.e. we know the full order). Now assume we have a set of words sorted in ascending lexicographical order:aardvark
ant
bee
cat
cow
dog
horse
llama
sheep
zebraGiven that we know the order of the English alphabet, it’s easy to see that the list of animals comprising our input data is correctly sorted.Now forget that you know anything at all about the English alphabet. Erase from your mind the fact that you know which characters comprise the alphabet (the problem statement doesn’t bound the set of characters or make any guarantee that the set of words use all characters in the alphabet). Also, erase from your mind the fact that you know the order of the English alphabet.

Let’s take the first word “aardvark”. What does this tell us? It tells us that the characters “a”, “r”, “d”, “v”, and “k” are present in the alphabet. Does it provide any information that can be used to establish the order of these characters? NO! Remember that it’s the order of the words in the list and thus comparisons between adjacent words in the list that provides clues about the order of the alphabet.

Okay, now let’s look at the first and second words:

aardvark
ant

What does this tell us? We see two new characters “n” and “t” by inspection of “ant”. What’s more we notice that the characters in the first column are both “a”. So clearly the lexicographical ordering of these two words wasn’t decided on the basis of the first column. Looking at the second column we note that the characters are different and correctly conclude that in our alphabet “a” proceeds “n”.

How about the third column? Sorry, no more clues here. We know that the order of “aardvark” and “ant” was decided on the basis of the second column character and cannot deduce anything further by comparing the third column.

So on we go through the list of animal words… Below I’ve reproduced the lexicographically sorted list of words in the left column, and indicate the “clues” in the right column. For simplicity I’m not going to list new characters discovered in the alphabet but rather focus only on clues about the order of the characters in the alphabet.

Note that because you actually do know the order of the English alphabet, you can easily vet this information without doing insane mental gymnastics. Also note that by convention an order clue is indicated by an ordered pair representing an edge in a directed graph. For example (a,b) indicates a directed edge from tail vertex “a” to head vertex “b” indicating that “a” proceeds “b” in the alphabet.

aardvark no order clues
ant (a,n) based on column 2
bee (a,b) based on column 1
cat (b,c) based on column 1
cow (a,o) based on column 2
dog (c,d) based on column 1
horse (d,h) based on column 1
llama (h,l) based on column 1
sheep (l,s) based on column 1
zebra (s,z) based on column 1
n
/
a -> b -> c -> d -> h -> l -> s -> z
\
o

Now remember that you know the order of the English alphabet and vet this graph. Makes sense right? For example we know that “n” comes after “a” as does “b” and “o”. And clearly a -> b -> c -> d -> h -> l -> s -> z is correctly ordered.

Note that given our sorted list of animals we do not know the order of “b”, “n” and “o”. All we know is that they follow “a”.

Also note that there are bunch of characters used in our list of animal words for which no clues were found. These are not show in the ASCII graph above. But these are really in the graph too (all with zero in-degree).

I seriously question the utility of doing a topological sort… It would produce a somewhat interesting result BUT is essentially meaningless except in the case that every vertex has out-degree one.

You could imagine a system that attempts to deduce ordering on an alphabet that exposes an interface DoesXProceedY(x, y) which returns YES, NO or KNOWN. Such a system could not (in general) use a topological sort of the graph to deduce the correct answer. Instead, it would need the graph topology.

Source Code:
Github: AlienLanguageParser.java


Output:

[a, o, b, c, d, h, l, s, z, n]
[a, b, c, e, d]
Author: Hrishikesh Mishra