Assignment 2

Searching tree-like data structures.

Worth 8%;    Due: 11.45pm Sunday 3 May 2009. 

Objectives

This assignment teaches you how to write algorithms to search pointer-based tree-like structures; you will learn how to search pointer-based implementations of so-called tries.

You should submit three files using the electronic submission system:

  1. The first should be called Trie.cpp, and should contain the code to implement the unimplemented methods of the class Trie as described below;
  2. The second should be called Task1.cpp;
  3. The third should be called Task2.cpp.

The format you should use for your files, and those of the outputs of your programs are specified below.

Your submitted files and the outputs of your programs must conform to the specified formats exactly, otherwise you risk receiving zero marks for the assignment. This rule is absolute - there are no exceptions. Why? Because if a student asks to resubmit his or her assignment “with the correct names this time”, there is no way to check that the files have not been altered after the cut-off date, and that is unfair to the others. Use the automarker to make sure that this does not happen to you.

Submit your work by the due date and time stated above.

Overview: Predictive text for composing SMS messages

Creating text messages on mobile phones is slightly awkward because each key on the keypad denotes one of at least three letters. For example the key labelled '2' denotes one of the alphabetic characters 'a', 'b' or 'c'. To overcome this ambiguity most mobile phones allow users to compose text messages in one of two ways:
  1. Using multiple key presses to distinguish the letters that share a key --- for example '2' stands for 'a', "22" stands for 'b' and "222" stands for 'c'; or

  2. Using predictive text which means that a word can be completed automatically given its prefix --- for example "quisling" can be predicted from the prefix "quis" since (in English) there is only one way to finish "quis" to make a word.

The reason that the state-of-the-art phones use predictive text --- Method (2) --- rather than multiple key presses --- Method (1) --- is because from the user's point of view fewer key presses overall are required: to input "quisling" using Method (1) requires 17 key presses, whereas (as you shall see) method (2) requires at most 8. Thus the user saves approximately 50% of effort.

Predictive text can work only if the phone stores a dictionary of words (in its local memory); the difficulty (for the programmer) is to find an efficient strategy for searching that dictionary --- phones are small and run only at a tiny fraction of the speed of the average laptop, so that any searching method implemented on the phone has to be fast. Phones that offer a predictive-texting feature can find words almost as fast as the user types them in. How do they do it?

In this assignment you will find out. You are given a partially-completed class whose purpose is to provide methods allowing a dictionary of prefixes to be searched efficiently enough for use on mobile phones. The class uses a so-called trie-structure which is to be explained in lectures; you can find additional information about tries in the Week 5 lecture summaries.

For Task 1 you are asked to implement a number of methods of the Trie class to traverse the look-up trie structure, and to use the methods to look up whole words in a dictionary efficiently; for Task 2 you must implement a further method that allows words to be looked up letter-by-letter as the user keys in his message a key at a time.

Details about the class, its methods and the trie-structure for this assignment are set out below.

Key-equivalent words

This table illustrates how the keys of a typical phone keypad correspond to the letters of the alphabet.
+-----------------------------------+
|  2   3   4   5   6   7    8   9   |
| abc def ghi jkl mno pqrs tuv wxyz |
+-----------------------------------+

As mentioned above, this means that any string of numeric characters has an associated set of, possibly many alphabetic strings, each of whose individual letters correspond to the numeric string according to the above table. For example the words "wolf", "yoke" and "woke" all correspond to the numeric sequence "9653", since '9' corresponds to 'w' and 'y'; '6' corresponds to 'o'; '5' corresponds to 'l' and 'k' and '3' corresponds to 'f' and 'e'.

We say that two strings are key-equivalent if their individual letters correspond.

Normally any sequence of numeric keys of length l corresponds to about 3^l key-equivalent alphabetic sequences. In general that's a big number: but not all of those sequences are real words. The major insight was to recognise that in natural languages, words are quite short, and very few strings of characters correspond to real words. This in turn means that any numeric sequence corresponds to only very few key-equivalent words, and (more importantly for predictive text) that only very few strings of characters correspond to prefixes of real words.

The dictionary of prefixes

Mobile phones using predictive text contain a stored dictionary so that key-equivalent words can be looked up as the user keys in his message. The dictionary stored in mobile phones is not like a normal dictionary. It differs in the following two important ways:

The reason it contains all prefixes (as well as real words) is that, as a user keys in his message, the phone should be flexible enough to allow the user to "browse through" all the possibilities for completing his word --- obviously it's not practical (or required) for the user to look at all possible complete words, but it is helpful if the user can look at all the possible prefixes of complete words stored in the dictionary.

For example, if a user keyed in "2", then he might be intending to begin a word that begins with 'a', or 'b' or 'c' --- he'd possibly like to view those three possibilities, but he wouldn't like to see all words beginning 'a', and all words beginning 'b' and all words beginning 'c', as there are too many. Thus to enable a programmer to implement a program for viewing pefixes, the dictionary also includes the single letter strings "a", "b", "c" as well as strings such as "aa" (prefix of "aardvark"), and "bo" (prefix of "bottle"), and so on. (Of course it also includes the strings "aardvark" and "bottle".)

The reason the dictionary is not sorted alphabetically is that, as a user keys in words, the dictionary must be searched to find the possible words that he might mean. To make searching efficient, it turns out to be much easier to have the words (and prefixes) in the dictionary sorted in an order that reflects more precisely the key-equivalent words: namely to have all the key-equivalent words and prefixes occuring at indices close together. Thus "woke", "wolf" and "yoke" are stored as consecutive entries, yet "wombat" appears much later in the dictionary (in particular after "yoke") since it corresponds to the numeric key sequence "966228".

The important fact to remember for this assignment is that the words in the dictionary are ordered according to their corresponding numeric key sequence and that key-equivalent words appear consecutively in blocks in the dictionary. This means that if s is a numeric key sequence, and the first key-equivalent word to which it corresponds appears at index j in the dictionary, and there are n words in total which are key-equivalent to s, then all those words are found exactly at dictionary indices j, j+1, j+2.. j+n-1.

In this assignment the dictionary of prefixes is stored in a vector called "Dictionary" as part of the Trie class (see below).

Using a trie for efficient dictionary searching

Tries are tree-like data structures having the property that the data stored in the nodes is associated exactly with the path from the root to that node.

In the application to mobile phones we need a way to use the information of the numeric key sequences to find the key-equivalent words (if there are any) in the Dictionary. In this assignment the paths from the root of the trie correspond to sequences of numeric keys and the data at each node gives information about how to look up the corresponding key-equivalent strings in the Dictionary.

Each node in the trie has eight children --- one child for each of the digits between '2' and '9'. These will be used for searching the trie. The data found in the item field of the struct TrieNode (described below) contains two pieces of information:

For example the first key-equivalent prefix corresponding to numeric key sequence "463784" is at Dictionary index 85481; moreover there are four key-equivalent prefixes in total. (They are "inepti", "inequi", "inerti", "inesti" --- which are prefixes respectively of the complete words "ineptiTUDE", "inequiVALENT", "inertiA" and "inestiMABLE" for example, found later in the Dictionary.) The way the Dictionary is arranged is that "inepti" appears at index 85481, "inequi" at index 85482, "inerti" at index 85483 and "inesti" at index 85484. In the trie on the other hand, the node which occurs at the end of the path corresponding to "463784" contains all the information required to lookup those four prefixes in the Dictionary: inside the field item, the field offset is set to 85481 (the index of the first key-equivalent prefix), and the field count is set to 4 (the total number of key-equivalent prefixes).

Now instead of searching laboriously through the Dictionary looking for key-equivalent prefixes corresponding to numeric sequences, the trie is searched by following the links to the children corresponding to the numeric characters --- and at the end of the path the information for retrieving the actual key-equivalent sequences from the Dictionary is immediately available.

Note that a direct iterative search through the Dictionary would need to iterate through each word --- in the above example since the first key-equivalent prefix occurs at Dictionary index 85481, such a search would need 85481 complete iterations; other strategies such as binary search are not appropriate since the words in the dictionary are not alphabetically ordered, and in any case the search would be proportional to the size of the Dictionary rather than to the length of the input sequence. The technique of searching the trie to find indices corresponding to key-equivalent prefixes uses an algorithm that only needs to iterate n times where n is for the length of the numeric key sequence --- which in the above example is 6. That is why the Dictionary is not searched directly, and it is how mobile phones are able locate words very quickly even though the phones' computers run relatively slowly.

The class Trie

You are provided with a partially-implemented class Trie. Its header file is Trie.h . In that file you will find specifications of methods for building and searching a trie with an associated Dictionary of prefixes. Many of the methods have already been implemented for you --- your main tasks (see below) are to write implementations of the various searching methods and retrieving methods.

A node in the trie is constructed from a struct containing two fields:

struct TrieNode {
      IndexCount item;
      TrieNode* links[8];
};
  • IndexCount is a struct and records the indexing information for looking up prefixes in Dictionary.
    struct IndexCount {
          int offSet;
          int count;
    };
    

    To access the fields in a variable v of type IndexCount use v.offset and v.count.

  • links is an array of items of type TrieNode*, that is an array of pointers to TrieNodes --- each item corresponds to a pointer to one of the eight children of a node. Here each child corresponds to the eight possible ways that a numeric key sequence can be extended. Warning: Recall that the convention for indexing arrays goes from 0 through to 7, whereas the keys in phones corresponding to letters use the digits '2' through to '9', Given a pointer p to a TrieNode the child corresponding to digit '2' is found at (p -> links)[0]; the child corresponding to digit '3' is found at (p -> links)[1], and so on.
  • An object of a Trie class consists of a Dictionary (which is a vector of strings) together with a "lookup trie" whose root node is a pointer called Root. Once the trie has been initialised, do not reassign Root otherwise you risk losing all the data. Both Dictionary and Root are initialised for you.

    To search through the trie you are also provided with the following private variables:

    The partially-implemented class is: Trie.cpp . To use this class you will also need to supply it with a list of words so that it can initialise the Dictionary and build the trie Root. A test list of words is at testPrefixes . Save it, and make sure you call it "testPrefixes". This is a small dictionary of words that you can use for testing your programs. A template for using the class is given at template.cpp

    The methods for building the trie have been implemented for you --- download all the files, make sure you have Trie.cpp, Trie.h, testPrefixes and template.cpp all in the same directory. To compile use

     g++ Trie.cpp template.cpp 

    All this program does is to declare an instance of the class Trie called SearchTrie and to initialise it. (On initialisation words are read from the file "testPrefixes" and a search trie and a vector of words are constructed; SearchTrie.Root then points to the root of the search trie and and SearchTrie.Dictionary contains the list of words read in.) It is now your task to implement methods for searching the trie structure.

    Remember that to use any of the methods of SearchTrie, you must use the syntax SearchTrie.methodName(..). Remember also that if a method is protected then it can only be called within Trie.cpp for implementing methods, but it cannot be used in either of your programs Task1.cpp or Task2.cpp .

    The Tasks

  • Task 1 (20 marks) Implement the following methods in Trie.cpp.
    void FirstEquiv(string key); 
     // PRE:  key is a sequence of digits between '2' and '9'. 
     // POST: Sets KeyPointer to point to the trie node such that
     //       the path from Root to KeyPointer corresponds to key.
     //       Sets StartKey to (KeyPointer-> item).offSet.
     //       Sets equivKeys to (KeyPointer->item).count.
     //       Sets CurrentIndex to 0.
     //
     //       Sets KeyPointer to Root if there is no such node.
     
     void NextEquiv();
     // POST: If Dictionary[StartKey+ CurrentIndex+1] is key-equivalent
     //       to Dictionary[StartKey], increments CurrentIndex.
    
     bool hasMoreEquiv();
     // POST: Returns true iff Dictionary[StartKey + CurrentIndex] is
     // key-equivalent to Dictionary[StartKey].
    
     string ThisOne(); 
     // POST: Returns Dictionary[StartKey+CurrentIndex], 
     //       or "?" if CurrentIndex >= equivKeys
    
    Using the class Trie, and the above methods, write a user-program that takes a complete numeric sequence from standard input and outputs all the key-equivalent words in Dictionary. Save your program in a file called Task1.cpp. Use the file template.cpp as a guide for the format of your file Task1.cpp

    To check your results, search through the file testPrefixes (used to build Dictionary) looking for any key-equivalent word corresponding to your input string. You will find all the key-equivalent words together in a block. Your program should print out all those words.

    For example, using the file testPrefixes, if the input is

    23

    then the output is

     
    ad
    af
    be
    ce
    

    If the input is

    2243

    then the output is

     
    ache
    cage
    

    If the input is

    5689

    then the output is

    ?
    

    That's because there are no key-equivalent words corresponding to this input sequence.

  • Task 2 (15 marks)

    In real phones users key in messages one character at a time, and the phone displays the possible prefixes of complete words as you go. In this task you are asked to simulate that situation. First implement the following member function in the file Trie.cpp.

    void ExtendKey(char ch, bool& Success); 
     // PRE: ch must be a numeric character between '2' and '9';
     // POST: Moves the KeyPointer to its child corresponding to ch...
     //       Sets StartKey to (KeyPointer-> item).offSet.
     //       Sets equivKeys to (KeyPointer->item).count.
     //       Sets CurrentIndex to 0.
     //       Sets Success to true.
     //
     //       Sets KeyPointer to Root if there is no such extension, and
     //       sets Success to false
    
    Using that function write a user-program that takes separate numeric characters from standard input until EOF and prints out all the key-equivalent words corresponding to the total string input so far. The output should be printed after each character followed by a carriage return.

    For example if the input is '3' then "carriage return", the output is

     
    d
    e
    f
    

    i.e. all the prefixes key-equivalent to "3"

    If the input continues '4', then "carriage return", the output is

     
    di
    eg
    fi
    

    i.e. all the prefixes key-equivalent to "34".

    If the input continues '7', then "carriage return", the output it

    dir
    dis
    fir
    

    i.e. all the prefixes key-equivalent to "347". If the user inputs a sequence that does not correspond to the word the output should be "?"

    The format of your files

    You may download a template at template.cpp; it is an indication of the format you should use for your submission. It shows you the format you should use for imlementing Task1.cpp and Task2.cpp.

    Assessment

    This assignment will be marked electronically with marks distributed as above. You will gain credit for implementing each method in the Trie class as well as for the two programs that use that class.

    The electronic part

    An important aspect of this assignment is to make sure that your programs run fast enough to please the electronic marker. Your progam will be tested with a large dictionary --- much larger than the test dictionary you have been given above. The large dictionary used by the electronic marker can be obtained at here . If you wish to test your own programs using this dictionary save it in a file called "testPrefixes" --- this is important because the initialisation of the Trie assumes that name when it reads in the data. When you run this program you will have to wait a second or two for the SearchTrie to be created, since it will be very large.

    You may have your program "trialed" by the electronic marker before the due date. Trials will begin from April 27 onwards. Any files submitted after that date and before the deadline will be presented to the electronic marker every day until the deadline.

    The results of the trial runs will be emailed to you at your student account. These results will be completely disregarded --- the feedback is only to help you correct errors that may be lurking in your programs, and to make sure that your code is efficient enough.

    You will receive the actual results for the electronic marked part of Assignment 2 after the final deadline for submission.

    Hints for obtaining full marks in this assignment.

    There are a number of common mistakes that students make which prevent them receiving full marks, even if they think they have thoroughly tested their programs!

    Remember that there will be many opportunities to check that your programs survive the electronic marker before the final submission date.