Worth 8%; Due: 11.45pm Sunday 3 May 2009.
This assignment teaches you how to write algorithms to search pointer-based tree-like structures; you will learn how to search pointer-based implementations of so-called tries.
You should submit three files using the electronic submission system:
The format you should use for your files, and those of the outputs of your programs are specified below.
Submit your work by the due date and time stated above.
The reason that the state-of-the-art phones use predictive text --- Method (2) --- rather than multiple key presses --- Method (1) --- is because from the user's point of view fewer key presses overall are required: to input "quisling" using Method (1) requires 17 key presses, whereas (as you shall see) method (2) requires at most 8. Thus the user saves approximately 50% of effort.
Predictive text can work only if the phone stores a dictionary of words (in its local memory); the difficulty (for the programmer) is to find an efficient strategy for searching that dictionary --- phones are small and run only at a tiny fraction of the speed of the average laptop, so that any searching method implemented on the phone has to be fast. Phones that offer a predictive-texting feature can find words almost as fast as the user types them in. How do they do it?
In this assignment you will find out. You are given a partially-completed class whose purpose is to provide methods allowing a dictionary of prefixes to be searched efficiently enough for use on mobile phones. The class uses a so-called trie-structure which is to be explained in lectures; you can find additional information about tries in the Week 5 lecture summaries.
For Task 1 you are asked to implement a number of methods of the Trie class to traverse the look-up trie structure, and to use the methods to look up whole words in a dictionary efficiently; for Task 2 you must implement a further method that allows words to be looked up letter-by-letter as the user keys in his message a key at a time.
Details about the class, its methods and the trie-structure for this assignment are set out below.
+-----------------------------------+ | 2 3 4 5 6 7 8 9 | | abc def ghi jkl mno pqrs tuv wxyz | +-----------------------------------+
As mentioned above, this means that any string of numeric characters has an associated set of, possibly many alphabetic strings, each of whose individual letters correspond to the numeric string according to the above table. For example the words "wolf", "yoke" and "woke" all correspond to the numeric sequence "9653", since '9' corresponds to 'w' and 'y'; '6' corresponds to 'o'; '5' corresponds to 'l' and 'k' and '3' corresponds to 'f' and 'e'.
We say that two strings are key-equivalent if their individual letters correspond.
Normally any sequence of numeric keys of length l corresponds to about 3^l key-equivalent alphabetic sequences. In general that's a big number: but not all of those sequences are real words. The major insight was to recognise that in natural languages, words are quite short, and very few strings of characters correspond to real words. This in turn means that any numeric sequence corresponds to only very few key-equivalent words, and (more importantly for predictive text) that only very few strings of characters correspond to prefixes of real words.
The reason it contains all prefixes (as well as real words) is that, as a user keys in his message, the phone should be flexible enough to allow the user to "browse through" all the possibilities for completing his word --- obviously it's not practical (or required) for the user to look at all possible complete words, but it is helpful if the user can look at all the possible prefixes of complete words stored in the dictionary.
For example, if a user keyed in "2", then he might be intending to begin a word that begins with 'a', or 'b' or 'c' --- he'd possibly like to view those three possibilities, but he wouldn't like to see all words beginning 'a', and all words beginning 'b' and all words beginning 'c', as there are too many. Thus to enable a programmer to implement a program for viewing pefixes, the dictionary also includes the single letter strings "a", "b", "c" as well as strings such as "aa" (prefix of "aardvark"), and "bo" (prefix of "bottle"), and so on. (Of course it also includes the strings "aardvark" and "bottle".)
The reason the dictionary is not sorted alphabetically is that, as a user keys in words, the dictionary must be searched to find the possible words that he might mean. To make searching efficient, it turns out to be much easier to have the words (and prefixes) in the dictionary sorted in an order that reflects more precisely the key-equivalent words: namely to have all the key-equivalent words and prefixes occuring at indices close together. Thus "woke", "wolf" and "yoke" are stored as consecutive entries, yet "wombat" appears much later in the dictionary (in particular after "yoke") since it corresponds to the numeric key sequence "966228".
The important fact to remember for this assignment is that the words in the dictionary are ordered according to their corresponding numeric key sequence and that key-equivalent words appear consecutively in blocks in the dictionary. This means that if s is a numeric key sequence, and the first key-equivalent word to which it corresponds appears at index j in the dictionary, and there are n words in total which are key-equivalent to s, then all those words are found exactly at dictionary indices j, j+1, j+2.. j+n-1.
In this assignment the dictionary of prefixes is stored in a vector called "Dictionary" as part of the Trie class (see below).
In the application to mobile phones we need a way to use the information of the numeric key sequences to find the key-equivalent words (if there are any) in the Dictionary. In this assignment the paths from the root of the trie correspond to sequences of numeric keys and the data at each node gives information about how to look up the corresponding key-equivalent strings in the Dictionary.
Each node in the trie has eight children --- one child for each of the digits between '2' and '9'. These will be used for searching the trie. The data found in the item field of the struct TrieNode (described below) contains two pieces of information:
Now instead of searching laboriously through the Dictionary looking for key-equivalent prefixes corresponding to numeric sequences, the trie is searched by following the links to the children corresponding to the numeric characters --- and at the end of the path the information for retrieving the actual key-equivalent sequences from the Dictionary is immediately available.
Note that a direct iterative search through the Dictionary would need to iterate through each word --- in the above example since the first key-equivalent prefix occurs at Dictionary index 85481, such a search would need 85481 complete iterations; other strategies such as binary search are not appropriate since the words in the dictionary are not alphabetically ordered, and in any case the search would be proportional to the size of the Dictionary rather than to the length of the input sequence. The technique of searching the trie to find indices corresponding to key-equivalent prefixes uses an algorithm that only needs to iterate n times where n is for the length of the numeric key sequence --- which in the above example is 6. That is why the Dictionary is not searched directly, and it is how mobile phones are able locate words very quickly even though the phones' computers run relatively slowly.
A node in the trie is constructed from a struct containing two fields:
struct TrieNode {
IndexCount item;
TrieNode* links[8];
};
struct IndexCount {
int offSet;
int count;
};
To access the fields in a variable v of type IndexCount use v.offset and v.count.
To search through the trie you are also provided with the following private variables:
The partially-implemented class is: Trie.cpp . To use this class you will also need to supply it with a list of words so that it can initialise the Dictionary and build the trie Root. A test list of words is at testPrefixes . Save it, and make sure you call it "testPrefixes". This is a small dictionary of words that you can use for testing your programs. A template for using the class is given at template.cpp
The methods for building the trie have been implemented for you --- download all the files, make sure you have Trie.cpp, Trie.h, testPrefixes and template.cpp all in the same directory. To compile use
g++ Trie.cpp template.cpp
All this program does is to declare an instance of the class Trie called SearchTrie and to initialise it. (On initialisation words are read from the file "testPrefixes" and a search trie and a vector of words are constructed; SearchTrie.Root then points to the root of the search trie and and SearchTrie.Dictionary contains the list of words read in.) It is now your task to implement methods for searching the trie structure.
Remember that to use any of the methods of SearchTrie, you must use the syntax SearchTrie.methodName(..). Remember also that if a method is protected then it can only be called within Trie.cpp for implementing methods, but it cannot be used in either of your programs Task1.cpp or Task2.cpp .
void FirstEquiv(string key); // PRE: key is a sequence of digits between '2' and '9'. // POST: Sets KeyPointer to point to the trie node such that // the path from Root to KeyPointer corresponds to key. // Sets StartKey to (KeyPointer-> item).offSet. // Sets equivKeys to (KeyPointer->item).count. // Sets CurrentIndex to 0. // // Sets KeyPointer to Root if there is no such node. void NextEquiv(); // POST: If Dictionary[StartKey+ CurrentIndex+1] is key-equivalent // to Dictionary[StartKey], increments CurrentIndex. bool hasMoreEquiv(); // POST: Returns true iff Dictionary[StartKey + CurrentIndex] is // key-equivalent to Dictionary[StartKey]. string ThisOne(); // POST: Returns Dictionary[StartKey+CurrentIndex], // or "?" if CurrentIndex >= equivKeysUsing the class Trie, and the above methods, write a user-program that takes a complete numeric sequence from standard input and outputs all the key-equivalent words in Dictionary. Save your program in a file called Task1.cpp. Use the file template.cpp as a guide for the format of your file Task1.cpp
To check your results, search through the file testPrefixes (used to build Dictionary) looking for any key-equivalent word corresponding to your input string. You will find all the key-equivalent words together in a block. Your program should print out all those words.
For example, using the file testPrefixes, if the input is
23
then the output is
ad af be ce
If the input is
2243
then the output is
ache cage
If the input is
5689
then the output is
?
That's because there are no key-equivalent words corresponding to this input sequence.
In real phones users key in messages one character at a time, and the phone displays the possible prefixes of complete words as you go. In this task you are asked to simulate that situation. First implement the following member function in the file Trie.cpp.
void ExtendKey(char ch, bool& Success); // PRE: ch must be a numeric character between '2' and '9'; // POST: Moves the KeyPointer to its child corresponding to ch... // Sets StartKey to (KeyPointer-> item).offSet. // Sets equivKeys to (KeyPointer->item).count. // Sets CurrentIndex to 0. // Sets Success to true. // // Sets KeyPointer to Root if there is no such extension, and // sets Success to falseUsing that function write a user-program that takes separate numeric characters from standard input until EOF and prints out all the key-equivalent words corresponding to the total string input so far. The output should be printed after each character followed by a carriage return.
For example if the input is '3' then "carriage return", the output is
d e f
i.e. all the prefixes key-equivalent to "3"
If the input continues '4', then "carriage return", the output is
di eg fi
i.e. all the prefixes key-equivalent to "34".
If the input continues '7', then "carriage return", the output it
dir dis fir
i.e. all the prefixes key-equivalent to "347". If the user inputs a sequence that does not correspond to the word the output should be "?"
You may download a template at template.cpp; it is an indication of the format you should use for your submission. It shows you the format you should use for imlementing Task1.cpp and Task2.cpp.
An important aspect of this assignment is to make sure that your programs run fast enough to please the electronic marker. Your progam will be tested with a large dictionary --- much larger than the test dictionary you have been given above. The large dictionary used by the electronic marker can be obtained at here . If you wish to test your own programs using this dictionary save it in a file called "testPrefixes" --- this is important because the initialisation of the Trie assumes that name when it reads in the data. When you run this program you will have to wait a second or two for the SearchTrie to be created, since it will be very large.
You may have your program "trialed" by the electronic marker before the due date. Trials will begin from April 27 onwards. Any files submitted after that date and before the deadline will be presented to the electronic marker every day until the deadline.
The results of the trial runs will be emailed to you at your student account. These results will be completely disregarded --- the feedback is only to help you correct errors that may be lurking in your programs, and to make sure that your code is efficient enough.
You will receive the actual results for the electronic marked part of Assignment 2 after the final deadline for submission.
There are a number of common mistakes that students make which prevent them receiving full marks, even if they think they have thoroughly tested their programs!
Remember that there will be many opportunities to check that your programs survive the electronic marker before the final submission date.