Introduction
This report is a small portion of research being conducted with the aim of restoring both communicative and environmental control abilities to such patients. The idea is that if these patients can communicate and interact through an intermediary computing device, the harshness of their hardship can be assuaged in some way. Since these patients have no way of physically controlling a computer, the core component of the research is a device, made by Dr. Philip Kennedy of Emory University and Neural Signals, Inc., which allows patients to mentally control a computer. Indeed, patients are able to communicate with a computer by "thinking" alone. The idea is that this special device (actually an electrode) is implanted in the patient's brain. Approximately 90 days after surgery, patients are able to generate brain signals that travel through the electrode. A receiver outside the patient's body captures these signals, and through a few other transformations, these signals are translated into mouse movements on a computer monitor. [1, 9]
The Communication Aspect
The device has already been shown to be effective for communication through a "virtual keyboard", i.e., a keyboard emulated through software on a computer monitor with a patient selecting letters through his or her mouse movements. However, the process tends to be very slow and quite tedious, so other methods should be developed to expedite this process [1]. Remember, these patients are cognitively intact, so while they are thinking of conversing at normal rates of 150 to 200 words per minute (wpm) [8], they may be able only to attain rates of up to 20 wpm [1]. This is critical when communicating with anyone, and especially those who are unfamiliar with locked-in syndrome, as limited or slow speech can convey low intelligence and misunderstanding, not to mention severely constraining the possibilities of the conversation [8]. Since our patients are cognitively intact, the chance to socially interact as normally as possible can reap great benefits for their health.
Existing Methods
The idea behind scanning is that elements (such as letters from the alphabet) are arranged in a fixed fashion, such as a grid. A cursor or highlighter moves through each of the elements at a fixed rate of speed. When the intended element is selected, the user throws a switch (or clicks a mouse button, etc.) to indicate selection of that element [2]. Most scanning methods also include delete element key(s) in order to fix mistakes, and erase previous input [6]. Most arrangements of elements are in a row-column format, where the cursor first scans through an entire row of elements, and then scans the columns of the selected row to complete the input from the user [2, 6]. However, two main ideas have emerged from this concept: character-based and word-based input.
The character-based approach is rather obvious. Each character of the alphabet plus other elements such as a blank space (to separate words), punctuation marks, backspace, and possibly backword (i.e., remove the entire previous word) are arranged in a rectangular grid. The arrangements can vary depending on the user [2]. While an alphabetic arrangement may allow for easy understanding for the user, the communication rate may not be optimal compared to a different approach. Specifically, if the keys are arranged so that the most frequently used characters/keys are placed closer to the initial starting point of the cursor, a significant increase may be seen. Since our users are cognitively intact, this would likely be a very good idea, if a scanning system was implemented.
The word-based approach is perhaps less obvious. The grid actually has two layers. On the first layer, elements are subject headings. On the subsequent layer, the elements are words relating to the subject headings. In other words, the elements of the grid are a predefined vocabulary set, in which every word is grouped into a category. To select a word, the user first selects the subject heading, then the actual word contained in the subject, necessitating the use of four clicks, rather than two as in the character-based approach. [6]
As a special note for both of these methods, the research in [2] focused on reducing timing errors for people using scanning systems. He showed that when audio output accompanies the movements of the cursor, the user is far less likely to make an error in timing. Since humans have a more natural association of timing with sound rather than movement, this conclusion makes sense. Some caution should be taken regarding this result, however, as it was based on a very small (i.e. 1) amount of patients. In any case, the idea is intriguing and should be considered for any implementation in our research.
There are advantages and disadvantages to each of these approaches. The character-based method allows for minimal device manipulation (or, in our case, minimal thought) in order to select an element. Moreover, the vocabulary for the user is unlimited, whereas in the word-based approach, the vocabulary is finite. While the word-based method requires double the amount of switches per element, the element itself can be much richer than an element of the character-based technique. In either case, however, the methods are very slow relative to other techniques. Moreover, in the word-based approach, words that commonly form a sentence occur in separate categories, thus heavily increasing the cognitive load (which itself is not dedicated to the conversation). In other words the user must think on two different levels in order to converse.
One scanning possibility is a binary arrangement, as mentioned in [1]. Here, we have the letters of the alphabet arranged in a balanced binary tree. A letter is displayed on the screen (starting with M or N), and the cursor scans through choices of "earler" or "later" (or "select" when the proper letter is displayed), thus efficiently navigating the tree. This may help the speed at which letters are selected, yet conversation would still occur on two different levels of thought. More research or testing of such a system could be warranted.
Abbreviation Expansion
Whereas scanning specifies a method for generating input, abbreviation expansion, as the name implies, specifies a method for minimizing the amount of input needed to generate whole words or phrases. The idea here is that the user predefines a number of small sequences of characters to represent a longer commonly used word or group of words (such as a phrase). The software handling the input from the user automatically replaces (or "expands") the abbreviation to the defined grouping of words. [3]
The obvious advantage to this strategy is that the user is required to type very little in order to maintain a dialogue. However, there are two major disadvantages. One is one we have already encountered: abbreviations can cover only part of the actual vocabulary of the user. Long words which are infrequently used will still have to be fully typed [3]. Moreover, as the number of abbreviations rises, the amount of memorization needed by the user rises, and soon the abbreviations become quite arbitrary. This can lead to many errors in input, which can be very costly in aided conversation. [7]
Word Prediction
Word prediction is another method which does not generate input but rather allows minimal input to be mapped to richer input. The concept behind this process also follows closely from the name: as a user types in the beginning of a word, the input software predicts what word the user is attempting to enter. There are many ways to accomplish this, with the most obvious being a list of possible matches displayed after a certain number of keystrokes. While it is clear that an alphabetical list of matches could be shown, this is often not helpful when the goal is minimal input. Thus many techniques are based on the frequency that words appear within the English language or normal English conversation. Some other processes rely on how recently a user used a word for input. [3]
While these methods have been shown to be somewhat useful, none of them consider the syntax in which words lie. The research contain in [3] proposes a new word prediction method based on syntactical analysis of English sentences. Natural language has an inherent redundancy which can be quite useful for prediction (although, for our purposes, we must consider whether conversation follows this same pattern). Based on the chart-parsing method referenced in [3], a system has been built to reflect this idea. Moreover, this system adapts to the common language usage structure and vocabulary of the user (thus incorporating localized "frequency" and "recency" techniques). [3]
The advantages of this system are many. Clearly as words become longer in length, word prediction aids the user in typing very little of the actual word. By definition, prediction lessens the time needed for input over a large vocabulary, given that the implementation of the method is minimally intrusive. Moreover, when the capabilities of syntax analysis are considered, the possibility of predicting entire sentences greatly lessens the effort needed to generate the input of that sentence. The disadvantage, however, is that as with word-based scanning, conversation operates at more than one level, since the user is constantly selecting from a list of words and then returning to the actual conversation. Also, while many such systems exist, a significant increase is communication rate has not been completely demonstrated. [3]
Semantic Compaction (Minspeak)
Whether denoted semantic compaction (scientifically) or Minspeak (commercially), this concept uses a fairly radical approach to limited input communication. Developed by Bruce Baker around 1980, the idea is to give the user a keyboard consisting only of a small quantity of icons (32, 64, or 128). These icons are ambiguous in that they represent concepts only, and each icon represents more than one concept. However, when the icons are placed in a logical sequence, the context of these icons generates unambiguous ideas. [4, 7]
Take for example the icon APPLE. Meanings associated with this icon can be food, eat, red or even New York (i.e. the Big Apple). Typing this icon will generate no particular output. But when combined with the RAINBOW icon, the unambiguous meaning of red is extracted from the sequence. In this manner a small number of icons, when combined, form a large set of language elements. [7]
Baker claims that semantic compaction systematizes a natural process through which we all pass during the course of conversation. In other words, although one is searching for the icons to represent his or her thoughts, this is not a separate level of cognition (as compared to word prediction). Moreover, whereas abbreviations in abbreviation expansion become arbitrary for large quantities, icons make intuitive sense to the "speaker," and are much more easily remembered. Minspeak keyboards also contain alphanumeric keys so that users can generate unique words or add noun and verb endings as appropriate. With this in mind, Baker claims that the number of keystrokes pressed (relative to standard keyboard typing) can drop as much as 60%. [7]
There are some disadvantages to semantic compaction, however. Clearly, users must remember a large amount of sequences in order to effectively use the Minspeak system (see the next section about "icon prediction" though) [4, 6]. Moreover, the output provided by semantic compaction is usually functional, i.e., allows the user to make a point but may not be grammatically correct. Minspeak does provide keys to fix this, but then the communication rate is greatly slowed and the benefit of the system diminishes or vanishes [6]. Since we desire to have full-scale communication, with no adverse effects for our patients, this poses a problem. There is also a time consideration: at least 90 hours of training and practice are recommended to master a simple 2000 word vocabulary. [7]
As a final note, Minspeak systems are licensed exclusively to the Prentke Romich Company. They provide both physical and virtual models of various Minspeak systems. In addition, PRC provides built-in vocabularies called MAPs. This frees the user from defining his or her own icon sequences and allows for practice to begin immediately. Both PRC and Baker assert that continued practice allows for great increases in the effectiveness of communication. [7, 10]
Compansion
Compansion is a technique that sits between the word prediction and the semantic compaction approaches to communication. Formally, the concept regarding compansion is to translate a sequence of content words into a grammatically correct and relevant sentence. In other words, when the user enters the phrase "apple eat john", a compansion method would produce the output "John has eaten an apple" but never "The apple ate John." [5] Alternatively, if the input content words could be interpreted various ways, the user should be presented with a list of choices from which to select. [6]
When combined with word prediction, compansion clearly creates an even faster method for producing output. However, compansion can also be combined with Minspeak. While a standard user would enter the proper icon sequences to generate a functional sentence and then enter a different mode to correct all the tenses and other grammatical aspects, compansion can automatically make these changes. So here too we see a dramatic increase in efficiency. Furthermore, compansion also includes a concept of "icon prediction." To be fair, the described method does not predict sequences of icons. Rather, when an icon is entered, all invalid subsequent icons are disallowed from input. So the user still must type every icon himself or herself, but the chances for error are reduced to those of meaning. [5]
There is also one more advantage that is relevant to our research. Out patients are cognitively intact, and so will likely have a keen grasp of English grammar. The design of compansion specifically targets just the people whom we are aiming to help. [6]
Conversational Momentum (CHAT)
The approach described as conversational momentum is quite different from any other already mentioned. The prediction capabilities of the system are based on well-defined structures of normal conversation. So the social rules of negotiation are ingrained into this system, with no other predictive capabilities. Some have criticized this system because speech can vary so widely. However, while speech is "infinitely variable... it does not follow... that it is totally unpredictable." [8]. The core idea of this system is based on the user "participating in a satisfying way in social encounters," [8] i.e., the exact goal of our research.
That does not mean that the system is without limitations or disadvantages. Indeed, while conversation is predictable, the majority of the content of the conversation changes for every social interaction. In other words, while we can generate text before a conversation that will likely be used, that constitutes about 20% of the entire conversation. Also, in using entirely pre-generated text (which CHAT does), we make an assumption that the user has a lot of time to think about what conversations will be forthcoming in the near future. Also, we must assume that the user understands the conversation structure, as he or she will have to designate the current status of the conversation as it occurs. CHAT was also tested on only a small sample of people. However this research showed a possible gain up to around 50 w.p.m., which would be a great improvement. [8]
Conversational momentum nevertheless asserts a number of advantages. CHAT does have a logging ability for conversations. This means that the user can remember personal phrases or relay a certain content of a conversation to many people without having to make input every time. Additionally, half of the 80% of main conversation is spent listening. The CHAT user has at his or her fingertips a list of "filler" remarks (such as "yeah" and "uh huh") in order to reassure the other conversation participant, just as in normal conversation. There is much promise in also perhaps combining this system with the other capabilities.
Coding
One aspect that was not explored was the possibility of a coding system for the user (such as Morse Code). Future researchers may want to examine this option.
Error handling: a missing piece
In scanning, the errors will result from users throwing a switch too early or too late. Throwing a row switch too early or too late is more devastating than a column switch, as the entire column must be scanned in order to restart the scanner. With the word based approach, an early switch throw is also very costly, since we then must scan through an entire screen of unimportant words in order to restart. In either case, multiple successive errors can be disastrous. So, we would like an optimal solution for handling errors quickly and gracefully.
For our purposes, we have at least two distinct signals that the user can generate. We will need one to act as the switch. As for the other, we should be able to customize its function for the user. One option could be to restart the scanning process from the beginning of the current row, or from the beginning entirely. Additionally, we should also place the delete key(s) near the beginning of the scanner, so that when an error is made a recovery can be quickly performed. Another possibility is that after the delete is used, we could return the user to the last place that he or she was scanning (especially if he or she has a tendency to click too early). At the very least, we should allow the "back up" features to be customizable by the users and by the trainers assisting the users.
Abbreviation expansion presents a slightly different problem. Here, we have a few letters representing longer words or a series of words. The most obvious idea is to keep a history of how each sentence was generated, and when the delete key is used, replace the expansion with the original abbreviation, unless the user makes a mistake in typing the abbreviation itself, in which case a character should be deleted. However, perhaps the user would like to intentionally embed phrases. For example, in conversation is is quite customary to inquire about general well-being. The general response is something like "I am doing fine" Let's say that our user uses the abbreviation idf for this. So, on days when one is not feeling well, one may want to use the idf abbreviation, erase the last word, and then use the tr abbreviation for "terrible". So me must carefully choose the function(s) of the delete key(s).
In word prediction, we are also faced with many possibilities for the delete key. As with abbreviation expansion, we may want to delete on a word by word basis, or we may want to return to the point before the prediction was completed. The hard trick comes when we try to handle errors with regard to natural language processing (syntax analysis). The whole structure of a sentence can be changed by an error in input, so we must ask ourselves how far back we can go in order to catch the actual error without making the system so hard to use that there is no time savings. It appears that as the sophistication of the system grows, the more complex the error handling becomes. Again, this is an area ripe for research!
We have already seen a method for limiting errors with Minspeak called "icon prediction," where a sequence of bad input is disallowed by construction [5]. This is a clear advantage to this system. However, there is no mention of how to handle deleting the previous icon that was pressed, if it was pressed in error. Moreover, perhaps an icon sequence has been translated into something that was unexpected. How can we recover from such an error? Do we back up to the point of translation, or do we back up entirely to the beginning of the sequence? Would a buffer that contains possible forthcoming translations help the user limit his or her errors? These are all questions that should be answered.
For sentence compansion, there is a clear way of handling errors. The software should remember the transformations from content words to richer, grammatically correct sentences. The delete key would allow for a return to the content words for further editing or a different selection of translation. At this point however, we return to the question of deletions among words and characters. Since compansion can relate either to word prediction or to semantic compaction, the methods employed there should be the same as employed here.
Conversational momentum relies on input that has already been constructed. Once the user makes the word selection, output is seen by the other participant in conversation. We may want to add an "are you sure" button to ensure that the user has not made an error in selection, but this could be cumbersome, especially for users with good control of their movements. Since all other conversation will be generated on the fly, the error handling method for that input method should be deployed, again with customization by the user a necessary property.
There is another question we must ask, and hopefully answer, for any of the above methods. Many times, we only notice an error in input a few keystrokes or clicks after the error has been made. Can we allow our users to "move back" through input to fix errors while leaving other portions of input intact? The level of complexity involved will depend on the method being used, but in each case, it seems that this task is a nontrivial event. Moreover, does this capability hinder efficiency in any way? It seems that we must strike a balance between user comfort with the system and known efficiency concerns. Training and practice will be key in handling errors.
We see that there are a number of unresolved questions in the described input methods with respect to error handling. The key for any system, though, will be the option for customization on a per-user basis. Since each of our patients may have a widely varying approach to communication and manipulation of input, the ability to catch errors common to that particular user will directly affect the effectiveness of the system. Further research is definitely recommended.
Some (Original?) Thoughts
The idea behind n-tuple scanning (let us call it double scanning when n = 2) is that there are n cursors scanning through input, and the user has at least n distinct signals that can be generated. The envisioned system for our user, who has at least two controllable signals, is double scanning that begins at the top and bottom of rows and left and right of columns, and converges toward the middle of the selection grid. It seems as if this system, once practiced, will allow for a faster scanning method, as long as the cognitive load is not too heavy.
It also seems that word prediction and abbreviation expansion could be implemented in the same system. The idea here is the the prediction component not only predicts the next word or series of words, but also includes the abbreviations that might come next, with their expansion noted on the screen for easy reference. This would slip into any of the three submethods (frequency, recency, and syntax analysis). For syntax analysis, we would need to analyze the meaning of the abbreviation and not the abbreviation itself, but this should be easy for the prediction device. Through this method, we have allowed the user to type very little for common phrases through either prediction or abbreviations. Moreover, the vocabulary is unlimited, negating the problem for abbreviation expansion and word-based scanning (taken separately).
We have seen that current implementations of semantic compaction have a concept called "icon prediction," but this amounts essentially to disallowing erroneous input. A good improvement would be to actually predict what icons the user is likely to select next, based perhaps on a frequency or recency framework. Also, with this in mind, we could display the words that the predicted icon sequence will generate if selected [6]. However, this may be too much to display and may be a hindrance more than an improvement. Nevertheless, a full-scale prediction component to Minspeak may be a wonderful improvement to efficiency.
Conversational momentum requires that conversation pieces be generated before the conversation begins, or typed on the fly. So we see some improvement in the beginning and ending of conversations (and filler words for the middle). Perhaps if we integrated any of the other input systems for the middle of the conversation, then CHAT may be an even more valuable tool. We may also want to add some sort of a prediction device when sorting through the logs of past conversation, since pieces are commonly repeated. Further investigation is required.
Conclusion
References
Notes