Voice Recognition for Linux

Eric S. Johansson esj at harvee.org
Fri Feb 23 19:44:46 UTC 2007


Henrik Nilsen Omma wrote:
> Eric S. Johansson wrote:

> Looks like the original text got caught in a spam filter somewhere 
> because of the attachment (I found it in the web archives). No worries 
> about the tone. We are having a frank technical discussion and need to 
> speak directly to get our points across. So my turn :) ...

Thanks for the understanding but it always helps to be polite.
> 
> I think you are too caught up in the current working model of NS to see 
> how things can be done differently.

you haven't seen the comments I've made in the past about speech user 
interfaces and what Dragon has done wrong.  I have proposed many things 
that should be fixed but the current command model is not one of them
> 
> I have not studied the details of voice recognition and voice models, 
>... but I do appreciate the need for custom voice model training over time. 
> There is a need for feedback, but it does _not_ need to be real-time. 
> Personally, I would prefer it not to be real time. NS does in theory 
> tout this as a feature when they claim that you can record speech on a 
> voice recorder and dump it into NS for transcription. I have no idea 
> whether that actually works.

okay, I should probably attempt to capture some of the user experience 
issues.

Correction from its recognitions is something people debate a lot.  If 
you don't correct miss recognitions, you'll most likely get the same 
thing over and over again.  The output of the language and recognition 
model is probabilistic so makes recognitions will change from time to 
time but it'll basically be the same kind misrecognition.  (Yes, all 
uncorrected).

The user is then faced with a choice, do you correct the recognition 
engine or do you edit the document?  In both cases, it's painful.  But 
then you get the odd case with the misrecognition is completely 
unintelligible and you don't have any idea what the hell you said.  Then 
you have no choice but to go back and listen to what was said at that 
phrase and make a correction.  This is a very real user experience.  I 
have spoken with people who write documents in Microsoft Word and 
they'll go back to page 5 out of 20 see something that's garbled and 
play it back so they can figure out what they said.  They usually don't 
correct heavy garbling but just say it again and get a more consistent 
recognition from that point forward courtesy the incremental training.

In theory, you can dictate into most applications using something called 
natural text.  It's a direct text injection with a history of what was 
said (audio and recognition).  You can do limited correction by 
Select-and-Say and it even sort of kind of works if it's a full native 
Microsoft Windows application.  Tools like Thunderbird, gaim, Emacs 
don't work so well.  How they feel is for later discussion.

But you have this nice tool, that's almost right, called the dictation 
box.  It's a little window which has full editing a correction 
capability using the voice model of NaturallySpeaking.  When you are 
done with your dictation, you can inject that into the application its 
associated with.  The wonderful thing about the dictation box is that 
making corrections significantly improves accuracy.  If I dictated 
nothing but the dictation box for a week, I would have a significantly 
more accurate system and a lower level of frustration on 
misrecognition's.  If I had what ever magic dictation box uses on all of 
my applications, I would be ecstatic.  I wouldn't need to retrain every 
six months.  But it's not sufficient.  Why is again conversation for a 
future time.

If you want to migrate away from incremental recognition, you'll need to 
look to NaturallySpeaking 3 or NaturallySpeaking 4 for the user 
experience.  You would probably lose one to 2% (or more) on the accuracy 
which is really significant.  Believe me, there's a huge difference 
between 99% and 99.5% recognition accuracy in actual operating 
conditions.  It's also important to note that dragon changed from the 
incremental correction model a couple of times.  The last time I was in 
touch with dragon employees (before the bakers got greedy), they will 
really convinced incremental training, properly done, gave a 
significantly better user experience and I would have to say from what I 
hear and from what I have experienced, I think they were right.  Maybe 
they were drinking their own Kool-Aid, maybe they were onto something. 
I am no stranger to figuring out interesting ways to get the signals you 
need to do something right so I trust them.

But independent of your desire, you may not be able to turn it off.  You 
may have users who know how it works making your life uncomfortable 
because you have made their life less pleasant.  You will have me 
demanding the highest possible accuracy.  :-)

I think at this point it would be a really good idea for you to go 
purchase a copy of NaturallySpeaking 9 preferred.  Get a really good 
headset.  The one that comes in the box is a piece of crap.  No 
seriously, it's really bad.  I can give you some recommendations on 
headsets (xvi mostly) but I really really love my vxi Bluetooth wireless 
headset.  It is just so sweet.  It has some flaws but it's really sweet too.

> I don't really want to interact with the voice engine all the time, I 
> want it to mostly stay out of my way. I don't want to look at the little 
> voice level bar when I'm speaking or read the early guesses of the voice 
> engine. I want to look out the window or look at the spreadsheet that 
> I'm writing an email about :) The fact that NS updates the voice model 
> incrementally is actually a bad feature. I don't want that. If I have a 
> cold one day or there is noise outside or the mic is a bit displaced the 
> profile gets damaged. That's probably why you have to start a fresh one 
> every six months.

Can you use your keyboard without the delete or backspace key?  Or even 
the arrow keys?  the correction dialog I'm talking about is as core to 
your daily operation as those keys are.  As for changing focus, sure, 
you can do it but only if you have an application which is sufficiently 
speech aware to record your audio track at the same time and be able to 
play back a segment you think is an error.  It's the only way you'll 
make corrections unless you have a memory which is a few orders of 
magnitude better than mine.

I should also note that if you don't have a clear and accurate 
indication of what's a misrecognition error, correcting something that 
is right can make your user model go back quickly.  at least so I am 
told.  Of course, I've never done anything like that, no, no way.  Uh-huh.


> Instead of saving my voice profile every day, I would like to save up a 
> log of all the mistakes that were made during the week. I would then sit 
> down for a session of training to help NS cope with those words and 
> phrases better. I would first take a backup of my voice profile, then 
> say a few sample sentences to make sure everything was generally working 
> OK. I would then read passages from the log and do the needed correction 
> and re-training. I would save the profile and start using the new one 
> for the next week. I would also save profiles going back four weeks, and 
> once a month I would do a brief test with the stored up profiles to see 
> if it had degraded over time. If it had, I would roll back to an older 
> one and perhaps do some training from recent logs too. There is no 
> reason a voice profile should just automatically go bad over time.

now you're thinking like a geek.  Ordinary users eventually learn when 
to save a profile based on the type and number of corrections they make. 
  They don't test them, they just save them and count on the system to 
automatically backup every few saves.  I don't save mine every day and I 
only  save my profile when I correct really persistent is recognitions. 
  If I'm getting a cold or hay fever, I definitely don't save but I also 
suffer from reduced recognition for a few days.

user reluctance to put in the effort is reason why you train on a 
document once at the beginning.  I usually choose a couple different 
documents to train on after a month on a new model but I am a rarity.  I 
described this behavior in a white paper I wrote called "spam filters 
are like dogs".  You have expert trainers and you have people whose dogs 
crap on the neighbors lawns.  Same category of animals, with roughly the 
same skill potential but very different training models.  Naturally 
speaking is try to take advantage of the "less formal" behaviors for 
training and they're doing a pretty good job at succeeding with those 
signals.

Don't force the ordinary user to train at an expert level.  It won't 
work, it will just piss them off, and it will discourage if not drive 
away the moderately expert user who wants to work in the way they are 
comfortable.
> 
> The fact that you have to constantly interact with the voice engine is 
> not a feature, it's a bug! It's just that you have adapted your 
> dictation to work around it. It's not at all clear that interactive 
> correction is better that batched correction. It certainly should not be 
> seen as a blocker for a project like this going forward. I wouldn't want 
> to spend years on a project simply to replicate NS on Linux. There is 
> plenty of room for improvement in the current system.

You constantly interact with your computer and except from it a bunch of 
feedback.  This is no different.  In not looking at speech levels but 
you may be looking at load averages, time of day, alerts about e-mail 
coming in, cursor position in an editor buffer, color changes for syntax 
highlighting.  These are all forms of feedback.  Incremental training 
and looking at recognition sequences are just different forms of 
feedback.  He learned to incorporate it in your operation

("he learned" is a persistent misrecognition error that mostly shows up 
when using natural text, because I'm not in a place where I can correct 
it often enough, it keeps showing up if I was in dictation box right 
now, it would be mostly gone.  This is why incremental recognition 
correction is so very very important.  batch training has never made 
this go away and I've tried.  The only thing that has succeeded has been 
incremental in one context.)

> 
> OK, now for some replies:

you mean the above weren't enough?  :-)

> 
>> There is a system that art exists that does exactly what you've 
>> opposed.   
> [assuming you meant 'proposed' here] Unlikely. If a system with the 
> level of usability existed it would already be in widespread use.
> 
>>   While it was technically successful, it has failed in that nobody 
>> but the originator uses it in even he admits  this model  has some 
>> serious shortcomings.
>>   
> What system, where? What was the model and what were the shortcomings?

http://eepatents.com/  but the package is no longer visible.  Ed took a 
gun awhile ago.  His package used  xinput direct injection.  He used a 
Windows application with a window to receive the dictation information 
and inject it into the virtual machine.  he was able to do straight 
injection of text limited by what NaturallySpeaking put out.  I think he 
did some character sequence translations but I'm not sure.  He couldn't 
control the mouse, couldn't shift Windows, had only global commands and 
not application-specific commands.  I could be wrong at some of these 
points but that's basically what I remember.

There was also a bunch of other stuff like, complicated to set up etc. 
but that can be fixed relatively easily.  Especially if you remove the 
dependency on twisted.

to my mind, it's the same as what you're proposing.  And there is a 
general agreement that it only a starting point for the very 
committed/dedicated

> 
>> The reason I insist on feedback is very simple.  A good speech 
>> recognition environment lets you lets you correct recognition errors 
>> and create application-specific and application neutral commands.
> Yes, we agree that you need correction. The application-specific 
> features can be implemented in this model too it the same way that Orca 
> uses scripting.

Don't know how orca uses scripting.  pointers?

seriously though, I want a grammar and the ability to associate methods 
with the grammar.  I do know I'm not the only one because there is a 
fair number of people that have built grammars using the 
NaturallySpeaking Visual Basic environment, natpython and a couple macro 
packages built on top of natpython.

Even if you convince me, you'll have to convince them.

> You would still have to correct the mistake at some point. I would 
> prefer to just dictate on and come back and correct all the mistakes at 
> the end. One should read through before sending in any case ;) 

Oh I understand but in my experience, if I don't pay attention to what 
the recognition system is saying, by speech gets sloppy and by 
recognition accuracy drops significantly until I have something which is 
completely unrecognizable at the end.  Also, I'm probably "special" in 
this case but even when I was typing, I continually look back at the 
document as far as the screen permits searching for errors.  It seems to 
help me keep speaking written speech and identifying where I'm using 
spoken speech for writing.  I know other people like you want to just 
dictate and not look back.  Some of them will turn their chair around 
and stare at painting on the wall while they dictate.  But there are 
those, like me that can't.

  > And I think that is a serious design-flaw for two (related) reasons: It
> gradually corrupts you voice files AND it makes the reader constantly 
> worry about whether that is happening. You have to make sure to speak as 
> correctly as properly at all times and always make sure to stop 
> immediately and correct all the mistakes. Otherwise your profile will be 
> hosed. I repeat: that is a bug, not a feature. You end up adapting more 
> to the machine than the machine adapts to you. *That is a bug.*

It's a feature... seriously, get NaturallySpeaking, And play with the 
dictation box as well as natural text driven applications.  When you 
have something that Select-and-Say enabled, you don't need to pay 
attention all the time, you can go back a paragraph or two or three and 
fix your errors.   The only time you need to pay attention is when you 
are using natural text which is one-way nuance forces you to toe the 
line when it comes to applications.  That is a bug!


> I think this is an NS bug too. I don't want natural editing, I only want 
> natural dictation. I want two completely separate modes: pure dictation 
> and pure editing. If I say 'cut that' I want the words 'cut that' to be 
> typed. To edit I want to say: 'Hal: cut that bit'. Why? because that 
> would improve overall recognition and would remove the worry that you 
> might delete a paragraph by mistake. NS would only trigger it's special 
> functions on a single word, and otherwise just do its best to 
> transcribe. You would of course select that word to be one that it would 
> never get wrong. (you could argue that natural editing is a feature, but 
> the fact that you cannot easily configure it to use the modes I 
> described is a design-flaw).

A few things are very important in this paragraph.  Prefacing a command 
is something I will really fight against.  It is a horrible thing to 
impose on the user because it adds extra vocal load and cognitive load 
on the user.  Voice coder has a "yo" command model for certain commands 
and I just refuse to use them.  I type rather than say that sequence is 
so repellent to me.  I have also had significant experience with modal 
commands with DragonDictate which is why I have such a strong reaction 
against the command preface and this is why Dragon Systems went away 
from them.  Remember, technology dedicated company, I know for a fact 
thatsome of the employees were quite smart.  If Dragon's research group 
does something and sticks with it, there's probably a good reason for it.

I think part of our differences comes from modal versus nonmodal user 
interfaces.  I like Emacs, it's nonmodal (mostly) other people like VI 
which is exceptionally modal.  Non-modal user interfaces are preferred 
in the circumstances if the indicator to activate some command or 
different course of action is relatively natural.  For example if I say 
"don't show dictation box" I just get text.  But if I say "show 
dictation box" with a pause before the text as well as after, up comes 
the dictation box.  Same words, but the simple addition of natural 
length pauses allows NaturallySpeaking to identify the command and 
activate it only when it's asked for.  Yes, it's training but minimal 
training and it applies everywhere when separating commands from text. 
This works for NaturallySpeaking commands and my private commands.

there is one additional form of mode switching in NaturallySpeaking and 
that's the switching of commands based on which program is active and 
its state (i.e. running dialog boxes or something equivalent).  That's 
why I have Emacs commands There are only active when running Emacs.

> Precisely. It's because they don't want to fiddle with the program, they 
> just want to dictate.

But those that just dictate, get unacceptable results.  Try it.  When 
you get NaturallySpeaking running, just dictate and never ever correct 
and see what happens.  Then try it the other way around using dictation 
box whenever possible.
---eric


-- 
Speech-recognition in use.  It makes mistakes, I correct some.




More information about the Ubuntu-accessibility mailing list