Voice Recognition for Linux

Fri Feb 23 17:49:10 UTC 2007

Eric S. Johansson wrote:
> Eric S. Johansson wrote:
>> as I was constructing my response, and was almost finished when it 
>> hit me about what's wrong with the model proposed.   it is the 
>> equivalent of raw natural text.  Full function natural text sucks a 
>> little bit.  The broken, unable to correct consistently, natural text 
>> is horrible and ruins voice models.  What you're proposing has even 
>> less functionality than a broken natural text.
>
> I'm sorry, that was too harsh.  I was interrupted by one too many 
> things while I was writing that bit and I forgot to go back and edit 
> it. Again, I apologize for being careless.  today might be a day to 
> stay away from the keyboard unless I'm writing code.  :-)

Hi Eric,

Looks like the original text got caught in a spam filter somewhere 
because of the attachment (I found it in the web archives). No worries 
about the tone. We are having a frank technical discussion and need to 
speak directly to get our points across. So my turn :) ...

I think you are too caught up in the current working model of NS to see 
how things can be done differently.

I have not studied the details of voice recognition and voice models, 
but I do appreciate the need for custom voice model training over time. 
There is a need for feedback, but it does _not_ need to be real-time. 
Personally, I would prefer it not to be real time. NS does in theory 
tout this as a feature when they claim that you can record speech on a 
voice recorder and dump it into NS for transcription. I have no idea 
whether that actually works.

I don't really want to interact with the voice engine all the time, I 
want it to mostly stay out of my way. I don't want to look at the little 
voice level bar when I'm speaking or read the early guesses of the voice 
engine. I want to look out the window or look at the spreadsheet that 
I'm writing an email about :) The fact that NS updates the voice model 
incrementally is actually a bad feature. I don't want that. If I have a 
cold one day or there is noise outside or the mic is a bit displaced the 
profile gets damaged. That's probably why you have to start a fresh one 
every six months.

Instead of saving my voice profile every day, I would like to save up a 
log of all the mistakes that were made during the week. I would then sit 
down for a session of training to help NS cope with those words and 
phrases better. I would first take a backup of my voice profile, then 
say a few sample sentences to make sure everything was generally working 
OK. I would then read passages from the log and do the needed correction 
and re-training. I would save the profile and start using the new one 
for the next week. I would also save profiles going back four weeks, and 
once a month I would do a brief test with the stored up profiles to see 
if it had degraded over time. If it had, I would roll back to an older 
one and perhaps do some training from recent logs too. There is no 
reason a voice profile should just automatically go bad over time.

The fact that you have to constantly interact with the voice engine is 
not a feature, it's a bug! It's just that you have adapted your 
dictation to work around it. It's not at all clear that interactive 
correction is better that batched correction. It certainly should not be 
seen as a blocker for a project like this going forward. I wouldn't want 
to spend years on a project simply to replicate NS on Linux. There is 
plenty of room for improvement in the current system.

OK, now for some replies:

> There is a system that art exists that does exactly what you've opposed. 
>   
[assuming you meant 'proposed' here] Unlikely. If a system with the 
level of usability existed it would already be in widespread use.

>   While it was technically successful, it has failed in that nobody but 
> the originator uses it in even he admits  this model  has some serious 
> shortcomings.
>   
What system, where? What was the model and what were the shortcomings?

> The reason I insist on feedback is very simple.  A good speech 
> recognition environment lets you lets you correct recognition errors and 
> create application-specific and application neutral commands.
Yes, we agree that you need correction. The application-specific 
features can be implemented in this model too it the same way that Orca 
uses scripting.

> modern systems train incrementally.  This improves the user experience 
> because you don't have to put up with continual misrecognition's. 
>   
You would still have to correct the mistake at some point. I would 
prefer to just dictate on and come back and correct all the mistakes at 
the end. One should read through before sending in any case ;) 
Correction and re-training do not have to be the same thing, though 
that's the way NS does it now.

> Apparently they also train incrementally on what's not corrected which 
> means batch correction is not a good thing. 
And I think that is a serious design-flaw for two (related) reasons: It 
gradually corrupts you voice files AND it makes the reader constantly 
worry about whether that is happening. You have to make sure to speak as 
correctly as properly at all times and always make sure to stop 
immediately and correct all the mistakes. Otherwise your profile will be 
hosed. I repeat: that is a bug, not a feature. You end up adapting more 
to the machine than the machine adapts to you. *That is a bug.*

> I have no problem leaving the entire user interface for correction etc. 
> in Windows.  The only trouble is how do you make it visible if you're 
> running a virtual machine full-screen?  Don't run the virtual machine 
> full-screen?
Sure, in a separate correction-session. Personally I would have two 
physical machines for this task, with the text going across the network. 
In the correcting session I would just flip the KVM switch to the 
windows box (or however you choose to organise it).

> no, the grammar I gave you was a custom grammar.  It didn't need a 
> preamble of "macro".  It demonstrates how you can create a more natural 
> speech user interface.  

I think this is an NS bug too. I don't want natural editing, I only want 
natural dictation. I want two completely separate modes: pure dictation 
and pure editing. If I say 'cut that' I want the words 'cut that' to be 
typed. To edit I want to say: 'Hal: cut that bit'. Why? because that 
would improve overall recognition and would remove the worry that you 
might delete a paragraph by mistake. NS would only trigger it's special 
functions on a single word, and otherwise just do its best to 
transcribe. You would of course select that word to be one that it would 
never get wrong. (you could argue that natural editing is a feature, but 
the fact that you cannot easily configure it to use the modes I 
described is a design-flaw).

> but also consider this. Ever wonder why the acceptance rate for speech 
> recognition is only one user in five?  Granted I only have a small 
> sample but all of the doctors I've talked to about speech recognition 
> tell me stories of purchasing a very expensive package only to drop it 
> in a few months and go back to human transcription.  Obviously 
> recognition accuracy is a part of the problem but the other half is 
> usability.  
Precisely. It's because they don't want to fiddle with the program, they 
just want to dictate.

Henrik