Voice Recognition for Linux
Henrik Nilsen Omma
henrik at ubuntu.com
Fri Feb 23 17:49:10 UTC 2007
Eric S. Johansson wrote:
> Eric S. Johansson wrote:
>> as I was constructing my response, and was almost finished when it
>> hit me about what's wrong with the model proposed. it is the
>> equivalent of raw natural text. Full function natural text sucks a
>> little bit. The broken, unable to correct consistently, natural text
>> is horrible and ruins voice models. What you're proposing has even
>> less functionality than a broken natural text.
>
> I'm sorry, that was too harsh. I was interrupted by one too many
> things while I was writing that bit and I forgot to go back and edit
> it. Again, I apologize for being careless. today might be a day to
> stay away from the keyboard unless I'm writing code. :-)
Hi Eric,
Looks like the original text got caught in a spam filter somewhere
because of the attachment (I found it in the web archives). No worries
about the tone. We are having a frank technical discussion and need to
speak directly to get our points across. So my turn :) ...
I think you are too caught up in the current working model of NS to see
how things can be done differently.
I have not studied the details of voice recognition and voice models,
but I do appreciate the need for custom voice model training over time.
There is a need for feedback, but it does _not_ need to be real-time.
Personally, I would prefer it not to be real time. NS does in theory
tout this as a feature when they claim that you can record speech on a
voice recorder and dump it into NS for transcription. I have no idea
whether that actually works.
I don't really want to interact with the voice engine all the time, I
want it to mostly stay out of my way. I don't want to look at the little
voice level bar when I'm speaking or read the early guesses of the voice
engine. I want to look out the window or look at the spreadsheet that
I'm writing an email about :) The fact that NS updates the voice model
incrementally is actually a bad feature. I don't want that. If I have a
cold one day or there is noise outside or the mic is a bit displaced the
profile gets damaged. That's probably why you have to start a fresh one
every six months.
Instead of saving my voice profile every day, I would like to save up a
log of all the mistakes that were made during the week. I would then sit
down for a session of training to help NS cope with those words and
phrases better. I would first take a backup of my voice profile, then
say a few sample sentences to make sure everything was generally working
OK. I would then read passages from the log and do the needed correction
and re-training. I would save the profile and start using the new one
for the next week. I would also save profiles going back four weeks, and
once a month I would do a brief test with the stored up profiles to see
if it had degraded over time. If it had, I would roll back to an older
one and perhaps do some training from recent logs too. There is no
reason a voice profile should just automatically go bad over time.
The fact that you have to constantly interact with the voice engine is
not a feature, it's a bug! It's just that you have adapted your
dictation to work around it. It's not at all clear that interactive
correction is better that batched correction. It certainly should not be
seen as a blocker for a project like this going forward. I wouldn't want
to spend years on a project simply to replicate NS on Linux. There is
plenty of room for improvement in the current system.
OK, now for some replies:
> There is a system that art exists that does exactly what you've opposed.
>
[assuming you meant 'proposed' here] Unlikely. If a system with the
level of usability existed it would already be in widespread use.
> While it was technically successful, it has failed in that nobody but
> the originator uses it in even he admits this model has some serious
> shortcomings.
>
What system, where? What was the model and what were the shortcomings?
> The reason I insist on feedback is very simple. A good speech
> recognition environment lets you lets you correct recognition errors and
> create application-specific and application neutral commands.
Yes, we agree that you need correction. The application-specific
features can be implemented in this model too it the same way that Orca
uses scripting.
> modern systems train incrementally. This improves the user experience
> because you don't have to put up with continual misrecognition's.
>
You would still have to correct the mistake at some point. I would
prefer to just dictate on and come back and correct all the mistakes at
the end. One should read through before sending in any case ;)
Correction and re-training do not have to be the same thing, though
that's the way NS does it now.
> Apparently they also train incrementally on what's not corrected which
> means batch correction is not a good thing.
And I think that is a serious design-flaw for two (related) reasons: It
gradually corrupts you voice files AND it makes the reader constantly
worry about whether that is happening. You have to make sure to speak as
correctly as properly at all times and always make sure to stop
immediately and correct all the mistakes. Otherwise your profile will be
hosed. I repeat: that is a bug, not a feature. You end up adapting more
to the machine than the machine adapts to you. *That is a bug.*
> I have no problem leaving the entire user interface for correction etc.
> in Windows. The only trouble is how do you make it visible if you're
> running a virtual machine full-screen? Don't run the virtual machine
> full-screen?
Sure, in a separate correction-session. Personally I would have two
physical machines for this task, with the text going across the network.
In the correcting session I would just flip the KVM switch to the
windows box (or however you choose to organise it).
> no, the grammar I gave you was a custom grammar. It didn't need a
> preamble of "macro". It demonstrates how you can create a more natural
> speech user interface.
I think this is an NS bug too. I don't want natural editing, I only want
natural dictation. I want two completely separate modes: pure dictation
and pure editing. If I say 'cut that' I want the words 'cut that' to be
typed. To edit I want to say: 'Hal: cut that bit'. Why? because that
would improve overall recognition and would remove the worry that you
might delete a paragraph by mistake. NS would only trigger it's special
functions on a single word, and otherwise just do its best to
transcribe. You would of course select that word to be one that it would
never get wrong. (you could argue that natural editing is a feature, but
the fact that you cannot easily configure it to use the modes I
described is a design-flaw).
> but also consider this. Ever wonder why the acceptance rate for speech
> recognition is only one user in five? Granted I only have a small
> sample but all of the doctors I've talked to about speech recognition
> tell me stories of purchasing a very expensive package only to drop it
> in a few months and go back to human transcription. Obviously
> recognition accuracy is a part of the problem but the other half is
> usability.
Precisely. It's because they don't want to fiddle with the program, they
just want to dictate.
Henrik
More information about the Ubuntu-accessibility
mailing list