Voice Recognition for Linux

Fri Feb 23 14:55:40 UTC 2007

as I was constructing my response, and was almost finished when it hit 
me about what's wrong with the model proposed.   it is the equivalent of 
raw natural text.  Full function natural text sucks a little bit.  The 
broken, unable to correct consistently, natural text is horrible and 
ruins voice models.  What you're proposing has even less functionality 
than a broken natural text.

Henrik Nilsen Omma wrote:
> Eric S. Johansson wrote:
>> this is one half of the solution needed.  Not only do you need to 
>> propagate text to Linux but you need to provide enough context back to 
>> windows so that NaturallySpeaking can select different grammars.  It 
>> would be nice to also modify the text injected in to Linux because 
>> nuanced really screwed the pooch on natural text.
> 
> My point that this is actually all you need. It has the advantage of 
> being quite simple from a coding point of view: you need the transmitter 
> on the windows system that NS feeds into (a version already exists) and 
> you need a gnome or KDE app on the other end with solid usability and 
> configurability. The latter is something that our open source community 
> is quite good at making and there are good tools like Mono, python, Qt4 
> that can be used.

There is a system that art exists that does exactly what you've opposed. 
  While it was technically successful, it has failed in that nobody but 
the originator uses it in even he admits  this model  has some serious 
shortcomings.

The reason I insist on feedback is very simple.  A good speech 
recognition environment lets you lets you correct recognition errors and 
create application-specific and application neutral commands.

> One point I've obviously glossed over is training. You'll need to do 
> some training to improve the recognition rate. Under my proposed scheme 
> you would need to do the training natively under windows. I'm quite 
> happy to do that actually. I would rather not worry about training 
> during my daily work with using the system, but would collect the 
> mistakes over a week or so and spend an hour or two doing just training. 
> With the system I'm proposing you could make the Linux client recognise 
> a 'must-train-this-later' command, which would cause it to save the past 
> few lines to a log file.

modern systems train incrementally.  This improves the user experience 
because you don't have to put up with continual misrecognition's. 
Apparently they also train incrementally on what's not corrected which 
means batch correction is not a good thing.  another example is what's 
happening with me right now.  There are a bunch of small words and 
misrecognized endings that are cropping up with increasing frequency. 
If nuance hadn't screwed up and left naturaltext in a working state, I 
would be able to correct them as I dictate into this Thunderbird window. 
  But no, it's so broken I make corrections by hand and as a result, the 
misrecognition get cast in stone and I need to scrap the user and start 
over again retraining about every six months.  Do not subject users to 
this kind of frustration and time waste.  They will drop the system in a 
heartbeat if you do.

I have no problem leaving the entire user interface for correction etc. 
in Windows.  The only trouble is how do you make it visible if you're 
running a virtual machine full-screen?  Don't run the virtual machine 
full-screen?

>> this is a difficult task.  There is a very nice package called voice 
>> coder spearheaded by Alain Desilets up at nrc-it in conjunction with 
>> David Fox.
> Do you have a link to this work? I'd be interested to see.

http://voicecode.iit.nrc.ca/VoiceCode/public/ywiki.cgi

Something you might also want to see which is a full Select-and-Say 
interface to Emacs

http://emacs-vr-mode.sourceforge.net/

These two things should keep you out of trouble for a while.  :-)

> We don't need any of that. We just accept a text stream from NS, running 
> in pure dictation mode, and create our events based on that. All we are 
> after is the excellent recognition engine. The GUI we leave behind.
> 
...
> You don't. You set this all up first on the native system along with the 
> initial setup. If you notice that it's not working as it should you open 
> the VMware window or the VNC session where NS is running and make a few 
> adjustments to it directly.

the graphical user interface is an integral portion of the dictation 
process.  For example, I pay attention to the little floating box which 
shows partial recognition states.  It gives me an early warning on how I 
am speaking and the quality of the recognition.  It also gives me the 
ability to terminate a recognition sequence is NaturallySpeaking loses 
his mind.  The little recognition box floats inside the window of the 
application is active so that if it's not in the window and I'm not 
getting any text injected, I know it's time to reset/restart 
NaturallySpeaking.

if you look at a system running NaturallySpeaking with the VNC, the 
dictation box is usually not visible.  If it is visible, it usually does 
not show information dynamically because updates far faster than DNC can 
cope.

>> I think we'd be better off finding some way of overlaying the user 
>> interface from NaturallySpeaking on top of a Linux virtual machine 
>> screen.  Sucks but you might get done faster than you are very 
>> desirable but overly optimistic wish.
> 
> So I disagree that this is easier or faster. It sounds very messy. You 
> would need to capture and transmit bits of the screen or something. A 
> lot of work to copy an already poor user interface.

the only thing was really a hideous user interface of the training 
dialogue.  David has shown how to replace that with something more 
useful.  The user interface elements that are quite useful are the audio 
level indicator, partial recognition information, and the ability to 
terminate the recognition sequence.

I've attached a very small (<6k) image showing the final recognition 
state of an utterance.  Normally in the upper left-hand corner is a red 
dot.  Click on that red dot and the recognition sequence terminates. 
The microphone in the taskbar turns to red and indicates it's in the off 
state.  The little bar in the lower left-hand corner is the audio 
intensity meter.  It's yellow now indicating that no one is speaking. 
When screen I'm speaking at the right level and when it's red, I'm 
talking too loud.  the text in the middle of the box changes as the 
recognition engine changes its evaluation.  Like I said, it's damned 
useful feedback helps me modify how I speak and interact with the engine 
in real time.

>> In any event, take a look at the voice coder you live for making 
>> corrections.  I really like it.  It's the best correction interface is 
>> seen so far.  David Fox is responsible for that wonderful creation.
> Sounds interesting. URL?

see the voice coder URL above.  It's in the user manual.

>> ...except you only have to say "delete line" and not "Macro delete line".
> If those are phrases that are active in NS's dictation mode then I'm 
> proposing to generally stay away from them and use your own custom 
> commands. Of course if you get them working reliably, then you can use 
> them and have the transmitter be clever enough to realise that a line 
> has just been deleted, etc.

no, the grammar I gave you was a custom grammar.  It didn't need a 
preamble of "macro".  It demonstrates how you can create a more natural 
speech user interface.  You can also overlay NaturallySpeaking commands 
with your own actions so that you can say "cut that" and "paste that" 
for commands that don't use ^c/^v for cutting and pasting.  As you know, 
this is desirable because it reduces the number of distinct commands a 
user must remember and eliminates the need for the user to be context 
smart.  Computers are much better at being context smart than we are.

  >> We have negotiated for rights to a speech recognition engine.  I don't
>> know if it's better than the Sphinx group but it is open source, and 
>> the developer is still interested in seeing it have a life.
> You have negotiated the rights to an open source speech engine? In what 
> sense? A transfer of copyright?

I'll get the details.  But if memory serves, we got the rights assigned 
back to the original developer and he is going to license it under some 
form of the GPL.  Then if you add a whole bunch of work, you might have 
something useful.  I'd estimate the development time to be roughly 3 to 
4 years if you had five fully funded,  full-time developers.  Which 
means it probably take longer given that I'm an optimist when it comes 
to schedules.  :-)

> But as with all these things it's important to make a start. A journey 
> of a thousand miles starts with the ground under you, and all that :)

I agree.  Let me finish up my project specification and get it bought 
off by the board of directors at OSSRI and from there, we can start 
soliciting contributions etc..

but also consider this. Ever wonder why the acceptance rate for speech 
recognition is only one user in five?  Granted I only have a small 
sample but all of the doctors I've talked to about speech recognition 
tell me stories of purchasing a very expensive package only to drop it 
in a few months and go back to human transcription.  Obviously 
recognition accuracy is a part of the problem but the other half is 
usability.  Can a transcriptionist detect errors and correct them 
without seriously interrupt their workflow?  Can they eliminate 
persistent errors quickly and effectively?  These are just a couple of 
higher level issues that will hit us as we go forward.

-- 
Speech-recognition in use.  It makes mistakes, I correct some.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dictation_dialog.JPG
Type: image/jpeg
Size: 5408 bytes
Desc: not available
URL: <https://lists.ubuntu.com/archives/ubuntu-accessibility/attachments/20070223/6b2e09d6/attachment.jpe>