Voice Recognition for Linux

Henrik Nilsen Omma henrik at ubuntu.com
Fri Feb 23 12:08:23 UTC 2007


Eric S. Johansson wrote:
> this is one half of the solution needed.  Not only do you need to 
> propagate text to Linux but you need to provide enough context back to 
> windows so that NaturallySpeaking can select different grammars.  It 
> would be nice to also modify the text injected in to Linux because 
> nuanced really screwed the pooch on natural text.

My point that this is actually all you need. It has the advantage of 
being quite simple from a coding point of view: you need the transmitter 
on the windows system that NS feeds into (a version already exists) and 
you need a gnome or KDE app on the other end with solid usability and 
configurability. The latter is something that our open source community 
is quite good at making and there are good tools like Mono, python, Qt4 
that can be used.

I'm making the point that you don't need to feed anything back to NS. 
One point I've obviously glossed over is training. You'll need to do 
some training to improve the recognition rate. Under my proposed scheme 
you would need to do the training natively under windows. I'm quite 
happy to do that actually. I would rather not worry about training 
during my daily work with using the system, but would collect the 
mistakes over a week or so and spend an hour or two doing just training. 
With the system I'm proposing you could make the Linux client recognise 
a 'must-train-this-later' command, which would cause it to save the past 
few lines to a log file.

> this is a difficult task.  There is a very nice package called voice 
> coder spearheaded by Alain Desilets up at nrc-it in conjunction with 
> David Fox.
Do you have a link to this work? I'd be interested to see.

> They haven't gotten a whole lot of additional contributions.  People 
> with upper extremity disorders tend not to volunteer a whole lot 
> because quite frankly life is bouncing physical pain against what 
> needs to be done.

Which is one reason why I'm suggesting we aim for a generally useful 
tool, not something targeted at disabled people. That way you'll get a 
much wider range of contributions. NS is itself like that (although 
comertial).

> damn, you are the optimist.  
Of course. How else do you get stuff done in this world? :)

> Yes, user interface does need to be better but it may not be possible 
> because the recognition engine or systems around it may not expose the 
> interface is necessary to make it better. 
We don't need any of that. We just accept a text stream from NS, running 
in pure dictation mode, and create our events based on that. All we are 
after is the excellent recognition engine. The GUI we leave behind.

> For example, where do you get the information from to give the user 
> clear feedback that the system is hearing something and it's at the 
> right level?  
You don't. You set this all up first on the native system along with the 
initial setup. If you notice that it's not working as it should you open 
the VMware window or the VNC session where NS is running and make a few 
adjustments to it directly.

> Also, the whole process of adding or deleting words from your 
> dictionary, training, or testing your audio input to make sure it 
> works right?  
Again, you don't. You do all those things in separate training sessions 
on the native system.

> I'm not saying it's impossible.  I'm just saying be prepared to work 
> very very hard.
That's just what I'm trying to avoid with this keep-it-simple approach :)

> I think we'd be better off finding some way of overlaying the user 
> interface from NaturallySpeaking on top of a Linux virtual machine 
> screen.  Sucks but you might get done faster than you are very 
> desirable but overly optimistic wish.

So I disagree that this is easier or faster. It sounds very messy. You 
would need to capture and transmit bits of the screen or something. A 
lot of work to copy an already poor user interface.

> In any event, take a look at the voice coder you live for making 
> corrections.  I really like it.  It's the best correction interface is 
> seen so far.  David Fox is responsible for that wonderful creation.
Sounds interesting. URL?

> you mean something like this...
I mean you should be able to define whatever commands you want. Both the 
spoken version and the resulting output.

> ...except you only have to say "delete line" and not "Macro delete line".
If those are phrases that are active in NS's dictation mode then I'm 
proposing to generally stay away from them and use your own custom 
commands. Of course if you get them working reliably, then you can use 
them and have the transmitter be clever enough to realise that a line 
has just been deleted, etc.

Remember, you don't _have_ to do anything in particular. It's a 
computer; it should do what you tell it to do :) Well, at least in the 
open source world.

> seriously, you need to live with speech recognition before you know 
> what's the right thing to say.  
Actually, with a more flexible system you should be able to decide what 
you want to say. Incidentally I've tested various systems on and off for 
the past 12 years. Only about 2-3 years ago did I start to find that NS 
was producing acceptable results, but then I ended up switching to Linux :)

> We have negotiated for rights to a speech recognition engine.  I don't 
> know if it's better than the Sphinx group but it is open source, and 
> the developer is still interested in seeing it have a life.
You have negotiated the rights to an open source speech engine? In what 
sense? A transfer of copyright?

>
>> Perhaps we can start this off as a Google Summer of Code project.
> perhaps but I think it's going to be much bigger than what summer of 
> code can do so we will need to dig up some alternative funding sources 
> so people can get paid.
But as with all these things it's important to make a start. A journey 
of a thousand miles starts with the ground under you, and all that :)

Henrik




More information about the Ubuntu-accessibility mailing list