Voice Recognition for Linux
Henrik Nilsen Omma
henrik at ubuntu.com
Fri Feb 23 12:08:23 UTC 2007
Eric S. Johansson wrote:
> this is one half of the solution needed. Not only do you need to
> propagate text to Linux but you need to provide enough context back to
> windows so that NaturallySpeaking can select different grammars. It
> would be nice to also modify the text injected in to Linux because
> nuanced really screwed the pooch on natural text.
My point that this is actually all you need. It has the advantage of
being quite simple from a coding point of view: you need the transmitter
on the windows system that NS feeds into (a version already exists) and
you need a gnome or KDE app on the other end with solid usability and
configurability. The latter is something that our open source community
is quite good at making and there are good tools like Mono, python, Qt4
that can be used.
I'm making the point that you don't need to feed anything back to NS.
One point I've obviously glossed over is training. You'll need to do
some training to improve the recognition rate. Under my proposed scheme
you would need to do the training natively under windows. I'm quite
happy to do that actually. I would rather not worry about training
during my daily work with using the system, but would collect the
mistakes over a week or so and spend an hour or two doing just training.
With the system I'm proposing you could make the Linux client recognise
a 'must-train-this-later' command, which would cause it to save the past
few lines to a log file.
> this is a difficult task. There is a very nice package called voice
> coder spearheaded by Alain Desilets up at nrc-it in conjunction with
> David Fox.
Do you have a link to this work? I'd be interested to see.
> They haven't gotten a whole lot of additional contributions. People
> with upper extremity disorders tend not to volunteer a whole lot
> because quite frankly life is bouncing physical pain against what
> needs to be done.
Which is one reason why I'm suggesting we aim for a generally useful
tool, not something targeted at disabled people. That way you'll get a
much wider range of contributions. NS is itself like that (although
comertial).
> damn, you are the optimist.
Of course. How else do you get stuff done in this world? :)
> Yes, user interface does need to be better but it may not be possible
> because the recognition engine or systems around it may not expose the
> interface is necessary to make it better.
We don't need any of that. We just accept a text stream from NS, running
in pure dictation mode, and create our events based on that. All we are
after is the excellent recognition engine. The GUI we leave behind.
> For example, where do you get the information from to give the user
> clear feedback that the system is hearing something and it's at the
> right level?
You don't. You set this all up first on the native system along with the
initial setup. If you notice that it's not working as it should you open
the VMware window or the VNC session where NS is running and make a few
adjustments to it directly.
> Also, the whole process of adding or deleting words from your
> dictionary, training, or testing your audio input to make sure it
> works right?
Again, you don't. You do all those things in separate training sessions
on the native system.
> I'm not saying it's impossible. I'm just saying be prepared to work
> very very hard.
That's just what I'm trying to avoid with this keep-it-simple approach :)
> I think we'd be better off finding some way of overlaying the user
> interface from NaturallySpeaking on top of a Linux virtual machine
> screen. Sucks but you might get done faster than you are very
> desirable but overly optimistic wish.
So I disagree that this is easier or faster. It sounds very messy. You
would need to capture and transmit bits of the screen or something. A
lot of work to copy an already poor user interface.
> In any event, take a look at the voice coder you live for making
> corrections. I really like it. It's the best correction interface is
> seen so far. David Fox is responsible for that wonderful creation.
Sounds interesting. URL?
> you mean something like this...
I mean you should be able to define whatever commands you want. Both the
spoken version and the resulting output.
> ...except you only have to say "delete line" and not "Macro delete line".
If those are phrases that are active in NS's dictation mode then I'm
proposing to generally stay away from them and use your own custom
commands. Of course if you get them working reliably, then you can use
them and have the transmitter be clever enough to realise that a line
has just been deleted, etc.
Remember, you don't _have_ to do anything in particular. It's a
computer; it should do what you tell it to do :) Well, at least in the
open source world.
> seriously, you need to live with speech recognition before you know
> what's the right thing to say.
Actually, with a more flexible system you should be able to decide what
you want to say. Incidentally I've tested various systems on and off for
the past 12 years. Only about 2-3 years ago did I start to find that NS
was producing acceptable results, but then I ended up switching to Linux :)
> We have negotiated for rights to a speech recognition engine. I don't
> know if it's better than the Sphinx group but it is open source, and
> the developer is still interested in seeing it have a life.
You have negotiated the rights to an open source speech engine? In what
sense? A transfer of copyright?
>
>> Perhaps we can start this off as a Google Summer of Code project.
> perhaps but I think it's going to be much bigger than what summer of
> code can do so we will need to dig up some alternative funding sources
> so people can get paid.
But as with all these things it's important to make a start. A journey
of a thousand miles starts with the ground under you, and all that :)
Henrik
More information about the Ubuntu-accessibility
mailing list