Speech Synthesis & Speech Recognition Using SAPI 4 Low Level Interfaces
Speech Synthesis & Speech Recognition
Using SAPI 4 Low Level Interfaces
Brian Long (www.blong.com)
Table of Contents
Click here to download the
files associated with this article.
Introduction
This article looks at adding support for speech capabilities to Microsoft Windows
applications written in Delphi, using the Microsoft Speech API version 4 (SAPI
4). For an overview on the subject of speech technology please click
here. For information on using SAPI 5.1 in Delphi applications click
here.
The older SAPI 4 interfaces are defined in two ways. There are high
level interfaces, intended to make implementation easier, but which sacrifice
some of the control. These are intended for quick results but can be quite effective.
There are also low level interfaces, which give full control but involve more
work to get going. These are intended for the serious programmer to work with.
The high level interfaces are implemented
by Microsoft in COM objects to call the lower level interfaces, taking care
of all the nitty-gritty. The low level interfaces themselves are implemented
by the TTS and SR engines that you obtain and install.
We will look at the low level interfaces available for TTS and SR in this article.
You can find coverage of the high level interfaces by clicking
here.
Grammars
Part of the process of speech recognition involves deciding what words have
actually been spoken. Recognisers use a grammar to decide what has been said,
where possible.
In the case of dictation, a grammar can be used to indicate some words that
are likely to be spoken. It is not feasible to try and represent the entire
spoken English language as a grammar, so the recogniser does its best and uses
the grammar to help out. The recogniser tries to use context information from
the text to work out which words are more likely than others. At its simplest,
the Microsoft SR engine can use a dictation grammar like this:
[Grammar]
LangID=2057
;2057 = $809 = UK English
type=dictation
With Command and Control, the permitted words are limited to the supported
commands. The grammar defines various rules that dictate what will be said and
this makes the recogniser's job much easier. Rather than trying to understand
anything spoken, it only needs to recognise speech that follows the supplied
rules. A Command and Control grammar is typically referred to as Context-Free
Grammar (CFG). A simple CFG that recognises three colours might look like this:
[Grammar]
LangID=2057
;UK English - 2057 = $809
Type=cfg
[<Start>]
<Start> = colour red
<Start> = colour green
<Start> = colour blue
Start is the root point of the grammar.
Grammars support lists to make implementing many similar commands easy. For
example:
LangID=2057
;UK English - 2057 = $809
Type=cfg
[<Start>]
<Start> = colour <Colour>
[<Colour>]
<Colour> = red
<Colour> = green
<Colour> = blue
You can find more details about the supported grammar syntax in the SAPI documentation
Low Level Interfaces
The low level interfaces are implemented by the TTS and SR engines installed
on your machine. These interfaces are made available as true COM interfaces
and also through ActiveX controls. There are more details to worry about with
these low-level interfaces, and it is recommended that you have the SAPI 4 documentation
to hand to help follow what's going on.
COM
The low level COM APIs are described as the DirectTextToSpeech API and
the DirectSpeechRecognition API. These are implemented in speech.dll
in the Windows speech directory, described simply as Microsoft Speech
in the version information. The pertinent interfaces are all defined in the
speech.pas SAPI 4 import unit.
DirectTextToSpeech API
The code listed in this section comes from the sample
project DirectTextToSpeech.dpr in the COM directory. The program looks like
this when running:

The following sections describe the important COM objects you work with and
how the sample program uses them.
Audio Destination Object
The first thing you need to do is initialise an audio destination object that
will be used by the speech engine object. To make sure things are on the right
track the default Wave Mapper is selected as the device to work with.
uses
Speech, MMSystem, ...
...
TfrmDirectTTSAPI = class(TForm)
...
AMMD: IAudioMultiMediaDevice;
...
end;
...
procedure TfrmDirectTTSAPI.FormCreate(Sender: TObject);
begin
SendMessage(lstProgress.Handle, LB_SETHORIZONTALEXTENT, Width, 0);
Log('About to connect to multimedia device');
AMMD := CreateComObject(CLSID_MMAudioDest) as IAudioMultiMediaDevice;
OleCheck(AMMD.DeviceNumSet(WAVE_MAPPER));
Log('Connected to multimedia device');
...
end;
Engine Enumerator Object
A TTS engine typically offers various modes (different types of voices) and
the enumerator can loop through them all. A TTS enumerator implements the ITTSEnum
interface to allow all modes of all available engines to be listed. You can
also use the ITTSFind interface
to locate a specific mode by specifying preferred attributes (such as gender
and age). Each mode is represented by a TTTSModeInfo
record structure.
This sample application lists all supported modes in a combobox and lets the
user select a mode to use. The combobox Items
property stores the textual mode names in its Strings
array and pointers to the corresponding mode records in the Objects
array. As different modes are selected a listbox is used to display the mode
attributes stored in the mode record. This way, the user can make an informed
decision about which voice mode to use.
type
TfrmDirectTTSAPI = class(TForm)
...
TTSEnum: ITTSEnum;
...
end;
...
procedure TfrmDirectTTSAPI.FormCreate(Sender: TObject);
var
ModeInfo: TTSModeInfo;
PModeInfo: PTTSModeInfo;
NumFound : DWord;
begin
...
Log('About to enumerate speech engines');
TTSEnum := CreateComObject(CLSID_TTSEnumerator) as ITTSEnum;
OleCheck(TTSEnum.Reset);
OleCheck(TTSEnum.Next(1, ModeInfo, @NumFound));
While NumFound > 0 do
begin
New(PModeInfo);
PModeInfo^ := ModeInfo;
cbEngines.Items.AddObject(String(ModeInfo.szModeName), TObject(PModeInfo));
OleCheck(TTSEnum.Next(1, ModeInfo, @NumFound));
end;
if cbEngines.Items.Count > 0 then
begin
cbEngines.ItemIndex := 0; //Select 1st engine
cbEngines.OnChange(cbEngines); //& ensure OnChange triggers
end;
Log('Enumerated speech engines');
end;
procedure TfrmDirectTTSAPI.FormDestroy(Sender: TObject);
var
I: Integer;
begin
chkStartStopEngine.Checked := False;
//Clean up TTTSModeInfo memory for each speech engine
for I := 0 to cbEngines.Items.Count - 1 do
Dispose(PTTSModeInfo(cbEngines.Items.Objects[I]));
end;
procedure TfrmDirectTTSAPI.cbEnginesChange(Sender: TObject);
const
Genders: array[GENDER_NEUTRAL..GENDER_MALE] of String =
('Neutral', 'Female', 'Male');
begin
with lstEngineInfo.Items do
begin
Clear;
PModeInfo := PTTSModeInfo(cbEngines.Items.Objects[cbEngines.ItemIndex]);
Add(Format('Name:'#9'%s', [PModeInfo.szModeName]));
Add(Format('Language:'#9'%s', [PModeInfo.Language.szDialect]));
Add(Format('Manufacturer:'#9'%s', [PModeInfo.szMfgName]));
Add(Format('Product:'#9'%s', [PModeInfo.szProductName]));
Add(Format('Mode ID:'#9'%s', [GUIDToString(PModeInfo.gModeID)]));
Add(Format('Engine ID:'#9'%s', [GUIDToString(PModeInfo.gEngineID)]));
Add(Format('Speaker:'#9'%s', [PModeInfo.szSpeaker]));
Add(Format('Style:'#9'%s', [PModeInfo.szStyle]));
Add(Format('Gender:'#9'%s', [Genders[PModeInfo.wGender]]));
case PModeInfo.wAge of
TTSAGE_BABY : Add('Age:'#9'Baby');
TTSAGE_TODDLER : Add('Age:'#9'Toddler');
TTSAGE_CHILD : Add('Age:'#9'Child');
TTSAGE_ADOLESCENT: Add('Age:'#9'Adolescent');
TTSAGE_ADULT : Add('Age:'#9'Adult');
TTSAGE_ELDERLY : Add('Age:'#9'Elderly');
end;
end
end;
Once a voice mode has been chosen, it can be identified by its mode identifier
(the GUID stored in the gModeID
field of the TTTSModeInfo record).
This can be stored with your application data in order to remember which voice
was being used last. The Voice Text API also supports mode identifiers; the
IVTxtAttributes.TTSModeSet method
takes such a GUID.
Engine Object
Once a voice has been chosen it is selected by calling ITTSEnum.Select
or ITTSFind.Select and an engine
object is created. The Select
method takes the desired mode identifier and an interface to an audio destination
object and returns the ITTSCentral
interface of the engine object in an out
parameter.
TTSCentral: ITTSCentral;
...
Log('About to start selected speech engine');
OleCheck(TTSEnum.Select(PModeInfo.gModeID, TTSCentral, AMMD));
Log('Started selected speech engine');
The engine object also (probably) implements the ITTSDialogs
interface, which allows access to the standard TTS engine dialogs and the ITTSAttributes
interface, which allows you to customise the voice attributes. You can either
use the Supports function to
see if these interfaces are supported, or check the TTTSModeInfo.dwInterfaces
mask for the TTSI_ITTSATTRIBUTES
or TTSI_ITTSDIALOGS flags.
This code gets access to both interfaces and checks the current voice pitch,
speed and volume (these details are displayed in track bars on the form).
TTSDialogs: ITTSDialogs;
TTSAttrs: ITTSAttributes;
...
if PModeInfo.dwInterfaces and TTSI_ITTSDialogs > 0 then
begin
TTSDialogs := TTSCentral as ITTSDialogs;
Log('Dialogs are available');
end
else
Log('Dialogs are not supported');
Log('About to check speech attributes');
if PModeInfo.dwInterfaces and TTSI_ITTSDialogs > 0 then
begin
TTSDialogs := TTSCentral as ITTSDialogs;
btnAbout.Enabled := True;
btnGeneral.Enabled := True;
btnLexicon.Enabled := True;
btnTranslate.Enabled := True;
Log('Dialogs are available');
end
else
Log('Dialogs are not supported');
Log('About to check speech attributes');
if PModeInfo.dwInterfaces and TTSI_ITTSATTRIBUTES > 0 then
begin
TTSAttrs := TTSCentral as ITTSAttributes;
CheckPitch;
CheckSpeed;
CheckVolume;
Log('Attributes checked');
btnSetAttrs.Enabled := True;
gbAttrs.Enabled := True
end
else
Log('Attributes interface not supported');
Making Your Computer Talk
As with the Voice Text API there are different calls to start speech and to
continue paused speech, so the same approach of using a helper flag has been
employed. The text to speak is taken from a richedit control. You can see that
a TSData record is necessary
to represent the text to be added to the speech queue.
procedure TfrmDirectTTSAPI.btnPlayClick(Sender: TObject);
var
SData: TSData;
begin
if not BeenPaused then
begin
SData.dwSize := Succ(Length(reText.Text));
SData.pData := PChar(reText.Text);
OleCheck(TTSCentral.TextData(CHARSET_TEXT, 0, SData,
Pointer(BufferSink), ITTSBufNotifySink));
end
else
begin
OleCheck(TTSCentral.AudioResume);
BeenPaused := False
end
end;
procedure TfrmDirectTTSAPI.btnPauseClick(Sender: TObject);
begin
OleCheck(TTSCentral.AudioPause);
BeenPaused := True
end;
procedure TfrmDirectTTSAPI.btnStopClick(Sender: TObject);
begin
OleCheck(TTSCentral.AudioReset);
end;
Text passed to a TTS engine can include textual tags to
add in things such as emphasis, pitch changes and bookmarks. You tell the TTS
engine that you have tagged text by passing TTSDATAFLAG_TAGGED
as the second argument to TextData.
The DirectTextToSpeech API supports two notification interfaces. One is set
up each time you call TextData
(we'll come back to this one) and the other
is set up just once.
Engine Status Notifications
The notification interface that requires a one-off setup is ITTSNotifySink.
Once you have created an object that implements this interface you must register
it with the TTS engine.
EngineSinkCookie: DWord;
TTSNotifySink: ITTSNotifySink;
...
Log('About to create and register engine event sink');
TTSNotifySink := TTTSNotifySink.Create(Self);
OleCheck(TTSCentral.Register(Pointer(TTSNotifySink), ITTSNotifySink, EngineSinkCookie));
Log('Registered engine event sink')
You can also unregister it by passing the returned cookie to the UnRegister
method. The notifications supported are much the same as with the Voice Text
API and so again, the engine phonemes that represent the text to be spoken are
listed in a separate memo.
Engine Buffer Notifications
As text is added to the speech buffer (when you call TextData)
a reference to an object that implements the buffer notification interface (ITTSBufNotifySink)
is passed along, as you can see in the call listed
earlier. This object is optionally created through one of the check boxes
on the form and logs details of buffer notification methods, which are:
- TextDataStarted: the buffer
data has started being processed
- TextDataDone: the buffer
has been emptied and all text has been sent to the audio device
- BookMark: a bookmark tag
has been encountered
- WordPosition: a new word
is being processed
The WordPosition notification
is useful in that since we are displaying the spoken text as engine phonemes
we know where to insert spaces. Additionally we can use this information to
emphasise in the UI what text is being spoken. As each word starts being processed
to be spoken, the code emboldens the corresponding word in the richedit control
(and restores the previous word to normal, if appropriate). This gives a nice
effect:

procedure TfrmDirectTTSAPI.SetWordStyle(FirstChar: Integer; Styles: TFontStyles);
var
WordLen: Integer;
begin
with reText do
begin
WordLen := 1;
while FirstChar + WordLen < Length(Text) do //Locate end of word
begin
if not (Text[Succ(FirstChar) + WordLen] in ['a'..'z', 'A'..'Z', '0'..'9']) then
Break;
Inc(WordLen)
end;
Lines.BeginUpdate;
try
SelStart := FirstChar; //highlight word
SelLength := WordLen;
SelAttributes.Style := Styles; //apply requested style
SelLength := 0; //unhighlight word
finally
Lines.EndUpdate
end
end
end;
function TTTSBufNotifySink.TextDataDone(qTimeStamp: QWORD;
dwFlags: DWORD): HResult;
const
EndTypes: array[Boolean] of String =
('all text sent to audio device', 'audio was reset');
begin
Result := S_OK;
FForm.Log('Buffer Event TextDataDone: timestamp %d, %s',
[qTimeStamp, EndTypes[dwFlags = TTSBNS_ABORTED]]);
FForm.SetWordStyle(FOldPos, []);
FinishedTalking := True;
end;
function TTTSBufNotifySink.WordPosition(qTimeStamp: QWORD;
dwByteOffset: DWORD): HResult;
begin
Result := S_OK;
FForm.Log('Buffer Event WordPosition: timestamp %d, byte offset %d',
[qTimeStamp, dwByteOffset]);
FForm.SetWordStyle(FOldPos, []);
//Byte offset is offset into Ansi buffer
FOldPos := dwByteOffset;
FForm.SetWordStyle(FOldPos, [fsBold]);
//If the engine events are firing then we can insert spaces
//between the engine phoneme words made by the Visual event
if FForm.actNotifySink.Checked then
if Trim(FForm.memEnginePhonemes.Text) <> '' then
FForm.memEnginePhonemes.Text := FForm.memEnginePhonemes.Text + #32;
end;
DirectSpeechRecognition API
Using the DirectSpeechRecognition API you can achieve Command and Control as
well as Dictation Speech Recognition depending on the grammar that you supply.
We didn't have to worry about this before as the high level objects set up a
default grammar (although you can change it).
For Command and Control you need a Context Free Grammar (CFG) and for Dictation
you need a Dictation Grammar. We'll see simple examples of these as we proceed
and you can find more information in the SAPI 4 documentation.
A sample project using the DirectSpeechRecognition
API can be found as DirectSpeechRecognitionAPI.dpr in the COM directory. The
program looks like this when running:

The following sections describe the important COM objects you work with and
how the sample program uses them.
Audio Destination Object
The first thing you need to do is initialise an audio destination object that
will be used by the speech engine object, telling it which audio device to use.
uses
Speech, MMSyetem, ...
...
TfrmDirectSRAPI = class(TForm)
...
AMMD: IAudioMultiMediaDevice;
...
end;
...
procedure TfrmDirectSRAPI.FormCreate(Sender: TObject);
begin
SendMessage(lstProgress.Handle, LB_SETHORIZONTALEXTENT, Width, 0);
Log('About to connect to multimedia device');
AMMD := CreateComObject(CLSID_MMAudioSource) as IAudioMultiMediaDevice;
OleCheck(AMMD.DeviceNumSet(WAVE_MAPPER));
Log('Connected to multimedia device');
...
end;
Engine Enumerator Object
An SR engine typically offers different speech recognition modes (such as one
for microphone and one for telephone) and the enumerator can loop through them
all. An SR enumerator implements the ISREnum
interface to allow all modes of all available engines to be listed. You can
also use the ISRFind interface
to locate a specific mode by specifying preferred attributes. Each mode is represented
by a TSRModeInfo record structure.
This sample application lists all supported modes in a combobox and lets the
user browse the available modes, but automatically selects the first mode to
use. The combobox Items property
stores the textual mode names in the Strings
array and pointers to the corresponding mode records in the Objects
array. As different modes are selected a listbox is used to display the mode
attributes stored in the mode record. This way, the user can see the attributes
of the various modes on offer.
type
TfrmDirectSRAPI = class(TForm)
...
SREnum: ISREnum;
...
end;
...
procedure TfrmDirectSRAPI.FormCreate(Sender: TObject);
var
ModeInfo: TSRModeInfo;
NumFound: DWord;
begin
...
Log('About to enumerate speech engines');
SREnum := CreateComObject(CLSID_SREnumerator) as ISREnum;
OleCheck(SREnum.Reset);
OleCheck(SREnum.Next(1, ModeInfo, @NumFound));
while NumFound > 0 do
begin
New(PModeInfo);
PModeInfo^ := ModeInfo;
cbEngines.Items.AddObject(String(ModeInfo.szModeName), TObject(PModeInfo));
OleCheck(SREnum.Next(1, ModeInfo, @NumFound));
end;
if cbEngines.Items.Count > 0 then
begin
cbEngines.ItemIndex := 0; //Select 1st engine
cbEngines.OnChange(cbEngines); //& ensure OnChange triggers
end;
Log('Enumerated speech engines');
...
end;
procedure TfrmDirectSRAPI.FormDestroy(Sender: TObject);
var
I: Integer;
begin
...
//Clean up TSRModeInfo memory for each speech engine
for I := 0 to cbEngines.Items.Count - 1 do
Dispose(PSRModeInfo(cbEngines.Items.Objects[I]));
end;
procedure TfrmDirectSRAPI.cbEnginesChange(Sender: TObject);
const
Sequences: array[SRSEQUENCE_DISCRETE..SRSEQUENCE_CONTCFGDISCDICT]
of String = (
'recognises discrete utterances',
'performs continuous recognition',
'performs word spotting',
'pPerforms continuous recognition for CFGs and discrete dictation');
begin
with lstEngineInfo.Items do
begin
Clear;
PModeInfo := PSRModeInfo(cbEngines.Items.Objects[cbEngines.ItemIndex]);
Add(Format('Name:'#9'%s', [PModeInfo.szModeName]));
Add(Format('Language:'#9'%s', [PModeInfo.Language.szDialect]));
Add(Format('Manufacturer:'#9'%s', [PModeInfo.szMfgName]));
Add(Format('Product:'#9'%s', [PModeInfo.szProductName]));
Add(Format('Mode ID:'#9'%s', [GUIDToString(PModeInfo.gModeID)]));
Add(Format('Engine ID:'#9'%s', [GUIDToString(PModeInfo.gEngineID)]));
Add(Format('Sequencing:'#9'%s', [Sequences[PModeInfo.dwSequencing]]));
end
end;
A voice mode can be identified by its mode identifier (the GUID stored in the
gModeID field of the TSRModeInfo
record). This can be stored with your application data in order to remember
which voice was being used last.
Engine Object
A voice mode is selected by calling ISREnum.Select
or ISRFind.Select and these methods
both create an engine object. The Select
method takes the desired mode identifier and an interface to an audio destination
object and returns the ISRCentral
interface of the engine object in an out
parameter.
SRCentral: ISRCentral;
...
Log('About to start engine');
OleCheck(SREnum.Select(PModeInfo.gModeID, SRCentral, AMMD));
Log('Engine started');
The engine object also (probably) implements the ISRDialogs
interface, which allows access to the standard SR engine dialogs and the ISRAttributes
interface, which allows you to check on the SR attributes. You can either use
the Supports function to see
if these interfaces are supported, or check the TSRModeInfo.dwInterfaces
mask for the SRI_ISRATTRIBUTES
or SRI_ISRDIALOGS flags.
SRDialogs: ISRDialogs;
SRAttrs: ISRAttributes;
...
Log('About to make dialogs available');
if PModeInfo.dwInterfaces and SRI_ISRDIALOGS > 0 then
begin
SRDialogs := SRCentral as ISRDialogs;
Log('Dialogs are available');
end
else
Log('Dialogs are not supported');
Log('About to make speech attributes available');
if PModeInfo.dwInterfaces and SRI_ISRATTRIBUTES > 0 then
begin
SRAttrs := SRCentral as ISRAttributes;
Log('Attributes available');
end
else
Log('Attributes interface not supported');
Grammar Compiler
Next a grammar compiler object is used to take a grammar definition (a simple
dictation grammar) and compile it. This compiled grammar will be passed along
to the SR engine shortly. Some brief information about grammars
is given towards the start of this article and you can find more information
in the SAPI documentation.
STGramComp: ISTGramComp;
...
const
Grammar: PChar = '[Grammar]'#13'LangID=2057'#13'Type=dictation'#13;
...
//Load grammar
STGramComp := CreateComObject(CLSID_STGramComp) as ISTGramComp;
OleCheck(STGramComp.FromMemory(Grammar, Succ(StrLen(Grammar))));
//Compile grammar
OleCheck(STGramComp.Compile(PPWideChar(nil)^, nil));
...
The grammar compiler can generate messages indicating if anything was wrong
with the grammar (or just that the grammar compiled successfully). If you wish
to see the error message you can change the code to:
uses
ActiveX;
...
STGramComp: ISTGramComp;
Size: DWord;
GramErr: PWideChar;
...
const
Grammar: PChar = '[Grammar]'#13'LangID=2057'#13'Type=dictation'#13;...
...
//Load grammar
STGramComp := CreateComObject(CLSID_STGramComp) as ISTGramComp;
OleCheck(STGramComp.FromMemory(Grammar, Succ(StrLen(Grammar))));
//Compile grammar
GramErr := nil;
OleCheck(STGramComp.Compile(GramErr, @Size));
if Assigned(GramErr) then
begin
ShowMessage(WideString(GramErr));
CoTaskMemFree(GramErr);
end;
...
Grammar And Status Notifications
The grammar compiler can now load the compiled grammar into the engine object
and, whilst doing so can set up a notification object that receives recognition-related
notifications (from the ISRGramNotifySink
interface). These include the PhraseStart,
PhraseHypothesis and PhraseFinish
notifications (among others) that we saw in the high
level Voice Dictation API.
SRGramUnk: IUnknown;
SRNotifySink: ISRNotifySink;
...
SRGramNotifySink := TSRGramNotifySink.Create(Self);
OleCheck(STGramComp.GrammarLoad(SRCentral, Pointer(SRGramNotifySink), ISRGramNotifySink, SRGramUnk));
Another notification interface is also supported (ISRNotifySink)
to inform the application of status changes. This interface has a notification
sink object set up to receive the notifications, which are logged to a listbox
on the form, like most of the other notifications in the program.
SRGramNotifySink: ISRGramNotifySink;
...
//Start a dictation session
SRNotifySink := TSRNotifySink.Create(Self);
OleCheck(SRCentral.Register(Pointer(SRNotifySink), ISRNotifySink, EngineSinkCookie));
The final job is to set up the speaker whose voice recognition profile you
wish to use (and whose profile will be modified if more SR training takes place).
Again, this example hard codes a specific speaker profile - in real applications
you may prefer to store this with your application data.
OleCheck(SRAttrs.SpeakerSet('blong'));
if Supports(SRGramUnk, ISRGramCommon, SRGramCommon) then
OleCheck(SRGramCommon.Activate(Handle, False, nil))
ActiveX
The ActiveX controls wrap up the DirectTextToSpeech and DirectSpeechRecognition
APIs and are described as ActiveVoice (or the Direct Speech Synthesis
Control) and ActiveListen (or the Direct Speech Recognition Control).
Ready made packages for Delphi 5 and Delphi 6 containing the ActiveX units
can be found in appropriately named subdirectories under SAPI 4 in the accompanying
files.
Direct Speech Synthesis Control (ActiveVoice)
The Microsoft Direct TextToSpeech control (or Direct Text-to-Speech control
as its type library describes it) is an ActiveX that wraps up the low level
DirectTextToSpeech API. To use it you must first import the ActiveX into Delphi;
you will find it described as Microsoft Direct Text-to-Speech (Version 1.0).
This will generate and install a type library import unit called ACTIVEVOICEPROJECTLib_TLB.pas.
The import unit contains the ActiveX component wrapper class called TDirectSS.
The ActiveX is implemented in XVoice.dll in the Windows speech directory (whose
version information describes it as the DirectSpeechSynthesis Module)
and the primary interface implemented is IDirectSS.
The control surfaces various DirectTextToSpeech interfaces such as ITTSCentral,
ITTSAttributes, ITTSDialogs
and ITTSFind. This means that
the control exposes methods and properties to speak, get and set the pitch,
speed and volume of the speech, invoke the engine dialogs, identify if the engine
is currently speaking and locate engines.
It also sends all the notifications from the ITTSNotifySink,
ITTSNotifySink2 and ITTSBufNotifySink
interfaces through ActiveX events.
You can programmatically work with this ActiveX control using the ProgID ActiveVoice.ActiveVoice
or the ClassID CLASS_DirectSS
from the ACTIVEVOICEPROJECTLib_TLB unit. The Windows registry describes this
class as the ActiveVoice Class.
Alternatively (and more typically) you can simply drop the ActiveX component
on a form as is done by the sample project DirectSSControl.dpr
in the ActiveX directory. The project does much the same as the DirectTextToSpeech
API project, but the ActiveX control provides an animated mouth that shows on
the form.
Much of the work involved in the DirectTextToSpeech API version is much simplified
by the ActiveX wrapping up some of the niggly details. In this project the engine
mode is changed as you choose different items in the listbox since it is so
straightforward.
The control manages a list of all modes (the number is given by the CountEngines
property and the active one is in CurrentMode).
When the combobox is populated the Strings
property of the TStrings property
Items is filled with the descriptive
mode names, whereas the Objects
property is simply filled with the mode index.
procedure TfrmDirectSSControl.FormCreate(Sender: TObject);
var
I: Integer;
begin
SendMessage(lstProgress.Handle, LB_SETHORIZONTALEXTENT, Width, 0);
DirectSS.Anchors := [akRight, akBottom];
for I := 1 to DirectSS.CountEngines do
cbEngines.Items.AddObject(DirectSS.ModeName(I), TObject(I));
cbEngines.ItemIndex := cbEngines.Items.IndexOf(DirectSS.ModeName(DirectSS.CurrentMode));
cbEngines.OnChange(cbEngines);
end;
procedure TfrmDirectSSControl.cbEnginesChange(Sender: TObject);
const
Genders: array[GENDER_NEUTRAL..GENDER_MALE] of String =
('neutral', 'female', 'male');
var
Buf: array[0..256] of Char;
begin
DirectSS.CurrentMode := Integer(cbEngines.Items.Objects[cbEngines.ItemIndex]);
with lstInfo.Items do
begin
Clear;
Add(Format('Manufacturer: %s', [DirectSS.MfgName(DirectSS.CurrentMode)]));
Add(Format('Product Name: %s', [DirectSS.ProductName(DirectSS.CurrentMode)]));
Add(Format('Engine ID: %s', [DirectSS.EngineID(DirectSS.CurrentMode)]));
Add(Format('Mode ID: %s', [DirectSS.ModeID(DirectSS.CurrentMode)]));
Add(Format('Mode Name: %s', [DirectSS.ModeName(DirectSS.CurrentMode)]));
Win32Check(Bool(GetLocaleInfo(
DirectSS.LanguageID(DirectSS.CurrentMode),
LOCALE_SENGLANGUAGE, Buf, SizeOf(Buf))));
Add(Format('Language ID: %s', [Buf]));
Add(Format('Dialect: %s', [DirectSS.Dialect(DirectSS.CurrentMode)]));
Add(Format('Gender: %s', [Genders[DirectSS.Gender(DirectSS.CurrentMode)]]));
Add(Format('Voice Style: %s', [DirectSS.Style(DirectSS.CurrentMode)]));
Add(Format('Speaker: %s', [DirectSS.Speaker(DirectSS.CurrentMode)]));
end;
CheckPitch;
CheckSpeed;
CheckVolume;
end;

Direct Speech Recognition Control (ActiveListen)
The Microsoft Direct Speech Recognition control is an ActiveX that wraps up
the low level DirectTextToSpeech API. To use it you must first import the ActiveX
into Delphi; you will find it described as Microsoft Direct Speech Recognition
(Version 1.0).
This will generate and install a type library import unit called ACTIVELISTENPROJECTLib_TLB.pas.
The import unit contains the ActiveX component wrapper class called TDirectSS.
The ActiveX is implemented in xlisten.dll in the Windows speech directory (whose
version information describes it as the DirectSpeechRecognition Module)
and the primary interface implemented is IDirectSR.
You can programmatically work with this ActiveX control using the ProgID ActiveListen.ActiveListen
or the ClassID CLASS_DirectSR
from the ACTIVELISTENPROJECTLib_TLB unit. The Windows registry describes this
class as the ActiveListen Class.
Alternatively (and more typically) you can simply drop the ActiveX component
on a form. That is what has been done with the sample
project DirectSpeechRecognitionControl.dpr in the ActiveX directory. This
project does much the same as the DirectSpeechRecognitionAPI.dpr
project and involves very much the same type of code.
The OnPhraseHypothesis
event does not fire with this control (though OnPhraseStart
does).
Speech Recognition Troubleshooting
If you get issues of SR stopping (or not starting) unexpectedly, or other weird
SR issues, check your recording settings have the microphone enabled.
- Double-click the Volume icon in your Task Bar's System Tray. If no Volume
icon is present, choose Start | Programs
| Accessories | Entertainment | Volume Control.
- If you see a Microphone column,
ensure it has its Mute checkbox
checked
- Choose Options | Properties,
click Recording, ensure the
Microphone option is checked
and press OK.
- Now ensure the Microphone
column has its Select checkbox
enabled, if it has one, or that its Mute
checkbox is unchecked, if it has one.
SAPI 4 Deployment
When distributing SAPI 4 applications you will need to supply the redistributable
components (available as spchapi.exe from http://www.microsoft.com/speech/download/old).
It would be advisable to also deploy the Speech Control Panel application (available
as spchcpl.exe from http://www.microsoft.com/msagent/downloads.htm),
however this Control Panel applet will not install on any version of Windows
later than Windows 2000.

The Microsoft SAPI 4 compliant TTS engine can be downloaded from various sites
(although not Microsoft's), such as http://misterhouse.net:81/public/speech
or http://www.cs.cofc.edu/~manaris/SUITEKeys.
As well as the Microsoft TTS engine, you can also download additional TTS engines
from Lernout & Hauspie (which include one that uses a British English voice)
from http://www.microsoft.com/msagent/downloads.htm.
If you plan to use any of these engines from applications running
under user accounts without user privileges, you need to do some registry tweaking,
described in http://www.microsoft.com/msagent/detail/tts3000deploy.htm.
You can download the Microsoft Speech Recognition engine for use with SAPI
4 from http://www.microsoft.com/msagent/downloads.htm.
References/Further Reading
The following is a list of useful articles and papers that I found on SAPI 4
development during my research on this subject.
- An
Overview of the Microsoft Speech API by Mike Rozak, November 1998.
Looks briefly at the high level and low level SR and TTS interfaces in the
SAPI 4 SDK.
- Writing
Your First Speech Recognition Program by by John T. Yung.
Shows how to uses the Direct Speech Recognition Control in a VB program.
About Brian Long
Brian Long used to work at Borland
UK, performing a number of duties including Technical Support on all the programming
tools. Since leaving in 1995, Brian has been providing training and consultancy
on Borland's RAD products ever since, and is now moving into the .NET world.
Besides authoring a
Borland Pascal problem-solving book published in 1994, Brian is a regular
columnist in The
Delphi Magazine and has had numerous articles published in Developer's Review,
Computing, Delphi
Developer's Journal and EXE Magazine. He was nominated for the Spirit
of Delphi 2000 award and was voted Best Speaker at Borland's BorCon
2002 conference in Anaheim, California by the conference delegates.
There are a growing number of conference papers and articles available on Brian's
Web site, so feel free to have a browse.
In his spare time (and waiting for his C++ programs to compile) Brian has learnt
the art of juggling and
making inflatable origami
paper frogs.
Go to the speech capabilities overview
Go back to the top of this SAPI 4 Low Level Interfaces article
Go to the SAPI 4 High Level Interfaces
article
Go to the SAPI 5.1 article