WebSocket API for Streaming Transcription
Connecting
To start a new stream, the connection must first be set up. A WebSocket
connection starts with a HTTP GET
request with header fields Upgrade: websocket
and Connection: Upgrade
as per
RFC6455.
GET /asr/v0.1/stream HTTP/1.1
Host: api.myrtle.ai
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Protocol: stream.asr.api.myrtle.ai
Sec-WebSocket-Version: 13
If all is well, the server will respond in the affirmative.
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
Sec-WebSocket-Protocol: stream.asr.api.myrtle.ai
The server will return HTTP/1.1 400 Bad Request
if the request is invalid.
Request Parameters
Parameters are query-encoded in the request URL.
Content Type
Parameter | Required | Default |
---|---|---|
content_type | Yes | - |
Requests can specify the audio format with the content_type
parameter. If the content type is not
specified then the server will attempt to infer it. Currently only audio/x-raw
is supported.
Supported content types are:
audio/x-raw
: Unstructured and uncompressed raw audio data. If raw audio is used then additional parameters must be provided by adding:format
: The format of audio samples. Only S16LE is currently supportedrate
: The sample rate of the audio. Only 16000 is currently supportedchannels
: The number of channels. Only 1 channel is currently supported
As a query parameter, this would look like:
content_type=audio/x-raw;format=S16LE;channels=1;rate=16000
Model Identifier
Parameter | Required | Default |
---|---|---|
model | No | "general" |
Requests can specify a transcription model identifier.
Model Version
Parameter | Required | Default |
---|---|---|
version | No | "latest" |
Requests can specify the transcription model version. Can be "latest"
or a specific version id.
Model Language
Parameter | Required | Default |
---|---|---|
lang | No | "en" |
The BCP47 language tag for the speech in the audio.
Max Number of Alternatives
Parameter | Required | Default |
---|---|---|
alternatives | No | 1 |
The maximum number of alternative transcriptions to provide.
Supported Models
Model id | Version | Supported Languages |
---|---|---|
general | v1 | en |
Request Frames
For audio/x-raw
audio, raw audio samples in the format specified in the
format
parameter should be sent in WebSocket Binary frames without padding.
Frames can be any length greater than zero.
A WebSocket Binary frame of length zero is treated as an end-of-stream (EOS) message.
Response Frames
Response frames are sent as WebSocket Text frames containing JSON.
{
"start": 0.0,
"end": 2.0,
"is_provisional": false,
"alternatives": [
{
"transcript": "hello world",
"confidence": 1.0
}
]
}
API during greedy decoding
start
: The start time of the transcribed interval in secondsend
: The end time of the transcribed interval in secondsis_provisional
: Always false for greedy decoding (but can be true for beam decoding)alternatives
: Contains at most one alternative for greedy decoding (but can be more for beam decoding)transcript
: The model predictions for this audio interval. Not cumulative, so you can get the full transcript by concatenating all thetranscript
fieldsconfidence
: Currently unused
API during beam decoding
When decoding with a beam search, the server will return two types of response, either 'partial' or 'final' where:
- Partial responses are hypotheses that are provisional and may be removed or updated in future frames
- Final responses are hypotheses that are complete and will not change in future frames
It is recommended to use partials for low-latency streaming applications and finals for the ultimate transcription output. If latency is not a concern you can ignore the partials and concatenate the finals.
Detection of finals is done by checking beam hypotheses for a shared prefix. Once this shared prefix is detected and sent in as a final, all following partials and finals will have that prefix removed. Hence it's possible to get the full transcript by concatenating all the transcript
fields in the finals.
Typically it takes no more than 1 second to get finals (and often they are much faster) to ensure a hard-cap on final latency beam time/depth pruning can be enabled, which discards the least likely hypothesis to ensure that the time between final emissions is bounded.
When running the asr server, partial responses are marked with "is_provisional": true
and finals with "is_provisional": false
and partials can be one of the many "alternatives: [...]"
.
During a partial response,
the alternatives are ordered from most confident to least.
During every frame other than the last one, the server will send partials and may also send a final. At the last frame, the server will not send any partials, but may send a final if there is outstanding text.
When the server sends a final and a partial for the same frame, the final will be sent first.
Closing the Connection
The client should not close the WebSocket connection, it should send an EOS message and wait for a WebSocket Close frame from the server. Closing the connection before receivng a WebSocket Close frame from the server may cause transcription results to be dropped.
An end-of-stream (EOS) message can be sent by sending a zero-length binary frame.
Errors
If an error occurs, the server will send a WebSocket Close frame, with error details in the body.
Error Code | Details |
---|---|
400 | Invalid parameters passed. |
503 | Maximum number of simultaneous connections reached. |