6464
PROPOSED STANDARD
A Real-time Transport Protocol (RTP) Header Extension for Client-to-Mixer Audio Level Indication
Authors: J. Lennox, E. Ivov, E. Marocco
Date: December 2011
Area: rai
Working Group: avtext
Stream: IETF
Abstract
This document defines a mechanism by which packets of Real-time Transport Protocol (RTP) audio streams can indicate, in an RTP header extension, the audio level of the audio sample carried in the RTP packet. In large conferences, this can reduce the load on an audio mixer or other middlebox that wants to forward only a few of the loudest audio streams, without requiring it to decode and measure every stream that is received. [STANDARDS-TRACK]
RFC 6464
PROPOSED STANDARD
Internet Engineering Task Force (IETF) J. Lennox, Ed.
Request for Comments: 6464 Vidyo
Category: Standards Track E. Ivov
ISSN: 2070-1721 Jitsi
E. Marocco
Telecom Italia
December 2011
<span class="h1">A Real-time Transport Protocol (RTP) Header Extension for</span>
<span class="h1">Client-to-Mixer Audio Level Indication</span>
Abstract
This document defines a mechanism by which packets of Real-time
Transport Protocol (RTP) audio streams can indicate, in an RTP header
extension, the audio level of the audio sample carried in the RTP
packet. In large conferences, this can reduce the load on an audio
mixer or other middlebox that wants to forward only a few of the
loudest audio streams, without requiring it to decode and measure
every stream that is received.
Status of This Memo
This is an Internet Standards Track document.
This document is a product of the Internet Engineering Task Force
(IETF). It represents the consensus of the IETF community. It has
received public review and has been approved for publication by the
Internet Engineering Steering Group (IESG). Further information on
Internet Standards is available in <a href="./rfc5741#section-2">Section 2 of RFC 5741</a>.
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
<a href="https://www.rfc-editor.org/info/rfc6464">http://www.rfc-editor.org/info/rfc6464</a>.
<span class="grey">Lennox, et al. Standards Track [Page 1]</span>
<span id="page-2" ></span>
<span class="grey"><a href="./rfc6464">RFC 6464</a> Client-to-Mixer Audio Level Indication December 2011</span>
Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to <a href="https://www.rfc-editor.org/bcp/bcp78">BCP 78</a> and the IETF Trust's Legal
Provisions Relating to IETF Documents
(<a href="http://trustee.ietf.org/license-info">http://trustee.ietf.org/license-info</a>) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
<a href="#section-1">1</a>. Introduction ....................................................<a href="#page-2">2</a>
<a href="#section-2">2</a>. Terminology .....................................................<a href="#page-3">3</a>
<a href="#section-3">3</a>. Audio Levels ....................................................<a href="#page-3">3</a>
<a href="#section-4">4</a>. Signaling (Setup) Information ...................................<a href="#page-5">5</a>
<a href="#section-5">5</a>. Considerations on Use ...........................................<a href="#page-6">6</a>
<a href="#section-6">6</a>. Security Considerations .........................................<a href="#page-6">6</a>
<a href="#section-7">7</a>. IANA Considerations .............................................<a href="#page-7">7</a>
<a href="#section-8">8</a>. References ......................................................<a href="#page-7">7</a>
<a href="#section-8.1">8.1</a>. Normative References .......................................<a href="#page-7">7</a>
<a href="#section-8.2">8.2</a>. Informative References .....................................<a href="#page-8">8</a>
<span class="h2"><a class="selflink" id="section-1" href="#section-1">1</a>. Introduction</span>
In a centralized Real-time Transport Protocol (RTP) [<a href="./rfc3550" title=""RTP: A Transport Protocol for Real-Time Applications"">RFC3550</a>] audio
conference, an audio mixer or forwarder receives audio streams from
many or all of the conference participants. It then selectively
forwards some of them to other participants in the conference. In
large conferences, it is possible that such a server might be
receiving a large number of streams, of which only a few are intended
to be forwarded to the other conference participants.
In such a scenario, in order to pick the audio streams to forward, a
centralized server needs to decode, measure audio levels, and
possibly perform voice activity detection on audio data from a large
number of streams. The need for such processing limits the size or
number of conferences such a server can support.
As an alternative, this document defines an RTP header extension
[<a href="./rfc5285" title=""A General Mechanism for RTP Header Extensions"">RFC5285</a>] through which senders of audio packets can indicate the
audio level of the packets' payload, reducing the processing load for
a server.
<span class="grey">Lennox, et al. Standards Track [Page 2]</span>
<span id="page-3" ></span>
<span class="grey"><a href="./rfc6464">RFC 6464</a> Client-to-Mixer Audio Level Indication December 2011</span>
The header extension in this document is different than, but
complementary with, the one defined in [<a href="./rfc6465" title=""A Real-time Transport Protocol (RTP) Header Extension for Mixer-to-Client Audio Level Indication"">RFC6465</a>], which defines a
mechanism by which audio mixers can indicate to clients the levels of
the contributing sources that made up the mixed audio.
<span class="h2"><a class="selflink" id="section-2" href="#section-2">2</a>. Terminology</span>
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in <a href="./rfc2119">RFC 2119</a> [<a href="./rfc2119" title=""Key words for use in RFCs to Indicate Requirement Levels"">RFC2119</a>] and
indicate requirement levels for compliant implementations.
<span class="h2"><a class="selflink" id="section-3" href="#section-3">3</a>. Audio Levels</span>
The audio level header extension carries the level of the audio in
the RTP [<a href="./rfc3550" title=""RTP: A Transport Protocol for Real-Time Applications"">RFC3550</a>] payload of the packet with which it is associated.
This information is carried in an RTP header extension element as
defined by "A General Mechanism for RTP Header Extensions" [<a href="./rfc5285" title=""A General Mechanism for RTP Header Extensions"">RFC5285</a>].
The payload of the audio level header extension element can be
encoded using either the one-byte or two-byte header defined in
[<a href="./rfc5285" title=""A General Mechanism for RTP Header Extensions"">RFC5285</a>]. Figures 1 and 2 show sample audio level encodings with
each of these header formats.
0 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ID | len=0 |V| level |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 1: Sample Audio Level Encoding Using the
One-Byte Header Format
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ID | len=1 |V| level | 0 (pad) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 2: Sample Audio Level Encoding Using the
Two-Byte Header Format
Note that, as indicated in [<a href="./rfc5285" title=""A General Mechanism for RTP Header Extensions"">RFC5285</a>], the length field in the one-
byte header format takes the value 0 to indicate that 1 byte follows.
In the two-byte header format, on the other hand, the length field
takes the value of 1.
<span class="grey">Lennox, et al. Standards Track [Page 3]</span>
<span id="page-4" ></span>
<span class="grey"><a href="./rfc6464">RFC 6464</a> Client-to-Mixer Audio Level Indication December 2011</span>
The magnitude of the audio level itself is packed into the seven
least significant bits of the single byte of the header extension,
shown in Figures 1 and 2. The least significant bit of the audio
level magnitude is packed into the least significant bit of the byte.
The most significant bit of the byte is used as a separate flag bit
"V", defined below.
The audio level is expressed in -dBov, with values from 0 to 127
representing 0 to -127 dBov. dBov is the level, in decibels, relative
to the overload point of the system, i.e., the highest-intensity
signal encodable by the payload format. (Note: Representation
relative to the overload point of a system is particularly useful for
digital implementations, since one does not need to know the relative
calibration of the analog circuitry.) For example, in the case of
u-law (audio/pcmu) audio [<a href="#ref-ITU.G711" title=""Pulse Code Modulation (PCM) of Voice Frequencies"">ITU.G711</a>], the 0 dBov reference would be a
square wave with values +/- 8031. (This translates to 6.18 dBm0,
relative to u-law's dBm0 definition in Table 6 of [<a href="#ref-ITU.G711" title=""Pulse Code Modulation (PCM) of Voice Frequencies"">ITU.G711</a>].)
The audio level for digital silence -- for a muted audio source, for
example -- MUST be represented as 127 (-127 dBov), regardless of the
dynamic range of the encoded audio format.
The audio level header extension only carries the level of the audio
in the RTP payload of the packet with which it is associated, with no
long-term averaging or smoothing applied. For payload formats that
contain extra error-correction bits or loss-concealment information,
the level corresponds only to the data that would result from the
payload's normal decoding process, not what it would produce under
error or packet loss concealment. The level is measured as a root
mean square of all the samples in the audio encoded by the packet.
To simplify implementation of the encoding procedures described here,
<a href="./rfc6465#appendix-A">Appendix A of [RFC6465]</a> provides a sample Java implementation of an
audio level calculator that helps obtain such values from raw linear
Pulse Code Modulation (PCM) audio samples.
In addition, a flag bit (labeled "V") optionally indicates whether
the encoder believes the audio packet contains voice activity. If
the V bit is in use, the value 1 indicates that the encoder believes
the audio packet contains voice activity, and the value 0 indicates
that the encoder believes it does not. (The voice activity detection
algorithm is unspecified and left implementation-specific.) If the V
bit is not in use, its value is unspecified and MUST be ignored by
receivers. The use of the V bit is signaled using the extension
attribute "vad", discussed in <a href="#section-4">Section 4</a>.
<span class="grey">Lennox, et al. Standards Track [Page 4]</span>
<span id="page-5" ></span>
<span class="grey"><a href="./rfc6464">RFC 6464</a> Client-to-Mixer Audio Level Indication December 2011</span>
When this header extension is used with RTP data sent using the RTP
Payload for Redundant Audio Data [<a href="./rfc2198" title=""RTP Payload for Redundant Audio Data"">RFC2198</a>], the header's data
describes the contents of the primary encoding.
Note: This audio level is defined in the same manner as is audio
noise level in the RTP Payload Comfort Noise specification
[<a href="./rfc3389" title=""Real-time Transport Protocol (RTP) Payload for Comfort Noise (CN)"">RFC3389</a>]. In [<a href="./rfc3389" title=""Real-time Transport Protocol (RTP) Payload for Comfort Noise (CN)"">RFC3389</a>], the overall magnitude of the noise level
in comfort noise is encoded into the first byte of the payload,
with spectral information about the noise in subsequent bytes.
This specification's audio level parameter is defined so as to be
identical to the comfort noise payload's noise-level byte.
<span class="h2"><a class="selflink" id="section-4" href="#section-4">4</a>. Signaling (Setup) Information</span>
The URI for declaring this header extension in an extmap attribute is
"urn:ietf:params:rtp-hdrext:ssrc-audio-level".
It has a single extension attribute, named "vad". It takes the form
"vad=on" or "vad=off". If the header extension element is signaled
with "vad=on", the V bit described in <a href="#section-3">Section 3</a> is in use, and MUST
be set by senders. If the header extension element is signaled with
"vad=off", the V bit is not in use, and its value MUST be ignored by
receivers. If the vad extension attribute is not specified, the
default is "vad=on".
An example attribute line in the Session Description Protocol (SDP)
for a conference might hence be:
a=extmap:6 urn:ietf:params:rtp-hdrext:ssrc-audio-level vad=on
The vad extension attribute only controls the semantics of this
header extension attribute, and does not make any statement about
whether the sender is using any other voice activity detection
features, such as discontinuous transmission, comfort noise, or
silence suppression.
Using the mechanisms of [<a href="./rfc5285" title=""A General Mechanism for RTP Header Extensions"">RFC5285</a>], an endpoint MAY signal multiple
instances of the header extension element, with different values of
the vad attribute, so long as these instances use different values
for the extension identifier. However, again following the rules of
[<a href="./rfc5285" title=""A General Mechanism for RTP Header Extensions"">RFC5285</a>], the semantics chosen for a header extension element
(including its vad setting) for a particular extension identifier
value MUST NOT be changed within an RTP session.
<span class="grey">Lennox, et al. Standards Track [Page 5]</span>
<span id="page-6" ></span>
<span class="grey"><a href="./rfc6464">RFC 6464</a> Client-to-Mixer Audio Level Indication December 2011</span>
<span class="h2"><a class="selflink" id="section-5" href="#section-5">5</a>. Considerations on Use</span>
Mixers and forwarders generally ought not base audio forwarding
decisions directly on packet-by-packet audio level information, but
rather ought to apply some analysis of the audio levels and trends.
This general rule applies whether audio levels are provided by
endpoints (as defined in this document), or are calculated at a
server, as would be done in the absence of this information. This
section discusses several issues that mixers and forwarders may wish
to take into account. (Note that this section provides design
guidance only, and is not normative.)
First of all, audio levels generally ought to be measured over longer
intervals than that of a single audio packet. In order to avoid
false-positives for short bursts of sound (such as a cough or a
dropped microphone), it is often useful to require that a
participant's audio level be maintained for some period of time
before considering it to be "real"; i.e., some type of low-pass
filter ought to be applied to the audio levels. Note, though, that
such filtering must be balanced with the need to avoid clipping of
the beginning of a speaker's speech.
Additionally, different participants may have their audio input set
differently. It may be useful to apply some sort of automatic gain
control to the audio levels. There are a number of possible
approaches to achieving this, e.g., by measuring peak audio levels,
by average audio levels during speech, or by measuring background
audio levels (average audio levels during non-speech).
<span class="h2"><a class="selflink" id="section-6" href="#section-6">6</a>. Security Considerations</span>
A malicious endpoint could choose to set the values in this header
extension falsely, so as to falsely claim that audio or voice is or
is not present. It is not clear what could be gained by falsely
claiming that audio is not present, but an endpoint falsely claiming
that audio is present, or falsely exaggerating its reported levels,
could perform a denial-of-service attack on an audio conference, so
as to send silence to suppress other conference members' audio, or
could dominate a conference by seizing its speaker-selection
algorithm. Thus, if a device relies on audio level data from
untrusted endpoints, it SHOULD periodically audit the level
information transmitted, taking appropriate corrective action against
endpoints that appear to be sending incorrect data. (However, as it
is valid for an endpoint to choose to measure audio levels prior to
encoding, some degree of discrepancy could be present. This would
not indicate that an endpoint is malicious.)
<span class="grey">Lennox, et al. Standards Track [Page 6]</span>
<span id="page-7" ></span>
<span class="grey"><a href="./rfc6464">RFC 6464</a> Client-to-Mixer Audio Level Indication December 2011</span>
In the Secure Real-time Transport Protocol (SRTP) [<a href="./rfc3711" title=""The Secure Real-time Transport Protocol (SRTP)"">RFC3711</a>], RTP
header extensions are authenticated but not encrypted. When this
header extension is used, audio levels are therefore visible on a
packet-by-packet basis to an attacker passively observing the audio
stream. As discussed in [<a href="#ref-SRTP-VBR-AUDIO">SRTP-VBR-AUDIO</a>], such an attacker might be
able to infer information about the conversation, possibly with
phoneme-level resolution. In scenarios where this is a concern,
additional mechanisms MUST be used to protect the confidentiality of
the header extension. This mechanism could be header extension
encryption [<a href="#ref-SRTP-ENCR-HDR">SRTP-ENCR-HDR</a>], or a lower-level security and
authentication mechanism such as IPsec [<a href="./rfc4301" title=""Security Architecture for the Internet Protocol"">RFC4301</a>].
<span class="h2"><a class="selflink" id="section-7" href="#section-7">7</a>. IANA Considerations</span>
This document defines a new extension URI in the RTP Compact Header
Extensions subregistry of the Real-Time Transport Protocol (RTP)
Parameters registry, according to the following data:
Extension URI: urn:ietf:params:rtp-hdrext:ssrc-audio-level
Description: Audio Level
Contact: [email protected]
Reference: <a href="./rfc6464">RFC 6464</a>
<span class="h2"><a class="selflink" id="section-8" href="#section-8">8</a>. References</span>
<span class="h3"><a class="selflink" id="section-8.1" href="#section-8.1">8.1</a>. Normative References</span>
[<a id="ref-RFC2119">RFC2119</a>] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", <a href="https://www.rfc-editor.org/bcp/bcp14">BCP 14</a>, <a href="./rfc2119">RFC 2119</a>, March 1997.
[<a id="ref-RFC2198">RFC2198</a>] Perkins, C., Kouvelas, I., Hodson, O., Hardman, V.,
Handley, M., Bolot, J., Vega-Garcia, A., and S. Fosse-
Parisis, "RTP Payload for Redundant Audio Data", <a href="./rfc2198">RFC 2198</a>,
September 1997.
[<a id="ref-RFC3550">RFC3550</a>] Schulzrinne, H., Casner, S., Frederick, R., and V.
Jacobson, "RTP: A Transport Protocol for Real-Time
Applications", STD 64, <a href="./rfc3550">RFC 3550</a>, July 2003.
[<a id="ref-RFC5285">RFC5285</a>] Singer, D. and H. Desineni, "A General Mechanism for RTP
Header Extensions", <a href="./rfc5285">RFC 5285</a>, July 2008.
<span class="grey">Lennox, et al. Standards Track [Page 7]</span>
<span id="page-8" ></span>
<span class="grey"><a href="./rfc6464">RFC 6464</a> Client-to-Mixer Audio Level Indication December 2011</span>
<span class="h3"><a class="selflink" id="section-8.2" href="#section-8.2">8.2</a>. Informative References</span>
[<a id="ref-ITU.G711">ITU.G711</a>] International Telecommunication Union, "Pulse Code
Modulation (PCM) of Voice Frequencies",
ITU-T Recommendation G.711, November 1988.
[<a id="ref-RFC3389">RFC3389</a>] Zopf, R., "Real-time Transport Protocol (RTP) Payload for
Comfort Noise (CN)", <a href="./rfc3389">RFC 3389</a>, September 2002.
[<a id="ref-RFC3711">RFC3711</a>] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.
Norrman, "The Secure Real-time Transport Protocol (SRTP)",
<a href="./rfc3711">RFC 3711</a>, March 2004.
[<a id="ref-RFC4301">RFC4301</a>] Kent, S. and K. Seo, "Security Architecture for the
Internet Protocol", <a href="./rfc4301">RFC 4301</a>, December 2005.
[<a id="ref-RFC6465">RFC6465</a>] Ivov, E., Ed., Marocco, E., Ed., and J. Lennox,
"A Real-time Transport Protocol (RTP) Header Extension for
Mixer-to-Client Audio Level Indication", <a href="./rfc6465">RFC 6465</a>,
December 2011.
[<a id="ref-SRTP-ENCR-HDR">SRTP-ENCR-HDR</a>]
Lennox, J., "Encryption of Header Extensions in the Secure
Real-Time Transport Protocol (SRTP)", Work in Progress,
October 2011.
[<a id="ref-SRTP-VBR-AUDIO">SRTP-VBR-AUDIO</a>]
Perkins, C. and JM. Valin, "Guidelines for the use of
Variable Bit Rate Audio with Secure RTP", Work
in Progress, July 2011.
<span class="grey">Lennox, et al. Standards Track [Page 8]</span>
<span id="page-9" ></span>
<span class="grey"><a href="./rfc6464">RFC 6464</a> Client-to-Mixer Audio Level Indication December 2011</span>
Authors' Addresses
Jonathan Lennox (editor)
Vidyo, Inc.
433 Hackensack Avenue
Seventh Floor
Hackensack, NJ 07601
US
EMail: [email protected]
Emil Ivov
Jitsi
Strasbourg 67000
France
EMail: [email protected]
Enrico Marocco
Telecom Italia
Via G. Reiss Romoli, 274
Turin 10148
Italy
EMail: [email protected]
Lennox, et al. Standards Track [Page 9]
Annotations
Select text to annotate