Voice

Voice channels

Discord voice channels allow audio data to be sent to the voice servers over UDP. A bot is able to connect to up to one voice channel per guild. One websocket connection will be opened and maintained for each voice channel the bot joins. The websocket connection should reconnect automatically the same way that the main Discord gateway websocket connections do. For available voice functions and usage see the Nostrum.Voice module.

FFmpeg

Nostrum uses the powerful ffmpeg command line utility to encode any audio (or video) file for sending to Discord's voice servers. By default Nostrum will look for the executable ffmpeg in the system path. If the executable is elsewhere, the path may be configured via config :nostrum, :ffmpeg, "/path/to/ffmpeg". The function Nostrum.Voice.play/4 allows sound to played via files, local or remote, or via raw data that gets piped to stdin of the ffmpeg process. When playing from a url, the url can be a name of a file on the filesystem or a url of file on a remote server - ffmpeg supports a ton of protocols, the most common of which are probably http or simply reading a file from the filesystem. It is also possible to send raw opus frames, bypassing ffmpeg, if desired.

youtube-dl

With only ffmpeg installed, Nostrum supports playing audio/video files or raw, piped data as discussed in the section above. Nostrum also has support for youtube-dl, another powerful command line utility for downloading audio/video from online video services. Although the name implies support for Youtube, youtube-dl supports downloading from an immense list of sites. By default Nostrum will look for the executable youtube-dl in the system path. If the executable is elsewhere, the path may be configured via config :nostrum, :youtubedl, "/path/to/youtube-dl". When Nostrum.Voice.play/4 is called with :ytdl for the type parameter, youtube-dl will be run with options -f bestaudio -q -o -, which will attempt to download the audio at the given url and pipe it to ffmpeg.

Forks

The youtube-dl project has not been regularly maintained, and the latest release is not presently compatible with YouTube. The use of the yt-dlp fork is recommended in its place.

config :nostrum, :youtubedl, "yt-dlp"

streamlink

Nostrum also has support for streamlink, yet another powerful command line utility for downloading live streams from online video streaming services. By default Nostrum will look for the executable streamlink in the system path. If the executable is elsewhere, the path may be configured via config :nostrum, :streamlink, "/path/to/streamlink". When Nostrum.Voice.play/4 is called with :stream for the type parameter, streamlink will attempt to download the live stream content and pipe it to ffmpeg. It's recommended to use the most up-to-date version of streamlink to properly play human-readable URLs from services such as Youtube and Twitch. Version 3.x.x currently works with both of these services. If the short, human-readable url of the streaming service doesn't work with streamlink out of the box, you may have more luck extracting the underlying raw stream url. These are typically long URLs that end in .m3u8 or .hls. If you have youtube-dl installed, you can attempt to get this URL by running the following:

{raw_url, 0} = System.cmd("youtube-dl", ["-f", "best", "-g", url])
raw_url = raw_url |> String.trim()

Audio Timeout

Upon invoking Nostrum.Voice.play/4, the player process has a large configurable initial window (20_000 milliseconds by default) that it must generate audio within before timing out. This is done to allow ample time for slow networks to download large audio/video files. This configurable timeout only applies to when play is initially invoked; once audio has begun transmitting, the timeout drops to 500 milliseconds. Because the ffmpeg process doesn't close when its input device is stdin, which is the case when type is set to :pipe, :ytdl, or :stream the timeout is necessary to promptly detect end of input. If the audio process times out within the initial window, the Nostrum.Struct.Event.SpeakingUpdate that is generated will have its timed_out field set to true. It will be false in all other cases. If your use case does not include large, slow downloads and you wish to more quickly be notified of timeouts or errors, you may consider setting audio_timeout to a lower value. However, youtube-dl typically takes at least 2.5 seconds to begin outputting audio data, even on a fast connection. If your use case involves playing large files at a timestamp several hours in like this, play(guild_id, url, :ytdl, start_time: "2:37:56"), you may consider setting the timeout to a higher value, as downloading a large youtube video and having ffmpeg seek through several hours of audio may take 15-20 seconds, even with a fast network connection.

Audio Frames Per Burst

The value :audio_frames_per_burst represents the number of consecutive packets to send before resting. When using Nostrum.Voice.play/4 to play audio, Nostrum collects a number of opus frames from the audio input source before sending them all to Discord as a "burst" of ordered frames. This is done to reduce the overhead of process-sleeping and setup. For reference, a single opus frame is 20 milliseconds of audio (at least for the format that Discord uses). By default, the :audio_frames_per_burst is set to 10, equivalent to 200 milliseconds of audio.

Under normal circumstances, there's no reason to change this value. However, if you attempt to play a very short piece of audio that's less than 10 frames (200ms) in length, it will time out (after the configured :audio_timeout duration has passed) as it waits to collect 10 frames to send. For those cases, configure the value to at most the minimum frame length of the audio you intend to play, or simply 1. Setting the value to 1 means that each opus frame from your audio source will be taken individually and be sent in its own "burst" with the player process sleeping between each; you likely won't notice a difference in audio playback quality compared to the default value of 10 other than that your sub-200ms audio files will play as expected.

Voice Events

There are a few voice related events that bots can consume with a Nostrum.Consumer process:

Both Nostrum.Consumer.voice_state_update/0 and Nostrum.Consumer.voice_server_update/0 are sent by the shard gateway session when a bot joins a voice channel. The receipt of both of these events is required for a voice gateway session to begin, and it happens automatically when joining a channel. The Nostrum.Consumer.voice_state_update/0 event is also sent every time any user joins or leaves a voice channel, and Nostrum.Struct.Guild.voice_states/0 is automatically updated within the guild cache to reflect current state of voice channels.

A use case for listening to both Nostrum.Consumer.voice_state_update/0 and Nostrum.Consumer.voice_server_update/0 events would be to outsource voice connections to an application outside of Nostrum. This can be done by setting the config option :voice_auto_connect to false and taking the session and token information from both of the events and passing them to your external voice app. Outside of this niche use case, another use case for listening solely to the Nostrum.Consumer.voice_state_update/0 event would be to detect when users join or leave voice channels.

The Nostrum.Consumer.voice_speaking_update/0 event is generated by Nostrum for convenience. It is sent every time the bot starts or stops speaking/sending audio. A use case for this event is if you have a queue of URLs to play, listening to the Nostrum.Consumer.voice_speaking_update/0 will let the bot know when the current URL has finished playing and that it should begin playing the next one in the queue. The alternative approach for this use case that is not event-driven is to periodically call Nostrum.Voice.playing?/1 and wait for it to return false as the trigger to play the next URL. Note that the third element in the event is of type Nostrum.Struct.VoiceWSState.t/0 and not Nostrum.Struct.WSState.t/0.

The Nostrum.Consumer.voice_ready/0 event is generated by Nostrum for convenience. It is sent when the bot is ready to begin sending audio data upon joining a voice channel. From the moment the bot joins a voice channel, Nostrum handles the multi-step handshaking process that is required before any audio packets can be sent or received. It is a common use case for bots to immediately begin playing audio upon joining a voice channel. Calling Nostrum.Voice.play/4 directly after calling Nostrum.Voice.join_channel/4 will always return an error as several network actions must take place before playing audio is possible. Listening for the Nostrum.Consumer.voice_ready/0 event can be used by the bot to begin playing audio as soon as it is able to. The alternative approach for this use case that is not event-driven is to periodically call Nostrum.Voice.ready?/1 and wait for it to return true as the trigger to begin playing. Another common approach is to define a try_play function as follows:

def try_play(guild_id, url, type, opts \\ []) do
  case Nostrum.Voice.play(guild_id, url, type, opts) do
    {:error, _msg} ->
      Process.sleep(100)
      try_play(guild_id, url, type, opts)

    _ ->
      :ok
  end
end

Note that the third element in the event is of type Nostrum.Struct.VoiceWSState.t/0 and not Nostrum.Struct.WSState.t/0.

The Nostrum.Consumer.voice_incoming_packet/0 event is generated by Nostrum. None will be generated by default. You must first be connected to a voice channel, call the Nostrum.Voice.start_listen_async/1 function, then have another user in the same voice channel speak. If these conditions are met, an event will be received for each RTP packet the bot receives; 50 packets per 1 second for each user that is actively speaking. These events are only useful if you intend to listen to incoming audio and are disabled by default. An alternative approach to listening to incoming audio that is not event driven is to call Nostrum.Voice.listen/3. This function blocks until the specified number of RTP packets is received. Nostrum.Voice.listen/3 has the additional features of removing duplicate RTP packets within the set of packets returned per invocation and the option to return the raw RTP packet. In practice these features likely won't be missed when consuming incoming voice packets asynchronously. Note that the third element in the event is of type Nostrum.Struct.VoiceWSState.t/0 and not Nostrum.Struct.WSState.t/0.

Encryption Modes

Nostrum supports all of Discord's available encryption modes for voice channels. The encryption mode is invisible to the user, and you will likely never need to touch it.

Different encryption modes may have different performance characteristics depending on the hardware architecture your bot is running on. If you're interested, keep reading.

Encryption Mode Configuration Options

This is a runtime configuration option. Some Discord voice servers may not support your configured encryption mode, and in these cases a fallback mode will be selected.

config :nostrum, :voice_encryption_mode, :aes256_gcm # Default

Available configuration options are as follows:

:xsalsa20_poly1305
:xsalsa20_poly1305_suffix
:xsalsa20_poly1305_lite
:xsalsa20_poly1305_lite_rtpsize
:aead_xchacha20_poly1305_rtpsize
:aead_aes256_gcm
:aead_aes256_gcm_rtpsize
:xchacha20_poly1305 (alias for :aead_xchacha20_poly1305_rtpsize)
:aes256_gcm (alias for :aead_aes256_gcm_rtpsize)

The first seven are Discord's available options, while the last two are shorter aliases.

The latter four of Discord's seven modes are not yet documented, but will be soon.

Implementation Details

Of the seven supported modes, three different ciphers are used. The remaining differences are variations in how the nonce is determined and where the encrypted portion of the RTP packet begins.

Erlang's :crypto module is leveraged as much as possible as the ciphers are NIFs.

xsalsa20_poly1305

The entire Salsa20/XSalsa20 cipher is implemented in elixir. The poly1305 MAC function is handled by the :crypto module. As a result, xsalsa_poly1305 modes will likely have the slowest performance.

xchacha20_poly1305

The :crypto module supports the chacha20_poly1305 AEAD cipher. The only thing implemented in elixir is the HChaCha20 hash function that generates a sub-key from the key and the longer nonce that XChaCha20 specifies, which is then passed to the chacha20_poly1305 cipher. If your hardware doesn't have AES hardware acceleration, the chacha option may perform the best for you.

aes256_gcm

The :crypto module completely supports AES256 in GCM mode requiring no implementation in elixir. Many CPUs have hardware acceleration specifically for AES. For these reasons, Nostrum defaults to aes256_gcm.

← Previous Page State

Next Page → Gateway Compression