Cloudbleed is a security bug discovered on February 17, 2017 affecting Cloudflare's reverse proxies, which caused their edge servers to run past the end of a buffer and return memory that contained private information such as HTTP cookies, authentication tokens, HTTP POST bodies, and other sensitive data.And some of that data had been cached by search engines.As a result, data from Cloudflare customers was leaked out and went to any other Cloudflare customers that happened to be in the server's memory on that particular moment. Some of this data was cached by search engines.
This bug was discovered by Tavis Ormandy from Google’s Project Zero.He was seeing corrupted web pages being returned by some HTTP requests run through Cloudflare.The bug was serious because the leaked memory could contain private information and because it had been cached by search engines. Cloudflare have also not discovered any evidence of malicious exploits of the bug or other reports of its existence.The greatest period of impact was from February 13 and February 18 with around 1 in every 3,300,000 HTTP requests through Cloudflare potentially resulting in memory leakage (that’s about 0.00003% of requests).
Parsing and modifying HTML on the fly
Many of Cloudflare’s services rely on parsing and modifying HTML pages as they pass through their edge servers. For example, they can insert the Google Analytics tag, safely rewrite http:// links to https://, exclude parts of a page from bad bots, obfuscate email addresses, enable AMP, and more by modifying the HTML of a page.
To modify the page, they need to read and parse the HTML to find elements that need changing. Since the very early days of Cloudflare, they’ve used a parser written using Ragel. A single .rl file contains an HTML parser used for all the on-the-fly HTML modifications that Cloudflare performs.
About a year ago, cloudflare decided that the Ragel-based parser had become too complex to maintain and started to write a new parser, named cf-html, to replace it. This streaming parser works correctly with HTML5 and is much, much faster and easier to maintain.
They first used this new parser for the Automatic HTTP Rewrites feature and have been slowly migrating functionality that uses the old Ragel parser to cf-html.
Both cf-html and the old Ragel parser are implemented as NGINX modules compiled into their NGINX builds. These NGINX filter modules parse buffers (blocks of memory) containing HTML responses, make modifications as necessary, and pass the buffers onto the next filter.
For the avoidance of doubt: the bug is not in Ragel itself. It is in Cloudflare's use of Ragel.
It turned out that the underlying bug that caused the memory leak had been present in their Ragel-based parser for many years but no memory was leaked because of the way the internal NGINX buffers were used. Introducing cf-html subtly changed the buffering which enabled the leakage even though there were no problems in cf-html itself.
Once the company knew that the bug was being caused by the activation of cf-html they disabled the three features that caused it to be used. Every feature Cloudflare ships has a corresponding feature flag known as a ‘global kill’. They activated the Email Obfuscation global kill 47 minutes after receiving details of the problem and the Automatic HTTPS Rewrites global kill 3h05m later. The Email Obfuscation feature had been changed on February 13 and was the primary cause of the leaked memory, thus disabling it quickly stopped almost all memory leaks.
Within a few seconds, those features were disabled worldwide. Cloudflare confirmed they were not seeing memory leakage via test URIs and had Google double check that they saw the same thing. Cloudflare then discovered that a third feature, Server-Side Excludes, was also vulnerable and did not have a global kill switch (it was so old it preceded the implementation of global kills). They implemented a global kill for Server-Side Excludes and deployed a patch to their fleet worldwide. From realizing Server-Side Excludes were a problem to deploying a patch took roughly three hours. However, Server-Side Excludes are rarely used and only activated for malicious IP addresses.
Root cause of the bug
The Ragel code is converted into generated C code which is then compiled. The C code uses, in the classic C manner, pointers to the HTML document being parsed, and Ragel itself gives the user a lot of control of the movement of those pointers. The underlying bug occurs because of a pointer error.
/* generated code */
if ( ++p == pe )
goto _test_eof;
The root cause of the bug was that reaching the end of a buffer was checked using the equality operator and a pointer was able to step past the end of the buffer. This is known as a buffer overrun. Had the check been done using >= instead of == jumping over the buffer end would have been caught. The equality check is generated automatically by Ragel and was not part of the code that cloudflare wrote. This indicated that they were not using Ragel correctly.
The Ragel code written contained a bug that caused the pointer to jump over the end of the buffer and past the ability of an equality check to spot the buffer overrun.
Here’s a piece of Ragel code used to consume an attribute in an HTML
<script> tag. The first line says that it should attempt to find zero or more unquoted_attr_char followed by (that’s the :>> concatenation operator) whitespace, forward slash or then > signifying the end of the tag.script_consume_attr := ((unquoted_attr_char)* :>> (space|'/'|'>'))
>{ ddctx("script consume_attr"); }
@{ fhold; fgoto script_tag_parse; }
$lerr{ dd("script consume_attr failed");
fgoto script_consume_attr; };
If an attribute is well-formed, then the Ragel parser moves to the code inside the
@{ }block. If the attribute fails to parse (which is the start of the bug we are discussing today) then the $lerr{ } block is used.
For example, in certain circumstances (detailed below) if the web page ended with a broken HTML tag like this:
<script type=
the
From our statistics it appears that such broken tags at the end of the HTML occur on about 0.06% of websites.
$lerr{ } block would get used and the buffer would be overrun. In this case the $lerr does dd(“script consume_attr failed”); (that’s a debug logging statement that is a nop in production) and then does fgoto script_consume_attr; (the state transitions to script_consume_attr to parse the next attribute).From our statistics it appears that such broken tags at the end of the HTML occur on about 0.06% of websites.
If you have a keen eye you may have noticed that the
@{ } transition also did a fgotobut right before it did fhold and the $lerr{ } block did not. It’s the missing fhold that resulted in the memory leakage.
Internally, the generated C code has a pointer named
p that is pointing to the character being examined in the HTML document. fhold is equivalent to p-- and is essential because when the error condition occurs p will be pointing to the character that caused the script_consume_attr to fail.
And it’s doubly important because if this error condition occurs at the end of the buffer containing the HTML document then
p will be after the end of the document (p will be pe + 1 internally) and a subsequent check that the end of the buffer has been reached will fail and p will run outside the buffer.
Adding an
fhold to the error handler fixes the problem.Going bug hunting
Research by IBM in the 1960s and 1970s showed that bugs tend to cluster in what became known as “error-prone modules”. Since cloudflare identified a nasty pointer overrun in the code generated by Ragel it was prudent to go hunting for other bugs.
Part of the information security team started fuzzing the generated code to look for other possible pointer overruns. Another team built test cases from malformed web pages found in the wild. A software engineering team began a manual inspection of the generated code looking for problems.
At that point it was decided to add explicit pointer checks to every pointer access in the generated code to prevent any future problem and to log any errors seen in the wild. The errors generated were fed to our global error logging infrastructure for analysis and trending.
#define SAFE_CHAR ({\
if (!__builtin_expect(p < pe, 1)) {\
ngx_log_error(NGX_LOG_CRIT, r->connection->log, 0, "email filter tried to access char past EOF");\
RESET();\
output_flat_saved(r, ctx);\
BUF_STATE(output);\
return NGX_ERROR;\
}\
*p;\
})
And cloudflare began seeing log lines like this:
2017/02/19 13:47:34 [crit] 27558#0: *2 email filter tried to access char past EOF while sending response to client, client: 127.0.0.1, server: localhost, request: "GET /malformed-test.html HTTP/1.1”
Every log line indicates an HTTP request that could have leaked private memory. By logging how often the problem was occurring we hoped to get an estimate of the number of times HTTP request had leaked memory while the bug was present.
In order for the memory to leak the following had to be true:
The final buffer containing data had to finish with a malformed script or img tag
The buffer had to be less than 4k in length (otherwise NGINX would crash)
The customer had to either have Email Obfuscation enabled (because it uses both the old and new parsers as we transition),
… or Automatic HTTPS Rewrites/Server Side Excludes (which use the new parser) in combination with another Cloudflare feature that uses the old parser.
… and Server-Side Excludes only execute if the client IP has a poor reputation (i.e. it does not work for most visitors).
The buffer had to be less than 4k in length (otherwise NGINX would crash)
The customer had to either have Email Obfuscation enabled (because it uses both the old and new parsers as we transition),
… or Automatic HTTPS Rewrites/Server Side Excludes (which use the new parser) in combination with another Cloudflare feature that uses the old parser.
… and Server-Side Excludes only execute if the client IP has a poor reputation (i.e. it does not work for most visitors).
That explains why the buffer overrun resulting in a leak of memory occurred so infrequently.
Additionally, the Email Obfuscation feature (which uses both parsers and would have enabled the bug to happen on the most Cloudflare sites) was only enabled on February 13 (four days before Tavis’ report).
The three features implicated were rolled out as follows. The earliest date memory could have leaked is 2016-09-22.
2016-09-22 Automatic HTTP Rewrites enabled
2017-01-30 Server-Side Excludes migrated to new parser
2017-02-13 Email Obfuscation partially migrated to new parser
2017-02-18 Google reports problem to Cloudflare and leak is stopped
2017-01-30 Server-Side Excludes migrated to new parser
2017-02-13 Email Obfuscation partially migrated to new parser
2017-02-18 Google reports problem to Cloudflare and leak is stopped
The greatest potential impact occurred for four days starting on February 13 because Automatic HTTP Rewrites wasn’t widely used and Server-Side Excludes only activate for malicious IP addresses.
Internal impact of the bug
Cloudflare runs multiple separate processes on the edge machines and these provide process and memory isolation. The memory being leaked was from a process based on NGINX that does HTTP handling. It has a separate heap from processes doing SSL, image re-compression, and caching, which meant that they were quickly able to determine that SSL private keys belonging to their customers could not have been leaked.
However, the memory space being leaked did still contain sensitive information. One obvious piece of information that had leaked was a private key used to secure connections between Cloudflare machines.
When processing HTTP requests for customers’ web sites our edge machines talk to each other within a rack, within a data center, and between data centers for logging, caching, and to retrieve web pages from origin web servers.
In response to heightened concerns about surveillance activities against Internet companies, cloudflare decided in 2013 to encrypt all connections between Cloudflare machines to prevent such an attack even if the machines were sitting in the same rack.
The private key leaked was the one used for this machine to machine encryption. There were also a small number of secrets used internally at Cloudflare for authentication present.
External impact and cache clearing
More concerning was that fact that chunks of in-flight HTTP requests for Cloudflare customers were present in the dumped memory. That meant that information that should have been private could be disclosed.
This included HTTP headers, chunks of POST data (perhaps containing passwords), JSON for API calls, URI parameters, cookies and other sensitive information used for authentication (such as API keys and OAuth tokens).
Because Cloudflare operates a large, shared infrastructure an HTTP request to a Cloudflare web site that was vulnerable to this problem could reveal information about an unrelated other Cloudflare site.
An additional problem was that Google (and other search engines) had cached some of the leaked memory through their normal crawling and caching processes. Company wanted to ensure that this memory was scrubbed from search engine caches before the public disclosure of the problem so that third-parties would not be able to go hunting for sensitive information.
The information security team worked to identify URIs in search engine caches that had leaked memory and get them purged. With the help of Google, Yahoo, Bing and others, they found 770 unique URIs that had been cached and which contained leaked memory. Those 770 unique URIs covered 161 unique domains. The leaked memory has been purged with the help of the search engines. Cloudflare also undertook other search expeditions looking for potentially leaked information on sites like Pastebin and did not find anything.
Cause of memory leakage is nicely explained.
ReplyDeleteThe root cause of the bug, buffer overrun, can be taken into consideration by others who intend to develop a better system.
ReplyDelete