Jacek Seliga

Team Leader | Senior Software Engineer

May 15, 2024

Fixing a 2-year-old Bug: A Journey with Debugging SSL Handshake

SSL Handshake

Nobody said building the world’s fastest distributed OLAP database would be easy. However, speed is not everything we aim for. One of the current hot topics at Oxla is external tools support.

Together with my team, we're responsible for supporting major data analytics tools compatible with PostgreSQL standards, such as QuickSight, Looker, PowerBI, and more.

Working with something new requires a thoughtful approach. We usually start by understanding how the tool communicates with PostgreSQL, which queries are executed, what functions are called, etc.

Since Oxla’s network protocol and SQL dialect are mostly compatible with PostgreSQL, usually it’s only a matter of providing missing DB functionalities support. We thoroughly analyse what happens  within PostgreSQL and once we identify missing features, the development process for adding tool support in Oxla begins.  

During the last sprint, we made some significant progress in our work on the Google PostgreSQL Connector for Looker Studio. We were finalizing the implementation of some missing features, and it was time to put our work to the test. Our goal was to test Looker not only with stand-alone Oxla (without SSL) but also with Oxla Cloud.

But suddenly – THE ISSUE arose:

Investigation begins

We quickly retrieved more details from our pgbouncer logs, a lightweight connection pooler for PostgreSQL, which we use in Oxla Cloud.

WARNING TLS handshake error: handshake failed: error:0A000438:SSL routines::tlsv1 alert internal error

I double-checked and confirmed that the issue still occurred even when PostgreSQL was behind pgbouncer and not Oxla. Therefore, the issue appeared to be with pgbouncer itself, which was even documented on their GitHub page.

The GitHub user who created that page precisely described a similar situation that we’ve experienced with Oxla. We were able to connect to various tools with SSL enabled (using pgbouncer) but Looker itself was causing us a hurdle. I went through all of the comments on it and found a valuable insight from another user. With all that context, I knew my way around the next steps.

Look under the hood

I decided to take a look at the communication to check the following scenarios:

  • Plain PostgreSQL that's able to connect to Looker with SSL.
  • pgbouncer with PostgreSQL backend, which isn't able to do that.

This can be achieved by using tools like tcpdump, Wireshark and a few others. By utilising them, we can see what is going on a given port, what packets are sent and received.

I decided to bet on Wireshark. It's a great open-source tool that recognizes many kinds of protocols, including PostgreSQL and TLS, and presents messages in a structured way. It can also reassemble TLS records spanning multiple TCP segments and TLS Application Data spanning multiple TLS records, thus I recommend turning those options on in the preferences for the TLS protocol.

Spot and Reduce The Differences

All messages up to 'Server Hello' appeared identical in both scenarios. This is understandable since the issue lied within pgbouncer itself, acting as our 'Server' in this context. As shown in the screenshots below, several fields differed between scenarios.

I minimised the discrepancy between these two scenarios by adjusting configuration flags in the pgbouncer config and slightly modifying its source code. While I hoped these actions would resolve the issue, their true purpose was guiding me toward the root cause by narrowing down the elements that required investigation.

Do not touch

After meticulously testing my changes one by one, I repeatedly attempted to connect Looker to pgbouncer. Suddenly, after clicking 'authenticate” to initiate yet another connection attempt, there was no activity in Wireshark. At first, I thought I had permanently broken something in pgbouncer. I tried connecting Looker directly to PostgreSQL, but still no activity.

I created a brand-new report in Looker Studio, and made another attempt, but still received no activity. Subsequently, I confirmed that I had exposed everything correctly and that there were no network configuration or connection issues. The server that I exposed was reachable from the web.

At that moment, the realization dawned on me.
Google 'banned' me for the sheer volume of connection attempts.

That was a pretty wise move from Google, as this kind of behaviour is certainly suspicious.

Suddenly, I got this idea – what if someone else could create a Looker report, add my data source, and give me access using the 'Share' functionality built into Looker? Then, I could simply replace the valid data source on my side with my version of pgbouncer (while leaving the exact URL and other settings) and hit 'refresh data' in Looker to receive a connection attempt from it.

Voilà! It worked like a charm however, hitting the refresh button provided me with yet another error message:

Unable to connect to Host: SSL error: Index 0 out of bounds for length 0

At this point, I knew that the 'Server Hello' message was missing something in the pgbouncer case, something that seemed to be there when I used PostgreSQL.

Just a few bytes

After several more attempts, the communication dump between pgbouncer and PostgreSQL was almost identical. The only difference was the size of the third message: for pgbouncer, it was 62 bytes long, while for PostgreSQL, it was 99 bytes long. Considering the last error message, and the fact that everything else, like the certificate used for SSL, username, and password is the same in both scenarios.

I was sure those missing 37 bytes were the root cause of the issue.

So, now I just needed to fix it, right? Check the difference in that message and make it match its PostgreSQL counterpart? Not so fast! It was encrypted.

Let there be light

Actually, encryption wasn't a problem at all. I had a certfile used for these connections. I could read the Client Random and Server Random, and I could check the cipher being used. None of that was necessary, though.

I knew that one of the enhancements introduced in TLS 1.3 was the encryption of handshake messages. So, what if I temporarily forced a downgrade to TLS 1.2? After making some minor changes to the configurations of both of my test scenarios and verifying that the issue still occurred, I went straight to Wireshark capture.

In the TLS 1.2 flavor, there was still just one difference between pgbouncer and PostgreSQL, but it was clearly visible that time.

When reviewing the data on screenshots above, I noticed that the "Certificate Request" was missing the Distinguished Names section. That was just it!

The End

After that moment, it turned into a simple, everyday task. I delved deep into the OpenSSL documentation and pgbouncer source (specifically, the libusual library used for SSL handling) to identify what was missing in the source code.

I quickly discovered that using SSL_CTX_set_client_CA_list would resolve the issue. After adding it to my local version of pgbouncer, I attempted to connect to Looker, and it worked! After that, I reverted all the potential not relevant changes that made pgbouncer communication similar to that of PostgreSQL, one by one, just to ensure that everything still functioned correctly. Finally, the issue was resolved.

I proceeded to create a pull request in libusual GitHub respository and responded in the original GitHub issue, sharing that it was now resolved. My changes were merged into pgbouncer repository on May 6th, 2024 – exactly two years after the issue was opened.

At this point, I'd like to thank JelteF and Microsoft – the maintainers of pgbouncer GitHub repository.


Did you enjoy this text? Want to be one of the first to try Looker with Oxla Cloud, share your feedback, or simply say hello to the author? Reach out to us at hello@oxla.com!