Debugging cryptic errors
by ananthakumaran

At a project I am working on, our backend system makes http requests to hundreds of different servers. It is sort of like webhook, http requests are made to the customer server with the payload whenever some specific events occur in our system. We were in the midst of migrating our systems from ruby to elixir.

During the test run, we started to get a cryptic error for some small number of domains.

HTTPoison.get("https://piay.iflix.com", [], [proxy: {'10.128.10.16', 3128}])
# => timestamp=2017-01-02T04:11:00.816Z level=error message= SSL: :hello:ssl_alert.erl:88:Fatal error: handshake failure

# => {:error, %HTTPoison.Error{id: nil, reason: {:tls_alert, 'handshake failure'}}}

The error points out that the handshake phase during the ssl connection is failing. But it doesn’t say why it is failing.

To get more information, I tried to use ssl module directly.

:ssl.connect('piay.iflix.com', 443, [])
# => {:ok,{:sslsocket, {:gen_tcp, #Port<0.10658>, :tls_connection, :undefined}, #PID<0.325.0>}}

To my amazement, it worked. I was able to connect to this domain, but https request failed. Due to compliance reasons, we have to use a proxy server for all our outgoing http/https requests. We use HTTPoison library, which is a wrapper for hackney library. We already had some issues due to proxy. It seems like most of the users of the hackney library don’t use the proxy option, so some of the code paths related to proxy are not well tested. To make sure proxy is the problem, I made the request without proxy option and it worked.

Hackney uses connect tunneling method, which is quite simple. The http client sends CONNECT piay.iflix.com:443 to the proxy server, which in turn opens a tcp connection to piay.iflix.com:443. The proxy server will then relay whatever data sent by http client to the destination server. In case of https request, once the connection is established, hackney initiates the ssl protocol using ssl:connect method.

This looks quite simple, but still something is going wrong. The same ssl:connect succeeds when it is established directly, but not through the proxy server.

The dbg app provides text based tracing functionality. It can be used to trace function at various granularities, from all the functions in a module to a function with specific arguments. I started to trace all the function calls in ssl module, but it resulted in too much data for me to analyze properly. Then I started to read the source code of ssl app and started to trace a single function at a time and compared the arguments and the results of successful and failed connection.

(<0.572.0>) returned from ssl_handshake:client_hello_extensions/6 ->
{hello_extensions,
 {renegotiation_info,
  {elliptic_curves,
   {1,3,
    132,
    0,
    30}]},
 undefined}


(<0.572.0>) returned from ssl_handshake:client_hello_extensions/6 ->
{hello_extensions,
 {renegotiation_info,
  {elliptic_curves,
   {1,3,
    132,
    0,
    30}]},
 {sni,
  "piay.iflix.com"}}

After multiple trials and errors, I finally came across something interesting. The sni field was present in successful connection, but not in failed connection. The rest of the deduction was easy. The sni extension allows a single server to serve different certificates for different domains. During the initial handshake, the client has to provide the domain name along with other details.

If a host name is passed as the first param of ssl:connect, the sni extension is automatically enabled. For proxy request, the connection is established to the proxy server, which relays the data to the destination server. As the ip address of the proxy server is passed as the first param, sni extension was not enabled.

HTTPoison.get("https://piay.iflix.com", [], [proxy: {'10.128.10.16', 3128}, ssl: [server_name_indication: 'piay.iflix.com']])

The fix was easy. The sni extension has to be enabled explicitly. As always, more layers introduce more points of failures.

  |   Visualization of backoff functions ⇒