Random problems with spring boot and Docker

We run a bunch of Java spring-boot microservices in docker containers. I’m not sure java is the best choice when it comes to running a lot of containers, simply because of JVM memory allocation, but that’s a different topic. The issue I want to address here is that we simply cannot start a bunch at once, or they… simply….. slow……. down…….. to ……. a ………. crawl. Server startup times go from 15 seconds to, in cases where 20 containers are starting up, 30 minutes! THAT’S CRAZY!

At any rate, this morning after server patching, I decided that I’d had enough, as this was going to take me an hour to start them all up one by one. I dove into the logs of one of the ones that did not start within my arbitrary 90 second timeframe (much longer than the usual 25 seconds when starting solo), and here’s what I saw:

  .   ____          _            __ _ _
/\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
\\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
=========|_|==============|___/=/_/_/_/
:: Spring Boot ::        (v1.3.0.RELEASE)
 
2016-07-17 07:13:20.456  INFO 1 --- [           main] com.company.EventApplication             : Starting EventApplication on 16bad0525076 with PID 1 (/opt/java1-service-1.0.0-SNAPSHOT.jar started by root in /)
2016-07-17 07:13:20.475  INFO 1 --- [           main] com.company.EventApplication             : The following profiles are active: production
2016-07-17 07:13:20.755  INFO 1 --- [           main] ationConfigEmbeddedWebApplicationContext : Refreshing org.springframework.boot.context.embedded.AnnotationConfigEmbeddedWebApplicationContext@619ffb1: startup date [Sun Jul 17 07:13:20 CDT 2016]; root of context hierarchy
2016-07-17 07:13:23.171  INFO 1 --- [           main] o.s.b.f.s.DefaultListableBeanFactory     : Overriding bean definition for bean 'beanNameViewResolver' with a different definition: replacing [Root bean: class [null]; scope=; abstract=false; lazyInit=false; autowireMode=3; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=org.springframework.boot.autoconfigure.web.ErrorMvcAutoConfiguration$WhitelabelErrorViewConfiguration; factoryMethodName=beanNameViewResolver; initMethodName=null; destroyMethodName=(inferred); defined in class path resource [org/springframework/boot/autoconfigure/web/ErrorMvcAutoConfiguration$WhitelabelErrorViewConfiguration.class]] with [Root bean: class [null]; scope=; abstract=false; lazyInit=false; autowireMode=3; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=org.springframework.boot.autoconfigure.web.WebMvcAutoConfiguration$WebMvcAutoConfigurationAdapter; factoryMethodName=beanNameViewResolver; initMethodName=null; destroyMethodName=(inferred); defined in class path resource [org/springframework/boot/autoconfigure/web/WebMvcAutoConfiguration$WebMvcAutoConfigurationAdapter.class]]
2016-07-17 07:13:24.728  INFO 1 --- [           main] s.b.c.e.t.TomcatEmbeddedServletContainer : Tomcat initialized with port(s): 8080 (http)
2016-07-17 07:13:24.783  INFO 1 --- [           main] o.apache.catalina.core.StandardService   : Starting service Tomcat
2016-07-17 07:13:24.786  INFO 1 --- [           main] org.apache.catalina.core.StandardEngine  : Starting Servlet Engine: Apache Tomcat/8.0.28
2016-07-17 07:13:25.180  INFO 1 --- [ost-startStop-1] o.a.c.c.C.[Tomcat].[localhost].[/]       : Initializing Spring embedded WebApplicationContext
2016-07-17 07:13:25.181  INFO 1 --- [ost-startStop-1] o.s.web.context.ContextLoader            : Root WebApplicationContext: initialization completed in 4451 ms
2016-07-17 07:13:25.790  INFO 1 --- [ost-startStop-1] o.s.b.c.e.ServletRegistrationBean        : Mapping servlet: 'dispatcherServlet' to [/]
2016-07-17 07:13:25.797  INFO 1 --- [ost-startStop-1] o.s.b.c.embedded.FilterRegistrationBean  : Mapping filter: 'metricFilter' to: [/*]
2016-07-17 07:13:25.798  INFO 1 --- [ost-startStop-1] o.s.b.c.embedded.FilterRegistrationBean  : Mapping filter: 'characterEncodingFilter' to: [/*]
2016-07-17 07:13:25.798  INFO 1 --- [ost-startStop-1] o.s.b.c.embedded.FilterRegistrationBean  : Mapping filter: 'hiddenHttpMethodFilter' to: [/*]
2016-07-17 07:13:25.799  INFO 1 --- [ost-startStop-1] o.s.b.c.embedded.FilterRegistrationBean  : Mapping filter: 'httpPutFormContentFilter' to: [/*]
2016-07-17 07:13:25.799  INFO 1 --- [ost-startStop-1] o.s.b.c.embedded.FilterRegistrationBean  : Mapping filter: 'requestContextFilter' to: [/*]
2016-07-17 07:13:25.799  INFO 1 --- [ost-startStop-1] o.s.b.c.embedded.FilterRegistrationBean  : Mapping filter: 'webRequestLoggingFilter' to: [/*]
2016-07-17 07:13:25.799  INFO 1 --- [ost-startStop-1] o.s.b.c.embedded.FilterRegistrationBean  : Mapping filter: 'applicationContextIdFilter' to: [/*]
2016-07-17 07:14:53.243  INFO 1 --- [ost-startStop-1] o.a.c.util.SessionIdGeneratorBase        : Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [87,395] milliseconds.

WHOA. What’s that last line say? “Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [87,395] milliseconds” Wow, almost 90 seconds. I’m not a Java developer, but I do know “PRNG” = Pseudo Random Number Generator. I also know that a PRNG needs to be fed entropy / randomness to be effective. Seeing that the embedded Tomcat in spring boot was using it to start up, I decided to try a test. But before that, I wanted to also measure the amount of entropy the system had.

With a little googling, I found the following command line works:

#cat /proc/sys/kernel/random/entropy_avail
135

I also found that collectd has a simple entropy module, so I included it and now have entropy readings every 10 seconds being sent to graphite.

Now for my test… I stopped and started 4 microservices via our microservice “swap out” script – start a new container up, and when “Server started in X seconds” is shown, remove the previous one of the same image, while updating the vulcan proxy load balancer. Hence, once one was fully up, I started the next…

# for i in java2-service java3-service java4-service java5-service; do ./$i/swapout; done

Once complete, I looked for instances of PRNG in their log files:

# for i in java2-service java3-service java4-service java5-service; do docker logs $i | grep PRNG; done
2016-07-17 08:54:39.224  INFO 1 --- [ost-startStop-1] o.a.c.util.SessionIdGeneratorBase        : Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [17,541] milliseconds.
2016-07-17 08:55:49.809  INFO 1 --- [ost-startStop-1] o.a.c.util.SessionIdGeneratorBase        : Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [19,913] milliseconds.
2016-07-17 08:57:03.131  INFO 1 --- [ost-startStop-1] o.a.c.util.SessionIdGeneratorBase        : Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [28,985] milliseconds.
2016-07-17 08:58:29.855  INFO 1 --- [ost-startStop-1] o.a.c.util.SessionIdGeneratorBase        : Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [57,045] milliseconds.

Sure enough, each service took longer and longer to get enough entropy. You can see below that the entropy bounce back a little after the first hit, and a little again after the second hit, but the third and fourth hits both have entropy values down low. Eventually it recovers.

entropy-1

We have a couple of options to make this better:

  • Tell tomcat to use /dev/urandom rather than /dev/random – /dev/urandom is not blocking. That’s not great for anything using crypto, but should not add ANY additional time to create the Tomcat’s SecureRandom instance. I imagine each of the above samples would all take the same time as the first one did, around 17 seconds.
  • Find a way to add entropy to the system. I generally prefer this method, since we wouldn’t want to systematically have all our tomcat instances use the less secure one and inadvertently have a service where it’s critical not have good randomness.

I investigated and found the haveged daemon, which adds randomness to the system from process run times… and WOW:

entropy-2

I then ran the same four microservice swapouts, and not only was there no impact to later microservices, overall microservice startup time was noticeably faster. (the above chart includes MS startup) There was no PRNG line in the log (I believe if there’s no delay, there’s no mention in the log), so nothing real to compare against other than the visually faster bit. Looking at some of our tomcat services that are not in docker, I see they suffer from the same PRNG issue occasionally. This will help all of our tomcat instances start up sooner.