Following up from the conversation in playframework#8374:
Some references can’t be linked directly in this post since, as a new member, I can only post 2 links, but I have some observations that I want to get some feedback on:
-
Some parts of Play have had, and probably still have, performance impacting “bugs” that can easily go unnoticed (eg: playframework#8374, playframework#8335)
-
Other heavily used components can benefit from pretty easy low hanging optimization fruit (eg: playframework#8375, play-ws#251)
In the case of bugs, it would be nice if there were an easier way to spot these earlier. In the case of optimizations that can be easily implemented, it would be nice if someone spent a little time looking for them and making those optimizations.
In all of the tickets listed above I was able to identify the problematic parts of the code because I have access to a staging service running a profiling agent that can produce flame graphs, and that service routinely has long periods of time where the only traffic coming in is “health check” traffic- which is to say, I can produce flame graphs from a play application where, for certain periods of time (weekends), the only “work” being done and measured comes from framework overhead and not the application.
Beyond the tickets linked above, we’ve also fixed more than a couple issues that we created ourselves in our own middleware- but the observation stands: having flame graphs for “framework overhead” is super useful in noticing work patterns that aren’t expected and point to straightforward resolution paths.
My question to this community is the following: how do we go about setting up a process within the play development cycle so that performance “bugs” and improvements can be noticed earlier without depending on the broader community to carefully monitor the performance of their own applications and submit patches upstream (although people should definitely be doing that on their own too). For example:
- Is it possible that prune can gather profiling data in addition to benchmark data and surface that somewhere?
- Does it make sense to set a performance goals for play itself so that some expected baseline latency and throughput targets can be used to make harder, more subjective trade-off decisions (ie: does it make sense to switch akka-http to the default backend)?
- More broadly, should a group of engineers (either at lightbend, via community efforts, or both) be formed to look into areas for improvement? How would such a group operate?
Looking forward to hearing the thoughts and feedback from others about how we can make Play better in this regard.