Analyzing High Memory Usage With awk

I was recently doing some investigation into high memory usage for a Magento Commerce client of ours and found myself working closely with a really useful tool called awk for the first time. One of our Tech Leads had previously done some analysis of overall memory usage in the system following an outage, and noticed that their hosting company’s support had called out a few requests that appeared to be eating up a ton of memory during this time:

2019-06-12 20:52:16 GET 500 173138.361 ms 420108 kB 17.56% /banner/ajax/load/?_=1560372733339&sections=
2019-06-12 20:52:29 GET 500 162652.213 ms 452908 kB 18.59% /banner/ajax/load/?_=1560372685651&sections=
2019-06-12 20:49:02 GET 500 366376.036 ms 1191096 kB 25.09% /banner/ajax/load/?_=1560372475962&sections=

It was my responsibility to take a closer look at these to see if there was any real cause for concern here, however, I didn’t initially have a lot of information to go off of (hosting had just simply sent us the above with a note that we should take a closer look). My first goal was to figure out where these requests were even logged to understand what I was looking at. At the time of investigation the outage had occurred a couple weeks prior, so I needed to use something like zgrep to dig through the rotated logs. I ended up using the requests total execution time as the string to match as it seemed like the most “unique” string to use (that “_” value used by jQuery for cache busting ended up popping up more times than I’d expected):

$ zgrep '173138.361' /var/log/xxxxxx/xxxxxx/*log.{15..19}.gz
/var/log/xxxxxx/xxxxxx/php.access.log.17.gz:2019-06-12 20:52:16 GET 500 173138.361 ms 420108 kB 17.56% /banner/ajax/load/?_=1560372733339&sections=

$ zgrep '162652.213' /var/log/xxxxxx/xxxxxx/*log.{15..19}.gz
/var/log/xxxxxx/xxxxxx/php.access.log.17.gz:2019-06-12 20:52:29 GET 500 162652.213 ms 452908 kB 18.59% /banner/ajax/load/?_=1560372685651&sections=

$ zgrep '366376.036' /var/log/xxxxxx/xxxxxx/*log.{15..19}.gz
/var/log/xxxxxx/xxxxxx/php.access.log.17.gz:2019-06-12 20:49:02 GET 500 366376.036 ms 1191096 kB 25.09% /banner/ajax/load/?_=1560372475962&sections=

Sweet. So at this point I had a better understanding of what support forwarded us. These are from the php-fpm logs on the server, and what we’re looking at is the amount of memory that php-fpm used on each request. That’s great and all, and I could take a look at the controller for these requests to get a better understanding of what’s going on at the PHP execution level (I did in fact do this later), however I was hoping to find a way to dig into the historical data to get an understanding of why and how often these requests were performing poorly. We currently use New Relic as an application performance monitoring tool with the vast majority of our clients (I even wrote up a bit on the SD blog about getting certified here), so my first instinct was to use that for analyzing the metrics. Unfortunately, it looks like because the system was going haywire and running out of memory at this time, very little data had actually been gathered and reported to New Relic. Additionally, New Relic hadn’t traced any specific banner/ajax/load transactions prior to or after the outage, as the execution times hadn’t been terribly slow. So now what?

Well, maybe I couldn’t get a sense why the requests had used a ton of memory, but I did have access to this php-fpm logs to see just how often this was happening. But how was I supposed to look up all of this data across weeks of compressed log files, extract what I needed, then run some statistics on the data (even if it was just basic operations)?

I’d previously seen awk used as a part of more complex chained commands in some of our other post-mortem analyses (and if you’re not documenting stuff like this in your post-mortems, I’d HIGHLY encourage you to start), so I decided to take a closer look at awk and see just what I could get out of it.

Let’s start with a simple example below:

$ zgrep 'banner/ajax/load' /var/log/xxxxxx/xxxxxx/php.access.log.17.gz | awk '{ print $7 }' 

With this, I’m simply greping for every request to /banner/ajax/loadin a compressed log file, piping the output to awk and pulling the 7th column of text, “columns” here defined as strings separated by whitespace. It’s important to note that unlike almost everything else in Computer Science, the columns in awk are not 0-indexed (go figure). Simple enough, yeah?

So now we have tens of thousands of lines of memory usage to look through. But what if we just want to look at the worst cases for this day? Well then, let’s just sort and grab the last few with sort and tail

$ zgrep 'banner/ajax/load' /var/log/xxxxxx/xxxxxx/php.access.log.17.gz | awk '{ print $7 }' | sort -h | tail -5
91404
195968
420108
452908
1191096

Awesome, so that was pretty easy, and now we see that there were really 4 requests that appeared to be using an absurd amount of memory on the day of the outage. Let’s look at a wider span of time though and see if we can get a better sense of how often this is happening:

$ zgrep 'banner/ajax/load' /var/log/xxxxxx/xxxxxx/php.access.log.{1..21}.gz | awk '{ print $7 }' | sort -h | tail
85320
85320
85332
89428
91404
91404
195968
420108
452908
1191096

$ zgrep 'banner/ajax/load' /var/log/xxxxxx/xxxxxx/php.access.log.{1..16}.gz | awk '{ print $7 }' | sort -h | tail
83212
83272
83284
83284
83284
85260
85320
85320
85332
89428

So at this point I’m feeling pretty confident that this isn’t a typical issue. But is 89428 kB still a little too high for this request? Let’s try to answer that question, and again turn to awk for some help. We’re able to compute and return an average for this column by passing something like {n++;sum+=$x} END {print n?sum/n:0} to awk:

$ zgrep 'banner/ajax/load' /var/log/xxxxxx/xxxxxx/php.access.log.{1..30}.gz | awk '{n++;sum+=$7} END {print n?sum/n:0}'
8813.21
$ grep 'banner/ajax/load' /var/log/xxxxxx/xxxxxx/php.access.log | awk '{n++;sum+=$7} END {print n?sum/n:0}'
8772.13

Nope! From what I was able to gather using the above, it looked like this wasn’t any sort of recurring issue we needed to watch out for. With the above data, some additional analysis on system metrics at the time of the outage, and a little further digging into the Magento source code, we were able to successfully determine that these requests were really just the result of an overloaded system that didn’t have a great PHP memory_limit set. We’re currently engaging Magento and working with hosting to find a safer memory_limit that will make sure individual requests aren’t able to use over over 1G of memory.