Alright. Here's a bit of a stumper, at least for me. I've got a user on my server who is using up a pretty good sized chunk of bandwidth. Nothing overly fatal to being able to use the system but enough that I'm wondering what exactly is causing the drain. He's averaging 15GB over the course of a 30 day period now and it's been consistently rising since it began spiking in December of 2007. Before that timeframe his usage was average compared to everyone else.
I've been helping him try to isolate the cause but can't find anything concrete. After some Google hunting on his domain we thought maybe it was from hotlinking images off the site for use elsewhere. So I installed an anti-hotlinking snippet into the .htaccess file. Over the last 30 days it had no affect at all. In fact usage went up slightly. So we're both stumped as to a specific cause since Google isn't turning much up anymore to help explain it.
I'm not particularly good with detailed access log analysis, but is there a way to isolate problems like this down far enough to figure out what specifically is causing problems?
I'm assuming you've looked at things like how many lines are in the access log compared to other people who aren't showing spiked access? Also, is this usage web-server-only? Have you done things like frequency counts of user agents, user hosts, and which pages got hit?
It ocurred to me about 5 minutes after posting that I should probably see about getting something to help analyze logs with, so for what its worth, awstats will have to do for now. Unfortunately it seems his log file hasn't been rotated in over a year and the format was no good. When I forced rotation it all got shoved into a gz archive and now there's nothing to analyze. So I'll have to wait it out for awhile until the thing gets some fresh hits to look at. You'd think with the geniuses out there that someone would have found a way to be able to comb log files without having the entire thing be in one single format. :/
Because the AWStats program said it couldn't understand the log format of the file. Somewhere along the way I guess I converted the website configs from "common" to "combined" format and for whatever reason several configs were not performing log rotation. So there's a point in the file somewhere that trips it up because of the change. Which I think is just dumb. Why not simply drop records you can't parse? But hey. It's linux stuff. I never expect any of it to actually be well built :P
I'd tend to agree, except that SmaugMuds.org gets 20x the actual visitor traffic and only just barely clears about 1GB in bandwidth per month. It's basically just the guy's blog which isn't exactly wildly popular for content. There's something in there though that apparently caught someone's attention and neither of us has been able to figure out what yet. Though I'm assuming once the logs properly populate again we'll see at least down to the page where the bandwidth is going. He does have an unusually large number of image files, but then so does my own blog and I don't see anywhere near the bandwidth usage for that despite my image gallery sharing a lot in common with his. My visitor counts are also a lot higher than his. It could just be that the blog software he's using is terribly inefficient. But we'll see.
I still don't understand why you can't just grep the files, count lines, count frequencies, and look for entries that come up a lot. Nor do I understand why you can't convert the "bad" logs into whatever format your tool uses, or just use another tool. And by the way, if you're so unhappy with the Linux stuff, you can always run your servers on Windows or Mac. :wink:
How exactly would I grep the log for something I don't even know is there? The file is 120MB because of the lack of rotation on it. If I knew what the problem was I'd hardly need an analysis tool to tell me.
Well, you need to extract the URLs that are visited. You would need to figure out the format – it should be pretty obvious – and then either grep according to that regular expression, or write a very short perl (or php or whatever) script to extract the URLs. Then you can load up the URL list in Excel or something and get frequency counts.
Or, if you're already writing a script to get the URLs from a regex, maintain a mapping from URL to count and spit out the count at the end. The script would still be just rather few and relatively simple lines.
#!ruby File.open("/c/xampp/apache/logs/jlsysinc_access.log") do |e| count = 0 e.each do |l| next if !l.match /GET (.*) HTTP/ puts $1 count += 1 break if count > 50 end end
This should match the first 50 GET entries in your log. Depending on the format of the URLs you'll want to match on that. I use virtual servers so the URLs are relative and each has their own log file.
Pheh, Awstats has nothing to do with Linux, it runs on (and is just as dumb on) Windows hosts too, you know. :p I do know what you mean about Awstats being crap, though – it manages to magically break at least once a month on my company's servers. That said, I thought that awstats does drop records it doesn't understand, so I'm a bit confused as to what problem you're running into with it. You might just have it configured wrong; it's kind of awkward to set up. If it is crapping out at the point where the format changes, it should be telling you which line, and cutting the file at that point is not something particularly complicated to do at all.
A quick grep for a specific date can easily give you the number of requests for a one day period (Monday and Friday are the best days to use). Compare that to other hosts. After that, you may need to start doing some more in-depth analysis of requests. The log files should show the response length, which is quite useful. Awstats will make nice charts and tables of all that for you once you get it running, of course. It should have useful data for you after just one day, so by the time you read this, it might already be able to show you the problem.
Story time! One of my company's clients recently started having huge bandwidth usage for no obvious reason. The problem turned out to be some dipstick in China having a misconfigured DNS server or something; a small range of IPs was apparently trying to access Google's website using the client's website's IP. He runs a kinda weird service on that IP that does some magic with subdomains (kinda like Livejournals); when it gets a request for a host name it doesn't recognize, the code redirects to google.com instead of showing an error or anything useful. So the broken clients were trying to go to google, loading up his service, which saw the Host: google.com header and redirected the client to… google.com. Thus he got a big redirect loop causing each request from the broken machines to actually generate like 40 requests to his site before the browser likely gave up. Even though the responses were just tiny little HTTP redirects, he was getting a gigantic volume of these requests, resulting in huge bandwidth usage. We just blocked the problem IP block in iptables and his bandwidth usage went down to normal. (We also convinced him to change the code to display an informative error instead of redirecting for unknown host requests.) The moral of that story is that there might not even be anything about his site causing the bandwidth spikes; it could just be the Chinese. :p
Samson, i do not know if you have solved your issues with this yet, but i just found that from within Virtumin you can generate reports and the like from awstats. It seems pretty trivial to get a real idea on who and that is hitting the site.
Reported period Month Nov 2008 First visit NA Last visit 27 Nov 2008 - 02:50 Unique visitors Number of visits Pages Hits Bandwidth Viewed traffic * 542 1149 4263 23903 441.28 MB(393.26 KB/Visit)
I was shocked that i was getting as much traffic to my site as i currently am and it would explain the influx of players i have seen during this period also, 14 in a month on a game that is not even open is fairly significant.
02 Feb, 2009, Rojan QDel wrote in the 15th comment:
If this is still relevant..at all, it is possible to import back-logged apache logs into awstats by first backing up the awstats database, unzipping the rotated log files, and manually running the stats database generator on each log file, specifying proper format, from the oldest to the newest, and then to restore the backed up database file. Though needlessly complex, this process will allow you to import logs of varying formats, and dates. The awstats_update documentation should have details on what options to specify for the format and file name.
Well it took us the better part of a month to narrow things down but it turned out his blog software was making a massive one page archive link of everything he'd ever posted in the last 5 years. Every last post. Pictures included. The behemoth was several megabytes. Since the link was terminated and the page it went to deleted the bandwidth usage has dropped to normal levels. I don't think the page was something he intentionally generated either. Just one of those things you never pay much attention to. He's now down to an average of 300MB a month, which seems about right.