No spam please!

Clean up spam in Google Analytics

by in Dev

For the past year or so, Google Analytics has been showing complete garbage for my site, and if you're here, it's probably been happening to you too.

What's probably going on is that some wise guys decided it would be a nice way to spam the Google Analytics API by sending fake data to random property ids.

So how do you filter out this data then?

The hard way

The only solution I have found so far involved actively filtering out the spam on a case by case. That meant creating lots of rules or filtering expressions for each case. I absolutely hated this solution because it involves a lot of trial and error and it's a permanent (losing) battle.

My solution

Instead of being one step behind your spam, I thought it might be easier if we can flag the real data somehow and filter out everything else.

One way of doing that is using Custom Dimensions.

Go to your property's Admin > Custom Definitions > Custom Dimensions and create a new dimension:

And here's the custom dimension in the Google Analytics script:

   (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
ga('create', 'UA-XXXXXXXX', 'auto');
// flag real data:
ga('set', 'dimension1', 'yesplease'); // dimension1: bacon="yesplease"
ga('send', 'pageview');

Next step is to filter out everything else but bacon="yesplease".

For this, I create a new View on my property and create a new Filter:

And done.




Reddit user Groggie posted a very insightful comment in reply to my article on /r/analytics:

What you're doing is exactly a hostname filter. Instead of editing the code to inject an ID (like 'bacon') into the site itself, I'd recommend just using a regex match for your hostname and create a new view/filter with that.

This method will only stop "ghost" spambots - so the spambots that actually visit your site (semalt), will need to be blocked another way.

Oh, how I wished I would have known about the hostname property before.

Indeed, whenever analytics tracks your page, it also sends out the domain on which it was viewed. So it seems a better way to do the ghost spam filtering would be to just add a filter on your hostname:

I've noticed that using this method, visits from Google Cache, or via Google Translate will be filtered out as they're going to have a different hostname, but that's not a big deal for me.