The Case
This week, we’re hunting phish. Our poor citizens are having to contend not only with exorbitant utility prices and incompetent elected officials, they’re now being targeted by scam callers. What kind of city is Mayor Budskott running here?
But all is not lost, we’ve got the call logs from the city’s telco. Hmm, nothing sketchy about releasing that PII eh Mayor?
Exploring the Logs
Let’s do some initial data exploration. First, how many entries do we have?
Ok, 16.4M. Cool. Let’s see what a sample looks like.
So we have a timestamp of connect and disconnect events, a connection id and some packed json properties. Let’s see how are calls distributed over time.
Seven days of data with between 1M and 2.75M records per day. Let’s look at how we can unpack the dynamic properties. We’ll use 6 hours of data just to keep things manageable.
The various bag_*
plugins are our friends here. First I’ve made a bag of the event type and timestamp (which will allow us to unpivot in a sec). I’ve mushed this with the existing Properties
column, then summarised everything by CallConnectionId
. Unpacking this gives us a nice pivoted view of the data with one row per call.
Catching the Phish
Thinking about the problem, I’m going to look for the number which makes the most number of calls, probably gets hung up on a lot of the time and has hidden their number. Let’s see how we can summarise the data.
Oops. So much for that plan. By unpacking and reshaping the data, we’ve hit some memory issues with our small cluster. Looks like Mayor B needs to get the cheque book out. Or we can rewrite the query.
So instead of packing and unpacking, let’s do this old skool style with some joins. We’ll join all of the Connect events onto the corresponding Disconnect, then apply the same summarisation as before.
Nice. Well, there’s our scammer. But could we do this more efficiently?
Putting the predicates into the inital join / subquery, summarising by only the distinct count and returning just the top row brings our query down to about 2secs. Cool.
Notice the dcount is different between the two queries - the first is all distinct destinations (248) and the second is distinct destinations that terminated the call. For us, this isn’t really important, so case closed.