Q: How do we know a person (like our son) visits certain sites?
A: (Any good answer starts with a set of questions to clarify the intent)
- Why do we need to know? [just because we want to know]
- That is not convincing, any real reason? [like protection, such as malware, internet bullying, etc..]
- OK, fine, I guess there is a need.
- Install anti-virus/anti-malware/anti-?? software on the devices. The problem is that the client device needs to support features like privileged accounts or parental controls. It effects performance (we dislike the anti-virus software enough on the desktops). It only works on the given device.
- K9 Web Production came closest, given its excellent light weight nature. But it didn't have a good mobile version for tablets, also lacking was family support of multiple devices. It also work in the "prevention/blocking" way, instead of passive monitoring.
- A more robust way would be a network based solution.
- The home routers could have a package to provide individual traffic monitoring with some analytic s capabilities. It turned out not so (as of 2014).
The decision was to roll-our-own solution. Besides it being cool, it is also a great opportunity to toy with different technologies (DNS logging, remote log, python, PHP, PowerView, etc). Building blocks we have at hand and constraints
- Firewall router running OpenWrt.
- A Synology DS101 disk station.
- Nothing else: Decided NOT to build additional *inux servers.
- Shall we say -- low cost?
The solution can be shown in two views.
Functions View: The following view captures the intended functions of this solution, INDEPENDENT of special tools and/or techniques. This is one of the standard ways we try in the "Solution Architecture" practice, to show "what" aspect of the solution.
Component Interaction: The following view shows how this solutions would work.
- Steps 11,12,13,14 are the standard user experience: A user with a client device accesses the internet, including the DNS lookup, and the general firewall protection. The only change is the enabling of "Local Logging" of DNS (step 21), with no perceivable user impact.
- The "DNS Query Forwarder" Python script starts as the firewall powers up (step 22), reads the kernel log using the "logread" utility (step 23), identifies DNS query entries, and forwards to the "Synology Disk Station" syslog facility using netcat (step 24).
- On the Synology Disk Station, the log entries are rotated into daily files. Log rotation makes the next step more efficient.
- A daily cron job is scheduled to run right before midnight, to read the daily log entries, and looking for the DNS queries (step 25).
- In step 26, the domain names are trimmed for efficiency. For example, us.www.yahoo.com, and images.www.yahoo.com will both be trimmed to www.yahoo.com as both subdomains will be of the same category.
- In step 27, the domain "type or category" is determined. It first looks at the local SqlLite database for a cached copy. If the cache expired or doesn't exist, it connects to WebRoot to determine the type, and cache the result. This is a little more complicated as the "list of categories" need to be retrieved during setup, and refreshed later as necessary.
- In step 28, the enriched query records (now with domain type) are consolidated into a daily file. Other information includes client host name, timestamp, etc.
The parents (31) later can use Excel with the PowerPivot extension (32) to pull down the file, and do analysis. Of course a template is created to make this a repeatable chore.
A Few Limitations:
- This solution only identifies the devices, not exactly the person. Tracking at a person level would require a user identity
- The "amount of traffic" is also not accurate. A client device may do a DNS query once for a website, and use the result during the TTL time (time-to-live) for any number of visits. The number of visits are not captured in this solution.
Afterthoughts:
- We ended up looking at the dashboard much less than we could. The dashboard is interesting, but not really actionable. Amount the mountain of data, the actual high risk items are very small, and could hardly tell anything in isolation. For example, a "gambling" website may just be advertisement.
- Python is a great language for prototyping as it is an interpreted language. But I really don't like the use of indentation for grouping statements. In comparison, I found PHP is closer to modern programming styles (like Java or C++).
- After 6 months, we disabled it so we don't pay the API fees. Now we know, we can activate it anytime we want, if ever so.
Architecture Musings: This paragraph is more tolerable since we even bothered to put the design in a nice format. The question that must be asked, is, could there be an easier/better way?
- Steps 1x, 21, 22, 23, 24 are pretty lean. Using a python script to read the log and forward over is as easy as it can be made.
- Instead of using Syslog, we "could" have used Apache Plume, and sink the output to a file. But that would be equivalent to Syslog anyway.
- Of course, we could be more ambitious by using Flume to sink the data into Hadoop (instead of syslog), and Oozie as the scheduler (instead of cron) to create datamarts, which can be analyzed/viewed later. If so, we can even write a custom Flume interceptor to figure out the domain category on the fly. This is a serious over engineering.
- Another possibility is to send the log files to AWS S3, and use something like lambda to trigger data enrichment and analysis , which could be dashboard enabled. That would be yet another day.


No comments:
Post a Comment