Internet Topology Mapping

Originally posted on DataScienceCentral, by Dr. Granville. Click here to read original article and comments.

This is a component often missing, yet valuable for most systems, algorithms and architectures that are dealing with online or mobile data, known as digital data: be it transaction scoring, fraud detection, online marketing, marketing mix and advertising optimization, online search, plagiarism and spam detection, etc.

I will call it an Internet Topology Mapping. It might not be stored as a traditional database (it could be a graph database, a file system, or a set of look-up tables). It must be pre-built (e.g. as look-up tables, with regular updates) to be efficiently used.

So what is the Internet Topology Mapping?

Essentially, it is a system that matches an IP address (Internet or mobile) with a domain name (ISP). When you process a transaction in real time in production mode (e.g. an online credit card transaction, to decide whether to accept or decline it), your system only has a few milliseconds to score the transaction to make the decision. In short, you only have a few milliseconds to call and run an algorithm (sub-process), on the fly, separately for each credit card transaction, to decide on accepting/rejecting. If the algorithm involves matching the IP address with an ISP domain name (this operation is callednslookup), it won’t work: direct nslookups take between a few hundreds to a few thousands milliseconds, and they will slow the system to a grind.

Because of that, Internet Topology Mappings are missing in most systems. Yet there is a very simple workaround to build it:

Look at all the IP addresses in your database. Chances are, even if you are Google, 99.9% of your traffic is generated by fewer than 100,000,000 IP addresses. Indeed, the total number of IP addresses (the whole universe) consists of less than 256^4 = 4,294,967,296 IP addresses. That’s about 4 billion, not that big of a number in the real scheme of big data. Also, many IP addresses are clustered: 120.176.231.xxx are likely to be part of the same domain, for xxx in the range (say) 124-180. In short, you need to store a lookup table possibly as small as 20,000,000 records (IP ranges / domain mapping) to solve the nslookup issue for 99.9% of your transactions. For the remaining 0.1%, you can either assign ‘Unknown Domain’ (not recommended, since quite a few IP addresses actually have unknown domain), or ‘missing’ (better) or perform the cave-man, very slow nslookup on the fly.
Create the look-up table that maps IP ranges to domain names, for 99.9% of your traffic.

When processing a transaction, access this look-up table (stored in memory, or least with some caching available in memory) to detect the domain name. Now you can use a rule system that does incorporate domain names.

Example of rules and metrics based on domain names are: