IP Netblocks WHOIS Database Blog

Posted on October 8, 2018

Read the other articles

Who owns the Internet? IP Netblocks WHOIS Data will tell you

Who is behind IP addresses? The use of netblocks data

The virtual world of the Internet can be linked to physical entities such as organizations or individuals via only a few techniques. One of them is to start from the IP address: the unique number associated with each machine connected to the Internet. As such an address is technically necessary for any networked machine to operate and each Internet communication includes this information, it is a very efficient and viable approach revealing the ownership of the infrastructure and the hierarchy behind its definition.

As is well-known, in the currently prevalent IPv4 address system the computers are addressed with 32-digit numbers. The ownership of these numbers is organized on the basis of blocks, defined in a way well-suitable for routing. This means that the whole range of 32-digit numbers is subdivided into non-overlapping intervals. Each interval is assigned to some entity. The intervals are then again subdivided into smaller intervals, again belonging to entities, and so on. Hence a hierarchy of intervals is formed, which reflects a hierarchy of the respective entities. The intervals themselves are termed as netblocks, which can be assigned to owners.

More specifically, the biggest blocks (thus with the largest number of addresses) are assigned by the Internet Assigned Numbers Authority (IANA) to regional Internet registries (RIRs). For instance, RIPE NCC, the European RIR, administers a block with over sixteen million addresses. The RIRs that have control over a large geographical area assign smaller blocks (i.e. intervals) within their big block to local Internet registries (LIRs). Those in turn may then subdivide their blocks again to other entities at a lower level of this hierarchy. Finally, end-users receive the smallest blocks to operate their networks. They are usually assigned their interval of addresses by their Internet Service Providers (ISPs), but there can be more complex situations, e.g. some networks can be run with multiple ISPs, etc.

So, given a set of IP addresses, you can find out which network they belong to and who is behind this network. IP Netblocks whois data contain this ownership information. This is very useful in many applications. Let’s name a few:

  • Having IPs collected from your or your clients’ firewall logs, you can identify the networks where they have come from as well as their owners. Clearly it can be of fundamental importance e.g. when investigating a cyber attack or in certain network security solutions.
  • By collecting IPs from your web server’s log and supplementing them with IP Netblocks whois data, you obtain a data set from which you can deduce the structure and dynamics of your traffic. Such information can have significant marketing implications, e.g. from which networks you are visited most frequently.
  • Having a network you can possibly be interested in extending, by purchasing neighboring netblocks or find other free blocks for purchase.
  • If you want to provide access to your clients via network filtering (e.g. a software license or a journal subscription), you can identify their networks from netblocks data.
  • You can address various interesting scientific research problems based on the observation of the structure and dynamics of netblocks.

There can be many other reasons for which you might want to be able to query a complete database of netblocks and their owners, that is, IP Netblocks whois data. The bad news is that owing to the large number of involved entities, the collecting of accurate, up-to-date or historic IP Netblocks whois data looks like a painful challenge, which it is indeed. There’s some good news though. You can get hold of these data comprehensively by purchasing them from WhoisXML API, Inc. And the size of the data sets, although not small, still enables you to handle these with your own devices by building a database from the purchased and downloaded data.

In the present blog we show some simple examples for downloading and using IP Netblocks whois data. First we are going to play around with netblocks and their notation. Then we’ll show how to load netblocks data into a noSQL database and query them. We shall also demonstrate that, after some prefiltering, you may even analyze these data by loading them into the computer memory using a high-level programming language. (Our choice of language will be Python as it is free and the codes are easily understandable, and it has many tools for analyzing data. A similar approach can be followed with different tools, too.)

Netblocks, CIDR, etc. for the newcomer

Let us first understand how a block of IP addresses is defined. (Those familiar with this will probably want to skip this Section.) It is done by Classless Inter-Domain Routing (CIDR). The main idea is variable-length subnet masking (VLSM): a netblock is a set of IP addresses in which the first (i.e. most significant) n bits, the CIDR prefix bits are kept fixed.

It is the simplest to understand it through an example. (A more detailed explanation can be found on this Wikipedia page.) To just play around, we use the netaddr package in series 3 Python. Although we do not need anything else but converting some numbers to binary formats and masking some bits, this package, as it is designed to handle IP addresses and networks, does very much what we need. And you are sure to find similar packages in your favorite environment. Further details of the package can also be found in this tutorial. So possibly after doing “pip3 install netaddr” in your shell or Windows command-line, start your series 3 Python, say “import netaddr” and for the sake of simplicity, “from netaddr import *” to follow the example (“>>>” stands for Python’s prompt). Or you may just read it and understand without reproducing.

Let’s take the netblock 104.16.0.0/12 in the CIDR notation. Here 104.16.0.0 is an IP address, while the number 12 indicates that the first 12 most significant digits will specify the actual netblock, while the rest can vary, leading to the available addresses. So, setting

                    
>>> example_network=IPNetwork('104.16.0.0/12')
                    
                

we will have a netblock of size

                    
>>> example_network.size
1048576
                    
                

that is, 2^(32-12)=2^20 addresses, as they are distinguished by the 20 least significant bits.

It will start with the IP

                    
>>> example_network[0]
IPAddress('104.16.0.0')
                    
                

or in binary format:

                    
>>> example_network[0].bits()
'01101000.00010000.00000000.00000000'
                    
                

and end with

                    
>>> example_network[-1]
IPAddress('104.31.255.255')
>>> example_network[-1].bits()
'01101000.00011111.11111111.11111111'
                    
                

Observe that indeed, the first (most significant) 12 bits are the same, while the rest go through all the possible values in the block, and this defines an interval of neighboring integers, so it is indeed a block. If this block is assigned to someone, blocks which are subsets of this can be defined by taking an address within and assigning more than 12 bits as CIDR prefix. This can then be delegated to smaller distributors or assigned to clients. Note that this is a very logical way of subdividing the whole range of IP addresses to a hierarchy of smaller and smaller non-intersecting subintervals.

An alternative way to specify such a block is to set the range, i.e. the starting and ending bits. How can we make sure that this is a CIDR netblock? It is enough to convert the first and last address. If they have a common prefix and the rest of the bits are all zeros for the first and all ones for the last, we do potentially have a contiguous CIDR block. Netaddr does us this favor:

                    
>>> IPRange('104.16.0.0', '104.31.255.255').cidrs()
[IPNetwork('104.16.0.0/12')]
                    
                

(The result can be a list of multiple networks, in which case our first and last addresses do not define a contiguous netblock.)

In the data available from WhoisXML API, the blocks are defined in the following form:

                    
'inetnum': '104.16.0.0 - 104.31.255.255',
'inetnumFirst': 1745879040,
'inetnumLast': 1746927615,
                    
                

So they contain the first and last address, both in IPv4 and decimal formats. It can be converted to the CIDR notation e.g. as shown in the last line of our example. However, without any conversion it is easy to check whether an address belongs to a block: the address should be within the given interval. Now let us look for some data and carry out some simple investigations.

Downloading data from The IP Netblocks feed

The data are provided on a webpage with simple HTTP authentication. A simple interactive way is to open the feed on the web from a browser and download it to a file. The URL is ftp://datafeeds.whoisxmlapi.com:21210/IP_Netblocks_WHOIS_Database/ where the data files are to be found. Use your API key as both username and password for authentication.

To download in an automated fashion, a suitable utility such as wget can be used. As an example, to download the file “ip_netblocks.2018-09-09.full.jsonl.gz” you can simply do

                    
wget --password YOUR_API_KEY ftp://datafeeds.whoisxmlapi.com:21210/IP_Netblocks_WHOIS_Database/ip_netblocks.2018-09-09.full.jsonl.gz
                    
                

which can then be simply integrated in various scripts you could write.

Netblocks NoSQL database example: MongoDB

For those who have not encountered it: MongoDB is a popular document database. It is designed to store data in flexible JSON documents and query them in a way tailored to a given application. We have just downloaded a massive set of JSONs, so MongoDB looks very suitable for handling it. Let us give it a try.

Instead of installing a vanilla MongoDB, we opt for the Percona Server for MongoDB, a product of Percona with certain enhancements over the vanilla community edition, which we found useful, especially regarding performance. This can be downloaded for a variety of Linux distributions from their web page: https://www.percona.com/downloads/percona-server-mongodb-LATEST

It is free software. So to follow this example, install it first. To keep things simple, we do not set up authentication for our DB in our example. If you do so, you will need to include the authentication-related options into the command lines.

It is very easy to import the downloaded data, especially since their format (jsonl, that is, one JSON per line) is just the native format of MongoDB. A MongoDB database is organized into collections. Within a collection, each record is termed as a “document”. Now we choose that we shall have a database named “netblocks”, and the downloaded file will be a collection within that. To create all this, all we need is our downloaded file and the execution of the following:

                    
mongoimport --db "netblocks" --collection netblocks_current_2018_09_09 --file ip_netblocks.2018-09-09.full.jsonl
                    
                

We shall then see a progress report starting like:

                    
2018-09-07T12:16:18.889+0000	connected to: localhost
2018-09-07T12:16:21.887+0000	[........................] netblocks.netblocks_current_2018_09_09	39.4MB/3.52GB (1.1%)
2018-09-07T12:16:24.887+0000	[........................] netblocks.netblocks_current_2018_09_09	76.5MB/3.52GB (2.1%)
                    
                

and finally, we shall have

                    
2018-09-07T12:21:09.175+0000	[########################] netblocks.netblocks_current_2018_09_09	3.52GB/3.52GB (100.0%)
2018-09-07T12:21:09.175+0000	imported 8698953 documents
                    
                

when it is done. As you can see, it took about 5 minutes for us to import 3.52 gigabytes, 8,698,953 records (the server used has 8 Intel(R) Xeon(R) CPUs E5345 @ 2.33GHz, and 16 gigabytes memory, not quite oversized for the task).

Let’s play around with our data. First we need to start the db shell:

                    
mongo --shell
                    
                

In the MongoDB shell we instruct MongoDB to use the appropriate database:

                    
use netblocks
                    
                

the reply will be

                    
switched to db netblocks
                    
                

Our task will be to find out how many records there are with the country code “UK”. To do so, we need to run the following query:

                    
db.getCollection("netblocks_current_2018_09_09").find({"country":"UK"}).count()
                    
                

The resulting number is 76. When omitting ”.count()”, we shall get the actual records. Now let us look for the records in which the starting address begins with “206.22.82”. We can look for them by using regular expressions, for instance:

                    
db.getCollection("netblocks_current_2018_09_09").find({"inetnum":{$regex:"^206\.225\.82\."}})
                    
                

This quick demonstration shows how efficiently one can deal with netblocks data in MongoDB. We refer to the documentation of MongoDB for further details on how to build more sophisticated queries.

Direct processing, e.g. in Python

A complete daily netblocks file is large, but at least part of it may fit into the memory of even an average computer. So one may consider writing custom software to work with the data, especially when the goal requires more complex processing tools than those a database system can offer and thus one can benefit from the power of a general-purpose programming language. Of course in many cases a combination of a database system (e.g. MongoDB) with the programming language is a better solution, but loading the data into some native data of the language and handling it in-memory could be efficient as well.

In case of data files containing daily changes (aka “daily”) this latter approach is surely viable in the case of the single files. Processing full daily files (aka “full”) may reach the limitation of the actual machine, as we will see from the following example.

In what follows we’ll describe a simple example of such an application in Python. This can be viewed also as a motivation to do something similar in your preferred language. In Python, the loaded arrays translate easily into the native dictionary type. With these we can do anything that Python allows for. For instance, we can put these dictionaries into an array and filter it with highly customized functions, or use the data for building graph representations to reveal some structural properties our research looks for. The attractive part is that the full arsenal of Python packages is at hand to process the data.

Consider the following use case: we need to write a simple program that reports all the netblocks a given IP belongs to. It is sufficient to deal with those records in which the data of the organization the netblock belongs to is not empty. Our input file will be “ip_netblocks.2018-09-09.full.jsonl”.

The following code does the job. Here instead of “netaddr” we use “ipaddress” which is part of Python’s standard library. Though “netaddr” has more capabilities, the present task is viable with this simpler package too, illustrating another opportunity. The code reads as follows:

                    
1 #!/usr/bin/python
2
3 import json
4 import sys
5 import re
6 import ipaddress
7 import pprint
8 import datetime
9
10 checked = 0
11 records = []

12 for rawrecord in open('ip_netblocks.2018-09-09.full.jsonl', 'r'):
13    record = json.loads(rawrecord)
14 if record['org']['org'].strip() != '':
15    records.append(record)
16 checked += 1
17 if checked % 100000 == 0:
18    sys.stderr.write('%d records checked.\n' % (checked, ))
19    sys.stderr.flush()
20
21 sys.stderr.write('%d records in memory.\n' % len(records))
22
23 def IPaddressinrange(address, iprange):
24     iprange = iprange.split(' - ')
25
26 try:
27     iprange_begin = ipaddress.IPv4Address(iprange[0])
28     iprange_end = ipaddress.IPv4Address(iprange[1])
29     query = ipaddress.IPv4Address(address)
30 except:
31     return False
32     return iprange_begin <= query <= iprange_end
33
34 while True:
35     query = input('Enter IP to be queried: ')
36     starttime = datetime.datetime.now()
37     result = filter(lambda x: IPaddressinrange(query, x['inetnum']), records)
38
39 print 'Results:'
40 try:
41     for record in result:
42         pprint.pprint(record)
43 except:
44     print 'Not found.'
45     endtime = datetime.datetime.now()
46     print 'The query took %d seconds' % (endtime - starttime).total_seconds()
                    
                

Let us see how it works. From lines 10 to 21 we load those records which have an organization name into the memory. Line 12 loops through the lines of the file. Each line is converted into a Python dictionary (associative array) in line 13. In lines 14-15 we append the actual record to the Python list (array) “records” if it meets our criterion of having an organization name. (Note that the JSON fields containing JSON themselves are also properly converted.) Lines 16-19 are there to allow the user to monitor the reading of the file. Finally, in line 18 we report the number of records we have actually loaded.

The function in lines 23-32 will decide if an IP belongs to a block defined in the way as it is done in the “iprange” field of our data, e.g. “104.16.0.0 - 104.31.255.255”. It is given as a string argument which is converted to a pair of the two IP strings in line 24. In lines 27-29 we start taking the advantage of the “ipaddress” package in Python: we try converting these strings as well as the one to be queried to “IPv4Address”-es. If we do not manage, we simple return “False”. Finally, in line 37, the function produces a return value if the queried IP is between the starting and ending IPs, taking the advantage of the relational operators of “IPv4Address”-type data. (You might well say that we could have used the “inetnumFirst” and “inetnumLast” fields for the purpose, which are integers. Indeed, that would have been a valid solution but we wanted to illustrate the power of having a specialized package of the programming language at hand.)

The main loop starts in line 34. The user is asked for an IP in line 35. Lines 36, 45, and 46 are there to provide information on how much time the query took. In lines 39 to 44 we print the results, and then we go for the next query.

An example session looks like this (we omit many “...records checked” lines from the beginning):

                    
8500000 records checked.
8600000 records checked.
3426120 records in memory.
Enter IP to be queried: 104.27.177.3
Results:
{'abuseContact': [],
'adminContact': [],
'city': '',
'country': 'EU # Country field is actually all countries in the world and not '
'just EU countries',
'inetnum': '0.0.0.0 - 255.255.255.255',
'inetnumFirst': 0,
'inetnumLast': 4294967295,
'mntBy': [{'email': 'helpdesk@apnic.net', 'mntner': 'APNIC-HM'}],
'mntDomains': [],
'mntLower': [{'email': 'unread@ripe.net', 'mntner': 'RIPE-NCC-HM-MNT'}],
'mntRoutes': [{'email': 'ggm@pobox.com', 'mntner': 'MAINT-AU-APNIC-GM85-AP'}],
'modified': '2012-02-08T09:09:31Z',
'netname': 'IANA-BLK',
'org': {'email': 'unread@ripe.net',
'name': 'Internet Assigned Numbers Authority',
'org': 'ORG-IANA1-RIPE',
'phone': ''},
'source': 'RIPE',
'techContact': []}
{'abuseContact': [],
'adminContact': [],
'city': '',
'country': '',
'inetnum': '104.0.0.0 - 104.255.255.255',
'inetnumFirst': 1744830464,
'inetnumLast': 1761607679,
'mntBy': [],
'mntDomains': [],
'mntLower': [],
'mntRoutes': [],
'modified': '',
'netname': 'NET104',
'org': {'email': '',
'name': 'American Registry for Internet Numbers',
'org': 'ARIN',
'phone': ''},
'source': 'ARIN',
'techContact': []}
{'abuseContact': [{'email': 'abuse@cloudflare.com',
'id': 'ABUSE2916-ARIN',
'person': 'Abuse',
'phone': ''}],
'adminContact': [],
'city': '',
'country': '',
'inetnum': '104.16.0.0 - 104.31.255.255',
'inetnumFirst': 1745879040,
'inetnumLast': 1746927615,
'mntBy': [],
'mntDomains': [],
'mntLower': [],
'mntRoutes': [],
'modified': '',
'netname': 'CLOUDFLARENET',
'org': {'email': '',
'name': 'Cloudflare, Inc.',
'org': 'CLOUD14',
'phone': ''},
'source': 'ARIN',
'techContact': [{'email': 'rir@cloudflare.com',
'id': 'ADMIN2521-ARIN',
'person': 'Admin',
'phone': ''}]}
The query took 137 seconds.
                    
                

We have queried for the IP “104.27.177.3” which is that of www.domainwhoisdatabase.com, one of the web servers of WhoisXML API, Inc. From the results we clearly see the hierarchy of the netblocks involved.

What are the lessons to learn from this example? First of all, the data provided are suitable for reading by high-level programming languages such as Python. It was as simple as line 11 of the code shows. Secondly, if the data are all loaded into the memory in some native format, we can use all the tools available in the given language to process the data. In this simple example we exploited this when we were to decide if the IP is within the given range. But you may envisage extremely more complicated ideas, e.g. building a graph of a subset of the records along some consideration and analyze it with the “networkx” graph package of Python, or reading many files and performing time-series analysis of netblocks’ behavior with the data analysis library “pandas”, etc.

Naturally, the query time of 137 seconds sounds a bit disappointing. The actual machine used for the test had 16 gigabytes of memory and Intel(R) Xeon(R) CPU E5345 @ 2.33GHz CPU-s. A single core was used of course as we had no threading in the code. Note that we were filtering 3,426,120 records altogether. These did fit into the memory (other applications and services were also running normally), but according to experience, we would have run into problems if we had had to process more than about 5 million records. In that case building a database and reading just the relevant data to Python via a database connection is a better approach. We were able to drop many records due to the requirement that we need the organization name. Just for loading the data, the extension of the physical memory would have solved it, but the query was not very fast already on the given number of records. In conclusion: using an array of dictionaries in this fashion becomes really efficient if the data are prefiltered upon loading to a large extent. We did prefilter it by dropping the records without an organization name, which solved the issue of size in memory. But the query time was already relatively large, which may be suitable in some research tasks where the calculation has to be performed just once, but definitely not so if this is part of an online mail filter. To do it more efficiently, one has either to write a more sophisticated code by applying supplementary indexing data sets and of course multi-threading. Alternatively, the use of an underlying database specialized in searching can resolve the efficiency issue, but the price to be paid is losing the elegant comparison of IPs in line 32 of the code.

Summary

In this blog we have introduced IP Netblocks whois data and their use through simple examples. Our aim was to motivate the reader by illustrating how many interesting, useful and sometimes challenging applications these data have, how easy they are to obtain and how efficiently they can be queried even with simple free tools running on average hardware. As WhoisXML API, Inc. provides both full databases and daily updates, you can set up a local database and keep it up-to-date with a limited amount of daily downloads. (Daily changes can be in fact interesting per se, in case of addressing dynamical changes of the IP address ecosystem.) Want to do it your way? Visit https://ip-netblocks-whois-database.whoisxmlapi.com and go for a subscription.

Read the other articles