IP Netblocks WHOIS Database Blog

Updated on February 20, 2019

Read the other articles

Who owns the Internet? IP Netblocks WHOIS Data will tell you

Who owns the Internet? IP Netblocks WHOIS Data will tell you

Who are behind IP addresses? The use of netblocks data

The virtual world of the Internet can be linked to physical entities such as organizations or individuals via only a few techniques. One of the possibilities is to start from the IP address: the unique number associated with each machine connected to the Internet. As such an address is technically essential for any networked machine to operate and each Internet communication to take place, it is a very efficient and viable approach revealing the ownership of the infrastructure and the hierarchy behind its definition.

As is well-known, in the currently prevalent IPv4 address system the computers are addressed with 32-digit numbers. The ownership of these numbers is organized on the basis of blocks, defined in a way well-suitable for routing. This means that the whole range of 32-digit numbers is subdivided into non-overlapping intervals. Each interval is assigned to some entity. The intervals are then subdivided into smaller intervals, again belonging to entities, and so on. Thus, a hierarchy of intervals is formed, which reflects a hierarchy of respective entities. The intervals themselves are termed as netblocks which can be assigned to owners.

More precisely: the biggest blocks (thus with the largest number of addresses) are assigned by the Internet Assigned Numbers Authority (IANA) to regional Internet registries (RIRs). For instance, RIPE NCC, the European RIR administers a block with over sixteen million addresses. The RIRs who have the control over a large geographical area assign smaller blocks (i.e. intervals) within their big block to local Internet registries (LIRs), who may then subdivide their blocks to other entities at a lower level of this hierarchy. Finally, end-users receive the smallest blocks to operate their networks. They are usually assigned to their interval of addresses by their Internet Service Providers (ISPs), but there can be more complex situations, e.g. when some networks are run with multiple ISPs, etc.

So, when given a set of IP addresses, you can find out which network they belong to and who is behind this network. IP Netblocks WHOIS data contain this ownership information. This is very useful in many applications. To name a few:

  • By having IPs collected from your or your clients’ firewall logs, you can identify the networks where they have come from and also their owners. Clearly it can be of fundamental importance e.g. when investigating a cyber-attack or in certain network security solutions.
  • By collecting IPs from your web server’s log and supplementing them with IP Netblocks WHOIS data, you obtain a data set from which you can deduce the structure and dynamics of your traffic. Such information can have significant marketing implications, e.g. from which networks you are visited most frequently.
  • You can be possibly be interested in extending your network by purchasing neighboring netblocks or finding other free blocks for purchase.
  • If you want to provide access to your clients via network filtering (e.g. a software license or a journal subscription), you can identify their networks from netblocks data.
  • You can address various interesting scientific research problems based on the observation of the structure and dynamics of netblocks.

There could be many other reasons why you might want to be able to query a complete database of netblocks and their owners, that is, IP Netblocks WHOIS data. The bad news is: owing to the large number of involved entities, collecting accurate, up-to-date or historic IP Netblocks WHOIS data looks like a big and possibly painful challenge and it is one, indeed. There is good news though. You can get hold of these data comprehensively by purchasing them from WhoisXML API, Inc. And the size of the data sets, although not small, still enables you to handle these with your own devices by building a database from the purchased and downloaded data.

In the present blog we show some simple examples for downloading and using IP Netblocks WHOIS data. First, we shall play around with netblocks and their notation. Then we show how to load netblocks data into a noSQL database and query them. (For those who prefer good old relational databases such as MySQL, we refer to our support script creating a MySQL database from csv format IP Netblocks WHOIS data, which are also available for download. The script and its documentation are here: https://github.com/whois-api-llc/whois_database_download_support/tree/master/netblocks_csv_to_mysqldb)

We shall also demonstrate here that, after some prefiltering, you can even analyze these data by loading them into the computer memory using a high-level programming language. (Our choice of language will be Python as it is free and the codes are easily understandable, and it has many appropriate tools. A similar approach can be followed with different tools, too.)

Netblocks, CIDR, etc. for the newcomer

Let us first understand how a block of IP addresses is defined. (Those who are familiar with this will probably want to skip this section.) It is done by Classless Inter-Domain Routing (CIDR). The main idea variable-length subnet masking (VLSM): a netblock is a set of IP addresses in which the first (i.e. most significant) n bits, the CIDR prefix bits are kept fixed.

The simplest way is to understand it through an example. (A more detailed explanation can be found on this Wikipedia page.) To just play around, we use the netaddr package in series 3 Python. Although we do not need anything else but converting some numbers to binary formats and masking some bits, this package, as it is designed to handle IP addresses and networks, does very much what we need. And you are bound to find similar packages in your favorite environment. Further details of the package can also be found in this tutorial. So possibly after doing “pip3 install netaddr” in your shell or Windows command-line, start your series 3 Python, say, “import netaddr” and for sake of simplicity, “from netaddr import *” to follow the example (“>>>” stands for Python’s prompt). Or you may just read it and understand without reproducing.

Let’s take the netblock 104.16.0.0/12 in the CIDR notation. Here 104.16.0.0 is an IP address, while the number 12 tells that the first 12 most significant digits will specify the actual netblock, while the rest can vary, leading to the available addresses. So, by setting

                    
>>> example_network=IPNetwork('104.16.0.0/12')
                    
                

we will have a netblock of size

                    
>>> example_network.size
1048576
                    
                

that is, 2^(32-12)=2^20 addresses, as they are distinguished by the 20 least significant bits.

It will start with the IP

                    
>>> example_network[0]
IPAddress('104.16.0.0')
                    
                

or in the binary format:

                    
>>> example_network[0].bits()
'01101000.00010000.00000000.00000000'
                    
                

and end with

                    
>>> example_network[-1]
IPAddress('104.31.255.255')
>>> example_network[-1].bits()
'01101000.00011111.11111111.11111111'
                    
                

Note that indeed, the first (most significant) 12 bits are the same while the rest go through all the possible values in the block, and this defines an interval of neighboring integers, so it is indeed a block. If this block is assigned to someone, blocks which are subsets of this can be defined by taking an address within and assigning more than 12 bits as CIDR prefix. This can then be delegated to smaller distributors or assigned to clients. Note that this is a very logical way of subdividing the whole range of IP addresses to a hierarchy of smaller and smaller non-intersecting subintervals.

An alternative way to specify such a block is to tell the range, i.e. the starting and ending bits. How can we make sure that this is a CIDR netblock? It is enough to convert the first and last address. If they have a common prefix and the rest of the bits are all zeros for the first and all ones for the last, we do potentially have a contiguous CIDR block. Netaddr does us this favor:

                    
>>> IPRange('104.16.0.0', '104.31.255.255').cidrs()
[IPNetwork('104.16.0.0/12')]
                    
                

(The result can be a list of multiple networks, in which case our first and last addresses do not define a contiguous netblock.)

In the data available from WhoisXML API, the blocks are defined in the following form:

                    
'inetnum': '104.16.0.0 - 104.31.255.255',
'inetnumFirst': 1745879040,
'inetnumLast': 1746927615,
                    
                

So they contain the first and last address, both in IPv4 and decimal formats. It can be converted to the CIDR notation e.g. as shown in the last line of our example. However, without any conversion it is easy to check whether an address belongs to a block: the address should be within the given interval. Now let us look for some data and carry out some simple investigations.

Downloading data from The IP Netblocks feed

The data are provided on a webpage with simple HTTP authentication. A simple interactive way is to open the feed on the web from a browser and download it to a file. The URL is https://ip-netblocks-whois-database.whoisxmlapi.com/datafeeds where the data files are to be found. Use your API key as both username and password for the authentication.

To download in an automated fashion, a suitable utility such as wget can be used. As an example, to download the file “ip_netblocks.2019-01-04.full.jsonl.gz” you can simply do

                    
wget --user YOUR_API_KEY  --password YOUR_API_KEY https://ip-netblocks-whois-database.whoisxmlapi.com/datafeeds/ip_netblocks.2019-01-04.full.jsonl.gz
                    
                

which can then be simply integrated in the various scripts you may have.

Netblocks nosql database example: MongoDB

MongoDB is a popular document database. It is designed to store data in flexible JSON documents and query them in a way tailored to the given application. We have just downloaded a massive set of JSONs, so MongoDB looks very suitable for handling it. Let us give it a try.

Instead of installing a Vanilla MongoDB, we opt for the Percona Server for MongoDB, a product of Percona with certain enhancements over the Vanilla community edition, which we found useful, especially regarding performance. This can be downloaded for a variety of Linux distributions from their web page: https://www.percona.com/downloads/percona-server-mongodb-LATEST

It is freeware, so, to follow this example, install it first. To keep things simple, we do not set up authentication for our DB in our example. If you do so, you will need to include the authentication-related options into the command-lines.

It is very easy to import the downloaded data, especially since their format (jsonl, that is, one JSON per line) is just the native format of MongoDB. A MongoDB database is organized into collections. Within a collection, each record is termed as a “document”. Now we choose that we shall have a database named “netblocks”, and the downloaded file will be a collection within that. To create this, all we need is our downloaded file and the execution of the following:

                    
mongoimport --db "netblocks" --collection netblocks_2019_01_04 --file ip_netblocks.2019-01-04.full.jsonl
                    
                

We shall then see a progress report starting like:

                    
2019-02-05T20:36:22.280+0100	connected to: localhost
2019-02-05T20:36:25.288+0100	[........................] netblocks.netblocks_2019_01_04	40.0MB/4.39GB (0.9%)
2019-02-05T20:36:28.282+0100	[........................] netblocks.netblocks_2019_01_04	79.2MB/4.39GB (1.8%)
2019-02-05T20:36:31.288+0100	[........................] netblocks.netblocks_2019_01_04	120MB/4.39GB (2.7%)
                    
                

and finally, we shall have

                    
2019-02-05T20:42:22.280+0100	[#######################.] netblocks.netblocks_2019_01_04	4.34GB/4.39GB (98.8%)
2019-02-05T20:42:25.279+0100	[#######################.] netblocks.netblocks_2019_01_04	4.38GB/4.39GB (99.7%)
2019-02-05T20:42:26.209+0100	[########################] netblocks.netblocks_2019_01_04	4.39GB/4.39GB (100.0%)
2019-02-05T20:42:26.210+0100	imported 8925144 documents
                    
                

when it is done. As you can see, it took about 6 minutes for me to import 4.4 gigabytes, 8 925 144 records (the virtual host I have used had 1 Intel(R)Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz, and 4 Gb RAM, not quite oversized for the task).

Let’s play around with our data. First we need to start the db shell:

                    
mongo --shell
                    
                

In the MongoDB shell we instruct MongoDB to use the appropriate database:

                    
use netblocks
                    
                

the reply will be

                    
switched to db netblocks
                    
                

Our task will be to find out how many records there are with the country code “UK”. To do so, we need to run the following query:

                    
db.getCollection("netblocks_2019_01_04").find({"country":"UK"}).count()
                    
                

Resulting in the number 74. When omitting ”.count()”, we shall get the actual records. Now let us look for the records in which the starting address starts with “206.22.82”. We can look for them e.g. using regular expressions:

                    
db.getCollection("netblocks_2019_01_04").find({"inetnum":{$regex:"^206\.225\.82\."}})
                    
                

This quick demonstration shows how efficiently one can deal with netblocks data in MongoDB. We refer to the documentation of MongoDB for further details on how to build more sophisticated queries.

Direct processing, e.g. in Python

A complete daily netblocks file is large, but still at least a part of it may fit into the memory of even an average computer. So one may consider towrite custom software to work with the data, especially when the goal requires more complex processing tools than a database system can offer and thus one can benefit from the power of a general-purpose programming language. Of course, in many cases a combination of a database system (e.g. MongoDB) with the programming language is a better solution, but it might be the case that loading the data into some native data of the language is and handling it in-memory is efficient.

In case of data files containing daily changes (aka “*.daily.jsonl.gz”) this latter approach is surely viable in the case of the single files. Processing full daily files (aka “*.full.jsonl.gz”) may reach the limitation of the actual machine, as we will see in from the following example.

In what follows we describe a simple example of such an application in Python. This can be viewed also as a motivation to do something similar in your preferred language. In Python, the loaded arrays translate easily into the native dictionary type. With these we can then do anything that Python allows for. For instance, we can put these dictionaries into an array and filter it with highly customized functions, or use the data for building graph representations to reveal some structural properties our research looks for. The attractive part is that the full arsenal of Python packages is at hand to process the data.

Consider the following use case: we need to write a simple program that reports all the netblocks a given IP belongs to. We assume that it is sufficient to deal with those records which originate from ARIN, the American Registry of Internet Numbers. Our input file will be “ip_netblocks.2019-01-04.full.jsonl”.

The following code does the job. Here instead of “netaddr” we use “ipaddress”, which is part of Python’s standard library. Though “netaddr” has more capabilities, the present task is viable with this simpler package too, illustrating another opportunity. The code reads as follows:

                    
1	#!/usr/bin/env python3

2	import json
3	import sys
4	import re
5	import ipaddress
6	import pprint
7	import datetime

8	checked = 0
9	records=[]

10	for rawrecord in open('ip_netblocks.2019-01-04.full.jsonl','r'):
11	    record=json.loads(rawrecord)

12	    try:
13	        if record['source'] == 'ARIN':
14	            records.append(record)
15	        checked += 1
16	    except:
17	        pass
18	    if checked % 100000 == 0:
19	        sys.stderr.write("%d records checked.\n" % checked)
20	        sys.stderr.flush()

21	sys.stderr.write('%d records in memory.\n' % (len(records)))

22	def IPaddressinrange(address, iprange):
23	    iprange = iprange.split(' - ')
24	    try:
25	        iprange_begin = ipaddress.IPv4Address(iprange[0])
26	        iprange_end = ipaddress.IPv4Address(iprange[1])
27	        query = ipaddress.IPv4Address(address)
28	    except:
29	        return(False)
30	    return(iprange_begin <= query <= iprange_end)


31	while True:
32	    query = input('Enter IP to be queried: ' )
33	    starttime = datetime.datetime.now()
34	    result = filter(lambda x: IPaddressinrange(query, x['inetnum']), records)
35	    print('Results:')
36	    try:
37	        for record in result:
38	            pprint.pprint(record)
39	    except:
40	        print('Not found.')
41	    endtime = datetime.datetime.now()
42	    print('The query took %d seconds.' % ((endtime-starttime).total_seconds()))
                    
                

Let us see how it works. From lines 8 to 20, we load into the memory those records which have an organization name. Line 10 loops through the lines of the file. Each line is converted into a Python dictionary (associative array) in line 11. In lines 12-17 we append the actual record to the Python list (array) “records” if it meets our criterion of coming from ‘ARIN’ (and having “source” defined, hence the excepton handling). Lines 18-20 are there to allow the user to monitor the process of reading the file. Finally, in line 21 we report the number of records we have actually loaded.

The function in lines 22-30 will decide if an IP belongs to a block defined in the way as it is done in the “iprange” field of our data, e.g. “104.16.0.0 - 104.31.255.255”. It is given as a string argument which is converted to a pair of the two IP strings in line 23. In lines 25-26 we start taking the advantage of the “ipaddress” package in Python: we try converting these strings as well as the one to be queried to “IPv4Address”-es. If we do not manage, we simple return “False”. Finally, in line 30, the function produces a return value if the queried IP is between the starting and ending IPs, taking the advantage of the relational operators of “IPv4Address”-type data. (You might well say that we could have used the “inetnumFirst” and “inetnumLast” fields for the purpose, which are integers. Indeed, that would have been a valid solution but we wanted to illustrate the power of having a specialized package of having the programming language at hand.)

The main loop starts in line 31. The user is asked for an IP in line 33. Lines 33, 41, and 42 are there to provide information on how much time the query took. In lines 35 to 40 we print the results, and then we go for the next query.

An example session looks like this (we omit many “...records checked” lines from the beginning):

                    
8800000 records checked.
8900000 records checked.
3257038 records in memory.
Enter IP to be queried: 104.27.154.235
Results:
{'abuseContact': [],
 'adminContact': [],
 'as': {'asn': 0, 'domain': '', 'name': '', 'route': ''},
 'city': 'Centreville',
 'country': 'US',
 'inetnum': '104.0.0.0 - 104.255.255.255',
 'inetnumFirst': 1744830464,
 'inetnumLast': 1761607679,
 'mntBy': [],
 'mntDomains': [],
 'mntLower': [],
 'mntRoutes': [],
 'modified': '2011-02-11T00:00:00Z',
 'netname': 'NET104',
 'org': {'city': 'Centreville',
         'country': 'US',
         'email': 'hostmaster@arin.net\nnoc@arin.net',
         'name': 'American Registry for Internet Numbers',
         'org': 'ARIN',
         'phone': '+1-703-227-0660\n+1-703-227-9840'},
 'source': 'ARIN',
 'techContact': []}
{'abuseContact': [{'city': 'San Francisco',
                   'country': 'US',
                   'email': 'abuse@cloudflare.com',
                   'id': 'ABUSE2916-ARIN',
                   'phone': '+1-650-319-8930',
                   'role': 'Abuse'}],
 'adminContact': [{'city': 'San Francisco',
                   'country': 'US',
                   'email': 'noc@cloudflare.com',
                   'id': 'NOC11962-ARIN',
                   'phone': '+1-650-319-8930',
                   'role': 'NOC'}],
 'as': {'asn': 13335,
        'domain': '',
        'name': 'CLOUDFLARENET',
        'route': '104.16.0.0/12'},
 'city': 'San Francisco',
 'country': 'US',
 'inetnum': '104.16.0.0 - 104.31.255.255',
 'inetnumFirst': 1745879040,
 'inetnumLast': 1746927615,
 'mntBy': [],
 'mntDomains': [],
 'mntLower': [],
 'mntRoutes': [],
 'modified': '2017-02-17T00:00:00Z',
 'netname': 'CLOUDFLARENET',
 'org': {'city': 'San Francisco',
         'country': 'US',
         'email': 'abuse@cloudflare.com\n'
                  'noc@cloudflare.com\n'
                  'rir@cloudflare.com',
         'name': 'Cloudflare, Inc.',
         'org': 'CLOUD14',
         'phone': '+1-650-319-8930'},
 'source': 'ARIN',
 'techContact': [{'city': 'San Francisco',
                  'country': 'US',
                  'email': 'rir@cloudflare.com',
                  'id': 'ADMIN2521-ARIN',
                  'phone': '+1-650-319-8930',
                  'role': 'Admin'}]}
The query took 131 seconds.
                    
                

We have queried for the IP “104.27.154.235”, which is that of www.domainwhoisdatabase.com, one of the web servers of WhoisXML API, Inc. From the results we clearly see the hierarchy of the netblocks involved.

What are the lessons to learn from this example? First of all, the data as provided are very suitable for reading by high-level programming languages such as Python. It was as simple as line 11 of the code shows. Secondly, if the data are all loaded into the memory in some native format, we can use all the tools available in the given language to process data. In this simple example we exploited this when we were to decide if the IP is within the given range. But you may envisage extremely more complicated ideas, e.g. building a graph of a subset of the records along some consideration and analyze it, e.g. with the “networkx” graph package of Python, or reading many files and performing e.g. time-series analysis of netblocks’ behavior with the data analysis library “pandas”, etc.

All of this sounds good and well but the query time of 131 seconds looks a bit disappointing. The actual machine used for the test had 16 gigabytes of memory and Intel(R) Xeon(R) CPU E5345 @ 2.33GHz CPU-s. A single core was used of course, as we had no threading in the code. Note that we were filtering 3,257,038 records altogether. These did fit into the memory (other applications and services were also running normally), but according to experience, we would have run into problems if we had had to process more than about 5 million records. In that case building a database and reading just the relevant data to Python via a database connection would be a better approach. We were able to drop many records due to the requirement that we need the organization name. Just for loading the data, the extension of the physical memory would have solved it, but the query was not very fast already on the given number of records. In conclusion: to use the array of dictionaries in this fashion becomes really efficient if the data are prefiltered upon loading to a large extent. We did prefilter by dropping the records without an organization name, and it solved the issue of the size in memory. But the query time was already relatively large: maybe it is suitable in some research task where the calculation has to be just once, but definitely not useful if this is a part of an online mail filter. To do that more efficiently, one has either to write more sophisticated code by applying supplementary indexing data sets and of course multi-threading. Or the use of an underlying database specialized in searching can resolve the efficiency issue, but the price to be paid is losing the elegant comparison of IPs in line 28 of the code.

Summary

In this blog we have introduced IP Netblocks WHOIS data and their use through simple examples. Our aim was to motivate the reader: by illustrating how many interesting, useful and sometimes challenging applications these data have, how easy they are to obtain and how efficiently they can be queried even with simple free tools running on average hardware. As WhoisXML API, Inc. provides both full databases and daily updates, you can set up a local database and keep it up-to-date with a limited amount of daily downloads. (Daily changes can be in fact interesting per se, in case of addressing dynamical changes of the IP address ecosystem.) Want to do it your way? Visit https://ip-netblocks-whois-database.whoisxmlapi.com and sign up for a subscription.

Read the other articles