File format

The Sypex Geo file format is designed following the following rules.

Minimize the number of file operations

Hard drives in modern computers are the slowest device (the mass transition to more powerful SSDs is still a long way off). And their bottleneck is the random reading of small amounts of information (several kilobytes). At the same time, most typical websites consist primarily of many small files. Therefore, when creating Sypex Geo, the goal was to minimize the number of file operations.

Block reading is better than byte reading

Files are read from the hard drive in blocks (usually 4 KB), i.e. even if you need to read one byte, the entire block is still read from the disk. Therefore, the file format was designed so that instead of 4-8 reads of 6 bytes from different parts of the file, one block of several KB is read.

Less data - higher speed

The file contains a special index of the first bytes (octets), which provides the following advantages.

  1. Only 3 bytes instead of 4 for storing an IP range (yes, this seems like a little, but this way we remove 25% of useless data, with 2 million ranges, the savings are already about 2 MB).
  2. The sampling range in the main index is greatly reduced.
  3. The speed increases when searching for the required IP in the database, since you need to compare one byte less (especially since this byte will be the same).


For example, you need to find IP 24.89.68.43, the following list of ranges will be selected from the database:

24.54.192.0
24.56.0.0
24.57.0.0< BR>24.58.0.0
24.64.0.0
24.72.144.0
24.76.0.0
24.88.0.0
24.89.64.0
24.89.128.0
24.89.192.0
24.90.0.0
...

As you can see, all ranges have the same first byte (on average, there are 7-9 thousand ranges for each first byte). There will be several hundred such IPs in the sample. Therefore, checking the first byte in this case is a waste of time, it is the same for everyone.

Fewer data conversions

When searching for the required IP, there is no conversion from a binary string to an integer (as is done by all competitors, and often some arithmetic operations are then performed with these numbers).

In Sypex Geo everything is as simplified as possible, just comparing two 3-byte strings in the same form as they were read from the file - no conversions.

Simple data storage structures

Storing directories of cities and regions in the simplest form, numbers are stored in binary form, strings with zero characters at the end. This ensures maximum performance when retrieving data, as well as compact data storage. No JSON or XML, which takes a lot of time to parse.

Openness

The format was initially created with the expectation of openness and the ability for users to independently create database files in the Sypex Geo format. Accordingly, both specifications of the format itself and tools for creating database files will be published. They will be published after leaving the beta stage and fixing the format specifications, otherwise minor changes are possible.

Versatility

It is also possible to store any user data in the database, not just standard information about countries and cities.

You can also readSypex Geo 2.2 format specification.