dimanche 24 juin 2012

GDAL/OGR using Shapefile native .sbn spatial index

This is the end of a mistery and a frustration that have lasted for about 15 years. The Shapefile format had been documented since 1998, but the documentation was limited to the minimum core, that is to say the .shp file that contains the geometries and the .shx that is an index to the geometries. However the format of the .sbn file, that was known to contain spatial index (aimed at speeding up spatial filters), has never been published.

The FOSS community came with an alternate spatial index format, the .qix format, originally used by Shapelib, MapServer and GDAL/OGR, and whose use has propagated into other software stacks, such as GeoTools.

But Joel Lawhead, Marc Pfister and another developer, Francisco, have actively started since the end of 2011 the reverse engineering of the .sbn format, and their effort finally came to a conclusion. Many kudos to them for leading this great effort !

There is not yet a formal specification of the .sbn format, but with the help of the blog entries and the debug/testing code that they made available, it was relatively easy to put pieces together. So, as an application of their work and a new step to increase interoperability between FOSS and proprietary software, I've just added support for reading and using .sbn files in the latest revision of the developement version of GDAL/OGR (GDAL 2.0dev). You can refer to ticket 4719 for the details and the code.


As this code is rather new, crowd testing is of course much appreciated. Provided that you use the latest version of GDAL 2.0dev and have python-gdal ready, you can use the testsbn.py script to check if the spatial index is correctly used with your own datasets.

Let's suppose that you have a shapefile "some_shapefile.shp", with an accompaying .sbn file "some_shapefile.sbn", you can try :
python testsbn.py some_shapefile.shp
This script will iterate over all the shapes of the shapefile, and for each shape, define a spatial filter whose extent is the bounding box of the geometry and issue a spatial request. It will check that the result of the spatial request contains the shape that served to define the spatial filter.


My initial testing of .sbn vs .qix queries would tend to show that search in .sbn is faster than in .qix (more than twice on large datasets, and with a lot of spatial filter request such as in testsbn.py), while being smaller. An explaination would be that spatial requests against .qix files use floating-point comparisons, whereas equivalent requests against .sbn files only use integer comparisons.

As far as adding write support for .sbn files, I'm a bit ambivalent for now. According to Joel's blog, there are still a few details that have to be solved, in order to understand all the subtelties of the algorithm that dispatches shapes into tree nodes, and ensure full interoperability with the software stack of the vendor at the origin of the .sbn format.

2 commentaires:

  1. It is good to know that a mistery since 15 years has been solved by some one. I am sure it is hard to make Geomatrical applications to work properly and after successful development you need to go through the software testing process.

  2. Some clarification on the last mystery of generating sbn files - in certain shapefiles, ESRI's algorithm will have some features 1 level higher in the tree than where hasbeen.py will put them. This only means that the index bin containing the shape has a larger and less optimal bounding box. But you will also find the feature sooner as you traverse down the index tree. We do need to test this more, but it seems that files generated by hasbeen should be perfectly valid, even if they are not a 1:1 match.