Spatial data on a diet: tips for file size reduction using TopoJSON

For a current project we needed to render, on the web, a geographic file with approximately six thousand features. The shapefile itself is approximately 11 megabytes, far too big to be handled speedily in a web application. We were planning to render the features using GeoServer and map tiles but decided to investigate slimming down the geographic file first to see how small we could make it.

1. Starting Point: Large shapefile of a gridded California

The shapefile we are working with is a map of California gridded into what are called ‘townships’ as part of the Pesticide Use Reporting data system in California. Unlike a more traditional grid, this grid is not restricted to square segments but instead, many of the townships follow, say, natural boundaries such as rivers. This is important from a size perspective because rather than a square that might be represented by four nodes, these are irregular shapes with a lot of detail.

Here is a small piece of the map in Northern California.

n_calif

Shapefile: 11 megabytes with four attributes

2. What happens when we convert to GeoJSON?

First step is to convert to a web-friendly format, GeoJSON. For this we use org2ogr (with defaults) using the following code:

ogr2ogr -f "GeoJSON" township.geojson township.shp

Unfortunately, GeoJSON is not an efficient storage container for spatial data (e.g., attribute names are repeated for each feature, features are represented, usually, as high precision points etc). As a result conversion to GeoJSON using the default commands (shown below) yield a spatial file that is approximately three times bigger than the original shapefile. Although, from a geographic perspective the GeoJSON file is nearly indistinguishable from the shapefile, at 31 megabytes we are a long way from something we can render on the web.

Note that for the rest of this post we will use a map of Lake Tahoe primarily because the lake shore includes a lot of detail that we can use to compare results easily.

{ "type": "Feature", "properties": { "OBJECTID": 1, "SHAPE_area": 0.000023264785254, "SHAPE_len": 0.0305910551739, "COMTR": "01M01N03W" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -122.2430065330162, 37.882559002628312 ], [ -122.24288515908067, 37.882468439540702 ], [ -122.24251972061417, 37.882208374073528 ], [ -122.24196459265734, 37.88245925307357 ], [ -122.24111478414277, 37.88297306202999 ], [ -122.24071515774634, 37.883128066073844 ], [ -122.23925559496328, 37.883839439353174 ], [ -122.2294649065516, 37.881157126637476 ], [ -122.22907940660983, 37.880980312272321 ], [ -122.24312990706873, 37.880853626468223 ], [ -122.24300784465072, 37.882524377486888 ], [ -122.2430065330162, 37.882559002628312 ] ] ] } },

geojson

GeoJSON: 31 megabytes with four attributes

3. Shapefile to TopoJSON Using Default Settings

As discussed in a previous post, Mike Bostock created a great tool for converting shapefiles (or GeoJSON files) to a more compact form of JSON that he calls TopoJSON. TopoJSON, most importantly, saves space by more efficiently storing border geographies. For example, a border shared by two states does not need to be recorded twice. Installation of the command-line utility is straightforward, directions can be found on the wiki. The command-line utility has a lot of useful arguments that can help you massage your data into a format that works on the web. Initially we simply accepted the defaults (with one exception) to see what size gains we get. We used the following code. Note that we add the -p at the end to preserve all our attributes (the default is no attributes).

topojson -o township.json township.shp -p

The output looks quite different from GeoJSON, here is a snippet.

"geometries":[{"type":"Polygon","arcs":[[0,1,2]],"properties":{"OBJECTID":1,"SHAPE_area":0.0000232647852539,"SHAPE_len":0.0305910551739,"COMTR":"01M01N03W"}},{"type":"Polygon","arcs":[[-3,3,4,5]],"properties":{"OBJECTID":2,"SHAPE_area":0.00164267700165,"SHAPE_len":0.252900787742,"COMTR":"01M01N04W"}},{"type":"Polygon","arcs":[[6,7,8,-5]],

From a geography perspective, a TopoJSON file created using the defaults looks the same as the shapefile and the GeoJSON. The file is also much, much smaller! We had 10mb for a shapefile, 30mb for the GeoJSON and we’re down to 2.5 with the default TopoJSON (with attributes). Unfortunately, this is still not small enough.

topojson

TopoJSON (all attributes, otherwise default settings): 2.5 megabytes

4. Not small enough but luckily some tweaks using TopoJSON arguments can help

4a. Quantization: Reducing coordinate precision

The command-line reference has a good discussion of the different arguments, many of which can be used to slim your file. There is also a nice discussion of the options by Nelson Minar here. One of the arguments controls what is called ‘quantization’. Essentially this amounts to how much coordinate precision you want to retain. The default quantization factor is 10,000 (-q 1e4). For more detail you want a larger number for less detail you want a smaller number. In our case, the 10,000 default was more than we needed so we experimented with smaller numbers to see what might work. In particular we used 100, 1000 and 5000.

Using the code below as an example, the -q is the quantization parameter:

topojson -o township_q100.json -q 100 township.shp -p

q

You can see from the images that clearly q=100 is not enough detail (doesn’t even come close to matching the lake boundary) unless your users don’t zoom in at all. The q=1000 could be acceptable but I see a few artifacts and 5000 looks pretty good to me.

TopoJSON (all attributes, quantization = 100): 1.0 megabyte
TopoJSON (all attributes, quantization = 1000): 1.6 megabyte
TopoJSON (all attributes, quantization = 5000): 2.1 megabytes

4b. Simplification: Reducing the number of nodes used to represent arcs

Again, check the wiki for details but the basic idea is that for most web mapping we don’t need extreme precision in our geographies and we can safely reduce the size of our files by reducing the level of detail in our arcs. TopoJSON uses the Visvalingam algorithm to “progressively remove points with the least-perceptible change.” Before reading the wiki, I hadn’t heard of the Visvalingam algorithm – the Douglas Peucker (or Ramer-Douglas-Peucker) is far more common – but it’s hard to argue with the success it achieves in TopoJSON. Like quantization, you can use scientific notation for the simplification argument but be careful, in the examples I’m showing simplification uses a negative exponent. Smaller numbers (larger negative exponent) yield more detail. By default there is no simplification. We will experiment to find the right balance.

Since we’re happy with quantization=5000 we will stick with this and layer the simplification on top. We will try:

  • 1e-8 (0.00000001) – smallest number, most detail
  • 5e-8 (0.00000005)
  • 1e-7 (0.0000001)
  • 5e-7 (0.0000005) – largest number, least detail

The code looks something like this:

topojson -o township_q5000_1e-8.json -q 5000 -s 1e-8 township.shp -p

s

TopoJSON (all attributes, quantization=5000, simplification=5e-7): 1.256 MB
TopoJSON (all attributes, quantization=5000, simplification=1e-7): 1.290 MB
TopoJSON (all attributes, quantization=5000, simplification=5e-8): 1.299 MB
TopoJSON (all attributes, quantization=5000, simplification=1e-8): 1.344 MB

Not a huge difference in size but there is a big difference in quality of the image. The high simplification distorts our map but 1e-8 seems to work well.

4c. Do you need all the attributes?

In many cases, the size of the geographic file is determined primarily by the geographies themselves but attributes can also influence the size. The TopoJSON command line tool offers a easy way to eliminate the attributes, in fact, this is the default. In our case we could eliminate three of the fields but needed to keep the ID field which is titled ‘COMTR’ but we can save space by renaming this ‘id’ (COMTR=id).

 topojson -o township_q5000_1e-8_onlyID.json -q 5000 --id-property COMTR=id -s 1e-8 township.shp

TopoJSON (only ID, quantization=5000, simplifcation=1e-8): 714 KB
TopoJSON (only ID, quantization=5000, simplifcation=1e-8), gzipped: 134 KB

Conclusions

In our case, using TopoJSON over GeoJSON makes a huge difference bringing the file from 30mb (10mb for the shapefile) down to about 2 MB. Beyond this, as you can see the quantization, simplification and attribute reduction all make a sizable difference. In the end we chose a quantization of 5000, a simplification of 1e-8 and limited attributes to just an ID variable bringing the size down to 714 KB (134 gzipped). A small enough file that might allow us to skip GeoServer in favor of using the TopoJSON file.

Final Map Using TopoJSON, LeafletJS, D3.js and OpenStreetMap

D3 is entirely unnecessary to show the final map using TopoJSON and we have an example of the map using Leaflet and TopoJSON alone here. But for fun, the map below uses D3 based on code stolen directly from Mike Bostock at this link.