Justin Colson

Learn how to use QGIS to convert lists of place names in to geographic coordinates, allowing you to map them.

edited by

  • Adam Crymble

reviewed by

  • Adam Dennett
  • Léon Robichaud








Lesson Goals

Many types of sources used by historians are inherently spatial. For example:

In this tutorial, you will learn how to ‘geocode’ historial data containing placenames (towns, counties, countries, etc), thus making them mappable using QGIS, a digital mapping software suite. This will allow you to:

This tutorial forms part of the Mapping and GIS series on Programming Historian, and builds upon skills you will have learned in earlier tutorials, especially Installing QGIS 2.0 and Adding Layers. It presumes that you have a set of shapefiles relevant to the region for which you intend to produce a map, and data that you would like to get into those shapefiles so that it can be visualised and analysed.

About Geocoding

Mapping data such as these involves rendering spatial information which humans can understand (such as names of towns or counties) into a format that can be understood by GIS (mapping) software. This is limited to some kind of geometry (a point, line or polygon in a vector representation) relating to the coordinates of this place in two dimensional space – these might be latitude and longitude expressed in degrees, or as is often the case in the UK, eastings and northings of the British National Grid. Geocoding always depends on the use of a gazetter, or list of places and coordinates.

There is often confusion between processes of geocoding and georeferencing.

In many modern applications geocoding is completed automatically, often using the mapping tools and gazetters offered seamlessly as part of Google Maps or OpenStreetMap. When working with contemporary data, or data from relatively recent periods, and Western European or North American historical contexts, this is often sufficient. If you are using data containing place names that are consistent with the present day, you can use the QGIS web geocoder plugin detailed in the postscript to this tutorial, or the Edinburgh Geoparser.

Many historians will be working on contexts where the place names in their data do not match the present day. Remember that street names tend to change relatively frequently, either in terms of spelling or entirely. Administrative areas have changed relatively frequently and were sometimes used inconsistently in historical sources (e.g. was Bristol in Gloucestershire, Somerset, City of Bristol, Avon?) and indeed places have moved between countries, and countries have changed in name and extent. Even town names have changed and are subject to linguistic ambiguities (e.g. Lynn Episcopi, Bishop’s Lynn, Lynn, King’s Lynn, Kings Lynn). For these reasons it is often better to avoid using automated online geocoding tools and create a gazetteer to suit the historical context which you are researching. The processes described in this tutorial are manual, and can be modified and applied to almost any geographical or historical context.

Lesson Structure

This lesson is divided into two main sections:

At the end of the tutorial there is a note on using automated geocoding tools which are available to work with modern addresses, but these are of limited relevance to historians working on eras before the late nineteenth or twentieth centuries.

Getting Started

This tutorial assumes that you have installed QGIS version 2 or newer and have followed the Programming Historian tutorial Installing QGIS 2.0 and Adding Layers by Jim Clifford, Josh MacFadyen and Daniel Macfarlane. Or, at least that you are familiar with the process of adding vector layers in QGIS.

This tutorial was prepared using QGIS 2.14 ‘Essen’ on Mac OS X 10.11. Menus, windows, and options might appear slightly different on different platforms or versions, but it should not be difficult to translate any differences. At a few points in the tutorial reference is made to how these techniques could be applied using ArcGIS, which is the industry standard commercial GIS application, and is widely available at universities, but is not always superior to QGIS.

You will also need to use a relational database such as Microsoft Access or LibreOffice Base, or alternatively be very proficient with spreadsheets. The instructions in the tutorial are designed for use with LibreOffice Base, which is a free download as part of the LibreOffice suite.

NB LibreOffice requires a full installation of Java in order to use the Base application. This is achieved most easily by downloading and installing the Java 8 Development Kit for your operating system from Oracle. The Java 8 Runtime Environment does NOT work with LibreOffice on Mac OS X 10.11.

The tutorial will map the data extracted from Alumni Oxonienses in the Programming Historian lesson Using Gazetteers to Extract Sets of Keywords from Free-Flowing Texts using publically available maps of English and Welsh historic counties. If you complete that tutorial first it will help you to understand the nature of the data which is being mapped here. These data are provided as both a full dataset and also a separate file which is a summary of the numbers of Oxford alumni by their county of origin, created from the first file using an Excel PivotTable.

The Data

Part 1: Joining Tables and Maps

The simplest way to map historical data is to ‘join’ (connect) a table of data containing place names to a layer of map features with the same names. This technique is commonly used by historians to create a map depicting a set of descriptive statistics for a set of data, for instance the number of individuals within a group originating from each county, or the proportion of inhabitants of each county working in a certain industry.

However, joining tables to features in GIS only works on a one-to-one basis (or at least only one-to-one relationships can be used to define the appearance of the map). This means that each map feature can only have one value associated with each of its attributes. You can say there are 50 students from the county Essex in your data, and thus link that to the Essex polygon feature in your shapefile (Table 1). But you cannot store the data as 50 rows, each of which represents a single student that points to the Essex feature in your shapefile (Table 2). One shapefile feature, one value. For this reason, joins are best suited to representing the results of analysis completed in a spreadsheet or database (eg, when you have already calculated the sum of people from a particular area - or whatever value you want to map).

Table 1: This table would work for ‘joins’, as each shapefile has a calculated value.

Shapefile Number of Students
Essex 50
Norfolk 28
Middlesex 81

Table 2: This table would not work for a ‘table join’.

Student Name Place of Origin
Joe Smith Essex
Tom Jones Essex
Matthew Rogers Essex

In this short tutorial we will map the total numbers of early modern University of Oxford alumni from each county. The file AlumniCounties.csv contains a summary of the full dataset which has already been created using a PivotTable in Microsoft Excel. Take a look at this file using your spreadsheet software to look at the column titles and the nature of the data contained in it.

NB: QGIS is very sensitive to correct formatting of Comma Separated Values (CSV) files, specifically the type of line breaks. If you have difficulties using a CSV file created using Microsoft Excel (especially Excel 2007 or 2011 for MacOS) try re-saving the CSV file using LibreOffice Calc or Excel 2016.

Tutorial: Joining Tables and Maps

There is an important distinction between Geographic Coordinate Systems, which simply define measurement units and the datum, and Projected Coordinate Systems, which also define the way in which the globe is ‘flattened’ onto a map. OSGB is available in both variants in QGIS, so choose the ‘projected’ version to get a map in which the United Kingdom appears the shape you would expect. For more details on projections in GIS, see the Working with Projections in QGIS Tutorial.

Figure 1: The QGIS Add Vector window on MacOS (the Add Vector button is circled on the left hand toolbar)

Figure 1: The QGIS Add Vector window on MacOS (the Add Vector button is circled on the left hand toolbar)

You should now see an outline map of British counties in a random colour. If you right-click on the name of this layer in the Layers Panel (shown by default on the bottom left) you can select Open Attribute Table to view the database properties already associated with each feature on the map. Notice that the name of each county is named in three different ways, the fullest of which is in the column called NAME, as well as two ID columns. We now need to join this to the alumni data that we want to map using the fact that the values in the NAME column are the same as those in one of the columns in our spreadsheet (they must be exactly the same for this to work).

The file AlumniCounties.csv contains a summary of the alumni dataset created using a PivotTable in Microsoft Excel. Two named columns contain county names (this column is called Row Labels) and simple totals of individuals originating in those places.

Figure 2: The join fields to vector dialogue

Figure 2: The join fields to vector dialogue

This data can now be shown as a choropleth map by changing the options in the the Style tab of layer properties. QGIS offers a very wide range of style options to convey the data associated with each map element in a graphical form. The style called Graduated allows you to create a choropleth map with a colour gradient repesenting the range of numerical values in your data, while Categorised allows you to assign colours or other visual styles to specific or text values in your tables. For data such as these, with many different values within a logical range, the graduated style of representation is appropriate; if there were just a limited range of potential values, these could be displayed more effectively using the categorized option.

Figure 3: The vector layer Styles tab showing classified values based on the field joined from the table

Figure 3: The vector layer Styles tab showing classified values based on the field joined from the table

For more information on choosing the correct classification method for your data, start by looking at this article on Classification in GIS. Examine the results of your map and think about what is actually being represented. Are the raw numbers of alumni, coloured according to the same classes, for very differently sized counties, helpful? Choropleth maps should normally display data that has been normalised in some way, for example showing population density, rather than raw population.

You may wish to experiment with the Expression Builder (accessed via the ∑ symbol next to Column in Properties>Style) to normalise these values using other columns and values that are available to you. Ideally we might normalise by population, but in the absence of this data, you might experiment by using the $area property, which is intrinsic to polygon shape layers in GIS. The very simple expression needed to create a map colour ramp on this would be (note that the field name contains spaces, so needs to be contained within double quotation marks):

"AlumniCounties_Count of Place of Origin" /  $area  

When you alter any of these settings within the graduated style page you will need to click Classify again to reassign colours to the numerical ranges in your data. If you don’t reclassify, you might find that the layer becomes invisible on your map.

Part 2: Geocoding Historical Data

Geocoding is a much more powerful technique than simple table joins because each and every line of your data remains visible and able to be analysed within the GIS software as an individual point on the map (as in table 2). Fundamentally the aim is to join each item of data to a pair of coordinates. Most historical data cannot be geocoded automatically using online tools or QGIS plugins. The geocoding process must therefore be carried out manually to match each data row with a location. This is a simple database operation joining (matching) your data with a gazetteer (a list of places with coordinates). Many gazetteers are available, but relatively few are suitable for use with historical data, for example, for England:

If no gazetteer exists for the area or period that you are studying, you can make your own relatively simply from a vector map by creating a point layer containing the information that you require within QGIS (potentially by combining information from other existing layers) and exporting that complete with XY coordinates. For some parts of the world there are neither historical gazetters, nor vector maps suitable for historical periods, in these cases you will have to investigate creating your own vector and point layer; see the tutorial Creating New Vector Layers in QGIS 2.0.

Tutorial: Creating a Custom Gazetteer and Geocoding in Relational Database

If you have completed the first part, you can carry on and follow the steps below in the same project. If you did not, or you want to start a new clean project, follow the instructions from the first section to:

Using your existing project, you can now start to add more layers to create your gazetteer:

Figure 4: The Polygon Centroids dialogue and result

Figure 4: The Polygon Centroids dialogue and result

Figure 5: The save vector layer as dialog configured for CSV gazetteer export

Figure 5: The save vector layer as dialog configured for CSV gazetteer export

This data can now be matched against your existing data to complete the geocoding process.

Geocoding your Data Table

We can now create a composite table of these locations and the data from our original table. This is created by matching the name of the county in the ‘place’ field of the alumni table with its equivalent in the new gazetteer using a relational database. This tutorial assumes that you have many hundreds or thousands or rows of data (as we do in this tutorial), requiring an automated method. If you only have a few rows, or you have difficulties using these methods, it is possible to do it manually - see ‘Geocoding your own Historical Data’ below.

In simple scenarios (such as this one where we are only matching a single ‘place’ attribute – i.e. only ‘county’) it is possible to code your data to a gazetteer using the VLOOKUP function in Microsoft Excel (or equivalent spreadsheets) or even using the MMQGIS plugin within QGIS. However, in most practical scenarios you will probably wish to match on several attributes simultaneously (for instance town, county and country – you would want to distinguish between Sudbury, Suffolk, England; Sudbury, Derbyshire, England; Sudbury, Middlesex, England; and Sudbury, Ontario, Canada). This can be achieved in a somewhat cumbersome way using the INDEX function in Excel, but is more practical, and extensible, in a relational database such as Microsoft Access or LibreOffice Base.

This tutorial uses LibreOffice, which is an Open Source alternative to Microsoft Office and is available for Windows, Mac OS X and all variants of Linux etc (NB it requires a full Java installation). It includes a relational database application on all platforms, unlike Microsoft Access which is available only in the Windows version of Office. However, it is quite restricted in its functionality. If you use Microsoft Access, or are a very proficient spreadsheet user, please feel free to replicate this process using your preferred software.

Figure 6: Copying a table into LibreOffice Base

Figure 6: Copying a table into LibreOffice Base

Figure 7: The query design completed in LibreOffice Base, showing the join between the tables and the grid detailing the fields that will show in the result

Figure 7: The query design completed in LibreOffice Base, showing the join between the tables and the grid detailing the fields that will show in the result

NB While relational database queries such as this are very powerful in allowing you match multiple criteria simultaneously, they can also present misleading results if not checked carefully. Check the section Troubleshooting Database Gazetteer Joins at the end of this tutorial for hints on checking results of joins when working with your own data.

Adding Geocoded Data to QGIS

You can now return to QGIS and add the data back to your map, using the new X and Y columns to map the data onto the map.

When you add data that has been geocoded as points in this way, you only initially see a single point in each location. Because each row of data has been geocoded with exactly the same coordinates, they overlap and you can’t immediately see how many there are.

There are several ways to display this data in ways that are more meaningful, and in this regard QGIS has many advantages over the leading commercial software ArcGIS. The most basic way that this can be achieved by creating a new polygon layer containing a new column with a count of points contained within each polygon (this feature is equivalent to a spatial join in ArcGIS). The result of this method would be essentially the same as summarising your data externally (in a database or spreadsheet) and then performing a table join.

A more useful way of depicting the geocoded data is to use QGIS’s advanced display styles such as Heatmap or Point Displacement (these features are laborious to replicate in ArcGIS, and involve creating ‘representations’ in parallel to layers). Point Displacement is probably the most appropriate way to explore this particular data. The advantage of displaying your data using styles, rather than mapping a summary created externally in a spreadsheet using a table join (as in part 1 of this tutorial), or creating a copy of your polygons containing a count of points that had been inside them using the previous steps, is that the layer remains dynamic.

Figure 8: The layer properties Style tab, showing point displacement styles, depicting points that actually overlap at precisely the same location

Figure 8: The layer properties Style tab, showing point displacement styles, depicting points that actually overlap at precisely the same location

You have now completed the geocoding process, and can enjoy the advantages of being able to analyse this inherently spatial historical data in an spatial way. In a real world scenario, you would probably only geocode data which is more precise than simple county level, giving a good deal more analytical potential and making maps plotted more meaningful. Where you have data which can be geocoded to a high – and crucially consistent – level of precision, it is possible to conduct a wide range of geographical analyses such as measures of clustering or distances.

For example, you can easily tweak and refine which records are mapped by changing the definition query in the properties of your geocoded layer (Right click on GeocodedAlumni in Layers Panel and select Layer Properties>General>Provider Feature Filter>Query Builder). You can use the less than or greater than operators to define years and see if trends change over time, or use the SQL LIKE statement to query the ‘details’ column to filter particular colleges – did they tend to attract students from particular counties? These queries use standard SQL language and can be combined with AND, NOT etc. This example would select only those students who had matriculated at Magdalen College:

"Details" LIKE '%Magdalen Hall%'

While the following would select only those who matriculated before 1612:

"Matriculation Year" < 1612

Geocoding your own Historical Data

The processes outlined here – matching using external queries – should be adaptable to a wide variety of scenarios wherever you can obtain or create a suitable gazetteer. Remember that your success will depend on the consistency and accuracy of your data. Ensure that the same conventions are followed in both your data and your gazetteer, especially with regard to punctuation (e.g. ‘Devon’ or ‘Devonshire’, ‘Hay-on-Wye’, or ‘Hay on Wye’ etc.) If you are lucky enough to have data which is presented in modern format (i.e. modern countries, streets or even postcodes), it is possible to use the much easier process of automated geocoding. See the section below.

If you only have a small number of rows in your data, or if you are having difficulty standardising your location information in one field so that it can be geocoded using the methods in this tutorial, you should remember that it is possible to do this process manually. Simply use one of many online geocoding tools to manually find the X and Y coordinates for each row of your data directly into X and Y columns in your spreadsheet or database. Remember to note the coordinate system used by the tool you use to find these coordinates though (probably WGS1984)! If you have manually geocoded data like this, simply follow the instructions above from ‘Adding Geocoded Data to QGIS’

Troubleshooting Database Gazetteer Joins

While relational database queries are very powerful tools for doing customised geocoding, by allowing you to match multiple criteria simultaneously, they can also present misleading results if not checked carefully. Any data that is not matched will usually be ignored ‘silently’ (i.e. you will not see an error message, so it is important to check whether the total number of lines in your results matches that in your original data.

If there are too few results, this means some values cannot be matched. In this table, for example, Place of origin includes values such as ‘London’ and ‘Germany’, which do not match any of the places in the gazetteer that we created. You could either conclude that the lower number of results is acceptable, or try to correct it by either altering places of origin, or adding locations to your gazetteer manually. Changing the properties of the join between the two tables from Inner Join to Right Join will ensure that ALL records from the alumni table are returned, whether or not there is matching data from the gazetteer counties table (presuming that the alumni table is on the right). This is a very useful diagnostic step.

If there are too many results, then each row in one table is matching multiple rows in the other. This is actually quite common with gazetteers, as there are likely to be duplicate points with the same, or very similar, place names in many datasets. This is especially true of very high resolution gazetteers which might have many neighbourhoods within a town individually located, but it is the ‘town’ column that you might want to match against. To guard against stray duplicates like this, you can use database functions to ensure only a single result is returned from your gazetteer. If you encounter this problem you should first create a query which uses the minimum or maximum functions (called sum functions in Access) on the ID field of your gazetteer, together with the group by function on the name field of your gazetteer, to isolate only a single occurence of each place name. You can then treat this as a subquery and add it to your existing query and join the now unique ID field to the existing gazetteer field using an Inner Join to ensure only one occurrence of each place name is matched.

Postscript: Geocoding Modern Addresses

If you have data which contains present-day addresses (such as postal addresses using contemporary street names, post or ZIP codes, or higher-level descriptions such as town or county names that follow modern conventions) then geocoding is very easy, using online tools or tools that use online APIs. Remember that online geocoding is unlikely to work if any of these elements of an address are not consistent with the present day.

Major online mapping providers such as Google, Bing, and OpenStreetMap all offer API (Application Programming Interface) connections to their highly sophisticated geocoding tools. These are the same tools that power the map search features of these websites, so are very effective at making ‘best guess’ decisions on ambiguous or incomplete addresses. It is worth noting that when accessed via an API, rather than their normal website, it will be necessary to provide country names as part of the address.

Figure 9: The 'Web Service Geocode' dialog from the MMQGIS plugin

Figure 9: The ‘Web Service Geocode’ dialog from the MMQGIS plugin

About the author

Justin Colson is a lecturer of history at the University of Essex, and a scholar of early modern Britain.