Fuzzy Address Matching Algorithm using Google’s Geocoding API

Prakhar Gurawa
6 min readApr 15, 2022
Photo by Kai Wenzel on Unsplash

Problem Statement

In this article, we will try to solve a problem where we need to find whether the two addresses provided indicated the same location or different. Naturally, the simple way to do this is to perform an exact string match between the two addresses, but the problem arises when the two addresses are provided by different sources. For example, the first address might be provided by your customer who might be entering this manually using an online form and for the second address, you might be fetching addresses from the government’s land registry database. Now, these theses addresses might look similar or even not but can indicate the same place.

For example, the below two addresses if matched using an exact match algorithm will fail but actually, these are the same places :

  • 134 Ashewood Walk, Summerhill Lane, Portlaoise
  • 134 Summerhill Ln, Ashewood Walk, Summerhill, Portlaoise, Co. Laois, R32 C52X, Ireland

Approach

There can be multiple ways to solve this problem, but I used google's Geocoding API which basically converts addresses or place IDs to latitude/longitude or vice versa. You can read about the API from the below link, also remember this is not a free API and needs you to generate your own API_key to use it. But there is a limit of approx 2000 calls per 24 hours for which they don’t charge anything.

Let us start the coding and in parallel will try to explain the relevant information required. First import all the relevant libraries :

The next step will be generating your API_key for which you can refer to the Google Geocoding API website.

The next step will be to test the external API and analyze its provided content. We have provided an address ‘Bridgestreet Champs Elysees Accommodations Paris’ for which the API will provide us different properties such as street_number, route, locality, country, postal_code, administrative_area_level_1,administrative_area_level_2, latitude, and longitude. For better analysis of the content provided by API this is JSON.

Now we can write a function that will utilize the API and provide results in a clear python dictionary. These are two functions, the first just provide the latitude and longitude and the second provides all possible information that can be captured from the API result.

An example result is provided below for this function created.

Also, we will require a function that can find the distance between two coordinates i.e latitude/longitude. We won’t use Euclidean distance as it works for the flat surface like a Cartesian plain however, Earth is not flat. So we have to use a special type of formula known as Haversine Distance. Haversine Distance can be defined as the angular distance between two locations on the Earth’s surface and can be calculated as :

Source: https://en.wikipedia.org/wiki/Haversine_formula

Thanks to python’s vibrant developer community we have a dedicated library to calculate Haversine distance called haversine(one of the perks of using python). Please read this excellent post by Ashutosh Bhardwaj for a better understanding of this.

Below are two examples where the first pair of coordinated points to the same address and the second pair are coordinated of two houses that are very close to each other.

Address Matching Algorithm

Now we will start working on the main algorithm which compares the two provided addresses and return whether they represent the same place or not. The algorithms perform the below logic :

  • First, we fetch the Geocoding API’s parameters for both the addresses using the function we have already created.
  • Our logic checks that the Haversine distance between the coordinates for the first and second addresses should be less than the threshold set. If the distance is greater than the threshold we will return them as different addresses. I have set this threshold as 0.25 which means 250 meters.
  • Also, we will perform some address cleaning on the various components fetched by the API. We will remove fadas, special string characters adapted from languages of other accents. Also, remove dash(-) and period(.) and convert them in space, also remove every other nonalphanumeric character
  • The second checks are on the various parameters provided by the external Google API. All of them need to be matched, but we won’t perform exact matching for all but will perform fuzzy matching of most of them. For calculating the fuzzy distance we will use the fuzzywuzzy library of python. Fuzzy logic is a form of multi-valued logic that deals with reasoning that is approximate rather than fixed and exact. Fuzzy logic values range between 1 and 0. i.e the value may range from completely true to completely false. In contrast, Boolean Logic is a two-valued logic: true or false usually denoted 1 and 0 respectively, that deals with reasoning that is fixed and exact. Fuzzy logic tends to reflect how people think and attempts to model our decision making hence it is now leading to new intelligent systems(expert systems). Fuzzy String Matching, also known as Approximate String Matching, is the process of finding strings that approximately match a pattern. You can read more on this here :
  • The weights dictionary decided the threshold limits for various address components. Such as for street_number, postal_town, country, and postal_code it is set as 100 which means exact matching, and for others we will have the fuzzy matching threshold set as 80 i.e even if they are 80% similar we will consider them as a match. The full algorithm is displayed below:

Let us test on a few pairs of addresses, the first pair is actually pointing to the same address but looks different if compared using exact matching and which is provided as a match by our algorithm. The second pair are two addresses that are very nearly located and are marked as different addresses by our algorithm.

Full code here

To-do/Room for improvement:

  • Perform better address cleaning.
  • Maybe we can approach this problem using Google’s Autocomplete API as discussed here :
  • Can test with Dedupe library. dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication, and entity resolution quickly on structured data.

Let me know any suggestions/room for improvements/ideas/faults. Thanks

--

--

Prakhar Gurawa

Data Scientist@ KPMG Applied Intelligence, Dublin| Learner | Caricaturist | Omnivorous | DC Fanboy