Hi everybody, I thought I'd give you all an update on this one.
We've done a lot of work on this over the last couple of years, but this is the sort of project that is going to be ongoing pretty far into the future.
Things that have been done:
Added the 'Report Issue' functionality to the website (in the navigation) where users can report duplicates that they come across
Added admin tools to merge duplicate Bands/Venues/Locations/Concerts when we find them or when a user reports them.
Created 'aliases' behind the scenes for Bands/Venues/Locations so that any duplicates that we merge won't pop up again in the future as duplicates.
We now verify locations against the external Geonames database to standardize them.
Implemented a location verification step after a concert is created to prompt the user to verify and standardize the location that they entered.
Reviewed all of the locations (about 25,000 at the time) about 2 years ago to standardize them and merge duplicates
Created an automated system that runs daily that checks for duplicate concerts and merges them together if they have matching bands/dates/locations/venues.
Created 'nonstandard concert types' to ensure that we don't merge seemingly duplicate concerts together in cases where one was an 'Early Show', 'VIP Event', etc.
Started collecting venue data from Google Maps and Pollstar in preparation for standardizing venues.
Started collecting band data from Musicbrainz, Spotify, Last.fm, and Pollstar in preparation for standardizing bands.
Probably more things that I'm forgetting at the moment :)
Rethinking how we standardize venues:
Our approach when we started working on standardizing venues turned out to be much more labor intensive than we thought.
Our original approach was to compare the venue information we have in Concert Archives to the information we gathered from Google Maps. This seemed reasonable but we soon realized that we'd need to manually verify the information from Google Maps to make sure it was actually correct rather than a similarly named venue or a new venue name rather than the historic venue name or similar issues. The amount of data (over 300,000 venues) and time involved in manually confirming the Google Maps data ourselves wasn't feasible. From there, we tried to split the work up and have our Patreon members help review individual cities but even then we'd still need to review their work, so it didn't really lessen the workload.
So standardizing venues has been on the back burner for awhile but we do have more ideas about how to tackle this better. One thing that we've already started doing is collecting venue data from one of our other data providers (Pollstar). By comparing our Venue data against the data of both Google Maps and Pollstar, it may be easier to confirm the data matches since we have 2 outside opinions and may better be able to make automated assumptions to the data accuracy.
Another idea that we have is feeding all of the venue data into an AI system and having it make recommendations on duplicate listings.
So yeah, standardizing venues has been an open project for a long time. We'll solve it eventually but needed to move it to the back burner so that we can do more pressing projects until we come up with a more feasible way to solve it.
Things that we'll be working on in the future:
Standardizing Venues as I just laid out
Standardizing Bands in a similar way to what we figure out for venues.
Adding 'Location Aliases' to Venues. Currently when we create Aliases for venues, they will only be merged together when the venues have the same location. So for example, "Redrocks" and "Redrocks Amphitheater" will automatically merge together when there is an alias as long as they are both associated with the "Morrison, Colorado, United States" location. The locations need to be the same to make sure we don't accidentally merge legitimate venues that have the same name but different locations (like "Hard Rock Live" in Dallas vs "Hard Rock Live" in LA). The issue is when users commonly put in the wrong location for a venue like they do with Redrocks, putting it as Denver rather than Morrison, Colorado. In this case our plan is to create "Location Aliases" for venues so that we can set aliases to make sure that a Denver Redrocks listing always gets automatically merged with the correct Morrison Redrocks venue.
It's been awhile since we've reviewed the non-standardized Locations, so we're going to do another pass through those (currently about 27,000 non-standardized location listings).
Probably more things that I'm forgetting at the moment :)
So yeah, we've been working away on all of this stuff, it just takes time and a lot of nuance to handle data standardization accurately and on a large scale. Just because we don't post public updates very often doesn't mean things aren't happening behind the scenes 😅
Thanks everybody for being a part of Concert Archives!
--Justin
Hi everybody, I thought I'd give you all an update on this one.
We've done a lot of work on this over the last couple of years, but this is the sort of project that is going to be ongoing pretty far into the future.
Things that have been done:
Added the 'Report Issue' functionality to the website (in the navigation) where users can report duplicates that they come across
Added admin tools to merge duplicate Bands/Venues/Locations/Concerts when we find them or when a user reports them.
Created 'aliases' behind the scenes for Bands/Venues/Locations so that any duplicates that we merge won't pop up again in the future as duplicates.
We now verify locations against the external Geonames database to standardize them.
Implemented a location verification step after a concert is created to prompt the user to verify and standardize the location that they entered.