Hi everybody, I thought I'd give you all an update on this one.
We've done a lot of work on this over the last couple of years, but this is the sort of project that is going to be ongoing pretty far into the future.
Things that have been done:
Added the 'Report Issue' functionality to the website (in the navigation) where users can report duplicates that they come across
Added admin tools to merge duplicate Bands/Venues/Locations/Concerts when we find them or when a user reports them.
Created 'aliases' behind the scenes for Bands/Venues/Locations so that any duplicates that we merge won't pop up again in the future as duplicates.
We now verify locations against the external Geonames database to standardize them.
Implemented a location verification step after a concert is created to prompt the user to verify and standardize the location that they entered.
Reviewed all of the locations (about 25,000 at the time) about 2 years ago to standardize them and merge duplicates
Created an automated system that runs daily that checks for duplicate concerts and merges them together if they have matching bands/dates/locations/venues.
Created 'nonstandard concert types' to ensure that we don't merge seemingly duplicate concerts together in cases where one was an 'Early Show', 'VIP Event', etc.
Started collecting venue data from Google Maps and Pollstar in preparation for standardizing venues.
Started collecting band data from Musicbrainz, Spotify, Last.fm, and Pollstar in preparation for standardizing bands.
Probably more things that I'm forgetting at the moment :)
Rethinking how we standardize venues:
Our approach when we started working on standardizing venues turned out to be much more labor intensive than we thought.
Our original approach was to compare the venue information we have in Concert Archives to the information we gathered from Google Maps. This seemed reasonable but we soon realized that we'd need to manually verify the information from Google Maps to make sure it was actually correct rather than a similarly named venue or a new venue name rather than the historic venue name or similar issues. The amount of data (over 300,000 venues) and time involved in manually confirming the Google Maps data ourselves wasn't feasible. From there, we tried to split the work up and have our Patreon members help review individual cities but even then we'd still need to review their work, so it didn't really lessen the workload.
So standardizing venues has been on the back burner for awhile but we do have more ideas about how to tackle this better. One thing that we've already started doing is collecting venue data from one of our other data providers (Pollstar). By comparing our Venue data against the data of both Google Maps and Pollstar, it may be easier to confirm the data matches since we have 2 outside opinions and may better be able to make automated assumptions to the data accuracy.
Another idea that we have is feeding all of the venue data into an AI system and having it make recommendations on duplicate listings.
So yeah, standardizing venues has been an open project for a long time. We'll solve it eventually but needed to move it to the back burner so that we can do more pressing projects until we come up with a more feasible way to solve it.
Things that we'll be working on in the future:
Standardizing Venues as I just laid out
Standardizing Bands in a similar way to what we figure out for venues.
Adding 'Location Aliases' to Venues. Currently when we create Aliases for venues, they will only be merged together when the venues have the same location. So for example, "Redrocks" and "Redrocks Amphitheater" will automatically merge together when there is an alias as long as they are both associated with the "Morrison, Colorado, United States" location. The locations need to be the same to make sure we don't accidentally merge legitimate venues that have the same name but different locations (like "Hard Rock Live" in Dallas vs "Hard Rock Live" in LA). The issue is when users commonly put in the wrong location for a venue like they do with Redrocks, putting it as Denver rather than Morrison, Colorado. In this case our plan is to create "Location Aliases" for venues so that we can set aliases to make sure that a Denver Redrocks listing always gets automatically merged with the correct Morrison Redrocks venue.
It's been awhile since we've reviewed the non-standardized Locations, so we're going to do another pass through those (currently about 27,000 non-standardized location listings).
Probably more things that I'm forgetting at the moment :)
So yeah, we've been working away on all of this stuff, it just takes time and a lot of nuance to handle data standardization accurately and on a large scale. Just because we don't post public updates very often doesn't mean things aren't happening behind the scenes 😅
Thanks everybody for being a part of Concert Archives!
--Justin
Hi everybody, I thought I'd give you all an update on this one.
We've done a lot of work on this over the last couple of years, but this is the sort of project that is going to be ongoing pretty far into the future.
Things that have been done:
Added the 'Report Issue' functionality to the website (in the navigation) where users can report duplicates that they come across
Added admin tools to merge duplicate Bands/Venues/Locations/Concerts when we find them or when a user reports them.
Created 'aliases' behind the scenes for Bands/Venues/Locations so that any duplicates that we merge won't pop up again in the future as duplicates.
We now verify locations against the external Geonames database to standardize them.
Implemented a location verification step after a concert is created to prompt the user to verify and standardize the location that they entered.
Hi Justin! I’m not a developer at all so take this with a grain of salt, but just from a practical perspective, I think there needs to be a two-pronged approach for when people create duplicate listings for the same concert: 1) let your user base help clean this up by giving us a “merge concerts” feature rather than just a flagging feature that you/a team/AI then has to deal with, and 2) stem the flow by adding some barriers to creating a new concert from scratch! Right now entering new concert info is the first option under add a concert, and I don’t think it should be. I understand this is a nice feature to have for adding historical concerts, but for recent concerts, it seems like we should just be importing from somewhere (songkick, setlist..), never putting in our own info. What if users had to go through at least an additional step of searching for an existing concert first, and then perhaps beyond that there could be a button that says something like “can’t find what you’re looking for? If you’re sure your concert isn’t on songkick, add the details here!” I think if we both made it a tiny bit more involved to create a concert from scratch, AND if you let us users merge concerts as we find dupes in real time, then we could get on top of the dupe issue much more quickly! Just my two cents.
Isn’t this what the bucket list section is for?