De-dupe and standardize data
I would prefer a focus on completely de-duping and standardizing the existing data before adding any new features.
Hi everybody, I thought I'd give you all an update on this one.
We've done a lot of work on this over the last couple of years, but this is the sort of project that is going to be ongoing pretty far into the future.
Things that have been done:
- Added the 'Report Issue' functionality to the website (in the navigation) where users can report duplicates that they come across
- Added admin tools to merge duplicate Bands/Venues/Locations/Concerts when we find them or when a user reports them.
- Created 'aliases' behind the scenes for Bands/Venues/Locations so that any duplicates that we merge won't pop up again in the future as duplicates.
- We now verify locations against the external Geonames database to standardize them.
- Implemented a location verification step after a concert is created to prompt the user to verify and standardize the location that they entered.
- Reviewed all of the locations (about 25,000 at the time) about 2 years ago to standardize them and merge duplicates
- Created an automated system that runs daily that checks for duplicate concerts and merges them together if they have matching bands/dates/locations/venues.
- Created 'nonstandard concert types' to ensure that we don't merge seemingly duplicate concerts together in cases where one was an 'Early Show', 'VIP Event', etc.
- Started collecting venue data from Google Maps and Pollstar in preparation for standardizing venues.
- Started collecting band data from Musicbrainz, Spotify, Last.fm, and Pollstar in preparation for standardizing bands.
- Probably more things that I'm forgetting at the moment :)
Rethinking how we standardize venues:
Our approach when we started working on standardizing venues turned out to be much more labor intensive than we thought.
Our original approach was to compare the venue information we have in Concert Archives to the information we gathered from Google Maps. This seemed reasonable but we soon realized that we'd need to manually verify the information from Google Maps to make sure it was actually correct rather than a similarly named venue or a new venue name rather than the historic venue name or similar issues. The amount of data (over 300,000 venues) and time involved in manually confirming the Google Maps data ourselves wasn't feasible. From there, we tried to split the work up and have our Patreon members help review individual cities but even then we'd still need to review their work, so it didn't really lessen the workload.
So standardizing venues has been on the back burner for awhile but we do have more ideas about how to tackle this better. One thing that we've already started doing is collecting venue data from one of our other data providers (Pollstar). By comparing our Venue data against the data of both Google Maps and Pollstar, it may be easier to confirm the data matches since we have 2 outside opinions and may better be able to make automated assumptions to the data accuracy.
Another idea that we have is feeding all of the venue data into an AI system and having it make recommendations on duplicate listings.
So yeah, standardizing venues has been an open project for a long time. We'll solve it eventually but needed to move it to the back burner so that we can do more pressing projects until we come up with a more feasible way to solve it.
Things that we'll be working on in the future:
- Standardizing Venues as I just laid out
- Standardizing Bands in a similar way to what we figure out for venues.
- Adding 'Location Aliases' to Venues. Currently when we create Aliases for venues, they will only be merged together when the venues have the same location. So for example, "Redrocks" and "Redrocks Amphitheater" will automatically merge together when there is an alias as long as they are both associated with the "Morrison, Colorado, United States" location. The locations need to be the same to make sure we don't accidentally merge legitimate venues that have the same name but different locations (like "Hard Rock Live" in Dallas vs "Hard Rock Live" in LA). The issue is when users commonly put in the wrong location for a venue like they do with Redrocks, putting it as Denver rather than Morrison, Colorado. In this case our plan is to create "Location Aliases" for venues so that we can set aliases to make sure that a Denver Redrocks listing always gets automatically merged with the correct Morrison Redrocks venue.
- It's been awhile since we've reviewed the non-standardized Locations, so we're going to do another pass through those (currently about 27,000 non-standardized location listings).
- Probably more things that I'm forgetting at the moment :)
So yeah, we've been working away on all of this stuff, it just takes time and a lot of nuance to handle data standardization accurately and on a large scale. Just because we don't post public updates very often doesn't mean things aren't happening behind the scenes 😅
Thanks everybody for being a part of Concert Archives!
--Justin
-
Sahnaa commented
I see so many duplicates of the same show… I think it would help if we had an option to flag them… is there one ?
-
noisetemple commented
Hi! Just joined, and am excited about logging shows.
I've been (what I would call) a frontline volunteer data quality checker for the FamilySearch genealogical project for nine years. Adding, editing, and a lot of dedup and standardizing are all involved. If dev(s) open up more micro-task crowdsourcing, I’d enjoy contributing to the "sweep.”
Also: I agree with the recs about an entry “merge” function, and the point made about one of the standardization perils being some user’s feelings of entry “ownership” (seen that a lot on FS.)
-
Concert Mojo commented
Masquerade (ATL) Heaven, Hell and Purgatory are different venues. They used to be in the same building but are now in different buildings (though right near each other). They must remain separate since there are sometimes concerts in all 3 on the same night. I have often seen shows in 2 or 3 on the same night. Not sure how important it is to reflect if it is OG Masquerade or new one in the Underground.
-
Sahnaa commented
Nearly done in April 2021… where are we on this project now ? Did it start ? I don’t mean to be pushy at all, I appreciate all the team’s work on this app/website, but would like updates on the changes that are made/in the making. If it’s already somewhere, could someone point me in the right direction please ? 😅
-
@Justine Baddeley, we're working on a way to separate out concerts that 'appear' to be duplicates (same band, date, venue, location) but really aren't. We'll be implementing a new option where you'll be able to mark a band as an 'Early Show', 'After Party', etc and that will keep them from being merged together. Hopefully we'll have that released in July.
-
_Just_Bad commented
Good idea to de-dup... until I've noticed it merged 2 separate concerts played at the same venue on the same day into 1, the bands played an all ages afternoon show then later an +18 only show the same day, I went to both so instead of saying that I went to 2 shows it's only counting as 1. Even updated the tour title to distinguish the 2 and they still merged
-
Jill Steiner McCall commented
Agree with Dan Curhan below IE: The Masquerade in Atlanta has an old locations from the 90's and a new locations stating in the 2000's - plus multiple locations for different stages Hell & Heaven.
Granted the old and new are 'technically' valid as they do have different street addresses if those were to be used, but the Heaven and Hell stage locations are definitely redundant.
Also the band Drivin' N' Cryin' is entered something like 4-5 different time with various different apostrophe options allowed. There should be a way to force people to use only one option for them.
-
dan curhan commented
It's not just concerts with duplicity - search "middle east" under Venues. There are THREE PAGES of results, all for what appear to be the same venue: The Middle East in Cambridge, MA, which has both an upstairs stage and a downstairs stage so should return, at most, two results.
I love the service and the website format and all, but the data is all over the place!
-
Greg Fasolino commented
I think the #1 priority should be merging all duplicates. Too many concerts have multiple entries for the exact same event. Even worse, we have some members who think their entries are "theirs" and belong to them uniquely. Each concert that occurred needs to have just one merged entry to avoid this kind of stuff.
-
Greg Fasolino commented
Love this website, the idea is fantastic, but it needs much more standardization, merging of duplicate entries, etc. Is there a way to make it so people cannot add in duplicates? (Setlist.fm does this automatically).
-
Matt Suda commented
The search has not been working lately. Using basic keywords as suggested on the search page comes back with no results found with only suggestions to import from SongKick. This has resulted in many duplicated concert listings lately.
Update: After contacting support this was fixed
-
grandpoobah commented
Your site would be much more valuable if I could look up history of a venue (Wiltern Theater in Los Angeles, as an example) and find one entry instead of the 23 that are there now. I do not have that problem on setlist.fm
-
Mark F. King commented
Consolidating multiple entries of the same concert would tidy things up. Once you figure out how exactly you are going to do this I would like to help with it. Let me know. Thanks.
-
mrblond commented
Only two days since I'm scanning the data base and yes, there are many duplicates. I hope the hosts of this web place want it be tidied up. I think, it is about 20% maybe more of the concert numbers should be down.
-
Xanthe commented
There are a lot of duplicate entries for concerts or festivals and it makes it difficult to search through the database. If there was a button to suggest combining entires, or if we were able to manually do it that would be amazing. A way so the people won't lose access to that archived concert but everyone will be combined under one. For festivals in particular it's frustrating, because some either are lacking of majority of the artists attending, or the other minor information like location is missing. So if it was all combined under one it would be much cleaner and easier plus since we can pick which artists we actually saw there shouldn't be that much of an issue. AND! if someone opens up a concert archive and they can't recall most of the information other people would be able to add to that archive rather than having 10 different entries with none of them having complete information. (Also maybe being able to add a header photo to a concert page, like the tour poster)
-
Benny62 commented
Is there anyway you can stop people putting multiple venues, places, bands etc. Because they spell differently or use commas or different grammar. It’s annoying to find multiple concerts purely because entries have been misspelt
-
Troy C. commented
Completely agree with this. There are many concerts with 3+ entries because of slight naming differences. It also makes it tough to search by tour or venue
-
Davey Gravy commented
I wholeheartedly agree with this enhancement. There are too many duplicates and the data could use a clean sweep.
-
Jayson Hanks commented
Agreed, but i would add that you shouldnt use abbreviations such as St. paul, but instead Saint Paul, or Saint Louis, MO
-
frnksassbutt commented
Ability to merge a concert so that there aren't loads of different posts for the same concert