Bad Data and Data Engineering: Dissecting Google Play Music Takeout Data using Beam, go, Python, and SQL
Google Play Music joined its brethren in the Google Graveyard of cancelled products in late 2020. Having used Google Play Music many years ago to back up all my MP3 rips (I collect CDs and Vinyl and have done so for about 17 years), Google sent me several friendly emails, asking me to transfer my music to YouTube Music and/or use Google Takeout to download a copy.
I did the latter. Little did I know that what I would receive would rival even the most horrifying “We’ve been doing reporting in Excel for 12 years!"-type datasets I’ve seen throughout the years. This is an attempt to tell the story on how I approach badly documented, rather large datasets from a Data Engineering lense and how I turned a dump of data into something useful.
This article will pretend we’re not just organizing some music, but rather write a semi-professional, small-sized (~240GB in total) data pipeline on bare metal, with simple tools. During this, we’ll be working with a fairly standard Data Engineering toolset (
SQL), as well as work with both
(My little KEF LS50 in front of the CDs that started this whole journey)
 The cynic in me thinks that this is intentional (in order to push you to YouTube Premium) - but I would not be surprised if Google has a perfectly valid Engineering reason for this mess.
All code can be found on GitHub!
Making Sense of it: The Google Play Music Takeout Data
Before we even try to attempt to formalize a goal here, let’s see what we’re working with. I should add that your mileage may vary - I only have my export to work with here.
Exporting your data can be customized quite nicely:
You can select the compression format and archive size. Once you did that, and wait for some nondescript batch-job to complete (that will take several hours), you can download those files:
You can throw that link at
wget, as to not to be at the mercy of an X-server if you want to push it to a file server directly (in my case, via
The resulting files are just zip archives, which you can extract.
The resulting structure looks like this:
I’ve cut out some playlists, but other than that, that’s it. 3 Main folders: Playlists, Radio Stations, and Tracks. On the root of this folder, there lives a
archive_browser.html HTML file.
The playlists folder contains a wild mix of album titles, custom playlists, and generic folders.
Within those folders, taking Austrian Black Metal band Dornenreich’s album “In Luft Geritzt” as an example, you’ll find a bunch of CSV files as such:
The entire folder doesn’t contain a single music file, just csvs.
These CSVs have the following structure:
No music thus far. Let’s keep looking.
This one contains a single folder with a bunch of
csv files showing me metadata about “recent” radio stations:
Not overly helpful and in a broken encoding (
text/plain; charset=us-ascii, which causes special characters like apostrophes to be escaped). What more could you ask for?
Tracks folder is a monolith of, in my case, 23,529 files. 11,746 are
csv files, 11,763 are
csv files don’t necessarily correspond to a
mp3 (but some do!) and are also not unique:
Paschendale (0) through
(2) as a CSV file, but not actual music.
When it comes to the actual music files, it appears that they do contain
I’ve pointed a local
Apache web server at that file and we can find a rough description of the exported data:
Which is not overly helpful, but gives us rough data dictionaries for all CSV files.
What we have so far
We now somewhat understand what’s going on here - the export contains all
mp3 files that were uploaded (as a backup) at one point. Music files that were never uploaded and just streamed - or not exported, for some reason - will show up as
There are duplicates and there are no naming conventions kept (e.g.
We also now that the data doesn’t contain any form of
indexes, meaning we won’t have anything to uniquely match for instance band or album names.
Last but not least, file na,es contain spaces and special characters, which aren’t great for use in e.g.
- Some files exists as
mp3, others only as
- Naming conventions are non-existent
- There are duplicates
csvs are not consistently encoded as
- There are no keys to match to other data
mp3s have some
- File names are not in a
UNIXstyle, i.e. contain spaces
The End-Goal: Organization
With this out of the way, let’s see what we can do with this.
I already maintain my music, a more recent copy, in the format mentioned above, i.e.
id3v2 tags on all files, for instance -
If we were able to…
- Re-organize all files into a logical naming convention
- Remove all duplicates, keeping only the “best” copy (based on compression/quality)
- Match them with the master data copy of music on my server and add all that are missing to it
- Match all non-existent mp3 files to the master data copy of music, and call out all that are missing
In order to achieve this, let’s pretend this is not a weekend during COVID, but rather a professional Data Engineering project - and over-engineer the hell out of it. For fun. And out of desparation.
Step 1: Build an Architecture
First, we’ll design a quick architecture that uses proven tools - we don’t need a lot, as we’re really only trying to parse and move around 100GB around.
- Takeout files are our incoming data
- Master files are our existing library
- Organize will simply move
mp3files into their own respective directories
- Add filenames will add filenames to the
csvs (see below)
- Index will move all
csvfiles, i.e. structured data, to tables
- Extract id3v2 will extract the unstructured data’s
- Analyze will be the
sqlportion of the exercise, where we try to make sense of our data
When it comes to technology, I’ll be using
bash for scripting,
Apache Beam for more intense workloads, and
MariaDB for a quick-and-easy database that can handle the number of records we can expect. Plus I have one running on my server anyways.
Step 2: Provisioning a “Dev Environment”
Since we’re at it, we might as well grab a representative set of data in a test environment. For this, I simply created local folders on a different drive-array and copied some data:
The last lines is the only funky one and simply copies 1,000 random files from the “Tracks” directory. I find it a good idea to pipe to an output file for deterministic behavior’s sake, rather than piping directly to
xargs would be much faster, though.
Step 3: Re-Organizing files
The first step for any data project (besides figuring out what you’re trying to do) should be to get an understanding of the data we’re working with. We’ve already made the first steps by going through that >100GB dump above, but formalizing it will help us to truly understand what’s going on and to ensure we can script out everything without missing corner cases.
Given that our input file structure is all over the place (with >20,000 mixed-type files in one directory!), we can script that out to make it a repeatable process.
One word of advice here - even if you think you’ll never use the script again and you’re the only person ever using it, just assume that to be wrong and that somebody in your organization will have to use it at a later point.
This is why we woefully over-engineer this thing - argument parsing, escaping Strings, all that - because chances are, you’ll see it pop up soon. This comes from a decent amount of experience - I find scripts I wrote for myself ages ago in the wild way too often.
The script should be decently self-explanatory - it copies data into a new, flat structure. We do some pseudo-grouping to concatenate all Playlists into single CSVs. Please note that this is all single-threaded, which I don’t recommend - with
nohup and the like, you can trivially parallelize this.
Once we run it, we get a nice and clean structure:
This dataset will be our starting point. In real data pipelines, it is not uncommon to do some basic cleanup before starting the real work - just because it makes our lives so much easier down the line.
Step 4: Creating Data Dictionaries
Based on our initial cleanup, we can now automatically profile our csv files, given that we have all of them in one place.
pandas are great tools to have at hand for that -
Which gives us outputs like this:
Which we can use to build out our tables:
|Radio||Other artists on this station||STRING|
|Radio||Artists on this station||STRING|
These data dictionaries and table definitions will help us to understand the structure of the data. Documentation is boring, yes, but critically important for any data project.
Note that the
floats here aren’t really that - they are binary indicators well change later.
float is the
pandas default, I believe.
Step 5: Building a Database Schema
We can now turn those data dictionaries into simple
MariaDB tables, by just abiding to the SQL engine’s naming conventions and restrictions.
Our schema should, at the very least, do the following:
- Contain all the data we have without loss
- Contain data that is close, but not necessarily identical, to the raw data
- Abide to semi-useful naming conventions
For instance, for our
csvs, we might want something like this:
All names are clear, have a clear data type, and contain all data we have.
Step 6: Indexing all Metadata
Now to the fun part: Actually touching the data. You might have realized that until now (apart from some profiling), we haven’t even bothered to really read any data.
We’ll be using
Apache Beam for it and because I hate myself, we’ll be writing it in
Beam is useful in the sense that we can relatively quickly build a pipeline, using pre-built
PTransforms, rather than having to manage separate
goroutines, deal with
channels, deal with custom
IO - in short, it would, in theory, be a fast and easy way to process the CSV files.
You might have noticed that we’ve added one field -
file_name - in the previous section. Now, since we’ll be using
Apache Beam in the next step, there’s 2 schools of thought: Force
Beam to provide the
PCollection’s file name (that is not something you can easily do, given the way their
SplittableDoFns work) or manipulate your input data to provide this information.
We’ll do the latter. So,
bash to the rescue, once again:
Note the re-using of
IFS, the Internal Field Separator, here. This is why you don’t use spaces in file names.
This script turns:
And we can use this data to create a pipline.
This pipeline would read all our
Song csv files and write it to our
Songs table. It uses the
TextIO ParDo to read each CSV, custom
ParDos to map the data to
structs to map our table, and writes it to it.
That being said, the
Beam documentation remains rather… sparse. 
func Write(s beam.Scope, driver, dsn, table string, columns string, col beam.PCollection)
Write writes the elements of the given PCollection to database, if columns left empty all table columns are used to insert into, otherwise selected
PCollection<T> is my favorite.
go doesn’t have generics. At least not yet - ETA for generics in go is end of this year (
Scala for “
PCollection of Type
T”). Yes, yes - if you actually read the source, you’ll quickly find some interesting uses of
Reflect to build some bastardized generics-type structure, but my point remains -
Beam’s (and by proxy, Google’s
Dataflow documentation) leaves a lot to be desired.
For instance, this error:
Was ultimately caused by not exporting (which, in
go, is done by using an uppercase character to start a
struct field) and aliasing the
Don’t ever believe that this is documented anywhere. I found the mapping in their unit tests.
But, in any case, the pipeline above creates records as such:
We can simply extend this to also cover all three other CSV files, using the same logic and clean it up a bit.
Note: The code on GitHub uses a
utils.go module, so we avoid copy-paste code
In case this seems confusing, please take a look at an older article I wrote on the topic.
Once done, let’s run the pipeline as such:
 Indeed, I should raise a PR for it
Step 7: Indexing all non-structured data
What you will notice, however, is that we are thus far only touching the
csv files, not the
Here, once again, multiple schools of thought are at play: Do you combine pipelines for
unstructured data or do you keep them separate?
Now naturally, this all depends on your project and environment; genereally speaking, I am a big fan of separation of duties for jobs and on relying on shared codebases - libraries, modules and the like - do implement specific functions that might share common attributes.
This provides several benefits during operations - jobs operating on separate data sources can be updated, scaled, and maintained independently (which is why the previous job should really have a flag that runs a single type of source data per instance!), while still having the chance to be customized to truly fit a specific use case.
For unstructured data,
Apache Beam tends not the be the right tool, given the
Beam usually operates on the splittable elements of files, e.g. lines in a
csv files or records in
avro files. You can absolutely do it, but as I’ve mentionend above, I do not believe there to be anything wrong with structured data preparation prior to your core pipelines.
With that being said, we’ll simply sent metadata to another
Beam pipeline, by providing a list of all MP3 files and their corresponding path:
With this approach, we can actually re-use our
Beam pipeline, as we now use it to maintain metadata and references towards unstructured files, without actually touching them. In this case, I have written a separate pipeline for easier presentation.
When it comes to writing it, we can use bogem/id3v2, an
id3v2 library for
id3v2, however, the standard is not exactly perfect to work with - all tags, similar to
EXIF on images, are entirely optional and depend on somebody, well, adding them to the file - that is not always a given with CD rips or even digital downloads, which you often get from vinyl purchases.
Hence, we’ll focus on some basic tags:
Which gives us this pipeline:
Step 8: Running all pipelines
And naturally, we’ll need to run
extract_id3v2 against the master list of music as well, another ~120GB:
Here, unforunately, we find ourselves in a bit of a pickle - the library we’re using does not support
2.3+. Rather than making this article even longer, I’ve simply piped out the broken records separately, knowing that we need to ignore them. About ~2,100 files are affected.
Step 9: Matching
Now that we have all data on a
sql database, we can start matching.
Note how we didn’t normalize our fields, for instance -
Gives us matches, but not exact ones:
Again, another philosophical point - we loaded raw data into a format we can query and analyze. We did not clean it up, as to not alter the raw data, short of changing the encoding. There are points to be made to replicate this dataset now and provide a cleaned or processed dataset, but I’ll be skipping that in the interest of time. We’ll do it live using
So, let’s first see if there are
csv files that point to
mp3s that already exist.
And indeed, we get matches!
So we already know to ignore those records.
Let’s try to find records that have been added through the export, i.e. don’t exist in our master list yet:
And I am glad to report that the answer to this query is: All records exist already (for both “Dev” and the whole data set). The export, being an old backup, is a subset of the music stored in the server, even though the files are named differently. :)
Step 9: Orchestration
Assuming all previous steps ran on the test data, let’s orchestrate this pipeline, with all its components, using Apache Airflow.
Python to build out pipelines to orchestrate steps, and if you care for details, take a look at the GitHub repository.
Or as Graph:
I will happily admit that my
Airflow knowledge has been outdated since ca. 2017, so please keep that in mind while reading this
DAG. The pipeline is rather simple and just runs all steps as
Bash steps sequentially (except for the last 2) - it’s essentially a glorified shell script, but you can, of course, make it “production ready” by spending more time on it. Airflow is also available on Google Cloud as
Cloud Composer, which is pretty neat.
Now we have all music, both Takeout and existing data, organized in a nice, performant SQL environment. Nothing is stopping us from running more analytics, visualizing the data, or using it as a baseline for a homegrown
Discogs-type alternative. We should probably also de-duplicate the data.
For instance, here’s the top mentions of bands across tables in the export for me:
Which seems about right.
Naturally, we are also somewhat ignoring non-
mp3 files, such as
m4as, but that is a story for another time. We would have to tackle pipelines for each type, which I won’t do for this article.
Well this was a giant waste of time again - I thoroughly enjoyed it.
What we essentially did was play pretend and used a fairly standard
Data Engineering toolset to tackle some real-world, bad data.
We’ve used tools many Data Engineers will be familiar with -
csv files - but also some maybe more obscure things, such as
go and using
unstructured files. I’ve semi-seriously made a point for using
go instead of
Rust for prototyping, which is just enforcing the old rule of “Use whatever tool you are most comfortable with” - and I’ve been on a streak with
go these past months, for sure.
When it comes to
Beam, however, please use
Java, as these SDKs are much, much more mature. I’ve done both, and
Java has the edge when it comes to performance and features, but I find
Python to be more fun to write (everything winds up being a
We’ve also shown that you do not need a 100 node Hadoop cluster or Cloud-job to waltz through ~240GB in total. This entire thing ran on my Home Server, the humble
bigiron.local, which rocks a Ryzen 2200G (4 cores, 4 threads), 32GB of non-ECC memory, 6 HDDs with
LUKS, behind a
Mikrotik switch and
Mikrotik RB4011 Ethernet-router. In other words, a relatively modest machine via Gigabit ethernet. The longest running task was by far copying data between my spinning drives!
I’ve also learned that my discipline regarding backups is good - nothing in this export wasn’t already on the server. You might find different result, if you are actually care to run this.
But maybe it inspires some folks to look into the field of Data Engineering - which, sadly, is often seen as a lesser Software Engineering discipline - and for those of y’all who are already in the field, maybe you learned something new.
All development and benchmarking was done under GNU/Linux [PopOS! 20.04 on Kernel 5.8] with 12 Intel i7-9750H vCores @ 4.5Ghz and 16GB RAM on a 2019 System76 Gazelle Laptop, using bigiron.local as endpoint