How I Built My Year in Review

September 28, 2020

I released my second “Year in Review” post in January. I’ve been a Quantified Self-er for a long time and these reviews have been a nice way to make use of all of the data I create and collect about myself.

I released the data and some scripts that I used on Github but wanted to write a bit about how I pulled the data.


Some Brief Context

I had been wanting to do a recap like this since 2012 but didn’t get my act together until the end of 2018.

I have long been a fan of the Feltron Reports and wanted to replicate the design of that. Unfortunately, I would always get too bogged down with either perfecting the code or being too ambitious on the design and ended up never shipping anything.

So, at the end of 2018, I decided to scale back to ensure I’d get something out …which resulted in my 2018 Year in Review! Who knew that not overcomplicating things would help ship things 😁


The Data

I created a repo for how I pulled my data here: github.com/bdotdub/2019

A few caveats to get out of the way 😛 As mentioned in the README, it is written in Python and it is decidedly not my most comfortable language. I tried to use it as an opportunity to learn how to build a Python repo from scratch. I also wrote the code for utility, not for reusability so it can be hacky at times 😁

Ok, now that we’ve got that out of the way, onto the scripts!

Photos

I reported the number of photos I took over the year and some light stats around most photos taken in a day and different streaks.

Since I back my photos up to Google Photos, I simply had to use the Google Photos API. The script is pretty straightforward: it loops through pages of photos until it hits a photo from 2018. There are likely some time zone issues but 🤷‍♂️ .

Once I have all of my photos, I dump them out into a JSON file so that D3, the Javascript framework I use for visualization, can read it.

Then to get the days that have the most photos, I could write another script, or simply use some Unix command line to get the data (since Everything is a File):

# This could all be one line but I'm splitting it up for claritys sake.
# You can run the below from your terminal to see it in action
curl "https://bwong.net/data/2019-year-in-review/photos.json" 2>/dev/null \ # Get the file. 2>/dev/null will hide stderr output
  | grep creationTime \ # Find all lines with "creation time"
  | cut -d":" -f2 \ # Split the line by ":" and get the second field. Each line looks like this: "creationTime": "2019-01-01T01:46:41Z",
  | sed 's#T.*##' \ # Remove the stuff after "T" (the previous line will output something like "2019-01-01T01"
  | sort -n \ # Sort all of the lines
  | uniq -c \ # Count the unique dates
  | sort -rn \ # Sort the lines again, but this time by numbers in descending order
  | head -20 # Take the top 20!

Getting the proportion of photos by camera is similar:

$ curl "https://bwong.net/data/2019-year-in-review/photos.json" 2>/dev/null | grep cameraMake | sort | uniq -c | sort -rn
14938         "cameraMake": "Apple",
1890         "cameraMake": "SONY",
1276         "cameraMake": "UNKNOWN",
  31         "cameraMake": "Google",
  30         "cameraMake": "Spectre",

Strava

Honestly, this was pretty easy. I just did a bulk data export and then just did a couple of quick edits to the file via Vim to clean the data.

Then, I used Excel to export CSV that I’m using on the post.

Youtube

This was the most fun. In my post, I wanted not only to see how many Youtube videos I watched, but how many hours of video I watched.

Google Takeout, Google’s product that allows you to export your Google-related data, provides the first data set quite easily. You simply go to the site, check off Youtube and you’re done! Like what I did with my photo data, a quick script or using jq, it’d be easy to count the videos by channel.

Unfortunately this data does not include how long I spent watching the videos. I pored over API docs and random forum posts and ultimately could not find an automated way of pulling this data.

So, I went old school: HTML scraping. I noticed that Youtube’s “Watch History” page has a little progress bar of how far you’ve made it into the video. I opened up that page, scrolled back until I had all videos watched in 2019, saved the HTML, and used the trusty BeautifulSoup to parse out the progress data.

The progress is simply a percentage of the video watched so I took that and simply multiplied it with the duration I got from the Google Takeout data. The HTML scraping is here: progress_parser.py

I will note that this methodology not perfect. Some videos I skip to close to the end (sport event replays) and some videos I watched multiple times (tutorials) so there is a loss of precision there. The progress percentages were also whole numbers so I also lost some precision there.

The Visualizations

I spent (too much) time learning how to build interactive visualizations for my data. I’ve always admired the way the New York Times and Fivethirtyeight tell stories by visualizing data in illustrative ways.

D3 has been the de facto tool for visualization on the web. The creator of D3, Mike Bostock, was one of the OG visualization engineers at the NYTimes.

The two charts in the post are a simple bar chart and line chart. I won’t go into how I built it because there are a ton more resources like Learn D3, or simply poking around Observable. For me, learning by doing was the most helpful.

Where it got fun was using the scrolling of the page to change and update the visuals as you browsed. I decided not to open source the visualization because it was very specific to that page. But mostly it is because the hacky af and don’t want anyone to ever see it 😁

Instead, here is a post that goes in-depth on how one would implement something similar: https://towardsdatascience.com/how-i-created-an-interactive-scrolling-visualisation-with-d3-js-and-how-you-can-too-e116372e2c73

2018 Interlude: iMessage / SMS

This post has mostly focused on my 2019 post but I did want to talk about iMessage/SMS from 2018.

On every Mac that has Messages setup, there is a database of your Messages under ~/Library/Messages/chat.db. If you have iCloud sync on, you’ll have all of the text messages you’ve ever sent and received.

I used that database to pull out my most used emoji (😂) and number of texts sent (1,354). Here are the commands I used:

# First, get a raw logs of the timestamp and text of the message.
sqlite3 chat.db "select
    *
from
    (select
        datetime(date/1000000000 + strftime('%s','2001-01-01'), 'unixepoch') as date,
        text
    from
        message
    where
        handle_id = 0) as a
where
    a.date > '2018-01-01'
    and a.date < '2018-12-31'
order by
    date" > txts

# Now, keep track of occurences of non-alphabet characters.
`cat txts`.split("").reduce({}) { |o, c|
    o[c] ||= 0
    o[c] += 1
    o
}.to_a.sort_by(&:last).reject { |tuple|
    tuple.first =~ /[a-z0-9]/i
}


Next Up: 2020

This year has obviously been…different to say the least. We’ve been very fortunate to have a comfortable apartment, our health (and our families health), and employment.

It’ll be interesting to dig into this year’s data to find the most striking differences between this year and the previous ones. There are a few things that are obvious to me (exercise, weight, etc) but probably many more that are not. I also got a WHOOP band so I should have a bunch more data to play with as well.


Benny Wong is a technologist born, raised, and living in NYC.