Unfurl Plugin and "Site Characteristics" Artifact Added in Hindsight

I'm happy to announce there is a new Hindsight release available! 2021.04.26 has many small improvements and fixes, including adding support Chrome 88 - 90, but the main new features are an Unfurl plugin and parsing of the Site Characteristics Database!

Unfurl Plugin

I'm excited that this new Hindsight version has an integration with Unfurl! Unfurl takes a URL and expands ("unfurls") it into a directed graph, and is useful for exploring data encoded in URLs or other text values. Unfurl typically displays all this in an interactive graph visualization, but that doesn't fit well into Hindsight's output. Instead, this new Unfurl plugin stores the "text tree" version of the output (as seen in the Unfurl CLI tool). At this time, the only thing that Unfurl plugin runs on are Local Storage records. I chose these for a few reasons:

Timestamp Detection + Parsing

Local Storage records lack explicit timestamps (they're just a collection of key/values pairs associated with an origin). Unfurl can often translate a value into a human-readable timestamp, potentially adding some hints as to timing on these records. Hindsight had a "Generic Timestamp Converter" plugin that did this previously, but it was rather limited and Unfurl does a much better job and covers a wider variety of timestamps. Example:

origin: https://www.reddit.com
key: push-token-last-refresh-ms
value: 1615493428164
Local Storage key/value pair for reddit.com
2021-03-11 20:10:28.164 (Converted as Epoch milliseconds) [Unfurl]
Unfurl parsing a timestamp from a value in Local Storage

When Unfurl's output is rather simple (like just a timestamp conversion), the plugin reformats the "tree" into a single line summary that works better in Hindsight.

Decoding Values

Another reason is that Local Storage values are often encoded. Unfurl's chaining of multiple simple transforms can sometimes bring clarity to an obscured value. For example:

origin: http://www.metacritic.com
key: __ansync3rdp_criteo
value: eyJiSWQiOiJjcml0ZW8iLCJ1Q29kZSI6bnVsbCwidHMiOjE1MzExODAxNDYwOTh9
Local Storage key/value pair for metacritic.com

The value from above is parsed by Unfurl (using base64, JSON, and timestamp conversions), and the "text tree" output is saved in the "Interpretation" column (in the same way other Hindsight plugins save their results):

[1] eyJiSWQiOiJjcml0ZW8iLCJ1Q29kZSI6bnVsbCwidHMiOjE1MzExODAxNDYwOTh9
 └─(b64)─[2] {"bId":"criteo","uCode":null,"ts":1531180146098}
    ├─(JSON)─[3] bId: criteo
    ├─(JSON)─[4] uCode: None
    └─(JSON)─[5] ts: 1531180146098
       └─(🕓)─[6] 2018-07-09 23:49:06.098 
Unfurl parsing an encoded Local Storage value

These are just a few examples of how Unfurl can be helpful on Local Storage values. All the parsers from the web version Unfurl are included in the Hindsight plugin, so things like UUIDs, zlib-compressed strings, Twitter Snowflakes, and a whole lot more can be parsed. If this plugin works out well, I'll evaluate if there are other places in Hindsight that an Unfurl integration would make sense.  

Site Characteristics Database

The other new feature is added parsing of the "Site Characteristics Database". It is a part of Chrome that tracks a few different behaviors on sites, such as if the site changes the favicon or page title in the background. These behaviors aren't that interesting in and of themselves, but they can provide interesting context.

Behind the scenes, the "Site Characteristics Database" is stored in a LevelDB as a collection of key/value pairs. The key for each record is the MD5 hash of the origin and the record's value is a protobuf. Luckily, since Chromium is open source, we can find the .proto file that corresponds to that protobuf, so decoding it is easier:

// Copyright 2019 The Chromium Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.

syntax = "proto2";

option optimize_for = LITE_RUNTIME;

// Contains the information that we want to track about a given site feature.
// Next Id: 3
message SiteDataFeatureProto {
  // The cumulative observation time for this feature in seconds, set to 0 once
  // this feature has been observed.
  optional int64 observation_duration = 1;
  // The time at which this feature has been used (set to 0 if it hasn't been
  // used), in seconds since epoch.
  optional int64 use_timestamp = 2;
}

// Contains decaying average performance measurement estimates.
// Next Id: 4
message SiteDataPerformanceMeasurement {
  // A decaying average of the CPU usage measurements. Units: microseconds.
  optional float avg_cpu_usage_us = 1;
  // A decaying average of the process footprint measurements. Units: kilobytes.
  optional float avg_footprint_kb = 2;
  // A decaying average of the duration from navigation commit to "loaded".
  // Units: microseconds.
  optional float avg_load_duration_us = 3;
};

// Defines the data that we want to track about a given site.
// Next Id: 7
message SiteDataProto {
  // The last time this site has been in the loaded state, in seconds since
  // epoch.
  optional uint32 last_loaded = 1;

  // List of features that we're tracking.
  optional SiteDataFeatureProto updates_favicon_in_background = 2;
  optional SiteDataFeatureProto updates_title_in_background = 3;
  optional SiteDataFeatureProto uses_audio_in_background = 4;
  optional SiteDataFeatureProto deprecated_uses_notifications_in_background = 5;

  // Load time performance measurement estimates. This maintains a decaying
  // average of the resource usage of a page until shortly after it becomes
  // idle.
  optional SiteDataPerformanceMeasurement load_time_estimates = 6;
}
https://source.chromium.org/chromium/chromium/src/+/master:components/performance_manager/persistence/site_data/site_data.proto

To process these records, Hindsight first calculates the MD5 hashes of every origin seen in other artifacts it has already parsed, then compares each Site Characteristic key to them. If a match is found, Hindsight uses that origin in the "URL" field for the record; if not, Hindsight shows something like "MD5 of origin: 99cd2175108d157588c04758296d1cfc". For the "Value" field, Hindsight parses the site_data protobuf and stores the result (it looks similar to JSON). To order these records by time, Hindsight uses the last_loaded value from the protobuf.

Example "Site Characteristic Database" record in Hindsight XLSX output

Deleted Records

Since the datastore is LevelDB, we can recover deleted data from it! For deleted records, we can only get the key (the origin MD5), not the value protobuf, so we lose some information, including any explicit timestamps. However, this recovered data can still be useful.

One potential use case for this is showing that a user visited a particular site. Looking through my own browser history, I have over 1200 records where Hindsight couldn't find the Site Characteristic origin by comparing its key to the rest of my browsing history. This means that these origins don't appear anywhere else in my Chrome history, yet there is still some (small) indication I visited them in these Site Characteristic records. If you have a site of particular importance to a case, you could calculate the MD5 of the origin and then search these records for it. Since the timestamp information in deleted records is missing, Hindsight places these records at the beginning of the timeline (at 1970-01-01), but uses a filter to hide them in the Excel output by default to avoid cluttering it.

Future Research

Things to explore around "Site Characteristic Database" records in the future:

  • What effect clearing different types of browser data has on Site Characteristics Database records? If they persist despite history being cleared, they could be even more useful in showing a particular site was visited.
  • The various observation_duration timestamps: they are relative timestamps (count of seconds), but could potentially still be useful.
  • More precise meaning of the last_loaded timestamp: In some quick testing, it looks to be updated when the page was closed: page timestamp + page visit_duration ~= last_loaded timestamp. This is interesting, as not all pages have a visit_duration value set, and it could potentially show interesting things about user behavior.

Get Hindsight

You can get Hindsight, view the code, and see the full change log on GitHub. Both the command line and web UI versions of this release are available as:

  • compiled exes attached to the GitHub release or in the dist/ folder
  • .py versions are available by pip install pyhindsight or downloading/cloning the GitHub repo.