Unfurl Plugin and "Site Characteristics" Artifact Added in Hindsight
I'm happy to announce there is a new Hindsight release available! 2021.04.26 has many small improvements and fixes, including adding support Chrome 88 - 90, but the main new features are an Unfurl plugin and parsing of the Site Characteristics Database!
Unfurl Plugin
I'm excited that this new Hindsight version has an integration with Unfurl! Unfurl takes a URL and expands ("unfurls") it into a directed graph, and is useful for exploring data encoded in URLs or other text values. Unfurl typically displays all this in an interactive graph visualization, but that doesn't fit well into Hindsight's output. Instead, this new Unfurl plugin stores the "text tree" version of the output (as seen in the Unfurl CLI tool). At this time, the only thing that Unfurl plugin runs on are Local Storage records. I chose these for a few reasons:
Timestamp Detection + Parsing
Local Storage records lack explicit timestamps (they're just a collection of key/values pairs associated with an origin). Unfurl can often translate a value into a human-readable timestamp, potentially adding some hints as to timing on these records. Hindsight had a "Generic Timestamp Converter" plugin that did this previously, but it was rather limited and Unfurl does a much better job and covers a wider variety of timestamps. Example:
When Unfurl's output is rather simple (like just a timestamp conversion), the plugin reformats the "tree" into a single line summary that works better in Hindsight.
Decoding Values
Another reason is that Local Storage values are often encoded. Unfurl's chaining of multiple simple transforms can sometimes bring clarity to an obscured value. For example:
The value
from above is parsed by Unfurl (using base64, JSON, and timestamp conversions), and the "text tree" output is saved in the "Interpretation" column (in the same way other Hindsight plugins save their results):
These are just a few examples of how Unfurl can be helpful on Local Storage values. All the parsers from the web version Unfurl are included in the Hindsight plugin, so things like UUIDs, zlib-compressed strings, Twitter Snowflakes, and a whole lot more can be parsed. If this plugin works out well, I'll evaluate if there are other places in Hindsight that an Unfurl integration would make sense.
Site Characteristics Database
The other new feature is added parsing of the "Site Characteristics Database". It is a part of Chrome that tracks a few different behaviors on sites, such as if the site changes the favicon or page title in the background. These behaviors aren't that interesting in and of themselves, but they can provide interesting context.
Behind the scenes, the "Site Characteristics Database" is stored in a LevelDB as a collection of key/value pairs. The key for each record is the MD5 hash of the origin and the record's value is a protobuf. Luckily, since Chromium is open source, we can find the .proto
file that corresponds to that protobuf, so decoding it is easier:
To process these records, Hindsight first calculates the MD5 hashes of every origin seen in other artifacts it has already parsed, then compares each Site Characteristic key to them. If a match is found, Hindsight uses that origin in the "URL" field for the record; if not, Hindsight shows something like "MD5 of origin: 99cd2175108d157588c04758296d1cfc". For the "Value" field, Hindsight parses the site_data
protobuf and stores the result (it looks similar to JSON). To order these records by time, Hindsight uses the last_loaded
value from the protobuf.
Deleted Records
Since the datastore is LevelDB, we can recover deleted data from it! For deleted records, we can only get the key (the origin MD5), not the value protobuf, so we lose some information, including any explicit timestamps. However, this recovered data can still be useful.
One potential use case for this is showing that a user visited a particular site. Looking through my own browser history, I have over 1200 records where Hindsight couldn't find the Site Characteristic origin by comparing its key to the rest of my browsing history. This means that these origins don't appear anywhere else in my Chrome history, yet there is still some (small) indication I visited them in these Site Characteristic records. If you have a site of particular importance to a case, you could calculate the MD5 of the origin and then search these records for it. Since the timestamp information in deleted records is missing, Hindsight places these records at the beginning of the timeline (at 1970-01-01), but uses a filter to hide them in the Excel output by default to avoid cluttering it.
Future Research
Things to explore around "Site Characteristic Database" records in the future:
- What effect clearing different types of browser data has on Site Characteristics Database records? If they persist despite history being cleared, they could be even more useful in showing a particular site was visited.
- The various
observation_duration
timestamps: they are relative timestamps (count of seconds), but could potentially still be useful. - More precise meaning of the
last_loaded
timestamp: In some quick testing, it looks to be updated when the page was closed: page timestamp + pagevisit_duration
~=last_loaded
timestamp. This is interesting, as not all pages have avisit_duration
value set, and it could potentially show interesting things about user behavior.
Get Hindsight
You can get Hindsight, view the code, and see the full change log on GitHub. Both the command line and web UI versions of this release are available as:
- compiled exes attached to the GitHub release or in the dist/ folder
- .py versions are available by
pip install pyhindsight
or downloading/cloning the GitHub repo.