dfir.blog

Hindsight v2026.01 Released!

Ryan Benson — Wed, 04 Feb 2026 18:00:00 GMT

Sync Data Parsing

A new feature that I'm excited to about in this release is parsing of Chrome's Sync Data. When a user signs into Chrome with their Google account, Chrome can sync bookmarks, passwords, extensions, history, and more across devices.

This sync functionality stores data locally in LevelDB files, and Hindsight can now parse it - at least partially. Most of the LevelDB records hold data encoded in different protobufs, many of which Hindsight now parses. The meaning and function of these parsed records is definitely an area for further research, as there is a wealth of information in the data. Hindsight currently only parses out what devices were used for syncing and enhances the existing "Source" column in the timeline with details about the originating device for synced URL visits:

Updated Terminal Interface

Hindsight's terminal interface has been largely unchanged for almost 10 years (!?) now, and it showed. Hindsight now uses the rich library to provide a much more polished command-line interface, while still keeping with the spirit and style of the original version. This is mostly a cosmetic change; the command line syntax remains the same.

Hindsight's updated terminal interface

New Artifacts & Expanded Parsing

Beyond Sync Data, v2026.01 adds parsing for several other Chrome artifacts:

Permission Actions from the Preferences file, showing what permission requests websites have made
Login Data For Account database, used for account-specific saved credentials in recent Chrome versions
Account Capabilities from Preferences, translated into human-readable descriptions
Parsing for more timestamped values in Preferences, as there are many top- or second-level keys that just hold a timestamp and are easy to parse

Improved Output Formats

All three output formats (XLSX, JSONL, and SQLite) received improvements in this release. The SQLite output in particular was overhauled to be more comparable to the other formats, making it easier to work with Hindsight data in your tool of choice. The JSONL output, which was introduced to make it easier to import Hindsight results into Timesketch, previously only had timestamped records. It now includes all records; those without any intrinsic timestamp (like various storage items) have their timestamp set to the Unix epoch and a timestamp description of "Not a time".

More Robust Parsing

There are over a dozen fixes and improvements to make Hindsight's parsing more reliable and complete:

Updated parsing for changes in Chrome v142's DIPS records
New danger types and interrupt reason codes for download records
Better handling of extension version strings and preference timestamps
More tolerant File System logical path creation
Improved file-closing and resource management

Get Hindsight!

You can get Hindsight, view the code, and see the full change log on GitHub. Both the command line and web UI versions of this release are available as:

compiled exes attached to the GitHub release or in the dist/ folder
.py versions are available by pip install pyhindsight or downloading/cloning the GitHub repo.

Unfurl 2025.03

Ryan Benson — Thu, 13 Mar 2025 13:30:40 GMT

A new Unfurl release is here! v2025.03 adds new features and some fixes, including:

Parsing Google Search's UDM parameter
Recognizing Mastodon usernames and parsing Mastodon forks (like truthsocial[.]com and gab[.]com)
Utility parser to "clean up" inputs

Get the new version now, or read on for more details about the new features!

Google Search UDM Parameter

I was first made aware of the UDM query string parameter in Google Search when lots of people starting posting about the "udm=14 hack" to turn off AI-generated content in Search results. What this parameter seems to do is control the results page type, and udm=14 sets the results page to "Web".

When you click on different results types in a Google Search results page, you can observed the udm value changing as well. In the screenshot below, I selected "Images" and the udm value changed to 2.

Google Search "Images" Results Page with UDM=2

I manually incremented the udm value in the URL and observed the what type of results page was served. udm of 51 was the highest value I found; setting it to 56 and above results in a redirect back to the search results page with the udm parameter stripped off (at least until 65, then I stopped testing). The results are in the table below:

UDM Value	Google Search Results Page Type
1	All
2	Images
3	Products
6	Learn
7	Videos
8	Jobs
12	News
14	Web
15	Things to do
18	Forums
28	Shopping
36	Books
37	Products
38	Videos
44	Visual matches
47	Web (+"Refine Results" panel)
48	Exact matches
51	Homework

Mastodon Parsing Improvements

There's a few minor enhancements to the Mastodon parser in this release. Unfurl now recognizes the username section of a post URL, and splits it into local username and account domain, if applicable.

I've also added truthsocial.com and gab.com to the Mastodon parser. Even though they aren't part of the Fediverse (like most other Mastodon servers are), since they're based on Mastodon's code, Unfurl can parse them just the same.

Input "Clean Up" Actions

I'm always on the lookout for ways to make Unfurl more helpful and usable. Some of the most common issues I've seen when people use Unfurl is improperly formatted inputs, like enclosing the input URL or string in quotes or including leading/trailing spaces. If this happens, Unfurl can't properly parse the inputs (as it doesn't know that those are errors), and so gives an unsatisfying result to the user.

I've added a few "clean up" actions to fix these common issues. Since I, like Unfurl, can't be truly sure that these extra characters are unintentional, I wanted to make these modifications visible to the user (both for transparency and to stick with Unfurl's "show your work" philosophy).

Unfurl "Clean Up" parser removing quotes

If you use Unfurl and have any other "annoyances" or quality-of-life type issues, please let me know! I'd love to make Unfurl easy and enjoyable to use for everyone.

Get it!

Those are the major items in this Unfurl release. There are more changes that didn't make it into the blog post; check out the release notes for more. To get Unfurl with these latest updates, you can:

use it online at dfir.blog/unfurl or unfurl.link
if using pip, pip install dfir-unfurl -U will upgrade your local Unfurl to the latest
View the release on GitHub

All features work in both the web UI and command line versions.

Hindsight v2025.03 Released!

Ryan Benson — Tue, 11 Mar 2025 17:02:06 GMT

Background

I've been following some of the news related to attacks involving browser extensions and read some great write-ups about what happened and how. I'd encourage everyone to read the post by John Tuckner (of Secure Annex) about the Cyberhaven Extension compromise:

Cyberhaven Extension Compromise

How the Cyberhaven extension was compromised and what it means for your organization.

Secure AnnexJohn Tuckner

One of the things that's been on my radar for a long time was adding more parsing of Extension-related databases to Hindsight, and this seemed like a timely excuse!

New "Extension Data" Section

Hindsight can now part eight more databases related to Extension activity (they all use LevelDB and share a similar format). They are:

Extension Rules
Extension Scripts
Extension State
Local App Settings
Local Extension Settings
Managed Extension Settings
Sync App Settings
Sync Extension Settings

As these records are different than other "Storage" ones, I decided to put them in a new Extension Data output section. There aren't any explicit timestamps associated with records (although plenty of timestamps are present inside the unstructured Value fields). I have some ideas on plugins and additional parsing, but that will need to wait for a subsequent release. For now, I think simply surfacing this data is a good place to start.

New "Extension Data" Tab in XLSX Output

Another, more minor, change in this version is to the Installed Extensions section of the output - I've added Permissions and Manifest columns. The Manifest column is the extension's entire manifest.json file, as lots of different parts of it are relevant for analysis, depending on the question being asked. I pulled out the Permissions section from the manifest into its own column to highlight it, as I think it's particular important. I also think it's useful to be able to quickly scan down the list of installed extensions and see what permissions each has, in case something jumps out as a bit unusual.

Updated "Installed Extensions" Tab, with Permissions and Manifest Columns

Get Hindsight!

You can get Hindsight, view the code, and see the full change log on GitHub. Both the command line and web UI versions of this release are available as:

compiled exes attached to the GitHub release or in the dist/ folder
.py versions are available by pip install pyhindsight or downloading/cloning the GitHub repo.

Unfurl v2025.02 Released

Ryan Benson — Wed, 19 Feb 2025 14:41:19 GMT

A new Unfurl release is here! v2025.02 adds new features and some fixes, including:

Parsing of IP addresses, including encoded or obfuscated variants
Resolving Bluesky handles to their backing identifiers (DIDs), and then looking up that DID in the plc.directory audit log to find its creation timestamp
Bug fixes and speed enhancements for bulk parsing

This is a relatively small release; but in addition to the new features, it fixes a few bugs (see the full changelog on the GitHub release page). Get it now, or read on for more details about the new features!

Parsing of IP Addresses (in many forms)

Unfurl previously only parsed domain names, but now can correctly recognize IP addresses. Not just IPs as they most typically appear (like 8.8.8.8 or 10.0.0.1), but in other forms, which are often used by attackers to try to obscure the actual destination (like http://example.com@1157586937). Below are more supported examples (from a Trustwave report); all examples point to a Google IP:

Dotted decimal IP address: https://216.58.199.78 (the most common)
Octal IP address: https://0330.0072.0307.0116 (convert each decimal number to octal)
Hexadecimal IP address: https://0xD83AC74E (convert each decimal number to hexadecimal)
Integer or DWORD IP address: https://3627730766 (convert hexadecimal IP to integer)

Unfurl parsing a deceptive URL with a username and encoded IP address

Parsing and Lookups of Bluesky Handles

Unfurl added support for parsing the embedded timestamps out of Bluesky post IDs ("TIDs") in the v2024.11 release; this latest release adds the ability to resolve a Bluesky handle to its underlying did , then consult the plc.directory audit log to see when that did was created.

Unfurl parsing a bsky.app URL, showing the handle creation and the post timestamps

ℹ️

Note: both the handle resolution and reading the creation timestamp from the audit log require a remote lookup, which is disabled by default in the local Python version. You can enable it by changing the unfurl.ini file.

Get it!

use it online at dfir.blog/unfurl or unfurl.link
if using pip, pip install dfir-unfurl -U will upgrade your local Unfurl to the latest
View the release on GitHub

All features work in both the web UI and command line versions.

Authenticating Screenshots from Netflix's Carry-On Movie

Ryan Benson — Mon, 13 Jan 2025 17:12:09 GMT

Over the winter holiday, I got a bit of downtime. During this, I was watching Netflix's Carry-On when I noticed something: an actual URL on screen! Often in movies and TV, any "web browsers" that appear are mock-ups (and either look awesomely futuristic or laughable bad). Not only did this appear to be a real-life web browser showing a real webpage, it was a Google Search Engine Results Page (SERP), which I know can have tons of interesting bits encoded in it. My wife chuckled at me as I paused the movie to take a closer look (she's used to that by now). Here it is:

A Google Search Results Page (SERP) from the Netflix movie Carry-On

The next day, I went back to that scene (about 47 minutes in, if you want to see it yourself) and did my best to type out the URL. I got as far as the oq query string parameter then gave up, as the image was getting blurry and I already had quite a bit. For the Google SERP URL, I was able to read the q, rlz, ei, ved, uact, and oq query string parameters. I put the URL into Unfurl, and got:

Unfurl parsing a Google SERP that appeared in Netflix's Carry-On

There's a ton of stuff here! If you don't know what all those Google Search parameters mean, no problem; Unfurl does its best to parse and explain them. I'd encourage you to take a look at the interactive graph yourself; there's useful hover text on some nodes that isn't visible in the screenshot above.

I'll summarize what Unfurl pulled out of each query string parameter from the Carry-On Google SERP URL:

q: "nova shock" - the terms used in the Google search query
oq: "nova shock" - the "original query" terms entered by the user.
- Sometimes auto-complete or suggestions are used to reach the actual search terms (in q) from the oq value, but that doesn't look to have happened here, since the q and oq are the same.
rlz: this is used for grouping promotion event signals and anonymous user cohorts (more info on rlz in this post). Interesting parsed info:
- the search was performed using Chrome Omnibox (that combination URL and search box at the top of Chrome)
- the language was English
- the Chrome browser used to make the search was installed in United States the week of 2020-11-16, which is also the same time period the first Google search was made from that system
ei: has info about when the search session started. The search session starting timestamp is before the actual search occurred; this is often seconds before, but could be many hours.
- The search session started 2023-05-17 20:59:08.757567+00:00
ved: often appears when a user clicks a link on a Google page. It contains information about the link that was clicked on: position on the page, link type, and timing (more info on ved in this post).

ℹ️

An important note is that what we folks external to Google know (or think we know) about Google Search URLs has been deduced through research and testing, and could be invalidated at any time if Google makes changes. Google doesn't publish what these query string parameters mean and how to interpret them, but a lot of people have spent a lot of time and effort trying to figure that out (both for forensic and search engine optimization reasons).

That's a lot of information extracted from one URL! Most of the time when I'm doing this kind of analysis, I don't have a video (or screenshot) of the user doing the actions in the browser, and the data points from the URL can help paint the full picture of what happened. In this instance, however, we can see what the user was doing, which lets us ask a different question: is what is encoded in the URL consistent with what we're seeing? Or phrased another way: has the screenshot been manipulated?

Is the Carry-On screenshot consistent with the movie setting?

So, how did the Carry-On screenshot do as far as being consistent with the events around it? Let's go through each data point from the URL and see how it fits with what we see in the movie:

Attribute	On-Screen	Extracted Data Point	Match
Search query	"Nov Chuck"	"nova shock"	❌
Browser is Chrome	Yes	Yes	✅
Search location	Google Home Page or New Tab Page	Omnibox	❌
Language	English	English	✅
Browser installed 2020-11-16	Unknown	Unknown	❔
Search session start	2023-05-17	202?-12-24	❌

Conclusion: The screenshot has been altered! 😲

I know, who could have guessed a computer screen in a movie had some edits applied? The search query from the URL didn't match what was on the screen, which is the most definitive mismatch I can see. The two matching attributes, the browser being Chrome and the language being English, are so common that it would be strange if they didn't match. The search location mismatch (Omnibox vs Home Page) I don't weigh heavily, as I've had a hard time getting a rlz parameter to appear in SERP URLs, so I haven't been able to verify its behavior. Likewise, the browser install date from the rlz is plausible, but not useful for verification in this case.

The last big mismatch is on the search session timestamp. While the search session starting timestamp can be a ways before the search actually occurs, 7 month is quite a stretch (as the movie is set on December 24th and the embedded timestamp is May 17th). However, if you kind of squint at the computer's clock while the search is happening, it might resemble 5/1?/????. So maybe the computer and the Google search agree on the date at least, but the people on-screen aren't being honest about what the timeframe is? 🧐

Blurry screenshot of the system clock onscreen in Carry-On

Real Life Applications

Evaluating the Authenticity of Screenshots

Now, this post is just a fun exercise (no one expects screenshots from movies to match reality), but it does have more serious parallels. If you come across a screenshot, whether that's during a DFIR investigation, some OSINT research, or just on social media, if that screenshot has a URL in it, you potentially have some more data points around the veracity of that screenshot.

This post highlighted how useful something like a search engine URL can be, but all sorts of URLs can have interesting bits encoded inside them, like those from Twitter/X, Discord, TikTok, and many more!

Importance of Verification

Above, when I said I just typed out the URL from Carry-On and dropped it into Unfurl to get all those results, I wasn't being completely honest. I did put my transcribed URL into Unfurl, but when I took a close look at the results I noticed things weren't quite right.

Some of the things we've observed about the URLs are useful, like what we think the timestamps represent, and some are more like trivia. One of the less-useful things we've figured out is that the last (or 3rd) value in the ei parameter should match the 13-3 value in the ved parameter. We don't know what the meaning of these values, but after looking at enough examples, we expect them to match. And in my first transcribed example... they don't. We also expect the timestamps in the ved and the ei to match, and those don't either. What's going on?

First attempting at Unfurling the SERP URL from Carry-On

This led me to experiment with the ei and ved parameters; specifically, with the characters that can be a little ambiguous (like lowercase "L" (l) and uppercase "i" (I). After some tinkering, I found that I had initially misread two characters, both in the ei parameter. The correct value was HEBlZL-eLrOmptQP3pmK4AI; previously I had the 4th and last characters switched with their homoglyphs (HEBIZL-eLrOmptQP3pmK4Al). This helps illustrate that even "trivia"-type knowledge has its uses; while I don't know what those values mean, I was able to use them as a kind of consistency check.

Try It Out!

That's it for this post. If you found it interesting, I'd encourage you to try it on a screenshot you find and let me know how it goes! Unfurl is useful for this, and you can use it online or locally.

Video of "What Can DFIQ Do For You?" Posted

Ryan Benson — Wed, 20 Dec 2023 17:59:00 GMT

The talk "What Can DFIQ Do For You?" that Jon Brown and I gave at the SANS DFIR Summit 2023 has been posted on YouTube! It was awesome to be able to publicly launch DFIQ; I hope this is just the start to a new DFIR community resource.

Unfurl v2023.09 Released!

Ryan Benson — Wed, 27 Sep 2023 13:30:00 GMT

A new Unfurl release is here! v2023.09 adds new features and some fixes. The release adds:

Parsing of JWTs (JSON Web Tokens)
Parsing of DoH (DNS over HTTPS) URLs
More recognized Mastodon servers

Parse JSON Web Tokens (JWTs)

JSON Web Tokens (JWTs) are used frequency for authorization and signing purposes. I won't go into much details about their structure here (check this out for an introduction), but just say at the highest level JWTs have three parts: header, payload, and signature. Each of these is base64-encoded, and separated by a .. Unfurl first splits a JWT into those three components, then base64-decodes the header and payload, then parses the resulting JSON objects. While Unfurl could parse all that in one step, it does it in three steps to keep with the "show your work" spirit of the tool.

Here's Unfurl parsing a simple JWT (from Wikipedia):

Unfurl parsing a simple JWT

I encounter these often when looking through links in emails. Here's another example, but with a lot more other parsers as well:

Unfurl parsing an email link with a JWT

Don't you just love how ridiculous email links have gotten? This one wasn't even malicious.

DNS over HTTPS (DoH)

I was reading a SANS Internet Storm Center post by Johannes Ullrich a while ago about decoding DoH requests in their honeypot and found it interesting. I knew a little about DoH, but hadn't seen URLs contained encoded requests before. I created an Unfurl parser for them; see an example below:

Unfurl parsing a URL containing an encoded DoH message

More Mastodon Servers

Unfurl has parsed timestamps from Mastodon's Toots for a long time, but it previously recognized a limited number of Mastodon servers. With the uptake of Mastodon usage, I've updated the list of Mastodon servers Unfurl knows about to nearly 250.

Get it!

use it online at dfir.blog/unfurl or unfurl.link
if using pip, pip install dfir-unfurl -U will upgrade your local Unfurl to the latest
View the release on GitHub

All features work in both the web UI and command line versions (unfurl_app.py & unfurl_cli.py).

Unfurl v2022.11: Social Media Edition

Ryan Benson — Thu, 10 Nov 2022 14:18:00 GMT

It's been a while, but a new Unfurl release is here! v2022.11 adds new features and has behind-the-scenes changes. With all the attention on Twitter lately, in this post I'm going to highlight changes related to social media websites:

Defining Twitter's sharing (s) parameter values (all 71 of them!)
Extracting timestamps from Mastodon IDs
Decoding multiple types of LinkedIn identifiers
Expanding Substack redirect links
Parsing common tracking/analytics query string parameters

Get it now, or read on for more details about the new features!

Twitter

Besides the headline-grabbing changes at Twitter, there have been some gradual, less obvious changes as well: the query string parameters. A few years ago (maybe 2018?) the s parameter appeared, and people (myself included) began speculating and trying to figure out its purpose. By experimentation, the values for s of 19, 20, and 21 seemed pretty clear: they meant a sharing source of Android, Twitter Web, and iOS, respectively (and Unfurl parsed them as such).

A few weeks ago, someone was poking at Twitter's JavaScript files and discovered an object with the mappings of 71 values for the sharing codes! They kindly shared this with me (thanks 2xyo!) and I added them to Unfurl.

The codes generally show the combination of device type (iOS, iPhone, Android, web browser) and method (email, WhatsApp, copy) used to share the tweet. I haven't personally seen the majority of these codes in use so I can't say they all are still valid, but then I also haven't shared a tweet from my iPad using LinkedIn (s=71)!

Here's my cleaned-up interpretation of what the s codes mean (links to the original .js files are in the GitHub issue if you're curious).

`s` Parameter	Shared From
01	an Android using SMS
02	an Android using Email
03	an Android using Gmail
04	an Android using Facebook
05	an Android using WeChat
06	an Android using Line
07	an Android using FBMessenger
08	an Android using WhatsApp
09	an Android using Other
10	iOS using Messages or SMS
11	iOS using Email
12	iOS using Other
13	an Android using Download
14	iOS using Download
15	an Android using Hangouts
16	an Android using Twitter DM
17	Twitter Web using Email
18	Twitter Web using Download
19	an Android using Copy
20	Twitter Web using Copy
21	iOS using Copy
22	iOS using Snapchat
23	an Android using Snapchat
24	iOS using WhatsApp
25	iOS using FBMessenger
26	iOS using Facebook
27	iOS using Gmail
28	iOS using Telegram
29	iOS using Line
30	iOS using Viber
31	an Android using Slack
32	an Android using Kakao
33	an Android using Discord
34	an Android using Reddit
35	an Android using Telegram
36	an Android using Instagram
37	an Android using Daum
38	iOS using Instagram
39	iOS using LinkedIn
40	an Android using LinkedIn
41	Gryphon using Copy
42	an iPhone using SMS
43	an iPhone using Email
44	an iPhone using Other
45	an iPhone using Download
46	an iPhone using Copy
47	an iPhone using Snapchat
48	an iPhone using WhatsApp
49	an iPhone using FBMessenger
50	an iPhone using Facebook
51	an iPhone using Gmail
52	an iPhone using Telegram
53	an iPhone using Line
54	an iPhone using Viber
55	an iPhone using Instagram
56	an iPhone using LinkedIn
57	an iPad using SMS
58	an iPad using Email
59	an iPad using Other
60	an iPad using Download
61	an iPad using Copy
62	an iPad using Snapchat
63	an iPad using WhatsApp
64	an iPad using FBMessenger
65	an iPad using Facebook
66	an iPad using Gmail
67	an iPad using Telegram
68	an iPad using Line
69	an iPad using Viber
70	an iPad using Instagram
71	an iPad using LinkedIn

In addition to the s parameter, we've seen t roll out gradually. I saw t on links shared from Android in late 2021 (s=19), then from Twitter Web (s=20) in early 2022, and finally from iOS (s=21) a bit later in 2022. I don't think anyone outside of Twitter knows exactly how the t parameter is constructed, but from my observations it appears consistent per device for a time. I shared tweets via numerous methods in August from my phone and the t was consistently the same. I did similar tests again in November, and the t value was again the same for different sharing methods, but it was different than from August. Maybe a software update or some other change on the device caused a change in the t "fingerprint"? With this in mind, I think seeing the same t values on multiple links suggests the same device was the sharing source. However, different t values could still be from the same device, just over a longer time period.

Mastodon

This isn't actually a new parser (it's been in Unfurl for a few years), but I figured it would be worth mentioning with the increased interest in Mastodon. Mastodon is similar to Twitter in some respects; one of those is that the URLs of "toots" (Mastodon's version of tweets) contain an embedded timestamp. The long ID at the end of the URL is similar to a Twitter Snowflake:

https://infosec.exchange/web/@RyanDFIR/109306117687853105

Due to the federated nature of Mastodon, it could be running on domain that Unfurl doesn't know about. To avoid false positives, I only have a short allowlist of domains to parse as Mastodon instances. If you know of any others that you'd like to be parsed, let me know.

A while ago, I did some research and discovered how to dissect a TikTok identifier and extract a timestamp. Ollie Boyd figured out that IDs in LinkedIn post URLs had a similar makeup and made a tool to extract those timestamps. I've added this ability to Unfurl:

Unfurl extracting a timestamp from a LinkedIn Post ID

LinkedIn Messaging IDs

It turns out these LinkedIn IDs are used in more places than posts. One place they used to appear was in Messaging threads. When viewing messages on linkedin.com, the URL for each message thread (series of messages with a user) looked like https://www.linkedin.com/messaging/thread/6685980502161199104/. The ID at the end has an embedded timestamp that seemed to line up with when the first message in the thread was sent.

I've been referencing this in past tense because this isn't the case anymore; message threads now have URLs that look like https://www.linkedin.com/messaging/thread/2-ZTRkNzljZjgtOTRmNC00ZGJkLWJlYTktMDFjOWU4MTgxMjhjXzAxMA==/. These new IDs (which I'm calling "v2" from the 2- at the beginning) are base64-encoded UUIDs with a few characters appended. The above "v2" ID decodes to e4d79cf8-94f4-4dbd-bea9-01c9e818128c_010.

For those familiar with UUIDs, you may spot that this looks like a UUIDv4 (randomly-generated). I went back through my LinkedIn messages threads, all the way back to 2009 (wow, I've been on there a long time), and found something interesting. The older message threads had UUIDs that fit the form of UUIDv5 (name-based), while the newer ones fit UUIDv4. From my messages, the switch from UUIDv5 to UUIDv4 happened near early 2021-05 (I have a UUIDv5 message on 2021-04-26 and a UUIDv4 on 2021-05-14).

Why I am going on about this? Neither version 4 or 5 UUIDs contain any embedded timestamp information (unlike version 1). However, now for this particular use case, we can infer that a LinkedIn ID based on UUIDv5 corresponds to a message thread older than 2021-05, while one with a UUIDv4 was sent after that. It's a small, rough bit of timing information, but that's what Unfurl is all about: trying to parse all those tiny pieces of knowledge, in the hope that when put together they might paint a clearer picture.

LinkedIn Profile IDs

A few months ago, Jack Crook showed how to decode LinkedIn Profile IDs and use their sequential nature to estimate profile creation time:

All of the profiles listed in the article and this thread were created within days of each other.
jennie-biller-9b631120a
victor-sites-40139b20a
charolette-pare-93b3a220a
vivian-christy-b1246320a
maryann-robles-2924b620a
1/4 https://t.co/N3Na6HAydN
— Jack Crook (@jackcr) September 30, 2022

These "profile IDs" are different than the other IDs we discussed previously. I thought this technique was really interesting; I've added parsing the ID from base12 to Unfurl. I don't yet do anything with taking that number and estimating the creation time, but that sounds like a neat little project when I find the time.

Tracking URL Parameters

Many websites add URL parameters to links to help with user tracking and analytics. This is not a new practice; we've all seen a bunch of parameters tacked on the end of links. As investigators, we can sometimes use these parameters to infer more information: how a user clicked on a link, what site the link was on, or even when they clicked it.

These parameters are key/value pairs; for example, in utm_source=newsletter, the key is utm_source and the value is newsletter. The values often contain helpful clues (in the example, I'd guess that the link was from an email newsletter). Even in the cases when the values are opaque, we can glean some information from the key. For example, with fbclid=IwAR3Nuy7koMAB1KyVE1NqjcVGqAExIxVjQLSx-01U_e3LHKwSOzf2NsyP0UI, I have no idea (yet!) how to parse anything out of the IwAR3... value, but from the key I can infer the link was from Facebook.

I've added parsing of some of the most common of the tracking/analytics parameters to Unfurl. If you find one you'd like added, please let me know.

Substack

I've seen Substack increase in popularity as well. I so far only subscribe to "The Info Op" by the grugq, but there is a lot of other good content there too. I typically read it via email and noticed that all the links go through Substack redirects. I added expanding of Substack's redirect links to Unfurl; since many of the links are to Twitter/Mastodon and Substack adds utm_* tracking parameters, this enables those parsers to run as well, making some nice Unfurl graphs:

Unfurl parsing a Substack redirect link from an email

Get it!

use it online at dfir.blog/unfurl or unfurl.link
if using pip, pip install dfir-unfurl -U will upgrade your local Unfurl to the latest
View the release on GitHub

All features work in both the web UI and command line versions (unfurl_app.py & unfurl_cli.py).

More Search URL Parsing, MISP Lists, & More in Unfurl v2022.02

Ryan Benson — Wed, 02 Mar 2022 14:41:01 GMT

A new Unfurl release is here! v2022.02 has been a long time coming and adds new features, including:

Parsing for Google Search's aqs parameter
Integrates MISP's "warning lists" to enrich domain names
Supports expanding shortlinks from 3x more domains
Extract encoded timestamps from Twitter image filenames
Parsing for Brave Search

Get it now, or read on for more details about the new features!

Google Search's `aqs` Parameter

Google Search's Assisted Query Stats (or aqs) parameter isn't new (it's been around since 2012 from what I can tell). Unlike many other Google Search URL parameters, it isn't a secret - it's (mostly) documented in the Chromium source. Per a comment in the code, AQS' purpose is to log "impressions of all autocomplete matches shown at the query submission time."

So what does that really mean? Consider the following screenshot:

Searching for "unfurl url" in Chrome's Omnibox

In the screenshot, I have typed "unfurl url" into Chrome's "Omnibox" (the address/search box). Chrome is showing me four suggestions relevant to what I have entered:

Suggestion 1: Do a Google Search for the text I entered ("unfurl url")
Suggestions 2-4: Visit relevant pages from my local history - parts of the page title and URL that contain the words I entered are bolded in each suggestion

I ultimately selected the first suggestion and was sent to the Google Search Engine Results Page (SERP) for "unfurl url". The URL had an aqs parameter: aqs=chrome..69i57j69i60l3.7758j0j9. Parsing that URL with Unfurl yields:

Google SERP URL containing an aqs parameter, parsed with Unfurl

What Unfurl parses from the aqs parameter can give quite a bit of insight about what I did to get to that Google SERP:

I started on the "New Tab Page" in Chrome
I was shown four suggestions ("Autocomplete Matches")
The first (index 0) was a Google Search suggestion
The second, third, and fourth (indexes 1-3) were URLs from my local history that were related to the text I entered
I select the first suggestion
It was 19.794 seconds from when I started typing to when I went to the SERP (this seems long; taking a screenshot slowed me down evidently)

The aqs parameter doesn't capture the content of the suggestions offered to me, but I think you'd agree that what it does log is pretty interesting. The mechanics of unpacking the aqs parameter would be too much for this post, but I may come back to it in a future post. You can also take a look through Unfurl's code for parsing it if you're curious.

Enrich Domain Names using MISP Lists

One requested feature was to have some sort of annotation for domain names showing how popular they are. The open source MISP project has a curated set of lists of all sorts, including domain names:

GitHub - MISP/misp-warninglists: Warning lists to inform users of MISP about potential false-positives or other information in indicators

Warning lists to inform users of MISP about potential false-positives or other information in indicators - GitHub - MISP/misp-warninglists: Warning lists to inform users of MISP about potential fal...

GitHubMISP

The purpose of these lists is to add context (a domain is in the top 1K/5K/1M domains, an IP address belongs to GCP, a hash is of EICAR, etc) to help in deciding if something is a false positive or not, not to list "good" or "bad" things.

Unfurl uses the various domain lists to annotate a domain (see below). Check out the link above to misp-warninglists for the full list of their lists (there are a lot).

More Shortlink Resolutions

One of those MISP "warninglists" is of domains used for link shortening. Unfurl already supported resolving some shortlinks, but it was a list I had manually pulled together and tested. Adding MISP's list to my own triples the number of shortlink domains Unfurl supports (from 27 to 81).

One other shortlink-related improvement was parsing LinkedIn "slinks", as Brian Krebs calls them:

How Phishers Are Slinking Their Links Into LinkedIn

If you received a link to LinkedIn.com via email, SMS or instant message, would you click it? Spammers, phishers and other ne’er-do-wells are hoping you will, because they’ve long taken advantage of a marketing feature on the business networking site…

Krebs on SecuritySkip to content

Unfurl already resolved LinkedIn shortlinks with the format lnkd.in/xyz123. This involves extracting the shortcode (xyz123 in my fictitious example), creating the intermediary "slink" URL using that shortcode (https://www.linkedin.com/slink?code=xyz123), then finally determining the destination of that shortlink using the Location header. This Unfurl update adds the ability to expand "slinks" directly, in addition to the more typical lnkd.in shortlinks.

A LinkedIn "slink" mentioned in Krebs' article, parsed with Unfurl

A note on contacting external resources: For many different reasons, I wanted to ensure that Unfurl reached out to external domains as little as possible, but some external resources would be really useful in Unfurl (as in the case of expanding shortlinks). My "middle ground" was to allow Unfurl to contact an allowlist of link shortener services to get the "expanded" link, but not contact the destination. If this doesn't work for you and you'd rather Unfurl not reach out to any external sites, there is a setting to disable all remote lookups.

Recognize and Parse Twitter Image Filenames

Unfurl has parsed the Twitter Snowflakes in tweets since its inception, but I only recently learned that the names Twitter gives to uploaded images also contain a Snowflake! It's mentioned by Dr. Neal Krawetz on his blog way back in 2014 (!):

Name Dropping - The Hacker Factor Blog

The Hacker Factor BlogFilename Ballistics

It appears different than the Snowflakes used in tweets - it's base64-encoded rather than shown as a decimal (EqmR8DPVEAAd5mv vs 1344769819887865856) and has three extra bytes at the end (I haven't been able to determine their purpose yet). But like tweets, the timestamp embedded in the Snowflake is consistent with when the object (tweet or image) was created - which in the case of images means the time it was uploaded to Twitter.

If we encounter one of these images elsewhere still with the name Twitter gave it, we have some hints about it: that it came from Twitter and when it was uploaded to Twitter. The odds of an image having a name that can be properly decoded as a Twitter Snowflake, with a reasonable embedded timestamp, and not being from Twitter is vanishingly small (unless it was deliberately renamed by someone).

In this example below, I saved an image from a tweet, then uploaded it to my site (without renaming it). Unfurl indicates that the image might have originally come from Twitter and shows the upload timestamp from the Snowflake.

Brave Search

Lastly, this update adds the ability for Unfurl to parse a Brave Search URL. It's relatively basic, at least compared to the Google Search parser (which is massive), but I think it's a good start.

Brave Search URL parsed with Unfurl

Get it!

use it online at dfir.blog/unfurl or unfurl.link
if using pip, pip install dfir-unfurl -U will upgrade your local Unfurl to the latest
View the release on GitHub

All features work in both the web UI and command line versions (unfurl_app.py & unfurl_cli.py).

Hindsight v2021.12

Ryan Benson — Tue, 21 Dec 2021 14:14:00 GMT

This latest version of Hindsight adds parsing of more preference items, site settings (including HSTS records), Session Storage, and more! It also includes other small enhancements, bug fixes, and minor changes to support Chrome up to version 96.

New "Site Setting" Record Type

Over time, Hindsight has gained the ability to parse more and more artifacts from Chrome, many of which are a bit different than "traditional" browser history items like URL visits, cookies, or cached items. Hindsight parses things like if a site was muted, if the user zoomed in, if a site used HSTS, or even if the page title changed in the background.

As I added these, I had been adding them to Hindsight's timeline as "Preference" items (as the initial ones came from the Preferences file), but over time that label seemed less and less apt. I decided to add a new "Site Setting" record type, as most of these records pertain to a setting for the visited site. Like other record types, it can have variations (zoom level, hsts, engagement, & more).

Examples of the new "Site Setting" records

I plan on adding more "Site Setting" records in the future - these might not be critical to every investigation, but I really like the level of detail they provide and you never know when they might come in handy.

Parsing of HSTS records

HSTS is one of these new "Site Setting" records. We can use HTTP Strict-Transport-Security (HSTS) settings to tell if a browser has visited a particular site before, as well as a little about timing of the visit.

The TransportSecurity file holds HSTS settings, most of which look like this:

{
	"expiry": 1671127807.687742,
	"host": "df0sSkr4gOg4VK8d/NNTAWFtAN/MjCgPCJ5ml+ucdZE=",
	"mode": "force-https",
	"sts_include_subdomains": false,
	"sts_observed": 1639591807.687746
}

The host is a hashed value (according to Chromium source code) "so that the stored state does not trivially reveal a user's browsing history to an attacker reading the serialized state on disk." The code also shows how this hashed value is constructed. This doesn't let us reverse the hash (since that's not how hashes work), but it does let us generate hashes from known inputs and compare. Hindsight does just that, computing the hashed host value for every domain and subdomain seen in other browser artifacts, and comparing to host values in the TransportSecurity file. If it finds a match, Hindsight will show the domain; if not it will show the hashed version:

HSTS records in Hindsight XLSX Report

Parsing Additional Preference Items

Hindsight can also parse more from Chrome's Preferences file, including whether network prefetching is enabled, sync settings, zoom percentages (instead of raw levels), password manager usage, and the session event log. These all are interesting, but I especially like the session event log records:

Session Event Log records in Hindsight XLSX Report

They give some high-level insights about usage; for example, from the above screenshot you can infer that:

I have Chrome set to "Continue where you left off", as seconds after each session start, a restore happens
None of these sessions ended in a crash (can sometimes happen if an exploit was attempted against the browser) - useful knowledge in some particular investigations
I tend to leave Chrome running quasi-permanently, not opening/closing a lot
I have a tab hoarding problem

Get Hindsight

You can get Hindsight, view the code, and see the full change log on GitHub. Both the command line and web UI versions of this release are available as:

compiled exes attached to the GitHub release or in the dist/ folder
.py versions are available by pip install pyhindsight or downloading/cloning the GitHub repo.

Cookies Database Moving in Chrome 96

Ryan Benson — Thu, 16 Dec 2021 15:28:34 GMT

The reason for this change is to enable sandboxing of Chrome's network service, so it can only access files on the file system that it needs. This would make it so any compromised network service can't access other files in the user's profile directory. Because of how ACLs work on Windows, to accomplish this the files needed by network services have moved from the user's profile directory to a Network subdirectory.

Network-related files that have/will be moved are:

Cookies (SQLite)
Network Persistent State (JSON)
Reporting and NEL (SQLite)
TransportSecurity (JSON)
Trust Tokens (SQLite)

The "Cache" directory (HTTP cache) is also included in the sandbox, but it was already in its own directory so it didn't need to move.

You can use my Chrome Evolution visualization to compare files in Chrome 95 vs 96.

This migration is starting with Windows, and is eventually planned to happen on macOS, Linux, Android and ChromeOS. Other operating systems might be included later (but not iOS).

For more details on how the data is moving and why, please see Migration of Network Data by Will Harris (@parityzero) - and thanks to Will for pointing out this change.

Forensic Tools Impact

Plaso & log2timeline - no impact. log2timeline parses every file independent of its path, so this change to Chrome has no impact.

Hindsight - impacted. Hindsight currently uses file paths to find files to parse, so this change to Chrome caused problems (the Cookies database and TransportSecurity file would not be parsed). A new Hindsight release (2021.12) is available now that fixes this.

References

Migration of Network Data

Migration of Network Data This Document is Public Authors: wfh@chromium.orgSep 2021 One-page overview As part of the larger Network Sandbox work, the files that the network service needs to access will be moved into a folder that the sandbox can be granted access to. This migration does not a...

Google Docs

1173622 - chromium - An open-source project to help move the web forward. - Monorail

Issue 1173622: store files needed by network service in separate directory

Metasploit URLs, Hash Lookups, & More in Unfurl v2021.06.15

Ryan Benson — Tue, 15 Jun 2021 13:19:00 GMT

A new Unfurl release is here! v2021.06.15 adds decoding of some Metasploit URLs, hash identification and API lookups, more control over remote lookups, better UUID parsing, and a few more shortlink expansions. It also has a number of smaller fixes, code cleanups, and tests.

Get it now, or read on for more details about the new features!

Metasploit URLs

Didier Stevens has written about (and made a tool for!) decoding different Metasploit artifacts: payload UUIDs and shellcode URLs. Thanks to his excellent work (that's he published as open source), I was able to see how those Metasploit artifacts are constructed and build decoders into Unfurl:

Unfurl decoding two different types of URLs generated by Metasploit

You can read his blog posts for the details on the artifacts (payload UUIDs and shellcode URLs), but the super abbreviated version is that we can often extract at least the platform that was targeted (Windows in both examples above) - and sometimes more! It's another great example of extracting useful information from the way identifiers are generated.

Live Unfurl Examples:

Hash Identification and Remote Lookup

This release also adds the ability to identify potential hashes: MD5, SHA-1, SHA-256, & SHA-512. The detection is based on characters and length, so it's not high fidelity (for example, MD5 hashes are the same length as UUIDs, so some nodes will be identified as potentially both).

To aid with determining what's an actual hash and what's not, Unfurl can query remote services to see if they've seen that value before. At present, two services are supported: VirusTotal and Nitrxgen's MD5 lookup database.

The VirusTotal integration is fairly basic; if a (free) VirusTotal API key is set in the Unfurl config file, Unfurl will query the VirusTotal API with potential file hash values and add a child node with file type & name (if found).

Nitrxgen's MD5 lookup database is a bit different; it's a dataset of plaintext → MD5 hashes with over a trillion values. Unfurl can query it with potential MD5 values to see if it corresponds with a known plaintext string. This is different than the VirusTotal lookup (which queries hashes of file content), as the Nitrxgen lookup is for hashed text strings. However, sometimes both can be true, as in the image below:

Unfurl identifying an MD5 hash value and looking it up on VirusTotal and Nitrxgen

These remote lookups can add value to Unfurl, but they also come with risk (as Unfurl is sending out potentially-sensitive hashes to 3rd parties). To give the user control over this, Unfurl has a new remote_lookups setting. Users can change it (from the default, false) in the unfurl.ini file. There's also a command line option to allow lookups (-l or --lookups) from unfurl_cli.py. The CLI tool will fall back to the value specified in unfurl.ini if no command line option is set. Users need to set this option to enable any remote lookups (it's disabled by default). Shortlink resolution and MAC address vendor lookups are now also controlled by this option, as they are remote lookups as well.

Live Unfurl Examples:

UUIDv1 Random Node ID Detection

Unfurl has been able to detect and expand UUIDs since its beginning. Version 1 UUIDs have been particularly interesting, with their embedded timestamp and MAC address. This release adds the ability to determine if the Node ID contained in the UUIDv1 is an actual MAC address or a random number.

Unfurl parsing a UUIDv1 with a random Node ID

Live Unfurl Example:

UUIDv1 with random Node ID

Get it!

To get Unfurl with these latest updates, you can:

use dfir.blog/unfurl online
if using pip, pip install dfir-unfurl -U will upgrade your local Unfurl to the latest
View the release on GitHub

All features work in both the web UI and command line versions (unfurl_app.py & unfurl_cli.py).

Let me know what you think!

Unfurl Plugin and "Site Characteristics" Artifact Added in Hindsight

Ryan Benson — Wed, 28 Apr 2021 17:34:59 GMT

I'm happy to announce there is a new Hindsight release available! 2021.04.26 has many small improvements and fixes, including adding support Chrome 88 - 90, but the main new features are an Unfurl plugin and parsing of the Site Characteristics Database!

Unfurl Plugin

I'm excited that this new Hindsight version has an integration with Unfurl! Unfurl takes a URL and expands ("unfurls") it into a directed graph, and is useful for exploring data encoded in URLs or other text values. Unfurl typically displays all this in an interactive graph visualization, but that doesn't fit well into Hindsight's output. Instead, this new Unfurl plugin stores the "text tree" version of the output (as seen in the Unfurl CLI tool). At this time, the only thing that Unfurl plugin runs on are Local Storage records. I chose these for a few reasons:

Timestamp Detection + Parsing

Local Storage records lack explicit timestamps (they're just a collection of key/values pairs associated with an origin). Unfurl can often translate a value into a human-readable timestamp, potentially adding some hints as to timing on these records. Hindsight had a "Generic Timestamp Converter" plugin that did this previously, but it was rather limited and Unfurl does a much better job and covers a wider variety of timestamps. Example:

origin: https://www.reddit.com
key: push-token-last-refresh-ms
value: 1615493428164

Local Storage key/value pair for reddit.com

2021-03-11 20:10:28.164 (Converted as Epoch milliseconds) [Unfurl]

Unfurl parsing a timestamp from a value in Local Storage

When Unfurl's output is rather simple (like just a timestamp conversion), the plugin reformats the "tree" into a single line summary that works better in Hindsight.

Decoding Values

Another reason is that Local Storage values are often encoded. Unfurl's chaining of multiple simple transforms can sometimes bring clarity to an obscured value. For example:

origin: http://www.metacritic.com
key: __ansync3rdp_criteo
value: eyJiSWQiOiJjcml0ZW8iLCJ1Q29kZSI6bnVsbCwidHMiOjE1MzExODAxNDYwOTh9

Local Storage key/value pair for metacritic.com

The value from above is parsed by Unfurl (using base64, JSON, and timestamp conversions), and the "text tree" output is saved in the "Interpretation" column (in the same way other Hindsight plugins save their results):

[1] eyJiSWQiOiJjcml0ZW8iLCJ1Q29kZSI6bnVsbCwidHMiOjE1MzExODAxNDYwOTh9
 └─(b64)─[2] {"bId":"criteo","uCode":null,"ts":1531180146098}
    ├─(JSON)─[3] bId: criteo
    ├─(JSON)─[4] uCode: None
    └─(JSON)─[5] ts: 1531180146098
       └─(🕓)─[6] 2018-07-09 23:49:06.098

Unfurl parsing an encoded Local Storage value

These are just a few examples of how Unfurl can be helpful on Local Storage values. All the parsers from the web version Unfurl are included in the Hindsight plugin, so things like UUIDs, zlib-compressed strings, Twitter Snowflakes, and a whole lot more can be parsed. If this plugin works out well, I'll evaluate if there are other places in Hindsight that an Unfurl integration would make sense.

Site Characteristics Database

The other new feature is added parsing of the "Site Characteristics Database". It is a part of Chrome that tracks a few different behaviors on sites, such as if the site changes the favicon or page title in the background. These behaviors aren't that interesting in and of themselves, but they can provide interesting context.

Behind the scenes, the "Site Characteristics Database" is stored in a LevelDB as a collection of key/value pairs. The key for each record is the MD5 hash of the origin and the record's value is a protobuf. Luckily, since Chromium is open source, we can find the .proto file that corresponds to that protobuf, so decoding it is easier:

// Copyright 2019 The Chromium Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.

syntax = "proto2";

option optimize_for = LITE_RUNTIME;

// Contains the information that we want to track about a given site feature.
// Next Id: 3
message SiteDataFeatureProto {
  // The cumulative observation time for this feature in seconds, set to 0 once
  // this feature has been observed.
  optional int64 observation_duration = 1;
  // The time at which this feature has been used (set to 0 if it hasn't been
  // used), in seconds since epoch.
  optional int64 use_timestamp = 2;
}

// Contains decaying average performance measurement estimates.
// Next Id: 4
message SiteDataPerformanceMeasurement {
  // A decaying average of the CPU usage measurements. Units: microseconds.
  optional float avg_cpu_usage_us = 1;
  // A decaying average of the process footprint measurements. Units: kilobytes.
  optional float avg_footprint_kb = 2;
  // A decaying average of the duration from navigation commit to "loaded".
  // Units: microseconds.
  optional float avg_load_duration_us = 3;
};

// Defines the data that we want to track about a given site.
// Next Id: 7
message SiteDataProto {
  // The last time this site has been in the loaded state, in seconds since
  // epoch.
  optional uint32 last_loaded = 1;

  // List of features that we're tracking.
  optional SiteDataFeatureProto updates_favicon_in_background = 2;
  optional SiteDataFeatureProto updates_title_in_background = 3;
  optional SiteDataFeatureProto uses_audio_in_background = 4;
  optional SiteDataFeatureProto deprecated_uses_notifications_in_background = 5;

  // Load time performance measurement estimates. This maintains a decaying
  // average of the resource usage of a page until shortly after it becomes
  // idle.
  optional SiteDataPerformanceMeasurement load_time_estimates = 6;
}

https://source.chromium.org/chromium/chromium/src/+/master:components/performance_manager/persistence/site_data/site_data.proto

To process these records, Hindsight first calculates the MD5 hashes of every origin seen in other artifacts it has already parsed, then compares each Site Characteristic key to them. If a match is found, Hindsight uses that origin in the "URL" field for the record; if not, Hindsight shows something like "MD5 of origin: 99cd2175108d157588c04758296d1cfc". For the "Value" field, Hindsight parses the site_data protobuf and stores the result (it looks similar to JSON). To order these records by time, Hindsight uses the last_loaded value from the protobuf.

Example "Site Characteristic Database" record in Hindsight XLSX output

Deleted Records

Since the datastore is LevelDB, we can recover deleted data from it! For deleted records, we can only get the key (the origin MD5), not the value protobuf, so we lose some information, including any explicit timestamps. However, this recovered data can still be useful.

One potential use case for this is showing that a user visited a particular site. Looking through my own browser history, I have over 1200 records where Hindsight couldn't find the Site Characteristic origin by comparing its key to the rest of my browsing history. This means that these origins don't appear anywhere else in my Chrome history, yet there is still some (small) indication I visited them in these Site Characteristic records. If you have a site of particular importance to a case, you could calculate the MD5 of the origin and then search these records for it. Since the timestamp information in deleted records is missing, Hindsight places these records at the beginning of the timeline (at 1970-01-01), but uses a filter to hide them in the Excel output by default to avoid cluttering it.

Future Research

Things to explore around "Site Characteristic Database" records in the future:

What effect clearing different types of browser data has on Site Characteristics Database records? If they persist despite history being cleared, they could be even more useful in showing a particular site was visited.
The various observation_duration timestamps: they are relative timestamps (count of seconds), but could potentially still be useful.
More precise meaning of the last_loaded timestamp: In some quick testing, it looks to be updated when the page was closed: page timestamp + page visit_duration ~= last_loaded timestamp. This is interesting, as not all pages have a visit_duration value set, and it could potentially show interesting things about user behavior.

Get Hindsight

You can get Hindsight, view the code, and see the full change log on GitHub. Both the command line and web UI versions of this release are available as:

compiled exes attached to the GitHub release or in the dist/ folder
.py versions are available by pip install pyhindsight or downloading/cloning the GitHub repo.

Keystroke Flow from Chrome Omnibox

Ryan Benson — Thu, 18 Feb 2021 13:58:00 GMT

The "Network Action Predictor" is an SQLite database that's long been part of Chrome (since Chrome 17) but hasn't gotten much attention. The (simplified) summary of its function is to help Chrome seem faster to the user by predicting the resources Chrome will need and preloading them. Kevin Pagano wrote a blog post that does a nice job introducing the artifact and covering the basic info about it. I won't cover the same stuff here, so check out his post for an introduction to Chrome's Network Action Predictor. His post gave me the little kick to dust off and polish (a little) a visualization I had been playing with for this artifact a while ago.

I've been interested in visualizations and applying them to digital forensics for a while now (some examples on the blog). When I was exploring the Network Action Predictor data the type of chart that came to mind was a Sankey diagram. A Sankey is a type of flow diagram. I think the best way to explain how it works is show an example. I came across this one a few years ago and it has stuck with me as an effective use of the visualization technique:

How 52 ninth-graders spell 'camouflage', Sankey diagram [OC] from dataisbeautiful

Each "node" (the colored bars) represents the number of items in that state and the "bands" (or "links") connect one node to the next. Both the nodes' and bands' sizes are drawn in proportion to their value. It's easy to see what spelling "paths" were the most common, see where they diverged, and how common each end state was. It's a ton of interesting information packed in a small area! I find myself tracing different paths, making comparisons, and just generally exploring it: all hallmarks of an effective visualization.

Below is the "Network Action Predictor" data as Sankey (after a little massaging; read on for the details). There isn't just one starting node (C in the spelling example above) as there were many different starting letters.

User keystroke "flow" as saved in Network Action Predictor DB

In the Keystroke Flow Sankey chart you can see a few different things (beyond that I visit Twitter way too much). When I visit Unfurl, I most often type un and then select the suggestion. I do the same with Twitter; tw then the suggestion. It's interesting that after seeing this in data, I came to realize I follow this pattern quite often when launching things: whether in the Windows Start Menu, Mac's Spotlight Search, or the Chrome Omnibox, I hit a shortcut (Windows key, Command+Space, or Ctrl+T, respectively) then the first couple letters of what I'm looking for.

The chart gets a little more interesting further down with the Github and Hindsight entries. I access two different Github repos and two Hindsight-related sites often; these all have some common starting places (g or h) and then diverge. The edges overlap and it's a bit harder to see (in the screenshot image at least; in the actual graph you can hover, highlight, and move nodes).

I think there's a couple things that could be of value in this artifact (or visualization). I find artifacts that show what a user actually typed have value, particularly with regard to user intention. After seeing the chart, it would be hard for me to argue that I only went to Twitter or Unfurl by mistake. Conversely, if you did find a visit to a site of interest in the Network Action Predictor data, found it was only visited once, and could see what the user typed to get there, that might help inform your opinion (for or against) as to if the visit was accidental or not.

I hadn't published this before as I couldn't see a good way to integrate it with existing timeline-centric tools (Hindsight, Plaso, or Timesketch) as there isn't any timestamp information in it. I've put it up in my scripts Github repository, kind of a catch-all for one-off scripts. I still consider the visualization to be in the proof-of-concept/prototype phase, but I thought someone might find it interesting or useful.

How to Build the Sankey Diagram

The data stored in the "Network Action Predictor" isn't quite in the format needed for a Sankey. The network_action_predictor table has user_text and url columns (among others), but that doesn't give us the "in-between" states (C → Cam → Camoflau → Camoflauge in the spelling example) and a Sankey without those is much less helpful. There are multiple ways to construct the intermediate states (using SQLite's rowid is one option), but the way I chose to approach it in my script is laid out below.

Filter

First, I wanted to filter out rows that aren't helpful for the visualization. I removed rows with number_of_hits == 0 (the 0-hit rows are quite numerous and are suggestions that were not correct) and rows where user_text == url (there weren't any intermediate steps; these are more rare). There also often are a lot of URLs that have only been visited a few times. These can make the graphic "noisy" so I added the ability to filter out any entries that are below a user-defined "threshold" value (2 by default).

Construct Nodes & Links

Next, I needed turn the rows of user_text, url, and number_of_hits values into nodes and links. I looped through all the rows and grouped the user_text entries by what URL they point at. This resulted in a dictionary for each URL with keys:values being user_text: number_of_hits . Example:

   'https://www.youtube.com/': {
       'y': 5.0,
       'yo': 5.0,
       'you': 3.0
   }

Grouping user_texts that point to same URL

I then need to convert these into "link" form. Doing so on the same YouTube data as the above example yields these links:

 'y' -> youtube.com (5) 
 'yo' -> youtube.com (5) 
 'you' -> youtube.com (3)

This is pretty similar to the raw network_action_predictor rows

Now there are many user_text entries all pointing to a URL, not to other text items. This would result in a graph that's only two "levels" deep, not the multi-leveled flow graph desired. I needed to modify the links so that user_text entries that eventually point to the same URL and that are subsets point to each other instead, showing the flow (and not lead to overcounting the end result); something like this:

  'y' -(5)-> |'yo'| ---(2)-------------------> | youtube.com |
             |'yo'| ---(3)--> |'you'| --(3)--> | youtube.com |

It's hard to illustrate in ASCII, but that's why the final product is a graph ;)

Important note: this is assuming something about the data; those 5 hits for y to youtube.com are actually the same 5 hits as for yo to youtube.com. I couldn't find confirmation that this is the case, but from looking at a bunch of different test data sets I've collected I believe it to be true. The alternative is that there actually was 10 hits to youtube.com (5 from y and 5 from yo), not the 5 I'm interpreting it as.

To transform these nodes and links into the "chained" form I want, I go through each URL's dictionary and see if any user_text values are the same, but with one letter added at the end. Examples: y & yo and yo & you. If so, made a new link ( y → yo ) with the "weight" being the overlap (5).

To wrap this part up, I made links for any nodes that didn't fall into this "subset" pattern, then did a little massaging to save the nodes and links in a JSON file suitable for the graphing library.

Display the Chart

To build the actual visualization, I used d3.js and a Sankey plugin. There are other Sankey options; this one is quite old, but I did start this project a long time ago. You can do incredible things with d3.js, but I am by no means a master with it and this chart is fairly spartan. It's mostly the example code with a few tweaks; most of the work I did was in transforming the Network Action Predictor data into a JSON in the format the library needed.

Run it Yourself

In my scripts repository, there is a keystroke-flow directory. Run python3 keystroke-flow.py "/path/to/Network Action Predictor" and it will create a JSON file. If you want to tweak the threshold value mentioned above, pass -t to filter out URLs that have less than incoming links. There's a keystroke_flow_diagram.html file in that directory that will render the JSON into the Keystroke Flow chart, but you can't just open it to view the results. If you do, you won't see the chart, as CORS policy won't let it load.

Fortunately, Python can help us out here. Open a command prompt and change directories into keystroke-flow. Then run python -m http.server, open http://localhost:8000/ in a browser, and click keystroke_flow_diagram.html to view your own Keystroke Flow Sankey!

New Hindsight Release: Better LevelDB parsing, New Web UI View, & More!

Ryan Benson — Mon, 18 Jan 2021 18:19:40 GMT

It's been a while, but a new Hindsight release is here! This new version (2021.01.16) brings exciting new features: improved LevelDB parsing (including deleted!), viewing Hindsight results in the web UI, and more!

Improved LevelDB Parsing

LevelDB has been used in Chrome for years... and for years I've had difficulties parsing it. The Python support for LevelDB hasn't been great; all the Python packages required you to have LevelDB installed on the system already and they acted like a shim to it. This worked great on Linux systems, as LevelDB was (relatively) easy to install, but proved a challenge on Windows systems.

Then Alex Caithness from CCL Forensics came out with a couple of fantastic blog posts (and code!) exploring Chrome's IndexedDB. IndexedDB in Chrome is complicated in its own right, but it also uses LevelDB for data storage. In Alex's exploration of IndexedDB, he created a pure Python parser for LevelDB! This code (which he released as open source), makes reading LevelDB in Python a lot easier. I've switched Hindsight over to using ccl_chrome_indexeddb for reading LevelDB and removed the old code and dependencies, which means Hindsight should parse LevelDB records now out of the box on all platforms!

Right now, FileSystem and LocalStorage records are the only LevelDB-backed artifacts that Hindsight parses, but I'll be adding more in the coming months. Both these record types appear in the "Storage" tab. Thanks to Alex's code, I was able to add a two new columns (Sequence and State), both about the LevelDB internals; I'll expand on them in a later post. The File System records got a few more additional columns, thanks to suggestions from Chad Tilbury, that help you see what files still exist on disk and a bit about them (size and type).

New Backing Database and File System columns in "Storage" tab

Bonus: Deleted Records!

One of the things that excited me initially when I was digging into LevelDB is that the format lends itself to keeping deleted records around for a while. I've been using a golang program called ldbdump to explore deleted records, and you can find a lot of them! Another great thing about the switch to using the CCL Forensics' code in Hindsight is that since it parses deleted records, Hindsight now can too! More to come on this in a later post.

Viewing SQLite Results in Hindsight's Web UI

Since Hindsight's beginning, it has been a parsing tool; you would have to view that parsed output somewhere else (an XLSX file in Excel, or maybe a JSONL file loaded into Timesketch). Thanks to Ryne Everett, you can now view parsed records in Hindsight too! He's added the ability to view Hindsight's SQLite output in the Hindsight web UI. It uses his sqlite-view project, which is based on sqlite-viewer, to add a SQL-like view and querying interface to Hindsight.

Viewing Hindsight's output in the browser using sqlite-view

After running Hindsight's web UI and processing some browser history files, there's a new button (View SQLite DB in Browser). After clicking that, a view like the above screenshot will appear. You can select which table to view by clicking the table name at the top, and you can do SQLite queries as if you were in an external SQLite viewer.

It does require a separate install step, as we didn't want to bundle all the supporting Javascript code in the Hindsight repo. If you don't have the necessary Javascript code installed, you just won't be able to use the new functionality (the button will be grayed out); everything else in Hindsight should continue to work as normal. I've included these supporting files in the compiled EXE version, so this feature is enabled in it.

Parsing "Media History" Artifacts

Chrome added a new "Media History" database in version 86, and this version of Hindsight adds support for parsing it. See this blog post for more info on this new artifact.

Update Minimum Python version to 3.8

The switch to using the CCL Forensics LevelDB parsing code necessitated moving Hindsight to use Python 3.8, rather than 3.7. I hope this isn't too big an issue for anyone, as 3.7 has moved to security-fixes only and 3.8 (and 3.9) have performance improvements as well.

Get Hindsight

You can get Hindsight, view the code, and see the full change log on GitHub. Both the command line and web UI versions of this release are available as:

compiled exes attached to the GitHub release or in the dist/ folder
.py versions are available by pip install pyhindsight or downloading/cloning the GitHub repo.

NOTE: Windows Defender has been flagging the EXEs as malware, presumably because they were packaged with PyInstaller. The Python script versions are not being flagged. If you'd like to build the EXEs from the Python code yourself, all I did was: pyinstaller --distpath .\dist .\spec\hindsight.spec from the root of the repo.

dfir.blog

Hindsight v2026.01 Released!

Sync Data Parsing

Updated Terminal Interface

New Artifacts & Expanded Parsing

Improved Output Formats

More Robust Parsing

Get Hindsight!

Unfurl 2025.03

Google Search UDM Parameter

Mastodon Parsing Improvements

Input "Clean Up" Actions

Get it!

Hindsight v2025.03 Released!

Background

New "Extension Data" Section

Get Hindsight!

Unfurl v2025.02 Released

Parsing of IP Addresses (in many forms)

Parsing and Lookups of Bluesky Handles

Get it!

Authenticating Screenshots from Netflix's Carry-On Movie

Is the Carry-On screenshot consistent with the movie setting?

Real Life Applications

Evaluating the Authenticity of Screenshots

Importance of Verification

Try It Out!

Video of "What Can DFIQ Do For You?" Posted

Unfurl v2023.09 Released!

Parse JSON Web Tokens (JWTs)

DNS over HTTPS (DoH)

More Mastodon Servers

Get it!

Unfurl v2022.11: Social Media Edition

Twitter

Mastodon

LinkedIn

LinkedIn Messaging IDs

LinkedIn Profile IDs

Tracking URL Parameters

Substack

Get it!

More Search URL Parsing, MISP Lists, & More in Unfurl v2022.02

Google Search's aqs Parameter

Enrich Domain Names using MISP Lists

More Shortlink Resolutions

Recognize and Parse Twitter Image Filenames

Brave Search

Get it!

Hindsight v2021.12

New "Site Setting" Record Type

Parsing of HSTS records

Parsing Additional Preference Items

Get Hindsight

Cookies Database Moving in Chrome 96

Forensic Tools Impact

References

Metasploit URLs, Hash Lookups, & More in Unfurl v2021.06.15

Metasploit URLs

Hash Identification and Remote Lookup

UUIDv1 Random Node ID Detection

Get it!

Unfurl Plugin and "Site Characteristics" Artifact Added in Hindsight

Unfurl Plugin

Timestamp Detection + Parsing

Decoding Values

Site Characteristics Database

Deleted Records

Future Research

Get Hindsight

Keystroke Flow from Chrome Omnibox

How to Build the Sankey Diagram

Filter

Construct Nodes & Links

Display the Chart

Run it Yourself

New Hindsight Release: Better LevelDB parsing, New Web UI View, & More!

Improved LevelDB Parsing

Bonus: Deleted Records!

Viewing SQLite Results in Hindsight's Web UI

Google Search's `aqs` Parameter