With all the news around TikTok (on political, business, and privacy fronts), I decided to take a look at it. There's been lots of coverage on the mobile apps, regarding what they can do or collect. That's not really my wheelhouse and has already been explored, so I decided to look into the URLs and see what I could find. If you're familiar with Unfurl, you might know I like to find timestamps embedded in things, so I figured I'd begin there.
Determining When a Video was Posted
TikTok doesn't post videos chronologically, like many other social sites do. It instead uses an algorithm to determine what videos to show the user, so a user might be shown videos in a row that were recorded at widely different times. TikTok used to not show any indicator when the video was posted (Wired has a nice article exploring this), but has since changed and displays the date.
As an example, I took a look at post from Bill Nye. In the screenshot below, I see "7-28" by Bill Nye's name, which I am interpreting as saying the video was posted sometime on 2020-07-28. In the page source, we find
"createTime":"1595988377", which converted to a human-readable timestamp is 2020-07-29 02:06:17 UTC. Given a bit of leeway for time zones, these two timestamps appear consistent. That's great, as we now have a more precise creation or post time for the video than a full day.
However, having to view the page source every time to find the creation date is not ideal. What if the video was deleted or set to private? What if TikTok removes the
createTime from the page source? (This happened during my research; I was using
uploadDate, which looked like
2019-12-04T12:01:41.000Z, and it disappeared. Luckily,
createTime is a useful replacement.) What if we had a timestamp that was easier to retrieve, hard for TikTok to remove, and can be found even for deleted or private videos? If that sounds good, keep reading 😀.
Comparing TikTok to Twitter
Looking at TikTok video URLs, I noticed they had many similarities with Twitter tweet URLs:
While the basic layout is the same (domain/user/post-type/post-id), the ID schemes are clearly not identical. Both these posts are from July 2020 and the IDs are very different. Despite this, perhaps they have similarities too.
Twitter's IDs are generated by a service called Snowflake. They were designed to be unique, generated independently on many systems, and be sortable (older IDs are smaller than newer ones). The way Snowflake achieves this is by using a timestamp as the high-order ("leftmost") bits and worker and sequence numbers in the low-order ("rightmost") bits in the ID. Having the timestamp "first" lets the Snowflake IDs be sortable by time; appending the worker and sequence numbers ensures the IDs are all unique. This system works even when IDs are generated independently and at massive scale. Since we know the makeup of Twitter's Snowflake IDs, we can extract the embedded timestamp. Unfurl makes use of this to show when a tweet was sent, solely from the tweet's URL.
Examining TikTok IDs
A byproduct of this ID format is that Twitter IDs, when interpreted as a number, get larger over time. Like Twitter's Snowflakes, TikTok IDs increase slowly over time, giving me hope that they similarly have a timestamp component. I collected some TikTok IDs and started splitting them in different ways; it turns out they have embedded timestamps too! Here's a chart of how to extract the timestamp; I'll go into more details after:
To extract the timestamp, follow these steps:
- Find the ID (for a TikTok video post, it's the long number at the end of the URL)
- Treating the ID as the decimal representation of a 64-bit number, convert it to binary. (If you treat it as a string that happens to be all numbers, the conversion will be incorrect.)
- Take the 32 "left-most" (or most significant) bits; one way to do this is to do a bitwise shift right of 32 bits. This shifts all the bits to the right 32 places, which discards the right-most 32 bits and keep the left-most ones we can about.
- Convert these 32 bits to a decimal number; it should be 10 digits long and start with a 1.
- Interpret this number as a Unix timestamp (in seconds) to get the embedded timestamp!
The embedded timestamp in the ID does not always exactly match the
createTime value, but most are within 5 seconds (the biggest difference I've observed is 18 seconds). When they don't match, in all instances I've seen the
createTime is newer than the embedded timestamp. This suggests to me that the ID is created/assigned a short time before whatever
createTime represents is created.
Like Twitter, TikTok uses this ID scheme to identify more than posts. You can find these IDs all over TikTok -
webId, and probably more. And now, we can tell when each of these was created by extracting the timestamp embedded in the TikTok ID!
Since this technique only uses information in the TikTok URL, it also works on videos that you can't actually watch (as long as you have the URL). This means even though you can't see a private or deleted video, you can still determine when it was posted. I did a quick search for "best TikToks 2018" and found (many, many!) lists of older posts; these lists often contain TikToks that can't be viewed anymore for whatever reason. Test the timestamp extraction process yourself: grab one of these unplayable TikTok URLs and figure out when it was posted! Another benefit to the ID being in the URL is that in order for TikTok to remove this timestamp, they would have to redo their entire ID and URL schemes. Since that would be a huge task for (probably?) small benefit, you can likely count on this timestamp sticking around.
Unfurl is an open source tool for extracting and visualizing data from URLs. It already can extract embedded timestamps from Tweet URLs (and many other ID types as well) and I've added in support for parsing TikTok IDs. Just take the URL of the TikTok you are interested in, put it in the text box, and click Unfurl!
I hope you enjoy using Unfurl to tinker with TikTok timestamps!
Bonus: Remaining Bits
If you followed the explanation above on how to extract the timestamp, one follow-on question that may have occurred is: what about the rest of the ID? We basically threw out half the bits when we focused on the timestamp.
From the description of Twitter's Snowflake above, the purpose of the lower bits was to ensure uniqueness, even when the IDs are generated on independent servers and very quickly. Twitter nicely explained the meaning of the lower-order "uniqueness bits" in Snowflake, but TikTok has not done the same with their ID scheme.
Unfortunately, discerning the meaning of the lower 32 bits is a bit harder than the upper timestamp bits. If they are flags and map to an internal lookup table, it's doubtful we'll ever get a concrete explanation of their meanings. If TikTok is instead using random bits in this section to provide uniqueness, we should be able to tell that by looking at enough IDs and seeing a pseudorandom distribution. Likewise, if these bits hold fractional seconds or other additional timing info, we should be able to spot that pattern. I collected 20 more TikTok URLs, from 2018 - 2020, and extracted the timestamp and lower 32 bits here:
The bits do not appear random; in fact, most of the 4-bit grouping only have a few unique combinations. An additional challenge is we don't even know how to divide these bits; I chose to group them in fours, but it could easily be 8-bit groups or any combination of divisions that add up to 32. This leads me to believe these "uniqueness bits" are similar to Snowflake's worker ID, rather than random or time-based. I'm going to leave this here for now, but if anyone learns more about these lower bits, I'd be very interested.