Cross-Device Tracking as a Re-Identification Technique: Why Technology Alone Cannot Enforce Best Practices

Joseph A. Calandrino, Ph.D.

Security & Privacy

Client Bulletin

July 14, 2015

If you follow the online advertising or privacy communities closely, you have likely heard of cross-device tracking. For those less familiar with the topic, cross-device tracking attempts to link an individual’s activity as that person switches between computers, smartphones, tablets, and other smart devices (e.g., smart TVs). Its stereotypical purpose is for marketing, such as targeting ads or tracking conversion rates across devices. Possible applications are broader, however: for example, similar techniques could help a bank distinguish between authentic and fraudulent login attempts on previously unseen devices. Regardless of the application, this tracking seeks to connect the dots between the many devices used routinely by consumers today.

Cross-device tracking seeks to connect user activity across devices.

Cross-device tracking extends attempts to track a user’s activity on a single device. This new form of tracking falls into two categories: deterministic and probabilistic. In the deterministic case, a user logs in to the same account on two devices, directly indicating a connection between the devices. For example, assume that a user logs in to Google and Facebook accounts on both her desktop computer and her smartphone. The user’s actions would strongly suggest to those companies that she uses both devices, helping the companies to link her activity as she switches between the devices. Although room for debate exists on a variety of issues surrounding deterministic tracking, the steps facilitating it are typically visible to a user.

Logging in to a common account on different devices facilitates deterministic cross-device tracking.

Probabilistic cross-device tracking involves making educated guesses at the connections between devices and their users. These guesses could be based on any available information, such as configuration details of the devices/software, user behavior, network details, and more. The goal is to find some pattern that links a user’s activity on one device to the user’s activity on other devices. Consider the following hypothetical cases:

Alice regularly visits the same distinctive combination of news, sports, shopping, and social media sites from her work computer during lunchtime and her tablet in the evening. An advertiser notices the similar patterns and merges the activity seen on both devices into a more thorough combined profile. The advertiser uses this richer profile to target ads to Alice on both devices.
While at home, Bob browses the web on his laptop and his smartphone, but he rarely uses both at the same time. Based on timing, network details, etc., an analytics service infers that Bob switches between both devices. The service uses the inferred connection to check whether items that Bob viewed on his smartphone later resulted in purchases on Bob’s laptop—helping an online retailer assess the value and effectiveness of its mobile site.

Depending on the data used and methods of data collection, little to no evidence of probabilistic tracking may be visible to a user.

Cross-device tracking raises a wide variety of challenging questions regarding data collection and use by various parties, which is a major reason that the issue has drawn so much attention. The probabilistic form presents additional complexity not only because it could occur silently but also because it poses a unique technical challenge: take the entire universe of previously seen users and infer which user (if any) matches the user on a given device. This technical challenge—and the goal of cross-device tracking more generally—closely resembles research in the area of re-identification. As we grapple with questions surrounding cross-device tracking, re-identification may provide some guidance.

Re-Identification

Informally, de-identification attempts to remove identifying information from data, while re-identification attempts to reassociate details of de-identified data with underlying individuals. Three popular examples may be helpful:

During the 1990s, Massachusetts wished to release state employee medical records to researchers. Given privacy concerns, the data excluded numerous obvious identifiers, like names—but critically left gender, ZIP code, and date of birth untouched. Researcher Latanya Sweeney (who subsequently served as FTC Chief Technologist) demonstrated that these three fields are unique for many individuals, potentially allowing location of a person’s medical records given knowledge of those three attributes alone. Publicly available voter registration lists contained these three fields accompanied by the individual’s name, allowing mass reassociation of records with names. Sweeney demonstrated the threat by using voter registration lists to locate then-Governor William Weld’s medical records in the data.
In 2006, AOL released a large batch of search data intended to facilitate research. Although the company did not disclose the usernames associated with search queries, New York Times reporters used details of the queries themselves to reassociate searches with Thelma Arnold, a 62-year-old resident of Lilburn, Georgia.
Also in 2006, Netflix released a set of customer movie rating data intended for participants in a contest to improve the site’s recommendations. Like the previous cases, names and other clearly identifying details were not released. Nevertheless, researchers Arvind Narayanan and Vitaly Shmatikov discovered that a small amount of imperfect prior knowledge regarding past video viewing habits could allow them to find a person’s full set of activity in this data, potentially revealing viewing history that the individual would not have willingly shared.

Re-identification and probabilistic inferences for cross-device tracking are extremely similar. The former attempts to reassociate individuals with compiled data. The latter attempts to associate an individual’s activity on a given device with data compiled about the individual from other devices.

Applicable Lessons from Re-Identification

Lessons learned from re-identification also may apply to cross-device tracking. One overarching theme from re-identification is the difficulty of preventing it using technical measures alone. New results and related research are commonplace, and many results uncover surprising features of data that distinguish us. Certain mitigation techniques increase the difficulty of re-identification, but no existing technological solution is perfect for all cases.

These lessons suggest that engineers and software developers cannot simply build technical features into devices to enforce best practices for cross-device tracking. Even if devices were to prevent the transmission of certain forms of data altogether under certain circumstances, unexpected details could nonetheless facilitate probabilistic tracking. Consider some hypothetical ways of distinguishing users:

The angle with which a user’s finger swipes across a trackpad or touchscreen would generally be visible to a website or application, and that angle may convey whether the user is left-handed or right-handed.
Color schemes, text size, etc. may be employed creatively to infer details of a user’s vision.
Reading speed, response times, typing speed, and more can be inferred from scrolling, button clicks, etc.
Use of special key combinations or other “shortcuts” may distinguish between “power users” and casual users of a site or application.

Many more possibilities exist, and similar inferences are already in use. Each feature alone may not convey much information, but combinations may be highly distinguishing. Like re-identification generally, new techniques will emerge, and existing ones will improve over time.

What Can We Do?

Many of the questions raised by cross-device tracking are well known to those of us who have followed online tracking generally. What may advertisers, retailers, data brokers, employers, and others learn about users via this technology? What will those parties do with the data? Other questions are new, albeit familiar, e.g.: could activity on a home television or other personal device show up on work devices, or could people make important decisions about individuals using data compiled via probabilistic (and potentially incorrect) inferences?

Although these questions stem from technology, re-identification research suggests that developers and engineers alone cannot dictate answers or broadly impose certain practices. Collaboration from industry groups, regulators, and others will be necessary to develop and enforce appropriate and meaningful best practices for data collection, use, and sharing.

Given the challenging questions raised by cross-device tracking and the need for cooperation, Elysium is looking forward to the FTC’s upcoming cross-device tracking workshop in November. This should be a wonderful opportunity for technologists, advocates, regulators, and members of industry to work towards common ground on thorny questions and effective assurances for consumers. Even if none of us can tackle these challenges on our own, perhaps we can make progress together.