Data Publishing: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 96: Line 96:
*  Plumb it in to the public facing dataset infrastructure, including metadata that links the public data back to the above review bug.
*  Plumb it in to the public facing dataset infrastructure, including metadata that links the public data back to the above review bug.
*  Once the dataset has been published, it will be announced on the new Data @ Mozilla blog. It will also be added to http://docs.telemetry.mozilla.org/datasets/.
*  Once the dataset has been published, it will be announced on the new Data @ Mozilla blog. It will also be added to http://docs.telemetry.mozilla.org/datasets/.
<big>'''Definitions'''</big>
'''Metric''' - A metric is anything we want to measure.
Examples: the number of clients that used the developer tools console, the number of active clients
'''Dimension''' - A dimension is a qualitative value such as OS, channel, or date. In practice, a dimension often defines a sub-population on which we can calculate a metric, allowing us to segment the metric for further analysis.
Examples: if we have an OS dimension, we can analyze the number of active clients by OS;
'''Aggregate''' - A combined value of many measurements (metric values), typically grouped by dimension or sets of dimensions. See also Aggregate Data.
'''Individual-level Data''' - Data containing a dimension which uniquely identifies a single profile, user, client, etc.
'''Tabular Data''' - Data that consists of rows (or records) and columns (or fields). Each row has the same number of columns, and each column represents a dimension or metric for that row. Think of a spreadsheet or CSV file as examples of this type of data.
<big>'''Example Data'''</big>
Here are some examples of data aggregated to the levels described above.
*  Level 7: raw data, with fine-grained timestamps
*  Level 6: individual-level data, aggregated to day-level time granularity
*  Level 5: anonymized individual-level data, identifiers replaced with pseudonyms
*  Level 4: probabilistic aggregates
*  Level 3: dimension-level aggregates without a minimum group size
*  Level 2: dimension-level aggregates with a minimum group size

Revision as of 23:09, 18 September 2020

Introduction

Mozilla’s history is steeped in openness and transparency - it’s simply core to what we do and how we see ourselves in the world. We are always looking for ways to bring our mission to life in ways that help create a healthy internet and support the Mozilla Manifesto. One of our commitments says “We are committed to an internet that elevates critical thinking, reasoned argument, shared knowledge, and verifiable facts”.

To this end, we have spent a good amount of time considering how we can publicly share our Mozilla telemetry data sets - it is one of the most simple and effective ways we can enable collaboration and share knowledge. But, only if it can be done safely and in a privacy protecting, principled way. We believe we’ve designed a way to do this and we are excited to outline our approach here.

Dataset Publishing Process

We want our data publishing review process, as well as our review decisions to be public and understandable, similar to our Mozilla Data Collection program. To that end, our full dataset publishing policy and details about what considerations we look at before determining what is safe to publish can be found below, including asummary of the critical pieces of that process.

The goal of our data publishing process is to:

  • Reduce friction for data publishing requests with low privacy risk to users;
  • Have a review system of checks and balances that considers both data aggregations and data level sensitivities to determine privacy risk prior to publishing, and;
  • Create a public record of these reviews, including making data and the queries that generate it publicly available and putting a link to the dataset + metadata on a public-facing Mozilla property.

This page defines all of the factors that must be taken into consideration before publicly sharing Mozilla’s telemetry data. It describes:

  • The levels of possible dataset aggregations using Mozilla’s data
  • The levels of publishing sensitivity
  • What dimensions are sensitive, and at which level
  • What metrics are sensitive, and at which level
  • How we characterize the levels of aggregation

How we characterize the levels of aggregation

The table below describes the various types of aggregation levels we are defining.

Level Aggregation Examples
1 Statistical / ML Models TAAR, Federated learning models, Forecasting models
2 Dimension-level aggregation w/ minimum bucket sizes Total page loads by country, OS, locale, channel where any combination with a count less than 5,000 are grouped into “Other”

[Canada, Linux, “Other locales”, nightly] for rare locales

3 Dimension-level aggregation w/o minimum bucket sizes Clientid count by country, os, locale, channel, where there could be: [Canada, Linux, PL, nightly] which has one client in it.
4 Probabilistic Aggregates HLL for computing approximate unique client counts, bloom filter for computing presence in a set
5 Anonymized individual-level data
  • Anonymized_id, date, country, os, locale, channel
  • A, 2019-08-08, Canada, Linux, PL, nightly
  • A, 2019-08-09, Canada, Linux, PL, nightly
  • A, 2019-08-10, Canada, Linux, PL, nightly
  • B, 2019-08-10, Peru, Windows, EN, release
6 Not-anonymized individual-level data
  • actual_id, date, country, os, locale, channel
  • 859c8a32-0b73-b547-a5e7-8ef4ed9c4c2d, 2019-08-08, Canada, Linux, PL, nightly
  • 859c8a32-0b73-b547-a5e7-8ef4ed9c4c2d, 2019-08-09, Canada, Linux, PL, nightly
  • 859c8a32-0b73-b547-a5e7-8ef4ed9c4c2d, 2019-08-10, Canada, Linux, PL, nightly
  • 4db8d07d-1935-9c45-93c9-6d97a790bb12, 2019-08-10, Peru, Windows, EN, release
7 High resolution individual-level data Raw telemetry events data, a sequence of actions in order of occurrence.

How we characterize the sensitivity of dimensions

Based on the Data Collection Categories, most Telemetry data naturally falls within category 1 (technical data) and 2 (interaction data), which are not considered sensitive. A notable exception, however, is geo location, which we geocode from IP addresses to extract City / Region / Country, but only include cities with a population > 15,000 (according to the Geonames database).

Category 3 (web activity) or 4 (highly-sensitive) data should be excluded from the set of “safe” dimensions.

Matrix of aggregation safety vs. dimension sensitivity:

Category Aggregation Level Notes
Category 1 (Technical) and 2 (Interaction) 1, 2, 3 For low-sensitivity data, we may not require a minimum bucket size for aggregation.
Category 3 (Web Activity) 1, 2 As sensitivity increases, minimum bucket size becomes increasingly important.
Category 4 (Highly Sensitive Data) 1, 2 Technically, category 4 often involves highly sensitive data, such as explicit identifiers, that will be removed in the process of aggregation. We include it here for the sake of completeness.

How do we characterize the sensitivity of metrics?

Most metrics are not sensitive information, per se. That said, if a metric indicates or directly implies something about revenue, it is “sensitive”. Example: Search counts.

Dataset Publishing Process

The goal of this process is to (1) make the “easy” (that is, safe) data publishing requests relatively friction-less, (2) have guard rails in-place so we don’t publish something that exposes us or our users to risk in some way, and (3) ensure that the dataset publishing request process matches closely other processes that are familiar to the data stewards.

Having a dataset published requires filling out a bug. Use the nomenclature defined in the preceding sections to answer the following four questions. If the answer to all of them is “no”, you may publish. A “yes” above means extra review is required.

  • Is the level of aggregation 3 or higher?
  • Are there any Data Collection Category 3 (web activity) or 4 (highly-sensitive) dimensions?
  • Do any of the dimensions or metrics include sensitive data?
  • Are there any data included that do not have a corresponding data review for collection? Please link to relevant data review(s).

A data steward will then be assigned to the bug, either by the requester or as part of bug triage, to double-check that these questions are correctly answered and there are no confounding factors inherent to the publishing of the data.

Once the request is approved, data engineering will do the implementation work:

  • Write (or review) the query
  • Schedule it to update on the desired frequency
  • Plumb it in to the public facing dataset infrastructure, including metadata that links the public data back to the above review bug.
  • Once the dataset has been published, it will be announced on the new Data @ Mozilla blog. It will also be added to http://docs.telemetry.mozilla.org/datasets/.

Definitions

Metric - A metric is anything we want to measure. Examples: the number of clients that used the developer tools console, the number of active clients

Dimension - A dimension is a qualitative value such as OS, channel, or date. In practice, a dimension often defines a sub-population on which we can calculate a metric, allowing us to segment the metric for further analysis. Examples: if we have an OS dimension, we can analyze the number of active clients by OS;

Aggregate - A combined value of many measurements (metric values), typically grouped by dimension or sets of dimensions. See also Aggregate Data.

Individual-level Data - Data containing a dimension which uniquely identifies a single profile, user, client, etc.

Tabular Data - Data that consists of rows (or records) and columns (or fields). Each row has the same number of columns, and each column represents a dimension or metric for that row. Think of a spreadsheet or CSV file as examples of this type of data.

Example Data Here are some examples of data aggregated to the levels described above.

  • Level 7: raw data, with fine-grained timestamps
  • Level 6: individual-level data, aggregated to day-level time granularity
  • Level 5: anonymized individual-level data, identifiers replaced with pseudonyms
  • Level 4: probabilistic aggregates
  • Level 3: dimension-level aggregates without a minimum group size
  • Level 2: dimension-level aggregates with a minimum group size