API Reference¶

This is the complete list of all the functionalities that the data minimization api offers. All methods expect a list of dictionaries as input.

data_minimization_tools.drop_keys(data: [<class 'dict'>], keys)¶

Removes the data for specific keys (does not drop the key form the dictionary!

Parameters:	data – input data as list of dicts keys – list of keys whose values should be removed
Returns:	cleaned list of dicts

data_minimization_tools.hash_keys(data: [<class 'dict'>], keys, hash_algorithm=<built-in function openssl_sha256>, salt=None, digest_to_bytes=False)¶

Hashes data for specific keys.

Parameters:	data – input data as list of dicts keys – list of keys whose values should be hashed hash_algorithm – the hashalgorith to apply. Can be any hashlib algorith or any function that behaves similarly salt – the salt to use digest_to_bytes – whether result should be bytes. If False, result is of type string
Returns:	cleaned list of dicts

data_minimization_tools.reduce_to_mean(data: [<class 'dict'>], keys)¶

Reduce all values for the given key to the mean across all values of the input data list

Parameters:	data – input data as list of dicts keys – list of keys whose values should be replaced
Returns:	cleaned list of dicts. Note, that this function returns as many items as you input.

data_minimization_tools.reduce_to_median(data: [<class 'dict'>], keys)¶

Reduce all values for the given key to the median across all values of the input data list

Parameters:	data – input data as list of dicts keys – list of keys whose values should be replaced
Returns:	cleaned list of dicts. Note, that this function returns as many items as you input.

data_minimization_tools.reduce_to_nearest_value(data: [<class 'dict'>], keys, step_width=10)¶

Reduce all values for the given key to the nearest value. Think of this as aggregating values as intervals.

Parameters:	data – input data as list of dicts keys – list of keys whose values should be replaced step_width – size of the intervals
Returns:	cleaned list of dicts. Note, that this function returns as many items as you input.

data_minimization_tools.replace_with(data: [<class 'dict'>], replacements: dict)¶

Receives a 1:1 mapping of original value to new value and replaces the original values accordingly. This corresponds to CN-Protect’s DataHierarchy.

Parameters:	data – input data as list of dicts replacements – 1:1 mapping
Returns:	cleaned list of dicts

data_minimization_tools.replace_with_distribution(data: [<class 'dict'>], keys, numpy_distribution_function_str='standard_normal', *distribution_args, **distribution_kwargs)¶

Replaces data for specific keys with data generated from a distribution.

Parameters:	data – input data as list of dicts keys – list of keys whose values should be replaced numpy_distribution_function_str – for possible distribution functions see here. Pass the function as string distribution_args – additional args that the chosen function requires distribution_kwargs – additional kwargs that the chosen function requires
Returns:	cleaned list of dicts

data_minimization_tools.cvdi.anonymize_journey(data: [<class 'dict'>], original_to_cvdi_key: dict, config_overrides: dict = None) → [<class 'dict'>]¶

Anonymize a journey using the U.S. DoT’s Privacy Protection Application.

Some of the waypoints in the input will not be present in the output, the rest will have only their geodata (see REQUIRED_KEYS) altered. Any additional attributes of the points that were not dropped from the output will remain unchanged.

Because the de-identification algorithm relies on knowledge of the roads along a journey, a so-called quad file must be provided. Generate such a file named “quad” and place it in ./cvdi-conf/ (relative to the script’s working directory).

Parameters:

data – input data as list of dicts.
original_to_cvdi_key – Mapping of the input data’s fields to the fields required by the de-identification algorithm, e.g., {"lat": "Latitude", ...}, where lat is part of the input data. For the list of required fields, see REQUIRED_KEYS.
config_overrides – Overrides to the de-identification application’s settings. For example, to increase the length of privacy intervals to 300m, provide {"max_direct_distance": 300, "max_manhattan_distance: 300}.

Returns:

A new, shorter, list of dictionaries representing the waypoints of the de-identified journey.

data_minimization_tools.cvdi.REQUIRED_KEYS = {'Gentime', 'Heading', 'Latitude', 'Longitude', 'Speed'}¶: The keys required to be present in the input data for de-identification to work.