Dataset Anonymization
The anonymize subcommand enables the sharing of RDF datasets and graphs with confidential information without leaking sensitive information.
The anonymization process removes all data while preserving the data's original structure.
How it works
The anonymization process uses a bijective function mapping input RDF terms to random output IRIs.
The output IRIs match the following regex pattern http://example.com/[a-zA-Z]+.
The function is implemented based on an in-memory map.
When an RDF term T is anonymized for the first time, a new random IRI I is generated and associated with the term T; subsequent occurrances of T are replaced by I.
IRIs are created using a cryptographically secure random number generator from OpenSSL.
The anonymization process takes place fully in-memory. No information about the mapping or the original data is persisted in the database. The generated mapping is immediately discarded when the program exits. Different invocations of the program always result in different output; even with the same input data.
Source code
The source code of the anonymization algorithm is publicly available in rdf4cpp. The Tentris binary simply uses this algorithm; however, you can directly use rdf4cpp::Dataset::anonymize.
Examples
Anonymizing a turtle file
tentris anonymize < secret.ttl > anon.ttl
Example Output for the Mona Lisa Graph
Below, the result of the anonymization process for the Mona Lisa graph from Introduction to Knowledge Graphs is provided.
Note: Each invocation results in a different graph.
<http://example.org/IfXHRlbIuXliRwkq>
<http://example.org/dUMZqmcIyPOCKNzL> <http://example.org/PhNocESfEftRsFVw> ;
<http://example.org/mWocxoZHzGeOLCsn> <http://example.org/CuedkluuFuMditkC> .
<http://example.org/SEVWpxLqiSTDzQuv>
<http://example.org/dUMZqmcIyPOCKNzL> <http://example.org/PhNocESfEftRsFVw> ;
<http://example.org/faYhdyFbFefXFyNY> <http://example.org/XeBBndYqWyGSsedh> ;
<http://example.org/mWocxoZHzGeOLCsn> <http://example.org/MMIZDsGzVCQhmJBx> ;
<http://example.org/diyOPzaKxmfpETuU> <http://example.org/IfXHRlbIuXliRwkq> ;
<http://example.org/nZXhJkVNOWPaAfGx> <http://example.org/PoNHEDJsjCikAxDM> .
<http://example.org/PoNHEDJsjCikAxDM>
<http://example.org/dUMZqmcIyPOCKNzL> <http://example.org/phztEdzhmcqhcDkm> ;
<http://example.org/dNEChiLJIqYVyqrm> <http://example.org/ZBdMsRjdUoOvsNeE> ;
<http://example.org/HSJUmqxiGonkCmoF> <http://example.org/shBsTMuwaYplgzoW> .
<http://example.org/shBsTMuwaYplgzoW>
<http://example.org/dUMZqmcIyPOCKNzL> <http://example.org/PhNocESfEftRsFVw> ;
<http://example.org/mWocxoZHzGeOLCsn> <http://example.org/CZGbPCKoYcQeNPQS> .
<http://example.org/suCtwVxXLahzYzHe>
<http://example.org/dUMZqmcIyPOCKNzL> <http://example.org/dprlGUMsicGOLAFo> ;
<http://example.org/ZkyvVIcgmjMBZBZj> <http://example.org/PoNHEDJsjCikAxDM> .
Options
-I <IFMT>,--input-format <IFMT>: Specify the input RDF format.
Options:n-triples,turtle,n-quads,tri-g,detect.
Default:detect(detects the format based on the file extension of the input file).-O <OFMT>,--output-format <OFMT>: Specify the output RDF format.
Options:n-triples,turtle,n-quads,tri-g,detect.
Default:detect(detects the format based on the file extension of the output file).