Etymology
Onion Routing
At the heart of Tor is onion routing. To achieve privacy, messages are encrypted in multiple layers, much like those of an onion. When this onionized message traverses the network, each intermediary router peels off one layer at a time then forwards the result to the next router, until the final router extracts the core (the original message) which it transfers to the destination.
Tor
Before Tor became known as such, it was a nameless implementation of onion routing. Similar projects started sprouting up, so the original was named The Onion Routing to separate it from the rest1. Despite Tor’s name stemming from an abbreviation, it is capitalized as Tor (not TOR).
My (Totally Real And True) Conspiracy Theory
Note
Skip this section if you want to learn something
I think (it would be really funny if) Tor and onion routing as a whole were inspired by Shrek. Here’s why:
- Shrek, as a character, is a recluse who seeks privacy in the comfort of his swamp; however, he is oft disturbed by a deluge of visiting villagers and inscrutable squatters. Parallels between his character and that of those who seek privacy online are obvious.
- Shrek famously stated: “Ogres are like onions. […] We both have layers” 2. It is here that privacy and onions form an association, potentially sparking the idea of onion routing in the minds of its creators.
- The timelines add up: Shrek the picture book (1990) predates onion routing (1995) 1 by five years, while Shrek the movie (2001) premiered one year before the deployment of Tor (2002) 1.
Terminology
- Circuit: A path through the Tor network
- Relay: A node in a circuit
- Entry: The first relay in a circuit
- Guard: A small set of trusted entry relays used by a client for a fixed period of time (unless using a bridge)
- Bridge: A relay that is not publicly listed (and can thus act as an entry without being identified as such)
- Exit: The last relay in a circuit
- Entry: The first relay in a circuit
- Relay Cell: The encrypted message at a specific relay
- Create Cell: A cell for creating a circuit, contains the first half of a Diffie-Hellman handshake
- Extend Cell: A cell for extending a circuit; similar to a create cell but also contains the address of the relay to extend to
- Onion Key: A public decryption key held by every relay
- Onion Sites/Services: Sites/services only available via the Tor network
Background
A Note On Notation
For the sake of this example, I will use a simplified model of the cell structure defined in 3:
|
Circuit ID
|Command
|Body
|
For clarity, commands will be referred to by their identifiers 4 rather than their true values.
Constructing A Circuit
Suppose Alice wants to create a circuit.
First, she chooses an exit node, followed by a chain of relays that constitute a path 5 such that no path constraints are violated 6. By doing so, she will have obtained the onion keys and addresses of each relay on the path. For relays, let be the keys for each relay.
Second, Alice establishes a TLS connection with the entry. Then she creates a relay create cell with a unique circuit ID () created according to 7 and as the payload, the first half of a Diffie-Hellman handshake () encrypted using a public key encryption function with the first relay’s onion key (i.e. ). This cell is as follows:
| |
CREATE
| |
Alice sends the cell to the first relay over the TLS channel. The relay decrypts the payload using the corresponding decryption function , and responds with its half of the key (), as well as a hash of the shared key () where is a cryptographic hash function. This cell is constructed like so:
| |
CREATED
| , |
Once Alice receives the response, both parties will have established the shared key . Henceforth, the circuit ID and shared key is used for all communications between Alice and relay 1.
To extend the circuit, Alice creates a relay extend cell using the same circuit ID, but with a new handshake part (), encrypted with the second relay’s onion key.
| |
EXTEND
| |
Alice sends the new cell to the first relay. Relay 1, seeing the cell is an extend cell, copies the payload directly into a relay create cell, but replaces the circuit ID with a new one () before sending the cell to relay 2:
| |
CREATE
| |
Relay 2 responds to relay 1 with its half of the handshake () and a hash of the negotiated key ():
| |
CREATED
| , |
Relay 1 then copies the payload into a new cell with the original circuit ID, and sends it to Alice:
| |
EXTENDED
| ) |
Now Alice has negotiated a shared key with relay 2 such that relay 1 cannot discover the key.
This process then repeats for all other relays to be added to the circuit. Circuits typically consist of at least three relays, while many onion services use six 3.
Summary
Step Number | Adding Relay 1 | Adding Relay 2 | … | Adding Relay n |
---|---|---|---|---|
1 | Alice → : | Alice → : | … | Alice → : |
2 | → Alice: | → : | … | → : |
3 | → : | … | → : | |
4 | → Alice: | … | → : | |
… | … | … | ||
n-1 | → : | |||
n | → : | |||
n+1 | → : | |||
… | … | |||
2n | → Alice: |
Using A Circuit
Suppose Alice now has a complete circuit, and thus the keys with all relays. To send a message through the circuit, she creates a relay cell, , like so:
and sends it to the entry. The entry peels the first layer, illustrated as follows:
The entry then sets ‘s origin to itself, then sends to relay 2. Relays repeat the same process. Relay , the exit, finally peels , and sends to the destination.
Note
Almost all traffic these days is TLS-encrypted8, so the exit does not actually see itself, but instead, . The only information the exit knows is the message’s destination, which is necessary for forwarding the message.
Throughout this process, relay only knows of relays and , hence only the entry knows the sender and only the exit knows the receiver.
Summary
The process of constructing and using a two-hop circuit is visualized in 3 as follows:
Attacks
De-anonymization is an obvious attack on an anonymization network.
Per the construction above, key recovery means breaking Diffie-Hellman (and thus the discrete log problem), and meaningful inter-relay man-in-the-middle attacks require breaking secure cryptosystems; both of which are infeasible. Hence, most de-anonymization techniques focus on weaker links: the traffic before the entry and after the exit, or human error.
1. Traffic Correlation
If an attacker controls a circuit (i.e. controls both entry and exit on the same circuit), they can see both the source and destination of the message.
Circuit Confirmation Attack
Since the relays themselves will not know which circuits they are a part of, an attacker will first have to confirm that both the entry and exit node under their control are part of the same circuit. One way for them to do this is by perturbing packets at the entry in some predictable way, then observing the same pattern at the exit node.
Correlating Traffic
Once an attacker has confirmed their control over a circuit, they must correlate traffic entering the entry relay and exiting the exit relay. This can be achieved via timing attacks or traffic analysis.
2. Website Fingerprinting
A set of methods to uniquely identify destination websites based on metadata and/or patterns in communication traffic observed between the client and entry relay. Packet sequences, lengths, order, timing information, and other seemingly innocuous features can uniquely identify a site.
Examples
- kNN9 (k-nearest neighbours): Leverages features extracted from packet sequences to distinguish web pages
- CUMUL10 (CUMULative representation): Support vector machine (SVM) using cumulated packet size to represent load behaviour
- kFP11 (k-nearest neighbours Finger Printing): Random forests and kNN trained on fingerprints of clearnet traffic between specific web pages in order to classify encrypted traffic
- DF12 (Deep Fingerprint): A high-precision deep Convolutional Neural Network (CNN) classifier
3. Browser Fingerprinting
A method to uniquely identify a user based on their browser and device setup; ex. OS, graphics card, screen dimensions, language, order of fonts installed, HTTP headers, time zone, browser plugins can identify users with 99% accuracy in some cases13.
Note
Even if the client makes small changes (installing new fonts, moving time zone, etc.), they are still highly identifiable
4. Canvas Fingerprinting
A method to uniquely identify a user by asking them (their browser, that is) to draw an image on a canvas (hidden in the DOM), then retrieves that image.
Note
This is VERY unique across different computers (anti-aliasing, how they draw colours, etc.)
Defences
1. Traffic Correlation
Use guard nodes, do not choose the same router twice for the same path, do not choose any router in the same family as another router in the same path, do not choose more than one router in a given network range6
2. Website Fingerprinting
Website fingerprinting is an active research topic within the Tor community and many defences have been presented over the years, thus there are more than I can conceivably compile here. Instead, I’ll include some techniques that I came across, categorized into two main classes: Randomization and Regularization. Others will be placed into an Other category.
1. Randomization
These defences use randomness such that no two traces from the same webpage have the same pattern.
- Adaptive padding14: introduce dummy packets into traffic to mask traffic bursts and their corresponding features
- WTF-PAD15 (Website Traffic Fingerprinting Protection with Adaptive Defence): a generalization of adaptive padding
- FRONT16: randomize the shape of distributions used for sampling the timing and number of dummy packets added, and place dummy packets near the front of a trace
2. Regularization
These defences fit traces into deterministic patterns. That is, traces from different pages become indistinguishable.
- Padding: pad packets such that all have equal size (ex. by making all packets the maximum size)
- BuFLO17 (Buffered Fixed-Length Obfuscation): send packets of a fixed size at fixed intervals, using dummy packets to both fill in and (potentially) extend the transmission
- Walkie-Talkie18: transform packet sequences of monitored sensitive pages and benign non-sensitive pages such that the packet sequences are identical (in terms of timing, length, direction, and ordering)
- Tamaraw19: an extension of BuFLO which sets packet size at 750 bytes rather than the MTU, and treats input and outgoing traffic differently (i.e. outgoing fixed at higher interval)
3. Other
- Traffic morphing20: load a web page using a packet size distribution from a different page
- Decoy pages21: load a decoy page simultaneously with the real page to hide the real packet sequence
- TrafficSilver22: split traffic over several “sub-circuits” (i.e. circuits containing distinct entry nodes) in a random manner
- Surakav23: train a generator that is able to generate various reference traces, then sample reference traces from the trained generator and send bursts of data based on the reference trace
3. Browser Fingerprinting
Give standardized answers for everything (implemented in Tor Browser). For example, always return 1920x1080 for the screen size, UTC for the time zone, and Comic Sans as the only font (jk), etc.
4. Canvas Fingerprinting
Disable canvassing (implemented in Tor Browser)
Footnotes
-
https://svn-archive.torproject.org/svn/projects/design-paper/tor-design.html#subsec:circuits ↩ ↩2 ↩3
-
https://spec.torproject.org/tor-spec/cell-packet-format.html#command ↩
-
https://spec.torproject.org/tor-spec/creating-circuits.html ↩
-
https://spec.torproject.org/path-spec/path-selection-constraints.html ↩ ↩2
-
https://spec.torproject.org/tor-spec/create-created-cells.html#choosing-circid ↩
-
https://radar.cloudflare.com/adoption-and-usage#http-vs-https ↩
-
https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/wang_tao ↩
-
https://arstechnica.com/information-technology/2017/02/now-sites-can-fingerprint-you-online-even-when-you-use-multiple-browsers/ ↩
-
https://www.usenix.org/system/files/conference/usenixsecurity17/sec17-wang-tao.pdf ↩
-
https://www.ndss-symposium.org/wp-content/uploads/2017/09/wright.pdf ↩
-
https://www.freehaven.net/anonbib/cache/wpes11-panchenko.pdf ↩
-
https://www.comsys.rwth-aachen.de/fileadmin/papers/2020/2020-delacadena-trafficsliver.pdf ↩