Large Language Models (LLMs) increasingly mediate socio-technical systems where privacy judgments are critical, yet they often encode biased privacy norms learned from internet-scale training data. Prior work has predominantly focused on detecting behavioral privacy biases without understanding their mechanistic origins within the model weights. This paper addresses this gap by investigating whether privacy biases are localize in specific circuits of LLMs using the Contextual Integrity (CI) framework.
We use this methodology of combining CI with Mechanistic Interpretability (MI) techniques to identify and analyze these circuits. Our approach constructs controlled vignette pairs that isolate key CI parameters and employs Edge Attribution Patching with Integrated Gradients (EAP-IG) on instruction-tuned LLMs to discover faithful circuits influencing privacy-related decisions. Results reveal specialized mid-to-late layer attention patterns with high fidelity that differentiate appropriate from inappropriate information flows, with low structural overlap indicating modular privacy mechanisms.
Our work bridges behavioral privacy evaluation with internal model interpretability, advancing tools for targeted circuit editing to mitigate privacy biases without costly full model retraining. our finding provide actionable insights for developing privacy-respecting AI systems.