really super anonymous surveys

#project #cryptography #privacy #vite #typescript #preact #python #cypress #fly #fastapi #redis #webdev #llms ## why make it? i noticed while using some survey tools like microsoft forms or google forms that they don't do a great job of protecting the anonymity of respondents to an anonymous survey. there seemed like a few obvious things which could be changed about how these surveys work to improve this situation. there also seemed like an opportunity to do something a bit new and interesting which is using llms as a countermeasure against stylometry as a means of deanonymisation. it also felt like a nice micro project which is easy to complete to a decent level in a short period. ### why do anonymous surveys at all? anonymous surveys, like secret ballots, provide a way of soliciting honest opinions within power structures which discourage - to some extent - honesty. there are two possible fixes if you value honest opinions: 1. remove the power structures 2. remove the discouragement of honesty 1 is pretty hard, especially as group size group size grows. 2 is also hard in lieu of 1, although there are things which you can do like rewarding honesty. even then, it is hard to know how well you are doing unless you have a some baseline to check against. that is where using anonymity to encourage honesty comes in. assuming honesty is proportional to anonymity, we should try to maximise anonymity. if your goal is 2. but want to operate candidly (anonymity has some negative consequences too!) you can add control questions e.g. how would you have answered differently in a non-anonymous context? or in the secret ballot example, how would you have voted in a non-secret ballot. receiving honest feedback is often really valuable. ## how it turned out you can check it out at https://rsurva.reuben.codes and you can see the code at https://github.com/rested/rsurva. in particular the https://rsurva.reuben.codes/how-it-works section describes exactly how it works and what the limitations are. this is also reproduced in [[really super anonymous surveys#how it works reproduced]] you can also check out [[really super anonymous surveys#screenshots]] to avoid extra clicks/tabs. ## reflections on building it ### frontend is more enjoyable now my last time doing much fe was a while back. of course the following is all caveated by this being a super simple project but: * tools like https://vitejs.dev/ making bundling and hot reloading nice and snappy - the app is pretty small but that super fast dev loop is great * css frameworks like https://daisyui.com/ / https://tailwindcss.com/ are great for getting things looking decent without having much in the way of ui skills and feel a lot more flexible than things like material ui with components https://mui.com/material-ui/ * https://www.cypress.io/ feels more mature than when i last used it and i ran into very little setup issues (even on wsl2), seems https://playwright.dev/ is more popular though now so will try that next time * using llms (https://www.continue.dev/ in vscode) makes refactoring and iterating involve a lot less boring toil work - the llm does it for you ### tools like https://fly.io are great for simple hobby projects being able to have a serverless application which spins up from cold pretty fast (even with python) and a https://upstash.com/ redis instance at barely any cost is really nice. having batteries included regarding cli, basic monitoring, etc is also great. ### leaving features un-implemented is good considering how to do time-locking of responses as detailed in [[really super anonymous surveys#👨‍💻🐱‍💻Survey Owner Controlling Backend]], i came across what would be a quite cool and interesting library - https://github.com/drand/tlock-js. while its nice to know about these are the sorts of things which make small simple projects become larger and less likely to be published ### learnt that rsa has hard limits on how much data it encrypts my initial approach was to encrypt answer data with the public key using rsa only. this caused a funny bug which i didn't catch with initial e2e tests as I was just using a small about of data. for an explanation of the limits see https://security.stackexchange.com/questions/33434/rsa-maximum-bytes-to-encrypt-comparison-to-aes-in-terms-of-security. the approach of RSA + AES was ultimately the fix in https://github.com/Rested/RSurvA/pull/4 ## screenshots ### creating a survey ![[Pasted image 20240805231439.png]] ![[Pasted image 20240805231508.png]] ### sharing ![[Pasted image 20240805233134.png]] ### filling in a survey ![[Pasted image 20240805233322.png]] ### adversarial stylometry ![[Pasted image 20240805233353.png]] ![[Pasted image 20240805233506.png]] ### viewing responses ![[Pasted image 20240805233532.png]] ![[Pasted image 20240805233548.png]] ## how it works reproduced [rsurva.reuben.codes](https://rsurva.reuben.codes/how-it-works) RSurvA **Really Super Anonymous Surveys** Surveys which claim to be anonymous often do very little to ensure that they actually are. **RSurvA** tries to do anonymous surveys better. ### 🔑Key Benefits - **Client-Side Encryption:** Answers are encrypted with an **RSA** public key on the client side before transmission, meaning they are stored encrypted on the server, keeping answers private. - **Conditional Access:** Survey responses are accessible only after the survey duration has ended and the minimum response threshold is met, enabling participant anonymity. - **Private Key Decryption:** Encrypted answers can only be unlocked by the survey owner after the survey duration has completed using a private key only they have access to. - **Open Source:** This project is open-source, which means it is auditable and can be self-hosted. Check out the code on [GitHub](https://github.com/rested/RSurvA). ### ☑How To Create A Survey 1. Create An Anonymous Survey Start by creating a new survey. Enter a name for your survey and add as many questions as you like. The questions can be edited at any time before sharing the survey. 2. Customize Survey Settings Choose the duration for which the survey will be active and set a minimum number of responses required before the survey results are available. 4. Share Your Survey As people start answering your survey, their responses will be saved encrypted on the server using the public key corresponding to the private key you saved earlier. Responses will not be available for you to decrypt until the survey duration and minimum response count conditions are met to help ensure respondent anonymity. 4. View Results After the survey duration has passed and if the minimum responses requirement is met, you will be able to decrypt the collected answers using the private key you saved earlier. ### 🏋️‍♂️Motivations Surveys which claim to be anonymous often are not. - They often require login, meaning the server knows exactly who provided which answer. - They may well store answers unencrypted, making them viewable by any entity with access to the server. - They often allow survey owners to see responses as they come in, making correlating them to when respondents saw the survey link possible. - They also offer no stylometry counter-measures (to stop the survey owner from identifying respondents using stylometry). - Finally, they often do nothing to randomize responses, making it easier to identify respondents by viewing all their answers at once and applying stylometry or other information on this broader dataset. RSurvA attempts to address all of these issues by providing a simple low trust approach. See the [Limitations & Mitigations](about:reader?url=https%3A%2F%2Frsurva.reuben.codes%2Fhow-it-works#limitations-and-mitigations) section for details on how. ### 🤨😇Limitations and Mitigations #### 🕵️‍♂️❔Identifying Respondent Through Answers Across Questions **Limitation:** The survey owner may attempt to look at answers across questions to uncover the respondent's identity. **Possible mitigation:** Do nothing if each question's answers are sufficiently anonymous. **Chosen mitigation:** Shuffle the encrypted answers (on a per-question basis) on the server before sending them to the survey answerer. So real answers: 1. What is your favourite colour? - green - blue 2. What letter does your favourite colour begin with? - g - b Becomes (in one permutation): 1. What is your favourite colour? - blue - green 2. What letter does your favourite colour begin with? - g - b #### ✍️🔍Stylometric Analysis **Limitation:** The survey owner may attempt [stylometric](https://en.wikipedia.org/wiki/Stylometry) analysis to determine who said what. References to particular people or events may reveal the respondent's identity. **Possible mitigations:** - LLMs can transform text to be sound, safe, and sensible (criteria described [here](https://en.wikipedia.org/wiki/Adversarial_stylometry)), but running these in the browser is currently problematic due to limited WebGPU support. - Using third-party networks like huggingface for model serving introduces hosting costs and dependencies. **Chosen mitigation:** Allow the user to click on a button which provides a prompt they can use with an LLM they trust to transform their answer. If some BYOLLM (bring your own LLM) apis become available, this should shift to use them. The prompt for the question "Are birds real?" and answer "Nah birds ain't real" is as follows:`You are a world class adverserial stylometry algorithm. Your transformations of text should follow the following criteria: * safety, meaning that stylistic characteristics are reliably eliminated * soundness, meaning that the semantic content of the text is not unacceptably altered * sensible, meaning that the output is "well-formed and inconspicuous" You should imitate the style of William Faulkner. You should preserve the language of the original text if it is not english, but try to imitate the style of William Faulkner regardless. You should use the most commonly spoken regional variety of a language for example with English you should always use American English. Any references to specific events, people or places which risk de-anonymising the author should be rephrased to reduce this risk. The user is answering the following question: 'Are birds real?' in an anonymous survey. The original answer is as follows. Please provide the transformed text with no additional styling or explanation. --- Nah birds ain't real` This gives _No, birds do not exist._ in gpt-4 #### 👨‍💻🐱‍💻Survey Owner Controlling Backend **Limitation:** If the survey owner controls the backend, they could retrieve the answers at any point after survey creation using both the private key (as the survey owner) and the encrypted answers (as the server/backend controller). **Possible mitigations:** Timelock encryption would be a fairly nice way to remove this trust on the server owner, but it is a fairly hard problem ([gwern has a good write up](https://gwern.net/self-decrypting)). The most viable implementation would be [tlock-js](https://github.com/drand/tlock-js), which would avoid this server owner trust at the expense of introducing a 3rd party network dependency. **Chosen mitigation:** While tlock-js could be viable, it introduces a networked 3rd party dependency that needs to be accessible and not compromised when the respondent encrypts or the survey owner attempts to decrypt. Early decryption has the same risks as the [minimum response threshold faking](about:reader?url=https%3A%2F%2Frsurva.reuben.codes%2Fhow-it-works#min-resp-faking) and benefits from the same mitigations. The difference is it has a slightly stronger condition in that there is a timing element. For example, if the survey is posted to a public channel and it's known that the public channel will only be viewed from 10 am by Alice and from 12 pm by Bob, then an answer decrypted at 11 am is likely from Alice. This can also bypass mitigations around question correlation. For this reason we strengthen the advise to the user to checking that at least the minimum responses number of users are likely to have seen the message at that time. That said, contributions for optional usage of tlock-js would be welcomed on the [GitHub](https://github.com/rested/RSurvA)! #### 📂🔓Questions Stored Unencrypted **Limitation:** The backend stores the questions unencrypted to be read by anyone with access to the link. **Possible mitigations:** - In order to store them encrypted on the server, and not provide anything other than the link to respondents, the link would need to contain a decryption key. - If the server which serves the frontend is different from the server which serves the backend (which it is in this case - cloudflare pages vs fly.io fastapi app), then we could conceive of a link generated by the frontend which consists of a reference to the backend identifier to fetch the survey with and a decryption key for the encrypted questions and survey title. - If we relaxed the constraint of needing to provide the respondent with just a link, we could ask the survey owner to share the link and decryption key separately. That way, the user could paste the decryption key in, avoiding the frontend server seeing the key. **Chosen mitigation:** Do nothing. It is likely that the entity which controls the frontend server also controls the backend server. This makes the first mitigation useless and the sacrifice in usability for the second option isn't warranted given that we already assume some level of trust in the server. That said, contributions for optional question/title encryption would be welcomed on the [GitHub](https://github.com/rested/RSurvA)! #### 🔒🛠️Tampering with Public Key **Limitation:** The backend could tamper with the public key sent to users. **Possible mitigations:** Using the public key as the URL would allow users to verify it, but would make URLs much longer. **Chosen mitigation:** Display the public key on the frontend and ask users to verify it with the survey owner. #### 🔗🌐Public Survey URLs **Limitation:** The survey URLs are public. **Chosen mitigation:** URLs are hard to guess (64 hex chars giving ~256 bits of entropy). The risk of sharing the link should be accepted by the survey owner. #### 👨‍💻📝Survey Owner Responding to Their Own Survey **Limitation:** The survey owner can respond to their own survey, limiting the effectiveness of the minimum responses threshold as the owner could create more responses themselves. **Possible mitigations:** - Hide response links from the survey owner, but this would put more trust in the server by requiring it to distribute these links (e.g., via email). - Add some form of login and disallow the survey owner from answering, but this could complicate the process. - Use frontend state (local storage), which is unreliable and provides false security. **Chosen mitigation:** Inform the respondent and ask them to ensure the survey link was sent in a public channel with at least the mentioned number of respondents during the survey period. A survey owner adding fake responses would not distinguish between real responses in this case. #### 👥🔢Respondent Duplicate Responses **Limitation:** Respondents could 'over-respond' to hide signal in noise. **Possible mitigations:** We could end the survey after a given number (x) of responses on the server above the min responses. However, this could have undesired consequences, for example where the audience of possible respondents is larger than x, and the survey owner would still not know that all of the x responses were from one respondent. **Chosen mitigation:** Leave this risk largely unmitigated, assuming the effort to create such a response would not be worth it. ### ↔Sequence Diagram ```mermaid sequenceDiagram; participant Survey Owner participant Frontend participant Backend participant Respondent Survey Owner->>Frontend: Create New Survey (Enter Details, Add Questions) Frontend->>Frontend: Generate RSA Key Pair Frontend->>Backend: Store Survey Details and Public Key Frontend->Survey Owner: Display Private Key, Public Key, and Survey Link Survey Owner->Survey Owner: Save Private Key Securely Survey Owner->>Respondent: Share Survey Link Respondent->>Frontend: Access Survey Link Frontend->>Backend: Fetch Survey Details (Questions and Public Key) Backend->>Frontend: Return Questions and Public Key Frontend->Respondent: Display Questions Respondent->Frontend: Submit Answers Frontend->>Frontend: Encrypt Answers with Public Key Frontend->>Backend: Store Encrypted Answers Survey Owner->>Frontend: Check on survey Frontend->>Backend: Fetch Survey Details and Encrypted Answers Backend->>Backend: Check Conditions (Duration & Min Responses Met) Backend-->>Frontend: Return Encrypted Answers (If conditions met) Survey Owner->>Frontend: Provide Private Key Survey Owner->Frontend: Request Decryption Frontend->Frontend: Decrypt Responses Using Private Key Frontend->>Survey Owner: Decrypted Responses ``` 🧱Infrastructure The frontend code (where encryption happens) is hosted on [Cloudflare Pages](https://pages.cloudflare.com/). The backend is hosted on [Fly.io](https://fly.io/). The encrypted answer data and unencrypted question data are stored on [Fly Upstash for Redis](https://fly.io/docs/reference/redis/). You can check the [code](https://github.com/rested/RSurvA) being used and the actions used to deploy it or deploy it to your own infra. ```mermaid graph LR; G[GitHub Repository] F[Frontend deployed on Cloudflare Pages] B[Backend deployed on Fly.io] R[Redis on Fly Upstash] G -- deploys to --> F G -- deploys to --> B F -- API Calls --> B B -- stores data in --> R ```