German General Personas#
This resource allows you to ask questions to a representative sample of 5,246 individual personas representing the German population.
This persona collection consists of 5,246 individual personas representing the German population. The data source is the German General Survey (ALLBUScompact). Their two-step randomized sampling ensures that ALLBUS as well as the German General Personas reflect a representative picture of the German population, regarding its sociodemographic attributes, norms and values.
The persona collection was first published in the work German General Personas: A Survey-Derived Persona Prompt Collection for Population-Aligned LLM Studies.
Why German General Personas?#
GGP offers several significant advantages:
Contextual Information: Personas enrich language models with relevant contextual information, enabling them to anchor predictions for specific tasks or target variables in empirically observed associations and connections within the German population.
Representative Alignment: The ALLBUS is a probability-based survey, and the personas derived from it are designed to represent the German population accurately. While there’s growing concern about biased representations in LLMs’ survey responses, GGP can potentially help align LLMs more effectively with the demographics and attitudes of the German population.
Novel Resource: GGP stands as a novel textual resource for researchers and practitioners in Natural Language Processing (NLP) and Computational Social Science (CSS).
We show how GPP can easily be utilized with the qstn Framework. The Tutorial here is also available as an interactive jupyter notebook.
Imports#
First, all relevant Python packages must be imported, such as pandas and QSTN.
# General Imports
import io
import zipfile
import pandas as pd
import requests
# Either local inference with vllm or remote with AsyncOpenAI
from openai import AsyncOpenAI
# qstn Imports
from qstn.inference import response_generation
from qstn.parser import parse_json, raw_responses
from qstn.prompt_builder import LLMPrompt, generate_likert_options
from qstn.survey_manager import conduct_survey_single_item
from qstn.utilities import create_one_dataframe, placeholder
SEED = 42
Preparing the Dataset#
We will load the personas from the GGP repository.
You can define how many personas you want to load and conduct interviews with. In total, you can choose up to 5,246 personas. If you choose less than the available amount of personas, we randomly select personas to ensure a representative sample of the GGP collection.
PERSONAS_TO_LOAD = 20
zip_url = "https://github.com/germanpersonas/German-General-Personas/raw/main/GGP_all_topk_fulltext.zip"
response = requests.get(zip_url)
response.raise_for_status() # Check for errors
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
with z.open("pc_fulltext_sociodemographics_only.jsonl") as f:
if PERSONAS_TO_LOAD:
df_personas = pd.read_json(f, lines=True)
df_personas = df_personas.sample(n=PERSONAS_TO_LOAD).reset_index(drop=False)
else:
df_personas = pd.read_json(f, lines=True)
df_personas = df_personas.rename(columns={0: "persona"})
print(f"Loaded {len(df_personas)} rows.")
display(df_personas.head())
Loaded 20 rows.
| index | persona | |
|---|---|---|
| 0 | 3135 | Du bist eine Person, 31 Jahre alt, weiblich un... |
| 1 | 3839 | Du bist eine Person, 61 Jahre alt und wohnhaft... |
| 2 | 3032 | Du bist eine Person, 54 Jahre alt, weiblich un... |
| 3 | 3492 | Du bist eine Person, 42 Jahre alt, männlich, m... |
| 4 | 1209 | Du bist eine Person, männlich, deutscher Staat... |
In addition to the personas, we load example questions and answer options to conduct the survey.
However, with QSTN you can freely create or choose the questions you are asking a representative sample of the German population.
Create your own questions#
If you want to create your own questions, you can simply create a json in the following format, which contains the question in statement, and the possible answers in answers. If you then want to ask your own questions to the LLM, set USE_OWN_QUESTIONS to True.
# Adjust this if you want to use your own questions
USE_OWN_QUESTIONS: bool = False
if USE_OWN_QUESTIONS:
json_questionnaire = {
"controversial": {
"task_type": "question",
"statement": "Wie stehen Sie zu Ananas auf Pizza?",
"answers": ["1: Lecker!", "2: Schrecklich!"],
},
"election": {
"task_type": "question",
"statement": "Welche Partei würden Sie heute wählen?",
"answers": [
"1: CDU/CSU",
"2: SPD",
"3: FDP",
"4: Bündnis90/Die Grünen",
"5: Die Linke",
"6: AFD",
"7: Sonstige"
"8: Ich wähle nicht."
],
},
# You can add as many questions as you want
}
print("Using your own questions!")
ALLBUS Example Questions#
If you don’t want to use your own questions, you can use the questions of the ALLBUS, which were used in the original paper.
if not USE_OWN_QUESTIONS:
# import example ALLBUS questions
url = "https://raw.githubusercontent.com/germanpersonas/German-General-Personas/refs/heads/main/_strat_task_question.json"
response = requests.get(url)
if response.status_code == 200:
json_questionnaire = response.json()
print("Successfully loaded as dictionary. Using ALLBUS questions.")
else:
print(f"Failed to retrieve file. Status code: {response.status_code}")
print(json_questionnaire["mp18"])
Successfully loaded as dictionary. Using ALLBUS questions.
{'task_type': 'question', 'statement': 'Ergeben sich Ihrer Meinung nach wegen der Flüchtlinge in Bezug auf das Zusammenleben in der Gesellschaft mehr Chancen, mehr Risiken oder weder noch?', 'answers': ['1: RISIKO UEBERWIEGT', '2: EHER RISIKO', '3: WEDER NOCH', '4: EHER CHANCE', '5: CHANCE UEBERWIEGT']}
To use the questions in the correct format, we have to adjust them to the qstn format and add them to a DataFrame.
questionnaire_list = []
for key, value in json_questionnaire.items():
# Create a new empty dict for this row
questionnaire_item = {}
# Update it with the specific format: questionnaire_item_id and question_content
questionnaire_item.update(
{"questionnaire_item_id": key, "question_content": value["statement"]}
)
# Add to the list
questionnaire_list.append(questionnaire_item)
questionnaire = pd.DataFrame(questionnaire_list)
print(questionnaire.head(3))
questionnaire_item_id question_content
0 lp04 Sind Sie bei der folgenden Aussage derselben o...
1 pe05 Inwiefern stimmen Sie der folgenden Meinung zu...
2 mp18 Ergeben sich Ihrer Meinung nach wegen der Flüc...
For the answers, we can select if they are on a scale or if they are categorical. We only need the plaintext answers.
all_cleaned_answers = []
for key, value in json_questionnaire.items():
cleaned_answers = []
for _, answer in enumerate(value["answers"]):
clean_text = answer.split(": ")[1]
# We simply check if the text contains a minus -> If it does it is a from to scale
if "-" in clean_text:
from_to_scale = True
cleaned_answers.append(clean_text)
else:
from_to_scale = False
cleaned_answers.append(clean_text)
all_cleaned_answers.append(
{"question": key, "answer": cleaned_answers, "from_to_scale": from_to_scale}
)
all_cleaned_answers[2]
{'question': 'mp18',
'answer': ['RISIKO UEBERWIEGT',
'EHER RISIKO',
'WEDER NOCH',
'EHER CHANCE',
'CHANCE UEBERWIEGT'],
'from_to_scale': False}
System Prompt, User Prompt and Personas#
Here we define the prompt structure for the interaction.
system_prompt: Instructions for the model to adopt the specific{persona}.prompt: The main task input which dynamically assembles:The question for each entry in our questionnaire (
PROMPT_QUESTIONS)The answer choices (
PROMPT_OPTIONS)The specific formatting rules (
PROMPT_AUTOMATIC_OUTPUT_INSTRUCTIONS) based on theoutput_methodselected in the next step.
We use the prompt from the GGP Paper in this case. Again you can adjust this however you need.
system_prompt = "Nehme die Perspektive der folgenden Person ein: {persona}"
prompt = (
"Welche der Antwortmöglichkeiten ist die Reaktion der Person auf folgende "
f"Frage: {placeholder.PROMPT_QUESTIONS}\n"
f"{placeholder.PROMPT_OPTIONS}\n"
f"{placeholder.PROMPT_AUTOMATIC_OUTPUT_INSTRUCTIONS}"
)
Configuration: Output Method#
Select the inference technique by assigning one of the keys below to output_method.
Method |
Description |
|---|---|
|
Full, unconstrained text generation. |
|
Logits are restricted to exact answer possibilities only. |
|
JSON output containing a preliminary reasoning step. |
|
Uncertainty estimation via verbalized probability as in Meister et al.. |
OUTPUT_METHODS = [
"OPEN",
"RESTRICTED_CHOICE",
"REASONING_JSON",
"VERBALIZED_DISTRIBUTION",
]
# Select your method here by copying one from above
output_method = "RESTRICTED_CHOICE"
Creating the LLM Prompts#
The key class we need to create for each of our personas is the
LLMPromptclass. To initialize it, we only need thequestionnaire,personaassystem_prompt, thepromptfrom before.We define how the model should answer depending on the
output_methodchosen.We call
prepare_promptto update each ``LLMPrompt`.
def create_llm_prompts(row: pd.Series):
persona_index = row.name
persona_str = row["persona"]
# ===== 1. Creating the LLMPrompt =====
llm_prompt = LLMPrompt(
questionnaire_source=questionnaire,
questionnaire_name=str(persona_index),
system_prompt=system_prompt.format(persona=persona_str),
prompt=prompt,
)
# ====== 2. Defining the Output Method =====
answer_options = {}
for dic in all_cleaned_answers:
answers = dic["answer"]
from_to_scale = dic["from_to_scale"]
rgm: response_generation.ResponseGenerationMethod = None
# We change the ResponseGenerationMethod here. All other code stays the same
if output_method == "OPEN":
pass
elif output_method == "RESTRICTED_CHOICE":
rgm = response_generation.ChoiceResponseGenerationMethod(
answers, output_template="Antworte nur mit der exakten Antwort."
)
elif output_method == "REASONING_JSON":
rgm = response_generation.JSONReasoningResponseGenerationMethod(
output_template=(
"Antworte nur im folgenden JSON format:\n"
f"{placeholder.JSON_TEMPLATE}"
)
)
elif output_method == "VERBALIZED_DISTRIBUTION":
rgm = response_generation.JSONVerbalizedDistribution()
# We can check for robustness with generate_likert_options:
# Randomized or reversed options order, different indeces etc.
if from_to_scale:
answer_option = generate_likert_options(
n=len(answers),
answer_texts=answers,
only_from_to_scale=True,
scale_prompt_template="Antwortmöglichkeiten: {start} bis {end}",
response_generation_method=rgm,
)
else:
answer_option = generate_likert_options(
n=len(answers),
answer_texts=answers,
list_prompt_template="Antwortmöglichkeiten: {options}",
response_generation_method=rgm,
)
# Since we have different answer options for different questions we create a dict
answer_options[dic["question"]] = answer_option
# ===== 3. Updating the Prompts with our answer options =====
llm_prompt.prepare_prompt(answer_options=answer_options)
return llm_prompt
# Create an LLMPrompt for every persona in our dataframe
llm_prompts: list[LLMPrompt] = df_personas.apply(create_llm_prompts, axis=1).to_list()
Here we can see our final system and user prompts depending on the method we chose.
sys_prompt, user_prompt = llm_prompts[0].get_prompt_for_questionnaire_type(
item_id="mp18"
)
print("SYSTEM:", sys_prompt)
print("USER:", user_prompt)
SYSTEM: Nehme die Perspektive der folgenden Person ein: Du bist eine Person, 31 Jahre alt, weiblich und polnische Staatsangehörigkeit. Du lebst in einer kleinen bis mittelgroßen Stadt in Ostdeutschland. Dein höchster Schulabschluss ist die Fachhochschulreife, und du besitzt einen Bachelor-Abschluss. Derzeit bist du als Angestellte in einer Sekretariatsfachkraft-Tätigkeit beschäftigt.
USER: Welche der Antwortmöglichkeiten ist die Reaktion der Person auf folgende Frage: Ergeben sich Ihrer Meinung nach wegen der Flüchtlinge in Bezug auf das Zusammenleben in der Gesellschaft mehr Chancen, mehr Risiken oder weder noch?
Antwortmöglichkeiten: 1: RISIKO UEBERWIEGT, 2: EHER RISIKO, 3: WEDER NOCH, 4: EHER CHANCE, 5: CHANCE UEBERWIEGT
Antworte nur mit der exakten Antwort.
Inference#
First we initialize our way to inference the model. We use AsyncOpenAI here, with a locally running vllm server.
vllm serve Qwen/Qwen3-VL-4B-Instruct --max-model-len=20000
model_id = "Qwen/Qwen3-VL-4B-Instruct"
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
generator = AsyncOpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
And QSTN inferences each question. We use max_tokens=2000 here to restrict overly long output.
If you want previous questions to influence the answers of further questions you can use the method conduct_survey_sequential.
results = conduct_survey_single_item(
generator,
llm_prompts=llm_prompts,
client_model_name=model_id,
max_tokens=2000,
seed=SEED,
)
# results = conduct_survey_sequential(
# generator,
# llm_prompts=llm_prompts,
# client_model_name=model_id,
# max_tokens=2000,
# seed=SEED,
# )
Finally we can parse the output and get a pandas DataFrame with the answers.
# If we expect JSON output we can automatically parse it
if output_method == "REASONING_JSON" or output_method == "VERBALIZED_DISTRIBUTION":
parsed_results = parse_json(results)
# Otherwise we just want the raw responses
else:
parsed_results = raw_responses(results)
# To get all responses in the same dataframe, we can use the helper method
full_results = create_one_dataframe(parsed_results)
display(full_results.head())
| questionnaire_name | questionnaire_item_id | question | llm_response | logprobs | reasoning | |
|---|---|---|---|---|---|---|
| 0 | 0 | lp04 | Sind Sie bei der folgenden Aussage derselben o... | BIN ANDERER MEINUNG | None | None |
| 1 | 0 | pe05 | Inwiefern stimmen Sie der folgenden Meinung zu... | STIMME EHER NICHT ZU | None | None |
| 2 | 0 | mp18 | Ergeben sich Ihrer Meinung nach wegen der Flüc... | EHER RISIKO | None | None |
| 3 | 0 | mm01 | Inwieweit stimmen Sie der folgenden Aussage zu... | 1 (1-7 "STIMME GAR NICHT ZU"-"STIMME VOLL+GANZ... | None | None |
| 4 | 0 | vi10 | Wie wichtig ist es für Sie persönlich 'sich po... | 4 (1-7 "UNWICHTIG"-"SEHR WICHTIG") | None | None |