German General Personas#

This resource allows you to ask questions to a representative sample of 5,246 individual personas representing the German population.

This persona collection consists of 5,246 individual personas representing the German population. The data source is the German General Survey (ALLBUScompact). Their two-step randomized sampling ensures that ALLBUS as well as the German General Personas reflect a representative picture of the German population, regarding its sociodemographic attributes, norms and values.

The persona collection was first published in the work German General Personas: A Survey-Derived Persona Prompt Collection for Population-Aligned LLM Studies.

Why German General Personas?#

GGP offers several significant advantages:

  • Contextual Information: Personas enrich language models with relevant contextual information, enabling them to anchor predictions for specific tasks or target variables in empirically observed associations and connections within the German population.

  • Representative Alignment: The ALLBUS is a probability-based survey, and the personas derived from it are designed to represent the German population accurately. While there’s growing concern about biased representations in LLMs’ survey responses, GGP can potentially help align LLMs more effectively with the demographics and attitudes of the German population.

  • Novel Resource: GGP stands as a novel textual resource for researchers and practitioners in Natural Language Processing (NLP) and Computational Social Science (CSS).

We show how GPP can easily be utilized with the qstn Framework. The Tutorial here is also available as an interactive jupyter notebook.

Imports#

First, all relevant Python packages must be imported, such as pandas and QSTN.

# General Imports
import io
import zipfile

import pandas as pd
import requests

# Either local inference with vllm or remote with AsyncOpenAI
from openai import AsyncOpenAI
# qstn Imports
from qstn.inference import response_generation
from qstn.parser import parse_json, raw_responses
from qstn.prompt_builder import LLMPrompt, generate_likert_options
from qstn.survey_manager import conduct_survey_single_item
from qstn.utilities import create_one_dataframe, placeholder
SEED = 42

Preparing the Dataset#

We will load the personas from the GGP repository.

You can define how many personas you want to load and conduct interviews with. In total, you can choose up to 5,246 personas. If you choose less than the available amount of personas, we randomly select personas to ensure a representative sample of the GGP collection.

PERSONAS_TO_LOAD = 20
zip_url = "https://github.com/germanpersonas/German-General-Personas/raw/main/GGP_all_topk_fulltext.zip"

response = requests.get(zip_url)
response.raise_for_status()  # Check for errors

with zipfile.ZipFile(io.BytesIO(response.content)) as z:
    with z.open("pc_fulltext_sociodemographics_only.jsonl") as f:
        if PERSONAS_TO_LOAD:
            df_personas = pd.read_json(f, lines=True)
            df_personas = df_personas.sample(n=PERSONAS_TO_LOAD).reset_index(drop=False)
        else:
            df_personas = pd.read_json(f, lines=True)

df_personas = df_personas.rename(columns={0: "persona"})
print(f"Loaded {len(df_personas)} rows.")
display(df_personas.head())
Loaded 20 rows.
index persona
0 3135 Du bist eine Person, 31 Jahre alt, weiblich un...
1 3839 Du bist eine Person, 61 Jahre alt und wohnhaft...
2 3032 Du bist eine Person, 54 Jahre alt, weiblich un...
3 3492 Du bist eine Person, 42 Jahre alt, männlich, m...
4 1209 Du bist eine Person, männlich, deutscher Staat...

In addition to the personas, we load example questions and answer options to conduct the survey.

However, with QSTN you can freely create or choose the questions you are asking a representative sample of the German population.

Create your own questions#

If you want to create your own questions, you can simply create a json in the following format, which contains the question in statement, and the possible answers in answers. If you then want to ask your own questions to the LLM, set USE_OWN_QUESTIONS to True.

# Adjust this if you want to use your own questions
USE_OWN_QUESTIONS: bool = False

if USE_OWN_QUESTIONS:
    json_questionnaire = {
        "controversial": {
            "task_type": "question",
            "statement": "Wie stehen Sie zu Ananas auf Pizza?",
            "answers": ["1: Lecker!", "2: Schrecklich!"],
        },
        "election": {
            "task_type": "question",
            "statement": "Welche Partei würden Sie heute wählen?",
            "answers": [
                "1: CDU/CSU",
                "2: SPD",
                "3: FDP",
                "4: Bündnis90/Die Grünen",
                "5: Die Linke",
                "6: AFD",
                "7: Sonstige"
                "8: Ich wähle nicht."
            ],
        },
        # You can add as many questions as you want
    }
    print("Using your own questions!")

ALLBUS Example Questions#

If you don’t want to use your own questions, you can use the questions of the ALLBUS, which were used in the original paper.

if not USE_OWN_QUESTIONS:
    # import example ALLBUS questions
    url = "https://raw.githubusercontent.com/germanpersonas/German-General-Personas/refs/heads/main/_strat_task_question.json"

    response = requests.get(url)

    if response.status_code == 200:
        json_questionnaire = response.json()

        print("Successfully loaded as dictionary. Using ALLBUS questions.")
    else:
        print(f"Failed to retrieve file. Status code: {response.status_code}")
    print(json_questionnaire["mp18"])
Successfully loaded as dictionary. Using ALLBUS questions.
{'task_type': 'question', 'statement': 'Ergeben sich Ihrer Meinung nach wegen der Flüchtlinge in Bezug auf das Zusammenleben in der Gesellschaft mehr Chancen, mehr Risiken oder weder noch?', 'answers': ['1: RISIKO UEBERWIEGT', '2: EHER RISIKO', '3: WEDER NOCH', '4: EHER CHANCE', '5: CHANCE UEBERWIEGT']}

To use the questions in the correct format, we have to adjust them to the qstn format and add them to a DataFrame.

questionnaire_list = []

for key, value in json_questionnaire.items():
    # Create a new empty dict for this row
    questionnaire_item = {}

    # Update it with the specific format: questionnaire_item_id and question_content
    questionnaire_item.update(
        {"questionnaire_item_id": key, "question_content": value["statement"]}
    )

    # Add to the list
    questionnaire_list.append(questionnaire_item)
questionnaire = pd.DataFrame(questionnaire_list)
print(questionnaire.head(3))
  questionnaire_item_id                                   question_content
0                  lp04  Sind Sie bei der folgenden Aussage derselben o...
1                  pe05  Inwiefern stimmen Sie der folgenden Meinung zu...
2                  mp18  Ergeben sich Ihrer Meinung nach wegen der Flüc...

For the answers, we can select if they are on a scale or if they are categorical. We only need the plaintext answers.

all_cleaned_answers = []
for key, value in json_questionnaire.items():
    cleaned_answers = []

    for _, answer in enumerate(value["answers"]):
        clean_text = answer.split(": ")[1]

        # We simply check if the text contains a minus -> If it does it is a from to scale
        if "-" in clean_text:
            from_to_scale = True
            cleaned_answers.append(clean_text)
        else:
            from_to_scale = False
            cleaned_answers.append(clean_text)

    all_cleaned_answers.append(
        {"question": key, "answer": cleaned_answers, "from_to_scale": from_to_scale}
    )
all_cleaned_answers[2]
{'question': 'mp18',
 'answer': ['RISIKO UEBERWIEGT',
  'EHER RISIKO',
  'WEDER NOCH',
  'EHER CHANCE',
  'CHANCE UEBERWIEGT'],
 'from_to_scale': False}

System Prompt, User Prompt and Personas#

Here we define the prompt structure for the interaction.

  • system_prompt: Instructions for the model to adopt the specific {persona}.

  • prompt: The main task input which dynamically assembles:

    • The question for each entry in our questionnaire (PROMPT_QUESTIONS)

    • The answer choices (PROMPT_OPTIONS)

    • The specific formatting rules (PROMPT_AUTOMATIC_OUTPUT_INSTRUCTIONS) based on the output_method selected in the next step.

We use the prompt from the GGP Paper in this case. Again you can adjust this however you need.

system_prompt = "Nehme die Perspektive der folgenden Person ein: {persona}"
prompt = (
    "Welche der Antwortmöglichkeiten ist die Reaktion der Person auf folgende "
    f"Frage: {placeholder.PROMPT_QUESTIONS}\n"
    f"{placeholder.PROMPT_OPTIONS}\n"
    f"{placeholder.PROMPT_AUTOMATIC_OUTPUT_INSTRUCTIONS}"
)

Configuration: Output Method#

Select the inference technique by assigning one of the keys below to output_method.

Method

Description

OPEN

Full, unconstrained text generation.

RESTRICTED_CHOICE

Logits are restricted to exact answer possibilities only.

REASONING_JSON

JSON output containing a preliminary reasoning step.

VERBALIZED_DISTRIBUTION

Uncertainty estimation via verbalized probability as in Meister et al..

OUTPUT_METHODS = [
    "OPEN",
    "RESTRICTED_CHOICE",
    "REASONING_JSON",
    "VERBALIZED_DISTRIBUTION",
]
# Select your method here by copying one from above
output_method = "RESTRICTED_CHOICE"

Creating the LLM Prompts#

  1. The key class we need to create for each of our personas is the LLMPrompt class. To initialize it, we only need the questionnaire, persona as system_prompt, the prompt from before.

  2. We define how the model should answer depending on the output_method chosen.

  3. We call prepare_prompt to update each ``LLMPrompt`.

def create_llm_prompts(row: pd.Series):
    persona_index = row.name
    persona_str = row["persona"]

    # ===== 1. Creating the LLMPrompt =====
    llm_prompt = LLMPrompt(
        questionnaire_source=questionnaire,
        questionnaire_name=str(persona_index),
        system_prompt=system_prompt.format(persona=persona_str),
        prompt=prompt,
    )

    # ====== 2. Defining the Output Method =====
    answer_options = {}

    for dic in all_cleaned_answers:
        answers = dic["answer"]
        from_to_scale = dic["from_to_scale"]
        rgm: response_generation.ResponseGenerationMethod = None

        # We change the ResponseGenerationMethod here. All other code stays the same
        if output_method == "OPEN":
            pass
        elif output_method == "RESTRICTED_CHOICE":
            rgm = response_generation.ChoiceResponseGenerationMethod(
                answers, output_template="Antworte nur mit der exakten Antwort."
            )
        elif output_method == "REASONING_JSON":
            rgm = response_generation.JSONReasoningResponseGenerationMethod(
                output_template=(
                    "Antworte nur im folgenden JSON format:\n"
                    f"{placeholder.JSON_TEMPLATE}"
                )
            )
        elif output_method == "VERBALIZED_DISTRIBUTION":
            rgm = response_generation.JSONVerbalizedDistribution()

        # We can check for robustness with generate_likert_options:
        # Randomized or reversed options order, different indeces etc.
        if from_to_scale:
            answer_option = generate_likert_options(
                n=len(answers),
                answer_texts=answers,
                only_from_to_scale=True,
                scale_prompt_template="Antwortmöglichkeiten: {start} bis {end}",
                response_generation_method=rgm,
            )
        else:
            answer_option = generate_likert_options(
                n=len(answers),
                answer_texts=answers,
                list_prompt_template="Antwortmöglichkeiten: {options}",
                response_generation_method=rgm,
            )
        
        # Since we have different answer options for different questions we create a dict
        answer_options[dic["question"]] = answer_option

    # ===== 3. Updating the Prompts with our answer options =====
    llm_prompt.prepare_prompt(answer_options=answer_options)
    return llm_prompt

# Create an LLMPrompt for every persona in our dataframe
llm_prompts: list[LLMPrompt] = df_personas.apply(create_llm_prompts, axis=1).to_list()

Here we can see our final system and user prompts depending on the method we chose.

sys_prompt, user_prompt = llm_prompts[0].get_prompt_for_questionnaire_type(
    item_id="mp18"
)
print("SYSTEM:", sys_prompt)
print("USER:", user_prompt)
SYSTEM: Nehme die Perspektive der folgenden Person ein: Du bist eine Person, 31 Jahre alt, weiblich und polnische Staatsangehörigkeit. Du lebst in einer kleinen bis mittelgroßen Stadt in Ostdeutschland. Dein höchster Schulabschluss ist die Fachhochschulreife, und du besitzt einen Bachelor-Abschluss. Derzeit bist du als Angestellte in einer Sekretariatsfachkraft-Tätigkeit beschäftigt.
USER: Welche der Antwortmöglichkeiten ist die Reaktion der Person auf folgende Frage: Ergeben sich Ihrer Meinung nach wegen der Flüchtlinge in Bezug auf das Zusammenleben in der Gesellschaft mehr Chancen, mehr Risiken oder weder noch?
Antwortmöglichkeiten: 1: RISIKO UEBERWIEGT, 2: EHER RISIKO, 3: WEDER NOCH, 4: EHER CHANCE, 5: CHANCE UEBERWIEGT
Antworte nur mit der exakten Antwort.

Inference#

First we initialize our way to inference the model. We use AsyncOpenAI here, with a locally running vllm server.

vllm serve Qwen/Qwen3-VL-4B-Instruct --max-model-len=20000
model_id = "Qwen/Qwen3-VL-4B-Instruct"

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

generator = AsyncOpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

And QSTN inferences each question. We use max_tokens=2000 here to restrict overly long output. If you want previous questions to influence the answers of further questions you can use the method conduct_survey_sequential.

results = conduct_survey_single_item(
    generator,
    llm_prompts=llm_prompts,
    client_model_name=model_id,
    max_tokens=2000,
    seed=SEED,
)

# results = conduct_survey_sequential(
#     generator,
#     llm_prompts=llm_prompts,
#     client_model_name=model_id,
#     max_tokens=2000,
#     seed=SEED,
# )

Finally we can parse the output and get a pandas DataFrame with the answers.

# If we expect JSON output we can automatically parse it
if output_method == "REASONING_JSON" or output_method == "VERBALIZED_DISTRIBUTION":
    parsed_results = parse_json(results)
# Otherwise we just want the raw responses
else:
    parsed_results = raw_responses(results)

# To get all responses in the same dataframe, we can use the helper method
full_results = create_one_dataframe(parsed_results)
display(full_results.head())
questionnaire_name questionnaire_item_id question llm_response logprobs reasoning
0 0 lp04 Sind Sie bei der folgenden Aussage derselben o... BIN ANDERER MEINUNG None None
1 0 pe05 Inwiefern stimmen Sie der folgenden Meinung zu... STIMME EHER NICHT ZU None None
2 0 mp18 Ergeben sich Ihrer Meinung nach wegen der Flüc... EHER RISIKO None None
3 0 mm01 Inwieweit stimmen Sie der folgenden Aussage zu... 1 (1-7 "STIMME GAR NICHT ZU"-"STIMME VOLL+GANZ... None None
4 0 vi10 Wie wichtig ist es für Sie persönlich 'sich po... 4 (1-7 "UNWICHTIG"-"SEHR WICHTIG") None None