Commit 0a625926 authored by Jana Germies's avatar Jana Germies
Browse files

add readme

parent 80cf448d
# MAJana-code
# Personalizing Conversational AI Using the Big Five Personality Traits
## Practical implementation of the Master's Thesis submitted in partial fulfillment to the requirements for an M.A. in Computational Linguistics at Ruhr-University Bochum
### Abstract
### About
The practical part of the thesis is divided in two parts. The first part is concerned with data understanding
and an analysis of the collected dataset. This part includes automatic and manual processes, i.e. cleaning of the data,
analysis of the accompanying personality questionnaires and a calculation of speaker alignment. The second part deals with
machine learning engineering and training a language model for conditioned dialogue response generation.
### Requirements
The project was implemented in Python3.7 and uses external libraries that can be installed via the requirements.txt.
It is advised to create a virtual environment for the project.
### Steps
#### Step 1.1: Clean and assess the chat data
To automatically clean the chat data and e.g. remove special symbols and chronologically sort the data,
execute file **data_understanding/process_chats.py**.
This will also provide a first statistical analysis of the chats in terms of number of chats, number of speakers
and length of messages. It is advised to conduct manual cleaning of the data, as well.
#### Step 1.2: Assess personality questionnaires
The dataset comes with raw personality scores collected via the **BFI-S questionnaire**. To calculate the final scores
and remove scores of subjects that did not partake in the chats, execute file **calculate_personality_scores.py**.
The program will also provide a visualization of the final scores, add personality labels to the chat data and split the
data into individual chats for further processing.
#### Step 1.3: Calculate Alignment and Linguistic Style Matching
Using the files resulting from the previous step, each chat is analyzed using the **LIWC** tool (*http://liwc.wpengine.com/*).
Of interest is the use of function category words (e.g. negation, personal pronouns and auxiliary verbs)
per user. The liwc scores are provided in **outputs/liwc/** and can be analyzed using **calculate_liwc_results.py**.
This will provide scores for the dataset overall, as well as a comparison of scores for chats between only extroverted
and mixed personality pairs.
#### Step 2.1: Prepare the dataset
Using the annotated corpus resulting from **Step 1.2**, add context and distractor phrases to the dialogue
data. Executing **add_context_columns.py** will output a modified csv file, that contains the reformatted the data and
can be used for training the dialogue model.
#### Step 2.2: Preprocess the dataset
For preprocessing, the dataset is tokenized and transformed into a Tensor Dataset. The dataset contains
label for language modeling and next-sentence prediction.
#### Step 2.3: Fine-tune the model on a multi-task objective
The dialogue model is fine-tuned on a multi-task objective using the preprocessed dataset. **train.py**
will preprocess the data and begin fine-tuning. The base model is loaded from the **HuggingFace Transformers
community model hub**.
### Notes:
As of now, the implementation produces some issues and is not running bug-free. The main issue is an
out-of-bounds error occurring during fine-tuning.
Implementation of Jana's Masters Thesis
\ No newline at end of file
......@@ -95,17 +95,17 @@ if __name__ == '__main__':
boxplot_lsm_scores(scores)
print("""
*******************************************************
*******************************************************
*** Congratulations ***
*** Congratulations ***
Basic cleaning and analysis of the data are done.
You should have a good understanding of the dta now.
Please continue with preparing the cleaned chats for
the modeling pipeline. To do so, please switch to the
modeling directory and execute file:
preprocessing.py
Basic cleaning and analysis of the data are done.
You should have a good understanding of the data now.
Please continue with preparing the cleaned chats for
the modeling pipeline. To do so, please switch to the
modeling directory and execute file:
add_context_columns.py
*******************************************************
""")
\ No newline at end of file
*******************************************************
""")
\ No newline at end of file
......@@ -128,7 +128,7 @@ def get_interaction_message_lengths_scores(df_chats, df_traits, msg_lens):
if __name__ == '__main__':
# paths
personality_path_in = '../ttas-data/ttas-user-answers.csv'
personality_path_out = '../outputs2/filtered_personality_scores.csv'
personality_path_out = '../outputs2/filtered-personality-scores.csv'
chat_path_in = '../outputs2/ttas-clean-chats.csv'
chat_path_out = '../outputs2/ttas-annotated-chats.csv'
# read
......@@ -172,26 +172,26 @@ if __name__ == '__main__':
scatterplot_interaction(interaction)
print("""
*******************************************************
*** Manual step required ***
To calculate Linguistic Style Matching scores, please
refer to the Linguistic Inquiry and Word Count (LIWC) tool.
Files for all individual chats have been created at
outputs/chats/ and are ready to be be analyzed using the
tool. The respective software can be found at:
http://liwc.wpengine.com/
Fees may apply.
After analyzing the individual chats with the tool, save
results and continue with calculating the overall scores.
To do so, execute file:
calculate_liwc_results.py
*******************************************************
""")
*******************************************************
*** Manual step required ***
To calculate Linguistic Style Matching scores, please
refer to the Linguistic Inquiry and Word Count (LIWC) tool.
Files for all individual chats have been created at
outputs/understanding/chats/ and are ready to be be analyzed
using the tool. The respective software can be found at:
http://liwc.wpengine.com/
Fees may apply.
After analyzing the individual chats with the tool, save
results and continue with calculating the overall scores.
To do so, execute file:
calculate_liwc_results.py
*******************************************************
""")
"""Apply methods to clean data"""
from process_chats import (read_chat_data, filter_chats, get_n_count, get_summary, summarize_chats, clean_messages,
sort_chat_messages, concat_and_save_message_strings)
from calculate_personality_scores import (read_personality_data, remove_fake_profiles, recode_answers,
calculate_scores_per_user, map_extraversion_poles, remove_superfluous_users, get_interaction_message_lengths_scores)
from visualization.visualizations import boxplot_trait_scores, histplot_messages, scatterplot_interaction
if __name__ == '__main__':
chats_input_path = '../ttas-data/ttas-complete-chats.csv'
traits_input_path = '../ttas-data/trait_scores.csv'
chat_output_path = '../outputs/ttas-filtered-chats.csv'
traits_out_path = '../outputs/filtered_personality_scores.csv'
# read in raw chat data
chat_data = read_chat_data(chats_input_path)
# filter out chats > 4
filtered_chats = filter_chats(chat_data)
# clean from special symbols
clean_chats = clean_messages(filtered_chats)
# count unique users and chats
unique_users, unique_chats, n_users, n_chats = get_n_count(clean_chats)
# summarize conversations
message_lens, chat_summary, summary_messages = summarize_chats(clean_chats)
### ----- ###
# extra step: manual cleaning
### ----- ###
# read in raw questionnaire answers
personality_answers = read_personality_data(traits_input_path)
# remove test profiles
clean_answers = remove_fake_profiles(personality_answers)
# recode answers for calculation
recoded_answers = recode_answers(clean_answers)
# calculate scores
trait_scores = calculate_scores_per_user(recoded_answers)
# compare with cleaned chat data and remove superfluous profiles
trait_scores.reset_index(inplace=True) # TODO: check reset index in original method
filtered_scores = remove_superfluous_users(trait_scores)
# evaluate
mean_scores = get_summary(filtered_scores.drop('user_id', axis=1))
# map extraversion scores to pole expression labels
extraversion_dict = map_extraversion_poles(filtered_scores)
# annotate to cleaned chat data
clean_chats['extraversion_pole'] = clean_chats['user_id'].map(extraversion_dict)
# sort chats according to timestamp
sorted_chats = sort_chat_messages(clean_chats)
# evaluate interaction between messages and personality scores
interaction = get_interaction_message_lengths_scores(sorted_chats, filtered_scores, message_lens)
# save
sorted_chats.to_csv(chat_output_path, index=False)
filtered_scores.to_csv(traits_out_path, index=False)
# visualize
boxplot_trait_scores(filtered_scores)
histplot_messages(message_lens)
scatterplot_interaction(interaction)
# prepare for LIWC
concat_and_save_message_strings(sorted_chats)
......@@ -130,7 +130,8 @@ if __name__ == '__main__':
to enhance quality of the data.
The corresponding file can be found under:
outputs/ttas-filtered_chats.csv
outputs/understanding/ttas-clean-chats.csv
After cleaning continue with calculating the respective
personality scores. To do so, execute file:
......@@ -139,9 +140,3 @@ if __name__ == '__main__':
*******************************************************
""")
"""Here, methods to process the data and create context and distractor data are provided."""
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
def change_userid_to_speakerid(df):
"""Change made up user ids to speaker 1 and speaker 2 respectively"""
uniq_user_id = df.user_id.unique()
chat = df['chat_id']
if len(uniq_user_id) < 2:
print('WARNING: Conversation with only 1 speaker detected. Please remove conversation:', chat)
else:
# map to dict
speaker_dict = {uniq_user_id[0]: 'speaker1', uniq_user_id[1]: 'speaker2'}
df['user_id'] = df['user_id'].map(speaker_dict)
return df
def clean_up_turns(df):
"""
Find incidents where the same speaker sent two succeeding messages and concatenate into one message/turn.
"""
# create boolean mask
# shift(-1) compares to row beneath
# shift(1) compares to row above
# mask = df['user_id'].shift() == df['user_id'] #and (df['chat_id'].shift(-1) == df['chat_id']))
mask = ((df['user_id'].shift() == df['user_id']) & (df['chat_id'].shift() == df['chat_id']))
# get indices
mask_ind = mask.reset_index(drop=True)
mask_ind = mask_ind[mask_ind].index.tolist()
concat_ind = [ind - 1 for ind in mask_ind]
ind_tuple = zip(concat_ind, mask_ind)
# concatenate messages / turns
for tpl in ind_tuple:
df.iloc[tpl[0]]['message'] = df.iloc[tpl[0]]['message'] + ' ' + df.iloc[tpl[1]]['message']
# drop redundant messages
df_clean = df[~mask]
return df_clean
def create_context_cols(df):
"""Create context columns in the data frame.
Note: Distractors are picked randomly from predefined distractor_sents.
For a larger dataset they should be picked randomly from the dataset itself"""
distractor_sents = pd.Series(['Das tur mir leid.', 'Das hab ich nicht verstanden.', 'Super cool!', 'Wie meinst du das?',
'Ich liebe Eis.', 'Ich bin vegan.', 'Was ist dein Lieblingsessen?', 'Was ist dein Hobby?',
'Ich mag Suppe.', 'Was hast du morgen so vor?'])
df['context_0'] = df['message'].shift(1, fill_value='Hi!')
df['context_1'] = df['message'].shift(2, fill_value=' ')
df['context_2'] = df['message'].shift(3, fill_value=' ')
df['distractor_1'] = distractor_sents[np.random.randint(0, len(distractor_sents), len(df)).tolist()].tolist()
df['distractor_2'] = distractor_sents[np.random.randint(0, len(distractor_sents), len(df)).tolist()].tolist()
return df
def format_context_response_table(df):
# concat messages
df_turns = clean_up_turns(df)
print(df_turns)
# change usernames
df_speaker = df_turns.groupby('chat_id').apply(change_userid_to_speakerid)
# create context columns
df_context = df_speaker.groupby('chat_id').apply(create_context_cols)
return df_context
def table_to_nested_dict(df):
"""Create a nested dict of the data that can be saved to json file.
The two main keys are the two extraversion trait poles, each key holds the individual messages,
their respective chat history and distractor replies.
Note: Not used in the final pipeline."""
df['candidates'] = df.apply(lambda x: [x['distractor_1']] + [x['distractor_2']] + [x['message']], axis=1)
df['context'] = df.apply(lambda x: [x['context_2']] + [x['context_1']] + [x['context_0']], axis=1)
df['context'] = [[msg for msg in li if msg != ' '] for li in df['context']]
keys = ['personality', 'utterances']
data = {'train': [], 'test': []}
grouped = df.groupby('extraversion_pole')
for group, frame in grouped:
train, test = train_test_split(frame, test_size=0.15)
print(len(train), len(test))
personality_dict = dict.fromkeys(keys)
personality_dict['personality'] = group
personality_dict['utterances'] = []
for idx, row in train.iterrows():
sub_dict = dict()
sub_dict['candidates'] = row['candidates']
sub_dict['history'] = row['context']
personality_dict['utterances'].append(sub_dict)
data['train'].append(personality_dict)
for idx, row in test.iterrows():
sub_dict = dict()
sub_dict['candidates'] = row['candidates']
sub_dict['history'] = row['context']
personality_dict['utterances'].append(sub_dict)
data['test'].append(personality_dict)
return data
if __name__ == '__main__':
# create context and distractor columns
chats = pd.read_csv('../outputs2/ttas-annotated-chats.csv', sep=";")
contextual_df = format_context_response_table(chats)
contextual_df = contextual_df.drop(['timestamp'], axis=1)
contextual_df.reset_index(drop=True, inplace=True)
contextual_df.to_csv('../outputs2/context-chats.csv', sep=';', index=False)
print("""
*******************************************************
*** Dataframe complete ***
Context and Distractor columns have been added to the
data frame. Please continue with preprocessing the data
and training. To do so, execute file:
train.py
*******************************************************
""")
"""Here, methods to prepare the dataset for training are provided"""
import pandas as pd
from itertools import chain
from collections import defaultdict
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import TensorDataset, DataLoader
from torch.utils.data.distributed import DistributedSampler
from transformers import (AutoModelWithLMHead, AutoTokenizer)
# define special tokens
SPECIAL_TOKENS = ['<bos>', '<eos>', '<speaker1>', '<speaker2>', '<introvert>', '<extrovert>', '<pad>']
ATTR_TO_SPECIAL_TOKEN = {'bos_token': '<bos>', 'eos_token': '<eos>', 'pad_token': '<pad>',
'additional_special_tokens': ['<speaker1>', '<speaker2>', '<introvert>', '<extrovert>']}
MODEL_INPUTS = ['input_ids', 'mc_token_ids', 'lm_labels', 'mc_labels', 'token_type_ids']
PADDED_INPUTS = ['input_ids', 'lm_labels', 'token_type_ids']
# TODO: check Trainer for modeling
def tokenize_dataset(df, tokenizer):
"""Tokenize string values specified columns
Note: dbmbz pre-trained tokenizer cannot be applied to batches of sentences
tokenize: separates string into list of words and punctuation marks
convert_tokens_to_ids: convert words into indices of vocabulary entries"""
print('INFO: Tokenizing messages ...')
# tokenize and encode
cols = ['message', 'distractor_1', 'distractor_2', 'context_0', 'context_1', 'context_2']
for name in cols:
df[name] = df[name].apply(tokenizer.tokenize)
df[name] = df[name].apply(tokenizer.convert_tokens_to_ids)
return df
def split_dataframe(df):
"""Concatenate candidates and contexts after tokenization
-> last response is ground truth
Note: token id 255 is an empty string and should be removed
Split into train and test set
test_size is set to 0.15 since the dataset is quite small"""
print('INFO: Splitting dataset ...')
new_df = pd.DataFrame()
new_df['trait'] = df['extraversion_pole']
new_df['candidates'] = df.apply(lambda x: [x['distractor_1']] + [x['distractor_2']] + [x['message']], axis=1)
new_df['context'] = df.apply(lambda x: [x['context_2']] + [x['context_1']] + [x['context_0']], axis=1)
new_df['context'] = [[msg for msg in li if msg != [225]] for li in new_df['context']]
# split in train and test
train, test = train_test_split(new_df, test_size=0.15, random_state=0, stratify=new_df[['trait']])
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)
print('INFO: Train and test samples:', train.shape, test.shape)
return train, test
def pad_dataset(dataset, padding=0):
"""Pad Dataset.
Note: LM Labels are padded differently
max length of history + response = 443 tokens
model size = 512 for dbmdz
model size = 1024 for GerPT"""
print('INFO: Padding inputs ...')
#max_l = max(len(x) for x in dataset['input_ids'])
max_l = 512
for name in PADDED_INPUTS:
dataset[name] = [x + [padding if name != 'lm_labels' else -100] * (max_l - len(x)) for x in dataset[name]]
return dataset
def add_special_token(model, tokenizer):
"""Add special tokens to model and tokenizer.
Check with pretrained tokens."""
n_added_tokens = tokenizer.add_special_tokens(ATTR_TO_SPECIAL_TOKEN)
if n_added_tokens > 0:
model.resize_token_embeddings(new_num_tokens=len(tokenizer))
def build_inputs(tokenizer, trait, history, response, lm_labels=False, with_eos=True):
"""Build modeling sequences from pole, history and response segments
- history = list of previous utterances as list of list of token ids / words
- response = list of token ids / words for gold or distractor response
- trait = trait special token
Returns dict"""
# convert special token symbols to token ids
bos, eos, speaker1, speaker2, introvert, extrovert = tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS[:-1])
# set trait poles to respective tokens / token ids
if trait == 'introvert':
pole = introvert
elif trait == 'extrovert':
pole = extrovert
# create sequences
sequence = [[bos] + [pole]] + history + [response + ([eos] if with_eos else [])]
sequence = [sequence[0]] + [[speaker2 if (len(sequence)-i) % 2 else speaker1]
+ s for i, s in enumerate(sequence[1:])]
instance = dict()
instance['input_ids'] = list(chain(*sequence))
instance['token_type_ids'] = [speaker2 if i % 2 else speaker1 for i, s in enumerate(sequence) for _ in s]
instance['mc_token_ids'] = len(instance['input_ids']) - 1
instance['lm_labels'] = [-100] * len(instance['input_ids'])
if lm_labels:
instance['lm_labels'] = ([-100] * sum(len(s) for s in sequence[:-1])) + [-100] + sequence[-1][1:]
return instance
def build_dataset(df, tokenizer, train_set=True, distributed=False):
"""
Transforms the input dataframe or dict into a Tensor Dataset.
Note: Distributed Training is only supported on Linux and Windows
For support on Mac library needs to be compiled from source
"""
print('INFO: Building dataset')
dataset = defaultdict(list)
n_candidates = 3
max_history = 2
if train_set:
n_candidates = n_candidates
else:
n_candidates = 1
# create instance for each candidate response
print('INFO: Building sequences ...')
for i, row in df.iterrows():
trait = row['trait']
history = row['context'][-(2*3+1):]
candidates = row['candidates']
for j, candidate in enumerate(candidates[-n_candidates:]): # possible error -> gold response has index 2 ?
lm_labels = bool(j == n_candidates-1)
instance = build_inputs(tokenizer, trait, history, candidate, lm_labels)
for input_name, input_array in instance.items():
dataset[input_name].append(input_array)
dataset['mc_labels'].append(n_candidates - 1) # label == 2?
dataset['n_candidates'] = n_candidates
# pad
padded_dataset = pad_dataset(dataset, padding=tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS[-1]))
# convert to tensors
print('INFO: Converting input sequences into tensors ...')
tensor_set = []
for input_name in MODEL_INPUTS:
tensor = torch.tensor(padded_dataset[input_name])
tensor = tensor.view((-1, dataset['n_candidates']) + tensor.shape[1:])
tensor_set.append(tensor)
#build tensor data set
batchsize = 4
tensor_dataset = TensorDataset(*tensor_set) # TODO: resolve size mismatch error
sampler = DistributedSampler(tensor_dataset) if distributed else None
loader = DataLoader(tensor_dataset, sampler=sampler, batch_size=batchsize, shuffle=False)
print('INFO: Dataset (Batch, Candidates, Seq Length):{}'.format(tensor_dataset.tensors[0].shape))
return loader, sampler
def get_data_loaders(data_path, tokenizer, model):
""" Load, tokenize and split data and build tensor datasets for training """
data = pd.read_csv(data_path, sep=";")
data = data.drop(['chat_id', 'user_id'], axis=1)
add_special_token(model, tokenizer)
tokenized_chats = tokenize_dataset(data, tokenizer)
train, test = split_dataframe(tokenized_chats)
train_loader, train_sampler = build_dataset(train, tokenizer)
test_loader, test_sampler = build_dataset(test, tokenizer, train_set=False)
return train_loader, train_sampler, test_loader, test_sampler
#if __name__ == '__main__':
#data = '../outputs2/context-chats.csv'
#tokenizer = AutoTokenizer.from_pretrained('dbmdz/german-gpt2')
#model = AutoModelWithLMHead.from_pretrained('dbmdz/german-gpt2')
#train_loader, train_sampler, test_loader, test_sampler = get_data_loaders(data, tokenizer, model)
\ No newline at end of file
model_checkpoint: models/
model_checkpoint: checkpoints/
n_candidates: 3
max_history: 3
train_batch_size: 4
......
"""Here, methods to process the data and create context and distractor data are provided."""
"""Here, methods to prepare the dataset for training are provided"""
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from itertools import chain
from collections import defaultdict
def change_userid_to_speakerid(df):
"""Change made up user ids to speaker 1 and speaker 2 respectively"""
uniq_user_id = df.user_id.unique()
chat = df['chat_id']
if len(uniq_user_id) < 2:
print('WARNING: Conversation with only 1 speaker detected. Please remove conversation:', chat)
else:
# map to dict
speaker_dict = {uniq_user_id[0]: 'speaker1', uniq_user_id[1]: 'speaker2'}
df['user_id'] = df['user_id'].map(speaker_dict)
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import TensorDataset, DataLoader
from torch.utils.data.distributed import DistributedSampler
from transformers import (AutoModelWithLMHead, AutoTokenizer)
# define special tokens
SPECIAL_TOKENS = ['<bos>', '<eos>', '<speaker1>', '<speaker2>', '<introvert>', '<extrovert>', '<pad>']
ATTR_TO_SPECIAL_TOKEN = {'bos_token': '<bos>', 'eos_token': '<eos>', 'pad_token': '<pad>',
'additional_special_tokens': ['<speaker1>', '<speaker2>', '<introvert>', '<extrovert>']}
MODEL_INPUTS = ['input_ids', 'mc_token_ids', 'lm_labels', 'mc_labels', 'token_type_ids']
PADDED_INPUTS = ['input_ids', 'lm_labels', 'token_type_ids']
# TODO: check Trainer for modeling
def tokenize_dataset(df, tokenizer):
"""Tokenize string values specified columns
Note: dbmbz pre-trained tokenizer cannot be applied to batches of sentences
tokenize: separates string into list of words and punctuation marks
convert_tokens_to_ids: convert words into indices of vocabulary entries"""
print('INFO: Tokenizing messages ...')
# tokenize and encode
cols = ['message', 'distractor_1', 'distractor_2', 'context_0', 'context_1', 'context_2']
for name in cols:
df[name] = df[name].apply(tokenizer.tokenize)
df[name] = df[name].apply(tokenizer.convert_tokens_to_ids)
return df
def clean_up_turns(df):
def split_dataframe(df):
"""Concatenate candidates and contexts after tokenization
-> last response is ground truth
Note: token id 255 is an empty string and should be removed
Split into train and test set
test_size is set to 0.15 since the dataset is quite small"""
print('INFO: Splitting dataset ...')
new_df = pd.DataFrame()
new_df['trait'] = df['extraversion_pole']
new_df['candidates'] = df.apply(lambda x: [x['distractor_1']] + [x['distractor_2']] + [x['message']], axis=1)
new_df['context'] = df.apply(lambda x: [x['context_2']] + [x['context_1']] + [x['context_0']], axis=1)