VoT MMR Data
Cleaning Guide
A complete walkthrough of VoT_MMR_DataCleaning_v2_Formv6.ipynb. Covers environment setup, loading KoboToolbox data, column mapping, cleaning transformations, 12 analysis modules, and exporting results.
Need the pipeline map (stages, DataFrames, folders)? Open the full user guide β Logic section.
KoboToolbox
Source data collection platform for screening forms
Data Sourcepandas
Core library for data wrangling and transformation
Libraryopenpyxl
Excel read/write and output formatting
Librarymatplotlib / seaborn
Data visualization β charts and distributions
LibraryPython 3.10+
Minimum runtime version required
RequirementJupyter Notebook
Interactive cell-by-cell execution environment
RequirementOverview & Prerequisites
This notebook processes screening data collected from Victims of Trafficking (VoT) in Myanmar via KoboToolbox (Form v6). It performs end-to-end data cleaning β from raw Excel export to analysis-ready datasets β and generates summary tables for protection, recruitment, exploitation, and M&E reporting.
Folder structure expected
VoT_MMR_Cleaning/
βββ VoT_MMR_DataCleaning_v2_Formv6.ipynb β this notebook
βββ _VoT/
β βββ input/ β raw KoboToolbox export (.xlsx)
β βββ output/ β cleaned datasets
β βββ tables/ β summary analysis tables
βββ config/ β column rename dictionaries
βββ Working/ β intermediate working files
Create _VoT/input/, _VoT/output/, and _VoT/tables/ before running the notebook. Many export cells write to these paths without creating them.
Setup & Installation
Run the installation cells once per environment. You only need to do this the first time, or after setting up a fresh Python environment.
1.1 Install dependencies
# Run inside a Jupyter cell (note the ! prefix for shell commands)
!pip3 install gdown --quiet
!pip3 install -U scikit-learn scipy matplotlib --quiet
!pip3 install seaborn --quiet
!pip3 install openpyxl --quiet
1.2 Import libraries
import pandas as pd
import numpy as np
import seaborn as sbn
import matplotlib.pyplot as plt
from openpyxl.styles import Font, PatternFill, Border, Side, Alignment
from openpyxl.utils import get_column_letter
from openpyxl import load_workbook
import os, re
from pathlib import Path
from datetime import datetime
from collections import Counter
1.3 Utility functions
These helpers are defined early and reused throughout every analysis module.
def lowerCaseContent(df):
"""Lowercase all string values in object columns."""
for col in df.select_dtypes(['object']):
df[col] = df[col].str.lower().str.strip()
return df
def integerToFloat(df):
"""Convert integer columns to float for consistent numeric handling."""
for col in df.select_dtypes(['int64']):
df[col] = df[col].astype(float)
return df
def getListOfColsIsNaN(df):
"""Return columns with missing values sorted by count."""
missing = df.isnull().sum()
return missing[missing > 0].sort_values(ascending=False)
def returnUniqueValues(df):
"""Print unique values for all categorical columns."""
for col in df.select_dtypes(['object']):
print(f"\n=== {col} ===")
print(df[col].unique())
Data Loading
The raw data comes from KoboToolbox exported as Excel. The notebook reads a specific sheet and immediately lowercases all column names for consistency.
2.1 Set file paths
# Input β raw KoboToolbox export
infile = "_VoT/input/VoT_MMR_Raw.xlsx"
out_dir = "_VoT/output/"
table_dir = "_VoT/tables/"
# Create output folders if they don't exist yet
os.makedirs(out_dir, exist_ok=True)
os.makedirs(table_dir, exist_ok=True)
2.2 Load into a DataFrame
# Read specific sheet exported from KoboToolbox
in_df = pd.read_excel(infile, sheet_name="data_pxd_25")
# Normalize all column names to lowercase
in_df.columns = [col.lower() for col in in_df.columns]
# Quick sanity check
print(f"Dataset: {in_df.shape[0]} rows Γ {in_df.shape[1]} columns")
in_df.head(2)
If the sheet name differs (e.g. Sheet1 or data), update sheet_name. Check your KoboToolbox export settings to confirm the exact sheet name.
2.3 Select analysis columns
processing_cols = [
'interviewer_name', 'interviewer_org', 'interview_date',
'gender', 'age', 'citizenship', 'education_level',
'passport_owned', 'visa_type', 'travel_route', 'airline_used',
'recruitment_contact', 'expected_work_type', 'exploitation_type',
'scam_type', 'means_of_control', 'ransom_amount',
# Add or remove columns based on your Form v6 export
]
in_df = in_df[[c for c in processing_cols if c in in_df.columns]]
Data Preparation & Cleaning
3.1 Rename columns
KoboToolbox exports use group-prefixed technical names. The notebook maps these to readable labels using a dictionary.
# key = KoboToolbox column name (already lowercased)
# value = analysis-friendly name
rename_dict = {
'group_demographics/q_gender': 'gender',
'group_demographics/q_age': 'age',
'group_demographics/q_citizenship': 'citizenship',
'group_travel/q_passport': 'passport_owned',
'group_travel/q_visa': 'visa_type',
# extend with all relevant columns from your form
}
in_df.rename(
columns={k: v for k, v in rename_dict.items() if k in in_df.columns},
inplace=True
)
3.2 Drop system columns
cols_to_drop = [
'survey_start_time', 'survey_end_time', 'record_id',
'record_uuid', 'submission_time', 'validation_status',
'system_notes', 'record_status', 'submitted_by',
'form_version', 'record_tags',
]
vot_df = in_df.drop(
columns=[c for c in cols_to_drop if c in in_df.columns]
)
3.3 Missing values report
total = in_df.isnull().sum().sort_values(ascending=False)
pct = (total / len(in_df) * 100).round(2)
missing = pd.concat([total, pct], axis=1, keys=['Missing Count', 'Missing %'])
missing = missing[missing['Missing Count'] > 0]
with pd.ExcelWriter(
"_VoT/tables/VoT_MMR_MissingValues.xlsx", engine='openpyxl'
) as writer:
missing.to_excel(writer, sheet_name='Missing Values')
print(missing)
Open VoT_MMR_MissingValues.xlsx to identify columns with >30% missingness. These may need imputation or exclusion before analysis.
3.4 Remove duplicates
n_dup = in_df.duplicated().sum()
print(f"Duplicate rows: {n_dup}")
if n_dup > 0:
in_df = in_df.drop_duplicates()
print(f"After dedup: {in_df.shape[0]} rows")
Data Mapping
KoboToolbox stores binary responses as 0 / 1. The notebook converts these to Yes / No and maps country codes to readable nationality labels.
4.1 Binary β Yes / No
def apply_individual_mappings_notnull(df):
"""Map 0/1 binary columns to Yes/No β only on non-null rows."""
yes_no_cols = [
'passport_owned', 'visa_obtained', 'consent_given',
'debt_bondage', 'document_confiscated', 'physical_abuse',
# add all binary columns from your Form v6
]
binary_map = {0: 'No', 1: 'Yes', 0.0: 'No', 1.0: 'Yes'}
for col in yes_no_cols:
if col in df.columns:
mask = df[col].notna()
df.loc[mask, col] = df.loc[mask, col].map(binary_map)
return df
df = apply_individual_mappings_notnull(in_df)
4.2 Country code β Nationality
def apply_nationality_mappings(df_mapped):
nationality_map = {
'mmr': 'Myanmar', 'myanmar': 'Myanmar',
'chn': 'China', 'tha': 'Thailand',
'khm': 'Cambodia', 'lao': 'Laos',
'vnm': 'Vietnam',
# extend for all nationalities in your dataset
}
if 'citizenship' in df_mapped.columns:
df_mapped['citizenship'] = (
df_mapped['citizenship'].str.lower().str.strip()
.map(nationality_map)
.fillna(df_mapped['citizenship'])
)
return df_mapped
df_mapped_2 = apply_nationality_mappings(df)
4.3 Filter β African nationals (optional)
# UN official list of 54 African countries
african_countries = [
'Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina Faso',
'Burundi', 'Cameroon', 'Central African Republic', 'Chad',
# ... all 54 (add the full list)
]
df_african = df_mapped_2[df_mapped_2['citizenship'].isin(african_countries)]
print(f"African nationals: {len(df_african)}")
Analysis Modules
The notebook is organized into 12 self-contained analysis modules. Each produces one or more Excel tables saved to _VoT/tables/. Run them in sequence after completing cleaning.
What: Cross-tabulates citizenship by gender; computes nationality percentages; identifies top-10 nationalities.
Output: gender_nationality_table β crosstab + summary stats (total VoTs, % female, top nationality).
What: Parses multi-step routes from free-text or select-multiple fields. Extracts origin β transit β destination patterns using regex and Counter.
Output: Route frequency table; most common transit countries; corridor bar chart.
What: Analyses education attainment distribution across victims. Generates a BarChart with openpyxl for embedded Excel visualization.
Output: Education breakdown table with embedded chart saved to _VoT/tables/.
What: Creates age-group categories (18β24, 25β34, 35β44, 45+); cross-tabs by gender; computes median age and % by group. Handles missing or invalid age values.
Output: Ageβgender pivot table; demographic summary statistics.
What: Analyses contact methods (social media, personal referral, job board) and promised employment types (factory, domestic, hospitality, scam compound).
Output: Recruitment contact frequency table; expected work type breakdown.
What: Identifies platforms used for recruitment (Facebook, TikTok, Telegram, WeChat). Handles multi-select fields by splitting on separator characters.
Output: Platform frequency bar chart; normalized percentage table.
What: Counts control mechanisms in scam compounds: document confiscation, debt bondage, physical abuse, threats, restricted movement, phone confiscation.
Output: Frequency table per control type; % of victims experiencing each.
What: Classifies exploitation (forced labour, sexual exploitation, forced criminality) and scam sub-types (romance, crypto, investment fraud). Analyses targets, languages, and platforms.
Output: Exploitation breakdown; scam typology matrix with infrastructure details.
What: Extracts and categorises fraudulent job types used to recruit victims. Uses regex to parse free-text job descriptions into standard categories.
Output: Job offer category frequency table saved to Excel.
What: Three sub-modules β (a) airline frequency during journey; (b) visa type obtained before departure; (c) passport ownership rates by nationality.
Output: Three separate Excel summary tables with counts and percentages.
What: Analyses ransom amounts demanded for victims' release. Handles currency normalization, outlier detection, and range bucketing via regex.
Output: Distribution table; min/max/median/mean; range breakdown (<$500, $500β$2000, >$2000).
What: Aggregates key M&E indicators: total VoTs screened, breakdown by gender, nationality, exploitation type, referral pathway, and protection status.
Output: Master M&E workbook β one sheet per indicator domain.
Output & Export
6.1 Save the cleaned dataset
outfile = f"_VoT/output/VoT_MMR_Cleaned_{datetime.today().strftime('%Y%m%d')}.xlsx"
with pd.ExcelWriter(outfile, engine='openpyxl') as writer:
df_mapped_2.to_excel(writer, sheet_name='Cleaned_Data', index=False)
# Apply header formatting
wb = load_workbook(outfile)
ws = wb['Cleaned_Data']
header_fill = PatternFill(start_color="1F4E79", fill_type="solid")
for cell in ws[1]:
cell.font = Font(bold=True, color="FFFFFF")
cell.fill = header_fill
cell.alignment = Alignment(horizontal="center")
ws.column_dimensions[cell.column_letter].width = 18
wb.save(outfile)
print(f"Saved β {outfile}")
6.2 Copy files for downstream analysis
import shutil
def copy_files(source_folder, dest_folder, extension=".xlsx"):
os.makedirs(dest_folder, exist_ok=True)
copied = 0
for f in Path(source_folder).glob(f"*{extension}"):
shutil.copy2(f, dest_folder)
copied += 1
print(f"Copied {copied} file(s) β {dest_folder}")
copy_files(source_folder="_VoT/output/", dest_folder="Working/")
Expected output files
Run Checklist
Use this before every notebook run. Your progress is saved in the browser β come back and continue where you left off.
Before running
Data loading
Data cleaning
Mapping & transformation
Output & handover
UN MMR Β· Victims of Trafficking Data Programme
For internal training use only β handle data per UN data protection standards.