VoT MMR Data
Cleaning & Analysis
User guide and training reference for VoT_MMR_DataCleaning_v2_Formv6.ipynb. Environment setup, KoboToolbox loading, cleaning and mapping, twelve analysis modules, exports, a runnable pre-flight checklist, and troubleshooting—styled for long-form reading.
KoboToolbox
Source data collection platform for screening forms
Data Sourcepandas
Core library for data wrangling and transformation
Libraryopenpyxl
Excel read/write and output formatting
Librarymatplotlib / seaborn
Data visualization — charts and distributions
LibraryPython 3.10+
Minimum runtime version required
RequirementJupyter Notebook
Interactive cell-by-cell execution environment
RequirementOverview & Prerequisites
This notebook processes screening data collected from Victims of Trafficking (VoT) in Myanmar via KoboToolbox (Form v6). It performs end-to-end data cleaning — from raw Excel export to analysis-ready datasets — and generates summary tables for protection, recruitment, exploitation, and M&E reporting.
Folder structure expected
VoT_MMR_Cleaning/
├── VoT_MMR_DataCleaning_v2_Formv6.ipynb ← this notebook
├── _VoT/
│ ├── input/ ← raw KoboToolbox export (.xlsx)
│ ├── output/ ← cleaned datasets
│ └── tables/ ← summary analysis tables
├── config/ ← column rename dictionaries
└── Working/ ← intermediate working files
Create _VoT/input/, _VoT/output/, and _VoT/tables/ before running the notebook. Many export cells write to these paths without creating them.
Notebook logic & data flow
This section explains in what order the notebook is meant to run and what each main variable represents. Use it as a map when a cell fails or when you return to the project after a break.
Start here if you are lost. Run the notebook top to bottom. Stages 1–4 prepare one analysis-ready table; stage 5 reads that table many times; stage 6 saves files. Do not run analysis modules before mapping is done.
Six stages (what happens in order)
Each stage builds on the previous one. The names below match how teams talk about the pipeline; the notebook cells follow the same sequence.
Load the KoboToolbox .xlsx, force lowercase headers, keep the screening column list.
Restrict to processing_cols so every later step works on the same questionnaire slice.
Rename long group paths, drop Kobo metadata columns, report missing values, remove duplicates.
Turn coded answers into labels (0/1 → Yes/No, country codes → nationalities). Output = one mapped DataFrame.
Twelve modules run after mapping; each writes tables into _VoT/tables/.
Save the cleaned workbook (dated) under _VoT/output/; optional copy to Working/.
One-page diagram
Kobo Excel (.xlsx)
│
▼
┌──────────────┐
│ 1–2 Ingest │ in_df (raw rows, chosen columns)
│ + subset │
└──────┬───────┘
▼
┌──────────────┐
│ 3 Clean │ vot_df / df (no junk columns, deduped)
└──────┬───────┘
▼
┌──────────────┐
│ 4 Map │ df_mapped → df_mapped_2 (labels, not codes)
└──────┬───────┘
▼
┌──────────────┐
│ 5 Analyse │ 12 modules → files in _VoT/tables/
└──────┬───────┘
▼
┌──────────────┐
│ 6 Export │ dated workbook in _VoT/output/
└──────────────┘
Compact checklist (same six steps)
1. Ingest — Read Excel (KoboToolbox export), normalise column names to lowercase
2. Subset — Keep screening columns listed in processing_cols (drop the rest for this pass)
3. Clean — Rename group paths → short names, drop system/metadata columns, missing-value report, deduplicate
4. Map — Binary 0/1 → Yes/No; country codes → nationality labels; optional filters (e.g. African nationals)
5. Analyse — Run the 12 modules in sequence; each writes tables under _VoT/tables/
6. Export — Save dated cleaned workbook to _VoT/output/; copy to Working/ if needed
Main DataFrames (variable lineage)
Names in your copy may differ slightly; always run print(df.shape) (and optionally .head()) after a big step.
Objects you will see in the cells
Rule of thumb
Downstream = anything that needs cleaned labels and consistent types. The 12 analysis modules expect the mapping work (Yes/No, nationalities) to be finished so frequencies and crosstabs are readable.
Where outputs land on disk
_VoT/output/ — cleaned, dated Excel. _VoT/tables/ — one or more workbooks per analysis module. Working/ — optional hand-off copies. Configure paths in the load section if your folder layout differs.
If a module returns empty or odd counts, work backwards: confirm sheet name, processing_cols, and that binary columns are already mapped to Yes/No where that module expects text.
Setup & Installation
Run the installation cells once per environment. You only need to do this the first time, or after setting up a fresh Python environment.
1.1 Install dependencies
# Run inside a Jupyter cell (note the ! prefix for shell commands)
!pip3 install gdown --quiet
!pip3 install -U scikit-learn scipy matplotlib --quiet
!pip3 install seaborn --quiet
!pip3 install openpyxl --quiet
1.2 Import libraries
import pandas as pd
import numpy as np
import seaborn as sbn
import matplotlib.pyplot as plt
from openpyxl.styles import Font, PatternFill, Border, Side, Alignment
from openpyxl.utils import get_column_letter
from openpyxl import load_workbook
import os, re
from pathlib import Path
from datetime import datetime
from collections import Counter
1.3 Utility functions
These helpers are defined early and reused throughout every analysis module.
def lowerCaseContent(df):
"""Lowercase all string values in object columns."""
for col in df.select_dtypes(['object']):
df[col] = df[col].str.lower().str.strip()
return df
def integerToFloat(df):
"""Convert integer columns to float for consistent numeric handling."""
for col in df.select_dtypes(['int64']):
df[col] = df[col].astype(float)
return df
def getListOfColsIsNaN(df):
"""Return columns with missing values sorted by count."""
missing = df.isnull().sum()
return missing[missing > 0].sort_values(ascending=False)
def returnUniqueValues(df):
"""Print unique values for all categorical columns."""
for col in df.select_dtypes(['object']):
print(f"\n=== {col} ===")
print(df[col].unique())
Data Loading
The raw data comes from KoboToolbox exported as Excel. The notebook reads a specific sheet and immediately lowercases all column names for consistency.
2.1 Set file paths
# Input — raw KoboToolbox export
infile = "_VoT/input/VoT_MMR_Raw.xlsx"
out_dir = "_VoT/output/"
table_dir = "_VoT/tables/"
# Create output folders if they don't exist yet
os.makedirs(out_dir, exist_ok=True)
os.makedirs(table_dir, exist_ok=True)
2.2 Load into a DataFrame
# Read specific sheet exported from KoboToolbox
in_df = pd.read_excel(infile, sheet_name="data_pxd_25")
# Normalize all column names to lowercase
in_df.columns = [col.lower() for col in in_df.columns]
# Quick sanity check
print(f"Dataset: {in_df.shape[0]} rows × {in_df.shape[1]} columns")
in_df.head(2)
If the sheet name differs (e.g. Sheet1 or data), update sheet_name. Check your KoboToolbox export settings to confirm the exact sheet name.
2.3 Select analysis columns
processing_cols = [
'interviewer_name', 'interviewer_org', 'interview_date',
'gender', 'age', 'citizenship', 'education_level',
'passport_owned', 'visa_type', 'travel_route', 'airline_used',
'recruitment_contact', 'expected_work_type', 'exploitation_type',
'scam_type', 'means_of_control', 'ransom_amount',
# Add or remove columns based on your Form v6 export
]
in_df = in_df[[c for c in processing_cols if c in in_df.columns]]
Data Preparation & Cleaning
3.1 Rename columns
KoboToolbox exports use group-prefixed technical names. The notebook maps these to readable labels using a dictionary.
# key = KoboToolbox column name (already lowercased)
# value = analysis-friendly name
rename_dict = {
'group_demographics/q_gender': 'gender',
'group_demographics/q_age': 'age',
'group_demographics/q_citizenship': 'citizenship',
'group_travel/q_passport': 'passport_owned',
'group_travel/q_visa': 'visa_type',
# extend with all relevant columns from your form
}
in_df.rename(
columns={k: v for k, v in rename_dict.items() if k in in_df.columns},
inplace=True
)
3.2 Drop system columns
cols_to_drop = [
'survey_start_time', 'survey_end_time', 'record_id',
'record_uuid', 'submission_time', 'validation_status',
'system_notes', 'record_status', 'submitted_by',
'form_version', 'record_tags',
]
vot_df = in_df.drop(
columns=[c for c in cols_to_drop if c in in_df.columns]
)
3.3 Missing values report
total = in_df.isnull().sum().sort_values(ascending=False)
pct = (total / len(in_df) * 100).round(2)
missing = pd.concat([total, pct], axis=1, keys=['Missing Count', 'Missing %'])
missing = missing[missing['Missing Count'] > 0]
with pd.ExcelWriter(
"_VoT/tables/VoT_MMR_MissingValues.xlsx", engine='openpyxl'
) as writer:
missing.to_excel(writer, sheet_name='Missing Values')
print(missing)
Open VoT_MMR_MissingValues.xlsx to identify columns with >30% missingness. These may need imputation or exclusion before analysis.
3.4 Remove duplicates
n_dup = in_df.duplicated().sum()
print(f"Duplicate rows: {n_dup}")
if n_dup > 0:
in_df = in_df.drop_duplicates()
print(f"After dedup: {in_df.shape[0]} rows")
Data Mapping
KoboToolbox stores binary responses as 0 / 1. The notebook converts these to Yes / No and maps country codes to readable nationality labels.
4.1 Binary → Yes / No
def apply_individual_mappings_notnull(df):
"""Map 0/1 binary columns to Yes/No — only on non-null rows."""
yes_no_cols = [
'passport_owned', 'visa_obtained', 'consent_given',
'debt_bondage', 'document_confiscated', 'physical_abuse',
# add all binary columns from your Form v6
]
binary_map = {0: 'No', 1: 'Yes', 0.0: 'No', 1.0: 'Yes'}
for col in yes_no_cols:
if col in df.columns:
mask = df[col].notna()
df.loc[mask, col] = df.loc[mask, col].map(binary_map)
return df
df = apply_individual_mappings_notnull(in_df)
4.2 Country code → Nationality
def apply_nationality_mappings(df_mapped):
nationality_map = {
'mmr': 'Myanmar', 'myanmar': 'Myanmar',
'chn': 'China', 'tha': 'Thailand',
'khm': 'Cambodia', 'lao': 'Laos',
'vnm': 'Vietnam',
# extend for all nationalities in your dataset
}
if 'citizenship' in df_mapped.columns:
df_mapped['citizenship'] = (
df_mapped['citizenship'].str.lower().str.strip()
.map(nationality_map)
.fillna(df_mapped['citizenship'])
)
return df_mapped
df_mapped_2 = apply_nationality_mappings(df)
4.3 Filter — African nationals (optional)
# UN official list of 54 African countries
african_countries = [
'Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina Faso',
'Burundi', 'Cameroon', 'Central African Republic', 'Chad',
# ... all 54 (add the full list)
]
df_african = df_mapped_2[df_mapped_2['citizenship'].isin(african_countries)]
print(f"African nationals: {len(df_african)}")
Analysis Modules
The notebook is organized into 12 self-contained analysis modules. Each produces one or more Excel tables saved to _VoT/tables/. Run them in sequence after completing cleaning.
What: Cross-tabulates citizenship by gender; computes nationality percentages; identifies top-10 nationalities.
Output: gender_nationality_table — crosstab + summary stats (total VoTs, % female, top nationality).
What: Parses multi-step routes from free-text or select-multiple fields. Extracts origin → transit → destination patterns using regex and Counter.
Output: Route frequency table; most common transit countries; corridor bar chart.
What: Analyses education attainment distribution across victims. Generates a BarChart with openpyxl for embedded Excel visualization.
Output: Education breakdown table with embedded chart saved to _VoT/tables/.
What: Creates age-group categories (18–24, 25–34, 35–44, 45+); cross-tabs by gender; computes median age and % by group. Handles missing or invalid age values.
Output: Age–gender pivot table; demographic summary statistics.
What: Analyses contact methods (social media, personal referral, job board) and promised employment types (factory, domestic, hospitality, scam compound).
Output: Recruitment contact frequency table; expected work type breakdown.
What: Identifies platforms used for recruitment (Facebook, TikTok, Telegram, WeChat). Handles multi-select fields by splitting on separator characters.
Output: Platform frequency bar chart; normalized percentage table.
What: Counts control mechanisms in scam compounds: document confiscation, debt bondage, physical abuse, threats, restricted movement, phone confiscation.
Output: Frequency table per control type; % of victims experiencing each.
What: Classifies exploitation (forced labour, sexual exploitation, forced criminality) and scam sub-types (romance, crypto, investment fraud). Analyses targets, languages, and platforms.
Output: Exploitation breakdown; scam typology matrix with infrastructure details.
What: Extracts and categorises fraudulent job types used to recruit victims. Uses regex to parse free-text job descriptions into standard categories.
Output: Job offer category frequency table saved to Excel.
What: Three sub-modules — (a) airline frequency during journey; (b) visa type obtained before departure; (c) passport ownership rates by nationality.
Output: Three separate Excel summary tables with counts and percentages.
What: Analyses ransom amounts demanded for victims' release. Handles currency normalization, outlier detection, and range bucketing via regex.
Output: Distribution table; min/max/median/mean; range breakdown (<$500, $500–$2000, >$2000).
What: Aggregates key M&E indicators: total VoTs screened, breakdown by gender, nationality, exploitation type, referral pathway, and protection status.
Output: Master M&E workbook — one sheet per indicator domain.
Output & Export
6.1 Save the cleaned dataset
outfile = f"_VoT/output/VoT_MMR_Cleaned_{datetime.today().strftime('%Y%m%d')}.xlsx"
with pd.ExcelWriter(outfile, engine='openpyxl') as writer:
df_mapped_2.to_excel(writer, sheet_name='Cleaned_Data', index=False)
# Apply header formatting
wb = load_workbook(outfile)
ws = wb['Cleaned_Data']
header_fill = PatternFill(start_color="1F4E79", fill_type="solid")
for cell in ws[1]:
cell.font = Font(bold=True, color="FFFFFF")
cell.fill = header_fill
cell.alignment = Alignment(horizontal="center")
ws.column_dimensions[cell.column_letter].width = 18
wb.save(outfile)
print(f"Saved → {outfile}")
6.2 Copy files for downstream analysis
import shutil
def copy_files(source_folder, dest_folder, extension=".xlsx"):
os.makedirs(dest_folder, exist_ok=True)
copied = 0
for f in Path(source_folder).glob(f"*{extension}"):
shutil.copy2(f, dest_folder)
copied += 1
print(f"Copied {copied} file(s) → {dest_folder}")
copy_files(source_folder="_VoT/output/", dest_folder="Working/")
Expected output files
Run Checklist
Use this before every notebook run. Your progress is saved in the browser — come back and continue where you left off.
Before running
Data loading
Data cleaning
Mapping & transformation
Output & handover
Troubleshooting
Common errors when the environment, file paths, or column names do not match what the notebook expects. Resolve these before re-running the analysis blocks.
File or sheet not found
Confirm the Excel path, that the file exists in _VoT/input/, and that sheet_name matches the export (e.g. data_pxd_25 vs Sheet1).
import os
print("Exists:", os.path.exists(infile))
Column or mapping errors
KoboToolbox labels change between form versions. If rename or mapping steps fail, list columns that are still not standardised and extend rename_dict or processing_cols.
print(sorted(in_df.columns.tolist()))
Memory on very large workbooks
For huge row counts, load only the columns you need with usecols, process in chunks, or filter early. Close other applications while running heavy cells.
Data protection
Store raw and processed files in access-controlled locations. Follow UN and programme rules for VoT data; this guide is for workflow only and does not replace organisational policies.
Next steps & support
After cleaning, typical next uses are deeper statistics, dashboards, and reporting. Keep methodology documented and version both raw and cleaned data.
Validation
Cross-check the missing-values workbook and M&E summary against expectations before sign-off.
QualityLineage
Record notebook version, form version, and any manual edits to dictionaries or paths.
DocumentationDownstream tools
Use cleaned files in Power BI, R, or Python for further analysis—read from _VoT/output/ or Working/.
HandoverGetting help
Review cell outputs and the checklist above; verify paths, sheet names, and that packages from Section 02 are installed. For script-specific questions, use inline notebook comments and your team’s data SOPs.
UN MMR · Victims of Trafficking Data Programme
For internal training use only — handle data per UN data protection standards.