Free training guide

VoT MMR Data
Cleaning Guide

A complete walkthrough of VoT_MMR_DataCleaning_v2_Formv6.ipynb. Covers environment setup, loading KoboToolbox data, column mapping, cleaning transformations, 12 analysis modules, and exporting results.

Need the pipeline map (stages, DataFrames, folders)? Open the full user guide → Logic section.

Start with Setup See Run Checklist

🇲🇲

Compiled from KoboToolbox Form v6 data and UN MMR VoT screening methodology.
If you want to go deeper, here are the references.

KoboToolbox

Source data collection platform for screening forms

Data Source

pandas

Core library for data wrangling and transformation

Library

openpyxl

Excel read/write and output formatting

Library

matplotlib / seaborn

Data visualization — charts and distributions

Library

Python 3.10+

Minimum runtime version required

Requirement

Jupyter Notebook

Interactive cell-by-cell execution environment

Requirement

Section 00

Overview & Prerequisites

This notebook processes screening data collected from Victims of Trafficking (VoT) in Myanmar via KoboToolbox (Form v6). It performs end-to-end data cleaning — from raw Excel export to analysis-ready datasets — and generates summary tables for protection, recruitment, exploitation, and M&E reporting.

Folder structure expected

plaintext / Project layout

VoT_MMR_Cleaning/
├── VoT_MMR_DataCleaning_v2_Formv6.ipynb   ← this notebook
├── _VoT/
│   ├── input/                              ← raw KoboToolbox export (.xlsx)
│   ├── output/                             ← cleaned datasets
│   └── tables/                             ← summary analysis tables
├── config/                                 ← column rename dictionaries
└── Working/                                ← intermediate working files

💡

Create _VoT/input/, _VoT/output/, and _VoT/tables/ before running the notebook. Many export cells write to these paths without creating them.

Section 01

Setup & Installation

Run the installation cells once per environment. You only need to do this the first time, or after setting up a fresh Python environment.

1.1 Install dependencies

bash / Jupyter cell — install packages

# Run inside a Jupyter cell (note the ! prefix for shell commands)
!pip3 install gdown --quiet
!pip3 install -U scikit-learn scipy matplotlib --quiet
!pip3 install seaborn --quiet
!pip3 install openpyxl --quiet

1.2 Import libraries

python / Core imports

import pandas as pd
import numpy as np
import seaborn as sbn
import matplotlib.pyplot as plt
from openpyxl.styles import Font, PatternFill, Border, Side, Alignment
from openpyxl.utils import get_column_letter
from openpyxl import load_workbook
import os, re
from pathlib import Path
from datetime import datetime
from collections import Counter

1.3 Utility functions

These helpers are defined early and reused throughout every analysis module.

python / Helper functions

def lowerCaseContent(df):
    """Lowercase all string values in object columns."""
    for col in df.select_dtypes(['object']):
        df[col] = df[col].str.lower().str.strip()
    return df

def integerToFloat(df):
    """Convert integer columns to float for consistent numeric handling."""
    for col in df.select_dtypes(['int64']):
        df[col] = df[col].astype(float)
    return df

def getListOfColsIsNaN(df):
    """Return columns with missing values sorted by count."""
    missing = df.isnull().sum()
    return missing[missing > 0].sort_values(ascending=False)

def returnUniqueValues(df):
    """Print unique values for all categorical columns."""
    for col in df.select_dtypes(['object']):
        print(f"\n=== {col} ===")
        print(df[col].unique())

Section 02

Data Loading

The raw data comes from KoboToolbox exported as Excel. The notebook reads a specific sheet and immediately lowercases all column names for consistency.

2.1 Set file paths

python / Define input and output paths

# Input — raw KoboToolbox export
infile    = "_VoT/input/VoT_MMR_Raw.xlsx"
out_dir   = "_VoT/output/"
table_dir = "_VoT/tables/"

# Create output folders if they don't exist yet
os.makedirs(out_dir,   exist_ok=True)
os.makedirs(table_dir, exist_ok=True)

2.2 Load into a DataFrame

python / Read Excel and normalize column names

# Read specific sheet exported from KoboToolbox
in_df = pd.read_excel(infile, sheet_name="data_pxd_25")

# Normalize all column names to lowercase
in_df.columns = [col.lower() for col in in_df.columns]

# Quick sanity check
print(f"Dataset: {in_df.shape[0]} rows × {in_df.shape[1]} columns")
in_df.head(2)

⚠️

If the sheet name differs (e.g. Sheet1 or data), update sheet_name. Check your KoboToolbox export settings to confirm the exact sheet name.

2.3 Select analysis columns

python / Subset to trafficking-relevant columns

processing_cols = [
    'interviewer_name', 'interviewer_org', 'interview_date',
    'gender', 'age', 'citizenship', 'education_level',
    'passport_owned', 'visa_type', 'travel_route', 'airline_used',
    'recruitment_contact', 'expected_work_type', 'exploitation_type',
    'scam_type', 'means_of_control', 'ransom_amount',
    # Add or remove columns based on your Form v6 export
]

in_df = in_df[[c for c in processing_cols if c in in_df.columns]]

Section 03

Data Preparation & Cleaning

3.1 Rename columns

KoboToolbox exports use group-prefixed technical names. The notebook maps these to readable labels using a dictionary.

python / Column rename dictionary

# key = KoboToolbox column name (already lowercased)
# value = analysis-friendly name
rename_dict = {
    'group_demographics/q_gender':     'gender',
    'group_demographics/q_age':        'age',
    'group_demographics/q_citizenship': 'citizenship',
    'group_travel/q_passport':         'passport_owned',
    'group_travel/q_visa':             'visa_type',
    # extend with all relevant columns from your form
}

in_df.rename(
    columns={k: v for k, v in rename_dict.items() if k in in_df.columns},
    inplace=True
)

3.2 Drop system columns

python / Drop metadata columns

cols_to_drop = [
    'survey_start_time', 'survey_end_time', 'record_id',
    'record_uuid', 'submission_time', 'validation_status',
    'system_notes', 'record_status', 'submitted_by',
    'form_version', 'record_tags',
]

vot_df = in_df.drop(
    columns=[c for c in cols_to_drop if c in in_df.columns]
)

3.3 Missing values report

python / Generate and save missing values report

total   = in_df.isnull().sum().sort_values(ascending=False)
pct     = (total / len(in_df) * 100).round(2)
missing = pd.concat([total, pct], axis=1, keys=['Missing Count', 'Missing %'])
missing = missing[missing['Missing Count'] > 0]

with pd.ExcelWriter(
    "_VoT/tables/VoT_MMR_MissingValues.xlsx", engine='openpyxl'
) as writer:
    missing.to_excel(writer, sheet_name='Missing Values')

print(missing)

💡

Open VoT_MMR_MissingValues.xlsx to identify columns with >30% missingness. These may need imputation or exclusion before analysis.

3.4 Remove duplicates

python / Detect and remove duplicate rows

n_dup = in_df.duplicated().sum()
print(f"Duplicate rows: {n_dup}")

if n_dup > 0:
    in_df = in_df.drop_duplicates()
    print(f"After dedup: {in_df.shape[0]} rows")

Section 04

Data Mapping

KoboToolbox stores binary responses as 0 / 1. The notebook converts these to Yes / No and maps country codes to readable nationality labels.

4.1 Binary → Yes / No

python / apply_individual_mappings_notnull

def apply_individual_mappings_notnull(df):
    """Map 0/1 binary columns to Yes/No — only on non-null rows."""
    yes_no_cols = [
        'passport_owned', 'visa_obtained', 'consent_given',
        'debt_bondage', 'document_confiscated', 'physical_abuse',
        # add all binary columns from your Form v6
    ]
    binary_map = {0: 'No', 1: 'Yes', 0.0: 'No', 1.0: 'Yes'}

    for col in yes_no_cols:
        if col in df.columns:
            mask = df[col].notna()
            df.loc[mask, col] = df.loc[mask, col].map(binary_map)
    return df

df = apply_individual_mappings_notnull(in_df)

4.2 Country code → Nationality

python / apply_nationality_mappings

def apply_nationality_mappings(df_mapped):
    nationality_map = {
        'mmr': 'Myanmar', 'myanmar': 'Myanmar',
        'chn': 'China',   'tha': 'Thailand',
        'khm': 'Cambodia', 'lao': 'Laos',
        'vnm': 'Vietnam',
        # extend for all nationalities in your dataset
    }
    if 'citizenship' in df_mapped.columns:
        df_mapped['citizenship'] = (
            df_mapped['citizenship'].str.lower().str.strip()
            .map(nationality_map)
            .fillna(df_mapped['citizenship'])
        )
    return df_mapped

df_mapped_2 = apply_nationality_mappings(df)

4.3 Filter — African nationals (optional)

python / African nationals subset

# UN official list of 54 African countries
african_countries = [
    'Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina Faso',
    'Burundi', 'Cameroon', 'Central African Republic', 'Chad',
    # ... all 54 (add the full list)
]

df_african = df_mapped_2[df_mapped_2['citizenship'].isin(african_countries)]
print(f"African nationals: {len(df_african)}")

Section 05

Analysis Modules

The notebook is organized into 12 self-contained analysis modules. Each produces one or more Excel tables saved to _VoT/tables/. Run them in sequence after completing cleaning.

Module 1

Gender & Nationality Analysis

What: Cross-tabulates citizenship by gender; computes nationality percentages; identifies top-10 nationalities.

Output: gender_nationality_table — crosstab + summary stats (total VoTs, % female, top nationality).

Module 2

Travel Routes Analysis

What: Parses multi-step routes from free-text or select-multiple fields. Extracts origin → transit → destination patterns using regex and Counter.

Output: Route frequency table; most common transit countries; corridor bar chart.

Module 3

Education Level Analysis

What: Analyses education attainment distribution across victims. Generates a BarChart with openpyxl for embedded Excel visualization.

Output: Education breakdown table with embedded chart saved to _VoT/tables/.

Module 4

Victim Profile — Age & Demographics

What: Creates age-group categories (18–24, 25–34, 35–44, 45+); cross-tabs by gender; computes median age and % by group. Handles missing or invalid age values.

Output: Age–gender pivot table; demographic summary statistics.

Module 5

Victim Recruitment Analysis

What: Analyses contact methods (social media, personal referral, job board) and promised employment types (factory, domestic, hospitality, scam compound).

Output: Recruitment contact frequency table; expected work type breakdown.

Module 6

Social Media Platform Analysis

What: Identifies platforms used for recruitment (Facebook, TikTok, Telegram, WeChat). Handles multi-select fields by splitting on separator characters.

Output: Platform frequency bar chart; normalized percentage table.

Module 7

Means of Control Analysis

What: Counts control mechanisms in scam compounds: document confiscation, debt bondage, physical abuse, threats, restricted movement, phone confiscation.

Output: Frequency table per control type; % of victims experiencing each.

Module 8

Exploitation Types & Scam Analysis

What: Classifies exploitation (forced labour, sexual exploitation, forced criminality) and scam sub-types (romance, crypto, investment fraud). Analyses targets, languages, and platforms.

Output: Exploitation breakdown; scam typology matrix with infrastructure details.

Module 9

Fraudulent Job Offers Analysis

What: Extracts and categorises fraudulent job types used to recruit victims. Uses regex to parse free-text job descriptions into standard categories.

Output: Job offer category frequency table saved to Excel.

Module 10

Airlines, Visa & Passport Analysis

What: Three sub-modules — (a) airline frequency during journey; (b) visa type obtained before departure; (c) passport ownership rates by nationality.

Output: Three separate Excel summary tables with counts and percentages.

Module 11

Ransom Amounts Analysis

What: Analyses ransom amounts demanded for victims' release. Handles currency normalization, outlier detection, and range bucketing via regex.

Output: Distribution table; min/max/median/mean; range breakdown (<$500, $500–$2000, >$2000).

Module 12

M&E Summary Table

What: Aggregates key M&E indicators: total VoTs screened, breakdown by gender, nationality, exploitation type, referral pathway, and protection status.

Output: Master M&E workbook — one sheet per indicator domain.

Section 06

Output & Export

6.1 Save the cleaned dataset

python / Export to formatted Excel

outfile = f"_VoT/output/VoT_MMR_Cleaned_{datetime.today().strftime('%Y%m%d')}.xlsx"

with pd.ExcelWriter(outfile, engine='openpyxl') as writer:
    df_mapped_2.to_excel(writer, sheet_name='Cleaned_Data', index=False)

# Apply header formatting
wb = load_workbook(outfile)
ws = wb['Cleaned_Data']
header_fill = PatternFill(start_color="1F4E79", fill_type="solid")

for cell in ws[1]:
    cell.font      = Font(bold=True, color="FFFFFF")
    cell.fill      = header_fill
    cell.alignment = Alignment(horizontal="center")
    ws.column_dimensions[cell.column_letter].width = 18

wb.save(outfile)
print(f"Saved → {outfile}")

6.2 Copy files for downstream analysis

python / copy_files helper

import shutil

def copy_files(source_folder, dest_folder, extension=".xlsx"):
    os.makedirs(dest_folder, exist_ok=True)
    copied = 0
    for f in Path(source_folder).glob(f"*{extension}"):
        shutil.copy2(f, dest_folder)
        copied += 1
    print(f"Copied {copied} file(s) → {dest_folder}")

copy_files(source_folder="_VoT/output/", dest_folder="Working/")

Expected output files

Cleaned dataset (dated) _VoT/output/VoT_MMR_Cleaned_YYYYMMDD.xlsx

Missing values report _VoT/tables/VoT_MMR_MissingValues.xlsx

Gender & Nationality table _VoT/tables/VoT_Gender_Nationality.xlsx

M&E summary workbook _VoT/tables/VoT_ME_Summary.xlsx

Exploitation & Scam analysis _VoT/tables/VoT_Exploitation_Scam.xlsx

African nationals subset _VoT/output/VoT_African_Nationals.xlsx

Section 07

Run Checklist

Use this before every notebook run. Your progress is saved in the browser — come back and continue where you left off.

Completion 0 / 18

Before running

Raw KoboToolbox export saved in _VoT/input/ Required

Sheet name in Excel matches the sheet_name parameter in cell 2.2 Required

Output folders _VoT/output/ and _VoT/tables/ exist Required

All packages installed: pandas, openpyxl, seaborn, gdown Required

Kernel restarted before running (clears stale variable state) Important

Data loading

in_df.shape printed — row and column count look correct Required

All column names are lowercase after loading Important

No missing expected columns in processing_cols Important

Data cleaning

Missing values report saved to _VoT/tables/ Important

Duplicate check complete — 0 duplicates (or removed) Required

Rename dictionary applied — key columns renamed correctly Important

Mapping & transformation

Binary columns show "Yes"/"No" (not 0/1) after mapping Required

Citizenship column shows nationality labels (not raw codes) Important

df_mapped_2 has same row count as in_df after mapping Verify

Output & handover

Cleaned dataset exported to _VoT/output/ with today's date in filename Required

All analysis tables saved to _VoT/tables/ Important

Files copied to Working/ for downstream analysis Verify

M&E summary reviewed and validated before sharing Required

UN MMR · Victims of Trafficking Data Programme

For internal training use only — handle data per UN data protection standards.