Free training guide

VoT MMR Data
Cleaning & Analysis

User guide and training reference for VoT_MMR_DataCleaning_v2_Formv6.ipynb. Environment setup, KoboToolbox loading, cleaning and mapping, twelve analysis modules, exports, a runnable pre-flight checklist, and troubleshooting—styled for long-form reading.

🇲🇲

Compiled from KoboToolbox Form v6 data and UN MMR VoT screening methodology.
If you want to go deeper, here are the references.

KoboToolbox

Source data collection platform for screening forms

Data Source

pandas

Core library for data wrangling and transformation

Library

openpyxl

Excel read/write and output formatting

Library

matplotlib / seaborn

Data visualization — charts and distributions

Library

Python 3.10+

Minimum runtime version required

Requirement

Jupyter Notebook

Interactive cell-by-cell execution environment

Requirement
Section 00

Overview & Prerequisites

This notebook processes screening data collected from Victims of Trafficking (VoT) in Myanmar via KoboToolbox (Form v6). It performs end-to-end data cleaning — from raw Excel export to analysis-ready datasets — and generates summary tables for protection, recruitment, exploitation, and M&E reporting.

Folder structure expected

plaintext / Project layout
VoT_MMR_Cleaning/
├── VoT_MMR_DataCleaning_v2_Formv6.ipynb   ← this notebook
├── _VoT/
│   ├── input/                              ← raw KoboToolbox export (.xlsx)
│   ├── output/                             ← cleaned datasets
│   └── tables/                             ← summary analysis tables
├── config/                                 ← column rename dictionaries
└── Working/                                ← intermediate working files
💡

Create _VoT/input/, _VoT/output/, and _VoT/tables/ before running the notebook. Many export cells write to these paths without creating them.


Section 01

Notebook logic & data flow

This section explains in what order the notebook is meant to run and what each main variable represents. Use it as a map when a cell fails or when you return to the project after a break.

📌

Start here if you are lost. Run the notebook top to bottom. Stages 1–4 prepare one analysis-ready table; stage 5 reads that table many times; stage 6 saves files. Do not run analysis modules before mapping is done.

Six stages (what happens in order)

Each stage builds on the previous one. The names below match how teams talk about the pipeline; the notebook cells follow the same sequence.

1
Ingest

Load the KoboToolbox .xlsx, force lowercase headers, keep the screening column list.

2
Subset

Restrict to processing_cols so every later step works on the same questionnaire slice.

3
Clean

Rename long group paths, drop Kobo metadata columns, report missing values, remove duplicates.

4
Map

Turn coded answers into labels (0/1 → Yes/No, country codes → nationalities). Output = one mapped DataFrame.

5
Analyse

Twelve modules run after mapping; each writes tables into _VoT/tables/.

6
Export

Save the cleaned workbook (dated) under _VoT/output/; optional copy to Working/.

One-page diagram

flow / Data movement
  Kobo Excel (.xlsx)
         │
         ▼
   ┌──────────────┐
   │ 1–2 Ingest   │  in_df  (raw rows, chosen columns)
   │   + subset   │
   └──────┬───────┘
          ▼
   ┌──────────────┐
   │ 3 Clean      │  vot_df / df  (no junk columns, deduped)
   └──────┬───────┘
          ▼
   ┌──────────────┐
   │ 4 Map        │  df_mapped → df_mapped_2  (labels, not codes)
   └──────┬───────┘
          ▼
   ┌──────────────┐
   │ 5 Analyse    │  12 modules → files in _VoT/tables/
   └──────┬───────┘
          ▼
   ┌──────────────┐
   │ 6 Export     │  dated workbook in _VoT/output/
   └──────────────┘

Compact checklist (same six steps)

flow / Stages in one block
1. Ingest  — Read Excel (KoboToolbox export), normalise column names to lowercase
2. Subset  — Keep screening columns listed in processing_cols (drop the rest for this pass)
3. Clean   — Rename group paths → short names, drop system/metadata columns, missing-value report, deduplicate
4. Map     — Binary 0/1 → Yes/No; country codes → nationality labels; optional filters (e.g. African nationals)
5. Analyse — Run the 12 modules in sequence; each writes tables under _VoT/tables/
6. Export  — Save dated cleaned workbook to _VoT/output/; copy to Working/ if needed

Main DataFrames (variable lineage)

Names in your copy may differ slightly; always run print(df.shape) (and optionally .head()) after a big step.

Objects you will see in the cells

in_df After read_excel + lowercase + column subset. Working copy of raw row content.
vot_df (or similar) After dropping system columns; sometimes used as the “clean” base before value mapping.
df, df_mapped, df_mapped_2 Intermediate and final mapped tables; the notebook reassigns df_* as mappings apply. The last mapped frame before export is the one to treat as “analysis-ready”.

Rule of thumb

Downstream = anything that needs cleaned labels and consistent types. The 12 analysis modules expect the mapping work (Yes/No, nationalities) to be finished so frequencies and crosstabs are readable.

Where outputs land on disk

_VoT/output/ — cleaned, dated Excel. _VoT/tables/ — one or more workbooks per analysis module. Working/ — optional hand-off copies. Configure paths in the load section if your folder layout differs.

💡

If a module returns empty or odd counts, work backwards: confirm sheet name, processing_cols, and that binary columns are already mapped to Yes/No where that module expects text.


Section 02

Setup & Installation

Run the installation cells once per environment. You only need to do this the first time, or after setting up a fresh Python environment.

1.1 Install dependencies

bash / Jupyter cell — install packages
# Run inside a Jupyter cell (note the ! prefix for shell commands)
!pip3 install gdown --quiet
!pip3 install -U scikit-learn scipy matplotlib --quiet
!pip3 install seaborn --quiet
!pip3 install openpyxl --quiet

1.2 Import libraries

python / Core imports
import pandas as pd
import numpy as np
import seaborn as sbn
import matplotlib.pyplot as plt
from openpyxl.styles import Font, PatternFill, Border, Side, Alignment
from openpyxl.utils import get_column_letter
from openpyxl import load_workbook
import os, re
from pathlib import Path
from datetime import datetime
from collections import Counter

1.3 Utility functions

These helpers are defined early and reused throughout every analysis module.

python / Helper functions
def lowerCaseContent(df):
    """Lowercase all string values in object columns."""
    for col in df.select_dtypes(['object']):
        df[col] = df[col].str.lower().str.strip()
    return df

def integerToFloat(df):
    """Convert integer columns to float for consistent numeric handling."""
    for col in df.select_dtypes(['int64']):
        df[col] = df[col].astype(float)
    return df

def getListOfColsIsNaN(df):
    """Return columns with missing values sorted by count."""
    missing = df.isnull().sum()
    return missing[missing > 0].sort_values(ascending=False)

def returnUniqueValues(df):
    """Print unique values for all categorical columns."""
    for col in df.select_dtypes(['object']):
        print(f"\n=== {col} ===")
        print(df[col].unique())

Section 03

Data Loading

The raw data comes from KoboToolbox exported as Excel. The notebook reads a specific sheet and immediately lowercases all column names for consistency.

2.1 Set file paths

python / Define input and output paths
# Input — raw KoboToolbox export
infile    = "_VoT/input/VoT_MMR_Raw.xlsx"
out_dir   = "_VoT/output/"
table_dir = "_VoT/tables/"

# Create output folders if they don't exist yet
os.makedirs(out_dir,   exist_ok=True)
os.makedirs(table_dir, exist_ok=True)

2.2 Load into a DataFrame

python / Read Excel and normalize column names
# Read specific sheet exported from KoboToolbox
in_df = pd.read_excel(infile, sheet_name="data_pxd_25")

# Normalize all column names to lowercase
in_df.columns = [col.lower() for col in in_df.columns]

# Quick sanity check
print(f"Dataset: {in_df.shape[0]} rows × {in_df.shape[1]} columns")
in_df.head(2)
⚠️

If the sheet name differs (e.g. Sheet1 or data), update sheet_name. Check your KoboToolbox export settings to confirm the exact sheet name.

2.3 Select analysis columns

python / Subset to trafficking-relevant columns
processing_cols = [
    'interviewer_name', 'interviewer_org', 'interview_date',
    'gender', 'age', 'citizenship', 'education_level',
    'passport_owned', 'visa_type', 'travel_route', 'airline_used',
    'recruitment_contact', 'expected_work_type', 'exploitation_type',
    'scam_type', 'means_of_control', 'ransom_amount',
    # Add or remove columns based on your Form v6 export
]

in_df = in_df[[c for c in processing_cols if c in in_df.columns]]

Section 04

Data Preparation & Cleaning

3.1 Rename columns

KoboToolbox exports use group-prefixed technical names. The notebook maps these to readable labels using a dictionary.

python / Column rename dictionary
# key = KoboToolbox column name (already lowercased)
# value = analysis-friendly name
rename_dict = {
    'group_demographics/q_gender':     'gender',
    'group_demographics/q_age':        'age',
    'group_demographics/q_citizenship': 'citizenship',
    'group_travel/q_passport':         'passport_owned',
    'group_travel/q_visa':             'visa_type',
    # extend with all relevant columns from your form
}

in_df.rename(
    columns={k: v for k, v in rename_dict.items() if k in in_df.columns},
    inplace=True
)

3.2 Drop system columns

python / Drop metadata columns
cols_to_drop = [
    'survey_start_time', 'survey_end_time', 'record_id',
    'record_uuid', 'submission_time', 'validation_status',
    'system_notes', 'record_status', 'submitted_by',
    'form_version', 'record_tags',
]

vot_df = in_df.drop(
    columns=[c for c in cols_to_drop if c in in_df.columns]
)

3.3 Missing values report

python / Generate and save missing values report
total   = in_df.isnull().sum().sort_values(ascending=False)
pct     = (total / len(in_df) * 100).round(2)
missing = pd.concat([total, pct], axis=1, keys=['Missing Count', 'Missing %'])
missing = missing[missing['Missing Count'] > 0]

with pd.ExcelWriter(
    "_VoT/tables/VoT_MMR_MissingValues.xlsx", engine='openpyxl'
) as writer:
    missing.to_excel(writer, sheet_name='Missing Values')

print(missing)
💡

Open VoT_MMR_MissingValues.xlsx to identify columns with >30% missingness. These may need imputation or exclusion before analysis.

3.4 Remove duplicates

python / Detect and remove duplicate rows
n_dup = in_df.duplicated().sum()
print(f"Duplicate rows: {n_dup}")

if n_dup > 0:
    in_df = in_df.drop_duplicates()
    print(f"After dedup: {in_df.shape[0]} rows")

Section 05

Data Mapping

KoboToolbox stores binary responses as 0 / 1. The notebook converts these to Yes / No and maps country codes to readable nationality labels.

4.1 Binary → Yes / No

python / apply_individual_mappings_notnull
def apply_individual_mappings_notnull(df):
    """Map 0/1 binary columns to Yes/No — only on non-null rows."""
    yes_no_cols = [
        'passport_owned', 'visa_obtained', 'consent_given',
        'debt_bondage', 'document_confiscated', 'physical_abuse',
        # add all binary columns from your Form v6
    ]
    binary_map = {0: 'No', 1: 'Yes', 0.0: 'No', 1.0: 'Yes'}

    for col in yes_no_cols:
        if col in df.columns:
            mask = df[col].notna()
            df.loc[mask, col] = df.loc[mask, col].map(binary_map)
    return df

df = apply_individual_mappings_notnull(in_df)

4.2 Country code → Nationality

python / apply_nationality_mappings
def apply_nationality_mappings(df_mapped):
    nationality_map = {
        'mmr': 'Myanmar', 'myanmar': 'Myanmar',
        'chn': 'China',   'tha': 'Thailand',
        'khm': 'Cambodia', 'lao': 'Laos',
        'vnm': 'Vietnam',
        # extend for all nationalities in your dataset
    }
    if 'citizenship' in df_mapped.columns:
        df_mapped['citizenship'] = (
            df_mapped['citizenship'].str.lower().str.strip()
            .map(nationality_map)
            .fillna(df_mapped['citizenship'])
        )
    return df_mapped

df_mapped_2 = apply_nationality_mappings(df)

4.3 Filter — African nationals (optional)

python / African nationals subset
# UN official list of 54 African countries
african_countries = [
    'Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina Faso',
    'Burundi', 'Cameroon', 'Central African Republic', 'Chad',
    # ... all 54 (add the full list)
]

df_african = df_mapped_2[df_mapped_2['citizenship'].isin(african_countries)]
print(f"African nationals: {len(df_african)}")

Section 06

Analysis Modules

The notebook is organized into 12 self-contained analysis modules. Each produces one or more Excel tables saved to _VoT/tables/. Run them in sequence after completing cleaning.

Module 1
Gender & Nationality Analysis

What: Cross-tabulates citizenship by gender; computes nationality percentages; identifies top-10 nationalities.

Output: gender_nationality_table — crosstab + summary stats (total VoTs, % female, top nationality).

Module 2
Travel Routes Analysis

What: Parses multi-step routes from free-text or select-multiple fields. Extracts origin → transit → destination patterns using regex and Counter.

Output: Route frequency table; most common transit countries; corridor bar chart.

Module 3
Education Level Analysis

What: Analyses education attainment distribution across victims. Generates a BarChart with openpyxl for embedded Excel visualization.

Output: Education breakdown table with embedded chart saved to _VoT/tables/.

Module 4
Victim Profile — Age & Demographics

What: Creates age-group categories (18–24, 25–34, 35–44, 45+); cross-tabs by gender; computes median age and % by group. Handles missing or invalid age values.

Output: Age–gender pivot table; demographic summary statistics.

Module 5
Victim Recruitment Analysis

What: Analyses contact methods (social media, personal referral, job board) and promised employment types (factory, domestic, hospitality, scam compound).

Output: Recruitment contact frequency table; expected work type breakdown.

Module 6
Social Media Platform Analysis

What: Identifies platforms used for recruitment (Facebook, TikTok, Telegram, WeChat). Handles multi-select fields by splitting on separator characters.

Output: Platform frequency bar chart; normalized percentage table.

Module 7
Means of Control Analysis

What: Counts control mechanisms in scam compounds: document confiscation, debt bondage, physical abuse, threats, restricted movement, phone confiscation.

Output: Frequency table per control type; % of victims experiencing each.

Module 8
Exploitation Types & Scam Analysis

What: Classifies exploitation (forced labour, sexual exploitation, forced criminality) and scam sub-types (romance, crypto, investment fraud). Analyses targets, languages, and platforms.

Output: Exploitation breakdown; scam typology matrix with infrastructure details.

Module 9
Fraudulent Job Offers Analysis

What: Extracts and categorises fraudulent job types used to recruit victims. Uses regex to parse free-text job descriptions into standard categories.

Output: Job offer category frequency table saved to Excel.

Module 10
Airlines, Visa & Passport Analysis

What: Three sub-modules — (a) airline frequency during journey; (b) visa type obtained before departure; (c) passport ownership rates by nationality.

Output: Three separate Excel summary tables with counts and percentages.

Module 11
Ransom Amounts Analysis

What: Analyses ransom amounts demanded for victims' release. Handles currency normalization, outlier detection, and range bucketing via regex.

Output: Distribution table; min/max/median/mean; range breakdown (<$500, $500–$2000, >$2000).

Module 12
M&E Summary Table

What: Aggregates key M&E indicators: total VoTs screened, breakdown by gender, nationality, exploitation type, referral pathway, and protection status.

Output: Master M&E workbook — one sheet per indicator domain.


Section 07

Output & Export

6.1 Save the cleaned dataset

python / Export to formatted Excel
outfile = f"_VoT/output/VoT_MMR_Cleaned_{datetime.today().strftime('%Y%m%d')}.xlsx"

with pd.ExcelWriter(outfile, engine='openpyxl') as writer:
    df_mapped_2.to_excel(writer, sheet_name='Cleaned_Data', index=False)

# Apply header formatting
wb = load_workbook(outfile)
ws = wb['Cleaned_Data']
header_fill = PatternFill(start_color="1F4E79", fill_type="solid")

for cell in ws[1]:
    cell.font      = Font(bold=True, color="FFFFFF")
    cell.fill      = header_fill
    cell.alignment = Alignment(horizontal="center")
    ws.column_dimensions[cell.column_letter].width = 18

wb.save(outfile)
print(f"Saved → {outfile}")

6.2 Copy files for downstream analysis

python / copy_files helper
import shutil

def copy_files(source_folder, dest_folder, extension=".xlsx"):
    os.makedirs(dest_folder, exist_ok=True)
    copied = 0
    for f in Path(source_folder).glob(f"*{extension}"):
        shutil.copy2(f, dest_folder)
        copied += 1
    print(f"Copied {copied} file(s) → {dest_folder}")

copy_files(source_folder="_VoT/output/", dest_folder="Working/")

Expected output files

Cleaned dataset (dated) _VoT/output/VoT_MMR_Cleaned_YYYYMMDD.xlsx
Missing values report _VoT/tables/VoT_MMR_MissingValues.xlsx
Gender & Nationality table _VoT/tables/VoT_Gender_Nationality.xlsx
M&E summary workbook _VoT/tables/VoT_ME_Summary.xlsx
Exploitation & Scam analysis _VoT/tables/VoT_Exploitation_Scam.xlsx
African nationals subset _VoT/output/VoT_African_Nationals.xlsx

Section 08

Run Checklist

Use this before every notebook run. Your progress is saved in the browser — come back and continue where you left off.

Completion 0 / 18

Before running

Raw KoboToolbox export saved in _VoT/input/ Required
Sheet name in Excel matches the sheet_name parameter in cell 2.2 Required
Output folders _VoT/output/ and _VoT/tables/ exist Required
All packages installed: pandas, openpyxl, seaborn, gdown Required
Kernel restarted before running (clears stale variable state) Important

Data loading

in_df.shape printed — row and column count look correct Required
All column names are lowercase after loading Important
No missing expected columns in processing_cols Important

Data cleaning

Missing values report saved to _VoT/tables/ Important
Duplicate check complete — 0 duplicates (or removed) Required
Rename dictionary applied — key columns renamed correctly Important

Mapping & transformation

Binary columns show "Yes"/"No" (not 0/1) after mapping Required
Citizenship column shows nationality labels (not raw codes) Important
df_mapped_2 has same row count as in_df after mapping Verify

Output & handover

Cleaned dataset exported to _VoT/output/ with today's date in filename Required
All analysis tables saved to _VoT/tables/ Important
Files copied to Working/ for downstream analysis Verify
M&E summary reviewed and validated before sharing Required

Section 09

Troubleshooting

Common errors when the environment, file paths, or column names do not match what the notebook expects. Resolve these before re-running the analysis blocks.

File or sheet not found

Confirm the Excel path, that the file exists in _VoT/input/, and that sheet_name matches the export (e.g. data_pxd_25 vs Sheet1).

python / Check file before load
import os
print("Exists:", os.path.exists(infile))

Column or mapping errors

KoboToolbox labels change between form versions. If rename or mapping steps fail, list columns that are still not standardised and extend rename_dict or processing_cols.

python / Columns present in the frame
print(sorted(in_df.columns.tolist()))

Memory on very large workbooks

For huge row counts, load only the columns you need with usecols, process in chunks, or filter early. Close other applications while running heavy cells.

Data protection

Store raw and processed files in access-controlled locations. Follow UN and programme rules for VoT data; this guide is for workflow only and does not replace organisational policies.


Section 10

Next steps & support

After cleaning, typical next uses are deeper statistics, dashboards, and reporting. Keep methodology documented and version both raw and cleaned data.

Validation

Cross-check the missing-values workbook and M&E summary against expectations before sign-off.

Quality

Lineage

Record notebook version, form version, and any manual edits to dictionaries or paths.

Documentation

Downstream tools

Use cleaned files in Power BI, R, or Python for further analysis—read from _VoT/output/ or Working/.

Handover

Getting help

Review cell outputs and the checklist above; verify paths, sheet names, and that packages from Section 02 are installed. For script-specific questions, use inline notebook comments and your team’s data SOPs.


UN MMR · Victims of Trafficking Data Programme

For internal training use only — handle data per UN data protection standards.