VOT Data Analysis User Guide

A comprehensive guide for analyzing Victim of Trafficking (VOT) data using Python

Back

πŸ“‹ Table of Contents

1. πŸ“Š Overview

This guide will help you understand and use the VOT (Victim of Trafficking) data analysis script. The script is designed to:

This script processes sensitive data about trafficking victims. Always ensure proper data protection and privacy protocols are followed.

2. πŸ”§ Prerequisites

Required Software

Required Python Libraries

pip install pandas numpy seaborn matplotlib plotly openpyxl xlsxwriter scikit-learn

Data Requirements

3. βš™οΈ Initial Setup

Create Folder Structure

The script expects a specific folder structure:

_VoT/ β”œβ”€β”€ data/ # Input data files β”œβ”€β”€ config/ # Configuration files β”œβ”€β”€ output/ # Processed data files └── html/ # Interactive visualizations

Configure File Paths

Set up your working directories:

# Define working folders data_dir = '_VoT/data' # Input folder for raw data cfg_dir = '_VoT/config' # Configuration files out_dir = '_VoT/output' # Output folder for processed data html_dir = '_VoT/html' # Interactive visualizations # Set country and month for your analysis which_country = 'MMR' # Country code which_month = 'June' # Month of data collection

Import Required Libraries

import pandas as pd import numpy as np import seaborn as sbn import matplotlib.pyplot as plt import os from pathlib import Path

4. πŸ“₯ Data Loading

Loading Your Dataset

The script loads data from an Excel file and performs initial preprocessing:

# Load the Excel file in_df = pd.read_excel(infile) # Convert column names to lowercase for consistency in_df.columns = [col.lower() for col in in_df.columns] # Display basic information about the dataset print('Number of data points = %d' % in_df.shape[0])
Make sure your Excel file is named correctly and placed in the data folder. The file should be named: Raw_VoT_[Month]_screening_Analysis.xlsx

What This Step Does

5. 🧹 Data Cleaning

Column Renaming

The script uses a comprehensive dictionary to rename columns to more manageable names:

# Example of column renaming dictionary dictionary = { 'survery start time': 'survey_start_time', 'full name of the interviewer/organisation': 'interviewer_organization', 'type of referring entity': 'referring_entity_type', # ... many more mappings } # Apply the renaming in_df.rename(columns=dictionary, inplace=True)

Removing Unnecessary Columns

System metadata and administrative columns are removed:

# Drop system columns that aren't needed for analysis columns_to_drop = [ 'survey_start_time', 'survey_end_time', 'record_id', 'record_uuid', 'submission_time', 'validation_status', 'system_notes', 'record_status', 'submitted_by', 'form_version', 'record_tags', 'record_index' ] vot_df = in_df.drop(columns=columns_to_drop, errors='ignore')
The errors='ignore' parameter ensures the script continues even if some columns don't exist in your dataset.

6. πŸ” Missing Values Analysis

Why Analyze Missing Values?

Missing values in VOT data can indicate:

Creating Missing Values Report

def missing_zero_values_table(df): # Calculate missing values and percentages zero_val = (df == 0.00).astype(int).sum(axis=0) mis_val = df.isnull().sum() mis_val_percent = 100 * df.isnull().sum() / len(df) # Create comprehensive table mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1) mz_table = mz_table.rename(columns={ 0: 'Zero Values', 1: 'Missing Values', 2: '% of Total Values' }) return mz_table # Generate the missing values report missing_table = missing_zero_values_table(results)

Understanding the Output

Column Description
Zero Values Count of fields with value 0
Missing Values Count of completely empty fields
% of Total Values Percentage of missing data
Data Type Type of data (text, number, etc.)

7. πŸ—ΊοΈ Data Mapping

Converting Codes to Labels

VOT survey data often uses numeric codes (0/1) that need to be converted to meaningful labels (No/Yes):

def apply_individual_mappings_notnull(df): # List of columns that need Yes/No mapping yes_no_columns = [ 'referring_entity_ngo', 'referring_entity_embassy', 'physical_abuse', 'psychological_abuse', # ... many more fields ] # Mapping dictionary yes_no_mapping = { '1': 'Yes', '0': 'No', '1.0': 'Yes', '0.0': 'No', 'yes': 'Yes', 'no': 'No' } # Apply mappings while preserving null values for column in yes_no_columns: if column in df.columns: mask = df[column].notna() df.loc[mask, column] = df.loc[mask, column].map(yes_no_mapping) return df # Apply the mappings df_mapped = apply_individual_mappings_notnull(in_df)

What Gets Mapped

The mapping process preserves missing/null values, ensuring data integrity while making the dataset more readable.

8. πŸ’Ύ Exporting Results

Multiple Output Files

The script generates several output files for different purposes:

# Export missing values analysis with pd.ExcelWriter("missing_values.xlsx", engine='openpyxl') as writer: missing_table.to_excel(writer, sheet_name='missing_values', index=True) missing_data.to_excel(writer, sheet_name='mv_percentage', index=True) # Export clean dataset with all data with pd.ExcelWriter(outfile, engine='openpyxl') as dataset: df_mapped.to_excel(dataset, sheet_name='vot_cleaned_data', header=True, index=True, index_label='index') # Export dataset with mapped values (nulls preserved) with pd.ExcelWriter(outfileNotNull, engine='openpyxl') as dataset1: df.to_excel(dataset1, sheet_name='vot_cleaned_data', header=True, index=True, index_label='index')

Output Files Explained

File Purpose Content
missing_values.xlsx Data quality assessment Missing values analysis and percentages
Cleaned_VoT_[Month]_all.xlsx Complete clean dataset All data with readable labels
Cleaned_VoT_[Month]_NotNull_Mapped.xlsx Mapped dataset Data with Yes/No labels, nulls preserved

9. πŸ”§ Troubleshooting

Common Issues and Solutions

File Not Found Error:
β€’ Check that your Excel file is in the correct folder
β€’ Verify the filename matches the expected pattern
β€’ Ensure you have read permissions for the file
Excel Export Errors:
β€’ Install openpyxl: pip install openpyxl
β€’ Close any open Excel files with the same name
β€’ Check write permissions in the output folder
Column Not Found Errors:
β€’ Your dataset may have different column names
β€’ Check the dictionary mapping matches your data
β€’ Some columns may not exist in all datasets

Debugging Tips

# Check your dataset structure print("Dataset shape:", in_df.shape) print("Column names:", in_df.columns.tolist()) # Check for missing columns expected_columns = ['consent_data_sharing', 'victim_of_trafficking'] missing_cols = [col for col in expected_columns if col not in in_df.columns] print("Missing columns:", missing_cols) # Check data types print("Data types:") print(in_df.dtypes)

10. πŸ“š Additional Resources

Best Practices

Remember: This script is a tool to help process and understand VOT data. Always validate results with subject matter experts and follow your organization's data handling protocols.

πŸ“ž Support

If you encounter issues not covered in this guide:

This guide covers the basic usage of the VOT data analysis script. As you become more comfortable with the tool, you can explore advanced features and customizations.