# Define working foldersdata_dir = '_VoT/data'# Input folder for raw datacfg_dir = '_VoT/config'# Configuration filesout_dir = '_VoT/output'# Output folder for processed datahtml_dir = '_VoT/html'# Interactive visualizations# Set country and month for your analysiswhich_country = 'MMR'# Country codewhich_month = 'June'# Month of data collection
Import Required Libraries
import pandas as pd
import numpy as np
import seaborn as sbn
import matplotlib.pyplot as plt
import os
from pathlib import Path
4. π₯ Data Loading
Loading Your Dataset
The script loads data from an Excel file and performs initial preprocessing:
# Load the Excel filein_df = pd.read_excel(infile)
# Convert column names to lowercase for consistencyin_df.columns = [col.lower() forcolinin_df.columns]
# Display basic information about the datasetprint('Number of data points = %d' % in_df.shape[0])
Make sure your Excel file is named correctly and placed in the data folder. The file should be named: Raw_VoT_[Month]_screening_Analysis.xlsx
What This Step Does
Reads Excel data into a pandas DataFrame
Standardizes column names by converting to lowercase
Reports dataset size for verification
5. π§Ή Data Cleaning
Column Renaming
The script uses a comprehensive dictionary to rename columns to more manageable names:
# Example of column renaming dictionarydictionary = {
'survery start time': 'survey_start_time',
'full name of the interviewer/organisation': 'interviewer_organization',
'type of referring entity': 'referring_entity_type',
# ... many more mappings
}
# Apply the renamingin_df.rename(columns=dictionary, inplace=True)
Removing Unnecessary Columns
System metadata and administrative columns are removed:
# Drop system columns that aren't needed for analysiscolumns_to_drop = [
'survey_start_time', 'survey_end_time', 'record_id',
'record_uuid', 'submission_time', 'validation_status',
'system_notes', 'record_status', 'submitted_by',
'form_version', 'record_tags', 'record_index'
]
vot_df = in_df.drop(columns=columns_to_drop, errors='ignore')
The errors='ignore' parameter ensures the script continues even if some columns don't exist in your dataset.
6. π Missing Values Analysis
Why Analyze Missing Values?
Missing values in VOT data can indicate:
Sensitive information victims are reluctant to share
File Not Found Error:
β’ Check that your Excel file is in the correct folder
β’ Verify the filename matches the expected pattern
β’ Ensure you have read permissions for the file
Excel Export Errors:
β’ Install openpyxl: pip install openpyxl
β’ Close any open Excel files with the same name
β’ Check write permissions in the output folder
Column Not Found Errors:
β’ Your dataset may have different column names
β’ Check the dictionary mapping matches your data
β’ Some columns may not exist in all datasets
Data Security: Always work with anonymized data when possible
Version Control: Keep backup copies of your original data
Documentation: Document any changes or assumptions you make
Validation: Cross-check results with domain experts
Regular Updates: Keep your Python libraries updated
Remember: This script is a tool to help process and understand VOT data. Always validate results with subject matter experts and follow your organization's data handling protocols.
π Support
If you encounter issues not covered in this guide:
Review the error messages carefully - they often contain helpful information
Check that all required libraries are installed
Verify your data format matches the expected structure
Consider reaching out to your technical support team
This guide covers the basic usage of the VOT data analysis script. As you become more comfortable with the tool, you can explore advanced features and customizations.