Economic Growth and Gender Equality: An Analysis Using IMF Data

Author

Jenny Xu

This data analysis project aims to explore the relationship between economic growth and gender equality using imfp, which allows us to download data from IMF (International Monetary Fund). imfp can be integrated with other python tools to streamline the computational process. To demonstrate its functionality, the project experimented with a variety of visualization and analysis methods.

Executive Summary

In this project, we explored the following:

Data Fetching

Make API call to fetch 4 datasets: GII (Gender Inequality Index), Nominal GDP, GDP Deflator Index, Population series

Feature Engineering

Cleaning: Convert GDP Deflator Index to a yearly basis and variables to numeric
Dependent Variable: Percent Change of Gender Inequality Index
Independent Variable: Percent Change of Real GDP per Capita
Transform variables to display magnitude of change
Merge the datasets

Data Visualization

Scatterplot
Time Series Line Plots
Barplot
Boxplot
Heatmap

Statistical Analysis

Descriptive Statistics
Regression Analysis
Time Series Analysis

Utility Functions

The integration of other Python tools not only streamlined our computational processes but also ensured consistency across the project.

A custom module is written to simplify the process of making API calls and fetching information with imfp library. load_or_fetch_databases, load_or_fetch_parameters load_or_fetch_dataset load and retreive database, parameters, and dataset from a local or remote source. view_dataframe_in_browser displays dataframe in a web browser.

import os
import pickle
from tempfile import NamedTemporaryFile
import pandas as pd
import imfp
import webbrowser


# Function to display a DataFrame in a web browser
def view_dataframe_in_browser(df):
    html = df.to_html()
    with NamedTemporaryFile(delete=False, mode="w", suffix=".html") as f:
        url = "file://" + f.name
        f.write(html)
    webbrowser.open(url)


# Function to load databases from CSV or fetch from API
def load_or_fetch_databases():
    csv_path = os.path.join("data", "databases.csv")

    # Try to load from CSV
    if os.path.exists(csv_path):
        try:
            return pd.read_csv(csv_path)
        except Exception as e:
            print(f"Error loading CSV: {e}")

    # If CSV doesn't exist or couldn't be loaded, fetch from API
    print("Fetching databases from IMF API...")
    databases = imfp.imf_databases()

    # Save to CSV for future use
    databases.to_csv(csv_path, index=False)
    print(f"Databases saved to {csv_path}")

    return databases


def load_or_fetch_parameters(database_name):
    pickle_path = os.path.join("data", f"{database_name}.pickle")

    # Try to load from pickle file
    if os.path.exists(pickle_path):
        try:
            with open(pickle_path, "rb") as f:
                return pickle.load(f)
        except Exception as e:
            print(f"Error loading pickle file: {e}")

    # If pickle doesn't exist or couldn't be loaded, fetch from API
    print(f"Fetching parameters for {database_name} from IMF API...")
    parameters = imfp.imf_parameters(database_name)

    # Save to pickle file for future use
    os.makedirs("data", exist_ok=True)  # Ensure the data directory exists
    with open(pickle_path, "wb") as f:
        pickle.dump(parameters, f)
    print(f"Parameters saved to {pickle_path}")

    return parameters


def load_or_fetch_dataset(database_id, indicator):
    file_name = f"{database_id}.{indicator}.csv"
    csv_path = os.path.join("data", file_name)

    # Try to load from CSV file
    if os.path.exists(csv_path):
        try:
            return pd.read_csv(csv_path)
        except Exception as e:
            print(f"Error loading CSV file: {e}")

    # If CSV doesn't exist or couldn't be loaded, fetch from API
    print(f"Fetching dataset for {database_id}.{indicator} from IMF API...")
    dataset = imfp.imf_dataset(database_id=database_id, indicator=[indicator])

    # Save to CSV file for future use
    os.makedirs("data", exist_ok=True)  # Ensure the data directory exists
    dataset.to_csv(csv_path, index=False)
    print(f"Dataset saved to {csv_path}")

    return dataset

Dependencies

Here is a brief introduction about the packages used:

pandas: view and manipulate data frame

matplotlib.pyplot: make plots

seaborn: make plots

numpy: computation

LinearRegression: implement linear regression

tabulate: format data into tables

statsmodels.api, adfuller, ARIMA,VAR,plot_acf,plot_pacf,mean_absolute_error,mean_squared_error, andgrangercausalitytests are specifically used for time series analysis.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from tabulate import tabulate
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.stattools import grangercausalitytests

Data Fetching

In this section, we extracted four datasets through API calls: Gender Inequality Index(GII), GDP Deflator, Nominal GDP, and Population.

from pathlib import Path
Path("data").mkdir(exist_ok=True)

# Load or fetch databases
databases = load_or_fetch_databases()

# Filter out databases that contain a year in the description
databases[
  ~databases['description'].str.contains(r"[\d]{4}", regex=True)
]

# view_dataframe_in_browser(databases)

	database_id	description
25	HPDD	Historical Public Debt (HPDD)
34	RAFIT2AGG	Revenue Administration Fiscal Information Tool...
47	GENDER_EQUALITY	Gender Equality
63	PGCS	Private and Public Capital Stock Dataset
69	GENDER_BUDGETING	Gender Budgeting
125	CPI	Consumer Price Index (CPI)
153	IRFCL	International Reserves and Foreign Currency Li...
191	IFS_DISCONTINUED	International Financial Statistics (IFS), Disc...
192	EQ	Export Quality
193	ED	Export Diversification
226	FISCALDECENTRALIZATION	Fiscal Decentralization
291	FDI	Financial Development Index
292	PSBSFAD	Public Sector Balance Sheet (PSBS)(FAD)
293	UNSDG_IMF_INPUTS	Sustainable Development Goals, IMF Inputs
294	CPIS	Coordinated Portfolio Investment Survey (CPIS)
295	PCTOT	Commodity Terms of Trade
296	FM	Fiscal Monitor (FM)
297	AFRREO	Sub-Saharan Africa Regional Economic Outlook (...
298	WHDREO	Western Hemisphere Regional Economic Outlook (...
299	MCDREO	Middle East and Central Asia Regional Economic...
300	APDREO	Asia and Pacific Regional Economic Outlook (AP...
301	BOPAGG	Balance of Payments (BOP), World and Regional ...
302	PCPS	Primary Commodity Price System (PCPS)
303	CDIS	Coordinated Direct Investment Survey (CDIS)
304	BOP	Balance of Payments (BOP)
305	COFER	Currency Composition of Official Foreign Excha...
306	DOT	Direction of Trade Statistics (DOTS)
307	FAS	Financial Access Survey (FAS)
308	BOPSDMXUSD	Balance of Payments (BOP), Global SDMX (US Dol...
309	NAMAIN_IDC_N	System of National Accounts (SNA), NA_MAIN
310	GFSR	Government Finance Statistics (GFS), Revenue
311	GFSSSUC	Government Finance Statistics (GFS), Statement...
312	GFSCOFOG	Government Finance Statistics (GFS), Expenditu...
313	GFSFALCS	Government Finance Statistics (GFS), Financial...
314	GFSIBS	Government Finance Statistics (GFS), Integrate...
315	GFSMAB	Government Finance Statistics (GFS), Main Aggr...
316	GFSE	Government Finance Statistics (GFS), Expense
317	IFS	International Financial Statistics (IFS)
318	MFS	Monetary and Financial Statistics (MFS)
319	RAFIT3P	RA-FIT Round3 Completion and Participation Rates
320	FSI	Financial Soundness Indicators (FSIs)
321	FSIRE	Financial Soundness Indicators: Reporting enti...

Two databases were used: Gender Equality and International Financial Statistics (IFS).

databases[databases['database_id'].isin(['GENDER_EQUALITY','IFS'])]

	database_id	description
47	GENDER_EQUALITY	Gender Equality
317	IFS	International Financial Statistics (IFS)

Parameters are dictionary key names to make requests from the databases. “freq” stands for Frequency, such as Annual, Monthly, or Quarterly. “ref_area” stands for Geogrpahical Area, such as US (United States), JP (Japan), and GB (United Kindom). “indicator” refers to the code representing a specific dataset in the database. For example, if we display all the indicators for IFS database, the GDP deflator dataset has an input code of “NGDP_D_SA_IX” with a full name description of Gross Domestic Product, Deflator, Seasonally Adjusted, Index.

datasets = ["GENDER_EQUALITY", "IFS"]
params = {}

# Fetch valid parameters for two datasets
for dataset in datasets:
    params[dataset] = load_or_fetch_parameters(dataset)

    valid_keys = list(params[dataset].keys())
    print(f"Parameters for {dataset}: ", valid_keys)

Parameters for GENDER_EQUALITY:  ['freq', 'ref_area', 'indicator']
Parameters for IFS:  ['freq', 'ref_area', 'indicator']

We paired the database with the specific dataset indicator to read and store the csv file.

datasets = {}
dsets = [("GENDER_EQUALITY", "GE_GII"), 
("IFS", "NGDP_D_SA_IX"), 
("IFS", "NGDP_XDC"), 
("IFS", "LP_PE_NUM")]

for dset in dsets:
    datasets[dset[0] + "." + dset[1]] = load_or_fetch_dataset(dset[0], dset[1])

# "Gender Inequality Index"
GII = "GENDER_EQUALITY.GE_GII"

# "Gross Domestic Product, Deflator, Seasonally Adjusted, Index"
GDP_deflator = "IFS.NGDP_D_SA_IX"

# "Gross Domestic Product, Nominal, Domestic Currency"
GDP_nominal = "IFS.NGDP_XDC"

# "Population, Persons, Number of"
GDP_population = "IFS.LP_PE_NUM"

# Assign the datasets to new variables so we don't change the originals
GII_data = datasets[GII]
GDP_deflator_data = datasets[GDP_deflator]
GDP_nominal_data = datasets[GDP_nominal]
GDP_population_data = datasets[GDP_population]

Feature Engineering

Data Cleaning

Since the GDP deflator was reported on a quarterly basis, we converted it to a yearly basis.

# Keep only rows with a partial string match for "Q4" in the time_period column
GDP_deflator_data = GDP_deflator_data[GDP_deflator_data
['time_period'].str.contains("Q4")]

# Split the time_period into year and quarter and keep the year only
GDP_deflator_data.loc[:, 'time_period'] = GDP_deflator_data['time_period'].str[0:4]

We made all the variables numeric.

datasets = [GII_data, GDP_deflator_data, GDP_nominal_data, GDP_population_data]

for i, dataset in enumerate(datasets):    
    # Use .loc to modify the columns
    datasets[i].loc[:, 'obs_value'] = pd.to_numeric(datasets[i]['obs_value'], 
    errors='coerce')
    datasets[i].loc[:, 'time_period'] = pd.to_numeric(datasets[i]['time_period'], 
    errors='coerce')
    datasets[i].loc[:, 'unit_mult'] = pd.to_numeric(datasets[i]['unit_mult'], 
    errors='coerce')

GII Percent Change: Dependent Variable

We kept percents as decimals to make them easy to work with for calculation. Different countries have different baseline level of economic growth and gender equality. We calculated the percent change to make them comparable.

Gender Inequality Index (GII) is a composite measure of gender inequality using three dimensions: reproducitve health, empowerment, and labor market. GII ranges from 0 to 1. While 0 indicates gender equality, 1 indicates gender inequality, possibly the worst outcome for one gender in all three dimensions.

# Calculate percent change for each ref_area
# First, create a copy and reset the index to avoid duplicate index issues
GII_data_sorted = GII_data.sort_values(
    ['ref_area', 'time_period']).reset_index(drop=True)
GII_data['pct_change'] = GII_data_sorted.groupby('ref_area')['obs_value'].pct_change()

# Display the first few rows of the updated dataset
GII_data.head()

	freq	ref_area	indicator	time_format	time_period	obs_value	pct_change
0	A	AF	GE_GII	P1Y	1990	0.828244	NaN
1	A	AF	GE_GII	P1Y	1991	0.817706	-0.015156
2	A	AF	GE_GII	P1Y	1992	0.809806	-0.016783
3	A	AF	GE_GII	P1Y	1993	0.803078	-0.012651
4	A	AF	GE_GII	P1Y	1994	0.797028	-0.013718

We subset the data frame to keep only the columns we want:

# Create a new dataframe with only the required columns
GII_data = GII_data[['ref_area', 'time_period', 'obs_value', 'pct_change']].copy()

GII_data = GII_data.rename(columns = {
    'ref_area': 'Country',
    'time_period': 'Time',
    'obs_value': 'GII',
    'pct_change': 'GII_change'
})

# Display the first few rows of the new dataset
GII_data.head()

	Country	Time	GII	GII_change
0	AF	1990	0.828244	NaN
1	AF	1991	0.817706	-0.015156
2	AF	1992	0.809806	-0.016783
3	AF	1993	0.803078	-0.012651
4	AF	1994	0.797028	-0.013718

GDP Percent Change: Independent Variable

Real GDP per capita is a measure of a country’s economic welfare or standard of living. It is a great tool comparing a country’s economic development compared to other economies. Due to dataset access issue, we calculated Real GDP per capita by the following formula using GDP Deflator, Nominal GDP, and Population data:

\(\text{Real GDP} = \frac{\text{Nominal GDP}}{\text{GDP Deflator Index}}\times 100\)

\(\text{Real GDP per capita} = \frac{\text{Real GDP}}{\text{Population}}\)

GDP Deflator is a measure of price inflation and deflation with respect to a specific base year. The GDP deflator of a base year is equal to 100. A number of 200 indicates price inflation: the current year price of the good is twice its base year price. A number of 50 indicates price deflation: the current year price of the good is half its base year price. We kept the columns we want only for GDP-related datasets for easier table merging.

# GDP Deflator Dataset
# Create a new dataframe with only the required columns
GDP_deflator_data = GDP_deflator_data[
    ['ref_area', 'time_period', 'unit_mult', 'obs_value']].copy()

# Display the first few rows of the new dataset
GDP_deflator_data.head()

	ref_area	time_period	obs_value
3	FI	1990	73.200623
7	FI	1991	73.984068
11	FI	1992	74.654309
15	FI	1993	75.619254
19	FI	1994	77.937293

Nominal GDP is the total value of all goods and services produced in a given time period. It is usually higher than Real GDP and does not take into account cost of living in different countries or price change due to inflation/deflation.

# GDP Nominal Data
# Create a new dataframe with only the required columns
GDP_nominal_data = GDP_nominal_data[
    ['ref_area', 'time_period', 'unit_mult','obs_value']].copy()

# Display the first few rows of the new dataset
GDP_nominal_data.head()

	ref_area	time_period	unit_mult	obs_value
0	NE	2005	6	2418864.0
1	NE	2006	6	2596972.0
2	NE	2007	6	2762961.0
3	NE	2008	6	3247835.0
4	NE	2009	6	3403683.0

Population is the total number of people living in a country at a given time. This is where the “per capita” comes from. Real GDP is the total value of all goods and services produced in a country adjusted for inflation. Real GDP per capita is the total economic output per person in a country.

# GDP Population Data 
# Create a new dataframe with only the required columns
GDP_population_data = GDP_population_data[
    ['ref_area', 'time_period', 'unit_mult','obs_value']].copy()

# Display the first few rows of the new dataset
GDP_population_data.head()

	ref_area	time_period	unit_mult	obs_value
0	GA	1950	3	473.296
1	GA	1951	3	476.381
2	GA	1952	3	478.655
3	GA	1953	3	480.536
4	GA	1954	3	482.332

# Combine all the datasets above for further calculation
merged_df = pd.merge(pd.merge(GDP_deflator_data,GDP_nominal_data, 
on=['time_period', 'ref_area'], 
suffixes=('_index', '_nominal'), 
how='inner'), 
GDP_population_data, 
on=['time_period', 'ref_area'], 
how='inner')

We want to adjust GDP data based on unit multiplier. Unit multiplier stands for the number of zeroes we need to add to the value column. For example, in 1950, the observed population data for country GA (Georgia) was 473.296. With a unit muliplier of 3, the adjusted population would be 473296.

merged_df['adjusted_index'] = merged_df['obs_value_index'] * (10 ** (merged_df
['unit_mult_index']))
merged_df['adjusted_nominal'] = merged_df['obs_value_nominal'] * (10 ** (merged_df
['unit_mult_nominal']))
merged_df['adjusted_population'] = merged_df['obs_value'] * (10 ** (merged_df
['unit_mult']))

# Merged dataset
# Create a new dataframe with only the required columns
merged_df = merged_df[['ref_area', 'time_period',
'adjusted_nominal', 'adjusted_index', 'adjusted_population']].copy()

# Display the first few rows of the dataset
merged_df.head()

	ref_area	time_period	adjusted_nominal	adjusted_index	adjusted_population
0	FI	1990	9.096400e+10	73.200623	4996220.0
1	FI	1991	8.691300e+10	73.984068	5019134.0
2	FI	1992	8.478600e+10	74.654309	5044928.0
3	FI	1993	8.561000e+10	75.619254	5071782.0
4	FI	1994	9.064600e+10	77.937293	5097090.0

We wanted to compute the Real GDP per capita.

# Step 1: Real GDP = (Nominal GDP / GDP Deflator Index)*100
merged_df['Real_GDP_domestic'] = (merged_df['adjusted_nominal'] / merged_df[
    'adjusted_index'])*100

# Step 2: Real GDP per Capita = Real GDP / Population
merged_df['Real_GDP_per_capita'] = merged_df['Real_GDP_domestic'] / merged_df[
    'adjusted_population']

# Rename columns
merged_df = merged_df.rename(columns= {
    "ref_area": "Country",
    "time_period": "Time",
    "adjusted_nominal": "Nominal",
    "adjusted_index": "Deflator",
    "adjusted_population": "Population",
    "Real_GDP_domestic": "Real GDP",
    "Real_GDP_per_capita": "Real GDP per Capita"
}
)
# Check the results
merged_df.head()

	Country	Time	Nominal	Deflator	Population	Real GDP	Real GDP per Capita
0	FI	1990	9.096400e+10	73.200623	4996220.0	1.242667e+11	24872.143699
1	FI	1991	8.691300e+10	73.984068	5019134.0	1.174753e+11	23405.490395
2	FI	1992	8.478600e+10	74.654309	5044928.0	1.135715e+11	22512.011198
3	FI	1993	8.561000e+10	75.619254	5071782.0	1.132119e+11	22321.919259
4	FI	1994	9.064600e+10	77.937293	5097090.0	1.163063e+11	22818.181344

We calculated the percentage change in Real GDP per capita and put it in a new column.

# Calculate percent change for each ref_area
merged_df[f'GDP_change'] = merged_df.sort_values(['Country', 'Time']).groupby(
    'Country')['Real GDP per Capita'].pct_change()

# Rename dataset
GDP_data = merged_df

# Display the first few rows of the dataset
GDP_data.head()

	Country	Time	Nominal	Deflator	Population	Real GDP	Real GDP per Capita	GDP_change
0	FI	1990	9.096400e+10	73.200623	4996220.0	1.242667e+11	24872.143699	NaN
1	FI	1991	8.691300e+10	73.984068	5019134.0	1.174753e+11	23405.490395	-0.058968
2	FI	1992	8.478600e+10	74.654309	5044928.0	1.135715e+11	22512.011198	-0.038174
3	FI	1993	8.561000e+10	75.619254	5071782.0	1.132119e+11	22321.919259	-0.008444
4	FI	1994	9.064600e+10	77.937293	5097090.0	1.163063e+11	22818.181344	0.022232

# GII and GDP
# Merge the datasets
combined_data = pd.merge(GII_data, GDP_data, 
on=["Country", "Time"], 
how = "inner")

# Check the combined dataset
combined_data.head()

	Country	Time	GII	GII_change	Nominal	Deflator	Population	Real GDP	Real GDP per Capita	GDP_change
0	AL	2009	0.246238	-0.006176	1.143936e+12	95.997230	2973044.0	1.191635e+12	400813.060656	NaN
1	AL	2010	0.240877	-0.009627	1.239645e+12	100.758353	2948029.0	1.230314e+12	417334.584116	0.041220
2	AL	2011	0.240131	-0.009771	1.300624e+12	103.924160	2928601.0	1.251513e+12	427341.491205	0.023978
3	AL	2012	0.236440	-0.009977	1.332811e+12	103.230605	2914091.0	1.291101e+12	443054.328283	0.036769
4	AL	2013	0.223407	-0.009001	1.350053e+12	102.584604	2903788.0	1.316038e+12	453214.302235	0.022932

Data Visualization

Scatterplot

Scatterplot use dots to represent values of two numeric variables. The horizontal axis was the percent change in Real GDP per capita. The vertical axis was the percent change in Gender Inequality Index(GII). Different colors represented different countries. We used a linear regression line to display the overall pattern.

Based on the scatterplot, it seemed like there was a slight positive relationship between GDP change and GII change as shown by the flat regression line. Gender inequality was decreasing (gender equality was improving) a little faster in country-years with low GDP growth and a little slower in country-years with high GDP growth.

# Convert numeric columns to float
numeric_columns = [
    'GII', 'GII_change', 'Nominal', 'Deflator', 'Population', 
    'Real GDP', 'Real GDP per Capita', 'GDP_change'
]
for col in numeric_columns:
    combined_data[col] = pd.to_numeric(combined_data[col], errors='coerce')

# Count NAs
print(f"Dropping {combined_data[numeric_columns].isna().sum()} rows with NAs")

# Drop NAs
combined_data = combined_data.dropna(subset=numeric_columns)

# Plot the data points
plt.figure(figsize=(8, 6))
for country in combined_data['Country'].unique():
    country_data = combined_data[combined_data['Country'] == country]
    plt.scatter(country_data['GDP_change'], country_data['GII_change'],
             marker='o',linestyle='-', label=country)
plt.title('Country-Year Analysis of GDP Change vs. GII Change')
plt.xlabel('Percent Change in Real GDP per Capita (Country-Year)')
plt.ylabel('Percent Change in GII (Country-Year)')
plt.grid(True)

# Prepare data for linear regression
X = combined_data['GDP_change'].values.reshape(-1, 1)
y = combined_data['GII_change'].values

# Perform linear regression
reg = LinearRegression().fit(X, y)
y_pred = reg.predict(X)

# Plot the regression line
plt.plot(combined_data['GDP_change'], y_pred, color='red', linewidth=2)

plt.show()

Dropping GII                     0
GII_change             40
Nominal                 0
Deflator                0
Population              0
Real GDP                0
Real GDP per Capita     0
GDP_change             38
dtype: int64 rows with NAs

Time Series Line Plot

We created separate line plots for GDP change and GII change over time for a few key countries might show the trends more clearly.

US: United States

JP: Japan

GB: United Kindom

FR: France

MX: Mexico

Based on the line plots, we saw GDP change and GII change have different patterns. For example, in Mexico, when there was a big change in real GDP per captia in 1995, the change in GII was pretty stable.

# Time Series Line plot for a few key countries
selected_countries  = ['US', 'JP', 'GB', 'FR', 'MX']
combined_data_selected = combined_data[combined_data['Country'].isin(selected_countries)]

# Set up the Plot Structure
fig, ax = plt.subplots(2, 1, figsize=(8, 6), sharex=True)

# Plot change in real GDP per capita over time
sns.lineplot(data = combined_data_selected, 
x = "Time", 
y = "GDP_change", 
hue = "Country", 
ax = ax[0])
ax[0].set_title("Percent Change in Real GDP per Capita Over Time")
ax[0].set_ylabel("Percent Change in Real GDP per Capita")

# Plot change in GII over time
sns.lineplot(data = combined_data_selected, 
x = "Time", 
y = "GII_change", 
hue = "Country", 
ax = ax[1])
ax[1].set_title("Percent Change in GII over Time")
ax[1].set_xlabel("Time")
ax[1].set_ylabel("GII")

plt.tight_layout
plt.show()

Barplot

We used a barplot to show average changes in GII and GDP percent change for each country to visualize regions where inequality was improving or worsening.

This plot supported our previous observation how GII change seemed to be not be correlated with GDP change. We also saw that, for country SI, Solvenia, there seems to be a large improvement in gender inequality.

# Barplot using average GII and GDP change
# Calculate average change for each country
combined_data_avg = combined_data.groupby('Country')[
    ['GII_change','GDP_change']].mean().reset_index()

# Prepare to plot structure 
plt.figure(figsize = (18,10))

# Create the barplot
combined_data_avg.plot(kind = 'bar', x = 'Country')
plt.ylabel('Average Change')
plt.xlabel('Country')
plt.legend(['GII change', 'GDP change'])
plt.grid(axis = 'y')

# Show the plot
plt.show()

<Figure size 1728x960 with 0 Axes>

Boxplot

We used boxplot to visualize the distribution of GDP and GII change by country, providing information about spread, median, and potential outliers. To provide a more informative view, we sequenced countries in an ascending order by the median of percent change in GDP.

The boxplot displayed a slight upward trend with no obvious pattern between GDP and GII change. In coutries with higher GDP change median, they also tend to have a larger spread of the GDP change. The median of GII change remained stable regardless of the magnitude of GDP change, implying weak or no association between GDP and GII change. We observed a potential outlier for country SI, Solvenia, which may explained its large improvement in Gender inequality.

# Box plot for GII and GDP change
# Melt the dataframe to long format for combined boxplot
combined_data_melted = combined_data.melt(id_vars=['Country'], 
value_vars=['GII_change', 'GDP_change'], 
var_name='Change_Type', 
value_name='Value')

gdp_medians = combined_data.groupby('Country')['GDP_change'].median().sort_values()

combined_data_melted['Country'] = pd.Categorical(combined_data_melted['Country'], 
categories=gdp_medians.index, 
ordered= True)

# Prepare the plot structure
plt.figure(figsize=(8, 6))
sns.boxplot(data = combined_data_melted, 
x = "Country", 
y = 'Value', 
hue = 'Change_Type')
plt.title('Distribution of GII and GDP change by Country')
plt.xlabel('Country')
plt.ylabel('Change')
plt.legend(title = 'Change Type')

# Show the plot
plt.show()

Correlation Matrix

We created a heatmap to show the relationship between GII and GDP change.

A positive correlation coefficient indicates a positive relationship: the larger the GDP change, the larger the GII change. A negative correlation coefficient indicates a negative relationship: the larger the GDP change, the smaller the GII change. A correlation coefficient closer to 0 indicates there is weak or no relationship.

Based on the numeric values in the plot, there was a moderately strong positive correlation between GII and GDP change for country Estonia(EE) and Ireland(IE).

# Calculate the correlation
country_correlation = combined_data.groupby('Country')[
    ['GII_change', 'GDP_change']].corr().iloc[0::2, -1].reset_index(name='Correlation')

# Put the correlation value in a matrix format
correlation_matrix = country_correlation.pivot(index='Country', 
columns='level_1', 
values='Correlation')

# Check for NaN values in the correlation matrix
# Replace NaNs with 0 or another value as appropriate
correlation_matrix.fillna(0, inplace=True)  

# Set up the plot structure
# Adjust height to give more space for y-axis labels
plt.figure(figsize=(8, 12))  

# Plot the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
cbar_kws={"shrink": .8}, 
linewidths=.5)

# Enhance axis labels and title
plt.title('Heatmap for GII and GDP Change', fontsize=20)
plt.xlabel('Variables', fontsize=16)
plt.ylabel('Country', fontsize=16)

# Improve readability of y-axis labels
plt.yticks(fontsize=12)  # Adjust the font size for y-axis labels

# Show the plot
plt.show()

Statistical Analysis

Descriptive Statistics

There was a total of 915 data points. The mean of the GII change in -0.0314868, which indicated the overall grand mean percent change in gender inequality index is -3.15%. The mean of the GDP change was 0.0234633, showing the overall grand mean percent change in real GDP per capita was 2.35%.

# Generate summary statistics
combined_data.describe()

	GII	GII_change	Nominal	Deflator	Population	Real GDP	Real GDP per Capita	GDP_change
count	896.000000	896.000000	8.960000e+02	896.000000	8.960000e+02	8.960000e+02	8.960000e+02	896.000000
mean	0.238270	-0.021738	7.841185e+13	85.362016	4.507063e+07	7.581775e+13	9.919523e+05	0.023793
std	0.149157	0.041293	6.063567e+14	21.125956	1.255392e+08	5.570436e+14	3.854128e+06	0.041931
min	0.011690	-0.552535	2.187139e+09	3.606364	3.963240e+05	5.217326e+09	1.798422e+03	-0.285847
25%	0.131004	-0.030203	1.101968e+11	73.831673	5.041634e+06	1.500708e+11	1.717820e+04	0.004223
50%	0.184532	-0.011172	7.808172e+11	88.700623	1.032114e+07	1.015802e+12	3.545761e+04	0.022361
75%	0.332871	-0.003554	2.497352e+12	100.127034	4.444512e+07	2.914933e+12	1.212319e+05	0.043195
max	0.788954	0.210491	9.546134e+15	207.890742	1.280842e+09	7.920071e+15	3.145315e+07	0.241984

Regression Analysis

Simple linear regression as a foundational approach provide us with a basic understanding of the relationship between GDP change and GII change.

Based on the summary, we concluded the following:

Becasue p-value = 0.057, if we set alpha, the significance level, to be 0.05, we failed to reject the null hypothesis and conclude there was no significant relationship between percent change in real GDP per capita and gender inequality index.
R-squared = 0.004. Only 0.4% of the variance in GII change could be explained by GDP change.
We were 95% confident that the interval from -0.003 to 0.169 captured the true slope of GDP change. Because 0 was included, we are uncertain about the effect of GDP change on GII chnage.

# Get column data type summaries of combined_data
combined_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 896 entries, 1 to 973
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Country              896 non-null    object 
 1   Time                 896 non-null    object 
 2   GII                  896 non-null    float64
 3   GII_change           896 non-null    float64
 4   Nominal              896 non-null    float64
 5   Deflator             896 non-null    float64
 6   Population           896 non-null    float64
 7   Real GDP             896 non-null    float64
 8   Real GDP per Capita  896 non-null    float64
 9   GDP_change           896 non-null    float64
dtypes: float64(8), object(2)
memory usage: 77.0+ KB

# Define independent and depenent variables
X = combined_data['GDP_change']
y = combined_data['GII_change']

# Add a constant to indepdent variable to include an intercept
X = sm.add_constant(X)

# Fit a simple linear regresion model and print out the summary
model = sm.OLS(y, X).fit()
model.summary()

OLS Regression Results
Dep. Variable:	GII_change	R-squared:	0.000
Model:	OLS	Adj. R-squared:	-0.001
Method:	Least Squares	F-statistic:	0.1114
Date:	Mon, 30 Jun 2025	Prob (F-statistic):	0.739
Time:	18:37:19	Log-Likelihood:	1584.8
No. Observations:	896	AIC:	-3166.
Df Residuals:	894	BIC:	-3156.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-0.0220	0.002	-13.862	0.000	-0.025	-0.019
GDP_change	0.0110	0.033	0.334	0.739	-0.054	0.076

Omnibus:	872.466	Durbin-Watson:	1.570
Prob(Omnibus):	0.000	Jarque-Bera (JB):	62725.209
Skew:	-4.277	Prob(JB):	0.00
Kurtosis:	43.087	Cond. No.	23.9

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Time Series Analysis

Time series analysis allows us to explore how the relationship between GII and GDP change vary across different time periods, accounting for lagged effects.

Here was a quick summary of the result:

Both GII and GDP change time series were stationary.
Past GII change values significantly influenced cuurent GII change values.
VAR model had good model performance on forecasting future values based on historical data.
Changes in GDP did not cause/precde the changes in GII.

ADF Test: Stationality Assumption Check

We wanted to use Augmented Dickey-Fuller (ADF) test to check whether a time series was stationary, which was the model assumption for many time series models.

Stationarity implied constant mean and variance over time, making it more predictable and stable for forecasting.

Based on the ADF test output, both GII and GDP change time series were stationary. We proceeded to the time series modeling section.

# Augmented Dickey-Fuller (ADF) test for stationarity check
# Create melted datasets
combined_data_time = combined_data.melt(id_vars=['Time', 'Country'], 
value_vars=['GII_change','GDP_change'], 
var_name = 'Change_Type', 
value_name = 'Value')
GII = combined_data_time[(combined_data_time['Change_Type'] == 'GII_change')]                         

GDP = combined_data_time[(combined_data_time['Change_Type'] == 'GDP_change')]

# Stationary Check
def adf_test(series):
    result = adfuller(series.dropna())
    print(f'ADF Statistic: {result[0]}')
    print(f'p-value: {result[1]}')
    if result[1] < 0.05:
        print("Series is stationary")
    else:
        print("Series is not stationary")

# Output the result
adf_test(GII['Value'])
adf_test(GDP['Value'])

ADF Statistic: -13.258508444109324
p-value: 8.491362672400665e-25
Series is stationary
ADF Statistic: -13.53669426905047
p-value: 2.559492685554561e-25
Series is stationary

VAR model: Examine variables separately

We fitted a VAR (Vector Autoreression) model to see the relationship between GII and GDP change. VAR is particularly useful when dealing with multivariate time series data and allows us to examine the interdependence between variables.

Based on summary, here were several interpretations we could make:

We used AIC as the criteria for model selection. Lower value suggests a better fit.
Given that we wanted to predict GII change, we focused on the first set “Results for equation GII_change.”
Past GII_change values significantly influenced current GII_change, as shown in the small p-values of lags 1 and 2.
Lag 2 of GDP_change had a relatively low p-value but is not statistically significant.

# Split the dataset into training and testing sets
split_ratio = 0.7
split_index = int(len(combined_data) * split_ratio)

# Training set is used to fit the model
train_data = combined_data.iloc[:split_index]

# Testing set is used for validation
test_data = combined_data.iloc[split_index:]

print(f"Training data: {train_data.shape}")
print(f"Test data: {test_data.shape}")

Training data: (627, 10)
Test data: (269, 10)

# Fit a VAR model 
time_model = VAR(train_data[['GII_change', 'GDP_change']])
time_model_fitted = time_model.fit(maxlags = 15, ic="aic")

# Print out the model summary
time_model_fitted.summary()

  Summary of Regression Results   
==================================
Model:                         VAR
Method:                        OLS
Date:           Mon, 30, Jun, 2025
Time:                     18:37:19
--------------------------------------------------------------------
No. of Equations:         2.00000    BIC:                   -12.6388
Nobs:                     624.000    HQIC:                  -12.6996
Log likelihood:           2217.52    FPE:                2.93645e-06
AIC:                     -12.7383    Det(Omega_mle):     2.87166e-06
--------------------------------------------------------------------
Results for equation GII_change
================================================================================
                   coefficient       std. error           t-stat            prob
--------------------------------------------------------------------------------
const                -0.014401         0.002527           -5.698           0.000
L1.GII_change         0.206183         0.040171            5.133           0.000
L1.GDP_change         0.008941         0.037668            0.237           0.812
L2.GII_change         0.147039         0.040539            3.627           0.000
L2.GDP_change        -0.038093         0.037741           -1.009           0.313
L3.GII_change         0.071829         0.040373            1.779           0.075
L3.GDP_change         0.041997         0.037485            1.120           0.263
================================================================================

Results for equation GDP_change
================================================================================
                   coefficient       std. error           t-stat            prob
--------------------------------------------------------------------------------
const                 0.017922         0.002668            6.719           0.000
L1.GII_change        -0.021878         0.042400           -0.516           0.606
L1.GDP_change         0.132221         0.039758            3.326           0.001
L2.GII_change         0.124973         0.042789            2.921           0.003
L2.GDP_change         0.025632         0.039835            0.643           0.520
L3.GII_change        -0.063330         0.042614           -1.486           0.137
L3.GDP_change         0.146934         0.039565            3.714           0.000
================================================================================

Correlation matrix of residuals
              GII_change  GDP_change
GII_change      1.000000   -0.033903
GDP_change     -0.033903    1.000000

VAR Model: Forecasting

We applied the model learned above to the test data. Based on the plot, the forecast values seem to follow the actual data well, indicating a good model fit caputuring the underlying trends.

# Number of steps to forecast (length of the test set)
n_steps = len(test_data)

# Get the last values from the training set for forecasting
forecast_input = train_data[
    ['GII_change', 'GDP_change']].values[-time_model_fitted.k_ar:]

# Forecasting
forecast = time_model_fitted.forecast(y=forecast_input, steps=n_steps)

# Create a DataFrame for the forecasted values
forecast_df = pd.DataFrame(forecast, index=test_data.index, 
columns=['GII_forecast', 'GDP_forecast'])

# Ensure the index of the forecast_df matches the test_data index
forecast_df.index = test_data.index

plt.figure(figsize=(8, 6))
plt.plot(train_data['GII_change'], label='Training GII', color='blue')
plt.plot(test_data['GII_change'], label='Actual GII', color='orange')
plt.plot(forecast_df['GII_forecast'], label='Forecasted GII', color='green')
plt.title('GII Change Forecast vs Actual')
plt.legend()
plt.show()

plt.figure(figsize=(8, 6))
plt.plot(train_data['GDP_change'], label='Training GDP', color='blue')
plt.plot(test_data['GDP_change'], label='Actual GDP', color='orange')
plt.plot(forecast_df['GDP_forecast'], label='Forecasted GDP', color='green')
plt.title('GDP Change Forecast vs Actual')
plt.legend()
plt.show()

VAR Model: Model Performance

Low values of both MAE and RMSE indicate good model performance with small average errors in predictions.

mae_gii = mean_absolute_error(test_data['GII_change'], forecast_df['GII_forecast'])
mae_gdp = mean_absolute_error(test_data['GDP_change'], forecast_df['GDP_forecast'])

print(f'Mean Absolute Error for GII: {mae_gii}')
print(f'Mean Absolute Error for GDP: {mae_gdp}')

Mean Absolute Error for GII: 0.02163427421854603
Mean Absolute Error for GDP: 0.027874530660148857

rmse_gii = np.sqrt(mean_squared_error(test_data['GII_change'], 
forecast_df['GII_forecast']))
rmse_gdp = np.sqrt(mean_squared_error(test_data['GDP_change'], 
forecast_df['GDP_forecast']))

print(f'RMSE for GII: {rmse_gii}')
print(f'RMSE for GDP: {rmse_gdp}')

RMSE for GII: 0.0400826273209931
RMSE for GDP: 0.03867925511599023

VAR Model: Granger causality test

Granger causality test evaluates whether one time series can predict another.

Based on the output, the lowest p-value is when lag = 2. However, because p-value > 0.05, we fail to reject the null hypothesis and conclude the GDP_change does not Granger-cause the GII_change.

# Perform the Granger causality test
max_lag = 3
test_result = grangercausalitytests(train_data[['GII_change', 'GDP_change']], max_lag,
 verbose=True)


Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.1149  , p=0.7348  , df_denom=623, df_num=1
ssr based chi2 test:   chi2=0.1154  , p=0.7340  , df=1
likelihood ratio test: chi2=0.1154  , p=0.7341  , df=1
parameter F test:         F=0.1149  , p=0.7348  , df_denom=623, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=0.4787  , p=0.6198  , df_denom=620, df_num=2
ssr based chi2 test:   chi2=0.9652  , p=0.6172  , df=2
likelihood ratio test: chi2=0.9644  , p=0.6174  , df=2
parameter F test:         F=0.4787  , p=0.6198  , df_denom=620, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=0.6797  , p=0.5647  , df_denom=617, df_num=3
ssr based chi2 test:   chi2=2.0623  , p=0.5596  , df=3
likelihood ratio test: chi2=2.0589  , p=0.5603  , df=3
parameter F test:         F=0.6797  , p=0.5647  , df_denom=617, df_num=3

Conclusion

In wrapping up our analysis, we found no evidence to support a significant relationship between the Change in Real GDP per capita and the Change in the Gender Inequality Index (GII). This suggests that economic growth may not have a direct impact on gender equality. However, our findings open the door to questions for future research.

Future Directions

First, we must consider what other factors might influence the relationship between GDP and GII change. The GII is a composite index, shaped by a myriad of social factors, including cultural norms, legal frameworks, and environmental shifts. Future studies could benefit from incorporating additional predictors into the analysis and exploring the interaction between economic growth and gender equality within specific country contexts.

Second, there’s potential to enhance the predictive power of our Vector Autoregression (VAR) time series model. While we established that GDP change does not cause GII change, our model performed well in forecasting trends for both variables independently. In practice, policymakers may want to forecast GII trends independently of GDP if they are implementing gender-focused policies. Future research could investigate time series modeling to further unravel the dynamics of GII and GDP changes.

So, as we wrap up this chapter, let’s keep our curiosity alive and our questions flowing. After all, every end is just a new beginning in the quest for knowledge!

About the Author

Hi there! My name is Jenny, and I’m a third-year student at University of California, Davis, double majoring in Statistics and Psychology. I’ve always been interested in becoming a data analyst working in tech, internet, or research industries. Interning at Promptly Technologies helped me learn a ton. A quick fun fact for me is that my MBTI is ISFJ (Defender)!

Email LinkedIn GitHub