Economic Growth and Gender Equality: An Analysis Using IMF Data

Author

Jenny Xu

This data analysis project aims to explore the relationship between economic growth and gender equality using imfp, which allows us to download data from IMF (International Monetary Fund). imfp can be integrated with other python tools to streamline the computational process. To demonstrate its functionality, the project experimented with a variety of visualization and analysis methods.

Executive Summary

In this project, we explored the following:

  1. Data Fetching
  • Make API call to fetch 4 datasets: GII (Gender Inequality Index), Nominal GDP, GDP Deflator Index, Population series
  1. Feature Engineering
  • Cleaning: Convert GDP Deflator Index to a yearly basis and variables to numeric
  • Dependent Variable: Percent Change of Gender Inequality Index
  • Independent Variable: Percent Change of Real GDP per Capita
  • Transform variables to display magnitude of change
  • Merge the datasets
  1. Data Visualization
  • Scatterplot
  • Time Series Line Plots
  • Barplot
  • Boxplot
  • Heatmap
  1. Statistical Analysis
  • Descriptive Statistics
  • Regression Analysis
  • Time Series Analysis

Utility Functions

The integration of other Python tools not only streamlined our computational processes but also ensured consistency across the project.

A custom module is written to simplify the process of making API calls and fetching information with imfp library. load_or_fetch_databases, load_or_fetch_parameters load_or_fetch_dataset load and retreive database, parameters, and dataset from a local or remote source. view_dataframe_in_browser displays dataframe in a web browser.

import os
import pickle
from tempfile import NamedTemporaryFile
import pandas as pd
import imfp
import webbrowser


# Function to display a DataFrame in a web browser
def view_dataframe_in_browser(df):
    html = df.to_html()
    with NamedTemporaryFile(delete=False, mode="w", suffix=".html") as f:
        url = "file://" + f.name
        f.write(html)
    webbrowser.open(url)


# Function to load databases from CSV or fetch from API
def load_or_fetch_databases():
    csv_path = os.path.join("data", "databases.csv")

    # Try to load from CSV
    if os.path.exists(csv_path):
        try:
            return pd.read_csv(csv_path)
        except Exception as e:
            print(f"Error loading CSV: {e}")

    # If CSV doesn't exist or couldn't be loaded, fetch from API
    print("Fetching databases from IMF API...")
    databases = imfp.imf_databases()

    # Save to CSV for future use
    databases.to_csv(csv_path, index=False)
    print(f"Databases saved to {csv_path}")

    return databases


def load_or_fetch_parameters(database_name):
    pickle_path = os.path.join("data", f"{database_name}.pickle")

    # Try to load from pickle file
    if os.path.exists(pickle_path):
        try:
            with open(pickle_path, "rb") as f:
                return pickle.load(f)
        except Exception as e:
            print(f"Error loading pickle file: {e}")

    # If pickle doesn't exist or couldn't be loaded, fetch from API
    print(f"Fetching parameters for {database_name} from IMF API...")
    parameters = imfp.imf_parameters(database_name)

    # Save to pickle file for future use
    os.makedirs("data", exist_ok=True)  # Ensure the data directory exists
    with open(pickle_path, "wb") as f:
        pickle.dump(parameters, f)
    print(f"Parameters saved to {pickle_path}")

    return parameters


def load_or_fetch_dataset(database_id, indicator):
    file_name = f"{database_id}.{indicator}.csv"
    csv_path = os.path.join("data", file_name)

    # Try to load from CSV file
    if os.path.exists(csv_path):
        try:
            return pd.read_csv(csv_path)
        except Exception as e:
            print(f"Error loading CSV file: {e}")

    # If CSV doesn't exist or couldn't be loaded, fetch from API
    print(f"Fetching dataset for {database_id}.{indicator} from IMF API...")
    dataset = imfp.imf_dataset(database_id=database_id, indicator=[indicator])

    # Save to CSV file for future use
    os.makedirs("data", exist_ok=True)  # Ensure the data directory exists
    dataset.to_csv(csv_path, index=False)
    print(f"Dataset saved to {csv_path}")

    return dataset

Dependencies

Here is a brief introduction about the packages used:

pandas: view and manipulate data frame

matplotlib.pyplot: make plots

seaborn: make plots

numpy: computation

LinearRegression: implement linear regression

tabulate: format data into tables

statsmodels.api, adfuller, ARIMA,VAR,plot_acf,plot_pacf,mean_absolute_error,mean_squared_error, andgrangercausalitytests are specifically used for time series analysis.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from tabulate import tabulate
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.stattools import grangercausalitytests

Data Fetching

In this section, we extracted four datasets through API calls: Gender Inequality Index(GII), GDP Deflator, Nominal GDP, and Population.

from pathlib import Path
Path("data").mkdir(exist_ok=True)
# Load or fetch databases
databases = load_or_fetch_databases()

# Filter out databases that contain a year in the description
databases[
  ~databases['description'].str.contains(r"[\d]{4}", regex=True)
]

# view_dataframe_in_browser(databases)
database_id description
25 HPDD Historical Public Debt (HPDD)
34 RAFIT2AGG Revenue Administration Fiscal Information Tool...
47 GENDER_EQUALITY Gender Equality
63 PGCS Private and Public Capital Stock Dataset
69 GENDER_BUDGETING Gender Budgeting
125 CPI Consumer Price Index (CPI)
153 IRFCL International Reserves and Foreign Currency Li...
191 IFS_DISCONTINUED International Financial Statistics (IFS), Disc...
192 EQ Export Quality
193 ED Export Diversification
226 FISCALDECENTRALIZATION Fiscal Decentralization
291 FDI Financial Development Index
292 PSBSFAD Public Sector Balance Sheet (PSBS)(FAD)
293 UNSDG_IMF_INPUTS Sustainable Development Goals, IMF Inputs
294 CPIS Coordinated Portfolio Investment Survey (CPIS)
295 PCTOT Commodity Terms of Trade
296 FM Fiscal Monitor (FM)
297 AFRREO Sub-Saharan Africa Regional Economic Outlook (...
298 WHDREO Western Hemisphere Regional Economic Outlook (...
299 MCDREO Middle East and Central Asia Regional Economic...
300 APDREO Asia and Pacific Regional Economic Outlook (AP...
301 BOPAGG Balance of Payments (BOP), World and Regional ...
302 PCPS Primary Commodity Price System (PCPS)
303 CDIS Coordinated Direct Investment Survey (CDIS)
304 BOP Balance of Payments (BOP)
305 COFER Currency Composition of Official Foreign Excha...
306 DOT Direction of Trade Statistics (DOTS)
307 FAS Financial Access Survey (FAS)
308 BOPSDMXUSD Balance of Payments (BOP), Global SDMX (US Dol...
309 NAMAIN_IDC_N System of National Accounts (SNA), NA_MAIN
310 GFSR Government Finance Statistics (GFS), Revenue
311 GFSSSUC Government Finance Statistics (GFS), Statement...
312 GFSCOFOG Government Finance Statistics (GFS), Expenditu...
313 GFSFALCS Government Finance Statistics (GFS), Financial...
314 GFSIBS Government Finance Statistics (GFS), Integrate...
315 GFSMAB Government Finance Statistics (GFS), Main Aggr...
316 GFSE Government Finance Statistics (GFS), Expense
317 IFS International Financial Statistics (IFS)
318 MFS Monetary and Financial Statistics (MFS)
319 RAFIT3P RA-FIT Round3 Completion and Participation Rates
320 FSI Financial Soundness Indicators (FSIs)
321 FSIRE Financial Soundness Indicators: Reporting enti...

Two databases were used: Gender Equality and International Financial Statistics (IFS).

databases[databases['database_id'].isin(['GENDER_EQUALITY','IFS'])]
database_id description
47 GENDER_EQUALITY Gender Equality
317 IFS International Financial Statistics (IFS)

Parameters are dictionary key names to make requests from the databases. “freq” stands for Frequency, such as Annual, Monthly, or Quarterly. “ref_area” stands for Geogrpahical Area, such as US (United States), JP (Japan), and GB (United Kindom). “indicator” refers to the code representing a specific dataset in the database. For example, if we display all the indicators for IFS database, the GDP deflator dataset has an input code of “NGDP_D_SA_IX” with a full name description of Gross Domestic Product, Deflator, Seasonally Adjusted, Index.

datasets = ["GENDER_EQUALITY", "IFS"]
params = {}

# Fetch valid parameters for two datasets
for dataset in datasets:
    params[dataset] = load_or_fetch_parameters(dataset)

    valid_keys = list(params[dataset].keys())
    print(f"Parameters for {dataset}: ", valid_keys)
Parameters for GENDER_EQUALITY:  ['freq', 'ref_area', 'indicator']
Parameters for IFS:  ['freq', 'ref_area', 'indicator']

We paired the database with the specific dataset indicator to read and store the csv file.

datasets = {}
dsets = [("GENDER_EQUALITY", "GE_GII"), 
("IFS", "NGDP_D_SA_IX"), 
("IFS", "NGDP_XDC"), 
("IFS", "LP_PE_NUM")]

for dset in dsets:
    datasets[dset[0] + "." + dset[1]] = load_or_fetch_dataset(dset[0], dset[1])
# "Gender Inequality Index"
GII = "GENDER_EQUALITY.GE_GII"

# "Gross Domestic Product, Deflator, Seasonally Adjusted, Index"
GDP_deflator = "IFS.NGDP_D_SA_IX"

# "Gross Domestic Product, Nominal, Domestic Currency"
GDP_nominal = "IFS.NGDP_XDC"

# "Population, Persons, Number of"
GDP_population = "IFS.LP_PE_NUM"

# Assign the datasets to new variables so we don't change the originals
GII_data = datasets[GII]
GDP_deflator_data = datasets[GDP_deflator]
GDP_nominal_data = datasets[GDP_nominal]
GDP_population_data = datasets[GDP_population]

Feature Engineering

Data Cleaning

Since the GDP deflator was reported on a quarterly basis, we converted it to a yearly basis.

# Keep only rows with a partial string match for "Q4" in the time_period column
GDP_deflator_data = GDP_deflator_data[GDP_deflator_data
['time_period'].str.contains("Q4")]
# Split the time_period into year and quarter and keep the year only
GDP_deflator_data.loc[:, 'time_period'] = GDP_deflator_data['time_period'].str[0:4]

We made all the variables numeric.

datasets = [GII_data, GDP_deflator_data, GDP_nominal_data, GDP_population_data]

for i, dataset in enumerate(datasets):    
    # Use .loc to modify the columns
    datasets[i].loc[:, 'obs_value'] = pd.to_numeric(datasets[i]['obs_value'], 
    errors='coerce')
    datasets[i].loc[:, 'time_period'] = pd.to_numeric(datasets[i]['time_period'], 
    errors='coerce')
    datasets[i].loc[:, 'unit_mult'] = pd.to_numeric(datasets[i]['unit_mult'], 
    errors='coerce')

GII Percent Change: Dependent Variable

We kept percents as decimals to make them easy to work with for calculation. Different countries have different baseline level of economic growth and gender equality. We calculated the percent change to make them comparable.

Gender Inequality Index (GII) is a composite measure of gender inequality using three dimensions: reproducitve health, empowerment, and labor market. GII ranges from 0 to 1. While 0 indicates gender equality, 1 indicates gender inequality, possibly the worst outcome for one gender in all three dimensions.

# Calculate percent change for each ref_area
# First, create a copy and reset the index to avoid duplicate index issues
GII_data_sorted = GII_data.sort_values(
    ['ref_area', 'time_period']).reset_index(drop=True)
GII_data['pct_change'] = GII_data_sorted.groupby('ref_area')['obs_value'].pct_change()

# Display the first few rows of the updated dataset
GII_data.head()
freq ref_area indicator unit_mult time_format time_period obs_value pct_change
0 A AF GE_GII 0 P1Y 1990 0.828244 NaN
1 A AF GE_GII 0 P1Y 1991 0.817706 -0.015156
2 A AF GE_GII 0 P1Y 1992 0.809806 -0.016783
3 A AF GE_GII 0 P1Y 1993 0.803078 -0.012651
4 A AF GE_GII 0 P1Y 1994 0.797028 -0.013718

We subset the data frame to keep only the columns we want:

# Create a new dataframe with only the required columns
GII_data = GII_data[['ref_area', 'time_period', 'obs_value', 'pct_change']].copy()

GII_data = GII_data.rename(columns = {
    'ref_area': 'Country',
    'time_period': 'Time',
    'obs_value': 'GII',
    'pct_change': 'GII_change'
})

# Display the first few rows of the new dataset
GII_data.head()
Country Time GII GII_change
0 AF 1990 0.828244 NaN
1 AF 1991 0.817706 -0.015156
2 AF 1992 0.809806 -0.016783
3 AF 1993 0.803078 -0.012651
4 AF 1994 0.797028 -0.013718

GDP Percent Change: Independent Variable

Real GDP per capita is a measure of a country’s economic welfare or standard of living. It is a great tool comparing a country’s economic development compared to other economies. Due to dataset access issue, we calculated Real GDP per capita by the following formula using GDP Deflator, Nominal GDP, and Population data:

\(\text{Real GDP} = \frac{\text{Nominal GDP}}{\text{GDP Deflator Index}}\times 100\)

\(\text{Real GDP per capita} = \frac{\text{Real GDP}}{\text{Population}}\)

GDP Deflator is a measure of price inflation and deflation with respect to a specific base year. The GDP deflator of a base year is equal to 100. A number of 200 indicates price inflation: the current year price of the good is twice its base year price. A number of 50 indicates price deflation: the current year price of the good is half its base year price. We kept the columns we want only for GDP-related datasets for easier table merging.

# GDP Deflator Dataset
# Create a new dataframe with only the required columns
GDP_deflator_data = GDP_deflator_data[
    ['ref_area', 'time_period', 'unit_mult', 'obs_value']].copy()

# Display the first few rows of the new dataset
GDP_deflator_data.head()
ref_area time_period unit_mult obs_value
3 FI 1990 0 73.200623
7 FI 1991 0 73.984068
11 FI 1992 0 74.654309
15 FI 1993 0 75.619254
19 FI 1994 0 77.937293

Nominal GDP is the total value of all goods and services produced in a given time period. It is usually higher than Real GDP and does not take into account cost of living in different countries or price change due to inflation/deflation.

# GDP Nominal Data
# Create a new dataframe with only the required columns
GDP_nominal_data = GDP_nominal_data[
    ['ref_area', 'time_period', 'unit_mult','obs_value']].copy()

# Display the first few rows of the new dataset
GDP_nominal_data.head()
ref_area time_period unit_mult obs_value
0 NE 2005 6 2418864.0
1 NE 2006 6 2596972.0
2 NE 2007 6 2762961.0
3 NE 2008 6 3247835.0
4 NE 2009 6 3403683.0

Population is the total number of people living in a country at a given time. This is where the “per capita” comes from. Real GDP is the total value of all goods and services produced in a country adjusted for inflation. Real GDP per capita is the total economic output per person in a country.

# GDP Population Data 
# Create a new dataframe with only the required columns
GDP_population_data = GDP_population_data[
    ['ref_area', 'time_period', 'unit_mult','obs_value']].copy()

# Display the first few rows of the new dataset
GDP_population_data.head()
ref_area time_period unit_mult obs_value
0 GA 1950 3 473.296
1 GA 1951 3 476.381
2 GA 1952 3 478.655
3 GA 1953 3 480.536
4 GA 1954 3 482.332
# Combine all the datasets above for further calculation
merged_df = pd.merge(pd.merge(GDP_deflator_data,GDP_nominal_data, 
on=['time_period', 'ref_area'], 
suffixes=('_index', '_nominal'), 
how='inner'), 
GDP_population_data, 
on=['time_period', 'ref_area'], 
how='inner')

We want to adjust GDP data based on unit multiplier. Unit multiplier stands for the number of zeroes we need to add to the value column. For example, in 1950, the observed population data for country GA (Georgia) was 473.296. With a unit muliplier of 3, the adjusted population would be 473296.

merged_df['adjusted_index'] = merged_df['obs_value_index'] * (10 ** (merged_df
['unit_mult_index']))
merged_df['adjusted_nominal'] = merged_df['obs_value_nominal'] * (10 ** (merged_df
['unit_mult_nominal']))
merged_df['adjusted_population'] = merged_df['obs_value'] * (10 ** (merged_df
['unit_mult']))
# Merged dataset
# Create a new dataframe with only the required columns
merged_df = merged_df[['ref_area', 'time_period',
'adjusted_nominal', 'adjusted_index', 'adjusted_population']].copy()

# Display the first few rows of the dataset
merged_df.head()
ref_area time_period adjusted_nominal adjusted_index adjusted_population
0 FI 1990 9.096400e+10 73.200623 4996220.0
1 FI 1991 8.691300e+10 73.984068 5019134.0
2 FI 1992 8.478600e+10 74.654309 5044928.0
3 FI 1993 8.561000e+10 75.619254 5071782.0
4 FI 1994 9.064600e+10 77.937293 5097090.0

We wanted to compute the Real GDP per capita.

# Step 1: Real GDP = (Nominal GDP / GDP Deflator Index)*100
merged_df['Real_GDP_domestic'] = (merged_df['adjusted_nominal'] / merged_df[
    'adjusted_index'])*100

# Step 2: Real GDP per Capita = Real GDP / Population
merged_df['Real_GDP_per_capita'] = merged_df['Real_GDP_domestic'] / merged_df[
    'adjusted_population']

# Rename columns
merged_df = merged_df.rename(columns= {
    "ref_area": "Country",
    "time_period": "Time",
    "adjusted_nominal": "Nominal",
    "adjusted_index": "Deflator",
    "adjusted_population": "Population",
    "Real_GDP_domestic": "Real GDP",
    "Real_GDP_per_capita": "Real GDP per Capita"
}
)
# Check the results
merged_df.head()
Country Time Nominal Deflator Population Real GDP Real GDP per Capita
0 FI 1990 9.096400e+10 73.200623 4996220.0 1.242667e+11 24872.143699
1 FI 1991 8.691300e+10 73.984068 5019134.0 1.174753e+11 23405.490395
2 FI 1992 8.478600e+10 74.654309 5044928.0 1.135715e+11 22512.011198
3 FI 1993 8.561000e+10 75.619254 5071782.0 1.132119e+11 22321.919259
4 FI 1994 9.064600e+10 77.937293 5097090.0 1.163063e+11 22818.181344

We calculated the percentage change in Real GDP per capita and put it in a new column.

# Calculate percent change for each ref_area
merged_df[f'GDP_change'] = merged_df.sort_values(['Country', 'Time']).groupby(
    'Country')['Real GDP per Capita'].pct_change()

# Rename dataset
GDP_data = merged_df

# Display the first few rows of the dataset
GDP_data.head()
Country Time Nominal Deflator Population Real GDP Real GDP per Capita GDP_change
0 FI 1990 9.096400e+10 73.200623 4996220.0 1.242667e+11 24872.143699 NaN
1 FI 1991 8.691300e+10 73.984068 5019134.0 1.174753e+11 23405.490395 -0.058968
2 FI 1992 8.478600e+10 74.654309 5044928.0 1.135715e+11 22512.011198 -0.038174
3 FI 1993 8.561000e+10 75.619254 5071782.0 1.132119e+11 22321.919259 -0.008444
4 FI 1994 9.064600e+10 77.937293 5097090.0 1.163063e+11 22818.181344 0.022232
# GII and GDP
# Merge the datasets
combined_data = pd.merge(GII_data, GDP_data, 
on=["Country", "Time"], 
how = "inner")

# Check the combined dataset
combined_data.head()
Country Time GII GII_change Nominal Deflator Population Real GDP Real GDP per Capita GDP_change
0 AL 2009 0.246238 -0.006176 1.143936e+12 95.997230 2973044.0 1.191635e+12 400813.060656 NaN
1 AL 2010 0.240877 -0.009627 1.239645e+12 100.758353 2948029.0 1.230314e+12 417334.584116 0.041220
2 AL 2011 0.240131 -0.009771 1.300624e+12 103.924160 2928601.0 1.251513e+12 427341.491205 0.023978
3 AL 2012 0.236440 -0.009977 1.332811e+12 103.230605 2914091.0 1.291101e+12 443054.328283 0.036769
4 AL 2013 0.223407 -0.009001 1.350053e+12 102.584604 2903788.0 1.316038e+12 453214.302235 0.022932

Data Visualization

Scatterplot

Scatterplot use dots to represent values of two numeric variables. The horizontal axis was the percent change in Real GDP per capita. The vertical axis was the percent change in Gender Inequality Index(GII). Different colors represented different countries. We used a linear regression line to display the overall pattern.

Based on the scatterplot, it seemed like there was a slight positive relationship between GDP change and GII change as shown by the flat regression line. Gender inequality was decreasing (gender equality was improving) a little faster in country-years with low GDP growth and a little slower in country-years with high GDP growth.

# Convert numeric columns to float
numeric_columns = [
    'GII', 'GII_change', 'Nominal', 'Deflator', 'Population', 
    'Real GDP', 'Real GDP per Capita', 'GDP_change'
]
for col in numeric_columns:
    combined_data[col] = pd.to_numeric(combined_data[col], errors='coerce')

# Count NAs
print(f"Dropping {combined_data[numeric_columns].isna().sum()} rows with NAs")

# Drop NAs
combined_data = combined_data.dropna(subset=numeric_columns)

# Plot the data points
plt.figure(figsize=(8, 6))
for country in combined_data['Country'].unique():
    country_data = combined_data[combined_data['Country'] == country]
    plt.scatter(country_data['GDP_change'], country_data['GII_change'],
             marker='o',linestyle='-', label=country)
plt.title('Country-Year Analysis of GDP Change vs. GII Change')
plt.xlabel('Percent Change in Real GDP per Capita (Country-Year)')
plt.ylabel('Percent Change in GII (Country-Year)')
plt.grid(True)

# Prepare data for linear regression
X = combined_data['GDP_change'].values.reshape(-1, 1)
y = combined_data['GII_change'].values

# Perform linear regression
reg = LinearRegression().fit(X, y)
y_pred = reg.predict(X)

# Plot the regression line
plt.plot(combined_data['GDP_change'], y_pred, color='red', linewidth=2)

plt.show()
Dropping GII                     0
GII_change             40
Nominal                 0
Deflator                0
Population              0
Real GDP                0
Real GDP per Capita     0
GDP_change             38
dtype: int64 rows with NAs

Time Series Line Plot

We created separate line plots for GDP change and GII change over time for a few key countries might show the trends more clearly.

US: United States

JP: Japan

GB: United Kindom

FR: France

MX: Mexico

Based on the line plots, we saw GDP change and GII change have different patterns. For example, in Mexico, when there was a big change in real GDP per captia in 1995, the change in GII was pretty stable.

# Time Series Line plot for a few key countries
selected_countries  = ['US', 'JP', 'GB', 'FR', 'MX']
combined_data_selected = combined_data[combined_data['Country'].isin(selected_countries)]

# Set up the Plot Structure
fig, ax = plt.subplots(2, 1, figsize=(8, 6), sharex=True)

# Plot change in real GDP per capita over time
sns.lineplot(data = combined_data_selected, 
x = "Time", 
y = "GDP_change", 
hue = "Country", 
ax = ax[0])
ax[0].set_title("Percent Change in Real GDP per Capita Over Time")
ax[0].set_ylabel("Percent Change in Real GDP per Capita")

# Plot change in GII over time
sns.lineplot(data = combined_data_selected, 
x = "Time", 
y = "GII_change", 
hue = "Country", 
ax = ax[1])
ax[1].set_title("Percent Change in GII over Time")
ax[1].set_xlabel("Time")
ax[1].set_ylabel("GII")

plt.tight_layout
plt.show()

Barplot

We used a barplot to show average changes in GII and GDP percent change for each country to visualize regions where inequality was improving or worsening.

This plot supported our previous observation how GII change seemed to be not be correlated with GDP change. We also saw that, for country SI, Solvenia, there seems to be a large improvement in gender inequality.

# Barplot using average GII and GDP change
# Calculate average change for each country
combined_data_avg = combined_data.groupby('Country')[
    ['GII_change','GDP_change']].mean().reset_index()

# Prepare to plot structure 
plt.figure(figsize = (18,10))

# Create the barplot
combined_data_avg.plot(kind = 'bar', x = 'Country')
plt.ylabel('Average Change')
plt.xlabel('Country')
plt.legend(['GII change', 'GDP change'])
plt.grid(axis = 'y')

# Show the plot
plt.show()
<Figure size 1728x960 with 0 Axes>

Boxplot

We used boxplot to visualize the distribution of GDP and GII change by country, providing information about spread, median, and potential outliers. To provide a more informative view, we sequenced countries in an ascending order by the median of percent change in GDP.

The boxplot displayed a slight upward trend with no obvious pattern between GDP and GII change. In coutries with higher GDP change median, they also tend to have a larger spread of the GDP change. The median of GII change remained stable regardless of the magnitude of GDP change, implying weak or no association between GDP and GII change. We observed a potential outlier for country SI, Solvenia, which may explained its large improvement in Gender inequality.

# Box plot for GII and GDP change
# Melt the dataframe to long format for combined boxplot
combined_data_melted = combined_data.melt(id_vars=['Country'], 
value_vars=['GII_change', 'GDP_change'], 
var_name='Change_Type', 
value_name='Value')

gdp_medians = combined_data.groupby('Country')['GDP_change'].median().sort_values()

combined_data_melted['Country'] = pd.Categorical(combined_data_melted['Country'], 
categories=gdp_medians.index, 
ordered= True)

# Prepare the plot structure
plt.figure(figsize=(8, 6))
sns.boxplot(data = combined_data_melted, 
x = "Country", 
y = 'Value', 
hue = 'Change_Type')
plt.title('Distribution of GII and GDP change by Country')
plt.xlabel('Country')
plt.ylabel('Change')
plt.legend(title = 'Change Type')

# Show the plot
plt.show()

Correlation Matrix

We created a heatmap to show the relationship between GII and GDP change.

A positive correlation coefficient indicates a positive relationship: the larger the GDP change, the larger the GII change. A negative correlation coefficient indicates a negative relationship: the larger the GDP change, the smaller the GII change. A correlation coefficient closer to 0 indicates there is weak or no relationship.

Based on the numeric values in the plot, there was a moderately strong positive correlation between GII and GDP change for country Estonia(EE) and Ireland(IE).

# Calculate the correlation
country_correlation = combined_data.groupby('Country')[
    ['GII_change', 'GDP_change']].corr().iloc[0::2, -1].reset_index(name='Correlation')

# Put the correlation value in a matrix format
correlation_matrix = country_correlation.pivot(index='Country', 
columns='level_1', 
values='Correlation')

# Check for NaN values in the correlation matrix
# Replace NaNs with 0 or another value as appropriate
correlation_matrix.fillna(0, inplace=True)  

# Set up the plot structure
# Adjust height to give more space for y-axis labels
plt.figure(figsize=(8, 12))  

# Plot the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
cbar_kws={"shrink": .8}, 
linewidths=.5)

# Enhance axis labels and title
plt.title('Heatmap for GII and GDP Change', fontsize=20)
plt.xlabel('Variables', fontsize=16)
plt.ylabel('Country', fontsize=16)

# Improve readability of y-axis labels
plt.yticks(fontsize=12)  # Adjust the font size for y-axis labels

# Show the plot
plt.show()

Statistical Analysis

Descriptive Statistics

There was a total of 915 data points. The mean of the GII change in -0.0314868, which indicated the overall grand mean percent change in gender inequality index is -3.15%. The mean of the GDP change was 0.0234633, showing the overall grand mean percent change in real GDP per capita was 2.35%.

# Generate summary statistics
combined_data.describe()
GII GII_change Nominal Deflator Population Real GDP Real GDP per Capita GDP_change
count 896.000000 896.000000 8.960000e+02 896.000000 8.960000e+02 8.960000e+02 8.960000e+02 896.000000
mean 0.238270 -0.021738 7.841185e+13 85.362016 4.507063e+07 7.581775e+13 9.919523e+05 0.023793
std 0.149157 0.041293 6.063567e+14 21.125956 1.255392e+08 5.570436e+14 3.854128e+06 0.041931
min 0.011690 -0.552535 2.187139e+09 3.606364 3.963240e+05 5.217326e+09 1.798422e+03 -0.285847
25% 0.131004 -0.030203 1.101968e+11 73.831673 5.041634e+06 1.500708e+11 1.717820e+04 0.004223
50% 0.184532 -0.011172 7.808172e+11 88.700623 1.032114e+07 1.015802e+12 3.545761e+04 0.022361
75% 0.332871 -0.003554 2.497352e+12 100.127034 4.444512e+07 2.914933e+12 1.212319e+05 0.043195
max 0.788954 0.210491 9.546134e+15 207.890742 1.280842e+09 7.920071e+15 3.145315e+07 0.241984

Regression Analysis

Simple linear regression as a foundational approach provide us with a basic understanding of the relationship between GDP change and GII change.

Based on the summary, we concluded the following:

  • Becasue p-value = 0.057, if we set alpha, the significance level, to be 0.05, we failed to reject the null hypothesis and conclude there was no significant relationship between percent change in real GDP per capita and gender inequality index.

  • R-squared = 0.004. Only 0.4% of the variance in GII change could be explained by GDP change.

  • We were 95% confident that the interval from -0.003 to 0.169 captured the true slope of GDP change. Because 0 was included, we are uncertain about the effect of GDP change on GII chnage.

# Get column data type summaries of combined_data
combined_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 896 entries, 1 to 973
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Country              896 non-null    object 
 1   Time                 896 non-null    object 
 2   GII                  896 non-null    float64
 3   GII_change           896 non-null    float64
 4   Nominal              896 non-null    float64
 5   Deflator             896 non-null    float64
 6   Population           896 non-null    float64
 7   Real GDP             896 non-null    float64
 8   Real GDP per Capita  896 non-null    float64
 9   GDP_change           896 non-null    float64
dtypes: float64(8), object(2)
memory usage: 77.0+ KB
# Define independent and depenent variables
X = combined_data['GDP_change']
y = combined_data['GII_change']

# Add a constant to indepdent variable to include an intercept
X = sm.add_constant(X)

# Fit a simple linear regresion model and print out the summary
model = sm.OLS(y, X).fit()
model.summary()
OLS Regression Results
Dep. Variable: GII_change R-squared: 0.000
Model: OLS Adj. R-squared: -0.001
Method: Least Squares F-statistic: 0.1114
Date: Wed, 22 Jan 2025 Prob (F-statistic): 0.739
Time: 16:59:36 Log-Likelihood: 1584.8
No. Observations: 896 AIC: -3166.
Df Residuals: 894 BIC: -3156.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -0.0220 0.002 -13.862 0.000 -0.025 -0.019
GDP_change 0.0110 0.033 0.334 0.739 -0.054 0.076
Omnibus: 872.466 Durbin-Watson: 1.570
Prob(Omnibus): 0.000 Jarque-Bera (JB): 62725.209
Skew: -4.277 Prob(JB): 0.00
Kurtosis: 43.087 Cond. No. 23.9


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Time Series Analysis

Time series analysis allows us to explore how the relationship between GII and GDP change vary across different time periods, accounting for lagged effects.

Here was a quick summary of the result: * Both GII and GDP change time series were stationary.

  • Past GII change values significantly influenced cuurent GII change values.

  • VAR model had good model performance on forecasting future values based on historical data.

  • Changes in GDP did not cause/precde the changes in GII.

ADF Test: Stationality Assumption Check

We wanted to use Augmented Dickey-Fuller (ADF) test to check whether a time series was stationary, which was the model assumption for many time series models.

Stationarity implied constant mean and variance over time, making it more predictable and stable for forecasting.

Based on the ADF test output, both GII and GDP change time series were stationary. We proceeded to the time series modeling section.

# Augmented Dickey-Fuller (ADF) test for stationarity check
# Create melted datasets
combined_data_time = combined_data.melt(id_vars=['Time', 'Country'], 
value_vars=['GII_change','GDP_change'], 
var_name = 'Change_Type', 
value_name = 'Value')
GII = combined_data_time[(combined_data_time['Change_Type'] == 'GII_change')]                         

GDP = combined_data_time[(combined_data_time['Change_Type'] == 'GDP_change')]

# Stationary Check
def adf_test(series):
    result = adfuller(series.dropna())
    print(f'ADF Statistic: {result[0]}')
    print(f'p-value: {result[1]}')
    if result[1] < 0.05:
        print("Series is stationary")
    else:
        print("Series is not stationary")

# Output the result
adf_test(GII['Value'])
adf_test(GDP['Value'])
ADF Statistic: -13.258508444109287
p-value: 8.491362672402006e-25
Series is stationary
ADF Statistic: -13.536694269050457
p-value: 2.5594926855546896e-25
Series is stationary

VAR model: Examine variables separately

We fitted a VAR (Vector Autoreression) model to see the relationship between GII and GDP change. VAR is particularly useful when dealing with multivariate time series data and allows us to examine the interdependence between variables.

Based on summary, here were several interpretations we could make:

  • We used AIC as the criteria for model selection. Lower value suggests a better fit.

  • Given that we wanted to predict GII change, we focused on the first set “Results for equation GII_change.”

  • Past GII_change values significantly influenced current GII_change, as shown in the small p-values of lags 1 and 2.

  • Lag 2 of GDP_change had a relatively low p-value but is not statistically significant.

# Split the dataset into training and testing sets
split_ratio = 0.7
split_index = int(len(combined_data) * split_ratio)

# Training set is used to fit the model
train_data = combined_data.iloc[:split_index]

# Testing set is used for validation
test_data = combined_data.iloc[split_index:]

print(f"Training data: {train_data.shape}")
print(f"Test data: {test_data.shape}")
Training data: (627, 10)
Test data: (269, 10)
# Fit a VAR model 
time_model = VAR(train_data[['GII_change', 'GDP_change']])
time_model_fitted = time_model.fit(maxlags = 15, ic="aic")

# Print out the model summary
time_model_fitted.summary()
  Summary of Regression Results   
==================================
Model:                         VAR
Method:                        OLS
Date:           Wed, 22, Jan, 2025
Time:                     16:59:36
--------------------------------------------------------------------
No. of Equations:         2.00000    BIC:                   -12.6388
Nobs:                     624.000    HQIC:                  -12.6996
Log likelihood:           2217.52    FPE:                2.93645e-06
AIC:                     -12.7383    Det(Omega_mle):     2.87166e-06
--------------------------------------------------------------------
Results for equation GII_change
================================================================================
                   coefficient       std. error           t-stat            prob
--------------------------------------------------------------------------------
const                -0.014401         0.002527           -5.698           0.000
L1.GII_change         0.206183         0.040171            5.133           0.000
L1.GDP_change         0.008941         0.037668            0.237           0.812
L2.GII_change         0.147039         0.040539            3.627           0.000
L2.GDP_change        -0.038093         0.037741           -1.009           0.313
L3.GII_change         0.071829         0.040373            1.779           0.075
L3.GDP_change         0.041997         0.037485            1.120           0.263
================================================================================

Results for equation GDP_change
================================================================================
                   coefficient       std. error           t-stat            prob
--------------------------------------------------------------------------------
const                 0.017922         0.002668            6.719           0.000
L1.GII_change        -0.021878         0.042400           -0.516           0.606
L1.GDP_change         0.132221         0.039758            3.326           0.001
L2.GII_change         0.124973         0.042789            2.921           0.003
L2.GDP_change         0.025632         0.039835            0.643           0.520
L3.GII_change        -0.063330         0.042614           -1.486           0.137
L3.GDP_change         0.146934         0.039565            3.714           0.000
================================================================================

Correlation matrix of residuals
              GII_change  GDP_change
GII_change      1.000000   -0.033903
GDP_change     -0.033903    1.000000

VAR Model: Forecasting

We applied the model learned above to the test data. Based on the plot, the forecast values seem to follow the actual data well, indicating a good model fit caputuring the underlying trends.

# Number of steps to forecast (length of the test set)
n_steps = len(test_data)

# Get the last values from the training set for forecasting
forecast_input = train_data[
    ['GII_change', 'GDP_change']].values[-time_model_fitted.k_ar:]

# Forecasting
forecast = time_model_fitted.forecast(y=forecast_input, steps=n_steps)

# Create a DataFrame for the forecasted values
forecast_df = pd.DataFrame(forecast, index=test_data.index, 
columns=['GII_forecast', 'GDP_forecast'])

# Ensure the index of the forecast_df matches the test_data index
forecast_df.index = test_data.index
plt.figure(figsize=(8, 6))
plt.plot(train_data['GII_change'], label='Training GII', color='blue')
plt.plot(test_data['GII_change'], label='Actual GII', color='orange')
plt.plot(forecast_df['GII_forecast'], label='Forecasted GII', color='green')
plt.title('GII Change Forecast vs Actual')
plt.legend()
plt.show()

plt.figure(figsize=(8, 6))
plt.plot(train_data['GDP_change'], label='Training GDP', color='blue')
plt.plot(test_data['GDP_change'], label='Actual GDP', color='orange')
plt.plot(forecast_df['GDP_forecast'], label='Forecasted GDP', color='green')
plt.title('GDP Change Forecast vs Actual')
plt.legend()
plt.show()

VAR Model: Model Performance

Low values of both MAE and RMSE indicate good model performance with small average errors in predictions.

mae_gii = mean_absolute_error(test_data['GII_change'], forecast_df['GII_forecast'])
mae_gdp = mean_absolute_error(test_data['GDP_change'], forecast_df['GDP_forecast'])

print(f'Mean Absolute Error for GII: {mae_gii}')
print(f'Mean Absolute Error for GDP: {mae_gdp}')
Mean Absolute Error for GII: 0.021634274218546037
Mean Absolute Error for GDP: 0.027874530660148864
rmse_gii = np.sqrt(mean_squared_error(test_data['GII_change'], 
forecast_df['GII_forecast']))
rmse_gdp = np.sqrt(mean_squared_error(test_data['GDP_change'], 
forecast_df['GDP_forecast']))

print(f'RMSE for GII: {rmse_gii}')
print(f'RMSE for GDP: {rmse_gdp}')
RMSE for GII: 0.0400826273209931
RMSE for GDP: 0.03867925511599023

VAR Model: Granger causality test

Granger causality test evaluates whether one time series can predict another.

Based on the output, the lowest p-value is when lag = 2. However, because p-value > 0.05, we fail to reject the null hypothesis and conclude the GDP_change does not Granger-cause the GII_change.

# Perform the Granger causality test
max_lag = 3
test_result = grangercausalitytests(train_data[['GII_change', 'GDP_change']], max_lag,
 verbose=True)

Granger Causality
number of lags (no zero) 1
ssr based F test:         F=0.1149  , p=0.7348  , df_denom=623, df_num=1
ssr based chi2 test:   chi2=0.1154  , p=0.7340  , df=1
likelihood ratio test: chi2=0.1154  , p=0.7341  , df=1
parameter F test:         F=0.1149  , p=0.7348  , df_denom=623, df_num=1

Granger Causality
number of lags (no zero) 2
ssr based F test:         F=0.4787  , p=0.6198  , df_denom=620, df_num=2
ssr based chi2 test:   chi2=0.9652  , p=0.6172  , df=2
likelihood ratio test: chi2=0.9644  , p=0.6174  , df=2
parameter F test:         F=0.4787  , p=0.6198  , df_denom=620, df_num=2

Granger Causality
number of lags (no zero) 3
ssr based F test:         F=0.6797  , p=0.5647  , df_denom=617, df_num=3
ssr based chi2 test:   chi2=2.0623  , p=0.5596  , df=3
likelihood ratio test: chi2=2.0589  , p=0.5603  , df=3
parameter F test:         F=0.6797  , p=0.5647  , df_denom=617, df_num=3

Conclusion

In wrapping up our analysis, we found no evidence to support a significant relationship between the Change in Real GDP per capita and the Change in the Gender Inequality Index (GII). This suggests that economic growth may not have a direct impact on gender equality. However, our findings open the door to questions for future research.

Future Directions

First, we must consider what other factors might influence the relationship between GDP and GII change. The GII is a composite index, shaped by a myriad of social factors, including cultural norms, legal frameworks, and environmental shifts. Future studies could benefit from incorporating additional predictors into the analysis and exploring the interaction between economic growth and gender equality within specific country contexts.

Second, there’s potential to enhance the predictive power of our Vector Autoregression (VAR) time series model. While we established that GDP change does not cause GII change, our model performed well in forecasting trends for both variables independently. In practice, policymakers may want to forecast GII trends independently of GDP if they are implementing gender-focused policies. Future research could investigate time series modeling to further unravel the dynamics of GII and GDP changes.

So, as we wrap up this chapter, let’s keep our curiosity alive and our questions flowing. After all, every end is just a new beginning in the quest for knowledge!

About the Author

Jenny Xu

Hi there! My name is Jenny, and I’m a third-year student at University of California, Davis, double majoring in Statistics and Psychology. I’ve always been interested in becoming a data analyst working in tech, internet, or research industries. Interning at Promptly Technologies helped me learn a ton. A quick fun fact for me is that my MBTI is ISFJ (Defender)!