import os
import pickle
from tempfile import NamedTemporaryFile
import pandas as pd
import imfp
import webbrowser
# Function to display a DataFrame in a web browser
def view_dataframe_in_browser(df):
= df.to_html()
html with NamedTemporaryFile(delete=False, mode="w", suffix=".html") as f:
= "file://" + f.name
url
f.write(html)open(url)
webbrowser.
# Function to load databases from CSV or fetch from API
def load_or_fetch_databases():
= os.path.join("data", "databases.csv")
csv_path
# Try to load from CSV
if os.path.exists(csv_path):
try:
return pd.read_csv(csv_path)
except Exception as e:
print(f"Error loading CSV: {e}")
# If CSV doesn't exist or couldn't be loaded, fetch from API
print("Fetching databases from IMF API...")
= imfp.imf_databases()
databases
# Save to CSV for future use
=False)
databases.to_csv(csv_path, indexprint(f"Databases saved to {csv_path}")
return databases
def load_or_fetch_parameters(database_name):
= os.path.join("data", f"{database_name}.pickle")
pickle_path
# Try to load from pickle file
if os.path.exists(pickle_path):
try:
with open(pickle_path, "rb") as f:
return pickle.load(f)
except Exception as e:
print(f"Error loading pickle file: {e}")
# If pickle doesn't exist or couldn't be loaded, fetch from API
print(f"Fetching parameters for {database_name} from IMF API...")
= imfp.imf_parameters(database_name)
parameters
# Save to pickle file for future use
"data", exist_ok=True) # Ensure the data directory exists
os.makedirs(with open(pickle_path, "wb") as f:
pickle.dump(parameters, f)print(f"Parameters saved to {pickle_path}")
return parameters
def load_or_fetch_dataset(database_id, indicator):
= f"{database_id}.{indicator}.csv"
file_name = os.path.join("data", file_name)
csv_path
# Try to load from CSV file
if os.path.exists(csv_path):
try:
return pd.read_csv(csv_path)
except Exception as e:
print(f"Error loading CSV file: {e}")
# If CSV doesn't exist or couldn't be loaded, fetch from API
print(f"Fetching dataset for {database_id}.{indicator} from IMF API...")
= imfp.imf_dataset(database_id=database_id, indicator=[indicator])
dataset
# Save to CSV file for future use
"data", exist_ok=True) # Ensure the data directory exists
os.makedirs(=False)
dataset.to_csv(csv_path, indexprint(f"Dataset saved to {csv_path}")
return dataset
Economic Growth and Gender Equality: An Analysis Using IMF Data
This data analysis project aims to explore the relationship between economic growth and gender equality using imfp
, which allows us to download data from IMF (International Monetary Fund). imfp
can be integrated with other python tools to streamline the computational process. To demonstrate its functionality, the project experimented with a variety of visualization and analysis methods.
Executive Summary
In this project, we explored the following:
- Data Fetching
- Make API call to fetch 4 datasets: GII (Gender Inequality Index), Nominal GDP, GDP Deflator Index, Population series
- Feature Engineering
- Cleaning: Convert GDP Deflator Index to a yearly basis and variables to numeric
- Dependent Variable: Percent Change of Gender Inequality Index
- Independent Variable: Percent Change of Real GDP per Capita
- Transform variables to display magnitude of change
- Merge the datasets
- Data Visualization
- Scatterplot
- Time Series Line Plots
- Barplot
- Boxplot
- Heatmap
- Statistical Analysis
- Descriptive Statistics
- Regression Analysis
- Time Series Analysis
Utility Functions
The integration of other Python tools not only streamlined our computational processes but also ensured consistency across the project.
A custom module is written to simplify the process of making API calls and fetching information with imfp library. load_or_fetch_databases
, load_or_fetch_parameters
load_or_fetch_dataset
load and retreive database, parameters, and dataset from a local or remote source. view_dataframe_in_browser
displays dataframe in a web browser.
Dependencies
Here is a brief introduction about the packages used:
pandas
: view and manipulate data frame
matplotlib.pyplot
: make plots
seaborn
: make plots
numpy
: computation
LinearRegression
: implement linear regression
tabulate
: format data into tables
statsmodels.api
, adfuller
, ARIMA
,VAR
,plot_acf
,plot_pacf
,mean_absolute_error
,mean_squared_error
, andgrangercausalitytests
are specifically used for time series analysis.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from tabulate import tabulate
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.stattools import grangercausalitytests
Data Fetching
In this section, we extracted four datasets through API calls: Gender Inequality Index(GII), GDP Deflator, Nominal GDP, and Population.
from pathlib import Path
"data").mkdir(exist_ok=True) Path(
# Load or fetch databases
= load_or_fetch_databases()
databases
# Filter out databases that contain a year in the description
databases[~databases['description'].str.contains(r"[\d]{4}", regex=True)
]
# view_dataframe_in_browser(databases)
database_id | description | |
---|---|---|
25 | HPDD | Historical Public Debt (HPDD) |
34 | RAFIT2AGG | Revenue Administration Fiscal Information Tool... |
47 | GENDER_EQUALITY | Gender Equality |
63 | PGCS | Private and Public Capital Stock Dataset |
69 | GENDER_BUDGETING | Gender Budgeting |
125 | CPI | Consumer Price Index (CPI) |
153 | IRFCL | International Reserves and Foreign Currency Li... |
191 | IFS_DISCONTINUED | International Financial Statistics (IFS), Disc... |
192 | EQ | Export Quality |
193 | ED | Export Diversification |
226 | FISCALDECENTRALIZATION | Fiscal Decentralization |
291 | FDI | Financial Development Index |
292 | PSBSFAD | Public Sector Balance Sheet (PSBS)(FAD) |
293 | UNSDG_IMF_INPUTS | Sustainable Development Goals, IMF Inputs |
294 | CPIS | Coordinated Portfolio Investment Survey (CPIS) |
295 | PCTOT | Commodity Terms of Trade |
296 | FM | Fiscal Monitor (FM) |
297 | AFRREO | Sub-Saharan Africa Regional Economic Outlook (... |
298 | WHDREO | Western Hemisphere Regional Economic Outlook (... |
299 | MCDREO | Middle East and Central Asia Regional Economic... |
300 | APDREO | Asia and Pacific Regional Economic Outlook (AP... |
301 | BOPAGG | Balance of Payments (BOP), World and Regional ... |
302 | PCPS | Primary Commodity Price System (PCPS) |
303 | CDIS | Coordinated Direct Investment Survey (CDIS) |
304 | BOP | Balance of Payments (BOP) |
305 | COFER | Currency Composition of Official Foreign Excha... |
306 | DOT | Direction of Trade Statistics (DOTS) |
307 | FAS | Financial Access Survey (FAS) |
308 | BOPSDMXUSD | Balance of Payments (BOP), Global SDMX (US Dol... |
309 | NAMAIN_IDC_N | System of National Accounts (SNA), NA_MAIN |
310 | GFSR | Government Finance Statistics (GFS), Revenue |
311 | GFSSSUC | Government Finance Statistics (GFS), Statement... |
312 | GFSCOFOG | Government Finance Statistics (GFS), Expenditu... |
313 | GFSFALCS | Government Finance Statistics (GFS), Financial... |
314 | GFSIBS | Government Finance Statistics (GFS), Integrate... |
315 | GFSMAB | Government Finance Statistics (GFS), Main Aggr... |
316 | GFSE | Government Finance Statistics (GFS), Expense |
317 | IFS | International Financial Statistics (IFS) |
318 | MFS | Monetary and Financial Statistics (MFS) |
319 | RAFIT3P | RA-FIT Round3 Completion and Participation Rates |
320 | FSI | Financial Soundness Indicators (FSIs) |
321 | FSIRE | Financial Soundness Indicators: Reporting enti... |
Two databases were used: Gender Equality and International Financial Statistics (IFS).
'database_id'].isin(['GENDER_EQUALITY','IFS'])] databases[databases[
database_id | description | |
---|---|---|
47 | GENDER_EQUALITY | Gender Equality |
317 | IFS | International Financial Statistics (IFS) |
Parameters are dictionary key names to make requests from the databases. “freq” stands for Frequency, such as Annual, Monthly, or Quarterly. “ref_area” stands for Geogrpahical Area, such as US (United States), JP (Japan), and GB (United Kindom). “indicator” refers to the code representing a specific dataset in the database. For example, if we display all the indicators for IFS database, the GDP deflator dataset has an input code of “NGDP_D_SA_IX” with a full name description of Gross Domestic Product, Deflator, Seasonally Adjusted, Index.
= ["GENDER_EQUALITY", "IFS"]
datasets = {}
params
# Fetch valid parameters for two datasets
for dataset in datasets:
= load_or_fetch_parameters(dataset)
params[dataset]
= list(params[dataset].keys())
valid_keys print(f"Parameters for {dataset}: ", valid_keys)
Parameters for GENDER_EQUALITY: ['freq', 'ref_area', 'indicator']
Parameters for IFS: ['freq', 'ref_area', 'indicator']
We paired the database with the specific dataset indicator to read and store the csv file.
= {}
datasets = [("GENDER_EQUALITY", "GE_GII"),
dsets "IFS", "NGDP_D_SA_IX"),
("IFS", "NGDP_XDC"),
("IFS", "LP_PE_NUM")]
(
for dset in dsets:
0] + "." + dset[1]] = load_or_fetch_dataset(dset[0], dset[1]) datasets[dset[
# "Gender Inequality Index"
= "GENDER_EQUALITY.GE_GII"
GII
# "Gross Domestic Product, Deflator, Seasonally Adjusted, Index"
= "IFS.NGDP_D_SA_IX"
GDP_deflator
# "Gross Domestic Product, Nominal, Domestic Currency"
= "IFS.NGDP_XDC"
GDP_nominal
# "Population, Persons, Number of"
= "IFS.LP_PE_NUM"
GDP_population
# Assign the datasets to new variables so we don't change the originals
= datasets[GII]
GII_data = datasets[GDP_deflator]
GDP_deflator_data = datasets[GDP_nominal]
GDP_nominal_data = datasets[GDP_population] GDP_population_data
Feature Engineering
Data Cleaning
Since the GDP deflator was reported on a quarterly basis, we converted it to a yearly basis.
# Keep only rows with a partial string match for "Q4" in the time_period column
= GDP_deflator_data[GDP_deflator_data
GDP_deflator_data 'time_period'].str.contains("Q4")] [
# Split the time_period into year and quarter and keep the year only
'time_period'] = GDP_deflator_data['time_period'].str[0:4] GDP_deflator_data.loc[:,
We made all the variables numeric.
= [GII_data, GDP_deflator_data, GDP_nominal_data, GDP_population_data]
datasets
for i, dataset in enumerate(datasets):
# Use .loc to modify the columns
'obs_value'] = pd.to_numeric(datasets[i]['obs_value'],
datasets[i].loc[:, ='coerce')
errors'time_period'] = pd.to_numeric(datasets[i]['time_period'],
datasets[i].loc[:, ='coerce')
errors'unit_mult'] = pd.to_numeric(datasets[i]['unit_mult'],
datasets[i].loc[:, ='coerce') errors
GII Percent Change: Dependent Variable
We kept percents as decimals to make them easy to work with for calculation. Different countries have different baseline level of economic growth and gender equality. We calculated the percent change to make them comparable.
Gender Inequality Index (GII) is a composite measure of gender inequality using three dimensions: reproducitve health, empowerment, and labor market. GII ranges from 0 to 1. While 0 indicates gender equality, 1 indicates gender inequality, possibly the worst outcome for one gender in all three dimensions.
# Calculate percent change for each ref_area
# First, create a copy and reset the index to avoid duplicate index issues
= GII_data.sort_values(
GII_data_sorted 'ref_area', 'time_period']).reset_index(drop=True)
['pct_change'] = GII_data_sorted.groupby('ref_area')['obs_value'].pct_change()
GII_data[
# Display the first few rows of the updated dataset
GII_data.head()
freq | ref_area | indicator | unit_mult | time_format | time_period | obs_value | pct_change | |
---|---|---|---|---|---|---|---|---|
0 | A | AF | GE_GII | 0 | P1Y | 1990 | 0.828244 | NaN |
1 | A | AF | GE_GII | 0 | P1Y | 1991 | 0.817706 | -0.015156 |
2 | A | AF | GE_GII | 0 | P1Y | 1992 | 0.809806 | -0.016783 |
3 | A | AF | GE_GII | 0 | P1Y | 1993 | 0.803078 | -0.012651 |
4 | A | AF | GE_GII | 0 | P1Y | 1994 | 0.797028 | -0.013718 |
We subset the data frame to keep only the columns we want:
# Create a new dataframe with only the required columns
= GII_data[['ref_area', 'time_period', 'obs_value', 'pct_change']].copy()
GII_data
= GII_data.rename(columns = {
GII_data 'ref_area': 'Country',
'time_period': 'Time',
'obs_value': 'GII',
'pct_change': 'GII_change'
})
# Display the first few rows of the new dataset
GII_data.head()
Country | Time | GII | GII_change | |
---|---|---|---|---|
0 | AF | 1990 | 0.828244 | NaN |
1 | AF | 1991 | 0.817706 | -0.015156 |
2 | AF | 1992 | 0.809806 | -0.016783 |
3 | AF | 1993 | 0.803078 | -0.012651 |
4 | AF | 1994 | 0.797028 | -0.013718 |
GDP Percent Change: Independent Variable
Real GDP per capita is a measure of a country’s economic welfare or standard of living. It is a great tool comparing a country’s economic development compared to other economies. Due to dataset access issue, we calculated Real GDP per capita by the following formula using GDP Deflator, Nominal GDP, and Population data:
\(\text{Real GDP} = \frac{\text{Nominal GDP}}{\text{GDP Deflator Index}}\times 100\)
\(\text{Real GDP per capita} = \frac{\text{Real GDP}}{\text{Population}}\)
GDP Deflator is a measure of price inflation and deflation with respect to a specific base year. The GDP deflator of a base year is equal to 100. A number of 200 indicates price inflation: the current year price of the good is twice its base year price. A number of 50 indicates price deflation: the current year price of the good is half its base year price. We kept the columns we want only for GDP-related datasets for easier table merging.
# GDP Deflator Dataset
# Create a new dataframe with only the required columns
= GDP_deflator_data[
GDP_deflator_data 'ref_area', 'time_period', 'unit_mult', 'obs_value']].copy()
[
# Display the first few rows of the new dataset
GDP_deflator_data.head()
ref_area | time_period | unit_mult | obs_value | |
---|---|---|---|---|
3 | FI | 1990 | 0 | 73.200623 |
7 | FI | 1991 | 0 | 73.984068 |
11 | FI | 1992 | 0 | 74.654309 |
15 | FI | 1993 | 0 | 75.619254 |
19 | FI | 1994 | 0 | 77.937293 |
Nominal GDP is the total value of all goods and services produced in a given time period. It is usually higher than Real GDP and does not take into account cost of living in different countries or price change due to inflation/deflation.
# GDP Nominal Data
# Create a new dataframe with only the required columns
= GDP_nominal_data[
GDP_nominal_data 'ref_area', 'time_period', 'unit_mult','obs_value']].copy()
[
# Display the first few rows of the new dataset
GDP_nominal_data.head()
ref_area | time_period | unit_mult | obs_value | |
---|---|---|---|---|
0 | NE | 2005 | 6 | 2418864.0 |
1 | NE | 2006 | 6 | 2596972.0 |
2 | NE | 2007 | 6 | 2762961.0 |
3 | NE | 2008 | 6 | 3247835.0 |
4 | NE | 2009 | 6 | 3403683.0 |
Population is the total number of people living in a country at a given time. This is where the “per capita” comes from. Real GDP is the total value of all goods and services produced in a country adjusted for inflation. Real GDP per capita is the total economic output per person in a country.
# GDP Population Data
# Create a new dataframe with only the required columns
= GDP_population_data[
GDP_population_data 'ref_area', 'time_period', 'unit_mult','obs_value']].copy()
[
# Display the first few rows of the new dataset
GDP_population_data.head()
ref_area | time_period | unit_mult | obs_value | |
---|---|---|---|---|
0 | GA | 1950 | 3 | 473.296 |
1 | GA | 1951 | 3 | 476.381 |
2 | GA | 1952 | 3 | 478.655 |
3 | GA | 1953 | 3 | 480.536 |
4 | GA | 1954 | 3 | 482.332 |
# Combine all the datasets above for further calculation
= pd.merge(pd.merge(GDP_deflator_data,GDP_nominal_data,
merged_df =['time_period', 'ref_area'],
on=('_index', '_nominal'),
suffixes='inner'),
how
GDP_population_data, =['time_period', 'ref_area'],
on='inner') how
We want to adjust GDP data based on unit multiplier. Unit multiplier stands for the number of zeroes we need to add to the value column. For example, in 1950, the observed population data for country GA (Georgia) was 473.296. With a unit muliplier of 3, the adjusted population would be 473296.
'adjusted_index'] = merged_df['obs_value_index'] * (10 ** (merged_df
merged_df['unit_mult_index']))
['adjusted_nominal'] = merged_df['obs_value_nominal'] * (10 ** (merged_df
merged_df['unit_mult_nominal']))
['adjusted_population'] = merged_df['obs_value'] * (10 ** (merged_df
merged_df['unit_mult'])) [
# Merged dataset
# Create a new dataframe with only the required columns
= merged_df[['ref_area', 'time_period',
merged_df 'adjusted_nominal', 'adjusted_index', 'adjusted_population']].copy()
# Display the first few rows of the dataset
merged_df.head()
ref_area | time_period | adjusted_nominal | adjusted_index | adjusted_population | |
---|---|---|---|---|---|
0 | FI | 1990 | 9.096400e+10 | 73.200623 | 4996220.0 |
1 | FI | 1991 | 8.691300e+10 | 73.984068 | 5019134.0 |
2 | FI | 1992 | 8.478600e+10 | 74.654309 | 5044928.0 |
3 | FI | 1993 | 8.561000e+10 | 75.619254 | 5071782.0 |
4 | FI | 1994 | 9.064600e+10 | 77.937293 | 5097090.0 |
We wanted to compute the Real GDP per capita.
# Step 1: Real GDP = (Nominal GDP / GDP Deflator Index)*100
'Real_GDP_domestic'] = (merged_df['adjusted_nominal'] / merged_df[
merged_df['adjusted_index'])*100
# Step 2: Real GDP per Capita = Real GDP / Population
'Real_GDP_per_capita'] = merged_df['Real_GDP_domestic'] / merged_df[
merged_df['adjusted_population']
# Rename columns
= merged_df.rename(columns= {
merged_df "ref_area": "Country",
"time_period": "Time",
"adjusted_nominal": "Nominal",
"adjusted_index": "Deflator",
"adjusted_population": "Population",
"Real_GDP_domestic": "Real GDP",
"Real_GDP_per_capita": "Real GDP per Capita"
}
)# Check the results
merged_df.head()
Country | Time | Nominal | Deflator | Population | Real GDP | Real GDP per Capita | |
---|---|---|---|---|---|---|---|
0 | FI | 1990 | 9.096400e+10 | 73.200623 | 4996220.0 | 1.242667e+11 | 24872.143699 |
1 | FI | 1991 | 8.691300e+10 | 73.984068 | 5019134.0 | 1.174753e+11 | 23405.490395 |
2 | FI | 1992 | 8.478600e+10 | 74.654309 | 5044928.0 | 1.135715e+11 | 22512.011198 |
3 | FI | 1993 | 8.561000e+10 | 75.619254 | 5071782.0 | 1.132119e+11 | 22321.919259 |
4 | FI | 1994 | 9.064600e+10 | 77.937293 | 5097090.0 | 1.163063e+11 | 22818.181344 |
We calculated the percentage change in Real GDP per capita and put it in a new column.
# Calculate percent change for each ref_area
f'GDP_change'] = merged_df.sort_values(['Country', 'Time']).groupby(
merged_df['Country')['Real GDP per Capita'].pct_change()
# Rename dataset
= merged_df
GDP_data
# Display the first few rows of the dataset
GDP_data.head()
Country | Time | Nominal | Deflator | Population | Real GDP | Real GDP per Capita | GDP_change | |
---|---|---|---|---|---|---|---|---|
0 | FI | 1990 | 9.096400e+10 | 73.200623 | 4996220.0 | 1.242667e+11 | 24872.143699 | NaN |
1 | FI | 1991 | 8.691300e+10 | 73.984068 | 5019134.0 | 1.174753e+11 | 23405.490395 | -0.058968 |
2 | FI | 1992 | 8.478600e+10 | 74.654309 | 5044928.0 | 1.135715e+11 | 22512.011198 | -0.038174 |
3 | FI | 1993 | 8.561000e+10 | 75.619254 | 5071782.0 | 1.132119e+11 | 22321.919259 | -0.008444 |
4 | FI | 1994 | 9.064600e+10 | 77.937293 | 5097090.0 | 1.163063e+11 | 22818.181344 | 0.022232 |
# GII and GDP
# Merge the datasets
= pd.merge(GII_data, GDP_data,
combined_data =["Country", "Time"],
on= "inner")
how
# Check the combined dataset
combined_data.head()
Country | Time | GII | GII_change | Nominal | Deflator | Population | Real GDP | Real GDP per Capita | GDP_change | |
---|---|---|---|---|---|---|---|---|---|---|
0 | AL | 2009 | 0.246238 | -0.006176 | 1.143936e+12 | 95.997230 | 2973044.0 | 1.191635e+12 | 400813.060656 | NaN |
1 | AL | 2010 | 0.240877 | -0.009627 | 1.239645e+12 | 100.758353 | 2948029.0 | 1.230314e+12 | 417334.584116 | 0.041220 |
2 | AL | 2011 | 0.240131 | -0.009771 | 1.300624e+12 | 103.924160 | 2928601.0 | 1.251513e+12 | 427341.491205 | 0.023978 |
3 | AL | 2012 | 0.236440 | -0.009977 | 1.332811e+12 | 103.230605 | 2914091.0 | 1.291101e+12 | 443054.328283 | 0.036769 |
4 | AL | 2013 | 0.223407 | -0.009001 | 1.350053e+12 | 102.584604 | 2903788.0 | 1.316038e+12 | 453214.302235 | 0.022932 |
Data Visualization
Scatterplot
Scatterplot use dots to represent values of two numeric variables. The horizontal axis was the percent change in Real GDP per capita. The vertical axis was the percent change in Gender Inequality Index(GII). Different colors represented different countries. We used a linear regression line to display the overall pattern.
Based on the scatterplot, it seemed like there was a slight positive relationship between GDP change and GII change as shown by the flat regression line. Gender inequality was decreasing (gender equality was improving) a little faster in country-years with low GDP growth and a little slower in country-years with high GDP growth.
# Convert numeric columns to float
= [
numeric_columns 'GII', 'GII_change', 'Nominal', 'Deflator', 'Population',
'Real GDP', 'Real GDP per Capita', 'GDP_change'
]for col in numeric_columns:
= pd.to_numeric(combined_data[col], errors='coerce')
combined_data[col]
# Count NAs
print(f"Dropping {combined_data[numeric_columns].isna().sum()} rows with NAs")
# Drop NAs
= combined_data.dropna(subset=numeric_columns)
combined_data
# Plot the data points
=(8, 6))
plt.figure(figsizefor country in combined_data['Country'].unique():
= combined_data[combined_data['Country'] == country]
country_data 'GDP_change'], country_data['GII_change'],
plt.scatter(country_data[='o',linestyle='-', label=country)
marker'Country-Year Analysis of GDP Change vs. GII Change')
plt.title('Percent Change in Real GDP per Capita (Country-Year)')
plt.xlabel('Percent Change in GII (Country-Year)')
plt.ylabel(True)
plt.grid(
# Prepare data for linear regression
= combined_data['GDP_change'].values.reshape(-1, 1)
X = combined_data['GII_change'].values
y
# Perform linear regression
= LinearRegression().fit(X, y)
reg = reg.predict(X)
y_pred
# Plot the regression line
'GDP_change'], y_pred, color='red', linewidth=2)
plt.plot(combined_data[
plt.show()
Dropping GII 0
GII_change 40
Nominal 0
Deflator 0
Population 0
Real GDP 0
Real GDP per Capita 0
GDP_change 38
dtype: int64 rows with NAs
Time Series Line Plot
We created separate line plots for GDP change and GII change over time for a few key countries might show the trends more clearly.
US: United States
JP: Japan
GB: United Kindom
FR: France
MX: Mexico
Based on the line plots, we saw GDP change and GII change have different patterns. For example, in Mexico, when there was a big change in real GDP per captia in 1995, the change in GII was pretty stable.
# Time Series Line plot for a few key countries
= ['US', 'JP', 'GB', 'FR', 'MX']
selected_countries = combined_data[combined_data['Country'].isin(selected_countries)]
combined_data_selected
# Set up the Plot Structure
= plt.subplots(2, 1, figsize=(8, 6), sharex=True)
fig, ax
# Plot change in real GDP per capita over time
= combined_data_selected,
sns.lineplot(data = "Time",
x = "GDP_change",
y = "Country",
hue = ax[0])
ax 0].set_title("Percent Change in Real GDP per Capita Over Time")
ax[0].set_ylabel("Percent Change in Real GDP per Capita")
ax[
# Plot change in GII over time
= combined_data_selected,
sns.lineplot(data = "Time",
x = "GII_change",
y = "Country",
hue = ax[1])
ax 1].set_title("Percent Change in GII over Time")
ax[1].set_xlabel("Time")
ax[1].set_ylabel("GII")
ax[
plt.tight_layout plt.show()
Barplot
We used a barplot to show average changes in GII and GDP percent change for each country to visualize regions where inequality was improving or worsening.
This plot supported our previous observation how GII change seemed to be not be correlated with GDP change. We also saw that, for country SI, Solvenia, there seems to be a large improvement in gender inequality.
# Barplot using average GII and GDP change
# Calculate average change for each country
= combined_data.groupby('Country')[
combined_data_avg 'GII_change','GDP_change']].mean().reset_index()
[
# Prepare to plot structure
= (18,10))
plt.figure(figsize
# Create the barplot
= 'bar', x = 'Country')
combined_data_avg.plot(kind 'Average Change')
plt.ylabel('Country')
plt.xlabel('GII change', 'GDP change'])
plt.legend([= 'y')
plt.grid(axis
# Show the plot
plt.show()
<Figure size 1728x960 with 0 Axes>
Boxplot
We used boxplot to visualize the distribution of GDP and GII change by country, providing information about spread, median, and potential outliers. To provide a more informative view, we sequenced countries in an ascending order by the median of percent change in GDP.
The boxplot displayed a slight upward trend with no obvious pattern between GDP and GII change. In coutries with higher GDP change median, they also tend to have a larger spread of the GDP change. The median of GII change remained stable regardless of the magnitude of GDP change, implying weak or no association between GDP and GII change. We observed a potential outlier for country SI, Solvenia, which may explained its large improvement in Gender inequality.
# Box plot for GII and GDP change
# Melt the dataframe to long format for combined boxplot
= combined_data.melt(id_vars=['Country'],
combined_data_melted =['GII_change', 'GDP_change'],
value_vars='Change_Type',
var_name='Value')
value_name
= combined_data.groupby('Country')['GDP_change'].median().sort_values()
gdp_medians
'Country'] = pd.Categorical(combined_data_melted['Country'],
combined_data_melted[=gdp_medians.index,
categories= True)
ordered
# Prepare the plot structure
=(8, 6))
plt.figure(figsize= combined_data_melted,
sns.boxplot(data = "Country",
x = 'Value',
y = 'Change_Type')
hue 'Distribution of GII and GDP change by Country')
plt.title('Country')
plt.xlabel('Change')
plt.ylabel(= 'Change Type')
plt.legend(title
# Show the plot
plt.show()
Correlation Matrix
We created a heatmap to show the relationship between GII and GDP change.
A positive correlation coefficient indicates a positive relationship: the larger the GDP change, the larger the GII change. A negative correlation coefficient indicates a negative relationship: the larger the GDP change, the smaller the GII change. A correlation coefficient closer to 0 indicates there is weak or no relationship.
Based on the numeric values in the plot, there was a moderately strong positive correlation between GII and GDP change for country Estonia(EE) and Ireland(IE).
# Calculate the correlation
= combined_data.groupby('Country')[
country_correlation 'GII_change', 'GDP_change']].corr().iloc[0::2, -1].reset_index(name='Correlation')
[
# Put the correlation value in a matrix format
= country_correlation.pivot(index='Country',
correlation_matrix ='level_1',
columns='Correlation')
values
# Check for NaN values in the correlation matrix
# Replace NaNs with 0 or another value as appropriate
0, inplace=True)
correlation_matrix.fillna(
# Set up the plot structure
# Adjust height to give more space for y-axis labels
=(8, 12))
plt.figure(figsize
# Plot the heatmap
=True, cmap='coolwarm', center=0,
sns.heatmap(correlation_matrix, annot={"shrink": .8},
cbar_kws=.5)
linewidths
# Enhance axis labels and title
'Heatmap for GII and GDP Change', fontsize=20)
plt.title('Variables', fontsize=16)
plt.xlabel('Country', fontsize=16)
plt.ylabel(
# Improve readability of y-axis labels
=12) # Adjust the font size for y-axis labels
plt.yticks(fontsize
# Show the plot
plt.show()
Statistical Analysis
Descriptive Statistics
There was a total of 915 data points. The mean of the GII change in -0.0314868, which indicated the overall grand mean percent change in gender inequality index is -3.15%. The mean of the GDP change was 0.0234633, showing the overall grand mean percent change in real GDP per capita was 2.35%.
# Generate summary statistics
combined_data.describe()
GII | GII_change | Nominal | Deflator | Population | Real GDP | Real GDP per Capita | GDP_change | |
---|---|---|---|---|---|---|---|---|
count | 896.000000 | 896.000000 | 8.960000e+02 | 896.000000 | 8.960000e+02 | 8.960000e+02 | 8.960000e+02 | 896.000000 |
mean | 0.238270 | -0.021738 | 7.841185e+13 | 85.362016 | 4.507063e+07 | 7.581775e+13 | 9.919523e+05 | 0.023793 |
std | 0.149157 | 0.041293 | 6.063567e+14 | 21.125956 | 1.255392e+08 | 5.570436e+14 | 3.854128e+06 | 0.041931 |
min | 0.011690 | -0.552535 | 2.187139e+09 | 3.606364 | 3.963240e+05 | 5.217326e+09 | 1.798422e+03 | -0.285847 |
25% | 0.131004 | -0.030203 | 1.101968e+11 | 73.831673 | 5.041634e+06 | 1.500708e+11 | 1.717820e+04 | 0.004223 |
50% | 0.184532 | -0.011172 | 7.808172e+11 | 88.700623 | 1.032114e+07 | 1.015802e+12 | 3.545761e+04 | 0.022361 |
75% | 0.332871 | -0.003554 | 2.497352e+12 | 100.127034 | 4.444512e+07 | 2.914933e+12 | 1.212319e+05 | 0.043195 |
max | 0.788954 | 0.210491 | 9.546134e+15 | 207.890742 | 1.280842e+09 | 7.920071e+15 | 3.145315e+07 | 0.241984 |
Regression Analysis
Simple linear regression as a foundational approach provide us with a basic understanding of the relationship between GDP change and GII change.
Based on the summary, we concluded the following:
Becasue p-value = 0.057, if we set alpha, the significance level, to be 0.05, we failed to reject the null hypothesis and conclude there was no significant relationship between percent change in real GDP per capita and gender inequality index.
R-squared = 0.004. Only 0.4% of the variance in GII change could be explained by GDP change.
We were 95% confident that the interval from -0.003 to 0.169 captured the true slope of GDP change. Because 0 was included, we are uncertain about the effect of GDP change on GII chnage.
# Get column data type summaries of combined_data
combined_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 896 entries, 1 to 973
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 896 non-null object
1 Time 896 non-null object
2 GII 896 non-null float64
3 GII_change 896 non-null float64
4 Nominal 896 non-null float64
5 Deflator 896 non-null float64
6 Population 896 non-null float64
7 Real GDP 896 non-null float64
8 Real GDP per Capita 896 non-null float64
9 GDP_change 896 non-null float64
dtypes: float64(8), object(2)
memory usage: 77.0+ KB
# Define independent and depenent variables
= combined_data['GDP_change']
X = combined_data['GII_change']
y
# Add a constant to indepdent variable to include an intercept
= sm.add_constant(X)
X
# Fit a simple linear regresion model and print out the summary
= sm.OLS(y, X).fit()
model model.summary()
Dep. Variable: | GII_change | R-squared: | 0.000 |
Model: | OLS | Adj. R-squared: | -0.001 |
Method: | Least Squares | F-statistic: | 0.1114 |
Date: | Wed, 22 Jan 2025 | Prob (F-statistic): | 0.739 |
Time: | 16:59:36 | Log-Likelihood: | 1584.8 |
No. Observations: | 896 | AIC: | -3166. |
Df Residuals: | 894 | BIC: | -3156. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
const | -0.0220 | 0.002 | -13.862 | 0.000 | -0.025 | -0.019 |
GDP_change | 0.0110 | 0.033 | 0.334 | 0.739 | -0.054 | 0.076 |
Omnibus: | 872.466 | Durbin-Watson: | 1.570 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 62725.209 |
Skew: | -4.277 | Prob(JB): | 0.00 |
Kurtosis: | 43.087 | Cond. No. | 23.9 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Time Series Analysis
Time series analysis allows us to explore how the relationship between GII and GDP change vary across different time periods, accounting for lagged effects.
Here was a quick summary of the result: * Both GII and GDP change time series were stationary.
Past GII change values significantly influenced cuurent GII change values.
VAR model had good model performance on forecasting future values based on historical data.
Changes in GDP did not cause/precde the changes in GII.
ADF Test: Stationality Assumption Check
We wanted to use Augmented Dickey-Fuller (ADF) test to check whether a time series was stationary, which was the model assumption for many time series models.
Stationarity implied constant mean and variance over time, making it more predictable and stable for forecasting.
Based on the ADF test output, both GII and GDP change time series were stationary. We proceeded to the time series modeling section.
# Augmented Dickey-Fuller (ADF) test for stationarity check
# Create melted datasets
= combined_data.melt(id_vars=['Time', 'Country'],
combined_data_time =['GII_change','GDP_change'],
value_vars= 'Change_Type',
var_name = 'Value')
value_name = combined_data_time[(combined_data_time['Change_Type'] == 'GII_change')]
GII
= combined_data_time[(combined_data_time['Change_Type'] == 'GDP_change')]
GDP
# Stationary Check
def adf_test(series):
= adfuller(series.dropna())
result print(f'ADF Statistic: {result[0]}')
print(f'p-value: {result[1]}')
if result[1] < 0.05:
print("Series is stationary")
else:
print("Series is not stationary")
# Output the result
'Value'])
adf_test(GII['Value']) adf_test(GDP[
ADF Statistic: -13.258508444109287
p-value: 8.491362672402006e-25
Series is stationary
ADF Statistic: -13.536694269050457
p-value: 2.5594926855546896e-25
Series is stationary
VAR model: Examine variables separately
We fitted a VAR (Vector Autoreression) model to see the relationship between GII and GDP change. VAR is particularly useful when dealing with multivariate time series data and allows us to examine the interdependence between variables.
Based on summary, here were several interpretations we could make:
We used AIC as the criteria for model selection. Lower value suggests a better fit.
Given that we wanted to predict GII change, we focused on the first set “Results for equation GII_change.”
Past GII_change values significantly influenced current GII_change, as shown in the small p-values of lags 1 and 2.
Lag 2 of GDP_change had a relatively low p-value but is not statistically significant.
# Split the dataset into training and testing sets
= 0.7
split_ratio = int(len(combined_data) * split_ratio)
split_index
# Training set is used to fit the model
= combined_data.iloc[:split_index]
train_data
# Testing set is used for validation
= combined_data.iloc[split_index:]
test_data
print(f"Training data: {train_data.shape}")
print(f"Test data: {test_data.shape}")
Training data: (627, 10)
Test data: (269, 10)
# Fit a VAR model
= VAR(train_data[['GII_change', 'GDP_change']])
time_model = time_model.fit(maxlags = 15, ic="aic")
time_model_fitted
# Print out the model summary
time_model_fitted.summary()
Summary of Regression Results
==================================
Model: VAR
Method: OLS
Date: Wed, 22, Jan, 2025
Time: 16:59:36
--------------------------------------------------------------------
No. of Equations: 2.00000 BIC: -12.6388
Nobs: 624.000 HQIC: -12.6996
Log likelihood: 2217.52 FPE: 2.93645e-06
AIC: -12.7383 Det(Omega_mle): 2.87166e-06
--------------------------------------------------------------------
Results for equation GII_change
================================================================================
coefficient std. error t-stat prob
--------------------------------------------------------------------------------
const -0.014401 0.002527 -5.698 0.000
L1.GII_change 0.206183 0.040171 5.133 0.000
L1.GDP_change 0.008941 0.037668 0.237 0.812
L2.GII_change 0.147039 0.040539 3.627 0.000
L2.GDP_change -0.038093 0.037741 -1.009 0.313
L3.GII_change 0.071829 0.040373 1.779 0.075
L3.GDP_change 0.041997 0.037485 1.120 0.263
================================================================================
Results for equation GDP_change
================================================================================
coefficient std. error t-stat prob
--------------------------------------------------------------------------------
const 0.017922 0.002668 6.719 0.000
L1.GII_change -0.021878 0.042400 -0.516 0.606
L1.GDP_change 0.132221 0.039758 3.326 0.001
L2.GII_change 0.124973 0.042789 2.921 0.003
L2.GDP_change 0.025632 0.039835 0.643 0.520
L3.GII_change -0.063330 0.042614 -1.486 0.137
L3.GDP_change 0.146934 0.039565 3.714 0.000
================================================================================
Correlation matrix of residuals
GII_change GDP_change
GII_change 1.000000 -0.033903
GDP_change -0.033903 1.000000
VAR Model: Forecasting
We applied the model learned above to the test data. Based on the plot, the forecast values seem to follow the actual data well, indicating a good model fit caputuring the underlying trends.
# Number of steps to forecast (length of the test set)
= len(test_data)
n_steps
# Get the last values from the training set for forecasting
= train_data[
forecast_input 'GII_change', 'GDP_change']].values[-time_model_fitted.k_ar:]
[
# Forecasting
= time_model_fitted.forecast(y=forecast_input, steps=n_steps)
forecast
# Create a DataFrame for the forecasted values
= pd.DataFrame(forecast, index=test_data.index,
forecast_df =['GII_forecast', 'GDP_forecast'])
columns
# Ensure the index of the forecast_df matches the test_data index
= test_data.index forecast_df.index
=(8, 6))
plt.figure(figsize'GII_change'], label='Training GII', color='blue')
plt.plot(train_data['GII_change'], label='Actual GII', color='orange')
plt.plot(test_data['GII_forecast'], label='Forecasted GII', color='green')
plt.plot(forecast_df['GII Change Forecast vs Actual')
plt.title(
plt.legend()
plt.show()
=(8, 6))
plt.figure(figsize'GDP_change'], label='Training GDP', color='blue')
plt.plot(train_data['GDP_change'], label='Actual GDP', color='orange')
plt.plot(test_data['GDP_forecast'], label='Forecasted GDP', color='green')
plt.plot(forecast_df['GDP Change Forecast vs Actual')
plt.title(
plt.legend() plt.show()
VAR Model: Model Performance
Low values of both MAE and RMSE indicate good model performance with small average errors in predictions.
= mean_absolute_error(test_data['GII_change'], forecast_df['GII_forecast'])
mae_gii = mean_absolute_error(test_data['GDP_change'], forecast_df['GDP_forecast'])
mae_gdp
print(f'Mean Absolute Error for GII: {mae_gii}')
print(f'Mean Absolute Error for GDP: {mae_gdp}')
Mean Absolute Error for GII: 0.021634274218546037
Mean Absolute Error for GDP: 0.027874530660148864
= np.sqrt(mean_squared_error(test_data['GII_change'],
rmse_gii 'GII_forecast']))
forecast_df[= np.sqrt(mean_squared_error(test_data['GDP_change'],
rmse_gdp 'GDP_forecast']))
forecast_df[
print(f'RMSE for GII: {rmse_gii}')
print(f'RMSE for GDP: {rmse_gdp}')
RMSE for GII: 0.0400826273209931
RMSE for GDP: 0.03867925511599023
VAR Model: Granger causality test
Granger causality test evaluates whether one time series can predict another.
Based on the output, the lowest p-value is when lag = 2. However, because p-value > 0.05, we fail to reject the null hypothesis and conclude the GDP_change does not Granger-cause the GII_change.
# Perform the Granger causality test
= 3
max_lag = grangercausalitytests(train_data[['GII_change', 'GDP_change']], max_lag,
test_result =True) verbose
Granger Causality
number of lags (no zero) 1
ssr based F test: F=0.1149 , p=0.7348 , df_denom=623, df_num=1
ssr based chi2 test: chi2=0.1154 , p=0.7340 , df=1
likelihood ratio test: chi2=0.1154 , p=0.7341 , df=1
parameter F test: F=0.1149 , p=0.7348 , df_denom=623, df_num=1
Granger Causality
number of lags (no zero) 2
ssr based F test: F=0.4787 , p=0.6198 , df_denom=620, df_num=2
ssr based chi2 test: chi2=0.9652 , p=0.6172 , df=2
likelihood ratio test: chi2=0.9644 , p=0.6174 , df=2
parameter F test: F=0.4787 , p=0.6198 , df_denom=620, df_num=2
Granger Causality
number of lags (no zero) 3
ssr based F test: F=0.6797 , p=0.5647 , df_denom=617, df_num=3
ssr based chi2 test: chi2=2.0623 , p=0.5596 , df=3
likelihood ratio test: chi2=2.0589 , p=0.5603 , df=3
parameter F test: F=0.6797 , p=0.5647 , df_denom=617, df_num=3
Conclusion
In wrapping up our analysis, we found no evidence to support a significant relationship between the Change in Real GDP per capita and the Change in the Gender Inequality Index (GII). This suggests that economic growth may not have a direct impact on gender equality. However, our findings open the door to questions for future research.
Future Directions
First, we must consider what other factors might influence the relationship between GDP and GII change. The GII is a composite index, shaped by a myriad of social factors, including cultural norms, legal frameworks, and environmental shifts. Future studies could benefit from incorporating additional predictors into the analysis and exploring the interaction between economic growth and gender equality within specific country contexts.
Second, there’s potential to enhance the predictive power of our Vector Autoregression (VAR) time series model. While we established that GDP change does not cause GII change, our model performed well in forecasting trends for both variables independently. In practice, policymakers may want to forecast GII trends independently of GDP if they are implementing gender-focused policies. Future research could investigate time series modeling to further unravel the dynamics of GII and GDP changes.
So, as we wrap up this chapter, let’s keep our curiosity alive and our questions flowing. After all, every end is just a new beginning in the quest for knowledge!