# imfp


# imfp

[![Tests](https://github.com/Promptly-Technologies-LLC/imfp/actions/workflows/test.yml/badge.svg)](https://github.com/Promptly-Technologies-LLC/imfp/actions/workflows/test.yml)
[![PyPI Version](https://img.shields.io/pypi/v/imfp.svg)](https://pypi.python.org/pypi/imfp)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

`imfp`, created and maintained by [Promptly Technologies](https://promptlytechnologies.com), is a Python package for downloading data from the [International Monetary Fund's](http://data.imf.org/) [RESTful JSON API](http://datahelp.imf.org/knowledgebase/articles/667681-using-json-restful-web-service).

## Installation

To install the stable version of imfp from PyPi, use pip.

```bash
pip install --upgrade imfp
```

To load the library, use `import`:

``` {python}
import imfp
```

## Workflow

The `imfp` package introduces four core functions: `imf_databases`, `imf_parameters`, `imf_parameter_defs`, and `imf_dataset`. The function for downloading datasets is `imf_dataset`, but you will need the other functions to determine what arguments to supply to your `imf_dataset` function call.

### Fetching a List of Databases with `imf_databases`

For instance, all calls to `imf_dataset` require a `database_id`. This is because the IMF serves many different databases through its API, and the API needs to know which of these many databases you're requesting data from.

To fetch a list of available databases, use:

``` {python}
# Fetch list of available databases
databases = imfp.imf_databases()
```

See [Working with Databases](docs/databases.qmd) for more information.

### Fetching a List of Parameters and Input Codes with `imf_parameters`

Requests to fetch data from IMF databases are complicated by the fact that each database uses a different set of parameters when making a request. (At last count, there were 43 unique parameters used in making API requests from the various databases!) You also have to have the list of valid input codes for each parameter. See [Working with Parameters](docs/parameters.qmd) for a more detailed explanation of parameters and input codes and how they work.

To obtain the full list of parameters and valid input codes for a given database, use:

``` {python}
# Fetch list of valid parameters and input codes for commodity price database
params = imfp.imf_parameters("PCPS")
```

The `imf_parameters` function returns a dictionary of data frames. Each dictionary key name corresponds to a parameter used in making requests from the database:

``` {python}
# Get key names from the params object
params.keys()
```

Each named list item is a data frame containing the valid input codes (and their descriptions) that can be used with the named parameter.

To access the data frame containing valid values for each parameter, subset the `params` dict by the parameter name:

``` {python}
# View the data frame of valid input codes for the frequency parameter
params['freq']
```

### Supplying Parameter Arguments to `imf_dataset`

To make a request to fetch data from the IMF API, just call `imfp.imf_dataset` with the database ID and keyword arguments for each parameter, where the keyword argument name is the parameter name and the value is the list of codes you want.

For instance, on exploring the `freq` parameter of the Primary Commodity Price System database above, we found that the frequency can take one of three values: "A" for annual, "Q" for quarterly, and "M" for monthly. Thus, to request annual data, we can call `imfp.imf_dataset` with `freq = ["A"]`.

Similarly, we might search the dataframes of valid input codes for the `commodity` and `unit_measure` parameters to find the input codes for coal and index:

``` {python}
# Find the 'commodity' input code for coal
params['commodity'].loc[
    params['commodity']['description'].str.contains("Coal")
]
```

``` {python}
# Find the 'unit_measure' input code for index
params['unit_measure'].loc[
    params['unit_measure']['description'].str.contains("Index")
]

```

Finally, we can use the information we've gathered to make the request to fetch the data:

``` {python}
# Request data from the API
df = imfp.imf_dataset(database_id = "PCPS",
         freq = ["A"], commodity = ["PCOAL", "PCOALAU", "PCOALSA"],
         unit_measure = ["IX"],
         start_year = 2000, end_year = 2015)

# Display the first few entries in the retrieved data frame
df.head()
```

The returned data frame has a `time_format` column that contains ISO 8601 duration codes. In this case, “P1Y” means “periods of 1 year.” The `unit_mult` column represents the power of 10 to which the value column should be raised. For instance, if value is in millions, then the unit multiplier will be 6 (meaning 10^6). If in billions, then the unit multiplier will be 9 (meaning 10^9). For more information on interpreting the returned data frame, see [Understanding the Data Frame](docs/usage.qmd#understanding-the-data-frame).

## Working with the Returned Data Frame

Note that all columns in the returned data frame are string objects, and to plot the series we will need to convert to valid numeric or date formats:

``` {python}
# Convert obs_value to numeric and time_period to integer year
df = df.astype({"time_period" : int, "obs_value" : float})
```

Then, using `seaborn` with `hue`, we can plot different indicators in different colors:

``` {python}
import seaborn as sns

# Plot prices of different commodities in different colors with seaborn
sns.lineplot(data=df, x='time_period', y='obs_value', hue='commodity');
```

## Contributing

We welcome contributions to improve `imfp`! Here's how you can help:

1. If you find a bug, please open [a Github issue](https://github.com/Promptly-Technologies-LLC/imfp/issues)
2. To fix a bug:
   - Fork and clone the repository and open a terminal in the repository directory
   - Install [uv](https://astral.sh/setup-uv/) with `curl -LsSf https://astral.sh/uv/install.sh | sh`
   - Install the dependencies with `uv sync`
   - Install a git hook to enforce conventional commits with `curl -o- https://raw.githubusercontent.com/tapsellorg/conventional-commits-git-hook/master/scripts/install.sh | sh`
   - Create a fix, commit it with an ["Angular-style Conventional Commit"](https://www.conventionalcommits.org/en/v1.0.0-beta.4/) message, and push it to your fork
   - Open a pull request to our `main` branch

Note that if you want to change and preview the documentation, you will need to install the [Quarto CLI tool](https://quarto.org/docs/download/).

Version incrementing, package building, testing, changelog generation, documentation rendering, publishing to PyPI, and Github release creation is handled automatically by the GitHub Actions workflow based on the commit messages.

## Working with LLMs

In line with the [llms.txt standard](https://llmstxt.org/), we have exposed the full Markdown-formatted project documentation as a [single text file](docs/static/llms.txt) to make it more usable by LLM agents.

``` {python}
#| echo: false
#| include: false
import re
from pathlib import Path

def extract_file_paths(quarto_yml_path):
    """
    Extract href paths from _quarto.yml file.
    Returns a list of .qmd file paths.
    """
    with open(quarto_yml_path, 'r', encoding='utf-8') as f:
        content = f.read()

    # Find all href entries that point to .qmd files
    pattern = r'href:\s*(.*?\.qmd)'
    matches = re.findall(pattern, content, re.MULTILINE)
    return matches


def process_qmd_content(file_path):
    """
    Process a .qmd file by converting YAML frontmatter to markdown heading.
    Returns the processed content as a string.
    """
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()

    # Replace YAML frontmatter with markdown heading
    pattern = r'^---\s*\ntitle:\s*"([^"]+)"\s*\n---'
    processed_content = re.sub(pattern, r'# \1', content)
    return processed_content


# Get the current working directory
base_dir = Path.cwd()
quarto_yml_path = base_dir / '_quarto.yml'

print(quarto_yml_path)

# Extract file paths from _quarto.yml
qmd_files = extract_file_paths(quarto_yml_path)
print(qmd_files)

# Process each .qmd file and collect contents
processed_contents = []
for qmd_file in qmd_files:
    file_path = base_dir / qmd_file
    if file_path.exists():
        processed_content = process_qmd_content(file_path)
        processed_contents.append(processed_content)

# Concatenate all contents with double newline separator
final_content = '\n\n'.join(processed_contents)

# Ensure the output directory exists
output_dir = base_dir / 'docs' / 'static'
output_dir.mkdir(parents=True, exist_ok=True)

# Write the concatenated content to the output file
output_path = output_dir / 'llms.txt'
with open(output_path, 'w', encoding='utf-8') as f:
    f.write(final_content)
```


# Installation

## Prerequisites

To install the latest version of `imfp`, you will need to have [Python 3.10 or later](https://www.python.org/downloads/) installed on your system.

If you don't already have Python, we recommend installing [the `uv` package manager](https://astral.sh/setup-uv/) and installing Python with `uv python install`.

## Installation

To install the latest stable `imfp` wheel from PyPi using pip:

``` bash
pip install --upgrade imfp
```

Alternatively, to install from the source code on Github, you can use the following command:

``` bash
pip install --upgrade git+https://github.com/Promptly-Technologies-LLC/imfp.git
```

You can then import the package in your Python script:

``` python
import imfp
```

## Suggested Dependencies for Data Analysis

`imfp` outputs data in a `pandas` data frame, so you will want to use the `pandas` package (which is installed with `imfp`) for its functions for viewing and manipulating this object type. For data visualization, we recommend installing these additional packages:

``` bash
pip install -q matplotlib seaborn
```

You can then import these packages in your Python script:

``` python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
```

## Development Installation

To get started with development of `imfp`,

1. Fork and clone the repository
2. Install [uv](https://astral.sh/setup-uv/) with `curl -LsSf https://astral.sh/uv/install.sh | sh`
3. Install the dependencies with `uv sync`
4. Install a git pre-commit hook to enforce conventional commits:
   ``` bash
   curl -o- https://raw.githubusercontent.com/tapsellorg/conventional-commits-git-hook/master/scripts/install.sh | sh
   ```

To edit and preview the documentation, you'll also want to install the [Quarto CLI tool](https://quarto.org/docs/download/).


# Working with Databases

## Understanding IMF Databases

The IMF serves many different databases through its API, and the API needs to know which of these many databases you're requesting data from. Before you can fetch any data, you'll need to:

1. Get a list of available databases
2. Find the database ID for the data you want

Then you can use that database ID to fetch the data.

## Fetching the Database List

### Fetching an Index of Databases with the `imf_databases` Function

To obtain the list of available databases and their corresponding IDs, use `imf_databases`:

``` {python}
import imfp

#Fetch the list of databases available through the IMF API
databases = imfp.imf_databases()
databases.head()
```


This function returns the IMF’s listing of 259 databases available through the API. (In reality, a few of the listed databases are defunct and not actually available. The databases FAS_2015, GFS01, FM202010, APDREO202010, AFRREO202010, WHDREO202010, BOPAGG_2020, and DOT_2020Q1 were unavailable as of last check.)

## Exploring the Database List

To view and explore the database list, it’s possible to explore subsets of the data frame by row number with `databases.loc`:

``` {python}
# View a subset consisting of rows 5 through 9
databases.loc[5:9]
```


Or, if you already know which database you want, you can fetch the corresponding code by searching for a string match using `str.contains` and subsetting the data frame for matching rows. For instance, here’s how to search for commodities data:

``` {python}
databases[databases['description'].str.contains("Commodity")]
```

See also [Working with Large Data Frames](usage.qmd#working-with-large-data-frames) for sample code showing how to view the full contents of the data frame in a browser window. 

## Best Practices

1. **Cache the Database List**: The database list rarely changes. Consider saving it locally if you'll be making multiple queries. See [Caching Strategy](rate_limits.qmd#caching-strategy) for sample code.

2. **Search Strategically**: Use specific search terms to find relevant databases. For example:

   - "Price" for price indices
   - "Trade" for trade statistics
   - "Financial" for financial data

3. **Use a Browser Viewer**: See [Working with Large Data Frames](usage.qmd#working-with-large-data-frames) for sample code showing how to view the full contents of the data frame in a browser window.

4. **Note Database IDs**: Once you find a database you'll use frequently, note its database ID for future reference.

## Next Steps

Once you've identified the database you want to use, you'll need to:

1. Get the list of parameters for that database (see [Parameters](parameters.qmd))
2. Use those parameters to fetch your data (see [Datasets](datasets.qmd))


# Working with Parameters

## Filtering IMF Dataset Requests with Parameters

Once you have a `database_id`, it’s possible to make a call to `imf_dataset` to fetch the entire database:

``` {python}
#| eval: false
import imfp
import pandas as pd

# Set float format to 2 decimal places for pandas display output
pd.set_option('display.float_format', lambda x: '%.2f' % x)

imfp.imf_dataset(database_id)
```

However, while this will succeed for a few small databases, it will fail for all of the larger ones. And even in the rare case when it succeeds, fetching an entire database can take a long time. You’re much better off supplying additional filter parameters to reduce the size of your request.

Requests to databases available through the IMF API are complicated by the fact that each database uses a different set of parameters when making a request. (At last count, there were 43 unique parameters used in making API requests from the various databases!) You also have to have the list of valid input codes for each parameter. The `imf_parameters` function solves this problem. Use the function to obtain the full list of parameters and valid input codes for a given database.

## Understanding Filter Parameters

Each database available through the IMF API has its own set of parameters that can be used to filter and specify the data you want to retrieve.

Each parameter will be a column in the data. Each row in the data will contain a value for that parameter. The parameter will always be a categorical variable, meaning that it can take only a limited set of values. We refer to these values as "input codes," because you can input them in your API request to filter the data.

What this means, though, is that before making an API request to retrieve data, you need to know what the available filtering parameters are for the database, and what codes you can use for filtering the data by each parameter.

There are two main functions for working with parameters:

- `imf_parameters()`: Get the full list of parameters and valid input codes for a database
- `imf_parameter_defs()`: Get text descriptions of what each parameter represents

## Discovering Available Parameters

To get started, you'll need to know what parameters are available for your chosen database. Use `imf_parameters()` to get this information:

``` {python}
import imfp

# Fetch list of valid parameters for the Primary Commodity Price System database
params = imfp.imf_parameters("PCPS")

# View the available parameter names
params.keys()
```

The function returns a dictionary of data frames.

Each key in the dictionary corresponds to a parameter used in making requests from the database. The value for each key is a data frame with the following columns:

- `input_code`: The valid codes you can use for that parameter
- `description`: A short text description of what each code represents

For example, to see the valid codes for the `freq` (frequency) parameter:

``` {python}
# View the data frame of valid input codes for the frequency parameter
params['freq']
```

## Parameter Definitions

If the parameter name is not self-explanatory, you can use the `imf_parameter_defs()` function to get a text description of what each parameter represents.

``` {python}
# Get descriptions of what each parameter means
params_defs = imfp.imf_parameter_defs("PCPS")

params_defs
```

## Supplying Parameters

### Basic Approach (Recommended for Most Users)

To make a request to fetch data from the IMF API, just call `imf_dataset` with the database ID and keyword arguments for each parameter, where the keyword argument name is the parameter name and the value is the list of codes you want.

For instance, on exploring the `freq` parameter of the Primary Commodity Price System database above, we found that the frequency can take one of three values: "A" for annual, "Q" for quarterly, and "M" for monthly. Thus, to request annual data, we can call `imf_dataset` with `freq = ["A"]`.

Here's a complete example that fetches annual coal prices for the years 2000 through 2015:

``` {python}
#| eval: false
# Example: Get annual coal prices
df = imfp.imf_dataset(
    database_id="PCPS",
    freq=["A"],  # Annual frequency
    commodity=["PCOAL"],  # Coal prices
    start_year=2000,
    end_year=2015
)
```

### Advanced Approaches

For more complex queries, there are two programmatic ways to supply parameters to `imf_dataset`. These approaches are particularly useful when you need to filter parameters based on their descriptions or when working with multiple parameter values.

#### 1. List Arguments with Parameter Filtering

This approach uses string matching to find the correct parameter codes before passing them to `imf_dataset`:

``` {python}
# Fetch the input code column of the freq parameter...
selected_freq = list(
    params['freq']['input_code'][
        # ...where the description contains "Annual"
        params['freq']['description'].str.contains("Annual")
    ]
)

# Fetch the input code column of the commodity parameter...
selected_commodity = list(
    params['commodity']['input_code'][
        # ...where the description contains "Coal"
        params['commodity']['description'].str.contains("Coal")
    ]
)

# Fetch the input code column of the unit_measure parameter...
selected_unit_measure = list(
    params['unit_measure']['input_code'][
        # ...where the description contains "Index"
        params['unit_measure']['description'].str.contains("Index")
    ]
)

# Request data from the API using the filtered parameter code lists
df = imfp.imf_dataset(
    database_id="PCPS",
    freq=selected_freq,
    commodity=selected_commodity,
    unit_measure=selected_unit_measure,
    start_year=2000,
    end_year=2015
)

df.head()
```

#### 2. Parameters Dictionary Approach

This approach modifies the parameters dictionary directly and passes the entire filtered dictionary to `imf_dataset` as a single `parameters` keyword argument. This is more concise but requires understanding how the parameters dictionary works:

``` {python}
# Copy the params dictionary
modified_params = params.copy()

# Overwrite the data frame for each parameter in the dictionary with filtered rows
modified_params['freq'] = params['freq'][
    # ...where the input code description for freq contains "Annual"
    params['freq']['description'].str.contains("Annual")
]
modified_params['commodity'] = params['commodity'][
    # ...where the input code description for commodity contains "Coal"
    params['commodity']['description'].str.contains("Coal")
]
modified_params['unit_measure'] = params['unit_measure'][
    # ...where the input code description for unit_measure contains "Index"
    params['unit_measure']['description'].str.contains("Index")
]

# Pass the modified dictionary to imf_dataset
df = imfp.imf_dataset(
    database_id="PCPS",
    parameters=modified_params,
    start_year=2000,
    end_year=2015
)

df.head()
```

Note that when using the parameters dictionary approach, you cannot combine it with individual parameter arguments. If you supply a `parameters` argument, any other keyword arguments for individual parameters will be ignored.


# Requesting Datasets

## Making a Request

To retrieve data from an IMF database, you'll need the database ID and any relevant [filter parameters](parameters.qmd). Here's a basic example using the Primary Commodity Price System (PCPS) database:

``` {python}
import imfp

# Get parameters and their valid codes
params = imfp.imf_parameters("PCPS")

# Fetch annual coal price index data
df = imfp.imf_dataset(
    database_id="PCPS",
    freq=["A"],  # Annual frequency
    commodity=["PCOAL"],  # Coal prices
    unit_measure=["IX"],  # Index
    start_year=2000,
    end_year=2015
)
```

This example creates two objects we'll use in the following sections:

- `params`: A dictionary of parameters and their valid codes
- `df`: The retrieved data frame containing our requested data

## Decoding Returned Data

When you retrieve data using `imf_dataset`, the returned data frame contains columns that correspond to the parameters you specified in your request. However, these columns use input codes (short identifiers) rather than human-readable descriptions. To make your data more interpretable, you can replace these codes with their corresponding text descriptions using the parameter information from `imf_parameters`, so that codes like "A" (Annual) or "W00" (World) become self-explanatory labels.

For example, suppose we want to decode the `freq` (frequency), `ref_area` (geographical area), and `unit_measure` (unit) columns in our dataset. We'll merge the parameter descriptions into our data frame:

``` {python}
# Decode frequency codes (e.g., "A" → "Annual")
df = df.merge(
    # Select code-description pairs
    params['freq'][['input_code', 'description']],
    # Match codes in the data frame
    left_on='freq',
    # ...to codes in the parameter data
    right_on='input_code',
    # Keep all data rows
    how='left'
).drop(columns=['freq', 'input_code']
).rename(columns={"description": "freq"})

# Decode geographic area codes (e.g., "W00" → "World")
df = df.merge(
    params['ref_area'][['input_code', 'description']],
    left_on='ref_area',
    right_on='input_code',
    how='left'
).drop(columns=['ref_area', 'input_code']
).rename(columns={"description":"ref_area"})

# Decode unit codes (e.g., "IX" → "Index")
df = df.merge(
    params['unit_measure'][['input_code', 'description']],
    left_on='unit_measure',
    right_on='input_code',
    how='left'
).drop(columns=['unit_measure', 'input_code']
).rename(columns={"description":"unit_measure"})

df.head()
```

After decoding, the data frame is much more human-interpretable. This transformation makes the data more accessible for analysis and presentation, while maintaining all the original information.

## Understanding the Data Frame

Also note that the returned data frame has additional mysterious-looking codes as values in some columns.

Codes in the `time_format` column are ISO 8601 duration codes. In this case, “P1Y” means “periods of 1 year.” See [Time Period Conversion](usage.qmd#time-period-conversion) for more information on reconciling time periods.

The `unit_mult` column represents the number of zeroes you should add to the value column. For instance, if value is in millions, then the unit multiplier will be 6. If in billions, then the unit multiplier will be 9. See [Unit Multiplier Adjustment](usage.qmd#unit-multiplier-adjustment) for more information on reconciling units.


# Suggestions for Usage

## Determining Data Availability

Unfortunately, **many** of the indicators listed as available in the lists of input codes returned by `imf_parameters()` are not actually available. This is a deficiency of the API rather than the library; someone at the IMF presumably intended to provide these indicators at some point, but never got around to it.

The only way to be certain whether an indicator is available is to make a request to the API and see if it succeeds. If not, you will receive an error message indicating that no data was found for your parameters. In general, if you see this message, you should try making a less restrictive version of your request. For instance, if your request returns no data for an indicator for a given country and time period, you can omit the country or time period parameter and try again. If you still get no data, that indicator is not actually available through the API.

While it is not fully predictable which indicators will be available, as a general rule you can expect to get unadjusted series but not adjusted ones. For instance, real and per capita GDP are not available (although they are listed) through the API, but nominal GDP is. The API does, however, make available all the adjustment variables you would need to adjust the data yourself. See the [Common Data Transformations](#common-data-transformations) section below for examples of how to make adjustments.

## Working with Large Data Frames

### Inspecting Data

`imfp` outputs data in `pandas` DataFrames, so you will want to use the `pandas` package for its functions for viewing and manipulating this object type.

For large datasets, you can use the `pandas` library's `info()` method to get a quick summary of the data frame, including the number of rows and columns, the count of non-missing values, the column names, and the data types.

``` {python}
import imfp
import pandas as pd

# Set float format to 2 decimal places for pandas display output
pd.set_option('display.float_format', lambda x: '%.2f' % x)

df: pd.DataFrame = imfp.imf_dataset(
    database_id="PCPS",
    commodity=["PCOAL"],
    unit_measure=["IX"],
    start_year=2000, end_year=2001
)

# Quick summary of DataFrame
df.info()
```

Alternatively, you can use the `head()` method to view the first 5 rows of the data frame.

``` {python}
# View first 5 rows of DataFrame
df.head()
```

### Cleaning Data

#### Numeric Conversion

All data is returned from the IMF API as a text (object) data type, so you will want to cast numeric columns to numeric.

``` {python}
# Numeric columns
numeric_cols = ["unit_mult", "obs_value"]

# Cast numeric columns
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric)
```

#### Categorical Conversion

You can also convert string columns to categorical types for better memory usage.

``` {python}
# Convert categorical columns like ref_area and indicator to category type
categorical_cols = [
  "freq",
  "ref_area",
  "commodity",
  "unit_measure",
  "time_format"
]

df[categorical_cols] = df[categorical_cols].astype("category")
```

#### NA Removal

After conversion, you may want to drop any rows with missing values.

``` {python}
# Drop rows with missing values
df = df.dropna()
```

#### Time Period Conversion

The `time_period` column can be more difficult to work with, because it may be differently formatted depending on the frequency of the data.

Annual data will be formatted as a four-digit year, such as "2000", which can be trivially converted to numeric.

However, quarterly data will be formatted as "2000-Q1", and monthly data will be formatted like "2000-01".

You can use the `pandas` library's `to_datetime()` method with the `format="mixed"` argument to convert this column to a datetime object in a format-agnostic way:

``` {python}
# Convert time_period to datetime
df["datetime"] = pd.to_datetime(df["time_period"], format="mixed")
df[["freq", "datetime"]].head()
```

Alternatively, you can split the `time_period` column into separate columns for year, quarter, and month, and then convert each to a numeric value:

``` {python}
# Split time_period into separate columns
df["year"] = df["time_period"].str.extract(r"(\d{4})")[0]
df["quarter"] = df["time_period"].str.extract(r"Q(\d{1})")[0]
df["month"] = df["time_period"].str.extract(r"-(\d{2})")[0]

# Convert year, quarter, and month to numeric
df["year"] = pd.to_numeric(df["year"])
df["quarter"] = pd.to_numeric(df["quarter"])
df["month"] = pd.to_numeric(df["month"])

df[["time_period", "year", "quarter", "month"]].head()
```

### Summarizing Data

After converting columns to numeric, you can use the `describe()` function to get a quick summary of the statistical properties of these, including the count of rows, the mean, the standard deviation, the minimum and maximum values, and the quartiles.

``` {python}
# Statistical summary
df.describe()
```

### Viewing Data

For large data frames, it can be useful to view the data in a browser window. To facilitate this, you can define a `View()` function as follows. This function will save the data frame to a temporary HTML file and open it in your default web browser.

``` {python}
#| eval: false
import tempfile
import webbrowser

# Define a simple function to view data frame in a browser window
def View(df: pd.DataFrame):
    html = df.to_html()
    with tempfile.NamedTemporaryFile('w', 
    delete=False, suffix='.html') as f:
        url = 'file://' + f.name
        f.write(html)
    webbrowser.open(url)

# Call the function
View(df)
```

## Common Data Transformations

The International Financial Statistics (IFS) database provides key macroeconomic aggregates that are frequently needed when working with other IMF datasets. Here, we'll demonstrate how to use three fundamental indicators—GDP, price deflators, and population statistics—to transform your data.

These transformations are essential for:

- Converting nominal to real dollar values
- Calculating per capita metrics
- Harmonizing data across different frequencies
- Adjusting for different unit scales

For a complete, end-to-end example of these transformations in a real analysis workflow, see Jenny Xu's superb [demo notebook](https://github.com/jennyxu/imfp-demo).

### Fetching IFS Adjusters

First, let's retrieve the key adjustment variables from the IFS database:

``` {python}
# Fetch GDP Deflator (Index, Annual)
deflator = imfp.imf_dataset(
    database_id="IFS",
    indicator="NGDP_D_SA_IX",
    freq="Q",
    start_year=2010
)

# Fetch Population Estimates (Annual)
population = imfp.imf_dataset(
    database_id="IFS",
    indicator="LP_PE_NUM",
    freq="A",
    start_year=2010
)

# Fetch Exchange Rate (Annual)
exchange_rate = imfp.imf_dataset(
    database_id="IFS", 
    indicator="ENDE_XDC_USD_RATE",
    freq="Q",
    # start_year=2010 currently breaks this query for some reason
)
```

We'll also retireve a nominal GDP series to be adjusted:

``` {python}
# Fetch Nominal GDP (Domestic currency, annual)
nominal_gdp = imfp.imf_dataset(
    database_id="IFS", 
    indicator="NGDP_XDC",
    freq="A",
    start_year=2010
)
```

**Key IFS Indicators**:

- `NGDP_D_SA_IX`: GDP deflator index (seasonally adjusted)
- `LP_PE_NUM`: Population estimates
- `ENDE_XDC_USD_RATE`: Exchange rate (domestic currency per USD)
- `NGDP_XDC`: Nominal GDP in domestic currency

### Harmonizing Frequencies

When working with data of different frequencies, you'll often need to harmonize them. For example, population and national GDP are available at an annual frequency, while the GDP deflator and exchange rates can only be obtained at a monthly or quarterly frequency. There are two common approaches:

1. Using Q4 values: This approach is often used for stock variables (measurements taken at a point in time) and when you want to align with end-of-year values:

```{python}
# Keep only Q4 observations for annual comparisons
deflator = deflator[deflator['time_period'].str.contains("Q4")]
exchange_rate = exchange_rate[exchange_rate['time_period'].str.contains("Q4")]

# Extract just the year from the time period for Q4 data
deflator['time_period'] = deflator['time_period'].str[:4]
exchange_rate['time_period'] = exchange_rate['time_period'].str[:4]
```

``` {python}
#| include: false
#| echo: false
# Hidden fixup to filter out exchange rate data before 2010
# Remove this line when the start_year parameter is fixed for this query
exchange_rate = exchange_rate[exchange_rate['time_period'].astype(int) > 2010]
```

2. Calculating annual averages: This approach is more appropriate for flow variables (measurements over a period) and when you want to smooth out seasonal variations:

``` {python}
#| eval: false
# Alternative: Calculate annual averages
deflator = deflator.groupby(
    ['ref_area', deflator['time_period']], 
    as_index=False
).agg({
    'obs_value': 'mean'
})
```

Choose the appropriate method based on your specific analysis needs and the economic meaning of your variables.

### Unit Multiplier Adjustment

IMF data often includes a `unit_mult` column that indicates the scale of the values (e.g., millions, billions). We can write a helper function to apply these scaling factors:

``` {python}
def apply_unit_multiplier(df):
    """Convert to numeric, adjust values using IMF's scaling factors, and drop
    missing values"""
    df['obs_value'] = pd.to_numeric(df['obs_value'])
    df['unit_mult'] = pd.to_numeric(df['unit_mult'])
    df['adjusted_value'] = df['obs_value'] * 10 ** df['unit_mult']
    df = df.dropna(subset=["obs_value"])
    return df

# Apply to each dataset
deflator = apply_unit_multiplier(deflator)
population = apply_unit_multiplier(population)
exchange_rate = apply_unit_multiplier(exchange_rate)
nominal_gdp = apply_unit_multiplier(nominal_gdp)
```

### Merging Datasets

After harmonizing unit scales, we can combine the datasets using `pd.DataFrame.merge()` with `ref_area` and `time_period` as keys:

``` {python}
merged = (
    nominal_gdp.merge(
        deflator,
        on=['ref_area', 'time_period'],
        suffixes=('_gdp', '_deflator')
    )
    .merge(
        population,
        on=['ref_area', 'time_period']
    )
    .merge(
        exchange_rate,
        on=['ref_area', 'time_period'],
        suffixes=('_population', '_exchange_rate')
    )
)
```

### Calculating Real Values

With the merged dataset, we can now calculate real GDP and per capita values:

``` {python}
# Convert nominal to real GDP
merged['real_gdp'] = (
    (merged['adjusted_value_gdp'] / merged['adjusted_value_deflator']) * 100
)

# Calculate per capita values (using population obs_value)
merged['real_gdp_per_capita'] = merged['real_gdp'] / merged['adjusted_value_population']

# Display the first 5 rows of the transformed data
merged[['time_period', 'real_gdp', 'real_gdp_per_capita']].head()
```

### Exchange Rate Adjustment

Note that this result is still in the domestic currency of the country. If you need to convert to a common currency, you can use the exchange rate data from the IFS database.

``` {python}
# Because 'adjusted_value_exrate' is local-currency-per-USD,
# dividing local-currency real GDP by it yields GDP in USD.
merged["real_gdp_usd"] = (
    merged["real_gdp"] / merged["adjusted_value_exchange_rate"]
)

# (Optional) real GDP per capita in USD
merged["real_gdp_usd_per_capita"] = (
    merged["real_gdp_usd"] / merged["adjusted_value_population"]
)

# Inspect results
merged[["time_period","ref_area","real_gdp","real_gdp_usd","real_gdp_usd_per_capita"]].head()
```


# Rate Limits

## API Rate Management

The IMF API imposes very restrictive (and incompletely documented) rate limits, not only for individual users and applications, but also globally for all users of the API. Thus, at high-traffic times, you may find that your requests fail. It's highly recommended that you set up proactive error handling, wait times, retries, and request caching to avoid hitting the API's rate limits. The `imfp` library provides some tools to help you do this (with more planned for future releases).

### Setting Application Name

The IMF API has an application-based rate limit of 50 requests per second. Each application is identified by the "user_agent" variable in the request header. By default, all `imfp` users share the same application name, which could lead to rate limit issues if many users are making requests simultaneously.

This could prove problematic if the `imfp` library became too popular and too many users tried to make simultaneous API requests using the default app name. By setting a custom application name, users can avoid hitting this rate limit and being blocked by the API. To solve this problem, `imfp` supplies the `set_imf_app_name()` function to set a custom application name.

`set_imf_app_name()` sets the application name by changing the `IMF_APP_NAME` variable in the environment. If this variable doesn't exist, `set_imf_app_name()` will create it. To set a custom application name, simply call the `set_imf_app_name()` function with your desired application name as an argument:

``` {python}
import imfp

# Set custom app name as an environment variable
imfp.set_imf_app_name("my_custom_app_name")
```

The function will throw an error if the provided name is missing, NULL, NA, not a string, or longer than 255 characters. If the provided name is "imfp" (the default) or an empty string, the function will issue a warning recommending the use of a unique app name to avoid hitting rate limits.

### Managing Request Timing

If making multiple requests in a short period of time, you may want to increase the wait time between requests to avoid hitting the API's rate limits. This is done with the `set_imf_wait_time()` function:

``` {python}
#| eval: false
# Increase wait time to 5 seconds
imfp.set_imf_wait_time(5)
```

### Retries

`imfp` automatically handles rate limits with exponential backoff:

1. Waits for specified time
2. Retries the request
3. Increases wait time exponentially on subsequent failures
4. Stops after 3 attempts (default)

You can modify retry behavior:

``` {python}
#| eval: false
# Retry 4 times rather than the default 3
df = imfp.imf_dataset("IFS", "NGDP_D_SA_IX", times=4)
```

### Caching Strategy

To reduce API calls, you can cache frequently accessed data. For instance, in a Jupyter or Quarto notebook that you run multiple times, you can wrap each `imfp` function call in an `if` statement that checks if the returned data has already been saved to a file. If it has, it loads the data from the file. If it hasn't, it fetches the data from the API and saves it to a file.

Note that to run this code, you will need to install the `pyarrow` library, which `pandas` uses as its engine for reading and writing parquet files (but which is not installed with `pandas` or `imfp` by default). Use `pip install pyarrow` to install it.

``` {python}
#| eval: false
import os
import pandas as pd

# Fetch imf databases from file if available, else from API
cache_path = f"data/imf_databases.parquet"
if os.path.exists(cache_path):
    databases = pd.read_parquet(cache_path)
else:
    databases = imfp.imf_databases()
    os.makedirs("data", exist_ok=True)
    databases.to_parquet(cache_path)
```

You can also functionalize this logic to permit reuse several times in the same script or notebook. See Jenny Xu's excellent [demo notebook](demo.qmd#utility-functions) for example caching functions.

## Performance Tips

1. **Filter Early**: Use parameters to limit data at the API level
2. **Parallelize Carefully**: Avoid running parallel API requests, even from multiple clients
3. **Use Efficient Formats**: Store cached data in parquet or feather files
4. **Validate Data**: Check for errors and empty responses


---
title: "Economic Growth and Gender Equality: An Analysis Using IMF Data"
author: "Jenny Xu"
---

This data analysis project aims to explore the relationship between economic growth and gender equality using `imfp`, which allows us to download data from IMF (International Monetary Fund). `imfp` can be integrated with other python tools to streamline the computational process. To demonstrate its functionality, the project experimented with a variety of visualization and analysis methods. 

## Executive Summary

In this project, we explored the following:

1. **Data Fetching**
* Make API call to fetch 4 datasets: GII (Gender Inequality Index), Nominal GDP, GDP Deflator Index, Population series

2. **Feature Engineering**
* Cleaning: Convert GDP Deflator Index to a yearly basis and variables to numeric
* Dependent Variable: Percent Change of Gender Inequality Index
* Independent Variable: Percent Change of Real GDP per Capita 
* Transform variables to display magnitude of change 
* Merge the datasets

3. **Data Visualization**
* Scatterplot
* Time Series Line Plots
* Barplot
* Boxplot
* Heatmap

4. **Statistical Analysis**
* Descriptive Statistics
* Regression Analysis
* Time Series Analysis

## Utility Functions
The integration of other Python tools not only streamlined our computational processes but also ensured consistency across the project.

A custom module is written to simplify the process of making API calls and fetching information with imfp library. `load_or_fetch_databases`, `load_or_fetch_parameters` `load_or_fetch_dataset` load and retreive database, parameters, and dataset from a local or remote source. `view_dataframe_in_browser` displays dataframe in a web browser.

```{python}
import os
import pickle
from tempfile import NamedTemporaryFile
import pandas as pd
import imfp
import webbrowser


# Function to display a DataFrame in a web browser
def view_dataframe_in_browser(df):
    html = df.to_html()
    with NamedTemporaryFile(delete=False, mode="w", suffix=".html") as f:
        url = "file://" + f.name
        f.write(html)
    webbrowser.open(url)


# Function to load databases from CSV or fetch from API
def load_or_fetch_databases():
    csv_path = os.path.join("data", "databases.csv")

    # Try to load from CSV
    if os.path.exists(csv_path):
        try:
            return pd.read_csv(csv_path)
        except Exception as e:
            print(f"Error loading CSV: {e}")

    # If CSV doesn't exist or couldn't be loaded, fetch from API
    print("Fetching databases from IMF API...")
    databases = imfp.imf_databases()

    # Save to CSV for future use
    databases.to_csv(csv_path, index=False)
    print(f"Databases saved to {csv_path}")

    return databases


def load_or_fetch_parameters(database_name):
    pickle_path = os.path.join("data", f"{database_name}.pickle")

    # Try to load from pickle file
    if os.path.exists(pickle_path):
        try:
            with open(pickle_path, "rb") as f:
                return pickle.load(f)
        except Exception as e:
            print(f"Error loading pickle file: {e}")

    # If pickle doesn't exist or couldn't be loaded, fetch from API
    print(f"Fetching parameters for {database_name} from IMF API...")
    parameters = imfp.imf_parameters(database_name)

    # Save to pickle file for future use
    os.makedirs("data", exist_ok=True)  # Ensure the data directory exists
    with open(pickle_path, "wb") as f:
        pickle.dump(parameters, f)
    print(f"Parameters saved to {pickle_path}")

    return parameters


def load_or_fetch_dataset(database_id, indicator):
    file_name = f"{database_id}.{indicator}.csv"
    csv_path = os.path.join("data", file_name)

    # Try to load from CSV file
    if os.path.exists(csv_path):
        try:
            return pd.read_csv(csv_path)
        except Exception as e:
            print(f"Error loading CSV file: {e}")

    # If CSV doesn't exist or couldn't be loaded, fetch from API
    print(f"Fetching dataset for {database_id}.{indicator} from IMF API...")
    dataset = imfp.imf_dataset(database_id=database_id, indicator=[indicator])

    # Save to CSV file for future use
    os.makedirs("data", exist_ok=True)  # Ensure the data directory exists
    dataset.to_csv(csv_path, index=False)
    print(f"Dataset saved to {csv_path}")

    return dataset

```

## Dependencies
Here is a brief introduction about the packages used:

`pandas`: view and manipulate data frame

`matplotlib.pyplot`: make plots

`seaborn`: make plots

`numpy`: computation

`LinearRegression`: implement linear regression

`tabulate`: format data into tables

`statsmodels.api`, `adfuller`, `ARIMA`,`VAR`,`plot_acf`,`plot_pacf`,`mean_absolute_error`,`mean_squared_error`, and`grangercausalitytests` are specifically used for time series analysis. 

```{python}
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from tabulate import tabulate
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.vector_ar.var_model import VAR
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.stattools import grangercausalitytests
```

## Data Fetching
In this section, we extracted four datasets through API calls: Gender Inequality Index(GII), GDP Deflator, Nominal GDP, and Population.  

```{python}
from pathlib import Path
Path("data").mkdir(exist_ok=True)
```

```{python}
# Load or fetch databases
databases = load_or_fetch_databases()

# Filter out databases that contain a year in the description
databases[
  ~databases['description'].str.contains(r"[\d]{4}", regex=True)
]

# view_dataframe_in_browser(databases)
```

Two databases were used: Gender Equality and International Financial Statistics (IFS).

```{python}
databases[databases['database_id'].isin(['GENDER_EQUALITY','IFS'])]
```

Parameters are dictionary key names to make requests from the databases. "freq" stands for Frequency, such as Annual, Monthly, or Quarterly. "ref_area" stands for Geogrpahical Area, such as US (United States), JP (Japan), and GB (United Kindom). "indicator" refers to the code representing a specific dataset in the database. For example, if we display all the indicators for IFS database, the GDP deflator dataset has an input code of "NGDP_D_SA_IX" with a full name description of Gross Domestic Product, Deflator, Seasonally Adjusted, Index. 

```{python}
datasets = ["GENDER_EQUALITY", "IFS"]
params = {}

# Fetch valid parameters for two datasets
for dataset in datasets:
    params[dataset] = load_or_fetch_parameters(dataset)

    valid_keys = list(params[dataset].keys())
    print(f"Parameters for {dataset}: ", valid_keys)
```

We paired the database with the specific dataset indicator to read and store the csv file. 

```{python}
datasets = {}
dsets = [("GENDER_EQUALITY", "GE_GII"), 
("IFS", "NGDP_D_SA_IX"), 
("IFS", "NGDP_XDC"), 
("IFS", "LP_PE_NUM")]

for dset in dsets:
    datasets[dset[0] + "." + dset[1]] = load_or_fetch_dataset(dset[0], dset[1])
```


```{python}
# "Gender Inequality Index"
GII = "GENDER_EQUALITY.GE_GII"

# "Gross Domestic Product, Deflator, Seasonally Adjusted, Index"
GDP_deflator = "IFS.NGDP_D_SA_IX"

# "Gross Domestic Product, Nominal, Domestic Currency"
GDP_nominal = "IFS.NGDP_XDC"

# "Population, Persons, Number of"
GDP_population = "IFS.LP_PE_NUM"

# Assign the datasets to new variables so we don't change the originals
GII_data = datasets[GII]
GDP_deflator_data = datasets[GDP_deflator]
GDP_nominal_data = datasets[GDP_nominal]
GDP_population_data = datasets[GDP_population]
```

## Feature Engineering
### Data Cleaning
Since the GDP deflator was reported on a quarterly basis, we converted it to a yearly basis.
```{python}
# Keep only rows with a partial string match for "Q4" in the time_period column
GDP_deflator_data = GDP_deflator_data[GDP_deflator_data
['time_period'].str.contains("Q4")]

```

```{python}
# Split the time_period into year and quarter and keep the year only
GDP_deflator_data.loc[:, 'time_period'] = GDP_deflator_data['time_period'].str[0:4]
```

We made all the variables numeric.
```{python}
datasets = [GII_data, GDP_deflator_data, GDP_nominal_data, GDP_population_data]

for i, dataset in enumerate(datasets):    
    # Use .loc to modify the columns
    datasets[i].loc[:, 'obs_value'] = pd.to_numeric(datasets[i]['obs_value'], 
    errors='coerce')
    datasets[i].loc[:, 'time_period'] = pd.to_numeric(datasets[i]['time_period'], 
    errors='coerce')
    datasets[i].loc[:, 'unit_mult'] = pd.to_numeric(datasets[i]['unit_mult'], 
    errors='coerce')
```

### GII Percent Change: Dependent Variable
We kept percents as decimals to make them easy to work with for calculation. Different countries have different baseline level of economic growth and gender equality. We calculated the percent change to make them comparable. 

Gender Inequality Index (GII) is a composite measure of gender inequality using three dimensions: reproducitve health, empowerment, and labor market. GII ranges from 0 to 1. While 0 indicates gender equality, 1 indicates gender inequality, possibly the worst outcome for one gender in all three dimensions. 

```{python}
# Calculate percent change for each ref_area
# First, create a copy and reset the index to avoid duplicate index issues
GII_data_sorted = GII_data.sort_values(
    ['ref_area', 'time_period']).reset_index(drop=True)
GII_data['pct_change'] = GII_data_sorted.groupby('ref_area')['obs_value'].pct_change()

# Display the first few rows of the updated dataset
GII_data.head()
```

We subset the data frame to keep only the columns we want:
```{python}
# Create a new dataframe with only the required columns
GII_data = GII_data[['ref_area', 'time_period', 'obs_value', 'pct_change']].copy()

GII_data = GII_data.rename(columns = {
    'ref_area': 'Country',
    'time_period': 'Time',
    'obs_value': 'GII',
    'pct_change': 'GII_change'
})

# Display the first few rows of the new dataset
GII_data.head()
```


### GDP Percent Change: Independent Variable
Real GDP per capita is a measure of a country's economic welfare or standard of living. It is a great tool comparing a country's economic development compared to other economies. Due to dataset access issue, we calculated Real GDP per capita by the following formula using GDP Deflator, Nominal GDP, and Population data:

$\text{Real GDP} = \frac{\text{Nominal GDP}}{\text{GDP Deflator Index}}\times 100$

$\text{Real GDP per capita} = \frac{\text{Real GDP}}{\text{Population}}$

GDP Deflator is a measure of price inflation and deflation with respect to a specific base year. The GDP deflator of a base year is equal to 100. A number of 200 indicates price inflation: the current year price of the good is twice its base year price. A number of 50 indicates price deflation: the current year price of the good is half its base year price. We kept the columns we want only for GDP-related datasets for easier table merging.

```{python}
# GDP Deflator Dataset
# Create a new dataframe with only the required columns
GDP_deflator_data = GDP_deflator_data[
    ['ref_area', 'time_period', 'unit_mult', 'obs_value']].copy()

# Display the first few rows of the new dataset
GDP_deflator_data.head()
```

Nominal GDP is the total value of all goods and services produced in a given time period. It is usually higher than Real GDP and does not take into account cost of living in different countries or price change due to inflation/deflation. 

```{python}
# GDP Nominal Data
# Create a new dataframe with only the required columns
GDP_nominal_data = GDP_nominal_data[
    ['ref_area', 'time_period', 'unit_mult','obs_value']].copy()

# Display the first few rows of the new dataset
GDP_nominal_data.head()
```

Population is the total number of people living in a country at a given time. This is where the "per capita" comes from. Real GDP is the total value of all goods and services produced in a country adjusted for inflation. Real GDP per capita is the total economic output per person in a country. 

```{python}
# GDP Population Data 
# Create a new dataframe with only the required columns
GDP_population_data = GDP_population_data[
    ['ref_area', 'time_period', 'unit_mult','obs_value']].copy()

# Display the first few rows of the new dataset
GDP_population_data.head()
```


```{python}
# Combine all the datasets above for further calculation
merged_df = pd.merge(pd.merge(GDP_deflator_data,GDP_nominal_data, 
on=['time_period', 'ref_area'], 
suffixes=('_index', '_nominal'), 
how='inner'), 
GDP_population_data, 
on=['time_period', 'ref_area'], 
how='inner')
```

We want to adjust GDP data based on unit multiplier. Unit multiplier stands for the number of zeroes we need to add to the value column. For example, in 1950, the observed population data for country GA (Georgia) was 473.296. With a unit muliplier of 3, the adjusted population would be 473296.

```{python}
merged_df['adjusted_index'] = merged_df['obs_value_index'] * (10 ** (merged_df
['unit_mult_index']))
merged_df['adjusted_nominal'] = merged_df['obs_value_nominal'] * (10 ** (merged_df
['unit_mult_nominal']))
merged_df['adjusted_population'] = merged_df['obs_value'] * (10 ** (merged_df
['unit_mult']))
```


```{python}
# Merged dataset
# Create a new dataframe with only the required columns
merged_df = merged_df[['ref_area', 'time_period',
'adjusted_nominal', 'adjusted_index', 'adjusted_population']].copy()

# Display the first few rows of the dataset
merged_df.head()
```

We wanted to compute the Real GDP per capita.
```{python}
# Step 1: Real GDP = (Nominal GDP / GDP Deflator Index)*100
merged_df['Real_GDP_domestic'] = (merged_df['adjusted_nominal'] / merged_df[
    'adjusted_index'])*100

# Step 2: Real GDP per Capita = Real GDP / Population
merged_df['Real_GDP_per_capita'] = merged_df['Real_GDP_domestic'] / merged_df[
    'adjusted_population']

# Rename columns
merged_df = merged_df.rename(columns= {
    "ref_area": "Country",
    "time_period": "Time",
    "adjusted_nominal": "Nominal",
    "adjusted_index": "Deflator",
    "adjusted_population": "Population",
    "Real_GDP_domestic": "Real GDP",
    "Real_GDP_per_capita": "Real GDP per Capita"
}
)
# Check the results
merged_df.head()
```

We calculated the percentage change in Real GDP per capita and put it in a new column.

```{python}
# Calculate percent change for each ref_area
merged_df[f'GDP_change'] = merged_df.sort_values(['Country', 'Time']).groupby(
    'Country')['Real GDP per Capita'].pct_change()

# Rename dataset
GDP_data = merged_df

# Display the first few rows of the dataset
GDP_data.head()
```


```{python}
# GII and GDP
# Merge the datasets
combined_data = pd.merge(GII_data, GDP_data, 
on=["Country", "Time"], 
how = "inner")

# Check the combined dataset
combined_data.head()
```

## Data Visualization
### Scatterplot
Scatterplot use dots to represent values of two numeric variables. The horizontal axis was the percent change in Real GDP per capita. The vertical axis was the percent change in Gender Inequality Index(GII). Different colors represented different countries. We used a linear regression line to display the overall pattern. 

Based on the scatterplot, it seemed like there was a slight positive relationship between GDP change and GII change as shown by the flat regression line. Gender inequality was decreasing (gender equality was improving) a little faster in country-years with low GDP growth and a little slower in country-years with high GDP growth.

```{python}
# Convert numeric columns to float
numeric_columns = [
    'GII', 'GII_change', 'Nominal', 'Deflator', 'Population', 
    'Real GDP', 'Real GDP per Capita', 'GDP_change'
]
for col in numeric_columns:
    combined_data[col] = pd.to_numeric(combined_data[col], errors='coerce')

# Count NAs
print(f"Dropping {combined_data[numeric_columns].isna().sum()} rows with NAs")

# Drop NAs
combined_data = combined_data.dropna(subset=numeric_columns)

# Plot the data points
plt.figure(figsize=(8, 6))
for country in combined_data['Country'].unique():
    country_data = combined_data[combined_data['Country'] == country]
    plt.scatter(country_data['GDP_change'], country_data['GII_change'],
             marker='o',linestyle='-', label=country)
plt.title('Country-Year Analysis of GDP Change vs. GII Change')
plt.xlabel('Percent Change in Real GDP per Capita (Country-Year)')
plt.ylabel('Percent Change in GII (Country-Year)')
plt.grid(True)

# Prepare data for linear regression
X = combined_data['GDP_change'].values.reshape(-1, 1)
y = combined_data['GII_change'].values

# Perform linear regression
reg = LinearRegression().fit(X, y)
y_pred = reg.predict(X)

# Plot the regression line
plt.plot(combined_data['GDP_change'], y_pred, color='red', linewidth=2)

plt.show()
```

### Time Series Line Plot
We created separate line plots for GDP change and GII change over time for a few key countries might show the trends more clearly. 

US: United States

JP: Japan

GB: United Kindom

FR: France

MX: Mexico

Based on the line plots, we saw GDP change and GII change have different patterns. For example, in Mexico, when there was a big change in real GDP per captia in 1995, the change in GII was pretty stable. 
```{python}
# Time Series Line plot for a few key countries
selected_countries  = ['US', 'JP', 'GB', 'FR', 'MX']
combined_data_selected = combined_data[combined_data['Country'].isin(selected_countries)]

# Set up the Plot Structure
fig, ax = plt.subplots(2, 1, figsize=(8, 6), sharex=True)

# Plot change in real GDP per capita over time
sns.lineplot(data = combined_data_selected, 
x = "Time", 
y = "GDP_change", 
hue = "Country", 
ax = ax[0])
ax[0].set_title("Percent Change in Real GDP per Capita Over Time")
ax[0].set_ylabel("Percent Change in Real GDP per Capita")

# Plot change in GII over time
sns.lineplot(data = combined_data_selected, 
x = "Time", 
y = "GII_change", 
hue = "Country", 
ax = ax[1])
ax[1].set_title("Percent Change in GII over Time")
ax[1].set_xlabel("Time")
ax[1].set_ylabel("GII")

plt.tight_layout
plt.show()
```

### Barplot
We used a barplot to show average changes in GII and GDP percent change for each country to visualize regions where inequality was improving or worsening. 

This plot supported our previous observation how GII change seemed to be not be correlated with GDP change. We also saw that, for country SI, Solvenia, there seems to be a large improvement in gender inequality.  

```{python}
# Barplot using average GII and GDP change
# Calculate average change for each country
combined_data_avg = combined_data.groupby('Country')[
    ['GII_change','GDP_change']].mean().reset_index()

# Prepare to plot structure 
plt.figure(figsize = (18,10))

# Create the barplot
combined_data_avg.plot(kind = 'bar', x = 'Country')
plt.ylabel('Average Change')
plt.xlabel('Country')
plt.legend(['GII change', 'GDP change'])
plt.grid(axis = 'y')

# Show the plot
plt.show()
```

### Boxplot
We used boxplot to visualize the distribution of GDP and GII change by country, providing information about spread, median, and potential outliers. To provide a more informative view, we sequenced countries in an ascending order by the median of percent change in GDP.

The boxplot displayed a slight upward trend with no obvious pattern between GDP and GII change. In coutries with higher GDP change median, they also tend to have a larger spread of the GDP change. The median of GII change remained stable regardless of the magnitude of GDP change, implying weak or no association between GDP and GII change. We observed a potential outlier for country SI, Solvenia, which may explained its large improvement in Gender inequality. 

```{python}
# Box plot for GII and GDP change
# Melt the dataframe to long format for combined boxplot
combined_data_melted = combined_data.melt(id_vars=['Country'], 
value_vars=['GII_change', 'GDP_change'], 
var_name='Change_Type', 
value_name='Value')

gdp_medians = combined_data.groupby('Country')['GDP_change'].median().sort_values()

combined_data_melted['Country'] = pd.Categorical(combined_data_melted['Country'], 
categories=gdp_medians.index, 
ordered= True)

# Prepare the plot structure
plt.figure(figsize=(8, 6))
sns.boxplot(data = combined_data_melted, 
x = "Country", 
y = 'Value', 
hue = 'Change_Type')
plt.title('Distribution of GII and GDP change by Country')
plt.xlabel('Country')
plt.ylabel('Change')
plt.legend(title = 'Change Type')

# Show the plot
plt.show()
```

### Correlation Matrix
We created a heatmap to show the relationship between GII and GDP change.

A positive correlation coefficient indicates a positive relationship: the larger the GDP change, the larger the GII change. A negative correlation coefficient indicates a negative relationship: the larger the GDP change, the smaller the GII change. A correlation coefficient closer to 0 indicates there is weak or no relationship. 

Based on the numeric values in the plot, there was a moderately strong positive correlation between GII and GDP change for country Estonia(EE) and Ireland(IE). 

```{python}
# Calculate the correlation
country_correlation = combined_data.groupby('Country')[
    ['GII_change', 'GDP_change']].corr().iloc[0::2, -1].reset_index(name='Correlation')

# Put the correlation value in a matrix format
correlation_matrix = country_correlation.pivot(index='Country', 
columns='level_1', 
values='Correlation')

# Check for NaN values in the correlation matrix
# Replace NaNs with 0 or another value as appropriate
correlation_matrix.fillna(0, inplace=True)  

# Set up the plot structure
# Adjust height to give more space for y-axis labels
plt.figure(figsize=(8, 12))  

# Plot the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
cbar_kws={"shrink": .8}, 
linewidths=.5)

# Enhance axis labels and title
plt.title('Heatmap for GII and GDP Change', fontsize=20)
plt.xlabel('Variables', fontsize=16)
plt.ylabel('Country', fontsize=16)

# Improve readability of y-axis labels
plt.yticks(fontsize=12)  # Adjust the font size for y-axis labels

# Show the plot
plt.show()
```

## Statistical Analysis 
### Descriptive Statistics
There was a total of 915 data points. The mean of the GII change in -0.0314868, which indicated the overall grand mean percent change in gender inequality index is -3.15%. The mean of the GDP change was 0.0234633, showing the overall grand mean percent change in real GDP per capita was 2.35%. 
```{python}
# Generate summary statistics
combined_data.describe()
```

### Regression Analysis
Simple linear regression as a foundational approach provide us with a basic understanding of the relationship between GDP change and GII change. 

Based on the summary, we concluded the following:

* Becasue p-value = 0.057, if we set alpha, the significance level, to be 0.05, we failed to reject the null hypothesis and conclude there was no significant relationship between percent change in real GDP per capita and gender inequality index. 

* R-squared = 0.004. Only 0.4% of the variance in GII change could be explained by GDP change.

* We were 95% confident that the interval from -0.003 to 0.169 captured the true slope of GDP change. Because 0 was included, we are uncertain about the effect of GDP change on GII chnage.

```{python}
# Get column data type summaries of combined_data
combined_data.info()
```

```{python}
# Define independent and depenent variables
X = combined_data['GDP_change']
y = combined_data['GII_change']

# Add a constant to indepdent variable to include an intercept
X = sm.add_constant(X)

# Fit a simple linear regresion model and print out the summary
model = sm.OLS(y, X).fit()
model.summary()
```

### Time Series Analysis
Time series analysis allows us to explore how the relationship between GII and GDP change vary across different time periods, accounting for lagged effects. 

Here was a quick summary of the result:

* Both GII and GDP change time series were stationary.

* Past GII change values significantly influenced cuurent GII change values.

* VAR model had good model performance on forecasting future values based on historical data. 

* Changes in GDP did not cause/precde the changes in GII. 

#### ADF Test: Stationality Assumption Check
We wanted to use Augmented Dickey-Fuller (ADF) test to check whether a time series was stationary, which was the model assumption for many time series models. 

Stationarity implied constant mean and variance over time, making it more predictable and stable for forecasting. 

Based on the ADF test output, both GII and GDP change time series were stationary. We proceeded to the time series modeling section. 
```{python}
# Augmented Dickey-Fuller (ADF) test for stationarity check
# Create melted datasets
combined_data_time = combined_data.melt(id_vars=['Time', 'Country'], 
value_vars=['GII_change','GDP_change'], 
var_name = 'Change_Type', 
value_name = 'Value')
GII = combined_data_time[(combined_data_time['Change_Type'] == 'GII_change')]                         

GDP = combined_data_time[(combined_data_time['Change_Type'] == 'GDP_change')]

# Stationary Check
def adf_test(series):
    result = adfuller(series.dropna())
    print(f'ADF Statistic: {result[0]}')
    print(f'p-value: {result[1]}')
    if result[1] < 0.05:
        print("Series is stationary")
    else:
        print("Series is not stationary")

# Output the result
adf_test(GII['Value'])
adf_test(GDP['Value'])
```

#### VAR model: Examine variables separately
We fitted a VAR (Vector Autoreression) model to see the relationship between GII and GDP change. VAR is particularly useful when dealing with multivariate time series data and allows us to examine the interdependence between variables. 

Based on summary, here were several interpretations we could make:

* We used AIC as the criteria for model selection. Lower value suggests a better fit. 

* Given that we wanted to predict GII change, we focused on the first set "Results for equation GII_change."

* Past GII_change values significantly influenced current GII_change, as shown in the small p-values of lags 1 and 2. 

* Lag 2 of GDP_change had a relatively low p-value but is not statistically significant.

```{python}
# Split the dataset into training and testing sets
split_ratio = 0.7
split_index = int(len(combined_data) * split_ratio)

# Training set is used to fit the model
train_data = combined_data.iloc[:split_index]

# Testing set is used for validation
test_data = combined_data.iloc[split_index:]

print(f"Training data: {train_data.shape}")
print(f"Test data: {test_data.shape}")
```

```{python}
#| warning: false
# Fit a VAR model 
time_model = VAR(train_data[['GII_change', 'GDP_change']])
time_model_fitted = time_model.fit(maxlags = 15, ic="aic")

# Print out the model summary
time_model_fitted.summary()
```

#### VAR Model: Forecasting
We applied the model learned above to the test data. Based on the plot, the forecast values seem to follow the actual data well, indicating a good model fit caputuring the underlying trends.
```{python}
# Number of steps to forecast (length of the test set)
n_steps = len(test_data)

# Get the last values from the training set for forecasting
forecast_input = train_data[
    ['GII_change', 'GDP_change']].values[-time_model_fitted.k_ar:]

# Forecasting
forecast = time_model_fitted.forecast(y=forecast_input, steps=n_steps)

# Create a DataFrame for the forecasted values
forecast_df = pd.DataFrame(forecast, index=test_data.index, 
columns=['GII_forecast', 'GDP_forecast'])

# Ensure the index of the forecast_df matches the test_data index
forecast_df.index = test_data.index
```

```{python}
plt.figure(figsize=(8, 6))
plt.plot(train_data['GII_change'], label='Training GII', color='blue')
plt.plot(test_data['GII_change'], label='Actual GII', color='orange')
plt.plot(forecast_df['GII_forecast'], label='Forecasted GII', color='green')
plt.title('GII Change Forecast vs Actual')
plt.legend()
plt.show()

plt.figure(figsize=(8, 6))
plt.plot(train_data['GDP_change'], label='Training GDP', color='blue')
plt.plot(test_data['GDP_change'], label='Actual GDP', color='orange')
plt.plot(forecast_df['GDP_forecast'], label='Forecasted GDP', color='green')
plt.title('GDP Change Forecast vs Actual')
plt.legend()
plt.show()
```

#### VAR Model: Model Performance
Low values of both MAE and RMSE indicate good model performance with small average errors in predictions. 
```{python}
mae_gii = mean_absolute_error(test_data['GII_change'], forecast_df['GII_forecast'])
mae_gdp = mean_absolute_error(test_data['GDP_change'], forecast_df['GDP_forecast'])

print(f'Mean Absolute Error for GII: {mae_gii}')
print(f'Mean Absolute Error for GDP: {mae_gdp}')
```


```{python}
rmse_gii = np.sqrt(mean_squared_error(test_data['GII_change'], 
forecast_df['GII_forecast']))
rmse_gdp = np.sqrt(mean_squared_error(test_data['GDP_change'], 
forecast_df['GDP_forecast']))

print(f'RMSE for GII: {rmse_gii}')
print(f'RMSE for GDP: {rmse_gdp}')
```

#### VAR Model: Granger causality test
Granger causality test evaluates whether one time series can predict another. 

Based on the output, the lowest p-value is when lag = 2. However, because p-value > 0.05, we fail to reject the null hypothesis and conclude the GDP_change does not Granger-cause the GII_change.
```{python}
#| warning: false
# Perform the Granger causality test
max_lag = 3
test_result = grangercausalitytests(train_data[['GII_change', 'GDP_change']], max_lag,
 verbose=True)

```

## Conclusion
In wrapping up our analysis, we found no evidence to support a significant relationship between the Change in Real GDP per capita and the Change in the Gender Inequality Index (GII). This suggests that economic growth may not have a direct impact on gender equality. However, our findings open the door to questions for future research. 

## Future Directions
First, we must consider what other factors might influence the relationship between GDP and GII change. The GII is a composite index, shaped by a myriad of social factors, including cultural norms, legal frameworks, and environmental shifts. Future studies could benefit from incorporating additional predictors into the analysis and exploring the interaction between economic growth and gender equality within specific country contexts.

Second, there's potential to enhance the predictive power of our Vector Autoregression (VAR) time series model. While we established that GDP change does not cause GII change, our model performed well in forecasting trends for both variables independently. In practice, policymakers may want to forecast GII trends independently of GDP if they are implementing gender-focused policies. Future research could investigate time series modeling to further unravel the dynamics of GII and GDP changes.

So, as we wrap up this chapter, let's keep our curiosity alive and our questions flowing. After all, every end is just a new beginning in the quest for knowledge!


## About the Author

<div style="display: flex; align-items: start; gap: 20px;">
<img src="static/Headshot.jpg" alt="Jenny Xu" style="width: 200px; border-radius: 10px;"/>
<div>
Hi there! My name is Jenny, and I'm a third-year student at University of California, Davis, double majoring in Statistics and Psychology. I've always been interested in becoming a data analyst working in tech, internet, or research industries. Interning at Promptly Technologies helped me learn a ton. A quick fun fact for me is that my MBTI is ISFJ (Defender)!

<div style="margin-top: 15px;">
<a href="mailto:yzxxu@ucdavis.edu" style="text-decoration: none; margin-right: 15px;">
  <svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M4 4h16c1.1 0 2 .9 2 2v12c0 1.1-.9 2-2 2H4c-1.1 0-2-.9-2-2V6c0-1.1.9-2 2-2z"></path><polyline points="22,6 12,13 2,6"></polyline></svg>
  Email
</a>
<a href="https://www.linkedin.com/in/jenny-xu-28519a273/" style="text-decoration: none; margin-right: 15px;">
  <svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M16 8a6 6 0 0 1 6 6v7h-4v-7a2 2 0 0 0-2-2 2 2 0 0 0-2 2v7h-4v-7a6 6 0 0 1 6-6z"></path><rect x="2" y="9" width="4" height="12"></rect><circle cx="4" cy="4" r="2"></circle></svg>
  LinkedIn
</a>
<a href="https://github.com/jennyyzxu" style="text-decoration: none;">
  <svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M9 19c-5 1.5-5-2.5-7-3m14 6v-3.87a3.37 3.37 0 0 0-.94-2.61c3.14-.35 6.44-1.54 6.44-7A5.44 5.44 0 0 0 20 4.77 5.07 5.07 0 0 0 19.91 1S18.73.65 16 2.48a13.38 13.38 0 0 0-7 0C6.27.65 5.09 1 5.09 1A5.07 5.07 0 0 0 5 4.77a5.44 5.44 0 0 0-1.5 3.78c0 5.42 3.3 6.61 6.44 7A3.37 3.37 0 0 0 9 18.13V22"></path></svg>
  GitHub
</a>
</div>

</div>
</div>