Iron Ore Prices Forecasting-A Time Series Approach with Pmdarima Python Package

A Data Science Project for DSND Udacity

9 min readOct 17, 2021

Reference: From Bust to Boom: Visualizing the Rise in Commodity Prices (visualcapitalist.com)

Project Definition

There was a notable rise in commodities prices recently, among which iron ore is one of the most important. Iron ore is the raw material for steel, one of the most used materials on the planet. And what’s next price for iron ore? Well…can we predict it?

The article from Visual Capitalist show us the importance of commodities: “If you ever wonder why commodities are important, just think of an object around you and ask yourself — what’s that made of?”

What if the price of these all-important things varies? Knowing in advance commodity futures prices is strategic for many sectors of the economy: from daily gains in financial markets to long-term planning of capital investments in industries.

The steps to analyse, model, evatluate and forecast iron ore prices using some classical time-series approaches by using the python package pmdarima is detailed in this blog post.

The Jupyter notebook developed for this project can be found at the project github repo.

Problem Statement

The objective of this project is to forecast iron ore prices.

Methodology: A time-series approach with various classical techniques, with and without considering exogenous variables, will be tested. The python packages statsmodels and pmdarina will be used for this purpose. Statistical tests for infering stationarity and causality will be used to better data understanding and support modeling phase.

Metrics: In order to evaluate the models, the MAPE (mean absolute percentage error) will be used. MAPE is a measure of error, so high numbers are bad and low numbers are good. For business purposes, percentages are easier to understand than squared errors, so MAPE is a good metric for communication of results.

Data Exploration and Visualization

Datasets

We investigated the historical monthly data for iron ore and other possible exogenous variables from the demand perspective — like steel price, scrap price (it replaces iron ore in the steelmaking), and GPD of China, one of the most important countries for steel market.

The data are collected from many sources like World Bank, OECD and Stock Exchange websites. The datasets are available here.

Data preprocessing involves joining all variables (endogenous and exogenous variables) in the same dataframe. Before merging, however, it is necessary to put them in the same time frequency.

Figure 1 — Evolution of Iron Ore Prices and related materials over time.

Historical data on iron ore prices show an evolution over time (Figure 1). This is typical of a time series. Time series patterns involve trend, cycles and seasonality. An excellent introduction to time series analysis can be found in this text here and in this blog post here.

Testing Stationarity

A stationary time series is one whose properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times.

Figure 2 — Iron Ore Prices: Autocorrelation Plot (montly, 2005–2021).

Some statistical tests were performed to assess the stationarity of the time series, including the augmented Dickey–Fuller test (ADF) and the Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test. Both have as null hypothesis the non-stationarity of the series. If the p-value is less than the desired level of significance, then we reject the null hypothesis and have the indication that the series is non-stationary. In this case, a simple linear regression model cannot be straight applied as the variable is not independent, and a time-series modeling is a more appropriate approach, like differencing the series to turn it into a stationary series.

We found that the iron ore series is a non-stationary series.

When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in size. So the autocorrelations of a trended time series tend to have positive values that slowly decrease as the lags increase. The Autocorrelation Plot in Figure shows this pattern. The stronger correlations values (>0.5) are about 8 lags.

Testing Causality

Commodity prices are related to supply and demand of the markets. In the case of iron ore, prices can be influenced by the price and production of steel but also by scrap prices, which to some extent can replace iron ore. Steel production is related to the growth of world economies, especially China. These variables can be considered as exogenous variables to explain the evolution of iron ore prices. So it seems to be a multivariate time series, that consists of more than one variable which varies over time with iron ore prices.

In order to check if these exogenous variables are relevant to the problem, a causality test was carried out. The Granger Causality test is an econometric hypothetical test for verifying the usage of one variable in forecasting another in multivariate time series data with a particular lag (references here and here). The Granger Causality tests applied to this case give us the following results:
* Scrap prices don’t make Iron Ore prices to change.
* Steel prices from past 2nd and 3rd months make Iron Ore prices to change.
* China GDP Growth from 1st and 6th previous months make Iron Ore prices to change.

Implementation

The historical data of iron ore prices and the exogenous relevant variables from the EDA step were prepared in a Python Pandas DataFrame for both EDA and Modeling phases. In order to build a model to forecast the iron ore prices, the classical time series approach was used.

Both ad-hoc hyperparameter selection and automatic tuning were used to build the models:

Simple Modeling: ARIMA with the parameters selected from EDA phase (inferred from autocorrelation plot and statistical tests), and implemented with the statsmodels python package.
Modeling with hyperparameters tuning: implemented with auto arima modeling function available at pmdarima python package.
Cross-validation Modeling : implemented with cross-validation functions available at pmdarima python package.

For the first two models, the data were divided into training and testing datasets in a proportion of 90% and 10% respectively. For the model using cross-validation, 12 folds were created by sliding windows.

Refinement

The first model tried was the ARIMAX model with the parameters p=8 (AR autoregressive component) and d=1 (order of differencing component required to make the time series stationary), and the best lags for the exogenous variables. All of these settings were inferred from the EDA phase.

The evaluation metric MAPE is 25.7%.

Figure 3 shows the actuals vs. forecasts plot for this first model. As can be seen, forecasts do not follow the trend of actual observations.

Figure 3 — Iron Ore Prices: Actuals vs. Forecasts for first model — ARIMA(8,1,0).

By looking at MAPE metric and the forecasts at Figure 3, the first model doesn’t seem to be a good model. This result led us to try some improvements, such as grid search for hyperparameters tuning and cross-validation techniques. Then the statsmodels package was replaced by pmdarima package, which is more suitable for the purpose.

The pmdarima python package, as you can read from the documentation, is essentially a Python & Cython wrapper of several different statistical and machine learning libraries (statsmodels and scikit-learn), and operates by generalizing all ARIMA models into a single class (unlike statsmodels). It does this by wrapping the respective statsmodels interfaces (ARMA, ARIMA and SARIMAX) inside the pmdarima.ARIMA class

The documentation explains that the auto_arima function itself operates a bit like a grid search, in that it tries various sets of p and q (also P and Q for seasonal models) parameters, selecting the model that minimizes the AIC (or BIC, or whatever information criterion you select). To select the differencing terms, auto_arima uses a test of stationarity (such as an augmented Dickey-Fuller test) and seasonality (such as the Canova-Hansen test) for seasonal models.

The autoarima function from pmdarima package was applied to the iron ore prices forecasting problem, and the best model was found to be an ARIMAX(0,1,2). So the parameters are:

p: Autoregressive order is 0
d: Differencing order is 1
q: Moving average order is 2

The results are discussed in next section.

Model Evaluation and Validation

The autoarima function performs a grid search to find the best parameters. By using a train-test split of 90% and 10% respectively, the MAPE is found to be 24%, which shows a little improvement from the first model (MAPE = 26%).

Cross-validation techniques are also available in pmdarima package. Cross-validation was implemented with 12 folds (sliding windows). The evaluation metrics for the final model with cross-validation has given a MAPE (mean absolute percentage error) of 21%, a larger improvement from the first model.

The Forecasts vs. Actuals values are shown in Figure 4. When the plots from Figure 3 and Figure 4 are contrasted, the improvement of the final model is impressive.

Figure 4 — Iron Ore Prices: Actuals x Forecasts, cross-validated.

The Normality Test for the residues showed a p-value of 0.08, so we can not reject the null hypothesis at 5% significance, and we considered it to have normal distribution, although it is a weak evidence.

Justification

The adoption of grid search and cross-validation techniques have led to an improvement from the initial approach for the chosen metric MAPE. Grid search and cross-validation are well-known techniques to improve the models. The pmdarima package makes this implementation much easier.

In this project, a cross-validation with k = 12 was used under a sliding windows split once we are dealing with time series data. However, the best number of folds and the most appropriate split technique (rolling forecast vs. sliding windows, for example) were not studied deeply. Their parameters were selected solely by the analyst.

Reflection

In this project, an iron ore price forecasting model was built using the classical time series approach. Exogenous variables, related to commodity demand, were incorporated to enrich the model. Statistical tests of stationarity and causality were used to understand the characteristics of endogenous and exogenous variables. We started from the ARIMA model (8,1,0) considering the result of the analysis of the autocorrelation plot (EDA phase). Then the ARIMA model with exogenous variables was evaluated. Finally, the use of grid search and cross validation techniques led to the final ARIMAX(0,1,2) model. The MAPE evaluation metric dropped from 26% for the first alternative to 21% for the final version of the model, ie a 5% improvement for MAPE.

Although the chosen metric MAPE shows better results throughout the experiments, it is still at a high level, as an error of 20% is relevant. This is characteristic of the problem, as price volatility in commodity markets is quite high and difficult to predict even for the most skilled researches.

The exogenous variables that can explain the problem are also difficult to choose, and more understanding with experts should be sought so that other variables could be evaluated considering the domain knowledge.

Improvement

In the search for better results, in addition to the classical techniques used, experiments can be carried out using non-linear techniques such as as NARMAX and ensemble methods like random forests. The XGBoost and LightGBM models might be good models to try next.