Time-series forecasting

Time-series predictions can be used to:

  • Forecast cloud infrastructure expenses next quarter
  • Forecast the value of a given stock in the future
  • Forecast the number of units of a product likely to be sold next quarter
  • Forecast the remaining lifespan of an IoT device
  • Forecast the number of taxi or ride share drivers necessary for a big holiday evening

Time-series forecasting alone is a powerful tool. But time-series data joined with business data can be a competitive advantage for any developer. TimescaleDB is PostgreSQL for time-series data and as such, time-series data stored in TimescaleDB can be easily joined with business data in another relational database in order to develop an even more insightful forecast into how your data (and business) changes over time.

This time-series forecasting example demonstrates how to integrate TimescaleDB with R, Apache MADlib, and Python to perform various time-series forecasting methods. It uses New York City taxicab data that is also used in the Hello Timescale Tutorial. The dataset contains information about all yellow cab trips in New York City in January 2016, including pickup and dropoff times, GPS coordinates, and total price of a trip. You can extract some interesting insights from this rich dataset, build a time-series forecasting model, and explore the use of various forecasting and machine learning tools.

Prerequisites:

First, let’s create the schema and populate the tables. Download the file and execute the following command:

The file contains SQL statements that create three TimescaleDB hypertables rides_count, rides_length and rides_price. Let’s look at how to create the rides_count table as an example. Here is a portion of the code taken from forecast.sql:

  1. CREATE TABLE rides_count(
  2. one_hour TIMESTAMP WITHOUT TIME ZONE NOT NULL,
  3. count NUMERIC
  4. );
  5. SELECT create_hypertable('rides_count', 'one_hour');
  6. INSERT INTO rides_count
  7. SELECT time_bucket_gapfill('1 hour', pickup_datetime, '2016-01-01 00:00:00','2016-01-31 23:59:59') AS one_hour,
  8. COUNT(*) AS count
  9. FROM rides
  10. WHERE ST_Distance(pickup_geom, ST_Transform(ST_SetSRID(ST_MakePoint(-74.0113,40.7075),4326),2163)) < 400
  11. AND pickup_datetime < '2016-02-01'
  12. GROUP BY one_hour
  13. ORDER BY one_hour;

Notice that you have made the rides_count table a TimescaleDB hypertable. This allows you to take advantage of TimescaleDB’s faster insert and query performance with time-series data. Here, you can see how PostgreSQL aggregate functions such as COUNT and various PostGIS functions all work as usual with TimescaleDB. You can use PostGIS to select data points from the original rides table where the pickup location is less than 400m from the GPS location (40.7589, -73.9851), which is Times Square.

The data comes from the NYC Taxi and Limousine Commission. It is missing data points for certain hours. You can gapfill the missing values with 0. To learn more, see the documentation. A similar method is used to create rides_length and rides_price.

Before you move onto the next few sections, check that the following tables are in your database.

  1. \dt
  2. List of relations
  3. Schema | Name | Type | Owner
  4. --------+-----------------+-------+----------
  5. public | payment_types | table | postgres
  6. public | rates | table | postgres
  7. public | rides | table | postgres
  8. public | rides_count | table | postgres
  9. public | rides_length | table | postgres
  10. public | rides_price | table | postgres
  11. public | spatial_ref_sys | table | postgres
  12. (7 rows)

The is a tool that is often used in time-series analysis to better understand a dataset and make predictions on future values. The ARIMA model can be broadly categorized as seasonal and non-seasonal. Seasonal ARIMA models are used for datasets that have characteristics that repeat over fixed periods of time. For example, a dataset of hourly temperature values over a week has a seasonal component with a period of 1 day, since the temperature goes up during the day and down over night every day. In contrast, the price of Bitcoin over time is (probably) non-seasonal since there is no clear observable pattern that recurs in fixed time periods.

This tutorial uses R to analyze the seasonality of the number of taxicab pickups at Times Square over a week.

The table rides_count contains the data needed for this section of the tutorial. rides_count has two columns one_hour and count. The one_hour column is a TimescaleDB time_bucket for every hour from January 1 to January 31. The count column is the number of pickups from Times Square during each hourly period.

  1. SELECT * FROM rides_count;
  2. one_hour | count
  3. ---------------------+-------
  4. 2016-01-01 00:00:00 | 176
  5. 2016-01-01 01:00:00 | 218
  6. 2016-01-01 02:00:00 | 221
  7. 2016-01-01 03:00:00 | 344
  8. 2016-01-01 04:00:00 | 397
  9. 2016-01-01 05:00:00 | 269
  10. 2016-01-01 06:00:00 | 192
  11. 2016-01-01 07:00:00 | 145
  12. 2016-01-01 08:00:00 | 166
  13. 2016-01-01 09:00:00 | 233
  14. 2016-01-01 10:00:00 | 295
  15. 2016-01-01 11:00:00 | 440
  16. 2016-01-01 12:00:00 | 472
  17. 2016-01-01 13:00:00 | 472
  18. 2016-01-01 14:00:00 | 485
  19. 2016-01-01 15:00:00 | 538
  20. 2016-01-01 16:00:00 | 430
  21. 2016-01-01 17:00:00 | 451
  22. 2016-01-01 18:00:00 | 496
  23. 2016-01-01 19:00:00 | 538
  24. 2016-01-01 20:00:00 | 485
  25. 2016-01-01 21:00:00 | 619
  26. 2016-01-01 22:00:00 | 1197
  27. 2016-01-01 23:00:00 | 798
  28. ...

Create two PostgreSQL views, rides_count_train and rides_count_test for the training and testing datasets.

  1. -- Make the training dataset
  2. CREATE VIEW rides_count_train AS
  3. SELECT * FROM rides_count
  4. WHERE one_hour <= '2016-01-21 23:59:59';
  5. -- Make the testing dataset
  6. CREATE VIEW rides_count_test AS
  7. SELECT * FROM rides_count
  8. WHERE one_hour >= '2016-01-22 00:00:00';

R has an RPostgres package which allows you to connect to your database from R. The code below establishes a connection to the PostgreSQL database nyc_data. You can connect to a different database simply by changing the parameters of dbConnect. The final line of code should print out a list of all tables in your database. This means that you have successfully connected and are ready to query the database from R.

  1. # Install and load RPostgres package
  2. install.packages("RPostgres")
  3. library("DBI")
  4. # creates a connection to the postgres database
  5. con <- dbConnect(RPostgres::Postgres(), dbname = "nyc_data",
  6. host = "localhost",
  7. user = "postgres")
  8. # list tables in database to verify connection
  9. dbListTables(con)

You can query the database with SQL code inside R. Putting the query result in an R data frame allows you to analyze the data using tools provided by R.

  1. # query the database and input the result into an R data frame
  2. # training dataset with data 2016/01/01 - 2016/01/21
  3. count_rides_train_query <- dbSendQuery(con, "SELECT * FROM rides_count_train;")
  4. count_rides_train <- dbFetch(count_rides_train_query)
  5. dbClearResult(count_rides_train_query)
  6. head(count_rides_train)
  7. one_hour count
  8. 1 2016-01-01 00:00:00 176
  9. 2 2016-01-01 01:00:00 218
  10. 3 2016-01-01 02:00:00 221
  11. 4 2016-01-01 03:00:00 344
  12. 5 2016-01-01 04:00:00 397
  13. 6 2016-01-01 05:00:00 269
  14. # testing dataset with data 2016/01/22 - 2016/01/31
  15. count_rides_test_query <- dbSendQuery(con, "SELECT * FROM rides_count_test")
  16. count_rides_test <- dbFetch(count_rides_test_query)
  17. dbClearResult(count_rides_test_query)
  18. head(count_rides_test)
  19. one_hour count
  20. 1 2016-01-22 00:00:00 702
  21. 2 2016-01-22 01:00:00 401
  22. 4 2016-01-22 03:00:00 169
  23. 5 2016-01-22 04:00:00 140
  24. 6 2016-01-22 05:00:00 100

In order to feed the data into an ARIMA model, you must first convert the data frame into a time-series object in R. is a package that allows you to do this easily. You can also set the frequency of the time-series object to 168. This is because the number of pickups is expected to fluctuate with a fixed pattern every week, and there are 168 hours in a week, or in other words, 168 data points in each seasonal period. If you want to model the data as having a seasonality of 1 day, you can change the frequency parameter to 24.

  1. # Install and load xts package
  2. install.packages("xts")
  3. library("xts")
  4. # convert the data frame into time-series
  5. xts_count_rides <- xts(count_rides_train$count, order.by = as.POSIXct(count_rides_train$one_hour, format = "%Y-%m-%d %H:%M:%S"))
  6. # set the frequency of series as weekly 24 * 7
  7. attr(xts_count_rides, 'frequency') <- 168

The forecast package in R provides a useful function auto.arima, which automatically finds the best ARIMA parameters for the dataset. Set the parameter D, which captures the seasonality of the model, to 1 to force the function to find a seasonal model. This calculation can take a while to compute (in this dataset, around five minutes). Once the computation is complete, you can save the output of the auto.arima function into fit and get a summary of the ARIMA model that has been created.

  1. # forecast future values using the arima model, h specifies the number of readings to forecast
  2. fcast <- forecast(fit, h=168)
  3. fcast
  4. Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
  5. 4.000000 659.0645 566.71202 751.4169 517.82358229 800.3053
  6. 4.005952 430.7339 325.02891 536.4388 269.07209741 592.3956
  7. 4.011905 268.1259 157.28358 378.9682 98.60719504 437.6446
  8. 4.017857 228.3024 116.08381 340.5210 56.67886523 399.9260
  9. 4.023810 200.7340 88.25064 313.2174 28.70554423 372.7625
  10. 4.029762 140.5758 28.04128 253.1103 -31.53088134 312.6824
  11. 4.035714 196.1703 83.57555 308.7650 23.97150358 368.3690
  12. 4.041667 282.6171 169.80545 395.4288 110.08657346 455.1476
  13. 4.047619 446.6713 333.28115 560.0614 273.25604289 620.0865
  14. 4.053571 479.9449 365.53618 594.3537 304.97184340 654.9180
  15. ...

The output of forecast can be hard to decipher. You can plot the forecasted values with the code below:

  1. # plot the values forecasted
  2. plot(fcast, include = 168, main="Taxicab Pickup Count in Times Square by Time", xlab="Date", ylab="Pickup Count", xaxt="n", col="red", fcol="blue")
  3. ticks <- seq(3, 5, 1/7)
  4. dates <- seq(as.Date("2016-01-15"), as.Date("2016-01-29"), by="days")
  5. dates <- format(dates, "%m-%d %H:%M")
  6. axis(1, at=ticks, labels=dates)
  7. legend('topleft', legend=c("Observed Value", "Predicted Value"), col=c("red", "blue"), lwd=c(2.5,2.5))
  8. # plot the observed values from the testing dataset
  9. count_rides_test$x <- seq(4, 4 + 239 * 1/168, 1/168)
  10. count_rides_test <- subset(count_rides_test, count_rides_test$one_hour < as.POSIXct("2016-01-29"))
  11. lines(count_rides_test$x, count_rides_test$count, col="red")

In the graphing of this data, the grey area around the prediction line in blue is the prediction interval, or the uncertainty of the prediction, while the red line is the actual observed pickup count. The number of pickups on Saturday January 23 is zero because the data is missing for this period of time.

You might find that the prediction for January 22 matches impressively with the observed values, but the prediction overestimates for the following days. It is clear that the model has captured the seasonality of the data, as you can see the forecasted values of the number of pickups drop dramatically overnight from 1 AM, before rising again from around 6 AM. There is a noticeable increase in the number of pickups in the afternoon compared to the morning, with a slight dip around lunchtime and a sharp peak around 6 PM when presumably people take cabs to return home after work.

While these findings do not reveal anything completely unexpected, it is still valuable to have the analysis verify your expectations. It must be noted that the ARIMA model is not perfect and this is evident from the anomalous prediction made for January 25. The ARIMA model created uses the previous week’s data to make predictions. January 18 2016 was Martin Luther King day, and so the distribution of ride pickups throughout the day is slightly different from that of a standard Monday. Also, the holiday probably affected riders’ behavior on the surrounding days too. The model does not pick up such anomalous data that arise from various holidays and this must be noted before reaching a conclusion. Simply taking out such anomalous data, by only using the first two weeks of January for example, may have led to a more accurate prediction. This demonstrates the importance of understanding the context behind your data.

Although R offers a rich library of statistical models, it requires importing the data into R before performing calculations. With a larger dataset, this can become a bottleneck to marshal and transfer all the data to the R process (which itself might run out of memory and start swapping). So, let’s look into an alternative method that allows you to move computations to the database and improve this performance.

MADlib is an open source library for in-database data analytics that provides a wide collection of popular machine learning methods and various supplementary statistical tools.

MADlib supports many machine learning algorithms that are available in R and Python. And by executing these machine learning algorithms within the database, it may be efficient enough to process them against an entire dataset rather than pulling a much smaller sample to an external program.

Install MADlib following the steps outlined in their documentation: .

Set up MADlib in the nyc_data database:

  1. /usr/local/madlib/bin/madpack -s madlib -p postgres -c [email protected]/nyc_data install
warning

This command might differ depending on the directory in which you installed MADlib and the names of your PostgreSQL user, host and database.

Now you can make use of MADlib’s library to analyze the taxicab dataset. Here, you can train an ARIMA model to predict the price of a ride from JFK to Times Square at a given time.

Let’s look at the rides_price table. The trip_price column is the average price of a trip from JFK to Times Square during each hourly period. Data points that are missing due to no rides being taken during a certain hourly period are filled with the previous value. This is done by , mentioned earlier in this tutorial.

  1. SELECT * FROM rides_price;
  2. one_hour | trip_price
  3. ---------------------+------------------
  4. 2016-01-01 00:00:00 | 58.34
  5. 2016-01-01 01:00:00 | 58.34
  6. 2016-01-01 02:00:00 | 58.34
  7. 2016-01-01 03:00:00 | 58.34
  8. 2016-01-01 04:00:00 | 58.34
  9. 2016-01-01 05:00:00 | 59.59
  10. 2016-01-01 06:00:00 | 58.34
  11. 2016-01-01 07:00:00 | 60.3833333333333
  12. 2016-01-01 08:00:00 | 61.2575
  13. 2016-01-01 09:00:00 | 58.435
  14. 2016-01-01 10:00:00 | 63.952
  15. 2016-01-01 11:00:00 | 59.9576923076923
  16. 2016-01-01 12:00:00 | 60.462
  17. 2016-01-01 13:00:00 | 61.65
  18. 2016-01-01 14:00:00 | 58.342
  19. 2016-01-01 15:00:00 | 59.8965
  20. 2016-01-01 16:00:00 | 61.6468965517241
  21. 2016-01-01 17:00:00 | 58.982
  22. 2016-01-01 18:00:00 | 64.28875
  23. 2016-01-01 19:00:00 | 60.8433333333333
  24. 2016-01-01 20:00:00 | 61.888125
  25. 2016-01-01 21:00:00 | 61.4064285714286
  26. 2016-01-01 22:00:00 | 61.107619047619
  27. 2016-01-01 23:00:00 | 57.9088888888889

You can also create two tables for the training and testing datasets. You can create tables instead of views here because you need to add columns to these datasets later in the time-series forecast analysis.

  1. -- Make the training dataset
  2. SELECT * INTO rides_price_train FROM rides_price
  3. WHERE one_hour <= '2016-01-21 23:59:59';
  4. -- Make the testing dataset
  5. SELECT * INTO rides_price_test FROM rides_price
  6. WHERE one_hour >= '2016-01-22 00:00:00';

Now you can use MADlib’s ARIMA library to make forecasts on your dataset.

MADlib does not yet offer a method that automatically finds the best parameters of the ARIMA model. So, the non-seasonal orders of the ARIMA model are obtained by using R’s auto.arima function in the same way you obtained them in the previous section with seasonal ARIMA. Here is the R code:

  1. # Connect to database and fetch records
  2. library("DBI")
  3. con <- dbConnect(RPostgres::Postgres(), dbname = "nyc_data",
  4. host = "localhost",
  5. user = "postgres")
  6. rides_price_train_query <- dbSendQuery(con, "SELECT * FROM rides_price_train;")
  7. rides_price_train <- dbFetch(rides_price_train_query)
  8. dbClearResult(rides_price_train_query)
  9. # convert the dataframe into a time-series
  10. library("xts")
  11. xts_rides_price <- xts(rides_price_train$trip_price, order.by = as.POSIXct(rides_price_train$one_hour, format = "%Y-%m-%d %H:%M:%S"))
  12. attr(xts_rides_price, 'frequency') <- 168
  13. # use auto.arima() to calculate the orders
  14. library("forecast")
  15. fit <- auto.arima(xts_rides_price[,1])
  16. # see the summary of the fit
  17. summary(fit)
  18. Series: xts_rides_price[, 1]
  19. ARIMA(2,1,3)
  20. Coefficients:
  21. ar1 ar2 ma1 ma2 ma3
  22. 0.3958 -0.5142 -1.1906 0.8263 -0.5791
  23. s.e. 0.2312 0.1593 0.2202 0.2846 0.1130
  24. sigma^2 estimated as 11.06: log likelihood=-1316.8
  25. AIC=2645.59 AICc=2645.76 BIC=2670.92
  26. Training set error measures:
  27. ME RMSE MAE MPE MAPE MASE
  28. Training set 0.1319955 3.30592 2.186295 -0.04371788 3.47929 0.6510487
  29. ACF1

Of course, you can continue the analysis with R by following the same steps in the previous seasonal ARIMA section. Unfortunately, MADlib does not yet offer a way to automatically find the orders of the ARIMA model.

Using the parameters ARIMA(2,1,3) found using R, you can use MADlib’s arima_train and functions.

  1. -- train arima model and forecast the price of a ride from JFK to Times Square
  2. DROP TABLE IF EXISTS rides_price_output;
  3. DROP TABLE IF EXISTS rides_price_output_residual;
  4. DROP TABLE IF EXISTS rides_price_output_summary;
  5. DROP TABLE IF EXISTS rides_price_forecast_output;
  6. SELECT madlib.arima_train('rides_price_train', -- input table
  7. 'rides_price_output', -- output table
  8. 'one_hour', -- timestamp column
  9. 'trip_price', -- time-series column
  10. NULL, -- grouping columns
  11. TRUE, -- include_mean
  12. ARRAY[2,1,3] -- non-seasonal orders
  13. );
  14. SELECT madlib.arima_forecast('rides_price_output', -- model table
  15. 'rides_price_forecast_output', -- output table
  16. 240 -- steps_ahead (10 days)
  17. );

Let’s examine what values the trained ARIMA model forecasted for the next day.

The model seems to suggest that the price of a ride from JFK to Times Square remains pretty much constant on a day-to-day basis. MADlib also provides various statistical functions to evaluate the model.

  1. ALTER TABLE rides_price_test ADD COLUMN id SERIAL PRIMARY KEY;
  2. ALTER TABLE rides_price_test ADD COLUMN forecast DOUBLE PRECISION;
  3. UPDATE rides_price_test
  4. SET forecast = rides_price_forecast_output.forecast_value
  5. FROM rides_price_forecast_output
  6. WHERE rides_price_test.id = rides_price_forecast_output.steps_ahead;
  7. SELECT madlib.mean_abs_perc_error('rides_price_test', 'rides_price_mean_abs_perc_error', 'trip_price', 'forecast');
  8. SELECT * FROM rides_price_mean_abs_perc_error;
  9. mean_abs_perc_error
  10. ---------------------
  11. 0.0423789161532639
  12. (1 row)

Earlier, you had to set up the columns of the rides_price_test table to fit the format of MADlib’s mean_abs_perc_error function. There are multiple ways to evaluate the quality of a model’s forecast values. In this case, you calculated the mean absolute percentage error and got 4.24%.

What can you take away from this? The non-seasonal ARIMA model predicts that the price of a trip from the airport to Manhattan remains constant at $62 and performs well against the testing dataset. Unlike some ride hailing apps such as Uber that have surge pricing during rush hours, yellow taxicab prices stay pretty much constant all day.

From a technical standpoint, you have seen how TimescaleDB integrates seamlessly with other PostgreSQL extensions PostGIS and MADlib. This means that TimescaleDB users can easily take advantage of the vast PostgreSQL ecosystem.

The Holt-Winters model is another widely used tool in time-series analysis and forecasting. It can only be used for seasonal time-series data. The Holt-Winters model uses simple exponential smoothing to make future predictions. So with time-series data, the forecast is calculated from taking a weighted average of past values, with more recent data points having greater weight than previous points. Holt-Winters is considered to be simpler than ARIMA, but there is no clear answer as to which time-series prediction model is superior in time-series forecasting. It is advised to create both models for a particular dataset and compare the performance to find out which is more suitable.

You can use Python to analyze how long it takes from the Financial District to Times Square at different time periods during the day. You need to install these Python packages:

  1. pip install psycopg2
  2. pip install pandas
  3. pip install numpy
  4. pip install statsmodels

The format of the data is very similar to the previous two sections. The trip_length column in the rides_length table is the average length of a ride from the Financial District to Times Square in the given time period.

  1. SELECT * FROM rides_length;
  2. three_hour | trip_length
  3. ---------------------+-----------------
  4. 2016-01-01 00:00:00 | 00:21:50.090909
  5. 2016-01-01 03:00:00 | 00:17:15.8
  6. 2016-01-01 06:00:00 | 00:13:21.666667
  7. 2016-01-01 09:00:00 | 00:14:20.625
  8. 2016-01-01 12:00:00 | 00:16:32.366667
  9. 2016-01-01 15:00:00 | 00:19:16.921569
  10. 2016-01-01 18:00:00 | 00:22:46.5
  11. 2016-01-01 21:00:00 | 00:17:22.285714
  12. 2016-01-02 00:00:00 | 00:19:24
  13. 2016-01-02 03:00:00 | 00:19:24
  14. 2016-01-02 06:00:00 | 00:12:13.5
  15. 2016-01-02 09:00:00 | 00:17:17.785714
  16. 2016-01-02 12:00:00 | 00:20:56.785714
  17. 2016-01-02 15:00:00 | 00:24:41.730769
  18. 2016-01-02 18:00:00 | 00:29:39.555556
  19. 2016-01-02 21:00:00 | 00:20:09.6
  20. ...

You can also create two PostgreSQL views for the training and testing datasets.

  1. -- Make the training dataset
  2. CREATE VIEW rides_length_train AS
  3. SELECT * FROM rides_length
  4. WHERE three_hour <= '2016-01-21 23:59:59';
  5. -- Make the testing dataset
  6. CREATE VIEW rides_length_test AS
  7. SELECT * FROM rides_length
  8. WHERE three_hour >= '2016-01-22 00:00:00';

Python has a package that allows you to query the database in Python:

  1. import psycopg2
  2. import psycopg2.extras
  3. # establish connection
  4. conn = psycopg2.connect(dbname='nyc_data', user='postgres', host='localhost')
  5. # cursor object allows querying of database
  6. # server-side cursor is created to prevent records to be downloaded until explicitly fetched
  7. cursor_train = conn.cursor('train', cursor_factory=psycopg2.extras.DictCursor)
  8. cursor_test = conn.cursor('test', cursor_factory=psycopg2.extras.DictCursor)
  9. # execute SQL query
  10. cursor_train.execute('SELECT * FROM rides_length_train')
  11. cursor_test.execute('SELECT * FROM rides_length_test')
  12. # fetch records from database
  13. ride_length_train = cursor_train.fetchall()
  14. ride_length_test = cursor_test.fetchall()

You can now manipulate the data to feed it into the Holt-Winters model.

  1. import pandas as pd
  2. import numpy as np
  3. # make records into a pandas dataframe
  4. ride_length_train = pd.DataFrame(np.array(ride_length_train), columns = ['time', 'trip_length'])
  5. ride_length_test = pd.DataFrame(np.array(ride_length_test), columns = ['time', 'trip_length'])
  6. # convert the type of columns of dataframe to datetime and timedelta
  7. ride_length_train['time'] = pd.to_datetime(ride_length_train['time'], format = '%Y-%m-%d %H:%M:%S')
  8. ride_length_test['time'] = pd.to_datetime(ride_length_test['time'], format = '%Y-%m-%d %H:%M:%S')
  9. ride_length_train['trip_length'] = pd.to_timedelta(ride_length_train['trip_length'])
  10. ride_length_test['trip_length'] = pd.to_timedelta(ride_length_test['trip_length'])
  11. # set the index of dataframes to the timestamp
  12. ride_length_train.set_index('time', inplace = True)
  13. ride_length_test.set_index('time', inplace = True)
  14. # convert trip_length into a numeric value in seconds
  15. ride_length_train['trip_length'] = ride_length_train['trip_length']/np.timedelta64(1, 's')
  16. ride_length_test['trip_length'] = ride_length_test['trip_length']/np.timedelta64(1, 's')

This data can now be used to train a Holt-Winters model that is imported from the statsmodels package. You can expect the pattern to repeat weekly, and therefore set the seasonal_periods parameter to 56 (there are eight 3-hour periods in a day, seven days in a week). Since the seasonal variations are likely to be fairly constant over time, you can use the additive method rather than the multiplicative method, which is specified by the trend and seasonal parameters.

  1. from statsmodels.tsa.api import ExponentialSmoothing
  2. fit = ExponentialSmoothing(np.asarray(ride_length_train['trip_length']), seasonal_periods = 56, trend = 'add', seasonal = 'add').fit()

You use the model that has been trained to make a forecast and compare with the testing dataset.

Now ride_length_test has a column with the observed values and predicted values from January 22 to January 31. You can plot these values on top of each other to make a visual comparison:

  1. import matplotlib.pyplot as plt
  2. plt.plot(ride_length_test)
  3. plt.title('Taxicab Ride Length from Financial District to Times Square by Time')
  4. plt.xlabel('Date')
  5. plt.ylabel('Ride Length (seconds)')
  6. plt.legend(['Observed', 'Predicted'])

Rides Length Graph

The model predicts that the length of a trip from the Financial District to Times Square fluctuates roughly between 16 minutes and 38 minutes, with high points midday and low points overnight. The trip length is notably longer during weekdays than it is during weekends (January 23, 24, 30, 31).

This tutorial looked at different ways you can build statistical models to analyze time-series data and how you can leverage the full power of the PostgreSQL ecosystem with TimescaleDB. This tutorial also looked at integrating TimescaleDB with R, Apache MADlib, and Python. You can simply choose the option you are most familiar with from a vast number of choices that TimescaleDB inherits from PostgreSQL. ARIMA and Holt-Winters are just a couple from a wide variety of statistical models and machine learning algorithms that you can use to analyze and make predictions on time-series data in your TimescaleDB database.