Acadlore takes over the publication of IJTDI from 2025 Vol. 9, No. 4. The preceding volumes were published under a CC BY 4.0 license by the previous owner, and displayed here as agreed between Acadlore and the previous owner. ✯ : This issue/volume is not published by Acadlore.
An Analytical Methodology for Extending Passenger Counts in a Metro System
Abstract:
The planning of a rail system requires the definition of travel demand in terms of passenger (or freight) flows for sizing physical and technological elements (such as number of trains, signalling system type, length and width of platforms). Moreover, once a system has been set up and functional elements have been acquired, system management in terms of services and related timetables requires knowledge of travel demand flows. Much has been written about the methods and techniques for estimating travel demand by means of analytical models (calibrated by surveys), statistical processing of survey data and/ or correcting model results by using properly collected traffic counts. However, whatever the adopted approach, it is necessary to proceed with survey campaigns to acquire experimental data. Obviously, the greater the number of detected data (and related acquisition costs and times), the greater the accuracy of travel demand estimations. Hence, in real cases, a fair compromise between survey costs and estimation accuracy has to be struck.
In this context, we propose an analytical methodology for identifying space–time relations between passenger counts to reduce the amount of data to be surveyed without affecting estimation accuracy. In particular, our proposal is based on defining analytical functions to provide boarding and alighting flows depending on the station (space component) and the time period (time component) in question. Finally, in order to show the feasibility of the proposed methodology and related improvements with respect to traditional approaches, we applied our proposal to the case of a real metro line in Naples (Italy) by comparing different levels of detail in passenger surveys.
1. Introduction
The promotion of public transport may be viewed as one of many useful strategies to reduce negative externalities on road networks (see, for instance, European Commission [1]). Indeed, although numerous policies have been applied to improve road system conditions in terms of safety (as shown by Dell’Acqua and Russo [2] and Dell’Acqua et al. [3]) or analysis of driver behaviour (as proposed by Bifulco et al. [4] and Pariota et al. [5]), the most successful strategies are based on the variation in user modal choices which allows sharp reductions in accidents, energy consumption, traffic congestion and air and noise pollution. In this context, adoption of interventions aimed at increasing the attractiveness of public transport (see Ca-scet-ta et al. [6, 7]) or reducing mass-transit operator costs and passenger disutilities (D’Acierno et al. [8], Gallo et al. [9] and D’Acierno et al. [10]) may represent a useful way for achieving the intended purposes.
However, whatever the adopted approach, it is necessary to estimate travel demand in terms of potential or expected passengers (or freights) with related features (i.e. starting and arrival stations, adopted time slot, trip duration, etc.). Indeed, such information is useful both for planning and management phases of transportation systems. This has generated an extensive literature on the estimation and forecasting methodologies of travel demand, which is summarized below.
In general, estimation of current and future demand can be performed by (Cascetta [11]) direct estimations, disaggregated estimations and aggregated estimations. The first approach, indicated in the literature (see, for instance, Smith [12], Brog & Ampt [13] and Ortuzar & Willumsen [14]) as direct estimation, can be adopted to determine only ‘present’ travel demand. It is based on the application of sampling theory in the case of mobility choices. The main limits of this methodology consist in the huge number of information to be collected and the inability to predict future developments due to transportation network or socioeconomic variations.
The second approach, known as disaggregated estimation (see, for instance, Domencich & McFadden [15], Horowitz [16], Manski & McFadden [17] and Ben-Akiva & Lerman [18]), consists in specifying (i.e. providing the functional form and related variables), calibrating (i.e. determining numerical values of model parameters) and validating (i.e. verifying the ability of the model to reproduce original data) a model by means of proper data. These data express disaggregate information related to a sample of individuals, where the size and the sample characteristics generally differ from those used in the first approach. This methodology allows mobility choices to be simulated in current conditions (based on the ability to reproduce sampling data) and in the case of future conditions (based on the ability to simulate user reactions to transportation network or socio-economic variations). The above disaggregated approach is referred to in the literature as the revealed preference approach (Cascetta [11]) since it is based on the use of data related to real behaviour of travellers. Recently (see, for instance, Ben-Akiva & Morikawa [19] and Ortuzar [20]), the stated preference approach has been developed, based on the statements of travellers about their appropriately described and designed preferences in hypothetical scenarios. With the use of this second approach the prediction abilities of the calibrated demand models can be improved.
Finally, the last approach (see, for instance, Lo & Chan [21], Cascetta et al. [22] and Lu et al. [23]), known as aggregated estimation, is based on modifying demand model results after correcting them by means of traffic counts (i.e. vehicular or passenger flows). The aim of this approach is to identify an origin-destination (OD) matrix which is closest to its estimation by model and, once it is assigned to the network, generates flows closest to the counting data.
The brief analysis of methods for estimating travel demand shows that it is always necessary to conduct survey campaigns to acquire experimental data. In particular, the greater the number of surveyed or detected data (and related acquisition costs and times), the greater the accuracy of travel demand estimations. Hence, in real cases, a fair compromise between survey costs and estimation accuracy should be achieved.
In this context, our proposal is to provide an analytical procedure for identifying some space–time functions in order to reduce the number of data to collect without significantly affecting estimation accuracy. In the particular case of a metro system, we aim to identify analytical relations which express boarding and alighting flows of passengers depending on the station (space component) and the time period considered (time component).
The article is organized as follows: Section 2 provides general features of the proposed approach; Section 3 applies the methodology in the case of a real metro line; finally, conclusions and research prospects are summarized in Section 4.
2. Definition of Space–time Relations
The proposed methodology is based on the assumption that the spatial correlation (i.e. the correlation among different stations) and temporal correlation (i.e. the correlation among different time periods) of passenger flows is generally not null. The reason lies in the framework of travel demand and its temporal distribution.
In the literature, the definition of functions for determining travel demand can be addressed through two main approaches (Cascetta [11]): descriptive and behavioural. The former (see, for instance, Oi & Shuldiner [24] and Wilson [25]) provides relations among variables without any assumption on users’ behaviour; the latter (see, for instance, Ortuzar & Willumsen [14], Domencich & McFadden [15] and Ben-Akiva & Lerman [18]) is based on explicit assumptions of users’ choice behaviours. Our proposal is based on adopting the first approach.
In particular, in order to investigate whether it is possible to identify one or more space– time functions to calculate passenger flows with a descriptive approach, we propose the following five-step procedure:
• design and execute a survey campaign in order to acquire a sufficient amount of data for applying the following statistical analyses;
• define a partial set of data, obtained by hiding (i.e. assuming not detected) some data with a predetermined sampling rate;
• perform a mono-dimensional statistical analysis on the partial data set in order to identify the optimal functional form;
• perform a multi-dimensional statistical analysis on the partial set of data, in order to specify, calibrate and validate one or more space–time functions;
• validate the methodology by comparing the simulation of the metro system by using data of the whole set (assumed as ‘absolute truth’) and those in the case of data obtained by space–time functions identified.
The first step consists in analysing the main features of the context such as the number and location of stations, the layout in terms of platforms and accesses of any station, operating hours and timetables, and rolling stock characteristics. This information allows a proper survey plan to be designed, indicating ‘when’ and ‘where’ passenger flows must be counted. Obviously, this phase includes the execution of surveys, whose appropriately noted results can be shown as in Fig. 1.
Since the aim of the procedure is to provide a tool for reducing the amount of data to be surveyed, the second phase consists in determining a partial data set to be analysed. In particular, we simulate the adoption of a prefixed sampling rate during surveys, for instance 50%, by hiding some detected data. In this way, we apply the proposed methodology in the case of a limited subset of surveyed data (see Fig. 2).
With the information collected and properly processed, the first kind of statistical analysis can then be implemented, that is the mono-dimensional approach. It consists in checking the class of functions which best describes the simulated survey data. This procedure is indicated as mono-dimensional because the analysed functions are defined in an R2 space, where the abscissa is the sequence of stations or the time periods and the ordinate is the surveyed flow (as shown in Fig. 3).



Hence, for each class of functions, it is necessary to analyse $\left(n_{\mathrm{st}} \times 2\right) \times n_{\mathrm{tp}}$ set of data, where $n_{\mathrm{st}}$ is the number of the stations (multiplier 2 is required to consider outgoing and return trips separately) and $n_{\mathrm{tp}}$ is the number of time periods considered. Since there are two flow types (i.e. boarding and alighting flows) to be considered, assuming $n_{\mathrm{fc}}$ as the number of function classes to be analysed, the second phase requires the calibration and the validation of $n_{\mathrm{f}}$ mono-dimensional functions, where


Obviously, for each class of function, suitable validation tests have to be performed.
Having defined the best class of mono-dimensional functions, the fourth phase consists in providing the specification, calibration and validation of four surfaces, where the independent parameters are the stations and the time periods, while the surface expresses the value of passenger flows. The number of functions to be defined, which is equal to four, is related to the fact that we have to consider two kinds of passenger flows (boarding and alighting flows) and two kinds of trips (outgoing and return trips). Obviously, also in this case the calibration set consists in the simulated (i.e. partial) surveyed data, shown in Fig. 2.
The last phase consists in comparing application results obtained by using the whole set of the survey data and those using the data of calibrated functions properly combined with the data of calibration subsets. Three different combinations of sets may be obtained for comparison: only the calibration subset (Fig. 2), the calibration subset extended by replacing missing data with function data (Fig. 4) and only function data for all values (Fig. 5).
3. Application of the Proposed Methodology
In order to show the feasibility of the proposed methodology, we applied it in the case of Metro Line 1 in Naples (Italy) which is about 18 km long and consists of 18 stations (further details can be found in Botte et al. [26]) from the suburbs (Piscinola) to city centre (Garibaldi).
Survey activities were implemented in July 2015 to collect data related to daily flows on an average working day in summer. It is worth noting that investigations were organized to detect flows for each single access (gate, stair, elevator, etc.) which were subsequently grouped according to platforms and travel directions. Finally, data were organized by considering three time periods and 18 stations; therefore we obtained four matrices of dimensions 3 × 18, whose framework is shown in Fig. 1.
As described in the previous section, the calibration phase was divided into two sub-phases: a mono-dimensional phase and a multi-dimensional phase. Obviously, it was first necessary to define the simulated surveyed rate which was assumed equal to 50%.
The first analysis consisted in dividing each of the four survey matrices into 54 (i.e. 3 ×× 18) vectors and testing the goodness of fit (i.e. the discrepancy between surveyed data and function data). The following classes of functions were tested: linear, quadratic, cubic, fourth-degree polynomial, fifth-degree polynomial, power, logarithmic and exponential functions. However, due to the scarcity of data along the matrix columns (i.e. there were at most two data), linear functions were adopted only in row analyses.
The goodness of fit of each class of function was estimated by means of the coefficient of determination $\Re^2$, calculated as follows:
with
where $\varphi_{\mathrm{i}}$ is the $i$-th simulated survey data (i.e. known value of Fig. 3), $n$ is the number of simulated survey data $\varphi_{\mathrm{i}}, \bar{\varphi}$ is the mean of data $\varphi_{\mathrm{i}}$ and $f_{\mathrm{i}}$ is the $i$-th value assumed by the calibrated function.
The class which provided the best $\mathrm{R}^2$ values were polynomial functions. Hence, we implemented the multi-dimensional phase by adopting this category of functions, obtaining the following formulations:
with
where x represents the time period; y represents the sequence of stations (y = 1 in the case of Piscinola and y = 18 in the case of Garibaldi); AG represents alighting flows (A) in the case of the outgoing trip (i.e. Garibaldi direction); BG represents boarding flows (B) in the case of the outgoing trip (i.e. Garibaldi direction); AP represents alighting flows (A) in the case of the return trip (i.e. Piscinola direction); BP represents boarding flows (B) in the case of the return trip (i.e. Piscinola direction).
It is worth noting that eqn. (4) is defined in the case of $k \in\{A G ; A P ; B P\}$, while eqn. (5) is defined only in the case of $k=B G$. Therefore, values $f_{\mathrm{i}}$ are calculated by means of eqn. (4) or eqn. (5) according to the values assumed by parameter $k$.
However, we adopted as global statistical tests: $\Re^2$ (formulated as eqn. (2)); adjusted $\Re^2$ (indicated as $\bar{\Re}^2$ ) and $F$-test (indicated as $F$ ), formulated as follows:
where p expresses the number of function parameters, which is equal to 5 in the case of function (4) and equal to 15 in the case of function (5). Moreover, we also performed the t-student (indicated as t) test of coefficients, calculated as
where $\operatorname{Var}()$, or equivalently $\operatorname{Var}\left(a_i^k\right)$, is the $i$-th element of the main diagonal of variancecovariance matrix S , obtained as
with
Table 1 reports the results of the global statistical tests; Tables 2–5 describe the results of statistical tests of coefficients.
Function type (k) | $\Re^2$ | $\bar{\Re}^2$ | F-test | ||
Confidence level (%) | |||||
F value | Threshold | ||||
AG | 0.764 | 0.701 | 12.273 | 8.018 | |
BG | 0.829 | 0.572 | 3.220 | 3.190 | |
AP | 0.621 | 0.521 | 6.218 | 5.967 | |
BP | 0.514 | 0.386 | 4.023 | 4.016 | |
Parameter | $a_1^k$ | $a_2^k$ | $a_3^k$ | $a_4^k$ | $a_5^k$ |
Value | 0.0624 | 0.0184 | -0.5150 | -0.0272 | 0.7285 |
t-Value | 489.62 | 588.89 | 706.461 | 398.376 | 588.973 |
Threshold | 5.077 | 5.077 | 5.077 | 5.077 | 5.077 |
Confidence level (%) | 99.99 | 99.99 | 99.99 | 99.99 | 99.99 |
Parameter | $a_1^k$ | $a_2^k$ | $a_3^k$ | $a_4^k$ | $a_5^k$ |
Value | -0.0552 | 0.0096 | 0.0677 | -0.0585 | 0.7942 |
t-Value | 375.497 | 184.120 | 69.562 | 525.132 | 468.483 |
Threshold | 5.077 | 5.077 | 5.077 | 5.077 | 5.077 |
Confidence level (%) | 99.99 | 99.99 | 99.99 | 99.99 | 99.99 |
Parameter | $b_1^k$ | $b_2^k$ | $b_3^k$ | $b_4^k$ | $b_5^k$ |
Value | 136.82 | -71.15 | 3.82 | 1.05 | 0.16 |
t-Value | 2.381·10⁶ | 908.94 | 367.44 | 2.892·10⁶ | 1.845·10⁸ |
Threshold | 6.412 | 6.412 | 6.412 | 6.412 | 6.412 |
Confidence level (%) | 99.99 | 99.99 | 99.99 | 99.99 | 99.99 |
Parameter | $b_6^k$ | $b_7^k$ | $b_8^k$ | $b_9^k$ | $b_{10}^k$ |
Value | -557.32 | 362.08 | -48.62 | -8.59 | 129.16 |
t-Value | 7,285.76 | 958.53 | 1,220.60 | 5,718.63 | 3,016.26 |
Threshold | 6.412 | 6.412 | 6.412 | 6.412 | 6.412 |
Confidence level (%) | 99.99 | 99.99 | 99.99 | 99.99 | 99.99 |
Parameter | $b_{11}^k$ | $b_{12}^k$ | $b_{13}^k$ | $b_{14}^k$ | $b_{15}^k$ |
Value | -201.01 | 172.87 | 380.50 | -970.94 | 2,035.15 |
t-Value | 342.91 | 3,513.05 | 465.52 | 1.040·10⁸ | 1.063·10⁵ |
Threshold | 6.412 | 6.412 | 6.412 | 6.412 | 6.412 |
Confidence level (%) | 99.99 | 99.99 | 99.99 | 99.99 | 99.99 |
Parameter | $a_1^k$ | $a_2^k$ | $a_3^k$ | $a_4^k$ | $a_5^k$ |
Value | -0.0400 | -0.0006 | 0.1864 | -0.0142 | 0.1726 |
t-Value | 332.221 | 14.512 | 234.541 | 165.639 | 124.745 |
Threshold | 5.077 | 5.077 | 5.077 | 5.077 | 5.077 |
Confidence level (%) | 99.99 | 99.99 | 99.99 | 99.99 | 99.99 |
In order to validate the proposed procedure, the daily travel demand, estimated by authors for the previous studies (see D’Acierno et al. [27] and Ercolani et al. [28]) in the case of an average working day in winter, was adapted in the case of new surveyed data by adopting an aggregated estimation approach. Details on the a priori travel demand estimation and related improvement by means of surveyed data can be found in Botte et al. [29].
The travel demand estimated by means of the whole set and assumed as ‘absolute truth’ was compared with OD matrices obtained by adopting three different counting sets: the partial survey set (Fig. 2), the set obtained by replacing missing data with function outputs (Fig. 4) and the set obtained by using only function outputs (Fig. 5). The comparisons were implemented by identifying the optimal intervention strategies in the case of metro system failures. Details on the considered breakdown and related intervention strategies can be found in D’Acierno et al. [10].
Intervention strategy | Complete survey data | Partial survey set | Replaced missing data | Function data |
0 | € 231,034.54 | € 253,518.18 | € 208,026.81 | € 168,733.07 |
1 | € 245,446.51 | € 268,782.33 | € 243,390.79 | € 229,185.58 |
2 | € 245,446.51 | € 268,782.33 | € 243,390.79 | € 229,185.58 |
3 | € 244,296.97 | € 267,618.70 | € 241,918.38 | € 228,309.20 |
4 | € 245,378.78 | € 268,671.10 | € 243,344.24 | € 229,131.18 |
5 | € 245,378.78 | € 268,671.10 | € 243,344.24 | € 229,131.18 |
6 | € 244,362.09 | € 267,724.45 | € 241,952.16 | € 228,324.68 |
7 | € 243,316.37 | € 266,814.45 | € 243,414.95 | € 229,163.03 |
8 | € 245,107.97 | € 268,431.45 | € 242,980.84 | € 228,863.49 |
9 | € 245,031.45 | € 268,338.34 | € 242,614.05 | € 228,665.49 |
10 | € 244,295.08 | € 267,631.59 | € 241,938.04 | € 228,309.45 |
11 | € 226,969.63 | € 248,628.12 | € 203,790.44 | € 165,513.24 |
12 | € 226,969.63 | € 248,628.12 | € 203,790.44 | € 165,513.24 |
13 | € 225,819.36 | € 247,460.93 | € 202,974.86 | € 164,773.21 |
14 | € 226,901.90 | € 248,516.89 | € 203,685.02 | € 165,427.05 |
15 | € 226,901.90 | € 248,516.89 | € 203,685.02 | € 165,427.05 |
16 | € 225,885.21 | € 247,570.24 | € 203,024.01 | € 164,822.67 |
17 | € 226,839.49 | € 248,458.24 | € 203,741.70 | € 165,473.23 |
18 | € 226,631.09 | € 248,277.46 | € 203,599.50 | € 165,342.12 |
19 | € 226,554.57 | € 248,184.13 | € 203,490.71 | € 165,242.53 |
20 | € 225,818.20 | € 247,477.38 | € 202,967.43 | € 164,769.11 |
Intervention strategy | Partial surveyed set (%) | Replaced missing data (%) | Function data (%) |
0 | 9.73 | 9.96 | 26.97 |
1 | 9.51 | 0.84 | 6.63 |
2 | 9.51 | 0.84 | 6.63 |
3 | 9.55 | 0.97 | 6.54 |
4 | 9.49 | 0.83 | 6.62 |
5 | 9.49 | 0.83 | 6.62 |
6 | 9.56 | 0.99 | 6.56 |
7 | 10.40 | 0.04 | 5.82 |
8 | 9.52 | 0.87 | 6.63 |
9 | 9.51 | 0.99 | 6.68 |
10 | 9.55 | 0.96 | 6.54 |
11 | 9.54 | 10.21 | 27.08 |
12 | 9.54 | 10.21 | 27.08 |
13 | 9.58 | 10.12 | 27.03 |
14 | 9.53 | 10.23 | 27.09 |
Intervention strategy | Partial surveyed set (%) | Replaced missing data (%) | Function data (%) |
15 | 9.53 | 10.23 | 27.09 |
16 | 9.60 | 10.12 | 27.03 |
17 | 9.53 | 10.18 | 27.05 |
18 | 9.55 | 10.16 | 27.04 |
19 | 9.55 | 10.18 | 27.06 |
20 | 9.59 | 10.12 | 27.03 |
|
|
|
|
Average | 9.59 | 5.71 | 17.28 |
Median | 10.40 | 10.23 | 27.09 |
Minimum | 9.49 | 0.04 | 5.82 |
Maximum | 9.54 | 9.96 | 26.97 |
Table 6 provides the objective function calculation for each different calibration set (the value in dark gray identifies the first best solution, light gray shows the second and third best strategies). Table 7 shows variation in the objective function with respect to the ‘absolute truth’.
4. Conclusions and Research Prospects
By following the proposed methodology the amount of data to be collected can be reduced without significantly compromising estimation accuracy. The space–time functions allow missing data to be replaced, providing an accuracy reduction of less than 6%.
In terms of future research, we propose to apply the proposed methodology in other contexts both for different time periods (for instance by collecting winter data) and different networks. A further research aim would be to verify the performance of different spatial reference systems such as curvilinear abscissa or polar coordinates.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.
