1 Introduction
Crowd counting is important for applications such as video surveillance and traffic control. Most stateoftheart approaches rely on regressors to estimate the local crowd density in individual images, which they then proceed to integrate over portions of the images to produce people counts. The regressors typically use Random Forests
[16], Gaussian Processes [7], or more recently Deep Nets [55, 59, 30, 34, 49, 41, 36, 26, 17, 33, 40, 22, 14, 32, 5].When video sequences are available, some algorithms use temporal consistency to impose weak constraints on successive density estimates. One way is to use an LSTM to model the evolution of people densities from one frame to the next [49]. However, this does not explicitly enforce the fact that people numbers must be strictly conserved as they move about, except at very specific locations where they can move in or out of the field of view. Modeling this was attempted in [24] but, because expressing this constraint in terms of people densities is difficult, the constraints actually enforced were much weaker, as will be shown below.
In this paper, we propose to regress people flows, that is, the number of people moving from one location to another in the image plane, instead of densities, as depicted by Fig. 1. We will show that it enables us to model their motion more accurately and to impose the conservation of the number of people much more effectively, resulting in increased performance using deep architectures that are no deeper or more complex than stateoftheart ones.
(a)  (b) 
(c)  (d) 
(e)  (f) 
Furthermore, regressing people flows instead of densities provides a scene description that includes the motion direction and magnitude, as shown in Fig. 1(c), (d) and (e), the people flows closely match the direction of optical flows. This enables us to exploit the fact that people flow and optical flow should be highly correlated, which provides an additional regularization constraint on the predicted flows and further boosts performance.
We will demonstrate on five benchmark datasets that our approach to enforcing temporal consistency brings a substantial performance boost compared to stateoftheart approaches. Our contribution is therefore a novel formulation of regressing people densities from video sequences that enforces strong consistency constraints without requiring complex network architectures.
2 Related Work
Given a single image of a crowded scene, the currently dominant approach to counting people is to train a deep network to regress a people density estimate at every image location. This density is then integrated to deliver an actual count [48, 23, 27, 37, 25, 24, 15, 60, 57, 47, 18, 19, 52, 29, 21, 50, 51, 39, 42, 8, 46, 53, 54].
Enforcing Temporal Consistency.
While most methods work on individual images, a few have nonetheless been extended to encode temporal consistency. Perhaps the most popular way to do so is to use an LSTM [13]. For example, in [49], the ConvLSTM architecture [38] is used for crowd counting purposes. It is trained to enforce consistency both in the forward and the backward direction. In [58], an LSTM is used in conjunction with an FCN [28] to count vehicles in video sequence. A Localityconstrained Spatial Transformer (LST) is introduced in [11]. It takes the current density map as input and outputs density maps in the next frames. The influence of these estimates on crowd density depends on the similarity between pixel values in pairs of neighboring frames.
While effective these approaches have two main limitations. First, at training time, they can only be used to impose consistency across annotated frames and cannot take advantage of unannotated ones to provide selfsupervision. Second, they do not explicitly enforce the fact that people numbers must be conserved over time, except at the edges of the field of view. The recent method of [24] addresses both these issues. However, as will be discussed in more detail in Section 3.1, because the people conservation constraints are expressed in terms of numbers of people in neighboring image areas, they are much weaker than they should be.
Introducing Flow Variables.
Imposing strong conservation constraints when tracking people has been a concern long before the advent of deep learning. For example, in
[3], people tracking is formulated as multitarget tracking on a grid and gives rise to a linear program that can be solved efficiently using the KShortest Path algorithm
[44]. The key to this formulation is the use as optimization variables of people flows from one grid cell to another, instead of the actual number of people in each grid cell. In [31], a comparable people conservation constraint is enforced and the global solution is found by a greedy algorithm that sequentially instantiates tracks using shortest path computations on a flow network [56].These people conservation constraints have since been combined with additional ones to further boost the performance. They include appearance constraints [1, 10, 2] to prevent identity switches, spatiotemporal constraints to force the trajectories of different objects to be disjoint [12], and higherorder constraints [4, 9].
However, all these works predate deep learning. These kind of flow constraints have never been used in a deep learning context and are designed for scenarios in which people can still be tracked individually. In this paper, we demonstrate that this approach can also brought to bear on a deep pipeline to handle dense crowds in which people cannot be tracked as individuals anymore.
3 Approach
number of time steps  
number of locations in image plane  
image at th frame  
number of people present at location at time  
number of people moving from location to location between times and  
neighborhood of location that can be reached within a single time step 
3.1 Formalization
Let us consider a video sequence and three consecutive images , , and from it. Let us assume that each image has been partitioned into rectangular grid locations. The main constraint we want to enforce is that the number of people present at location at time is the number of people who were already there at time and stayed there plus the number of those who walked in from neighboring locations between and . The number of people present at location at time also equals the sum of the number of people who stayed there until time and of people who went to a neighboring location between and .
Let be the number of people present at location at time , the number of people who move from location to location between times and , and the neighborhood of location that can be reached within a single time step. These notations are illustrated by Fig. 2 and summarized in Table 1. In practice, we take to be the 8 neighbors of grid location plus the grid location itself to account for people who remain at the same place, as depicted by Fig. 3. Our people conservation constraint can now be written as
(1) 
for all locations that are not on the edge of the grid, that is, locations from which people cannot appear or disappear without being seen elsewhere in the image.
Most earlier approaches [30, 59, 5, 17, 20, 25, 23] regress the values of , which makes it hard to impose the constraint of Eq. 1 because many different values of the flows can produce the same values. For example, in [23], the equivalent constraint is
(2) 
It only states that the number of people at location at time is less than or equal to the total number of people at neighboring locations at time and that the same holds between times and . This is a much looser constraint than the one of Eq. 1. It guarantees that people cannot suddenly appear but does not account for the fact that people cannot suddenly disappear either.
Our key insight is therefore that by regressing the from pairs consecutive images and computing the values of the from these, we can impose the tighter constraints of Eq. 1. Furthermore, we can formulate two additional constraints. First, all flows should be nonnegative. Second, if we were to play the video sequence in reverse, the flows should have the same magnitude but the opposite direction. We can therefore write
(3)  
(4) 
We now turn to the task of training a regressor that predicts flows that correspond to what is observed while obeying the above constraints and properly handling the boundary grid cells.
3.2 Regressing the Flows
Let us denote the regressor that predicts the flows from and as with parameters to be learned during training. In other words,
is the vector of predicted flows between all pairs of neighboring locations between times
and . In practice, is implemented by a deep network. As the flows are not directly observable, the training data comes in the form of number of people per grid cell, which we will refer to as groundtruth people densities, denoted as . During training, our goal is therefore to find values of such that(5)  
for all , , and
, except for locations at the edges of the image plane, where people can appear from and disappear to unseen parts of the scene. In practice, we enforce nonnegativity by using ReLu normalization in the network that implements
and the last two constraints by incorporating them into the loss function we minimize to learn
.Regressor Architecture.
Recall from the previous paragraph that is a vector of predicted flows from neighboring locations between times and . In practice, is implemented by the encoding/decoding architecture shown in Fig. 4 and has the same dimension as the image grid and 10 channels per location. The first are the flows to the 9 possible neighbors depicted by Fig. 3 and the tenth represents potential flows from outside the image and is therefore only meaningful at the edges.
To compute , consecutive frames and are fed to the CAN encoder network of [25]
. This yields deep features
and , where denotes the encoder with weights . These features are then concatenated and fed to a decoder network to output , where is the decoder with weights . comprises the backend decoder of CAN [25] with an additional final ReLU layer to guarantee that the output is always nonnegative. The encoder and decoder specifications are given in Table 2.Loss Function and Training.
To obtain the groundtruth density maps of Eq. 5 , we use the same approach as in most previous work [30, 59, 5, 17, 20, 25, 23]. In each image , we annotate a set of 2D points that denote the positions of the human heads in the scene. The corresponding groundtruth density map is obtained by convolving an image containing ones at these locations and zeroes elsewhere with a Gaussian kernel with mean
. We write(6) 
where denotes the center of location . Note that this formulation preserves the constraints of Eq. 5 because we perform the same convolution across the whole image. In other words, if a person moves in a given direction by pixels, the corresponding contribution to the density map will shift in the same direction and also by pixels.
The final ReLU layer of the regressor guarantees that the first three constraints of Eq. 5 are always met. To enforce the remaining two, we define our combined loss function as the weighted sum of two loss terms. We write
(7)  
where is the groundtruth crowd density value at time and pixel of Eq. 6 and is a weight factor we set to 1 in all our experiments.
Although can be computed from only two consecutive frames, at training time we always use three to enforce the temporal consistency constraints of Eq. 1. Algorithm 1 describes our training scheme in more detail. Note that we do not assume that all training frames are annotated. Only frames , , need be with . And when we evaluate the loss function for frame , where is an integer, we use frames , , and , where one of the three is annotated when . In other words, our formulation allows us to leverage nonannotated frames and provides a degree of selfsupervision.
layer  encoder  layer  decoder 

1  2  3364 conv1  1  33512 conv2 
2  2  33512 conv2  
3  4  33128 conv1  3  33512 conv2 
2 2 max pooling  4  33256 conv2  
5  7  33256 conv1  5  33128 conv2 
2 2 max pooling  6  3364 conv2  
8  10  33512 conv1  7  1110 conv1 
11  contextual feature [25]  8  ReLU 
3.3 Exploiting Optical Flow
When the camera is static, both the people flow discussed above and the optical flow that can be computed directly from the images stem for the motion of the people. They should therefore be highly correlated. In fact, this remains true even if the camera moves because its motion creates an apparent flow of people from one image location to another. However, there is no simple linear relationship between people flow and optical flow. To account for their correlation, we therefore introduce an additional loss function, which we define as
(8)  
where and are density maps inferred from our predicted flows, denotes the corresponding predicted optical flow in grid cell , is the optical flow from frames to computed by a stateoftheart optical flow network [43], and the term ensures that the correlation is only enforced where there are people. This is especially useful when the camera moves to discount the optical flows generated by the changing background.
Training the regressor requires annotations for consecutive frames, that is, in the definition of training Algorithm 1.
When such annotations are available, we use this algorithm again but replace by
(9) 
In all our experiments, we set to 0.0001 to account for the fact that the optical flow values are around 4,000 times larger than the people flow values.
4 Experiments
In this section, we first introduce the evaluation metrics and benchmark datasets used in our experiments. We then compare our results to those of current stateoftheart methods. Finally, we perform an ablation study to demonstrate the impact of individual constraints.
4.1 Evaluation Metrics
Previous works in crowd density estimation use the mean absolute error () and the root mean squared error () as evaluation metrics [59, 55, 30, 34, 49, 41]. They are defined as
where is the number of test images, denotes the true number of people inside the ROI of the th image and the estimated number of people. In the benchmark datasets discussed below, the ROI is the whole image except when explicitly stated otherwise. In practice, is taken to be , that is, the integral over the image pixels or image grid locations of the predicted people densities, in our case, the densities obtained from the predicted people flows.
4.2 Benchmark Datasets and Groundtruth Data
(a) original image  (b) ground truth density map  (c) estimated density map  (d) flow direction 
(e) flow direction  (f) flow direction  (g) flow direction  (h) flow direction 
(i) flow direction  (j) flow direction  (k) flow direction  (l) flow direction 
For evaluations purposes, we use five different datasets, for which the videos have been released along with recently published papers. The first one is a synthetic dataset with groundtruth optical flows. The other four are real world videos, with annotated people locations but without groundtruth optical flow. To use the optical flow constraints as described in Section 3.3, we therefore use the PWCNet [43] to compute the optical flow and inject it into the loss function of Eq. 8. Fig. 7 depicts one such flow.
CrowdFlow [35].
This dataset consists of five synthetic sequences ranging from 300 and 450 frames each. Each one is rendered twice, once using a static camera and the other a moving one. The groundtruth optical flow is provided as shown at Fig. 6. As this dataset has not been used for crowd counting before, and the definition of the training and testing dataset is not clearly explained in [35], to verify the performance difference caused by using groundtruth optical flow vs. estimated one, we use the first three sequences of both the static and moving camera scenarios for training and validation, and the last two for testing.
Fdst [11].
It comprises 100 videos captured from 13 different scenes with a total of 150,000 frames and 394,081 annotated heads. The training set consists of 60 videos, 9000 frames and the testing set contains the remaining 40 videos, 6000 frames. We follow the same setting as in [11].
Ucsd [6].
This dataset contains 2000 frames captured by surveillance cameras on the UCSD campus. The resolution of the frames is 238 158 pixels and the rate of frame is 10 fps. For each frame, the number of people varies from 11 to 46. We use the same setting as in [6], with frames 601 to 1400 as training data and the remaining 1200 frames as testing data.
Venice [25].
It contains 4 different sequences and in total 167 annotated frames with fixed 1,280 720 resolution. As in [25], 80 images from a single long sequence are taken as training data, and the remaining 3 sequences are used for testing purposes.
WorldExpo’10 [55].
It comprises 1,132 annotated video sequences collected from 103 different scenes. There are 3,980 annotated frames, 3,380 of which are used for training purposes. Each scene contains a Region Of Interest (ROI) in which the people are counted. As in previous work [55, 59, 34, 33, 17, 5, 20, 41, 36, 32, 40] on this dataset, we report the MAE of each scene, as well as the average over all scenes.
4.3 Comparing against Recent Techniques
Model  

MCNN [59]  172.8  216.0 
CSRNet[17]  137.8  181.0 
CAN[25]  124.3  160.2 
OURSCOMBI  97.8  112.1 
OURSALLEST  96.3  111.6 
OURSALLGT  90.9  110.3 
Model  

MCNN [59]  3.77  4.88 
ConvLSTM [49]  4.48  5.82 
WithoutLST [11]  3.87  5.16 
LST [11]  3.35  4.45 
OURSCOMBI  2.92  3.76 
OURSALLEST  2.84  3.57 
Model  MAE  RMSE 

Zhang et al. [55]  1.60  3.31 
HydraCNN [30]  1.07  1.35 
CNNBoosting [45]  1.10   
MCNN [59]  1.07  1.35 
SwitchCNN [34]  1.62  2.10 
ConvLSTM [49]  1.30  1.79 
BiConvLSTM [49]  1.13  1.43 
ACSCP [36]  1.04  1.35 
CSRNet [17]  1.16  1.47 
SANet [5]  1.02  1.29 
ADCrowdNet [23]  0.98  1.25 
PACNN [37]  0.89  1.18 
SANet+SPANet [8]  1.00  1.28 
OURSCOMBI  0.86  1.13 
OURSALLEST  0.81  1.07 
Model  

MCNN [59]  145.4  147.3 
SwitchCNN [34]  52.8  59.5 
CSRNet[17]  35.8  50.0 
CAN[25]  23.5  38.9 
ECAN[25]  20.5  29.9 
GPC[24]  18.2  26.6 
OURSCOMBI  15.0  19.6 
Model  Scene1  Scene2  Scene3  Scene4  Scene5  Average 

Zhang et al. [55]  9.8  14.1  14.3  22.2  3.7  12.9 
MCNN [59]  3.4  20.6  12.9  13.0  8.1  11.6 
SwitchCNN [34]  4.4  15.7  10.0  11.0  5.9  9.4 
CPCNN [41]  2.9  14.7  10.5  10.4  5.8  8.9 
ACSCP [36]  2.8  14.05  9.6  8.1  2.9  7.5 
IGCNN [33]  2.6  16.1  10.15  20.2  7.6  11.3 
icCNN[32]  17.0  12.3  9.2  8.1  4.7  10.3 
DConvNet [40]  1.9  12.1  20.7  8.3  2.6  9.1 
CSRNet [17]  2.9  11.5  8.6  16.6  3.4  8.6 
SANet [5]  2.6  13.2  9.0  13.3  3.0  8.2 
DecideNet [20]  2.0  13.14  8.9  17.4  4.75  9.23 
CAN [25]  2.9  12.0  10.0  7.9  4.3  7.4 
ECAN [25]  2.4  9.4  8.8  11.2  4.0  7.2 
PGCNet [52]  2.5  12.7  8.4  13.7  3.2  8.1 
OURSCOMBI  2.2  10.8  8.0  8.8  3.2  6.6 
We denote our model trained using the combined loss function of Section 3.2 as OURSCOMBI and the one using the full loss function of Section 3.3 with groundtruth optical flow as OURSALLGT. In other words, OURSALLGT exploits the optical flow while OURSCOMBI does not. If the groundtruth optical flow is not available, we use the optical flow estimated by PWCNet [43] and denote this model as OURSALLEST.
Synthetic Data.
Fig. 8 depicts a qualitative result and we report our quantitative results on the CrowdFlow dataset in Table 3. OURSCOMBI outperforms the competing methods by a significant margin while OURSALLEST delivers a further improvement. Using the groundtruth optical flow values in our loss term yields a clear performance improvement as shown by the results of OURSALLGT.
Real Data.
Fig. 5 depicts a qualitative result and we report our quantitative results on the four realworld datasets in Tables 4, 5, 6 and 7. For FDST and UCSD, annotations in consecutive frames are available, which enabled us to train the regressor of Eq. 8. We therefore report results for both OURSCOMBI and OURSALLEST. By contrast, for Venice and WorldExpo’10, only a sparse subset of frames are annotated and we therefore only report results for OURSCOMBI.
For FDST, UCSD, and Venice, our approach again clearly outperforms the competing methods with the optical flow constraint further boosting performance when applicable. For WorldExpo’10, the ranking of the methods depends on the scene being used but ours still performs best on average and on Scene3. In short, when the crowd is dense, our approach dominates the others. By contrast, when the crowd becomes very sparse as in Scene1 and Scene5, models that comprise a pool of different regressors, such as [40], gain an advantage. This points to a potential way to further improve our own method, that is, also use a pool of regressors to estimate the people flows.
(a) original image  (b) ground truth density map  (c) estimated density map  (d) flow direction 
(e) flow direction  (f) flow direction  (g) flow direction  (h) flow direction 
(i) flow direction  (j) flow direction  (k) flow direction  (l) flow direction 
4.4 Ablation Study
Model  

BASELINE  124.3  160.2 
OURSFLOW  113.3  140.3 
OURSCOMBI  97.8  112.1 
OURSALLEST  96.3  111.6 
Model  

BASELINE  3.25  4.13 
OURSFLOW  3.17  4.04 
OURSCOMBI  2.92  3.76 
OURSALLEST  2.84  3.57 
Model  

BASELINE  0.98  1.26 
OURSFLOW  0.94  1.21 
OURSCOMBI  0.86  1.13 
OURSALLEST  0.81  1.07 
To confirm that the results we report are not a function of the specific network architecture we use but of the fact that we infer people flows instead of density maps as in most other approaches, we performed an ablation study. Recall from Section 3.2, that we use the CAN [25] architecture to regress the flows. As in the original paper, we can use this network to directly regress the densities. We refer to this approach as BASELINE. To highlight the importance of the forwardbackward constraints of Eq. 4, we also tested a simplified version of our approach in which we drop them and that we refer to OURSFLOW.
We compare the performance of these four approaches on CrowdFlow, FDST, and UCSD in Tables 8, 9 and 10. As expected OURSFLOW improves on BASELINE in all three datasets, with further performance increase for OURSCOMBI and OURSALLEST. This confirms that using people flows instead of densities is a win and the additional constraints we impose all make positive contributions.
5 Conclusion
We have shown that implementing a crowd counting algorithm in terms of estimating the people flows and then summing them to obtain people densities is more effective than attempting to directly estimate the densities. This is because it allows us to impose conservation constraints that make the estimates more robust. When optical flow data can be obtained, it also enables us to exploit the correlation between optical flow and people flow to further improve the results.
In this paper, we have performed all the computations in image space, in large part so that we can compare our results to that of other recent algorithms that also work in image space. However, this neglects perspective effects as people densities per unit of image area are affected by where in the image the pixels are. In future work, we therefore intend to account for these effects by working on the ground plane instead of the image place, which should further increase accuracy by removing a potential source of bias.
A promising application is to use drones for people counting because their internal sensors can be directly used to provide the camera registration parameters necessary to compute the homographies between the camera and ground planes. In this scenario, the drone sensors also provide a motion estimate, which can be used to correct the optical flow measurements and therefore exploit the information they provide as effectively as if the camera were static.
Acknowledgments This work was supported in part by the Swiss National Science Foundation.
References
 [1] H. BenShitrit, J. Berclaz, F. Fleuret, and P. Fua. Tracking Multiple People Under Global Apperance Constraints. In International Conference on Computer Vision, 2011.
 [2] H. BenShitrit, J. Berclaz, F. Fleuret, and P. Fua. MultiCommodity Network Flow for Tracking Multiple People. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8):1614–1627, 2014.
 [3] J. Berclaz, F. Fleuret, E. Türetken, and P. Fua. Multiple Object Tracking Using KShortest Paths Optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11):1806–1819, 2011.

[4]
A. Butt and R. Collins.
MultiTarget Tracking by Lagrangian Relaxation to MinCost Network
Flow.
In
Conference on Computer Vision and Pattern Recognition
, pages 1846–1853, 2013.  [5] X. Cao, Z. Wang, Y. Zhao, and F. Su. Scale Aggregation Network for Accurate and Efficient Crowd Counting. In European Conference on Computer Vision, 2018.
 [6] A.B. Chan, Z.S.J. Liang, and N. Vasconcelos. Privacy Preserving Crowd Monitoring: Counting People Without People Models or Tracking. In Conference on Computer Vision and Pattern Recognition, 2008.
 [7] A.B. Chan and N. Vasconcelos. Bayesian Poisson Regression for Crowd Counting. In International Conference on Computer Vision, pages 545–551, 2009.
 [8] Z. Cheng, J. Li, Q. Dai, X. Wu, and A. G. Hauptmann. Learning Spatial Awareness to Improve Crowd Counting. In International Conference on Computer Vision, 2019.
 [9] R.T. Collins. Multitarget Data Association with HigherOrder Motion Models. In Conference on Computer Vision and Pattern Recognition, 2012.
 [10] C. Dicle, O. I Camps, and M. Sznaier. The Way They Move: Tracking Multiple Targets with Similar Appearance. In International Conference on Computer Vision, 2013.

[11]
Y. Fang, B. Zhan, W. Cai, S. Gao, and B. Hu.
Localityconstrained Spatial Transformer Network for Video Crowd Counting.
International Conference on Multimedia and Expo, 2019.  [12] Z. He, X. Li, X. You, D. Tao, and Y. Y. Tang. Connected Component Model for MultiObject Tracking. 2016.
 [13] S. Hochreiter and J. Schmidhuber. Long ShortTerm Memory. Neural Computation, 9(8):1735–1780, 1997.
 [14] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Almaadeed, N. Rajpoot, and M. Shah. Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds. In European Conference on Computer Vision, 2018.
 [15] X. Jiang, Z. Xiao, B. Zhang, and X. Zhen. Crowd Counting and Density Estimation by Trellis EncoderDecoder Networks. In Conference on Computer Vision and Pattern Recognition, 2019.
 [16] V. Lempitsky and A. Zisserman. Learning to Count Objects in Images. In Advances in Neural Information Processing Systems, 2010.

[17]
Y. Li, X. Zhang, and D. Chen.
CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes.
In Conference on Computer Vision and Pattern Recognition, 2018.  [18] D. Lian, J. Li, J. Zheng, W. Luo, and S. Gao. Density Map Regression Guided Detection Network for RGBD Crowd Counting and Localization. In Conference on Computer Vision and Pattern Recognition, 2019.
 [19] C. Liu, X. Weng, and Y. Mu. Recurrent Attentive Zooming for Joint Crowd Counting and Precise Localization. Conference on Computer Vision and Pattern Recognition, 2019.
 [20] J. Liu, C. Gao, D. Meng, and A.G. Hauptmann1. Decidenet: Counting Varying Density Crowds through Attention Guided Detection and Density Estimation. In Conference on Computer Vision and Pattern Recognition, 2018.
 [21] L. Liu, Z. Qiu, G. Li, S. Liu, W. Ouyang, and L. Lin. Crowd Counting with Deep Structured Scale Integration Network. International Conference on Computer Vision, 2019.

[22]
L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin.
Crowd Counting Using Deep Recurrent SpatialAware Network.
In
International Joint Conference on Artificial Intelligence
, 2018.  [23] N. Liu, Y. Long, C. Zou, Q. Niu, L. Pan, and H. Wu. ADCrowdNet: An AttentionInjective Deformable Convolutional Network for Crowd Understanding. Conference on Computer Vision and Pattern Recognition, 2019.
 [24] W. Liu, K. Lis, M. Salzmann, and P. Fua. Geometric and Physical Constraints for DroneBased Head Plane Crowd Density Estimation. International Conference on Intelligent Robots and Systems, 2019.
 [25] W. Liu, M. Salzmann, and P. Fua. ContextAware Crowd Counting. In Conference on Computer Vision and Pattern Recognition, 2019.
 [26] X. Liu, J.V.d. Weijer, and A.D. Bagdanov. Leveraging Unlabeled Data for Crowd Counting by Learning to Rank. In Conference on Computer Vision and Pattern Recognition, 2018.
 [27] Y. Liu, M. Shi, Q. Zhao, and X. Wang. Point in, Box out: Beyond Counting Persons in Crowds. Conference on Computer Vision and Pattern Recognition, 2019.
 [28] J. Long, E. Shelhamer, and T. Darrell. Fully Convolutional Networks for Semantic Segmentation. In Conference on Computer Vision and Pattern Recognition, 2015.
 [29] Z. Ma, X. Wei, X. Hong, and Y. Gong. Bayesian Loss for Crowd Count Estimation with Point Supervision. International Conference on Computer Vision, 2019.
 [30] D. OnoroRubio and R.J. LópezSastre. Towards PerspectiveFree Object Counting with Deep Learning. In European Conference on Computer Vision, pages 615–629, 2016.
 [31] H. Pirsiavash, D. Ramanan, and C. Fowlkes. GloballyOptimal Greedy Algorithms for Tracking a Variable Number of Objects. In Conference on Computer Vision and Pattern Recognition, pages 1201–1208, June 2011.
 [32] V. Ranjan, H. Le, and M. Hoai. Iterative Crowd Counting. In European Conference on Computer Vision, 2018.
 [33] D.B. Sam, N.N. Sajjan, R.V. Babu, and M. Srinivasan. Divide and Grow: Capturing Huge Diversity in Crowd Images with Incrementally Growing CNN. In Conference on Computer Vision and Pattern Recognition, 2018.
 [34] D.B. Sam, S. Surya, and R.V. Babu. Switching Convolutional Neural Network for Crowd Counting. In Conference on Computer Vision and Pattern Recognition, page 6, 2017.
 [35] G. Schröder, T. Senst, E. Bochinski, and T. Sikora. Optical Flow Dataset and Benchmark for Visual Crowd Analysis. In International Conference on Advanced Video and Signal Based Surveillance, 2018.
 [36] Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang. Crowd Counting via Adversarial CrossScale Consistency Pursuit. In Conference on Computer Vision and Pattern Recognition, 2018.
 [37] M. Shi, Z. Yang, C. Xu, and Q. Chen. Revisiting Perspective Information for Efficient Crowd Counting. In Conference on Computer Vision and Pattern Recognition, 2019.

[38]
X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo.
Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting.
In Advances in Neural Information Processing Systems, pages 802–810, 2015.  [39] Z. Shi, P. Mettes, and C. G. M. Snoek. Counting with Focus for Free. In International Conference on Computer Vision, 2019.
 [40] Z. Shi, L. Zhang, Y. Liu, and X. Cao. Crowd Counting with Deep Negative Correlation Learning. In Conference on Computer Vision and Pattern Recognition, 2018.
 [41] V.A. Sindagi and V.M. Patel. Generating HighQuality Crowd Density Maps Using Contextual Pyramid CNNs. In International Conference on Computer Vision, pages 1879–1888, 2017.
 [42] V.A. Sindagi and V.M. Patel. MultiLevel BottomTop and TopBottom Feature Fusion for Crowd Counting. In International Conference on Computer Vision, 2019.
 [43] D. Sun, X. Yang, M. Liu, and J. Kautz. PWCNet: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. In Conference on Computer Vision and Pattern Recognition, 2018.
 [44] J.W. Suurballe. Disjoint Paths in a Network. Networks, 1974.
 [45] E. Walach and L. Wolf. Learning to Count with CNN Boosting. In European Conference on Computer Vision, 2016.
 [46] J. Wan and A. B. Chan. Adaptive Density Map Generation for Crowd Counting. In International Conference on Computer Vision, 2019.
 [47] J. Wan, W. Luo, B. Wu, A. B. Chan, and W. Liu. Residual Regression with Semantic Prior for Crowd Counting. In Conference on Computer Vision and Pattern Recognition, 2019.
 [48] Q. Wang, J. Gao, W. Lin, and Y. Yuan. Learning from Synthetic Data for Crowd Counting in the Wild. In Conference on Computer Vision and Pattern Recognition, 2019.
 [49] F. Xiong, X. Shi, and D. Yeung. Spatiotemporal Modeling for Crowd Counting in Videos. In International Conference on Computer Vision, pages 5161–5169, 2017.
 [50] H. Xiong, H. Lu, C. Liu, L. Liu, Z. Cao, and C. Shen. From Open Set to Closed Set: Counting Objects by Spatial DivideandConquer. In International Conference on Computer Vision, 2019.
 [51] C. Xu, K. Qiu, J. Fu, S. Bai, Y. Xu, and X. Bai. Learn to Scale: Generating Multipolar Normalized Density Maps for Crowd Counting. In International Conference on Computer Vision, 2019.
 [52] Z. Yan, Y. Yuan, W. Zuo, X. Tan, Y. Wang, S. Wen, and E. Ding. PerspectiveGuided Convolution Networks for Crowd Counting. In International Conference on Computer Vision, 2019.
 [53] A. Zhang, J. Shen, Z. Xiao, F. Zhu, X. Zhen, X. Cao, and L. Shao. Relational Attention Network for Crowd Counting. In International Conference on Computer Vision, 2019.
 [54] A. Zhang, L. Yue, J. Shen, F. Zhu, X. Zhen, X. Cao, and L. Shao. Attentional Neural Fields for Crowd Counting. In International Conference on Computer Vision, 2019.
 [55] C. Zhang, H. Li, X. Wang, and X. Yang. CrossScene Crowd Counting via Deep Convolutional Neural Networks. In Conference on Computer Vision and Pattern Recognition, pages 833–841, 2015.
 [56] L. Zhang, Y. Li, and R. Nevatia. Global Data Association for MultiObject Tracking Using Network Flows. In Conference on Computer Vision and Pattern Recognition, 2008.
 [57] Q. Zhang and A. B. Chan. WideArea Crowd Counting via GroundPlane Density Maps and MultiView Fusion CNNs. In Conference on Computer Vision and Pattern Recognition, 2019.
 [58] S. Zhang, G. Wu, J.P. Costeira, and J.M.F. Moura. FCNrLSTM: Deep SpatioTemporal Neural Networks for Vehicle Counting in City Cameras. In International Conference on Computer Vision, 2017.
 [59] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. SingleImage Crowd Counting via MultiColumn Convolutional Neural Network. In Conference on Computer Vision and Pattern Recognition, pages 589–597, 2016.
 [60] M. Zhao, J. Zhang, C. Zhang, and W. Zhang. Leveraging Heterogeneous Auxiliary Tasks to Assist Crowd Counting. In Conference on Computer Vision and Pattern Recognition, 2019.
Comments
There are no comments yet.