CS5750F13 proposal

Trip Planning

Geocoded social network services has enabled researchers …


Imagine you plan to visit a new city for traveling for the first time. You need to search for various information, such as, places to visit, flight tickets, hotels, and etc. With the development of World Wide Web, most information is easy to get, but some information, such as, which POIs (point-of-interest) would be interesting to you (preference estimation) and how long it will take to visit each place (visit duration estimation), is still difficult to obtain from the internet. These information is critical for trip planning, because, before deciding the set of tourist attractions to visit you need to know which POIs you would enjoy, and before deciding which attraction to visit on which date you need to know how long each POI would take to visit.

However, for a traveler to a new city, this information is fairly hard guess based on what the traveler already knows. For instance, comparing the two most famous museums in New York, Metropolitan Museum of Art (MET) and Museum of Modern Art (MoMA), even though they share many words in their names, their focus is actually fairly different; MET focuses on archaeological artifacts, while MoMA focuses on liberal arts. A person interested in MoMA would probably be more interested in Guggenheim Museum than MET. Also, MET and MoMA differ greatly in average visit duration. MET, one of the largest museum in US, often takes more than a day to see all the exhibitions, while MoMA takes less than 4 hours for many visitors. For people who have already been there, this time difference might sound very obvious, but for first time visitors, there is not a single clue how long the attraction is expected to take.

In this project, we try to work on two things: first, estimate attraction co-preference by modeling POIs as a network, and second, estimate expected visit duration to POIs based on timed information of social networks. We believe this project is the first attempt to estimate preference based on POI network and to compare the result of visit duration estimation from more than one social network.

Related Works

Point-of-interest recommendation tries to predict which POI or set of POIs will a traveler want to visit next or in coming days. One approach is to train a model using features extracted from POIs and users. For instance, Jiang et al. (2013) tried to find user preference based on the user's past photos and used collaborative filtering to recommend next POI or POIs. Ye et al. (2011) tried to recommend POIs by combining scores of user preference, which is extracted from collaborative filtering, and geographical influence. Another approach is to use social network's log data and train a Markov Model to predict where the traveler would like to go next. Kurashima et al. (2011) used geo tags and time stamps of flickr photos to train a Markov Model that predicts next POI based on current location.

In this project, we plan to construct a POI network by calculating the number of co-visits between POIs, and try to run clustering on this network to estimate co-preference of POIs. Instead of directly calculating user-preference to each POI, we focused on calculating POI-POI similarity because this comparison is more natural and has computational advantage because the system can pre-calculate similarity values before user interacts with the system. Compared to other POI-POI similarity calculating algorithms, our approach is unique in two ways. First, our method is based purely on user log data, which is easy to get a large volume of data, and directly calculates co-preference, while other approaches use features based on metadata or description of POI which are often not easy to get, and rather than calculating co-preference these methods calculates POI-POI similarity. Second, our method is the only approach that uses network-generated features to calculate co-preference.

Second task of this project, visit duration estimation, has been considered as a side result of a research, such as trip planning. For instance, Kurashima et al. (2011) estimated visit duration of POIs based on data from Flickr while building a whole trip planning system. In this project, we plan to use two datasets, Flickr and geo-tagged Twitter checkins, and see how the result differs compared to Flickr.


To estimate tourist attraction co-preference, we first need to build a network of tourist attractions. Based on geo code, textual tags, and visual cue of the photos, we will first match each photo to a POI as in Crandall et al. (2009). Once all the photos are matched to POIs, we will calculate the co-visit counts of tourist attractions by counting the number of unique users that visited both POIs. Then, we will create a graph of POIs where each node is a POI and each edge is weighted by the co-visit counts. For instance, if there are 5 flickr users who visited both MoMA and MET, MoMA and MET each becomes a node and the edge weight connecting the two will be 5. Based on this graph, we plan to run hierarchical clustering using clustering algorithm of Blondel et al. (2008). A simple baseline system to be compared would be a random clustering algorithm. Once simple evaluation metric is to compare the modularity of our method to random clustering algorithm on the generated POI graph, though this seems so obvious that our method would outperform the random clustering. We are still thinking of a good way to evaluate the clustering result.

For visit duration estimation, we plan to use both Flickr and Twitter data. We will first match each photo and tweet to a POI as in co-preference estimation. By looking at the sequential stream of geocoded photos or tweets of a single user, you can estimate the rough arrival and departure time, and visit duration of the user to the POI. By collecting all this data, we hope we will be able to estimate the expected visit duration of a person to a POI. For now, we cannot think of a good evaluation metrics that can judge the quality of our estimate. We will be thinking of an evaluation metric throughout the project.


For the last few weeks, we have been working on gathering data for the project. The following is how we set up our data collecting scripts.


Twitter does not allow downloading full dataset. Instead, we decided to use Twitter Streaming API. Twitter Streaming API allows you to stream up to 1% of twitter data that is currently being generated. You can filter the data by providing condition to Twitter API, and if your condition is strict enough, the resulting tweets would be less than 1% of total twitter data, and in theory, you will get all the tweets you conditioned.

For this project, we decided to focus on New York City, one of the largest tourist destination on Earth. We started streaming tweets by dividing Manhattan into three regions and making a single stream for each region. Table 1 shows the geo codes of each region. We overlapped each region by 1km, so that we don't have to jump between two different regions in case a POI locates on the borderline.

In order to check whether our region is small enough to get all the tweets, we did a simple experiment by making another stream that looks at a narrow area around MoMA as in Table 2. If midtown region of Table 1 is small enough, all the POIs in MoMA area should be covered by the tweets in midtown region.


Flickr API allows you to download the dataset with various conditions, and we downloaded all the photos between Jan 2006 and Jun 2013 whose geocode is between [40.498137,-74.376283] and [41.058644,-73.519349]. Flickr API turned out to be extremely flaky and when a large list of photos needs to be returned, the system often returned a list of duplicate results, often returning a list of 10,000 identical items. To avoid this, we made a different query for every single date between Jan 2006 and Jun 2013. This helped Flickr not generate a long return list from the first place. This method generally worked, however, still about 1% data is duplicate and we are looking for a clever way to avoid this.


Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008 (10), P1000

David J. Crandall, Lars Backstrom, Daniel Huttenlocher, and Jon Kleinberg. 2009. Mapping the world's photos. In Proceedings of the 18th international conference on World wide web (WWW '09). ACM, New York, NY, USA, 761-770. DOI=10.1145/1526709.1526812 http://doi.acm.org/10.1145/1526709.1526812

Kai Jiang, Huagang Yin, Peng Wang, Nenghai Yu, Learning from contextual information of geo-tagged web photos to rank personalized tourism attractions, Neurocomputing, Volume 119, 7 November 2013, Pages 17-25, ISSN 0925-2312, http://dx.doi.org/10.1016/j.neucom.2012.02.049.
Keywords: Geo-tagged photo; Contextual information; Tourism recommendation; RankSVM

Takeshi Kurashima, Tomoharu Iwata, Go Irie, and Ko Fujimura. 2010. Travel route recommendation using geotags in photo sharing sites. In Proceedings of the 19th ACM international conference on Information and knowledge management (CIKM '10). ACM, New York, NY, USA, 579-588. DOI=10.1145/1871437.1871513 http://doi.acm.org/10.1145/1871437.1871513

Mao Ye, Peifeng Yin, Wang-Chien Lee, and Dik-Lun Lee. 2011. Exploiting geographical influence for collaborative point-of-interest recommendation. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (SIGIR '11). ACM, New York, NY, USA, 325-334. DOI=10.1145/2009916.2009962 http://doi.acm.org/10.1145/2009916.2009962

we are trying to calculate POI co-preference based on the network structure of POIs. Previous works calculated POI preference either by directly training on user and POI features
or by doing collaborative filtering. Meanwhile

Our work is different from the previous works because, we use network features

we first build a graph of tourist attractions to calculate the co-preference of tourist attraction while previous approaches tried to directly calculate similarity between tourist attractions based

alculate tourist attraction correlation by running hierarchical clustering on POI network. We plan to construct POI network using logs and visual & textual features of flickr photos (Crandall et al., 2009), and do hierarchical clustering (Blondel et al., 2008).

Moonyoung Kang. 2013. Integer Programming Formulation of Finding Cheapest Ticket Combination over Multiple Tourist Attractions. Information and Communication Technologies in Tourism 2013 (Proceedings of the International Conference in Innsbruck, Austria, January 23-25, 2013). Berlin - Heidelberg: Springer

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License