Detecting Points of Interest
Implementation
2.1 WordNet
WordNet is a lexical database containing words grouped into sets of cognitive synonyms ( also referred to as synsets ) each specifying an own concept. Those sets are then linked by means of conceptual-semantic and lexical relations. This way a sort of repository of common sense knowledge represented by a sort of hierarchy is created. Users can then find terms that underly the same semantic class as the term provided.
In our case, the WordNet database is used to make the implementation more independent from its field of application meaning that the actual GPS lookup for the POIs can be automated. The user only has to provide a certain term that defines his interest or an example of his interest and WordNet will provide other words that underly the same class and might also be interesting for the user.
2.2 The Engine
To understand the meaning of queries such as hiding place or restaurant, the engine uses the WordNet database to lookup hyponyms for the given term. For example, some hyponyms for hiding place are grave, holy place, overlook, top, solitude and hiding place. These terms are then used together with their environment (in this case Maastricht) to form a query in a search engine such as Google or Yahoo. For each coordinate term, a new query is created. Assuming that each query returns p results and q is the amount of WordNet coordinates, a total of p × q results have to be analyzed.
Since the POI environment in this project is restricted to a part of the inner city of Maastricht, the retrieval process could be speed up by using a database containing several streets of the inner city. When analyzing the results, a window can then be moved over the content looking for streetnames present in the database. To find more precise matches, streetnames in the content have to be followed by a (house)number, increasing the probability the result page is discussing a specific place related to our query term in the environment of Maastricht. Next, each streetname match is queried in a GPS search engine, resulting into GPS coordinates. These coordinates are then - if unique - added to the result list of POIs and finally stored in an XML file.
The restriction of Maastricht as environment can easily be solved by extending the analyse of webpages in such a way that locations are filtered out and stored locally. These locations can be detected by some algorithms which read the whole webpage and select candidate location terms which are then tested on their validity by performing a GPS query on them. Since any GPS engine uses a database containing locations (such as streets, city names, ...), it can be assumed only valid results will come out. When all webpages are analysed, most promising valid locations (eg the most occurring, or in a certain range of the query location) are then selected as POIs.
Since most of the time of the retrieval process is consumed by connecting to the several webpages and reading the contents, and since Java only checks a timeout when connecting to a remote host (and not for DNS lookups before the connection is made), a module was written to validate each DNS of the search results before connecting to it. This prevents long waiting times for opening connections which actually don't exist.
In this implementation, Google's APIs were used to perform the search and GPS queries. However, since these APIs are focused on webapplications, some tricks were needed to allow their use:
- Sending a HTTP header to Google containing the query, the API key and some other encrypted and unencrypted information
- Catching the GET response of Google containing the query results
- Parsing the GET response and reading out the desirable information
Also, because in some cases, Google doesn't find appropriate results for GPS locations, the API of Yahoo was used. The benefit of this API is that it returns a GET response in XML format, which is easier to parse that Google's typical output.