Europe Media Monitor
Europe Media Monitor is a system which tracks the current news reported by the world’s online media. Monitoring thousands of news sources in over 70 languages, the system uses advanced information extraction techniques to automatically determine what is being reported in the news, where things are happening, who is involved and what they said.
the system provides a unique and independent viewpoint of what is being reported in the world right now.
The system performs:
language detection, recognition of people and places, quote extraction, categorisation and clustering.
About this app
This application serves as an interface to navigate, customize and utilize the data-set generated by Europe Media Monitor (EMM) system.
How to use this app
The first screen of the application is called a ‘set’
A set provides an overview for a list of topic.
A set is a collection of user-defined topics which the application is currently tracking.
The topic inside each set are fully customizable, the user can create and customize new sets.
- The current set – tap to change / modify other sets
- Active languages – tap to set the languages which are being monitored
- Single or dual column display of the topics below
- A single channel which collects articles related to a specific topic
INSIDE A CHANNEL
- Add topic to set
- Article’s row Density
- Inline translation to English
- Back main view
More About Media Monitoring
Most large organizations have dedicated departments that monitor the media to keep up-to-date with relevant developments and to keep an eye on how they are represented in the news. Specialist organizations such as those that monitor threats to Public Health,monitor the multilingual media continuously for early warning and information gathering purposes. In Europe and large parts of the world, these organizations look at the news in various languages because the content found across different languages is complementary. In the context of the European Union, which has23 official languages, multilinguality is a practical and a diplomatic necessity, and so is the related ability to access information across languages. Current approaches to link and access related textual content across languages use Machine Translation, bilingual dictionaries,or bilingual vector space representations. These approaches are limited to covering a small number of languages. We present an alternative way of tackling this challenge. It consists of aiming at a language-neutral representation of the multilingual document contents.The European Commission’s Joint Research Centre (JRC) has developed a number of multilingual news monitoring and analysis applications, four of which are publicly accessible to the wider public. All four systems can be accessed through the commonentrance portal http://press.jrc.it/overview.html.Users of the systems are the EU Institutions, national organizations in the EU Member States (and in some non-EUcountries), international organizations, as well as the public. The freely accessible websites receive between one and two million hits per day from thirty to fifty thousand distinct users.The next section provides information on the news data that is being analyzed every day. Section 3 gives an overview of the four publicly accessible media monitoring systems and their specific functionalities. Section 4 aims to explain the specific approach which enabled the EMM family of applications to cover a high number of languages and to fuse information extracted from texts written in different languages. It also describes some of the multilingual and cross-lingual functionalities that we could only develop because we adopted this approach. The last section points to future work. Related work for each of the functionalities described will be discussed in the related sections.
A little known feature of the EMM system is that we provide online-inline-realtime translations to English for a number of languages
online: you can look at the translations on the website (and soon in the app)
inline, in the sense that we translate everything whilst processing the information, so it is not on demand, but when we process the article
realtime, in 2 ways, first of all the translation time is very short so we hardly add time to the processing, and secondly, the translation is immediately available to you on the website or the app
The feature is little known because we don’t make it very clear in the user interface. Something we are working on.
The translation system for 9 of these languages is home grown, based on open source software, but trained and refined by us. It remains small scale (we are not Google) but it is interesting to see that in most cases we are able to translate at least well enough to give you some idea what the original article is about.
The translation system for 2 other languages is operated under license.
The current set of languages that we translate to English is German, Italian, Spanish, French, Portuguese, Polish, Czech, Danish, Arabic, Farsi and Chinese.
Many thanks to Marco Turchi for the original work on this translation subsystem of EMM
Another little known feature is the fact that we have a user interface in many different languages, including Arabic and Chinese. The Arabic version does correctly show the information right to left. Not all terms and phrases are translated, or translated equally well, but the technology is there and suggestions for translations always welcome!
And what EMM is NOT?
EMM is NOT “yet another Internet search engine”. In fact the search capability of EMM is only a useful by-product of the system.
EMM does NOT have a big database of articles, we do not store the articles that we process and we will always only provide the original links to the relevant articles. What we store is a minimum amount of information from the article, together with all the information that we generate about the article.
EMM is NOT a ‘product’. Our purpose was to expose as much as possible of the work that we do in a way that is useful for you. This means that although we have done our best to build a reasonable user interface we have not necessarily built the most user friendly website. Comments are always welcome.
The EMM Newsbrief is NOT political and does NOT reflect any particular view or opinion of the European Commission with respect to any of the information displayed on the site. What you see on the website is the result of our research activities that we have tried to turn into something useful for you.
EMM Under the hood
for those of you who would like to know how we do what we do in terms of technology and systems engineering
Most of the system was developed and written by us ( ‘us’ in 2001 was a very small group of people), after an original idea of Niels Jørgen Thøgersen and Clive Best. System development started in November 2001 and after a demo in April 2002 some re-engineering followed and a very early version of the system was launched in July 2002, monitoring around 200 web-pages, and applying around 100 categories.
We adopted RSS as the format to process, and during the development of the system RSS eventually also became the format we use internally to communicate between the servlets. As there were hardly any RSS feeds available in 2001, we decided to develop our own solutions to create RSS from HTML. This eventually resulted in the creation of an HTML scanning/parsing subsystem which allows us also to extract the main article text with reasonable accuracy. Even today (2011) almost half of the feeds we monitor are HTML pages.
Against popular opinion we decided NOT to adopt SOAP. Although this lead to a fair amount of criticism (after all, how can you do webservices without SOAP…) we decided to stick with another standard, good old HTTP. Little did we know that about 8 years later this method would become known as REST.
EMM should be considered as a ‘processing chain’. Items/articles are being fed into a chain of processes that ‘do their thing’ based on the available text, the results of previous processes or both. To give an example, the process that tags the article with geographic information uses information about entities (people, organisations) in the article to increase the precision of the geographic tagging. Clinton is a common place name, but once we have detected that the article is about a person named Clinton we will not use ‘Clinton’ for the geo tagging. Similarly, Paris Hilton is not the Hilton in Paris.
The system that drives this processing chain detects new material published on the Internet by monitoring a given set of web pages or RSS feeds. It acts more or less as a giant RSS feed reader. We do not crawl the websites, all information we need is derived from the RSS feeds, or in the case of HTML pages by converting the HTML page into RSS first. This means that the load on the target webserver is extremely light, most HTML pages are in the order of 20kb-50kb of text. Once a link to a new article has been detected, the link is followed and information about the article enters the processing chain. The article is never stored in a database, all processing is done ‘on-the fly’.
At the back end of the processing chain we have built a number of systems to deliver relevant information to the users of the system. We provide category based RSS feeds, e-mail, we have some systems where we pre-cook country based information, the index is fed with the results of the processing etc. And of course we have the website to make these products accessible to you. All data to be displayed is stored as XML files in the file system.
The production system for the website is based on XML transformation using XSL stylesheets and XHTML templates. As most of the rest of the system we have written this in house. We have seen Cocoon come and go but did not adopt it at the time as it was overkill for our needs. The principle however is still valid and powerful. The website regenerates hundreds of HTML pages every ten minutes and has a special dynamic subsystem to deal with other pages. Much of the layout of the site is also described in XML files and this makes the system highly configurable.
From the very beginning Tomcat has acted as the webserver. We have not used Apache or IIS for the static pages as Tomcat simply was fast enough and we do not use anything but Java (and jsp). We have had peak days of 65000 visitors per day (a few years ago, admittedly) and several million hits per day running on relatively modest server hardware. Performance has always been adequate.
With hindsight the choices we have made are pretty obvious. Tomcat, XSLT/XML/RSS, Java are all extremely well supported, accepted and have grown to be the mainstay of many large software projects.
In 2001 things were less clear, there was some pressure to use RDF, parallel processing as a set of webservices was not common (the Grid was still some time in the future), CORBA and similar technologies based on XML (XML RPC) were alternative and accepted solutions, SOAP was a sort of minimum standard.