Skip to content
mduering edited this page Mar 5, 2019 · 187 revisions

March 2019: Alpha release


The alpha release of the impresso interface contains the basic features you expect from a newspaper interface: Search, search facets and a viewer which lets you read and explore newspaper articles. We are also rolling out some more advanced features: Named entities, topic models and collections.

Beware however that this is our very first release and that things can still be a bit rough around the edges. Interface features, topic models and named entities have not yet reached their final shape, quality will improve significantly until the next release.

impresso Alpha Release Walkthrough

Here is what you can expect from this and the forthcoming beta and public releases:



Known issues


Have you noticed something wrong or irritating in the interface? Please email us at info@impresso-project.ch, use the Feedback Button on the lower right or use the impresso community Slack channel to let us know.


And here a list of things which are already under investigation:

  • some article appear to have no text associated to them
  • coordinates for blue highlights for articles in a page are sometimes off
  • yellow highlights for searched keywords in a page does not work yet
  • we display raw OCR output without any postprocessing whatsoever; future releases will show improvement
  • the interface is optimized for Firefox and Chrome, Safari may cause issues
  • newspaper metadata is still incomplete
  • date range filtering will be improved
  • keywords from previous searches can not be deleted in Firefox
  • keyword editing causes problems in Chrome
  • "exact", "regular", "fuzzy" search does not work yet
  • the English language articles are an artifact caused by poor OCR



Search


Start your query with the search box by typing a first keyword (for example Paris or for the movie Paris, Texas) and press enter. To add another keyword (France), repeat and press enter. Your keywords will appear below the search bar. Click on them to choose whether you want to include them or exclude them from your query. You can also choose between “exact”, “regular” and “fuzzy” precision in your search. Change the scope of the query with the filters and validate the choice of the filters by clicking the “Search” button below.

You can limit date ranges in your search in two ways:

  • On the time boxes: you can navigate in the calendar or just type in the year you aim. Don’t forget to select a day to validate the chronological limits.
  • On the timeline, the greyed area is the selected area, so change there the selection, go to the extremities of the timeline and when the mouse turns into a double arrow, move the grey area to the needed limits.

You can limit the scope of the query to one newspaper title or to front pages only. Selection of multiple newspapers as filters will be added at a later stage. Click on the title to select it. When selected it will appear below the search box.

To expand again the scope to all newspaper titles, delete the selection by clicking on the "x" in the box.

To start a new search, click on Clear at the bottom of the search panel to erase all filters you have set so far.



Corpus


For this release we focused on adding a first set Swiss newspapers to the corpus. Check the Newspaper Titles segment in the interface to get an overview of what has been included in this release.












Named entities mentions


Named entities are automatically detected names of persons and locations. Future releases will also include organisations. In the search bar you can already select whether you are interested e.g. in Paris as a person or a location. You may notice wrongly assigned entities - e.g. a location miscategorized as a person - quality will improve over time. Also note that for now entities are recognized and categorized (pers/org), but not yet disambiguated, i.e. attached to a unique referent. Consequently, the detected names Bismarck, Otto von Bismarck and Bismark (with an OCR mistake or typo) correspond to three different mentions, not yet linked all together.


To learn more about named entities, take a look at our blog post Named entity processing in a nutshell.







Topic models


Put very simply, topic models allow us to automatically identify articles which have words in common and may therefore be related to each other. Topic models can help you to get an overview of the different contexts in which for example Europe is mentioned in the corpus. You can also use topic models as filters to narrow down search results. In the Topics section you can filter all topics for keywords of interest and use the graph to explore visually how they overlap. Click on a topic to explore the underlying articles.


To learn more about topic modeling, take a look at our blog post About Topic Modeling on historical newspapers.



Collections


Collections let you store and organize articles. You are now able to add articles to your own collections. To add one article to one or more collection, use Add to Collection and here, define the name of the collection by typing it and clicking on Create New. If you have already created one or several collections, you can add the article to the collection by clicking on them. When an article has been added to a collection, the labels of these collections will appear on the article listing. To add several articles: on the side of the article list, notice the boxes that can help you select several articles at once and add them to one or several collections.









April 2019: Beta release


This release will see important additions to the corpus and to the advanced features. Also, it will be a smoother user experience since many of the rough edges from the Alpha Release will be gone.


Corpus


We will add the Luxembourgish newspaper collections to the corpus. Additional metadata will be added to allow better filtering.



Named entities


Named entities will be disambiguated, this means that you will now be able to distinguish between Paris (France) vs. Paris (Texas). We will also add an Entity page which will give you an overview of how frequently it appeared in the corpus and which other entities it was associated with.



OCR quality assessment


We will add an estimate of the quality of the OCR per article which allows to identify search problems related to poor OCR.



Collections and queries


You will be able to save your queries.





June 2019: Public release


For the first public release we add many more advanced features: a visual search tool, text reuse detection, search keyword suggestion and visual collection comparison. In addition, we will add more newspapers from Switzerland to the corpus.



Corpus


We will add more newspapers/periodocals from Switzerland among them Schweizerisches Wirtschaftsarchiv and Bundesblatt.



Search


You will now receive suggestions for your search queries to improve your results. We will have a help page with additional information about the interface in place as well as more recipes: examples for advanced searches that you can easily adapt for your own purposes.



Text reuse


You will be able to search for recurrent text segments throughout the corpus. This will allow you to trace how e.g. a press release has been altered and how it spread over time and across newspapers.



Post-OCR Corrections


Depending on the newspaper font style (Gothic/Antiqua), different techniques can lead to sometimes dramatic improvements in the quality of character recognition. This in turn will improve the quality of our text processing and yield for example better keyword searches, named entity recognition and finally more legible text.



Evaluation and confidence scores for named entity recognition


The models used to extract entities will be evaluated against a ground truth which will cover several newspapers and 20 time periods. This will help us assess the quality of entity recognition. We will attempt to add confidence scores for each extracted entity mention in the interface.



Collections


We will add a tool for the visual comparison of collections which will help you to detect where two collections overlap and differ.



Recommendations


A first recommendation service will be available which will suggest articles with similar content.





October 2019


This release will add tools which allow you to track semantic shifts of words over time by taking into account how their word-neighbors change. Other than that this release will have additional recommendation features and focus on improvements based on the feedback from our testers.



Search


We will add a visual search function based on an index of all images in the corpus. This will let you e.g. trace occurrences of a given photo within the corpus.



Collections and queries


You will be able to tag articles in your collections and use those tags to retrieve subsets of your articles across collections.



Article type distinction


We will plan to automatically identify visually distinctive content types. This will allow you to search - for example - only for mastheads, weather reports, crosswords or stock exchange listings.



After the October release we will begin working on custom visualisation case studies. Here we seek to support scholars with their projects by providing customized data visualisations which match their specific research interests. Do get in touch if you are interested in this.






Clone this wiki locally