German English Parallel Corpus "de-news", Daily News 1996-2000

publications

German-English Parallel Corpus "de-news", Daily News 1996-2000

My count of digital fame:

home

2000, 15MB, download v0.91, public domain.

This corpus has been adapted from the de-news web site. Volunteers collected about five to ten news items per day from German radio broadcast and translated them into English. The translation quality varies, but it is overall very good. We processed the corpus into a format that is more suitable for machine translation research.

The goals of the processing was to generate sentence-aligned text. For this purpose we extracted matching news items and labeled them with corresponding document IDs. Using a preprocessor we separated out punctuation and identified sentence boundaries. Sentence alignment took place using tool based on the Church and Gale algorithm.

We spent only about one hour per month of data manually editing the corpus to check that news items match up. For sure, there are still misaligned news items and inadequate translations. Additional editing would be helpful. Please contact us, if you would be willing to assist us with this.

Size of the corpus

Version 0.9 covers August 1996 to January 2000. Future versions will include recent data and more data from 1995-1996.

1118 days
9,756 news items
66,317 German / 62,475 English sentences
59,014 aligned sentences
1,017,064 German / 1,175,526 English word tokens

Formats

The data is available in three formats.

raw text (*.txt), sample: German, English.
preprocessed text (*.pre), sample: German, English.
sentence-aligned (*.al), sample: German, English.

All formats contain document (<DOC id>), headline (<H1>), and paragraph (<P>) mark-up on a seperate line. The data is stored in one file per day.

Terms of Use

This corpus may be used free of charge. There are no restrictions, warranties, and what have you. We would appreciate to know if you use this data, please send email to koehn@csail.mit.edu.

home