DCAWD.com

Posted: **Wed Mar 19, 2008 11:17 am**

http://www.ldc.upenn.edu/Catalog/Catalo ... LDC2008T02

Note a couple of excerpts:

Sponsorship

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Blogs are posts to informal web-based journals of varying topical content. GALE Phase 1 Arabic Blog Parallel Text was prepared by the LDC and consists of 102K words (222 files) of Arabic blog text and its English translation from thirty-three sources. This release was used as training data in Phase 1 of the DARPA-funded GALE program.

...

The task of preparing this corpus involved four stages of work: data scouting, data harvesting, formatting, and data selection.
Data scouting involved manually searching the web for suitable blog text. Data scouts were assigned particular topics and genres along with a production target in order to focus their web search. Formal annotation guidelines and a customized annotation toolkit helped data scouts to manage the search process and to track progress.

Of course, if they're looking for and reading/translating arabic text, one can assume that a language which does not require translation also doesn't require any new funding to actually accomplish. I think I was just recently saying I knew of a couple of organizations that could use conversational-predicting/recognition algorithms.

DCAWD.com

They're watching

They're watching