They're watching
Posted: Wed Mar 19, 2008 11:17 am
http://www.ldc.upenn.edu/Catalog/Catalo ... LDC2008T02
Note a couple of excerpts:
Note a couple of excerpts:
Sponsorship
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Of course, if they're looking for and reading/translating arabic text, one can assume that a language which does not require translation also doesn't require any new funding to actually accomplish. I think I was just recently saying I knew of a couple of organizations that could use conversational-predicting/recognition algorithms.Blogs are posts to informal web-based journals of varying topical content. GALE Phase 1 Arabic Blog Parallel Text was prepared by the LDC and consists of 102K words (222 files) of Arabic blog text and its English translation from thirty-three sources. This release was used as training data in Phase 1 of the DARPA-funded GALE program.
...
The task of preparing this corpus involved four stages of work: data scouting, data harvesting, formatting, and data selection.
Data scouting involved manually searching the web for suitable blog text. Data scouts were assigned particular topics and genres along with a production target in order to focus their web search. Formal annotation guidelines and a customized annotation toolkit helped data scouts to manage the search process and to track progress.