They're watching

The place for technology related posts.

Moderator: Moderators

Post Reply
avriette
DCAWD Groupie
Posts: 1316
Joined: Sun Oct 01, 2006 3:48 pm
Location: Arlington, VA
Contact:

They're watching

Post by avriette »

http://www.ldc.upenn.edu/Catalog/Catalo ... LDC2008T02

Note a couple of excerpts:
Sponsorship

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Blogs are posts to informal web-based journals of varying topical content. GALE Phase 1 Arabic Blog Parallel Text was prepared by the LDC and consists of 102K words (222 files) of Arabic blog text and its English translation from thirty-three sources. This release was used as training data in Phase 1 of the DARPA-funded GALE program.

...

The task of preparing this corpus involved four stages of work: data scouting, data harvesting, formatting, and data selection.
Data scouting involved manually searching the web for suitable blog text. Data scouts were assigned particular topics and genres along with a production target in order to focus their web search. Formal annotation guidelines and a customized annotation toolkit helped data scouts to manage the search process and to track progress.
Of course, if they're looking for and reading/translating arabic text, one can assume that a language which does not require translation also doesn't require any new funding to actually accomplish. I think I was just recently saying I knew of a couple of organizations that could use conversational-predicting/recognition algorithms.
rocket scientist
Post Reply