Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources
Contents
Tools
Machine Translation systems
Instructions

Freely downloadable






Free, but getting them requires hassle


Part of Speech Taggers
Freely downloadable















Free, but require registration



Usable by email or on the web, but not distributed freely






Not free

[email protected]
. There is an
online demo.



No longer available

NP chunking
Downloadable



Generic sequence models
Downloadable



Parsers
Information on available probabilistic parsers can be found on the FSNLP: probabilistic parsing links page.
Semantic Parsers
Downloadable



Named Entity Recognition
Downloadable



Coreference (Anaphora) Resolution
Downloadable


Language modeling toolkits
Downloadable


Downloadable, but requires registration

Not yet classified

Friendly concordancing and text analysis tools

Text summarization tools


Other
Downloadable












[email protected]
. Some stuff including a multilingual text editor is downloadable.
MULTEXT EAST has parallel versions of Orwell's 1984 available free (upon registration) for a number of Central European languages.




Free, but require registration




Unsure

[email protected]
The PennTools page collects information on a variety of NLP systems, many of which are available externally.
Corpora
Large collections aimed at the NLP community
[email protected]
. Provides the largest range of corpora on CD-ROM. Cost ranges from cheap (e.g., ACL-DCI disk) to pricey. CDs can be purchased individually; institutions can become members and receive discounts on CDs. There's an
LDC Online service for searches over the web (mainly intended for members, but there are samplers available).
help
to
[email protected]
, by ftp to
nora.hd.uib.no
. Also,
manuals for these corpora.
[email protected]
. Focuses more on language processing tools and lexicons, but does have some corpora. As of Feb 1996, you can get most of their stuff by anonymous ftp to
clr.nmsu.edu
. Their
catalog is available as a postscript file.
[email protected]
. Most materials are available on the web or by anonymous ftp to
ota.ox.ac.uk
. Some require negotiations with the providers.
Particular languages
English
English language corpora available from the sites above are not repeated here.
Chinese
English language corpora available from the sites above are not repeated here.
Multilingual
Bosnian
Czech
French
German
Russian
Slovene
Croatian
Spanish and Portuguese
Swedish
Treebanks
Penn Treebank | US English | 2 million + words | Available (distributed by LDC) | 1 million WSJ, 1 million speech, surface syntax (1970s TG) |
BLLIP WSJ corpus | US English | 30 million words | Available (distributed by LDC) | WSJ newswire. Automatically parsed, not hand checked. Same structure as Penn Treebank, except for some additional coreference marking |
ICE-GB | UK English | 1 million words (83,394 sentences) | Available; c. 500 pounds | British part of ICE, the International Corpus of English project. Tagged and parsed for function. Half spoken material. |
Bulgarian Treebank | Bulgarian | n/a | POS-tagged texts and dependencies analyses are available (some are free on the web, others via a license agreement) | An under construction Bulgarian HPSG treebank. |
Penn Chinese Treebank | Chinese | 100,000 words | Available (LDC) | Based on Xinhua news articles. 1980s-style GB syntax. |
The Prague Dependency Treebank 1.0 | Czech | 500,000 words | Free on completion of license agreement (available through LDC). | Analyzed at the levels of parts of speech, syntactic functions (and, in the future, semantic roles) level in a dependency framework. Text from newspapers and weekly magazines. |
Danish Dependency Treebank 1.0 | Danish | 100,000 words | Available free under the GPL. | Built on a portion of the Parole corpus. |
Alpino Dependency Treebank | Dutch | 150,000 words | Freely downloadable | Assorted subcorpora. By far the largest is the full cdbl (newspaper) part of the Eindhoven corpus. |
NEGRA Corpus | German | 20,000 sentences | Available free of charge to academics on completion of license agreement. | Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Tagged, and with syntactic structures. |
TIGER corpus | German | 700,000 words | Available free of charge for research purposes on completion of license agreement. | German newspaper text (Frankfurter Rundschau). Semi-automatically parsed. They also have a good treebank search tool, TIGERSearch. |
Icelandic Parsed Historical Corpus (IcePaHC) | Icelandic | 1,000,000 words | Free download (LGPL) | Texts from 1150 through 2008! |
TUT: Turin University Treebank | Italian | 2,400 sentences | Free download. | Morhpological analysis and dependency analysis. Penn Treebank translation. Civil law and newspaper texts. |
Floresta Sintá(c)tica | Portuguese | 168,000 words hand-corrected; 1,000,000 words automatically parsed | Hand corrected part is free web download; automatically parsed part available through email contact | Text from CETEMPúblico corpus. Phrase structure and dependency representations. Available in several formats, including Penn Treebank format. |
Talbanken05 | Swedish | 300,000 words | Free download | Resurrects and modernizes an early treebank from the 1970s. |
Treebanks
CSTBank: Cross-document Structure Theory: marking sentence functional relationships across related documents.
Resources for Word Sense Disambiguation




Literature
There are now quite large collections of online literature, available in various languages (though the majority are in English, of course). Below are pointers to some of the main collections:
Entirely or mainly English
Acquisition data
SGML/XML
Dictionaries
Dictionaries of subcategorization frames
The following dictionaries all list surface subcategorization frames (each with a different annotation scheme). They are also all available in electronic form from the publishers (not free).
Not exactly a dictionary, but other popular sources are:
See also COMLEX and CELEX available from the LDC.
Dictionaries of assorted languages on the web
Names
U.S. names with frequency information, are available from the Census Bureau.
SGML structured dictionaries
Lexical/morphological resources
Courses, Syllabi, and other Educational Resources
"Techie"













"Corpus Linguistics"




Mailing lists
Mailing lists that have information on these topics include:

subscribe corpora
to
[email protected]
. Or if you want to subscribe with a different email address, send:
subscribe corpora email-address
(Note that you're now speaking to a Majordomo server, not a listserv, so you don't send your name!). Or you can
subscribe on the web.

[email protected]
.
Other stuff on the Web
General resources





















Information Retrieval





Information Extraction/Wrapper Induction









People's homepages
Home pages with something useful on them.




Societies/Journals


Still under construction...
http://nlp.stanford.edu/links/statnlp.html