fuliang

Standord NLP组整理的NLP工具、资源列表

Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources

Tools: Machine Translation, POS Taggers, NP chunking, Sequence models, Parsers, Semantic Parsers/SRL, NER, Coreference, Language models, Concordances, Summarization, Other

Corpora: Large collections, Particular languages, Treebanks, Discourse, WSD, Literature, Acquisition

SGML/XML

Dictionaries

Lexical/morphological resources

Courses, Syllabi, and other Educational Resources

Mailing lists

Other stuff on the Web: General, IR, IE/Wrappers, People, Societies

Tools

Machine Translation systems

Instructions

Building a baseline statistical phrase MT system

Wonderful pages about how to download a bunch of tools and some data and put them together to build a very competent baseline statistical MT system: NAACL 2006 WMT or 2009 WMT.

Freely downloadable

EGYPT system

System from 1999 JHU workshop. Mainly of historical interest.

GIZA++ and mkcls

Franz Och. C++. GPL.

Thot

Phrase-based model building kit

Phramer

An Open-Source Java Statistical Phrase-Based MT Decoder

Moses

A new open-source phrase-based MT decoder with functionality beyond Pharaoh.

Syntax Augmented Machine Translation via Chart Parsing

Andreas Zollmann and Ashish Venugopal

Free, but getting them requires hassle

Pharaoh decoder

Philip Koehn, ISI.

MTTK

Machine Translation Tool Kit. Deng and Byrne.

Part of Speech Taggers

Freely downloadable

Stanford POS tagger

Loglinear tagger in Java (by Kristina Toutanova)

hunpos

An HMM tagger with models available for English and Hungarian. A reimplementation of TnT (see below) in OCaml. pre-compiled models. Runs on Linux, Mac OS X, and Windows.

MBT: Memory-based Tagger

Based on TiMBL

TreeTagger

A decision tree based tagger from the University of Stuttgart (Helmut Scmid). It's language independent, but comes complete with parameter files for English, German, Italian, Dutch, French, Old French, Spanish, Bulgarian, and Russian. (Linux, Sparc-Solaris, Windows, and Mac OS X versions. Binary distribution only.) Page has links to sites where you can run it online.

SVMTool

POS Tagger based on SVMs (uses SVMlight). LGPL.

ACOPOST (formerly ICOPOST)

Open source C taggers originally written by by Ingo Schröder. Implements maximum entropy, HMM trigram, and transformation-based learning. C source available under GNU public license.

MXPOST: Adwait Ratnaparkhi's Maximum Entropy part of speech tagger

Java POS tagger. A sentence boundary detector (MXTERMINATOR) is also included. Original version was only JDK1.1; later version worked with JDK1.3+. Class files, not source.

fnTBL

A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.

mu-TBL

An implementation of a Transformation-based Learner (a la Brill), usable for POS tagging and other things by Torbjörn Lager. Web demo also available. Prolog.

YamCha

SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)

QTAG Part of speech tagger

An HMM-based Java POS tagger from Birmingham U. (Oliver Mason). English and German parameter files. [Java class files, not source.]

The TOSCA/LOB tagger.

Currently available for MS-DOS only. But the decision to make this famous system available is very interesting from an historical perspective, and for software sharing in academia more generally. LOB tag set.

The venerable Brill's Transformation-based learning Tagger

A symbolic tagger, written in C. It's no longer available from a canonical location, but you might find a version from the Wikipedia page or you could try a reimplementation such as fnTBL.

Original Xerox Tagger

A common lisp HMM tagger available by ftp.

Lingua-EN-Tagger

Perl POS tagger by Maciej Ceglowski and Aaron Coburn. Version 0.11. (A bigram HMM tagger.)

Free, but require registration

TATOO

The ISSCO tagger. HMM tagger. Need to register to download.

PoSTech Korean morphological analyzer and tagger

Online registration.

TnT - A Statistical Part-of-Speech Tagger

Trainable for various languages, comes with English and German pre-compiled models. Runs on Solaris and Linux.

Usable by email or on the web, but not distributed freely

Memory-based tagger

From ILK group, Catholic University Brabant (Jakub Zavrel/Walter Daelemans). Does Dutch, English, Spanish, Swedish, Slovene. Other MBL demos are also available.

Birmingham tagger

Accepts only plain ASCII email message contents. The tagset used is similar to the Brown/LOB/Penn set.

CLAWS tagger

The UCREL CLAWS tagger is available for trial use on the web. (It's limited to 300 words though -- this site is more of an advertisement for licensing the real thing -- available as software for Suns or as a paid service.) You can also find info on CLAWS tagsets, though that page doesn't seem to link to the C7 tagset.

The AMALGAM tagger

The AMALGAM Project also has various other useful resources, in particular a web guide to different tag sets in common use. The tagging is actually done by a (retrained) version of the Brill tagger (q.v.).

Xerox XRCE MLTT Part Of Speech Taggers

Tags any of 14 languages (European and Arabic), online on the web.

Portuguese taggers on the web: Projecto Natura and a QTAG adaptation.

Not free

Lingsoft

Lingsoft in Finland has (symbolic) analysis tools for many European languages. More information can be obtained by emailing [email protected]. There is an online demo.

Conexor

Conexor in Finland has demonstrations of EngCG-style taggers and parsers, for English, Swedish, and Spanish.

Xerox

Xerox has morphological analyzers and taggers for many languages. There are demos of some of their tools on the web. More information can be obtained by contacting Daniella Russo.

Infogistics

Infogistics, an Edinburgh spinoff has a tagging and NP/Verb group chunker available commercially, including an evaluation version.

No longer available

LT POS and LT TTT

The Edinburgh Language Technology Group tagger and text tokenizer (and sentence splitter were binary-only Solaris tools which no longer seem to be available.

NP chunking

Downloadable

YamCha

SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)

Mark Greenwood's Noun Phrase Chunker

A Java reimplementation of Ramshaw and Marcus (1995).

fnTBL

A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.

Generic sequence models

Downloadable

CRF++

Generic CRF-based model in C++. Open source. By the author of YamCha.

Carafe

Generic CRF-based sequence models in O-CaML. Open source. By Ben Wellner.

FreeLing

A large suite of language analyzers. Written in C++. Covers text preprocessing, morphology, NER, POS tagging, parsing.

Parsers

Information on available probabilistic parsers can be found on the FSNLP: probabilistic parsing links page.

Semantic Parsers

Downloadable

ASSERT

PropBank semantic roles (and opinions, etc.) by Sameer Pradhan.

Shalmaneser

FrameNet-based by Katrin Erk.

Tree Kernels in SVMlight by Alessandro Moschitti.

A general package, but it has particularly been used for SRL.

Named Entity Recognition

Downloadable

Stanford Named Entity Recognizer

A Java Conditional Random Field sequence model with trained models for Named Entity Recognition. Java. GPL. By Jenny Finkel.

LingPipe

Tools include statistical named-entity recognition, a heuristic sentence boundary detector, and a heuristic within-document coreference resolution engine. Java. GPL. By Bob Carpenter, Breck Baldwin and co.

YamCha

SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)

Coreference (Anaphora) Resolution

Downloadable

BART

A Beautiful Anaphora Resolution Toolkit. Java. By Yannick Versley and many others. Java. Apache with GPL components.

Guitar

Java. GPL.

Language modeling toolkits

Downloadable

IRSTLM Toolkit Compatible with SRILM, suitable for very large language models. LGPL. By Marcello Federico, Nicola Bertoldi et al.

CMU-Cambridge Statistical Language Modeling toolkit

Downloadable, but requires registration

The SRI Language Modeling toolkit

by Andreas Stolcke is another good system for building language models, freely available for research purposes.

Not yet classified

Lextools

A package of tools for creating weighted finite-state transducers (WFST) from high-level linguistic descriptions. Lextools binaries are available free for non-commercial use at: http://www.research.att.com/sw/tools/lextools/. Supported platforms are: linux (i686), sgi (mips2) and sun4. Lextools is built on top of, and requires, the AT&T WFST toolkit (version 3.6), available free for non-commercial use from: http://www.research.att.com/sw/tools/fsm/

Friendly concordancing and text analysis tools

Wordsmith Tools (Mike Scott)

The thing to get if you are working in the Windows world.

Text summarization tools

A prototype Java Summarisation applet (System Quirk)

MEAD

A public domain portable multi-document summarization system. (Dragomir Radev and others.)

Other

Downloadable

Tilburg University's TiMBL

Tilburg's Memory Based Learner by Walter Daelemans et al. A general near-neighbour-based machine learning package, but optimized for statistical NLP applications.

splitta

Statistical sentence boundary detection by Dan Gillick.

Time Expression taggers

TIMEX2 standard taggers (site at Mitre).

NLTK

An open source Python package for NLP application development with tools such as tokenization, POS TAGGING and parsers by Ed Loper and Steven Bird.

Ted Pedersen's code

Ngram Statistics Package: Perl code that implements: Fisher's exact test, the likelihood ratio, Pearson's chi squared test, the Dice Coefficient, and Mutual Information; Duluth Senseval-2 word sense disambiguation systems; Senseval-1 data in Senseval-2 format; various other WSD datasets in Senseval formats, and semantic distances derived via WordNet.

ISIP tools

The main aim is a publically available speech recognition system (alpha release available), but along the way there are also toolkits for discrete HMMs and statistical decision trees, and for various aspects of signal processing.

Mem. A Perl implementation of Generalized and Improved Iterative Scaling

by Hugo WL ter Doest.

Automorphology

A system (for Windows) for automatically learning the morphological forms of words in a corpus by John Goldsmith.

Wordnet

Wordnet is available by ftp, compiled for a variety of machine types. For money, one can also get EuroWordNet for various European languages, an Italian/English/Spanish MultiWordNet and there's now a site for Global Wordnet. (See also Mappings between WordNet versions and Perl WordNet-Similarity module by Ted Pedersen, and WordNet Domains (coarse-grained sense topic classifications).)

Penn XTAG project

A wide-coverage tree-adjoining grammar written in a mixture of C and Common Lisp. Also includes a large coverage morphological analyzer. Now includes more tools such as TCL/Tk tree viewer.

Dan Melamed's Assorted Tools

A collection of various tools including a simulated annealling program, a post-processor for English stemming for the Penn XTAG morphology system, Good-Turing smoothing software, general text processing tools, text statistics tools and bitext geometry tools (mainly written in Perl 5).

MULTEXT

Constructing corpora and tools for processing multilingual corpora. Contact: Jean Veronis [email protected]. Some stuff including a multilingual text editor is downloadable. MULTEXT EAST has parallel versions of Orwell's 1984 available free (upon registration) for a number of Central European languages.

Naive Bayes algorithm

Software from the Rainbow/Libbow software package that implements several algorithms for text categorization, including naive Bayes, TF.IDF, and probabilistic algorithms. Accompanies Tom Mitchell's ML text.

HDDI

Text Data Mining API from Lehigh University.

Emdros: a text database engine for linguistic analysis and research

Chasen

Japanese morphological analyzer. Descendent of JUMAN.

Free, but require registration

Stuttgart's IMS Corpus Workbench (CWB)

A workbench for full-text retrieval from large corpora (with a query language and corpus indexing). Includes the Corpus Query Processor (CQP) and xkwic. Available free for research groups (currently only as Solaris 1/2 or Linux binaries), on signing a license agreement.

Gate

University of Sheffield's General Architecture for Text Engineering. Primarily an Information Extraction system.

MITRE's Alembic Workbench

A workbench for the development of tagged corpora. Includes a tagger based on Brill's TBL approach.

SNoW

SNoW is a learning program that can be used as a general purpose multi-class classifier and is specifically tailored for learning in the presence of a very large number of features. The learning architecture is a sparse network of linear units over a pre-defined or incrementally acquired feature space (Dan Roth).

Unsure

INTEX

a finite-state transducer analysis system for English, French, and Italian that runs under NextStep. Contact: Max Silberztein [email protected]

The PennTools page collects information on a variety of NLP systems, many of which are available externally.

Corpora

Large collections aimed at the NLP community

LDC (Linguistic Data Consortium) and its catalogue by year.

Email: [email protected]. Provides the largest range of corpora on CD-ROM. Cost ranges from cheap (e.g., ACL-DCI disk) to pricey. CDs can be purchased individually; institutions can become members and receive discounts on CDs. There's an LDC Online service for searches over the web (mainly intended for members, but there are samplers available).

European Language Resources Association and its catalogue.

Distribution agency is ELDA. Rapidly growing collection of materials in European languages.

ICAME (International Computer Archive of Modern English)

Sells various corpora (including Brown and London-Lund). Information on corpora on the web, by sending the message help to [email protected], by ftp to nora.hd.uib.no. Also, manuals for these corpora.

Reuters @ NIST

Reuters corpora are now distributed by NIST.

TRACTOR

TELRI Research Archive of Computational Tools and Resource. Corpora, many multilingual, in European community languages. Small fee for joining in order to be able to get corpora (unless you have contributed corpora).

CLR (Consortium for Lexical Research)

Email: [email protected]. Focuses more on language processing tools and lexicons, but does have some corpora. As of Feb 1996, you can get most of their stuff by anonymous ftp to clr.nmsu.edu. Their catalog is available as a postscript file.

OTA (Oxford Text Archive)

Provides mainly literary texts. Has a bright new web site. Email: [email protected]. Most materials are available on the web or by anonymous ftp to ota.ox.ac.uk. Some require negotiations with the providers.

Leipzig Corpora Collection

Sentence collections in MySQL database for 17 mainly European languages.

BNC (British National Corpus)

A 100 million word corpus of British English. You can search it online from their simple web interface or via View, a much better interface by Mark Davies, and there is an index to genres by David Lee. And now, an XML edition.

European Corpus Initiative Multilingual Corpus I (ECI/MCI)

A 98 million word corpus, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, and Malay. Cheap. Need to sign a license agreement available at either the WWW site. Also available from the LDC.

Survey of English Usage

At the Department of English Language and Literature at University College London. Includes the British part of ICE, the International Corpus of English project. Now available tagged, and parsed for function. 83,419 sentences. Includes ICECUP, dedicated retrieval software. Also, Diachronic Corpus of Present-Day Spoken English (800,000 words, tagged and parsed, half from ICE-GB and half from London-Lund).

International Corpus of English (ICE)

Million word collections of English from various world Englishes: ICE-NZ, ICE-HK, ICE-East Africa, etc. Several of them are downloadable from this site.

Corpora held by Lancaster University

This link provides its own annotations.

The European Language Activity Network

Promises a uniform query language for accessing corpora in all EU languages -- but isn't quite there yet.

Talkbank.

Rich video and transcripts.

Particular languages

English

English language corpora available from the sites above are not repeated here.

Corpora by Geoffrey Sampson's team

The SUSANNE corpus and the CHRISTINE corpus (SUSANNE markup of a speech corpus).

Michigan Corpus of Academic Spoken English (MICASE). 1.7 million words from 1997-2001.

Penn-Helsinki Parsed Corpus of Middle English

A syntactically annotated corpus of the Middle English prose samples in the Helsinki Corpus of Historical English, with additions. 1.3 million words. $200.

Corpus of Professional, Spoken American-English (CPSA)

2 million words from faculty and committee meetings and White House press conferences (50K work sample free on internet).

Lancaster Parsed Corpus

Dialogue Diversity Corpus (Bill Mann)

American National Corpus

Chinese

English language corpora available from the sites above are not repeated here.

The Lancaster Corpus of Mandarin Chinese (LCMC)

By Tony McEnery and Richard Xiao. Distinguished by being a balanced corpus, and freely available.

Multilingual

JRC-Acquis

A parallel corpus of EU documents across all member states. 8 million words or more in each of 20 languages.

EMILLE/CIIL

Monolingual written corpus data for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu). Orthographically transcribed spoken data and parallel corpus data for five South Asian languages (Bengali, Gujarati, Hindi, Punjabi and Urdu). In addition, the parallel corpus contains the English originals from which the translations stored in the corpus were derived. All data in the corpus is CES and Unicode compliant. The EMILLE corpus totals some 94 million words. Downloadable.

OPUS

An open source parallel corpus, aligned, in many languages, based on free Linux etc. manuals.

World Health Organization Computer Assisted Translation page.

Also includes a good selection of links on Computer Assisted Translation. (See also the copyright page.)

Searchable Canadian Hansard French-English parallel texts (1986-1993)

From the Laboratoire de Recherche Appliquée en Linguistique Informatique, Universite de Montréal

European Union web server

Parallel text in all EU languages. (In particular try European legislation.)

TELRI CD-ROMs

Parallel and other text in central and eastern european languages.

Bosnian

The Oslo Corpus of Bosnian Texts.

Czech

Parallel Czech-English

Literature translations in Czech and English

Czech National Corpus project: SYN2000

100 million words of contemporary Czech.

French

Association des Bibliophiles Universels

Various French literary works.

American and French Research on the Treasury of the French Language (ARTFL)

150 million word corpus of various genres of French. You have to be a member to use it (but membership is fairly cheap).

German

COSMAS Corpus

Large (over a billion words!) online-searchable German and Austrian corpora. This is the publically available part of the 1.85 billion word Mannheimer Corpus Collection

NEGRA Corpus

Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Available free of charge to academics. 20,000 sentences, tagged, and with syntactic structures. Free for academic use.

Russian

Russian National Corpus

150 million words, 5 million words POS-tagged, some in dependency treebank.

Library of Russian Internet Libraries

Various literary works.

Slovene

Slovene-English parallel corpus

1 M words, free to download + on-line concordances.

Coming soon: Slovene reference corpus of 100 M words

Croatian

Croatian National Corpus

100 M words

Spanish and Portuguese

TychoBrahe Parsed Corpus of Historical Portuguese

Over a million words of Portuguese from different historical periods, some of it morphologically analyzed/tagged. Free.

Information about Mark Davies' collection of (mainly historical Spanish and Portuguese.

It's not clear what their availability is.

The CUMBRE corpus. Contact Professor Aquilino Sánchez

The CRATER Spanish corpus

Morphosyntactically tagged telecommunication manuals) is available by ftp.

Corpus resources for Portuguese

In total about 70 million words, available free, from various sources (newswire, etc.)

Folha de S. Paulo newspaper

4 annual CDROMs with full text.

COMPARA

Portuguese-English parallel corpus. (In general, various resources at Linguateca site.

Swedish

Spraakdata, Department of Swedish, Göteborgs University.

Has various searcable part of speech tagged Swedish corpora (Parole, Bank of Swedish, etc.), and some material in Zimbabwean languages.

Treebanks

Name Language Size Availability Comments

Penn Treebank	US English	2 million + words	Available (distributed by LDC)	1 million WSJ, 1 million speech, surface syntax (1970s TG)
BLLIP WSJ corpus	US English	30 million words	Available (distributed by LDC)	WSJ newswire. Automatically parsed, not hand checked. Same structure as Penn Treebank, except for some additional coreference marking
ICE-GB	UK English	1 million words (83,394 sentences)	Available; c. 500 pounds	British part of ICE, the International Corpus of English project. Tagged and parsed for function. Half spoken material.
Bulgarian Treebank	Bulgarian	n/a	POS-tagged texts and dependencies analyses are available (some are free on the web, others via a license agreement)	An under construction Bulgarian HPSG treebank.
Penn Chinese Treebank	Chinese	100,000 words	Available (LDC)	Based on Xinhua news articles. 1980s-style GB syntax.
The Prague Dependency Treebank 1.0	Czech	500,000 words	Free on completion of license agreement (available through LDC).	Analyzed at the levels of parts of speech, syntactic functions (and, in the future, semantic roles) level in a dependency framework. Text from newspapers and weekly magazines.
Danish Dependency Treebank 1.0	Danish	100,000 words	Available free under the GPL.	Built on a portion of the Parole corpus.
Alpino Dependency Treebank	Dutch	150,000 words	Freely downloadable	Assorted subcorpora. By far the largest is the full cdbl (newspaper) part of the Eindhoven corpus.
NEGRA Corpus	German	20,000 sentences	Available free of charge to academics on completion of license agreement.	Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Tagged, and with syntactic structures.
TIGER corpus	German	700,000 words	Available free of charge for research purposes on completion of license agreement.	German newspaper text (Frankfurter Rundschau). Semi-automatically parsed. They also have a good treebank search tool, TIGERSearch.
Icelandic Parsed Historical Corpus (IcePaHC)	Icelandic	1,000,000 words	Free download (LGPL)	Texts from 1150 through 2008!
TUT: Turin University Treebank	Italian	2,400 sentences	Free download.	Morhpological analysis and dependency analysis. Penn Treebank translation. Civil law and newspaper texts.
Floresta Sintá(c)tica	Portuguese	168,000 words hand-corrected; 1,000,000 words automatically parsed	Hand corrected part is free web download; automatically parsed part available through email contact	Text from CETEMPúblico corpus. Phrase structure and dependency representations. Available in several formats, including Penn Treebank format.
Talbanken05	Swedish	300,000 words	Free download	Resurrects and modernizes an early treebank from the 1970s.

Verbmobil Tübingen: under construction treebanked corpus of German, English, and Japanese sentences from Verbmobil (appointment scheduling) data

Syntactic Spanish Database (SDB) University of Santago de Compostela. 160,000 clauses / 1.5 million words.

CKIP Chinese Treebank (Taiwan). Based on Academia Sinica corpus. (There's also a 100 sentence Chinese treebank at U. Maryland.)

LDC Korean Treebank.

Dublin-Essex Treebank project

Deriving Linguistic Resources from Treebanks.

Treebanks

CSTBank: Cross-document Structure Theory: marking sentence functional relationships across related documents.

Resources for Word Sense Disambiguation

The Senseval web site

Has a comprehensive selection of resources for WSD, including a good list of WSD data resources, but not yet the new SEMCOR.

Ted Pedersen's code

Includes various WSD systems.

SenseClusters

Open source package for unsupervised discovery of word senses by clustering together instances of a word (or words) that are used in similar contexts in raw text, supporting a wide range of clustering techniques based on both context vectors and similarity matrices, and including links to SVDPACKC and CLUTO. Ted Pedersen and Amruta Purandare.

Evocation WordNet synset similarity judgments

Judgments on how similar the meanings of synsets are and how common they are in the BNC from Jordan Boyd-Graber.

Literature

There are now quite large collections of online literature, available in various languages (though the majority are in English, of course). Below are pointers to some of the main collections:

Entirely or mainly English

Alex: A Catalogue of Electronic Texts on the Internet

Seems to have one of the largest collection. Searching and browsing facilities through gopher menus. Many languages.

Wiretap Electronic Text Archive

Extensive and good quality. Still in the gopher age, though.

The On-line Books Page

The index here only covers books in English, but there are lots of links to other collections of material in all languages.

Project Gutenberg

The oldest and largest project to get out of copyright literature online, freely available. (Or see the mirror, Sailor's Project Gutenberg site.)

The Electronic Text Center of the University of Virginia

Large collection of SGML text, mainly in English, but also in other major languages.

Center for Electronic Texts in the Humanities

Princeton/Rutgers collaboration. They didn't have it together with their web site when I stopped by, but they may soon.

Oxford Electronic Text Library Editions

Available from Oxford University Press, 200 Madison Ave, NY, NY 10016 212-679-7300. The Complete Works of Jane Austen is $95.00, and is reviewed in Computers and the Humanities, 28:4-5 (Aug/Oct, 1994), 317-321.

Coreference annotated texts

From University of Woverhampton (R. Mitkov, C. Barbu et al.).

Acquisition data

CHILDES database.

Database of child language transcriptions in English and many other languages. Texts are also available by ftp. Certain usage requirements. Manuals and programs for accessing the data (the CLAN concordancer) are also available online. Now in Unicode XML.

SGML/XML

Robin Cover's SGML/XML Web Page

This is a wonderful compendium of information on SGML and XML, including information on the Text Encoding Initiative (TEI). This document is also a guide to many text collections (ones using SGML).

Information about the Text Encoding Initiative (TEI). (The Pizza Chef acts as a TEI tag set selector.)

Xaira

XML Aware Indexing and Retrieval Application. The successor of SARA.

Microsoft's XML page

W3C XML page.

The Corpus Encoding Standard.

An SGML instance designed for language engineering applications. Also the XML version.

Dictionaries

Dictionaries of subcategorization frames

The following dictionaries all list surface subcategorization frames (each with a different annotation scheme). They are also all available in electronic form from the publishers (not free).

COBUILD

Collins Cobuild English Language Dictionary. London: Collins, 1987. The COBUILD web site lets you search their Bank of English corpus (but you need to pay to get more than a trial.

LDOCE

Longman Dictionary of Contemporary English. Burnt Mill, Essex: Longman, 1978.

OALD

Oxford Advanced Learner's Dictionary of Current English. Oxford: Oxford University Press, Fourth Edition, 1989. The third edition also had information on subcategorization frames, although in a different incompatible format. However, a partial version of the third edition (with this information) is available free online from the Oxford Text Archive.

Not exactly a dictionary, but other popular sources are:

Levin (1993)

Beth Levin. 1993. English Verb Classes and Alternations: A Preliminary Investigation. Chicago. Discusses linguistic distinctions (like unergative/unaccusative verbs, dative shift, etc., not made by the above dictionaries). The index of verbs is online.

English subcategorization evaluation resources

Gold standard data, from Cambridge University (Anna Korhonen)

See also COMLEX and CELEX available from the LDC.

Dictionaries of assorted languages on the web

The old version of Robert Beard's Web of Online Dictionaries long ago mutated into YourDictionary.com. I'm told the IPO has been delayed. Nevertheless, it's the most comprehensive index of dictionaries available on the web.

Names

U.S. names with frequency information, are available from the Census Bureau.

SGML structured dictionaries

Cambridge International Dictionary of English and other products in SGML.

Lexical/morphological resources

English SENSEVAL Resources

Dictionary entries and tagged examples for 35 words.

ARIES Natural Language Tools

Lexicons and morphological analysis for Spanish. There is a free Prolog demonstrator, but the real lexicons and C/C++ access tools cost money.

Courses, Syllabi, and other Educational Resources

"Techie"

Foundations of Statistical Natural Language Processing

Some information about, and sample chapters from, Christopher Manning and Hinrich Schütze's new textbook, published in June 1999 by MIT Press. Read about courses using this book.

Corpus-based Linguistics

Christopher Manning's Fall 1994 CMU course syllabus (a postscript file).

Statistical NLP: Theory and Practice

Christopher Manning's Spring 1996 CMU course materials.

John Lafferty and Roni Rosenfeld's Spring 1997 CMU course Language and Statistics.

Boston University (John D. Burger and Lynette Hirschman)

A good course and web site, by the looks!

Draft of Data-Intensive Linguistics

By Chris Brew and Marc Moens.

Statistical Natural Language Processing course

By Joakim Nivre. Elsnet suported.

Short Course: Statistical Methods in NLP

By Philip Resnik

Linguist's Guide to Statistics by Brigitte Krenn and Christer Samuelsson.

Statistical and Corpora Based Methods for Processing Natural Languages

By Alon Itai, Technion Computer Science Department. (Don't read those old drafts of mine though ... get the real thing!)

CS 241 Statistical Models in Natural-Language Processing

Eugene Charniak, Brown University.

Michael Littman, Duke: 1997, 1998.

"Corpus Linguistics"

A tutorial on concordances and corpora by Cathy Ball

Tony Berber Sardinha's Corpus Linguistics course

Powerpoint slides in an interesting mixture of English and Portuguese (plus the rest of his homepage!)

Concordancing and corpus linguistics

Notes prepared by Phil Benson, Hong Kong University.

Computational Approaches to Collocations

Discussion of all the measures that have been used, and software for calculating them. By Evert and Krenn.

Mailing lists

Mailing lists that have information on these topics include:

Corpora

The main mailing list for info on corpus-based linguistics. Subscribe by sending the message: subscribe corporato [email protected]. Or if you want to subscribe with a different email address, send: subscribe corpora email-address(Note that you're now speaking to a Majordomo server, not a listserv, so you don't send your name!). Or you can subscribe on the web.

Empiricist

The empiricist list appears to be defunct now. You used to send a "subscribe" message to [email protected].

Other stuff on the Web

General resources

NIST Human Language Technology programs

Including: TREC, TIDES, ACE, ....

Text summarization

Tons of resources (tutorialis, bibliographies, and software) for document summarization, maintained by Dragomir Radev.

PropositionBank @ UPenn

Statistical MT

Bookmarks for Corpus-based Linguists An extensive annotated collection by David Lee, aimed at linguistics more than NLP (includes web-searchable corpora and concordancing options).

HLTCentral

European site aiming to increase transfer of language technologies to the commercial market. News, etc.

Linguistic annotation

A description of formats for linguistic annotation by Steven Bird.

CTI Textual Studies, University of Oxford, Guide to Digital Resources

Lists text analysis tools, corpora, and other stuff.

U. Essex W3-Corpora

Lots of teaching material, links, and online corpora.

Computational Linguistics and NLP (Kenji Kita, Tokushima U.)

A good well organized list of CL references, concentrating on corpus-based and statistical NLP methods. See also Software tools for NLP.

HLT Central

European Human Language Technology site

Survey of the State of the Art in Human Language Technology

ACL SIGLEX list of Lexical Resources

Online materials for a course on Learning Dynamical Systems at Brown University.

Lots of neat info.

Expert Advisory Group for Language Engineering Standards (EAGLES) home page

European standards organization.

Materials prepared for Michael Barlow's Corpus Linguistics course

Corpus Linguistics University of Birmingham

Chris Brew's Teaching Materials for statistical NLP

Not much there last time I looked; you might also try his home page.

Edinburgh LTG HelpDesk's FAQ

Many of the questions in the concern issues related to corpora and tagging.

Content Analysis Resources

Qualitative Text Analysis, Concordances, etc.

MT paper archive

Lots of papers, etc.

Information Retrieval

The SMART IR system

ACM SIGIR

Managing Gigabytes

TREC conference

Text-based Intelligent Systems (Bruce Croft)

Information Extraction/Wrapper Induction

Introduction to Information Extraction Technology. A tutorial by Douglas E. Appelt and David Israel.

IE data sets

Updated versions (i.e., now well-formed XML) of classic IE data sets: Seminar Announcements and Corporate Acquisitions.

Web -> KB. CMU World Wide Knowledge Base project (Tom Mitchell). Has a lot of the best recent probabilistic model IE work, and links to data sets.

RISE: Repository of Online Information Sources Used in Information Extraction Tasks, including links to people, papers, and many widely used data sets, etc. (Ion Muslea). Appears to not have been updated since 1999.

Message Understanding Conference (MUC) information. A US government funded information extraction exercise (from the 1990s).

Web IR and IE (Einat Amitay). Various links on IR and IE on the web.

Web question answering system (University of Michigan)

GATE: General Architecture for Text Engineering (Sheffield)

Genia Project. Biomedical text information extraction corpus (Tsujii lab). And IE tutorial slides.

People's homepages

Home pages with something useful on them.

University of Texas at Austin Machine Learning Research Group

Steven Abney (until 1997)

Adam Berger

Various stuff on statistical MT and maximum entropy models

Alex Chengyu Fang

Provides a lot of info on the kinds of things they get up to at UCL, without actually giving you anything to play with yourself.

Societies/Journals

International Quantitative Linguistics Association/Journal of Quantitative Linguistics

Not very hip.

Association for Computational Linguistics/Computational Linguistics

Hipper

Still under construction...

http://nlp.stanford.edu/links/statnlp.html

Christopher Manning -- <[email protected]> -- Last modified: Sat Sep 10 17:36:39 PDT 2011

你可能感兴趣的:(工具)

python 读excel每行替换_Python脚本操作Excel实现批量替换功能 weixin_39646695 python 读excel每行替换
Python脚本操作Excel实现批量替换功能大家好，给大家分享下如何使用Python脚本操作Excel实现批量替换。使用的工具Openpyxl，一个处理excel的python库，处理excel，其实针对的就是WorkBook，Sheet，Cell这三个最根本的元素~明确需求原始excel如下我们的目标是把下面excel工作表的sheet1表页A列的内容“替换我吧”批量替换为B列的“我用来替换的
Git 与 GitHub 的对比与使用指南一念& 其它 git github
Git与GitHub的对比与使用指南在软件开发中，Git和GitHub是两个密切相关但本质不同的工具。下面我将逐步解释它们的定义、区别、核心概念以及如何协同使用，确保内容真实可靠，基于广泛的技术实践。1.什么是Git？Git是一个分布式版本控制系统，由LinusTorvalds于2005年创建。它的核心功能是跟踪代码文件的变化，帮助开发者管理项目历史记录、协作和回滚错误。Git是开源的，可以在本地
免费排版助手：智能修正段落 + 删除干扰符，杂乱文本一键变规范
各位文字工作者们！你们有没有被排版折磨到崩溃的时候？我跟你们说，我之前排版一篇文章，那简直就像在走迷宫，头晕眼花的！不过后来我发现了一款软件——排版助手！软件下载地址安装包这玩意儿是个文章智能排版工具，专门给新闻编辑、文摘网站这些文字工作者用的。它功能老多了，能修正段落，把那些乱七八糟的段落变得规规矩矩；还能删除干扰符，就像给文章做了个大扫除，把没用的东西都清理掉；简繁转换也不在话下，不管是简体还
营销活动-大转盘無缺520
写在前面最近，首先营销活动工具这块我是再熟悉不过了。曾经做了不下20个活动工具，然后通过监控活动数据反推活动的好坏。文中主要讲解幸运大转盘营销工具一.大转盘定义大转盘是比较常见的营销活动工具，它是通过消费者用户控制【开始/停止】操作获得奖品物品。用户在不知道自己能获得什么奖品的条件下，然后通过抽奖，大概率的获得未知的奖品。类似最近流行的盲盒玩法。二.为什么做大转盘大转盘是最常用的抽奖类的活动工具之
DPDK 技术详解：榨干网络性能的“瑞士军刀”
你是否曾感觉，即使拥有顶级的服务器和万兆网卡，你的网络应用也总是“喂不饱”硬件，性能总差那么一口气？传统的网络处理方式，就像在高速公路上设置了太多的收费站和检查点，限制了数据包的“奔跑”速度。今天，我们要深入探讨一个能够打破这些瓶颈，让你的网络应用快到飞起的“黑科技”——DPDK(DataPlaneDevelopmentKit，数据平面开发套件)。这不仅仅是一个工具包，更是一种全新的网络处理哲学。
【Coze搞钱实战】3. 避坑指南：对话流设计中的6个致命错误（真实案例） AI_DL_CODE Coze平台对话流设计客服Bot避坑用户流失封号风险智能客服配置故障修复指南
摘要：对话流设计是智能客服Bot能否落地的核心环节，直接影响用户体验与业务安全。本文基于50+企业Bot部署故障分析，聚焦导致用户流失、投诉甚至封号的6大致命错误：无限循环追问、人工移交超时、敏感词过滤缺失、知识库冲突、未处理否定意图、跨平台适配失败。通过真实案例拆解每个错误的表现形式、技术根因及工业级解决方案，提供可直接复用的Coze配置代码、工作流模板和检测工具。文中包含对话流健康度检测工具使
Pktgen-DPDK：开源网络测试工具的深度解析与应用艾古力斯
本文还有配套的精品资源，点击获取简介：Pktgen-DPDK是基于DPDK的高性能流量生成工具，适用于网络性能测试、硬件验证及协议栈开发。它支持多种网络协议，能够模拟高吞吐量的数据包发送。本项目通过利用DPDK的高速数据包处理能力，允许用户自定义数据包内容，并实现高效的数据包管理与传输。文章将指导如何安装DPDK、编译Pktgen、配置工具以及使用方法，最终帮助开发者和网络管理员深入理解并优化网络
Anaconda 和 Miniconda：功能详解与选择建议古月฿ python入门 python conda
Anaconda和Miniconda详细介绍一、Anaconda的详细介绍1.什么是Anaconda？Anaconda是一个开源的包管理和环境管理工具，在数据科学、机器学习以及科学计算领域发挥着关键作用。它以Python和R语言为基础，为用户精心准备了大量预装库和工具，极大地缩短了搭建数据科学环境的时间。对于那些想要快速开展数据分析、模型训练等工作的人员来说，Anaconda就像是一个一站式的“数
办公党必备！Excel文件批量加密神器！一键保护你的重要数据阿幸软件杂货间 Excel excel
软件介绍今天推荐的这一款专为Excel文件设计的批量加密工具，能够帮助用户快速、高效地为多个Excel文件设置密码保护，有效防止数据泄露。软件特点本地化离线处理支持批量操作完全免费软件操作选择你需要加密的文件和路径，设置密码进行加密即可软件下载夸克网盘迅雷网盘UC网盘
Qwen3 大模型实战：使用 vLLM 部署与函数调用（Function Call）全攻略曦紫沐大模型大模型部署 Qwen3 vLLM 函数调用
文章摘要本文将带你从零开始，深入掌握如何使用Qwen3-8B大语言模型，结合vLLM进行高性能部署，并通过函数调用（FunctionCall）实现模型与外部工具的智能联动。我们将详细讲解部署命令、调用方式、代码示例及实际应用场景，帮助你快速构建基于Qwen3的智能应用。一、Qwen3简介与部署环境准备Qwen3是通义千问系列的最新一代大语言模型，具备强大的自然语言理解和生成能力，尤其在函数调用、工
Zread.AI：一键将GitHub项目转化为结构化中文手册的AI代码维基工具
Zread.AI：一键将GitHub项目转化为结构化中文手册的AI代码维基工具文章来源：PoixeAI文章目录Zread.AI工具概述核心功能优势亮点典型应用场景上手指南注意事项官网地址Zread.AI由智谱Z.ai推出，是一款面向开发者的AI代码维基工具，可在几秒内把任何公开GitHub仓库转化为结构化中文手册，并通过独家Buzz面板聚合commits、issues与相关新闻，让项目脉搏一目了然
Android Slices：让应用功能在系统级交互中触手可及安卓开发者 Android Jetpack android 交互 gitee
引言在当今移动应用生态中，用户每天要面对数十个甚至上百个应用的选择，如何让自己的应用在关键时刻触达用户，成为开发者面临的重要挑战。Google在Android9Pie中引入的Slices技术，正是为了解决这一痛点而生。本文将全面介绍AndroidSlices的概念、实现方法、应用场景以及最佳实践，帮助开发者掌握这一提升用户参与度的强大工具。什么是AndroidSlices？AndroidSlice
镜中往事（79）大漠雪（上） Drosia
“诸位，目前西都上下都在搜寻我们洪盟成员，我私下见过当今西都主事，对方是一个非常爱民的好官，他一定能够让百姓过上好日子，于是我决定，我们洪盟需要去别处发展壮大。”槲枫的眼中有不容动摇的坚定。几位元老都没有说话。“既然盟主决定迁出西都，那洪盟的未来在哪里呢？”“南下，南方物资丰富，商业城市多，可以为洪盟打下好基础。”在场的人几乎都同意了这个建议。“哥哥，有消息说洪盟要迁出西都？”云芙刚刚从洪盟的工具
程序员必备：10 个提升代码质量的工具大力出奇迹985 宠物
在软件开发过程中，代码质量对项目的成功起着决定性作用。高质量的代码不仅易于维护和扩展，还能有效降低成本并提升可靠性。本文精心挑选了10个程序员必备工具，助力提升代码质量。这些工具涵盖代码格式化、静态分析、代码审查、测试、性能优化、安全扫描、版本控制、依赖管理、代码生成以及文档生成等多个关键领域。通过使用它们，开发者能够高效地发现并解决代码中的潜在问题，遵循最佳实践，提升代码的可读性、可维护性与安全
用代码生成艺术字：设计个性化海报的秘密
本文围绕“用代码生成艺术字：设计个性化海报的秘密”展开，先概述代码生成艺术字在海报设计中的独特价值，接着介绍常用的代码工具（如HTML、CSS、JavaScript等），详细阐述从构思到实现的完整流程，包括字体样式设计、动态效果添加等，还分享了提升艺术字质感的技巧及实际案例。最后总结代码生成艺术字的优势，为设计师提供打造个性化海报的实用指南，助力提升海报设计的独特性与吸引力，符合搜索引擎SEO标准
数据分析领域中AI人工智能的发展前景展望 AI大模型应用工坊 AI大模型开发实战数据分析人工智能数据挖掘 ai
数据分析领域中AI人工智能的发展前景展望关键词：数据分析、人工智能、机器学习、深度学习、数据挖掘、预测分析、自动化摘要：本文深入探讨了人工智能在数据分析领域的发展现状和未来趋势。我们将从核心技术原理出发，分析AI如何改变传统数据分析范式，详细讲解机器学习算法在数据分析中的应用，并通过实际案例展示AI驱动的数据分析解决方案。文章还将探讨行业应用场景、工具生态以及未来发展面临的挑战和机遇，为数据分析师
lesson20：Python函数的标注你的电影很有趣 python 开发语言
目录引言：为什么函数标注是现代Python开发的必备技能一、函数标注的基础语法1.1参数与返回值标注1.2支持的标注类型1.3Python3.9+的重大改进：标准集合泛型二、高级标注技巧与最佳实践2.1复杂参数结构标注2.2函数类型与回调标注2.3变量注解与类型别名三、静态类型检查工具应用3.1mypy：最流行的类型检查器3.2Pyright与IDE集成3.3运行时类型验证四、函数标注的工程价值与
K8S 常用命令全解析：高效管理容器化集群恩爸编程 docker kubernetes 容器 k8s常用命令 k8s有哪些常用命令 k8s命令有哪些 K8S常用命令有哪些
K8S常用命令全解析：高效管理容器化集群一、引言Kubernetes（K8S）作为强大的容器编排平台，其丰富的命令行工具（kubectl）为用户提供了便捷的方式来管理集群中的各种资源。熟练掌握K8S常用命令对于开发人员和运维人员至关重要，能够有效提高容器化应用的部署、监控与维护效率。本文将详细介绍一些K8S常用命令及其使用案例。二、基础资源操作命令（一）kubectlcreate功能：用于创建K8
Jupyter Notebook：数据科学的“瑞士军刀” a小胡哦机器学习基础人工智能机器学习
在数据科学的世界里，JupyterNotebook是一个不可或缺的工具，它就像是数据科学家手中的“瑞士军刀”，功能强大且灵活多变。今天，就让我们一起深入了解这个神奇的工具。一、JupyterNotebook是什么？JupyterNotebook是一个开源的Web应用程序，它允许你创建和共享包含实时代码、方程、可视化和解释性文本的文档。它支持多种编程语言，其中Python是最常用的语言之一。Jupy
ubuntu 查看防火墙相关操作三希 windows
在Ubuntu系统里，查看防火墙状态和配置主要借助ufw（UncomplicatedFirewall）工具，它是Ubuntu默认的防火墙配置界面。下面为你介绍常用的查看命令：一、查看防火墙状态要查看防火墙是否处于运行状态，可以使用以下命令：bashsudoufwstatus或者使用更详细的版本：bashsudoufwstatusverbose输出结果里，Status:active意味着防火墙正在运
重复文件清理工具，附免费链接 mixiumixiu 其他
链接:https://pan.baidu.com/s/1s_Zx1eHp5Y-XnbbGldIgvw?pwd=kjex提取码:kjex复制这段内容后打开百度网盘手机App，操作更方便哦
【Python】pypinyin-汉字拼音转换工具鸟哥大大 Python python 自然语言处理
文章目录1.主要功能2.安装3.常用API3.1拼音风格3.2核心API3.2.1pypinyin.pinyin()3.2.2pypinyin.lazy_pinyin()3.2.3pypinyin.load_single_dict()3.2.4pypinyin.load_phrases_dict()3.2.5pypinyin.slug()3.3注册新的拼音风格4.基本用法4.1库导入4.2基本汉字
python编程第十四课：数据可视化小小源助手 Python代码实例信息可视化 python 开发语言
Python数据可视化：让数据“开口说话”在当今数据爆炸的时代，数据可视化已成为探索数据规律、传达数据信息的关键技术。Python凭借其丰富的第三方库，为数据可视化提供了强大而灵活的解决方案。本文将带你深入了解Matplotlib库的基础绘图、Seaborn库的高级可视化以及交互式可视化工具Plotly，帮助你通过图表清晰地展示数据背后的故事。一、Matplotlib库基础绘图Matplotlib
C++中std::variant的使用详解和实战代码示例点云SLAM C++c++开发语言 variant C++泛型编程联合体 C++类型擦除机制 C++17
std::variant是C++17引入的一个类型安全的联合体（type-safeunion），它可以在多个类型之间存储一个值，并在编译时进行类型检查。它是现代C++类型擦除与泛型编程的核心工具之一，适用于构建可变类型结构、消息传递系统、状态机等。一、基本概念#includestd::variantv;类似于联合体union，但类型安全。std::variant只能存储其中一个类型的值。默认构造时
Python数据可视化：用代码绘制数据背后的故事 AAEllisonPang Python 信息可视化 python 开发语言
引言：当数据会说话在数据爆炸的时代，可视化是解锁数据价值的金钥匙。Python凭借其丰富的可视化生态库，已成为数据科学家的首选工具。本文将带您从基础到高级，探索如何用Python将冰冷数字转化为引人入胜的视觉叙事。一、基础篇：二维可视化的艺术表达1.1Matplotlib：可视化领域的瑞士军刀importmatplotlib.pyplotaspltimportnumpyasnpx=np.linsp
word转pdf、pdf转word在线工具分享 bpmh 常用工具 word pdf
️一、在线转换网站（方便快捷，无需安装）MicrosoftOfficeOnline(官方推荐，最安全可靠)：网址：直接使用你的Microsoft账户登录https://www.office.com/方法：将你的.docx或.doc文件上传到OneDrive。在OfficeOnline中打开该Word文档。点击文件>另存为>下载PDF副本。优点：官方出品，完全免费，无需额外上传到第三方服务器，安全性
外卖在哪个app点单更优惠?领取外卖优惠券小程序推荐! 好项目高省
在美团外卖平台上，优惠券是一种非常实用的购物工具，可以帮助消费者在购买商品时享受一定的折扣或优惠。然而，许多人对美团外卖优惠券的领取方法并不清楚，不知道如何才能免费领取。本文将分享一些美团外卖优惠券的领取技巧，让你轻松获取优惠券，享受购物优惠！一、美团APP内领取打开美团APP，进入首页或发现页。在页面中，找到“外卖”选项，点击进入。在“外卖”页面中，可以看到各类商家的优惠活动，包括满减优惠、折扣
使用Python和Gradio构建实时数据可视化工具 PythonAI编程架构实战家信息可视化 python 开发语言 ai
使用Python和Gradio构建实时数据可视化工具关键词：Python、Gradio、数据可视化、实时数据、Web应用、交互式界面、数据科学摘要：本文将详细介绍如何使用Python和Gradio框架构建一个实时数据可视化工具。我们将从基础概念开始，逐步深入到核心算法实现，包括数据处理、可视化技术以及Gradio的交互式界面设计。通过实际项目案例，读者将学习如何创建一个功能完整、响应迅速的实时数据
pdf文件的属性值怎么修改？修改PDF内部的属性创建时间和修改时间这辈子谁会真的心疼你 pdf 修改PDF属性文件属性修改
部分PDF生成时会自动嵌入一些隐藏属性，比如创建软件版本、电脑用户名、修改记录等，这些信息可能涉及隐私或商业机密。例如，用个人电脑编辑的公司文件，属性中若包含个人用户名，可能泄露信息归属；通过修改或清除这些属性，可以避免不必要的信息暴露，降低隐私泄露风险。pdf文件的属性值怎么修改？要修改PDF文件的属性值（如标题、作者、主题等元数据），可以使用不同的工具或编程语言。以下是几种常见的方法：方法一：
在Windows11上安装Linux操作系统的几种技术方案 yuanpan linux 运维服务器
在Windows11上安装Linux主要有以下几种技术方案，每种方案适用于不同的需求场景：1.WindowsSubsystemforLinux(WSL)适用场景：开发、命令行工具、轻量级Linux环境支持发行版：Ubuntu、Debian、KaliLinux、Fedora等优点：轻量级：无需虚拟机，直接在Windows上运行Linux命令行环境。无缝集成：可访问Windows文件系统，支持VSCo
戴尔笔记本win8系统改装win7系统 sophia天雪 win7 戴尔改装系统 win8
戴尔win8 系统改装win7 系统详述第一步：使用U盘制作虚拟光驱： 1）下载安装UltraISO：注册码可以在网上搜索。 2）启动UltraISO，点击“文件”—》“打开”按钮，打开已经准备好的ISO镜像文
BeanUtils.copyProperties使用笔记 bylijinnan java
BeanUtils.copyProperties VS PropertyUtils.copyProperties 两者最大的区别是： BeanUtils.copyProperties会进行类型转换，而PropertyUtils.copyProperties不会。既然进行了类型转换，那BeanUtils.copyProperties的速度比不上PropertyUtils.copyProp
MyEclipse中文乱码问题 0624chenhong MyEclipse
一、设置新建常见文件的默认编码格式，也就是文件保存的格式。在不对MyEclipse进行设置的时候，默认保存文件的编码，一般跟简体中文操作系统（如windows2000，windowsXP）的编码一致，即GBK。在简体中文系统下，ANSI 编码代表 GBK编码;在日文操作系统下，ANSI 编码代表 JIS 编码。 Window-->Preferences-->General -
发送邮件不懂事的小屁孩 send email
import org.apache.commons.mail.EmailAttachment; import org.apache.commons.mail.EmailException; import org.apache.commons.mail.HtmlEmail; import org.apache.commons.mail.MultiPartEmail;
动画合集换个号韩国红果果 html css
动画指一种样式变为另一种样式 keyframes应当始终定义0 100 过程 1 transition 制作鼠标滑过图片时的放大效果 css .wrap{ width: 340px;height: 340px; position: absolute; top: 30%; left: 20%; overflow: hidden; bor
网络最常见的攻击方式竟然是SQL注入蓝儿唯美 sql注入
NTT研究表明，尽管SQL注入（SQLi）型攻击记录详尽且为人熟知，但目前网络应用程序仍然是SQLi攻击的重灾区。信息安全和风险管理公司NTTCom Security发布的《2015全球智能威胁风险报告》表明，目前黑客攻击网络应用程序方式中最流行的，要数SQLi攻击。报告对去年发生的60亿攻击行为进行分析，指出SQLi攻击是最常见的网络应用程序攻击方式。全球网络应用程序攻击中，SQLi攻击占
java笔记2 a-john java
类的封装： 1，java中，对象就是一个封装体。封装是把对象的属性和服务结合成一个独立的的单位。并尽可能隐藏对象的内部细节（尤其是私有数据） 2，目的：使对象以外的部分不能随意存取对象的内部数据（如属性），从而使软件错误能够局部化，减少差错和排错的难度。 3，简单来说，“隐藏属性、方法或实现细节的过程”称为——封装。 4，封装的特性： 4.1设置
[Andengine]Error：can't creat bitmap form path “gfx/xxx.xxx” aijuans 学习Android遇到的错误
最开始遇到这个错误是很早以前了，以前也没注意，只当是一个不理解的bug，因为所有的texture，textureregion都没有问题，但是就是提示错误。昨天和美工要图片，本来是要背景透明的png格式，可是她却给了我一个jpg的。说明了之后她说没法改，因为没有png这个保存选项。我就看了一下，和她要了psd的文件，还好我有一点
自己写的一个繁体到简体的转换程序 asialee java 转换繁体 filter 简体
今天调研一个任务，基于java的filter实现繁体到简体的转换，于是写了一个demo，给各位博友奉上，欢迎批评指正。实现的思路是重载request的调取参数的几个方法，然后做下转换。
android意图和意图监听器技术百合不是茶 android 显示意图隐式意图意图监听器
Intent是在activity之间传递数据;Intent的传递分为显示传递和隐式传递显式意图：调用Intent.setComponent() 或 Intent.setClassName() 或 Intent.setClass()方法明确指定了组件名的Intent为显式意图，显式意图明确指定了Intent应该传递给哪个组件。隐式意图;不指明调用的名称,根据设
spring3中新增的@value注解 bijian1013 java spring @Value
在spring 3.0中，可以通过使用@value，对一些如xxx.properties文件中的文件，进行键值对的注入，例子如下： 1.首先在applicationContext.xml中加入： <beans xmlns="http://www.springframework.
Jboss启用CXF日志 sunjing log jboss CXF
1. 在standalone.xml配置文件中添加system-properties： <system-properties> <property name="org.apache.cxf.logging.enabled" value=&
【Hadoop三】Centos7_x86_64部署Hadoop集群之编译Hadoop源代码 bit1129 centos
编译必需的软件 Firebugs3.0.0 Maven3.2.3 Ant JDK1.7.0_67 protobuf-2.5.0 Hadoop 2.5.2源码包 Firebugs3.0.0 http://sourceforge.jp/projects/sfnet_findbug
struts2验证框架的使用和扩展白糖_ 框架 xml bean struts 正则表达式
struts2能够对前台提交的表单数据进行输入有效性校验，通常有两种方式： 1、在Action类中通过validatexx方法验证，这种方式很简单，在此不再赘述； 2、通过编写xx-validation.xml文件执行表单验证，当用户提交表单请求后，struts会优先执行xml文件，如果校验不通过是不会让请求访问指定action的。本文介绍一下struts2通过xml文件进行校验的方法并说
记录-感悟 braveCS 感悟
再翻翻以前写的感悟，有时会发现自己很幼稚，也会让自己找回初心。 2015-1-11 1. 能在工作之余学习感兴趣的东西已经很幸福了； 2. 要改变自己，不能这样一直在原来区域，要突破安全区舒适区，才能提高自己，往好的方面发展； 3. 多反省多思考；要会用工具，而不是变成工具的奴隶； 4. 一天内集中一个定长时间段看最新资讯和偏流式博
编程之美-数组中最长递增子序列 bylijinnan 编程之美
import java.util.Arrays; import java.util.Random; public class LongestAccendingSubSequence { /** * 编程之美数组中最长递增子序列 * 书上的解法容易理解 * 另一方法书上没有提到的是，可以将数组排序（由小到大）得到新的数组， * 然后求排序后的数组与原数
读书笔记5 chengxuyuancsdn 重复提交 struts2的token验证
1、重复提交 2、struts2的token验证 3、用response返回xml时的注意 1、重复提交 (1)应用场景 (1-1)点击提交按钮两次。 (1-2)使用浏览器后退按钮重复之前的操作，导致重复提交表单。 (1-3)刷新页面 (1-4)使用浏览器历史记录重复提交表单。 (1-5)浏览器重复的 HTTP 请求。 (2)解决方法 (2-1)禁掉提交按钮 (2-2)
[时空与探索]全球联合进行第二次费城实验的可能性 comsci
二次世界大战前后,由爱因斯坦参加的一次在海军舰艇上进行的物理学实验 -费城实验至今给我们大家留下很多迷团..... 关于费城实验的详细过程,大家可以在网络上搜索一下,我这里就不详细描述了在这里,我的意思是,现在
easy connect 之 ORA-12154: TNS: 无法解析指定的连接标识符 daizj oracle ORA-12154
用easy connect连接出现“tns无法解析指定的连接标示符”的错误，如下： C:\Users\Administrator>sqlplus username/[email protected]:1521/orcl SQL*Plus: Release 10.2.0.1.0 – Production on 星期一 5月 21 18:16:20 2012 Copyright (c) 198
简单排序:归并排序 dieslrae 归并排序
public void mergeSort(int[] array){ int temp = array.length/2; if(temp == 0){ return; } int[] a = new int[temp]; int
C语言中字符串的\0和空格 dcj3sjt126com c
\0 为字符串结束符，比如说： abcd (空格)cdefg；存入数组时，空格作为一个字符占有一个字节的空间，我们
解决Composer国内速度慢的办法 dcj3sjt126com Composer
用法：有两种方式启用本镜像服务： 1 将以下配置信息添加到 Composer 的配置文件 config.json 中（系统全局配置）。见“例1” 2 将以下配置信息添加到你的项目的 composer.json 文件中（针对单个项目配置）。见“例2” 为了避免安装包的时候都要执行两次查询，切记要添加禁用 packagist 的设置，如下 1 2 3 4 5
高效可伸缩的结果缓存 shuizhaosi888 高效可伸缩的结果缓存
/** * 要执行的算法，返回结果v */ public interface Computable<A, V> { public V comput(final A arg); } /** * 用于缓存数据 */ public class Memoizer<A, V> implements Computable<A,
三点定位的算法 haoningabc c 算法
三点定位，已知a,b,c三个顶点的x,y坐标和三个点都z坐标的距离，la，lb,lc 求z点的坐标原理就是围绕a,b,c 三个点画圆，三个圆焦点的部分就是所求但是，由于三个点的距离可能不准，不一定会有结果，所以是三个圆环的焦点，环的宽度开始为0，没有取到则加1 运行 gcc -lm test.c test.c代码如下 #include "stdi
epoll使用详解 jimmee c linux 服务端编程 epoll
epoll - I/O event notification facility在linux的网络编程中，很长的时间都在使用select来做事件触发。在linux新的内核中，有了一种替换它的机制，就是epoll。相比于select，epoll最大的好处在于它不会随着监听fd数目的增长而降低效率。因为在内核中的select实现中，它是采用轮询来处理的，轮询的fd数目越多，自然耗时越多。并且，在linu
Hibernate对Enum的映射的基本使用方法 linzx0212 enum Hibernate
枚举 /** * 性别枚举 */ public enum Gender { MALE(0), FEMALE(1), OTHER(2); private Gender(int i) { this.i = i; } private int i; public int getI
第10章高级事件（下） onestopweb 事件
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
孙子兵法 roadrunners 孙子兵法
始计第一孙子曰：兵者，国之大事，死生之地，存亡之道，不可不察也。故经之以五事，校之以计，而索其情：一曰道，二曰天，三曰地，四曰将，五曰法。道者，令民于上同意，可与之死，可与之生，而不危也；天者，阴阳、寒暑、时制也；地者，远近、险易、广狭、死生也；将者，智、信、仁、勇、严也；法者，曲制、官道、主用也。凡此五者，将莫不闻，知之者胜，不知之者不胜。故校之以计，而索其情，曰
MySQL双向复制 tomcat_oracle mysql
本文包括: 主机配置从机配置建立主-从复制建立双向复制背景按照以下简单的步骤: 参考一下：在机器A配置主机(192.168.1.30) 在机器B配置从机(192.168.1.29) 我们可以使用下面的步骤来实现这一点步骤1：机器A设置主机在主机中打开配置文件 ,
zoj 3822 Domination(dp) 阿尔萨斯 Mina
题目链接：zoj 3822 Domination 题目大意：给定一个N∗M的棋盘，每次任选一个位置放置一枚棋子，直到每行每列上都至少有一枚棋子，问放置棋子个数的期望。解题思路：大白书上概率那一张有一道类似的题目，但是因为时间比较久了，还是稍微想了一下。dp[i][j][k]表示i行j列上均有至少一枚棋子，并且消耗k步的概率（k≤i∗j）,因为放置在i+1~n上等价与放在i+1行上，同理

Standord NLP组整理的NLP工具、资源列表

Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources

Contents

Tools

Machine Translation systems

Instructions

Freely downloadable

Free, but getting them requires hassle

Part of Speech Taggers

Freely downloadable

Free, but require registration

Usable by email or on the web, but not distributed freely

Not free

No longer available

NP chunking

Downloadable

Generic sequence models

Downloadable

Parsers

Semantic Parsers

Downloadable

Named Entity Recognition

Downloadable

Coreference (Anaphora) Resolution

Downloadable

Language modeling toolkits

Downloadable

Downloadable, but requires registration

Not yet classified

Friendly concordancing and text analysis tools

Text summarization tools

Other

Downloadable

Free, but require registration

Unsure

Corpora

Large collections aimed at the NLP community

Particular languages

English

Chinese

Multilingual

Bosnian

Czech

French

German

Russian

Slovene

Croatian

Spanish and Portuguese

Swedish

Treebanks

Treebanks

Resources for Word Sense Disambiguation

Literature

Entirely or mainly English

Acquisition data

SGML/XML

Dictionaries

Dictionaries of subcategorization frames

Dictionaries of assorted languages on the web

Names

SGML structured dictionaries

Lexical/morphological resources

Courses, Syllabi, and other Educational Resources

"Techie"

"Corpus Linguistics"

Mailing lists

Other stuff on the Web

General resources

Information Retrieval

Information Extraction/Wrapper Induction

People's homepages

Societies/Journals

你可能感兴趣的:(工具)