Stanford CoreNLP的源代码是使用Java写的,提供了Server方式进行交互。stanfordcorenlp是一个对Stanford CoreNLP进行了封装的Python工具包,GitHub地址,使用非常方便。
***************更新**************
Stanford 官方发布了python版,直接可以安装。具体参见https://stanfordnlp.github.io/stanfordnlp/。
pip install stanfordnlp
1:下载安装JDK 1.8及以上版本。
2:下载Stanford CoreNLP文件,解压。
3:处理中文还需要下载中文的模型jar文件,然后放到stanford-corenlp-full-2018-02-27根目录下即可(注意一定要下载这个文件,否则它默认是按英文来处理的)。
StanfordCoreNLP官网给出了python调用StanfordCoreNLP的接口。
These packages use the Stanford CoreNLP server that we’ve developed over the last couple of years. You should probably use one of them.
Stanford官方发布了Python版的nlp处理工具,不在纠结使用java了。
Setup
StanfordNLP supports Python 3.6 or later. We strongly recommend that you install StanfordNLP from PyPI. If you already have pip installed, simply run
pip install stanfordnlp
this should also help resolve all of the dependencies of StanfordNLP, for instance PyTorch 1.0.0 or above.
Alternatively, you can also install from source of this git repository, which will give you more flexibility in developing on top of StanfordNLP and training your own models. For this option, run
git clone [email protected]:stanfordnlp/stanfordnlp.git
cd stanfordnlp
pip install -e .
Running StanfordNLP
Getting Started with the neural pipeline
To run your first StanfordNLP pipeline, simply following these steps in your Python interactive interpreter:
>>> import stanfordnlp
>>> stanfordnlp.download('en') # This downloads the English models for the neural pipeline
>>> nlp = stanfordnlp.Pipeline() # This sets up a default neural pipeline in English
>>> doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
>>> doc.sentences[0].print_dependencies()
The last command will print out the words in the first sentence in the input string (or Document, as it is represented in StanfordNLP), as well as the indices for the word that governs it in the Universal Dependencies parse of that sentence (its “head”), along with the dependency relation between the words. The output should look like:
('Barack', '4', 'nsubj:pass')
('Obama', '1', 'flat')
('was', '4', 'aux:pass')
('born', '0', 'root')
('in', '6', 'case')
('Hawaii', '4', 'obl')
('.', '4', 'punct')
Note: If you are running into issues like OSError: [Errno 22] Invalid argument, it’s very likely that you are affected by a known Python issue, and we would recommend Python 3.6.8 or later and Python 3.7.2 or later.
We also provide a multilingual demo script that demonstrates how one uses StanfordNLP in other languages than English, for example Chinese (traditional)
python demo/pipeline_demo.py -l zh
See our getting started guide for more details.
本教程以stanfordcorenlp
接口为例(本文所用版本为Stanford CoreNLP 3.9.1),讲解Python调用StanfordCoreNLP
的使用方法。
stanfordcorenlp
:简单使用命令:pip install stanfordcorenlp
选择USTC镜像安装(安装速度很快,毕竟国内镜像):pip install stanfordcorenlp -i http://pypi.mirrors.ustc.edu.cn/simple/ --trusted-host pypi.mirrors.ustc.edu.cn
stanfordcorenlp
:from stanfordcorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP(r'G:\JavaLibraries\stanford-corenlp-full-2018-02-27')
sentence = 'Guangdong University of Foreign Studies is located in Guangzhou.'
print 'Tokenize:', nlp.word_tokenize(sentence)
print 'Part of Speech:', nlp.pos_tag(sentence)
print 'Named Entities:', nlp.ner(sentence)
print 'Constituency Parsing:', nlp.parse(sentence)
print 'Dependency Parsing:', nlp.dependency_parse(sentence)
nlp.close() # Do not forget to close! The backend server will consume a lot memery.
# Tokenize
[u'Guangdong', u'University', u'of', u'Foreign', u'Studies', u'is', u'located', u'in', u'Guangzhou', u'.']
#Part of Speech
[(u'Guangdong', u'NNP'), (u'University', u'NNP'), (u'of', u'IN'), (u'Foreign', u'NNP'), (u'Studies', u'NNPS'), (u'is', u'VBZ'), (u'located', u'JJ'), (u'in', u'IN'), (u'Guangzhou', u'NNP'), (u'.', u'.')]
# Named Entities
[(u'Guangdong', u'ORGANIZATION'), (u'University', u'ORGANIZATION'), (u'of', u'ORGANIZATION'), (u'Foreign', u'ORGANIZATION'), (u'Studies', u'ORGANIZATION'), (u'is', u'O'), (u'located', u'O'), (u'in', u'O'), (u'Guangzhou', u'LOCATION'), (u'.', u'O')]
# Constituency Parsing
(ROOT
(S
(NP
(NP (NNP Guangdong) (NNP University))
(PP (IN of)
(NP (NNP Foreign) (NNPS Studies))))
(VP (VBZ is)
(ADJP (JJ located)
(PP (IN in)
(NP (NNP Guangzhou)))))
(. .)))
#Dependency Parsing
[(u'ROOT', 0, 7), (u'compound', 2, 1), (u'nsubjpass', 7, 2), (u'case', 5, 3), (u'compound', 5, 4), (u'nmod', 2, 5), (u'auxpass', 7, 6), (u'case', 9, 8), (u'nmod', 7, 9), (u'punct', 7, 10)]
Note: you must download an additional model file and place it in the …/stanford-corenlp-full-2018-02-27 folder. For example, you should download the stanford-chinese-corenlp-2018-02-27-models.jar file if you want to process Chinese.
# _*_coding:utf-8_*_
# Other human languages support, e.g. Chinese
sentence = '清华大学位于北京。'
with StanfordCoreNLP(r'G:\JavaLibraries\stanford-corenlp-full-2018-02-27', lang='zh') as nlp:
print(nlp.word_tokenize(sentence))
print(nlp.pos_tag(sentence))
print(nlp.ner(sentence))
print(nlp.parse(sentence))
print(nlp.dependency_parse(sentence))
Since this will load all the models which require more memory, initialize the server with more memory. 8GB is recommended.
#General json output
nlp = StanfordCoreNLP(r'path_to_corenlp', memory='8g')
print nlp.annotate(sentence)
nlp.close()
You can specify properties:
annotators: tokenize, ssplit, pos, lemma, ner, parse, depparse, dcoref (See Detail)
pipelineLanguage: en, zh, ar, fr, de, es (English, Chinese, Arabic, French, German, Spanish) (See Annotator Support Detail)
outputFormat: json, xml, text
text = 'Guangdong University of Foreign Studies is located in Guangzhou. ' \
'GDUFS is active in a full range of international cooperation and exchanges in education. '
props={'annotators': 'tokenize,ssplit,pos','pipelineLanguage':'en','outputFormat':'xml'}
print nlp.annotate(text, properties=props)
nlp.close()
Start a CoreNLP Server with command:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
And then:
# Use an existing server
nlp = StanfordCoreNLP('http://localhost', port=9000)
import logging
from stanfordcorenlp import StanfordCoreNLP
# Debug the wrapper
nlp = StanfordCoreNLP(r'path_or_host', logging_level=logging.DEBUG)
# Check more info from the CoreNLP Server
nlp = StanfordCoreNLP(r'path_or_host', quiet=False, logging_level=logging.DEBUG)
nlp.close()
We use setuptools to package our project. You can build from the latest source code with the following command: