2019-11-12 学习使用arxiv-sanity-preserver

参考学习资料:https://github.com/karpathy/arxiv-sanity-preserver#arxiv-sanity-preserver
这是一个论文检索引擎
先来一段介绍:

arxiv sanity preserver

This project is a web interface that attempts to tame the overwhelming flood of papers on Arxiv. It allows researchers to keep track of recent papers, search for papers, sort papers by similarity to any paper, see recent popular papers, to add papers to a personal library, and to get personalized recommendations of (new or old) Arxiv papers. This code is currently running live at www.arxiv-sanity.com/, where it's serving 25,000+ Arxiv papers from Machine Learning (cs.[CV|AI|CL|LG|NE]/stat.ML) over the last ~3 years. With this code base you could replicate the website to any of your favorite subsets of Arxiv by simply changing the categories infetch_papers.py.

以上介绍的大概意思就是说这个搜索引擎很智能,想关注什么领域的最新进展就把喜欢的主题词在infetch_papers.py做一下更改即可,这是机器学习的杰作等等。

几秒钟就能注册成功,跟你打字速度一样快,进入之后是这么个界面:


image.png

代码布局
代码有两大部分:

索引代码。使用 Arxiv API 下载任何你喜欢的类别的最新论文,然后下载所有论文,提取所有文本,根据每篇论文的内容创建 tfidf 向量。因此,此代码与后端抓取和计算有关:建立 arxiv 论文数据库、计算内容向量、创建缩略图、为人计算 SVM 等。

用户界面。然后是一个网络服务器(基于Flask/Tornado/sqlite),允许通过数据库搜索和过滤相似文件,等等。

Dependencies
Several: You will need numpy, feedparser (to process xml files), scikit learn (for tfidf vectorizer, training of SVM), flask (for serving the results), flask_limiter, and tornado (if you want to run the flask server in production). Also dateutil, and scipy. And sqlite3 for database (accounts, library support, etc.). Most of these are easy to get through pip, e.g.:

$ virtualenv env                # optional: use virtualenv
$ source env/bin/activate       # optional: use virtualenv
$ pip install -r requirements.txt

此外还可能需要 ImageMagick 和 pdftotext, 可通过Ubuntu 系统指令 sudo apt-get install imagemagick poppler-utils完成,好多的依赖。

流程如下,最好是按顺序来:

  1. Run fetch_papers.py to query arxiv API and create a file db.p that contains all information for each paper. This script is where you would modify the query, indicating which parts of arxiv you'd like to use. Note that if you're trying to pull too many papers arxiv will start to rate limit you. You may have to run the script multiple times, and I recommend using the arg --start-index to restart where you left off when you were last interrupted by arxiv.
  2. Run download_pdfs.py, which iterates over all papers in parsed pickle and downloads the papers into folder pdf
  3. Run parse_pdf_to_text.py to export all text from pdfs to files in txt
  4. Run thumb_pdf.py to export thumbnails of all pdfs to thumb
  5. Run analyze.py to compute tfidf vectors for all documents based on bigrams. Saves a tfidf.p, tfidf_meta.p and sim_dict.p pickle files.
  6. Run buildsvm.py to train SVMs for all users (if any), exports a pickle user_sim.p
  7. Run make_cache.py for various preprocessing so that server starts faster (and make sure to run sqlite3 as.db < schema.sql if this is the very first time ever you're starting arxiv-sanity, which initializes an empty database).
  8. Start the mongodb daemon in the background. Mongodb can be installed by following the instructions here - https://docs.mongodb.com/tutorials/install-mongodb-on-ubuntu/.
  • Start the mongodb server with - sudo service mongod start.
  • Verify if the server is running in the background : The last line of /var/log/mongodb/mongod.log file must be - [initandlisten] waiting for connections on port
  1. Run the flask server with serve.py. Visit localhost:5000 and enjoy sane viewing of papers!
    可选项: 你也可以运行twitter_daemon.py在screen session, 使用Twitter API credentials (stored in twitter.txt) Twitter periodically looking for mentions of papers in the database, 并且可以把搜索结果写入twitter.p.

作者说还有一个简单的shell脚本,通过逐个运行这些命令,他会每天运行这个脚本来获取新论文,将它们合并到数据库中,并重新计算所有tfidf矢量/分类器。有关此过程的更多详细信息,请参阅下文。
protip: numpy/BLAS: 脚本analyze.pynumpy执行大量繁重的工作。作者建议小心地设置你的numpy使用BLAS(例如OpenBLAS),否则计算将需要很长时间。该脚本拥有 25,000 篇论文和 5000 名用户,使用与 BLAS 链接的 numpy在他的计算机上运行了几个小时。

Running online

If you'd like to run the flask server online (e.g. AWS) run it as python serve.py --prod.
You also want to create a secret_key.txt file and fill it with random text (see top of serve.py).

Current workflow

作者说他这个运作现在还不是全自动的,那他怎么让代码活到现在呢,他通过一个脚本,在 arxiv 出来后(~midnight PST) 执行了以下更新:

python fetch_papers.py
python download_pdfs.py
python parse_pdf_to_text.py
python thumb_pdf.py
python analyze.py
python buildsvm.py
python make_cache.py

作者使用的 screen session,所以设置screen -S serve 参数 (或-rto reattach to it) 然后在运行:

python serve.py --prod --port 80

服务器将加载新文件并开始托管站点。请注意,在某些系统上,如果没有 sudo,您无法使用端口 80。两个选项是使用iptables重置路由端口,或者可以使用 setcap来授予运行serve.pypython解释器的权限。在这种情况下,我建议谨慎对待权限,也许可以尝试用虚拟机?(不是太明白这个设置,应该是怕资料泄露之类的)等等。

因为还没有系统的学习过python,暂时还不敢随意尝试。
ImageMagick
这里提到的依赖工具其中一个是个类似作弊器一样的东西(美图秀秀+全能扫描王?)http://www.imagemagick.org/script/index.php
也是个开源的免费软件目前版本是ImageMagick 7.0.9-2. 兼容 Linux, Windows, Mac Os X, iOS, Android OS, 及其他.
可参考ImageMagick使用实例来使用ImageMagick用 command-line 完成任务. 也可参见 Fred's ImageMagick Scripts: 里面包括执行几何变换、模糊、锐化、边缘、降噪和颜色操作的大量命令行脚本。也可以用参考Magick.NET,使用ImageMagick可不用安装客户端。

下载安装参考:http://www.imagemagick.org/script/download.php

另一个是个读PDF并转为文档的工具 pdftotext
在开源的XpdfReader代码上做了修饰的一个工具http://www.xpdfreader.com/

你可能感兴趣的:(2019-11-12 学习使用arxiv-sanity-preserver)