从头学习计算机网络_我如何通过从头开始构建网络爬虫来自动进行求职

从头学习计算机网络

它是如何开始的故事 (The story of how it began)

It was midnight on a Friday, my friends were out having a good time, and yet I was nailed to my computer screen typing away.

星期五是午夜,我的朋友们出去玩得很开心,但我被钉在电脑屏幕上打字了。

Oddly, I didn’t feel left out.

奇怪的是,我没有被排除在外。

I was working on something that I thought was genuinely interesting and awesome.

我正在做一些我认为真的很有趣而且很棒的事情。

I was right out of college, and I needed a job. When I left for Seattle, I had a backpack full of college textbooks and some clothes. I could fit everything I owned in the trunk of my 2002 Honda Civic.

我当时刚大学毕业,需要一份工作。 当我去西雅图时,我有一个装满大学课本和一些衣服的背包。 我可以装满2002年本田思域后备箱中的所有物品。

I didn’t like to socialize much back then, so I decided to tackle this job-finding problem the best way I knew how. I tried to build an app to do it for me, and this article is about how I did it. ?

那时我不喜欢社交,所以我决定以我所知道的最好方式解决这个找工作的问题。 我试图构建一个应用程序来为我做这件事,而本文则是关于我是如何做到的。 ?

Craigslist入门 (Getting started with Craigslist)

I was in my room, furiously building some software that would help me collect, and respond to, people who were looking for software engineers on Craigslist. Craigslist is essentially the marketplace of the Internet, where you can go and find things for sale, services, community posts, and so on.

我当时在我的房间里,疯狂地开发一些软件,这些软件可以帮助我收集和响应在Craigslist上寻找软件工程师的人们。 Craigslist本质上是Internet的市场,您可以在其中找到要出售的东西,服务,社区帖子等。

At that point in time, I had never built a fully fledged application. Most of the things I worked on in college were academic projects that involved building and parsing binary trees, computer graphics, and simple language processing models.

那时,我从未构建过完整的应用程序。 我在大学期间从事的大多数工作都是学术项目,涉及构建和解析二叉树,计算机图形学以及简单的语言处理模型。

I was quite the “newb.”

我真是个“新手”。

That said, I had always heard about this new “hot” programming language called Python. I didn’t know much Python, but I wanted to get my hands dirty and learn more about it.

就是说,我一直都听说过这种称为Python的新“热门”编程语言。 我对Python不太了解,但是我想弄清楚自己的手,并进一步了解它。

So I put two and two together, and decided to build a small application using this new programming language.

因此,我将两个和两个放在一起,并决定使用这种新的编程语言来构建一个小型应用程序。

建立(工作中的)原型的旅程 (The journey to build a (working) prototype)

I had a used BenQ laptop my brother had given me when I left for college that I used for development.

我上大学时曾用过哥哥给我的一台二手BenQ笔记本电脑,当时我用它来开发。

It wasn’t the best development environment by any measure. I was using Python 2.4 and an older version of Sublime text, yet the process of writing an application from scratch was truly an exhilarating experience.

无论如何,它都不是最佳的开发环境。 我使用的是Python 2.4和较旧版本的Sublime文本 ,但是从头开始编写应用程序的过程确实令人振奋。

I didn’t know what I needed to do yet. I was trying various things out to see what stuck, and my first approach was to find out how I could access Craigslist data easily.

我还不知道该怎么办。 我尝试了各种尝试以了解问题所在,而我的第一种方法是找出如何轻松访问Craigslist数据的方法。

I looked up Craigslist to find out if they had a publicly available REST API. To my dismay, they didn’t.

我查找了Craigslist,以了解他们是否具有公开可用的REST API。 令我沮丧的是,他们没有。

However, I found the next best thing.

但是,我找到了下一个最好的东西。

Craigslist had an RSS feed that was publicly available for personal use. An RSS feed is essentially a computer-readable summary of updates that a website sends out. In this case, the RSS feed would allow me to pick up new job listings whenever they were posted. This was perfect for my needs.

Craigslist的RSS供稿已公开供个人使用。 RSS feed本质上是网站发送的更新的计算机可读摘要 。 在这种情况下,RSS提要将允许我在发布新职位列表时选择它们。 这非常适合我的需求。

Next, I needed a way to read these RSS feeds. I didn’t want to go through the RSS feeds manually myself, because that would be a time-sink and that would be no different than browsing Craigslist.

接下来,我需要一种阅读这些RSS feed的方法。 我不想自己亲自浏览RSS提要,因为那会浪费时间,而且与浏览Craigslist没什么不同。

Around this time, I started to realize the power of Google. There’s a running joke that software engineers spend most of their time Googling for answers. I think there’s definitely some truth to that.

大约在这段时间里,我开始意识到Google的强大功能。 开个玩笑,软件工程师将大部分时间都用在Google搜索上。 我认为这肯定是有些道理。

After a little bit of Googling, I found this useful post on StackOverflow that described how to search through a Craiglist RSS feed. It was sort of a filtering functionality that Craigslist provided for free. All I had to do was pass in a specific query parameter with the keyword I was interested in.

经过一番谷歌搜索之后,我在StackOverflow上找到了这篇有用的文章,描述了如何搜索Craiglist RSS feed。 这是Craigslist免费提供的一种筛选功能。 我要做的就是用我感兴趣的关键字传递特定的查询参数。

I was focused on searching for software-related jobs in Seattle. With that, I typed up this specific URL to look for listings in Seattle that contained the keyword “software”.

我专注于在西雅图寻找与软件相关的工作。 这样,我输入了该特定URL,以查找包含关键字“软件”的西雅图清单。

https://seattle.craigslist.org/search/sss?format=rss&query=software

https://seattle.craigslist.org/search/sss?format=rss&query=software

And voilà! It worked beautifully.

和瞧! 它工作得很漂亮

我吃过最美丽的汤 (The most beautiful soup I’ve ever tasted)

I wasn’t convinced, however, that my approach would work.

但是,我没有确信我的方法会奏效。

First, the number of listings was limited. My data didn’t contain all the available job postings in Seattle. The returned results were merely a subset of the whole. I was looking to cast as wide a net as possible, so I needed to know all the available job listings.

首先, 列表数量是有限的 。 我的数据没有包含西雅图所有可用的职位发布。 返回的结果只是整体的一部分。 我一直在寻找尽可能广泛的网络,所以我需要知道所有可用的工作清单。

Second, I realized that the RSS feed didn’t include any contact information. That was a bummer. I could find the listings, but I couldn’t contact the posters unless I manually filtered through these listings.

其次,我意识到RSS提要不包含任何联系信息 。 真是可惜。 我可以找到列表,但是除非手动过滤这些列表,否则我无法联系海报。

I’m a person of many skills and interests, but doing repetitive manual work isn’t one of them. I could’ve hired someone to do it for me, but I was barely scraping by with 1-dollar ramen cup noodles. I couldn’t splurge on this side project.

我是一个有很多技能和兴趣的人,但是做重复的体力劳动不是其中之一。 我本来可以雇一个人为我做的,但我勉强抓着一美元的拉面杯面条。 我不能为此项目挥霍。

That was a dead-end. But it wasn’t the end.

那是死路一条。 但它是不是结束

连续迭代 (Continuous iteration)

From my first failed attempt, I learned that Craigslist had an RSS feed that I could filter on, and each posting had a link to the actual posting itself.

从我的第一次失败尝试中,我了解到Craigslist有一个RSS提要供我过滤,并且每个帖子都有指向实际帖子本身的链接。

Well, if I could access the actual posting, then maybe I could scrape the email address off of it? That meant I needed to find a way to grab email addresses from the original postings.

好吧,如果我可以访问实际的帖子,那么也许可以从中删除电子邮件地址? meant那意味着我需要找到一种方法来从原始帖子中获取电子邮件地址。

Once again, I pulled up my trusted Google, and searched for “ways to parse a website.”

我再次拉起我信任的Google,并搜索“解析网站的方式”。

With a little Googling, I found a cool little Python tool called Beautiful Soup. It’s essentially a nifty tool that allows you to parse an entire DOM Tree and helps you make sense of how a web page is structured.

稍加谷歌搜索,我发现了一个很酷的Python小工具,名为Beautiful Soup 。 从本质上讲,它是一个漂亮的工具,可让您解析整个DOM树,并帮助您理解网页的结构。

My needs were simple: I needed a tool that was easy to use and would let me collect data from a webpage. BeautifulSoup checked off both boxes, and rather than spending more time picking out the best tool, I picked a tool that worked and moved on. Here’s a list of alternatives that do something similar.

我的需求很简单:我需要一个易于使用的工具,并且可以让我从网页上收集数据。 BeautifulSoup选中了这两个复选框,而不是花更多的时间挑选最好的工具 ,而是选择了一个行之有效的工具。 这是做类似事情的替代方案的列表 。

Side note: I found this awesome tutorial that talks about how to scrape websites using Python and BeautifulSoup. If you’re interested in learning how to scrape, then I recommend reading it.

旁注:我发现了这个很棒的教程 ,该教程讨论了如何使用Python和BeautifulSoup抓取网站。 如果您有兴趣学习如何抓取,则建议阅读。

With this new tool, my workflow was all set.

有了这个新工具,我的工作流程就完成了。

I was now ready to tackle the next task: scraping email addresses from the actual postings.

我现在准备处理下一个任务:从实际发帖中抓取电子邮件地址。

Now, here’s the cool thing about open-source technologies. They’re free and work great! It’s like getting free ice-cream on a hot summer day, and a freshly baked chocolate-chip cookie to go.

现在,这是关于开源技术的最酷的东西。 它们是免费的,而且效果很好! 就像在炎热的夏日里免费获得冰淇淋, 以及新鲜出炉的巧克力曲奇饼干一样。

BeautifulSoup lets you search for specific HTML tags, or markers, on a web page. And Craigslist has structured their listings in such a way that it was a breeze to find email addresses. The tag was something along the lines of “email-reply-link,” which basically points out that an email link is available.

BeautifulSoup使您可以在网页上搜索特定HTML标签或标记。 Craigslist的清单结构很容易找到电子邮件地址。 该标记类似于“ email-reply-link”,基本上指出了电子邮件链接可用。

From then on, everything was easy. I relied on the built-in functionality BeautifulSoup provided, and with just some simple manipulation, I was able to pick out email addresses from Craigslist posts quite easily.

从那时起,一切都很轻松。 我依靠提供的内置功能BeautifulSoup,并且只需进行一些简单的操作,就可以很容易地从Craigslist帖子中挑选出电子邮件地址。

放在一起 (Putting things together)

Within an hour or so, I had my first MVP. I had built a web scraper that could collect email addresses and respond to people looking for software engineers within a 100-mile radius of Seattle.

在一个小时左右的时间内,我有了第一个MVP。 我建立了一个网络抓取工具,可以收集电子邮件地址并响应在西雅图100英里范围内寻找软件工程师的人们的React。

I added various add-ons on top of the original script to make life much easier. For example, I saved the results into a CSV and HTML page so that I could parse them quickly.

我在原始脚本的顶部添加了各种附加组件,以使工作更加轻松。 例如,我将结果保存到CSV和HTML页面中,以便可以快速解析它们。

Of course, there were many other notable features lacking, such as:

当然,还缺少许多其他值得注意的功能,例如:

  • the ability to log the email addresses I sent

    能够记录我发送的电子邮件地址
  • fatigue rules to prevent over-sending emails to people I’d already reached out to

    疲劳规则,以防止向我已经联系过的人发送过多电子邮件
  • special cases, such as some emails requiring a Captcha before they’re displayed to deter automated bots (which I was)

    特殊情况,例如有些电子邮件需要显示验证码才能显示,以阻止自动漫游器(我当时是)
  • Craigslist didn’t allow scrapers on their platform, so I would get banned if I ran the script too often. (I tried to switch between various VPNs to try to “trick” Craigslist, but that didn’t work), and

    Craigslist不允许在其平台上使用刮板,因此如果我过于频繁地运行脚本,我将被禁止使用。 (我试图在各种VPN之间切换以尝试“欺骗” Craigslist,但这没有用),以及
  • I still couldn’t retrieve all postings on Craigslist

    我仍然无法检索Craigslist上的所有帖子

The last one was a kicker. But I figured if a posting had been sitting for a while, then maybe the person who posted it was not even looking anymore. It was a trade-off I was OK with.

最后一个是踢脚。 但是我发现如果某个发布已经坐了一段时间,那么发布该帖子的人可能甚至都不再看了。 这是我可以接受的折衷方案。

The whole experience felt like a game of Tetris. I knew what my end goal was, and my real challenge was fitting the right pieces together to achieve that specific end goal. Each piece of the puzzle brought me on a different journey. It was challenging, but enjoyable nonetheless and I learned something new each step of the way.

整个体验就像是俄罗斯方块的游戏 。 我知道自己的最终目标是什么,而我真正的挑战是将正确的零件组合在一起以实现那个特定的最终目标。 每个难题都使我走上了不同的旅程。 这是具有挑战性的,但仍然很有趣,我在每一步中都学到了一些新东西。

得到教训 (Lessons learned)

It was an eye-opening experience, and I ended up learning a little bit more about how the Internet (and Craigslist) works, how various different tools can work together to solve a problem, plus I got a cool little story I can share with friends.

这是一次令人大开眼界的经历,我最终了解了有关Internet(和Craigslist)如何工作,各种不同工具如何协同工作以解决问题的更多知识,并且我得到了一个很酷的小故事,可以与我分享朋友们。

In a way, that’s a lot like how technologies work these days. You find a big, hairy problem that you need to solve, and you don’t see any immediate, obvious solution to it. You break down the big hairy problem into multiple different manageable chunks, and then you solve them one chunk at a time.

从某种意义上讲,这与当今技术的运作方式非常相似。 您发现需要解决的一个大问题,而且没有任何直接,明显的解决方案。 您将大毛病分解为多个不同的可管理块,然后一次解决一个块。

Looking back, my problem was this: how can I use this awesome directory on the Internet to reach people with specific interests quickly? There were no known products or solutions available to me at the time, so I broke it down into multiple pieces:

回想起来,我的问题是这样的: 我如何使用Internet上的这个很棒的目录快速找到具有特定兴趣的人 ? 当时没有可用的已知产品或解决方案,因此我将其分解为多个部分:

  1. Find all listings on the platform

    在平台上查找所有列表
  2. Collect contact information about each listing

    收集有关每个列表的联系信息
  3. Send an email to them if the contact information exists

    如果存在联系信息,请向他们发送电子邮件

That’s all there was to it. Technology merely acted as a means to the end. If I could’ve use an Excel spreadsheet to do it for me, I would’ve opted for that instead. However, I’m no Excel guru, and so I went with the approach that made most sense to me at the time.

仅此而已。 技术只是达到目的的手段 。 如果我可以使用Excel电子表格来帮我做,那我会选择这么做。 但是,我不是Excel专家,所以我采用了当时对我来说最有意义的方法。

改进领域 (Areas of Improvement)

There were many areas in which I could improve:

我可以在很多方面进行改进:

  • I picked a language I wasn’t very familiar with to start, and there was a learning curve in the beginning. It wasn’t too awful, because Python is very easy to pick up. I highly recommend that any beginning software enthusiast use that as a first language.

    我选择了一种我不太熟悉的语言来开始学习,而且一开始就有学习的弯路。 并不是很糟糕,因为Python很容易拿起。 我强烈建议任何新手软件爱好者将其用作第一语言。
  • Relying too heavily on open-source technologies. Open source software has it’s own set of problems, too. There were multiple libraries I used that were no longer in active development, so I ran into issues early on. I could not import a library, or the library would fail for seemingly innocuous reasons.

    过于依赖开源技术。 开源软件也有它自己的一系列问题 。 我使用了多个不再进行主动开发的库,所以我很早就遇到了问题。 我无法导入库,否则该库将因看似无害的原因而失败。

  • Tackling a project by yourself can be fun, but can also cause a lot of stress. You’d need a lot of momentum to ship something. This project was quick and easy, but it did take me a few weekends to add in the improvements. As the project went on, I started to lose motivation and momentum. After I found a job, I completely ditched the project.

    自己解决一个项目可能很有趣,但也会带来很多压力 。 您需要大量的动力来运送东西。 这个项目既快速又简单,但是确实花了我几个周末来进行改进。 随着项目的进行,我开始失去动力和动力。 找到工作后,我完全放弃了这个项目。

我使用的资源和工具 (Resources and Tools I used)

The Hitchhiker’s Guide to Python — Great book for learning Python in general. I recommend Python as a beginner’s first programming language, and I talk about how I used it to land offers from multiple top-tier top companies in my article here.

《 Hitchhiker的Python指南》 -全面学习Python的好书。 我建议Python作为初学者的第一个编程语言,和我谈我如何使用从多个顶级顶级公司的土地报价在我的文章在这里 。

DailyCodingProblem: It’s a service that sends out daily coding problems to your email, and has some of the most recent programming problems from top-tier tech companies. Use my coupon code, zhiachong, to get $10 off!

DailyCodingProblem :这是一项将日常编码问题发送到您的电子邮件的服务,并且具有一些顶级技术公司的最新编程问题。 使用我的优惠券代码zhiachong可获得$ 10的折扣!

BeautifulSoup — The nifty utility tool I used to build my web crawler

BeautifulSoup —我用来构建网络搜寻器的漂亮实用工具

Web Scraping with Python — A useful guide to learning how web scraping with Python works.

使用Python进行网络抓取-学习如何使用Python进行网络抓取的有用指南。

Lean Startup - I learned about rapid prototyping and creating an MVP to test an idea from this book. I think the ideas in here are applicable across many different fields and also helped drive me to complete the project.

精益创业 -我从本书中学到了快速原型制作和创建MVP来测试想法的知识。 我认为这里的想法适用于许多不同领域,也帮助我完成了该项目。

Evernote — I used Evernote to compile my thoughts together for this post. Highly recommend it — I use this for basically _everything_ I do.

Evernote —我使用Evernote将我的想法汇总在一起。 强烈推荐它-我基本上将其用于所有操作。

My laptop- This is my current at-home laptop, set up as a work station. It’s much, much easier to work with than an old BenQ laptop, but both would work for just general programming work.

我的笔记本电脑 -这是我当前的家用笔记本电脑,设置为工作站。 与旧的BenQ笔记本电脑相比,它使用起来容易得多,但两者都仅适用于常规编程工作。

Credits:

学分:

Brandon O’brien, my mentor and good friend, for proof-reading and providing valuable feedback on how to improve this article.

我的导师和好朋友Brandon O'brien进行了校对并提供了有关改进本文的宝贵反馈。

Leon Tager, my coworker and friend who proofreads and showers me with much-needed financial wisdom.

莱昂·塔格 ( Leon Tager )是我的同事和朋友,他用急需的财务知识为我校对和洗澡。

You can sign up for industry news, random tidbits and be the first to know when I publish new articles by signing up here.

您可以注册以获取行业新闻,随机花絮,并可以在此处注册成为第一个知道我何时发布新文章的人。



Zhia Chong is a software engineer at Twitter. He works on the Ads Measurement team in Seattle, measuring ads impact and ROI for advertisers. The team is hiring!

Zhia Chong是Twitter的软件工程师。 他在西雅图的广告评估团队工作,负责评估广告客户的广告影响力和投资回报率。 团队正在 招聘

You can find him on Twitter and LinkedIn.

您可以在 Twitter LinkedIn 上找到他

翻译自: https://www.freecodecamp.org/news/how-i-built-a-web-crawler-to-automate-my-job-search-f825fb5af718/

从头学习计算机网络

你可能感兴趣的:(大数据,编程语言,python,人工智能,java)