原创文章,转载请注明出处:服务器非业余研究http://blog.csdn.net/erlib 作者Sunface
Restricting input is the simplest way to manage message queue growth in Erlang systems.
It’s the simplest approach because it basically means you’re slowing the user down (applying back-pressure), which instantly fixes the problem without any further optimization required.
On the other hand, it can lead to a really crappy experience for the user.
限制输入是处理Erlang系统日益增长的消息队列中最简单的方法。之所以简单,是因为这意味着减慢了用户数据的传输(相对于满负荷:back-pressure工作来说),这可以马上修复问题且不需要做更多的优化。但另一方面,这会给用户一个非常糟糕的体验!!。
The most common way to restrict data input is to make calls to a process whose queue would grow in uncontrollable ways synchronously. By requiring a response before moving on to the next request, you will generally ensure that the direct source of the problem will be slowed down.
最常用的限制输入的方法就是使无法控制自己消息队列增长的进程变成同步处理。在处理下一个请求(request)之前需要回应前一个请求,这样同步处理肯定会使系统效率慢下来。
The difficult part for this approach is that the bottleneck causing the queue to grow is usually not at the edge of the system, but deep inside it, which you find after optimizing nearly everything that came before. Such bottlenecks will often be database operations, disk operations, or some service over the network.
这个方法的难处在于造成消息队列增长的短板通常不是系统边缘性问题导致的,当你深入理解后,你就会发现要去优化系统的每个细节。这个瓶颈通常来自于数据库操作,硬盘读写或者网络服务。
This means that when you introduce synchronous behaviour deep in the system, you’ll possibly need to handle back-pressure, level by level, until you end up at the system’s edges and can tell the user, "please slow down."
这就意味着当你为系统引入同步机制时,你可能得从系统内部到边缘一层层地限制负载,并还要忠告用户:”请悠着点用“。
Developers that see this pattern will often try to put API limits per user 8 on the system entry points. This is a valid approach, especially since it can guarantee a basic quality of service (QoS) to the system and allows one to allocate resources as fairly (or unfairly) as desired.
开发者看到这种模式会经常在系统的入口处给每个用户都做API使用上的限制8。 这的确是一个有效的方法,特别是可以保证一个最基本的服务质量(base quality of service,QoS)。可以允许用户按需自由地分配资源。
[8] There’s a tradeoff between slowing down all requests equally or using rate-limiting, both of which are valid. Rate-limiting per user would mean you’d still need to increase capacity or lower the limits of all users when more new users hammer your system, whereas a synchronous system that indiscriminately blocks should adapt to any load with more ease, but possibly unfairly.
[注8]:减缓所有请求和使用速度限制之间有一个权衡,两者可以结合使用。速度限制每个用户将意味着你仍然需要增加容量或放宽对新用户加放系统的限制,而对于同步系统来说,就是不加选择地放宽限制来适应任何负载,但可能存在不公平。
What’s particularly tricky about applying back-pressure to handle overload via synchronous calls is having to determine what the typical operation should be taking in terms of time, or rather, at what point the system should time out.
有个棘手的问题是:通过同步调用处理负载的时候必须要给典型的操作设定一个执行时间,更确定地说就是操作执行时间达到指定最大的timeout时间后,这个操作会无效(超时)。
The best way to express the problem is that the first timer to be started will be at the edge of the system, but the critical operations will be happening deep within it.This means that the timer at the edge of the system will need to have a longer wait time that those within, unless you plan on having operations reported as timing out at the edge even though they succeeded internally.
表述这一问题最好的方法就是:计时器开始于系统边缘层,但关键的操作却发生在系统内部层。数据在各层的传递肯定是需要时间的,因此系统外围的计时器将需要设定比预期更长的等待时间(包含各层的处理时间等)。除非你计划就是这样:明明系统内部已经执行成功了,结果因为计时器的时间设定短了,最终返回的操作结果反而变成了超时!
An easy way out of this is to go for infinite timeouts. Pat Helland 9 has an interesting answer to this:
Some application developers may push for no timeout and argue it is OK to wait indefinitely. I typically propose they set the timeout to 30 years.
That, in turn, generates a response that I need to be reasonable and not silly. Why is 30 years sily but infinity is reasonable? I have yet to see a messaging application that really wants to wait for an unbounded period of time. . . This is, ultimately, a case-by-case issue. In many cases, it may be more practical to use a different mechanism for that flow control. 10
一个简单的方法就是把timeout设置为infinite,对此,PatHelland9有一个非常有意思的回答:“一些application开发者可能想努力争取做到消灭超时,因此设定为无限等待,所以我通常建议他们把timeout设置为30年”。 事实上是:我们需要一个合理而不愚蠢的时间,那为啥30年愚蠢,无限等待就合理?实际上我也确实看到过一个信息系统想要等待很久很久时间。。。。因此最终这还是一个就事论事的问题,在很多情况下使用不同的流程机制或许会更加符合实际10。
[9] Idempotence is Not a Medical Condition, April 14, 2012 [10] In Erlang, using the value infinity will avoid creating a timer, avoiding some resources. If you do use this, remember to at least have a well-defined timeout somewhere in the sequence of calls.
[注9]:Idempotence is Not a Medical Condition, April 14, 2012 [注10]: Erlang中使用infinity不会创建一个定时器,不会使用一些资源。如果你打算这样用,请不要忘记在一系列的call中有至少要有一个定义好的timeout时间。
A somewhat simpler approach to back-pressure is to identify the resources we want to block on, those that cannot be made faster and are critical to your business and users. Lock these resources behind a module or procedure where a caller must ask for the right to make a request and use them.
对于降低负载来说有一些简单的方法:把那些阻塞的、不能变快的、对你的业务和用户来说至关重要的资源,都打上标识,然后锁定这些资源:如果一个模块或程序想调用这个资源,就必须先申请到使用权限才能使用它。
There’s plenty of variables that can be used: memory, CPU, overall load, a bounded number of calls, concurrency, response times, a combination of them, and so on.
这里有很多值可以作为衡量标准:内存、CPU、总负载、一定数的调用、并发、应答时间以及上述值的组合等等。
The SafetyValve 11 application is a system-wide framework that can be used when you know back-pressure is what you’ll need.
SafetyValve 11 application就是一个可以让你了解负载指标和降低负载的系统级的框架。
For more specific use cases having to do with service or system failures, there are plenty of circuit breaker applications available. Examples include breaky 12, fuse 13, or Klarna’s circuit_breaker 14.
Otherwise, ad-hoc solutions can be written using processes, ETS, or any other tool available.
对于一些处理业务和系统错误的特殊用例来说,有很多applications可以选择,比如:breaky12, fuse13, Klarna’s14 circuit_breaker。另外,临时的解决方案可以使用进程,ETS,或其它可用的工具。
The important part is that the edge of the system (or subsystem) may block and ask for the right to process data, but the critical bottleneck in code is the one to determine whether that right can be granted or not.
The advantage of proceeding that way is that you may just avoid all the tricky stuff about timers and making every single layer of abstraction synchronous.
系统边缘层或子系统中重要的部分可能会阻塞并请求权限来处理数据,由此来看,这么做的劣势是关键代码的瓶颈:是否可以授权?这样做的优势在于,你可以禁止所有棘手的事件:时间定时器和让每一个抽象层都同步.
You’ll instead put guards at the bottleneck and at a given edge or control point, and everything in between can be expressed in the most readable way possible.
你可以在代码瓶颈处、指定控制点处做好防护,设置一个阈值,在此范围内的所有操作都可以正常的运行。
[11] https://github.com/jlouis/safetyvalve
[12] https://github.com/mmzeeman/breaky
[13] https://github.com/jlouis/fuse
[14] https://github.com/klarna/circuit_breaker
[注11]:1https://github.com/jlouis/safetyvalve .
[注12]:2https://github.com/mmzeeman/breaky
[注13]:https://github.com/jlouis/fuse
[注14]:https://github.com/klarna/circuit_breaker
The tricky part about back-pressure is reporting it. When back-pressure is done implicitly through synchronous calls, the only way to know it is at work due to overload is that the system becomes slower and less usable.
处理负载棘手的部分就是如何获知负载的状况。当负载隐密地通过同步调用完成时,知道它(存在)的唯一方法就是:超负载会让系统变得缓慢而不可用。
Sadly, this is also going to be a potential symptom of bad hardware, bad network, unrelated overload, and possibly a slow client.
Trying to figure out that a system is applying back-pressure by measuring its responsiveness is equivalent to trying to diagnose which illness someone has by observing that person has a fever.
It tells you something is wrong, but not what.
悲剧的是,超负载引起的症状也可能是因为硬件出错、网络不好,甚至也可能由于一个变慢的客户端引起的,在这些情况下都与系统过载无关。通过测量系统的响应来试图找出负载状况就相当于通过观察一个人发烧来诊断他到底得了什么病一样。响应时间只是告诉你某些东西出错了,但不会告诉你是什么出错了。
Asking for permission, as a mechanism, will generally allow you to define your interface in such a way that you can explicitly report what is going on: the system as a whole is overloaded, or you’re hitting a limit into the rate at which you can perform an operation and adjust accordingly.
请求许可权作为一种机制,通常允许你自定义一些可以观察系统运行情况的接口并提交报告:系统超负载了?还是你达到了某个操作的极限速度。这样,你就可以根据结果作相应的调整。
There is a choice to be made when designing the system. Are your users going to have per-account limits, or are the limits going to be global to the system?
System-global or node-global limits are usually easy to implement, but will have the downside that they may be unfair. A user doing 90% of all your requests may end up making the platform unusable for the vast majority of the other users.
还有一个设计系统时要考虑的一个因素:对你的用户进行帐号限制,还是只在系统全局做个限制?在系统层面或节点层面上做全局限制是非常容易的,但也有个缺点,就是不公平:其中一个用户请求占总数的90%,可能会导致其它用户完全不能使用这个平台了。
Per-account limits, on the other hand, tend to be very fair, and allow fancy schemes such as having premium users who can go above the usual limits. This is extremely nice, but has the downside that the more users use your system, the higher the effective global system limit tends to move. Starting with 100 users that can do 100 requests a minute gives you a global 10000 requests per minute.
Add 20 new users with that same rate allowed, and suddenly you may crash a lot more often.
而另一方面,如果对每个用户帐号都作限制,就会非常公平,可以容许高级用户直接突破规则,这真的非常棒。但是用户越多,系统全局的限制性能优势也相应的越明显,当做用户限制时:如果100个用户可以使用100请求/min,就相当于10000 请求/min的全局限制效果,再在相同条件下新加20个用户,你就有可能经常崩溃。
It’s important to consider the tradeoffs your business can tolerate from that point of view, because users will tend not to appreciate seeing their allowed usage go down all the time, possibly even more so than seeing the system go down entirely from time to time.
权衡一下用户能否容忍你的业务是非常重要的,因为用户并不希望他们被限制,甚至更情愿系统偶尔整体慢下来。