1. 一个有10亿条记录的文本文件,已按照关键字排好序存储,设计算法,可以快速的从文件中查找指定关键字的记录。
$10亿=10^9 \approx 2^{30}$,每行记录如果是1kB的话,总共是1TB。将文件分割成1000份,每份1G,load进内存作二分查找即可。
2. 设计一个分布式爬虫系统。
配置参数: start_url, 爬取的深度, update的频率.
功能: 定时爬取更新, 去重, 检索; 是否支持规则;
问题: 分布式存储, 怎么去重, 磁盘io和网络io; 重爬. 数据失效后,更新索引;
一开始要估计好量吧,比如一个页面有100个链接,4层的话就有100^4,每个页面是100kB的话,每次爬取就可能有10TB数据. 怎么去重. 假设有50%去重了,也就是5TB.
假设有20%需要定时更新,那么update的量就有1TB.
http://blog.sina.com.cn/s/blog_59c4ac5501017wda.html
http://www.douban.com/group/topic/38361104/
3. 设计一个长连接手机云推送服务。怎么做链接管理(链接中断、链接查找),百万级长连接,怎么做容错。
4. news feeds。
5. 分布式缓存方案。
系统设计的时候,我觉得知道以下几点会有好处:
A load balancer is a device that acts as a reverse proxy and distributes network or application traffic across a number of servers. Load balancers are used to increase capacity (concurrent users) and reliability of applications. They improve the overall performance of applications by decreasing the burden on servers associated with managing and maintaining application and network sessions, as well as by performing application-specific tasks.
Load balancers are generally grouped into two categories: Layer 4 and Layer 7. Layer 4 load balancers act upon data found in network and transport layer protocols (IP, TCP, FTP, UDP). Layer 7 load balancers distribute requests based upon data found in application layer protocols such as HTTP.
Requests are received by both types of load balancers and they are distributed to a particular server based on a configured algorithm. Some industry standard algorithms are:
Round robin
Weighted round robin
Least connections
Least response time
Layer 7 load balancers can further distribute requests based on application specific data such as HTTP headers, cookies, or data within the application message itself, such as the value of a specific parameter.
Load balancers ensure reliability and availability by monitoring the "health" of applications and only sending requests to servers and applications that can respond in a timely manner.