Parallel crawlers pdf




















Download Download PDF. Translate PDF. This gene- queue to determine the order in which documents are re- rates a continuous stream of new URLs of documents to trieved from the Web. Priority is defined by a composition be downloaded and it is clear that the associated work-load of various importance metrics for Web sites. The incoming new URLs have to be organized a number of clusters of computers composed of several thou- by a priority measure in order to download the most rele- sand processors or processing nodes, all of them organized vant documents first.

Efficiently managing them along with to focus on different segments of the Web. These systems other synchronization issues such as URLs downloaded by are fully asynchronous and their large sizes are explained by different processing nodes forming a cluster of computers the huge size and highly dynamic nature of the Web and are the matters of this paper.

We propose efficient and sca- the requirement of retrieving web documents in the least lable strategies which consider intra-node multi-core multi- possible time. That is, connection and docu- Search and Retrieval—Search process ment retrieval time is affected by the large latencies of the Internet and therefore the available bandwidth can only be exploited by increasing the number of concurrent robots.

General Terms This in turn is possible by executing a large number of asyn- Algorithms, Performance chronous threads running on a given set of processors. Priority Queues, Parallel and Distributed Computing After a document has been downloaded a number of ope- rations take place on it. The main ones are extraction of 1. The relevant fact for this paper is that to user queries, and iii the search engine itself which per- the fully asynchronous activity of robots generates a con- forms query processing upon the index structure to quickly stant stream of operations that compete for the use of re- present users the top-k documents per query.

A crawler is sources such as processors, inter-processors communication hardware and disk arrays. The arrival time of these on-line jobs and the amount of work demanded on the resources are in general unpredictable. Permission to make digital or hard copies of all or part of this work for Current cluster realizations of crawlers are implemented personal or classroom use is granted without fee provided that copies are using the message passing approach to parallel computing.

Or even better is to distribute moment in the Web sample known stored up to this point Web site domains onto processors rather than individual by the P processors. This defines the problem as a parallel URLs. A communication action is triggered by a robot when queue, then pick-up the next globally top- rP and so on.

This Internet and communication hardware latencies make unfea- reduces communication significantly as most web pages of a sible this ideal scenario. At the inter-nodes level we reduce given site are expected to point to pages in the same site. We focus on the efficient implementation of munication among them. Sending messages grouped into an important component of parallel crawlers devised to run single blocks is more efficient than sending individual point on cluster of computers.

We believe our approach is novel to to point messages. This is not only because of the cumu- the field because of the unique type of hybrid parallelization lative overheads of individual messages but also because of method we promote and the data structures and parallel the associated of cost of heavy-threads management e.

In our case the In this paper we are specifically interested in obtaining the number of these heavy threads can be reduced to just one best performance of hardware in charge of processing URLs and replaced by light threads as we explain below admi- in order to efficiently feed up the pairs scheduler, r-robots nistered by hardware as understood in multi-core processors distributed onto a set of P processors. We show that by programed with the openMP library.

In parti- with a set of URLs to be extracted from its local prior- cular we perform these computations in a bulk-synchronous ity queue and a set of URLs to be inserted in the queue, manner that allows the following optimizations which are and yet another set of URLs to be sent to other processors. In this paper we propose two highly optimized data struc- tures and algorithms to perform these tasks.

Both strate- Inter-nodes parallelism: Overall parallel computations gies are friendly to secondary memory and multi-core par- are performed in blocks of R URLs in each processor R allel processing and can be used alternatively. One ensures being an average value and messages among them are also logarithmic worst case whereas the another has better aver- sent in blocks. This prevents concurrency control conflicts age performance. The key to efficient performance in these since grouped URLs are processed sequentially at each pro- strategies comes from the fact that they work in chunks of R cessor which reduces overheads significantly.

During each of these intervals each is scaled down to achieve the proper average R with the less processor essentially executes an insert-many M operation possible amount of resources. This bulk-synchronism loaded and the bulk-synchronous processes feed up the queues enables us to achieve near-optimal inter-node parallelism in Q in their respective processors at regular intervals.

At the a multi-core processor supporting the efficient use of T light same time the new URLs discovered by the robots are stored threads. Namely, the running time costs of the operations in an input queue associated with each bulk-synchronous insert-many M and extract-top R operations can be re- process.

When all data fits into main This sets the communication in both directions between the memory our experiments show that this approach achieves asynchronous and bulk-synchronous tasks. For example, one point and delivery of messages. The function run is in of the data structures works with chunks of size R each of charge of processing all messages.

In particular messages which is stored in contiguous blocks of disk and it ensures at sent to the main thread by the robots. Upon reception most a logarithmic number of disk accesses per operation. When the run We first describe the overall process of performing crawl- function does not find idle robots it places the operation in ing in parallel on a cluster of computers. We simplify the dis- a pending jobs queue Q.

Robots that finish current jobs get cussion by assuming the use of a standard priority queue per new ones from this queue Q or if the queue is empty they processor and assuming that the Web graph is distributed sleep themselves on condition variables. Processing a doc- on the processors by websites. We also assume that the ument involves mainly i performing a parsing to extract asynchronous and synchronous parts of the parallel crawler links to other documents which can produce messages to are both hosted by the same set of processors which may other processors , and ii extract and store the text and not be the case in a production system.

Each part lives as ranking information for the search engine. Messages are an independent set of threads in each processor. Each URL is assigned a priority maximum number Rx of documents that are allowed to be value which depends of the application in our experiments downloaded before sending messages containing packages we use OPIC [1].

Each processor maintains a scheduler URLs to other processors. The average value of Rx can and r robots. The asynchronous component of the crawler be determined off-line from a previous crawling of the same is composed of a set of Possix threads.

The main one of these Web as follows. The downloaded sample can be represented threads in each processor executes the tasks of the scheduler as a graph where nodes are web pages and arcs links.

For and there are r additional threads for executing the robots nodes we determine the pageRank value using the standard one thread per robot. A priority queue Qx is created using The overall process of crawling works in cycles composed the pageRank values of the nodes as priority values. The al- of three main steps: gorithm that we propose for this task performs the following 1. Get the next node n from the priority queue Qx.

This for the URLs belonging to the web- sites assigned to the processor. For all processors i containing a site for one of the links sent to their respective destinations. The processes executing the synchronous machine are treated as a bulk-synchronous parallel BSP computer [15]. In BSP 6. Repeat from step 2 until Qx becomes empty. The messages are available for processing at respective processors.

This parameter can be set by per- their destinations by the next superstep, and each superstep forming a binary search through several executions of the is ended with the barrier synchronization of the processors. The error in each superstep is next superstep. Notice that the connection between the asynchronous and synchronous in the simulated parallel computer the nodes stored in W processes can be made by the queue Q discussed in Sec.

Each item stored Figure 1: Insertion update. We associate each leaf of the CBT with one item, and use the internal nodes to maintain a continuous binary tournament among the items. N ] of priority values, and iii an array Leaf[ N ] of integers to map between items and leaves.

The parent of a to get the top-R priority keys. We use a quick-sort like SELECT operation see integer arithmetic on the array Prio we can calculate the next subsection to redistribute the contents of global nodes priority associated with a given leaf].

Finally, to enable a Prio[i] associated with every item i located along the path dynamic reusing of item identifiers in the PQ, the array Leaf from a leaf k to the root of the global CBT. We also need Deletions in the CBT are performed by removing the child to determine the minimum key in any global Prio[k] node. On the other hand, insertions are performed ternal node with two leaf children the leaf located at po- by appending a new rightmost leaf and updating the CBT.

During an insert and after setting tree. This as shown in Figure 1 larger numerical values indicate higher cost can be further reduced using parallel algorithms. Let us assume that k is cessor but symmetrically partitioned into T slices. The the position of the leaf that holds the item which contains S insert-many M operation, executed as a sequence of insert- and i is the new item selected to be stored in k. An integer capacity to indicate the size of heap.

We Figure 3: IncrementalQuicksort. Note that if we use heap as a circular array, we can handle For secondary memory management, the same strategy of arbitrarily long sequences of insertions and deletions as using R-sized nodes can be applied.

To reduce simultaneously in the quickheap. In the case of circular arrays, we have to take into account that an object whose position is pos is actually located in 3. We add elements at the tail of the quickheap the cell Let us now switch to the incremental sorting problem, heap[S[0] mod capacity] , and perform min-extractions from which can be stated as follows: Given a set A of m numbers, the head of the quickheap the cell heap[idx mod capacity].

The value of capacity must be sufficient to store multiply each priority value by So, when extracting, we simultaneously all the elements we need in the array. IQS avoids the O kn S. For this sake, we just call IQS heap, idx, S complexity by reusing the work across calls to Quickselect. Therefore, IQS stores pos mod capacity, thus we have to slightly change algorithm these pivots within a stack S, as they are relevant for the IQS to manage the positions in the circular array.

To find the next minimum, we To extract the minimum, we first make sure that it is first check whether p, the top value in S, is the index of the located in the cell heap[idx]. Once again, in this case IQS element sought, in which case we pop it and return A[p]. Next, we increase idx Otherwise, because of previous partitionings, it holds that and pop S.

Figure 3 shows algorithm IQS, which solves the pivot invariant. To do that, it is enough optimal expected time. We first move the fictitious pivot, in the array. If we read it from right to left, we start with a updating its position in S, without comparing it with the pivot and at its left side there is a chunk of elements smaller new element x, so we have a free cell in the last chunk.

Next, we have another pivot and another chunk, Next, we compare x with the pivot at cell S[1]. If the pivot and so on, until we reach the last pivot and a last chunk. Otherwise, we move the first element at array are semi-ordered. In the following, we exploit this the right of pivot S[1] to the free place left by pivot S[0], property to implement a priority queue over an array pro- and move the pivot S[1] one place to the right, updating its cessed with algorithm IQS.

We call this IQS-based priority position in S. Citation Type. Has PDF. Publication Type. More Filters. Scheduling algorithms for Web crawling. WebMedia and LA-Web, Highly Influenced. View 8 excerpts, cites background and methods. Computer Science, Materials Science. View 1 excerpt. View 2 excerpts, cites background and methods. View 2 excerpts, cites methods and background. View 6 excerpts, cites background.

Due to the growing and dynamic nature of the web, it has become a challenge to traverse all URLs in the … Expand. Around the web in six weeks: Documenting a large-scale crawl. High-performance web crawling. Focused Crawling Using Context Graphs. Mercator: A scalable, extensible Web crawler. World Wide Web.



0コメント

  • 1000 / 1000