The Research of a Spider Based on Crawling Algorithm

Xin-Yang WANG, Jian ZHANG

Abstract


This paper conducts a deep research on data mining in three areas including work flow, key technologies and software algorithm of the spider. The paper analyzes the work flow and key technologies of the spider facing URL in details. It also brings forward the mind that adopting several queues to manage the URL list, in order to download HTML, files in high speed we sort the URLs by document correlativity. The aim of this paper is to design a well-adjusted and perfectly functional software model of the spider. Sun JDK+Borland Jbuilder+SQL Server+IIS+Bot package is used as the software development environment support.

Keywords


Spider, URL Seed, Scope First, Document Correlativity, Threshold


DOI
10.12783/dtcse/aice-ncs2016/5717

Full Text:

PDF

Refbacks

  • There are currently no refbacks.