Web Crawler


Table of Contents
1. Introduction
2. Java web crawler implementation
3. wget Command
4. Comparison
5. References
6. Code
7. Sample Output


1. Introduction
The purpose of this document is to outline the details of my implementation of the Java web as well as compare the the results I receive from my crawler program with that of wget. This document is divided into three parts. The first part is a discussion on my implementation of a crawler. The second shows a list of the top 100 URLs I obtained from using wget and the top 100 URLs I obtained from the java web crawler. The last part will be a comparison of the two crawlers. References will be listed at the end.


2. Java web crawler implementation
The web crawler keeps track of two lists: unvisited and unique. Both are in the form of a queue-- the unvisited list contains a list of links that the crawler has yet to crawl. The unique list contains a list of unique links that the crawler found when crawling.

Since the crawl will be starting at: http://ciir.cs.umass.edu/, the link is added to both the unvisited and unique queue.

The crawler then checks to see if the unvisited list is empty, if it is not, a entry is taken from the unvisited list, and its robots.txt file is checked. If the robots.txt file disallow crawling, we go back to check the unvisited list again for a link. Otherwise, we read the webpage line by line to find links that exist, are either web pages or pdfs, and are in cs.umass.edu. Once this is done, the crawler checks to see if the link already exist in the unique queue. If it does not, it is added the to unique and unvisited queue. This process is done until the unique queue has 100 entries. The 100 unique entries are outputted to a text file called url.txt.

a. processLink method:
This method takes the extracted links and put them in the correct format. For example, links that are in the form of “.../zzz” are changed into something like “http://xxx/yyy/yyy” and “..../.../zzz” to “http://xxx/zzz”. It also deals with other link oddities like “/%7Emccallum” and changes it to “/~mccallum” and removes “www” from links so that the unique list does not add both http://www.ciir.umass.edu/ and http://ciir.umass.edu/. 

b. checkRobots method:
This method checks the robot.txt file of an url. This is done by obtaining the hostname from an url, and reading its robots.txt page. The program checks the “Disallow” tags in robots and checks to see if the url it is currently look at falls in the disallow category. If it does, the url is not crawled. Otherwise, the program will crawl the page. The “Crawl­delay” tag is also observed­­ if there is a stated crawl­delay, the program delay the crawl by that time, otherwise it will delay the crawl for 5 seconds (set as default in this program). 


3a. wget

Command:
wget ­r ­w 5 ­A pdf ­H ­D cs.umass.edu ciir.cs.umass.edu 2> temp cat temp | grep "saved" | more | awk ‘{print $6}’ > url.txt

­r : allows the crawler to recursively crawl pages
­w 5 : wait 5 second after each crawl
­H : makes the crawler search outside of the initial domain that it is given ­D : tells the crawler to only crawl cs.umass.edu and ciir.cs.umass.edu 2> output to a temporary file
grep “saved” | more : looks at the line that contains the word “saved” awk ‘{print $6}’ : displays only the 6th column in the temp file 

4. Comparison
As you can see from the two lists above, they are not identical, though there are some overlaps. The first difference you can probably observe is that all the links that the wget command generated omitted the “http://” in the beginning of links and have either .html, .htm, or .pdf suffixes. For syntactical issues for the “URL” object in java, I had to include the “http://” in front of links. As for the suffix, my crawler does not automatically complete links such as “http://ciir.umass.edu/” with the “index.html” ending. You can see the discrepancy in link #9 on both list: cs.umass.edu/index.html cs.umass.edu/ vs. http://cs.umass.edu/. This is definitely an issue­­ though fixing it can possibly lead to other problems... for example “.pdf/index.html”, and then there is the case of checking if “index.htm” works. Lastly, is it possible to guess what a site is referring to when it links to: “http://cs.umass.edu/xx/”? Does it want “http://cs.umass.edu/xx/index.html”? “http://cs.umass.edu/xx/index.htm”? Or just the directory “http://cs.umass.edu/xx/”? For this reason, I simply left the link as is.

Although the links may be displayed differently, there is a considerable amount of overlap between the two results. The links might not be in the same order, but there are chunks of links that are almost identical. For example, lines 34­48 from the wget results are links to the same page as lines 56­70 from the Java crawler. Other chunks of links that matches like this are: wget 50­53 and crawler 71­74, and 13­25 and crawler 15­27. It is only after the top 50 in the wget link list does the links start to differ greatly. For example, in the wget list, most of the links in the 90s are in the form of people.cs.umass.edu/~keikham/papers or ciir.cs.umass.edu/~dietz/ while for the crawler, it is mostly cs.umass.edu/talks or publications.

The differences in links is probably due to the different way the wget and the java crawler crawls a website. My program starts off with the first link it was given (ciir.cs.umass.edu), it crawls the site and finds all the unique links and adds it to the unvisited queue. It then checks the next link in the queue and repeats this process. wget, on the other hand, recursively crawls. wget also, by default, uses robots.txt and parses the HTML of web pages. This is done manually on the Java crawler, and many exceptions are not taken care of. This difference in how the crawler crawls a webpage and processes them cause the differences in links. This is the reason why the first few links are identical­­ the crawlers start at the same point, but after that point, links are visited in the order of the algorithm the crawler follows, resulting in quite a different set of links near the end of the list. 


5. References
wget flags: http://www.gnu.org/software/wget/manual/wget.html
Queue: http://docs.oracle.com/javase/7/docs/api/java/util/Queue.html
BufferedReader: http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
Check if a page exist: http://stackoverflow.com/questions/1378199/how-to-check-if-a-url-exists-or-returns-404-with-java
String.matches: http://stackoverflow.com/questions/2275004/in-java-how-to-check-if-a-string-contains-a-substring-ignoring-the-case
Pausing a program: http://docs.oracle.com/javase/tutorial/essential/concurrency/sleep.html


6. Source Code:

WebCrawler.java

import java.io.*;
import java.net.*;
import java.util.*;
import java.util.regex.*;


public class crawl{

	static String processLink (URL host, String link) throws Exception{
		URL root = host; 
		link=link.replaceAll("www.", "");//remove www.
		link=link.replaceAll("%7E","~"); //%07E == ~
		root= new URL(root.toString().substring(0, root.toString().lastIndexOf("/")));

		if(link.startsWith("http"))
			return link; //is ok
		if(link.startsWith("../")){ //need host
			//System.out.println("*****"+link);
			link = link.substring(3,link.length());
			String a= processLink(root,link); //is it okay now?
			//System.out.println("****XX** "+a);
			return a;
		}
		if(link.startsWith("/")){ //need host
			link = link.substring(1,link.length());
			return root+"/"+link;
		}
		else{
			return root+"/"+link;
		}

	}

	static boolean checkRobots(URL url) throws InterruptedException{
		//getting the hostname
		String link=url.toString();
		int slashes = link.indexOf("//") + 2;
		String root = "https://"+link.substring(slashes,link.indexOf('/', slashes));

		System.out.println("Host "+root);
		System.out.println("Link "+link);

		int delay = 0;
		boolean crawlDelay=false;

		URL robot;
		try{
			robot = new URL(root+"/robots.txt");
		} catch (MalformedURLException e){
			return false;
		}

		System.out.println("Robots File: "+robot.toString());
		System.out.println("-------------------------------------");

		//looking at the robots.txt
		try{
			BufferedReader robotstxt = new BufferedReader(new InputStreamReader(robot.openStream())); 

			String line="";
			while(null != (line = robotstxt.readLine())){  //what if it is moved?

				//System.out.println("ROBOTS: "+line);

				if(line.startsWith("Disallow")){
					//System.out.println("ROBOTS: "+line);
					line = line.substring(10, line.length()).trim();
					//System.out.println("---- "+line);

					if(line.equals("/") | link.matches(".*"+line+".*")){
						System.out.println("ROBOTS: "+line);
						System.out.println("*STATUS: CANNOT CRAWL!");
						return false;
					}
				}

				if(line.startsWith("Crawl-delay")){
					System.out.println("ROBOTS: "+line);
					crawlDelay=true;
					delay = Integer.parseInt(line.substring(13, line.length()).trim());
					System.out.println("*STATUS: CRAWL DELAY FOUND: "+delay +" SEC");
				}
			}
			robotstxt.close();
		} catch (IOException e) { // no robots.txt
			System.out.println("*STATUS: NO ROBOTS. SAFE TO CRAWL");
			return true;
		}

		if(crawlDelay==true){
			System.out.println("*STATUS: DEALYING CRAWL BY: " + delay +" SEC");
			Thread.sleep(delay * 1000);
		}
		else{
			System.out.println("*STATUS: DEALYING CRAWL BY: 5 SEC");
			Thread.sleep(5 * 1000);
		}

		System.out.println("*STATUS: SAFE TO CRAWL");
		return true;

	}

	public static void main(String args[]) throws Exception{

		//Queue visited = new LinkedList();
		Queue unvisited = new LinkedList();
		Queue unique = new LinkedList();

		BufferedWriter writer = new BufferedWriter(new FileWriter("url.txt"));

		//start with crawling ciir.cs.umass.edu
		URL start = new URL("http://ciir.cs.umass.edu/");
		unvisited.add(start);
		unique.add(start);

		while(!unvisited.isEmpty()){ //there are unvisited links

			URL look = unvisited.poll();

			System.out.println("*STATUS: WANT TO CRAWL..." +look);
			System.out.println("*STATUS: CHECKING ROBOTS...\n");

			if(checkRobots(look)){ //crawling is allowed
				System.out.println("*STATUS: CRAWLING...");
				System.out.println("-------------------------------------");

				//visited.add(look); //add to visited links list
				BufferedReader reader = new BufferedReader(new InputStreamReader(look.openStream()));
				String line = "";

				while(null != (line = reader.readLine())){

					//System.out.println(line); //this will print out the html code for the url
					Pattern p = Pattern.compile("]");
					Matcher m = p.matcher(line);


					while(m.find()){

						String link = m.group(1).trim(); //remove white space

						if(link.matches(".*php.*")){
							//System.out.println("REJECT: "+link);
							continue;}
						if (link.length()<1 | link.charAt(0)=='#' | link.startsWith("mailto") | link.endsWith("jpg") | link.endsWith("png") | link.endsWith("gif") | link.endsWith("png") | link.endsWith("text")  | link.endsWith("png") | link.endsWith("txt") | link.endsWith("ps")){
							continue; //ignore
						}
						else{
							//
							//if(link.matches(".*pdf") | link.matches(".*htm.*")){ //find only pdf, htm, html
							//System.out.println(link);
							link = processLink(look, link); //take care of ../ 
							//System.out.println(link);

							if(link.matches(".*cs.umass.edu.*")){

								URL found = new URL(link);

								//does the link exist?
								try{
									HttpURLConnection huc =  ( HttpURLConnection )  found.openConnection (); 
									huc.setRequestMethod ("GET");  //OR  huc.setRequestMethod ("HEAD"); 
									huc.connect () ; 
									int code = huc.getResponseCode() ;
									//System.out.println(code);
									if (code==404){
										System.out.println("**BROKEN LINK: "+code + " " +found);
										break;
									}
								}catch (IOException e) {
									System.out.println("**BROKEN LINK: " +found);
									break;
								}

								//if the link has not been seen before. equivalent to (!unvisited.contains(link) && !visited.contains(link))
								//also take care of things like ciir.umass.edu/~hi and cirr.umass.edu/~hi/
								if(!unique.contains(found) && !unique.contains(new URL(found+"/"))){
									System.out.println("*STATUS: UNIQUE LINK FOUND: " +found);
									unvisited.add(found);
									if(unique.size()>=100){ //if we have 100 links, break out (for the purpose of this assignment)
										System.out.println("*STATUS: 100 UNIQUE LINKS FOUND; EXITING PROGRAM.");
										writer.close();
										System.exit(0);
									}
									unique.add(found);
									//System.out.println("*STATUS: WRITING " +found + " TO url.txt");
									writer.write(found.toString());
									writer.newLine();
									//System.out.println(found);
								}
							}//if in cs.umass.edu
						}//if a page
					}//while there is a link
				}//while a line exist
				System.out.println("-------------------------------------");
				reader.close();
			}//if robot allows
			//Thread.sleep(5 * 1000);
		} //while unvisited is not empty

	}

}


7. Sample Output

*STATUS: WANT TO CRAWL...http://ciir.cs.umass.edu/
*STATUS: CHECKING ROBOTS...

Host https://ciir.cs.umass.edu
Link http://ciir.cs.umass.edu/
Robots File: https://ciir.cs.umass.edu/robots.txt
-------------------------------------
*STATUS: DEALYING CRAWL BY: 5 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/index.html
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/about/index.html
*STATUS: UNIQUE LINK FOUND: http://iesl.cs.umass.edu/
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/publications/index.html
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/membership/member.html
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/personnel/index.html
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/whatsnew/index.html
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/contact/index.html
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/
-------------------------------------
*STATUS: WANT TO CRAWL...http://ciir.cs.umass.edu/index.html
*STATUS: CHECKING ROBOTS...

Host https://ciir.cs.umass.edu
Link http://ciir.cs.umass.edu/index.html
Robots File: https://ciir.cs.umass.edu/robots.txt
-------------------------------------
*STATUS: DEALYING CRAWL BY: 5 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
-------------------------------------
*STATUS: WANT TO CRAWL...http://ciir.cs.umass.edu/about/index.html
*STATUS: CHECKING ROBOTS...

Host https://ciir.cs.umass.edu
Link http://ciir.cs.umass.edu/about/index.html
Robots File: https://ciir.cs.umass.edu/robots.txt
-------------------------------------
*STATUS: DEALYING CRAWL BY: 5 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/research/irlab/index.html
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/research/mir/mir.html
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~croft
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~allan
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mccallum/
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/~manmatha/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~marlin
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~wallach
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/projects/index.html
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/research/completed.html
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/membership/member_list.html
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/about/history.html
-------------------------------------
*STATUS: WANT TO CRAWL...http://iesl.cs.umass.edu/
*STATUS: CHECKING ROBOTS...

Host https://iesl.cs.umass.edu
Link http://iesl.cs.umass.edu/
Robots File: https://iesl.cs.umass.edu/robots.txt
-------------------------------------
*STATUS: NO ROBOTS. SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
-------------------------------------
*STATUS: WANT TO CRAWL...http://ciir.cs.umass.edu/publications/index.html
*STATUS: CHECKING ROBOTS...

Host https://ciir.cs.umass.edu
Link http://ciir.cs.umass.edu/publications/index.html
Robots File: https://ciir.cs.umass.edu/robots.txt
-------------------------------------
*STATUS: DEALYING CRAWL BY: 5 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
-------------------------------------
*STATUS: WANT TO CRAWL...http://ciir.cs.umass.edu/membership/member.html
*STATUS: CHECKING ROBOTS...

Host https://ciir.cs.umass.edu
Link http://ciir.cs.umass.edu/membership/member.html
Robots File: https://ciir.cs.umass.edu/robots.txt
-------------------------------------
*STATUS: DEALYING CRAWL BY: 5 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
-------------------------------------
*STATUS: WANT TO CRAWL...http://ciir.cs.umass.edu/personnel/index.html
*STATUS: CHECKING ROBOTS...

Host https://ciir.cs.umass.edu
Link http://ciir.cs.umass.edu/personnel/index.html
Robots File: https://ciir.cs.umass.edu/robots.txt
-------------------------------------
*STATUS: DEALYING CRAWL BY: 5 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~dietz/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~keikham/
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/personnel/jean.html
*STATUS: UNIQUE LINK FOUND: http://balder.cs.umass.edu/~kate/Site/Movie.html
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/personnel/stowell.html
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~elif
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~kedarb
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~efcan
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/~jdalton
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~vdang
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~sjh
*STATUS: UNIQUE LINK FOUND: https://cs.umass.edu/~kriste/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~narad
*STATUS: UNIQUE LINK FOUND: http://people.cs.umass.edu/~sameer/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mwick
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~zeki
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~lmyao
-------------------------------------
*STATUS: WANT TO CRAWL...http://ciir.cs.umass.edu/whatsnew/index.html
*STATUS: CHECKING ROBOTS...

Host https://ciir.cs.umass.edu
Link http://ciir.cs.umass.edu/whatsnew/index.html
Robots File: https://ciir.cs.umass.edu/robots.txt
-------------------------------------
*STATUS: DEALYING CRAWL BY: 5 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/whatsnew/awards.html
*STATUS: UNIQUE LINK FOUND: https://people.cs.umass.edu/~kriste/
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/jobs
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/news/latest-news/eric-brown-and-watson-take-jeopardy-challenge
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/whatsnew/query-evolution-france.pdf
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/personnel/dasmith.html
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/csinfo/announce/msrfellows06.html
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/research/wordspotting/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/csinfo/announce/croft_saltonaward.html
-------------------------------------
*STATUS: WANT TO CRAWL...http://ciir.cs.umass.edu/contact/index.html
*STATUS: CHECKING ROBOTS...

Host https://ciir.cs.umass.edu
Link http://ciir.cs.umass.edu/contact/index.html
Robots File: https://ciir.cs.umass.edu/robots.txt
-------------------------------------
*STATUS: DEALYING CRAWL BY: 5 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/about/visiting-department
-------------------------------------
*STATUS: WANT TO CRAWL...http://cs.umass.edu/
*STATUS: CHECKING ROBOTS...

Host https://cs.umass.edu
Link http://cs.umass.edu/
Robots File: https://cs.umass.edu/robots.txt
-------------------------------------
ROBOTS: Crawl-delay: 10
*STATUS: CRAWL DELAY FOUND: 10 SEC
*STATUS: DEALYING CRAWL BY: 10 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
*STATUS: UNIQUE LINK FOUND: https://cs.umass.edu/
-------------------------------------
*STATUS: WANT TO CRAWL...http://ciir.cs.umass.edu/research/irlab/index.html
*STATUS: CHECKING ROBOTS...

Host https://ciir.cs.umass.edu
Link http://ciir.cs.umass.edu/research/irlab/index.html
Robots File: https://ciir.cs.umass.edu/robots.txt
-------------------------------------
*STATUS: DEALYING CRAWL BY: 5 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
**BROKEN LINK: 404 http://ciir.cs.umass.edu/research/research/irlab/index.html
**BROKEN LINK: 404 http://ciir.cs.umass.edu/research/research/mir/mir.html
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/research/irlab/irlabphds.html
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/research/irlab/irlabpapers.html
-------------------------------------
*STATUS: WANT TO CRAWL...http://ciir.cs.umass.edu/research/mir/mir.html
*STATUS: CHECKING ROBOTS...

Host https://ciir.cs.umass.edu
Link http://ciir.cs.umass.edu/research/mir/mir.html
Robots File: https://ciir.cs.umass.edu/robots.txt
-------------------------------------
*STATUS: DEALYING CRAWL BY: 5 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
**BROKEN LINK: 404 http://ciir.cs.umass.edu/research/research/irlab/index.html
**BROKEN LINK: 404 http://ciir.cs.umass.edu/research/research/mir/mir.html
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/irdemo/hw-demo/
-------------------------------------
*STATUS: WANT TO CRAWL...http://cs.umass.edu/~croft
*STATUS: CHECKING ROBOTS...

Host https://cs.umass.edu
Link http://cs.umass.edu/~croft
Robots File: https://cs.umass.edu/robots.txt
-------------------------------------
ROBOTS: Crawl-delay: 10
*STATUS: CRAWL DELAY FOUND: 10 SEC
*STATUS: DEALYING CRAWL BY: 10 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/personnel/croftbio.pdf
-------------------------------------
*STATUS: WANT TO CRAWL...http://cs.umass.edu/~allan
*STATUS: CHECKING ROBOTS...

Host https://cs.umass.edu
Link http://cs.umass.edu/~allan
Robots File: https://cs.umass.edu/robots.txt
-------------------------------------
ROBOTS: Crawl-delay: 10
*STATUS: CRAWL DELAY FOUND: 10 SEC
*STATUS: DEALYING CRAWL BY: 10 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
-------------------------------------
*STATUS: WANT TO CRAWL...http://cs.umass.edu/~mccallum/
*STATUS: CHECKING ROBOTS...

Host https://cs.umass.edu
Link http://cs.umass.edu/~mccallum/
Robots File: https://cs.umass.edu/robots.txt
-------------------------------------
ROBOTS: Crawl-delay: 10
*STATUS: CRAWL DELAY FOUND: 10 SEC
*STATUS: DEALYING CRAWL BY: 10 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mccallum/photos
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/faculty/faculty-directory
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mccallum/contact.html
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mccallum/bio.html
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mccallum/mccallum-vita.pdf
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mccallum/pubs.html
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mccallum/talks.html
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mccallum/projects.html
**BROKEN LINK: 404 http://iesl.cs.umass.edu/people
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mccallum/code.html
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mccallum/data.html
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mccallum/teaching.html
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mccallum/personal.html
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mccallum/people.html
*STATUS: UNIQUE LINK FOUND: http://factorie.cs.umass.edu/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mccallum/papers/ge08note.pdf
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mccallum/papers/druck08sigir.pdf
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~mccallum/papers/acm-queue-ie.pdf
**BROKEN LINK: 404 http://cs.umass.edu/~mimno/icml100.html
-------------------------------------
*STATUS: WANT TO CRAWL...http://ciir.cs.umass.edu/~manmatha/
*STATUS: CHECKING ROBOTS...

Host https://ciir.cs.umass.edu
Link http://ciir.cs.umass.edu/~manmatha/
Robots File: https://ciir.cs.umass.edu/robots.txt
-------------------------------------
*STATUS: DEALYING CRAWL BY: 5 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
**BROKEN LINK: 404 http://ciir.cs.umass.edu/~mmedia/index.html
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/~manmatha/papers/sigir03.pdf
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/~manmatha/papers/SenSys08.pdf
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/~manmatha/research.html
*STATUS: UNIQUE LINK FOUND: http://ciir.cs.umass.edu/~manmatha/mmpapers.html
**BROKEN LINK: 404 http://ciir.cs.umass.edu/~mmedia/
-------------------------------------
*STATUS: WANT TO CRAWL...http://cs.umass.edu/~marlin
*STATUS: CHECKING ROBOTS...

Host https://cs.umass.edu
Link http://cs.umass.edu/~marlin
Robots File: https://cs.umass.edu/robots.txt
-------------------------------------
ROBOTS: Crawl-delay: 10
*STATUS: CRAWL DELAY FOUND: 10 SEC
*STATUS: DEALYING CRAWL BY: 10 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
-------------------------------------
*STATUS: WANT TO CRAWL...http://cs.umass.edu/~wallach
*STATUS: CHECKING ROBOTS...

Host https://cs.umass.edu
Link http://cs.umass.edu/~wallach
Robots File: https://cs.umass.edu/robots.txt
-------------------------------------
ROBOTS: Crawl-delay: 10
*STATUS: CRAWL DELAY FOUND: 10 SEC
*STATUS: DEALYING CRAWL BY: 10 SEC
*STATUS: SAFE TO CRAWL
*STATUS: CRAWLING...
-------------------------------------
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/courses/f12/cmpsci691bm/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/courses/s12/cmpsci240/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/workshops/nips2011css/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/courses/s11/cmpsci791ss/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/workshops/nips2010css/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/research.pdf
*STATUS: UNIQUE LINK FOUND: http://people.cs.umass.edu/~aschein/
**BROKEN LINK: 404 http://cs.umass.edu/~mday/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~pkrafft/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~jingyi/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~ravali/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~jmoore/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/~abakalov/
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/theses/wallach_phd_thesis.pdf
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/publications/passos11correlations.pdf
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/publications/mimno09polylingual.pdf
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/publications/mimno08gibbs.pdf
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/publications/wallach06topic.pdf
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/talks/2013-07-03_MSR_NE.pdf
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/talks/2013-05-01_NEML.pdf
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/talks/ml_intro.pdf
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/talks/2011-07-31_JSM.pdf
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/talks/2011-04-05_JHU.pdf
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/talks/2011-04-05_CLSP.pdf
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/talks/learning_dbns.pdf
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/talks/jitp.pdf
*STATUS: UNIQUE LINK FOUND: http://cs.umass.edu/talks/2010-04-05_UMass.pdf
*STATUS: 100 UNIQUE LINKS FOUND; EXITING PROGRAM.