![]() |
| Home > Internet > internet > |
Web Research FAQ v.1.1 |
Section 1 of 3 - Prev - Next
All sections - 1 - 2 - 3
Archive-name: internet/web-research-faq Posting-Frequency: monthly Last-modified: variable:date URL: http://spireproject.com Copyright: (c) 2000 David Novak Maintainer: David NovakWeb Research FAQ Welcome. This FAQ introduces the concepts and tools of web research. Attention is focussed on how web research fits into the larger field of information research, but web research has many peculiarities all its own, quite obscure to the new user. This FAQ resides at SpireProject.com/webfaq.txt SpireProject.co.uk/webfaq.txt and http://cn.net.au/webfaq.txt This FAQ is just a small part of a much larger effort to help you with information research. The Spire Project is available as 3 website, mirrors, zip-file, and 3 other faqs (See our larger Information Research FAQ - http://spireproject.com/faq.txt) I have included here text versions of relevant webpages and sections of our other faqs. Enjoy, David Novak - david@cn.net.au The Spire Project : SpireProject.com, SpireProject.co.uk, Cn.net.au Contents Web research starts with a vision. Don't Try to Search Everything Internet Information Theory ----- Articles from The Spire Project Finding A Webpage Discussion Groups ----- Excerpt from the Information Research FAQ What is Information Research? ----- Acknowledgements ___________________________________________________ Web research starts with a vision. The web is at the heart of a dramatic change to the information we receive. Even if we never touch the web, it will still affect what we read on account of how radically it permits new voices and new competition to the established ways we receive information. Web research, for those who wish to learn, is about finding more information, better information, & better answers from what exists on the web. There is a great deal of research about the relative strengths and techniques to maximize the use of search-engines. I do not follow this discourse closely. Concider reading the reviews offered by SearchEngineWatch (www.searchenginewatch.com) for this. My interest and excitement comes with developing and transplanting search techniques of a stronger nature. This FAQ covers topics like: - Boolean & field searches, - Anticipate what exists, - Judge webpage quality quickly, - Move to relevant nexus points, - An awareness of larger structures in the web, - Ask for directions, - Discussion archives as a search tool, - Information Clumps, so seek the clumps, - Information Research & its similarities. Before the web, we all had our private library, supplemented by the local public library, corporate library and perhaps local bookstore to provide our information needs. Add our daily paper and perhaps a magazine or two (and our 30 minutes of news broadcast on TV) and you have the sum total of most of our research needs. There were other sources available, but these were muted on account of the popularity and strength of these resources. Today, with the web, we have a vast new tool at our disposal. We can chose instead to follow the stockmarket live, read news direct from the newswires, browse our local library AND the British Library concurrently. We can read widely divergent views on an incident or activity. Perhaps best, we now have a vast slew of people and organizations to assist us in getting the information we seek. Each has different aims and focus and experience and bias. Navigation of the Internet is not simple. Those who tell you so are selling you something. Navigation (and research, a close cousin,) depends on your experience & practical understanding of how information is distributed on the Internet. Let's start with some theory. ___________________________________________________ Don't Try to Search Everything If a tree falls in a forest, and no one is around, does it make a sound? This is so relevant to the web. If a webpage exists, but no-one visits, does it really exist? This is more the norm for information on the web. People enjoy the publishing of information - but for most webpages, promotion is minimal at best. I may inform a search-engine or two, or tell a relative, a forum or another relevant website. Lets look at what this means for research. I inform a search-engine, so someone skilled in searching search-engines could find my webpage - provided they search for the right words. Unfortunately, from the point of view of a publisher, this is not a useful way to find readers. There are simply too many webpages and too few skilled researchers. If your webpage is fortunate enough to appear first on a popular search term, the search-engine may drive significant readers your way. Otherwise, you can expect very few visitors. From the researcher's point of view, the search-engine does not give you a clue which of the many webpages will be useful to you. The usual practise is to look for the top 30 and leave the rest. Thus, finding a particular webpage, lets say the best webpage on travelling Western Australia, depends a bit on luck (being in the top 30) and a bit on the skill of the reader (in choosing words which accurately describe the better webpages). When a writer informs a relative, they may tell a few other relatives, and some of this may snowball to a few more relatives or workmates. Again, it depends on the popularity of the webpage, but you should not expect considerable traffic. From the researchers perspective, to find this webpage you must either speak/write to one of the relatives, or look for a link in a relative's webpage (provided one of them added a link). Mentioning a webpage to a forum works similarly, but increases the likelihood a fine webpage will be linked into a collection of relevant webpages with existing traffic (perhaps some of the historically important documents - which we will explain later) and this in turn may snowball some. From the researcher's perspective, we could find my webpage by asking the forum, searching past forum messages, or by looking at those relevant webpages which link to fine webpages (nexus points - again explained later). Obviously there are more ways to promote that this. If you have money, you can purchase a top position in most of the search-engines. There is banner advertising. Promote away from the Internet too. But just from this rather lengthy example, can you see, no single tool will ensure you find a webpage, no single technique is enough. Many webpages will just never be found. Defeatist you say? No, its realism. Just as you can, theoretically find out everyone's first name with a telephone, in practise, you can't. So, to one of the first myths of internet research: You are not searching the web when you search AltaVista, Yahoo or AllTheWeb. You are only searching a database/directory - an incomplete tool that does a poor job selecting the best from the rest. The purpose of this narrative is to set the picture. In web research, we do not attempt to search everything. There is simply no good way to do this - and if we attempt to, we lose all hope of isolating the best information. No. We search the web in ways which provide good coverage AND link to quality information. We are, after all, search the web FOR something. Technically speaking, it is unwise to formulate your question as "search the web for everything about..." Oh, we will remain flexible and use blunt techniques when valuable or expedient, but our focus is on quality & depth. And this focus opens up a vast range of search techniques taking advantage of internet structures like nexus points and historically important documents. This focus vastly improves results. Lets turn our attention now to understanding structure on the web. ___________________________________________________ Internet Information Theory Lets agree the Internet is great fun to surf, but less valuable when you have a specific question in mind. To improve our search skills, we begin by understanding how information is arranged on the Internet. Contrary to myth, information is not disorganized but rather organized very carefully along clear patterns. Many patterns are specific to the information format (text document, webpage, email message, printed article). Further patterns match the way we become aware of information, or are specific to the information systems (mailing list, faq, peer-reviewed journal). Your understanding of the strengths and weaknesses of each pattern, each format, each system, guides your search for information. We shall start by shattering the Internet, and commenting on the many pieces. __ 32.1 three definitions of the Internet Let us be careful when we use the word 'Internet'. 1_ The Internet is a physical network; more than a million computers continuously exchanging information. The Internet allows us to transfer information around the world. 2_ The Internet is a landscape of information available on almost every topic imaginable. This information appears almost chaotically distributed to the world, but holds clear patterns. For instance, linking information together are various structures like government web links, search engines and FAQ documents. 3_ The Internet is a community of 100+ million individuals. These are real people who chose to interact, discuss and share information online. What we learn here is not so important as the technique - break the large seemingly chaotic system into smaller pieces: pieces that hopefully make more sense. Eventually, when we've made sense of the little bits, perhaps we can comment astutely on the big-picture. In this example, let me just draw your attention to the way most of our research effort focuses on the second definition: a landscape of information. Much of the best information originates in the third definition: the Internet is a community. Sometimes it is far more effective to ask real people than search the information cyberspace. Let us now illuminate more important facets of the Internet. __ 32.2 information, transaction, entertainment There is a triad of functions to all online activity: Function - Activity - Unit ---------------------------------------- Information - Research - The Fact or Conclusion Exchange - Business - The Transaction Entertainment - Play - The Experience Each Internet function grows at a different rate and moves in a different direction. The development of forums is firmly in the smallest segment dealing with information. This segment is quite poorly organized and confusing. The entertainment function in contrast is well financed and graphically innovative with clear, profitable opportunities. Much of the web is prepared with Exchange or Entertainment in mind. "Brochureware" (purely promotional webpages) is rarely required for research, but is critical to securing a transaction. Entertainment related, or just entertaining, websites abound. Let us recognize just how few webpages are information & research related. My own experience suggests we are just beginning to see the movements towards profiting from providing information. Direct sales of information is still chaotic and unrewarding. __ 32.3 information formats The way information is packaged has a great bearing on the content, quality and use of the information. This theme is evident throughout the work of The Spire Project, and is particularly applicable to Internet information. Webpages, text files, software, email and database entries each have particular qualities. Each shapes, constrains and restricts the informative content. These particular qualities apply irrespective of the information involved. Books are dense, factual, a little old. Articles are short, sharp, more recent. News is puff, introductory, immediate. Each way the information is packaged, each format, presents the information to set standards. Information formats on the Internet are the same. Webpages are graphical, technical to produce, and not easily updated. FAQs are easier to maintain, text only, and attract more peer review. Mailing lists are simpler still, text, short, immediate, very peer-reviewed, characterized by discussion and resource discovery. Newsgroups are characterized by extremely low costs, vulnerable to trashing, poorly managed. Email is simple use, one-to-one discussion. Lets look at books more closely. Books are created by authors who have something to write. Books are printed and marketed by Publishers to the bookstores that then provide it to the readers. Each facet of this process defines the resource. Books have quality, editorial vetting but minimal peer-review, marketable value and a potentially lengthy preparation time. When it comes to research, why look for a book when investigating digital money? Books would just have the wrong qualities - would present the information poorly. We need a more current format (digital money is a fast moving topic), and a more peer-reviewed format (books have editorial vetting, but not intrinsic peer-review). Why not search for a mailing list, an FAQ, or an association website. These formats have qualities more appropriate to our question. __ 32.4 information preparation Information flows also impress patterns on Internet information. Most information is transplanted to the web - first created elsewhere. The source of information imparts as much pattern as the eventual format the information takes. Information may appear as a webpage, and conform to our expectations for all webpages, but the information may have been prepared from the discussion on a mailing list - and thus enjoy a more topical, specific, timely and peer-reviewed quality. Lets look at FAQs. The best resource in the world on copyright law is the musings of a group of copyright lawyers who form the copyright mailing list. The copyright FAQ supported by this group is a logical document summarizing much of the discussion of this mailing list. FAQs are vetted by the news.answers team, then automatically mirrored around the world. From its origins in the mailing list, the FAQ is a peer-reviewed document, often full of links to further resources, topical, knowledgeable and factual. As an FAQ, the document is not immediate, graphical or financially rewarding (some FAQs stagnate). Only some Internet information is created within the Internet environment. The concept of 'brochureware' describes the common traits to promotional webpages directly prepared from paper promotional brochures. One of the more exciting trends is the movement of information from the dusty shelves of government offices and association libraries to their more accessible websites. The quality of information retained in your average government agency, from quality research reports, to detailed studies, to current industry monitoring is very high. These qualities are then brought over to the web format. Such web-documents tend to be isolated (not linked to other related resources) and perhaps a little behind the time line, but of a generally high quality. An exciting holistic view of the Internet information landscape is based on these descriptions. Imagine, for a moment, information flowing through a collection of systems. At certain points, information groups together, and generates new, perhaps higher quality information, which then flows in a different system, a different direction, to different people. The flow of information from one person to another, from one format to another, imprints qualities to the information along the way. Each organization, or subsequent re-organization, imparts specific styles and conventions and quality to the result. __ 32.5 publishing motivation Let us proceed to a third set of patterns. Information appears on the Internet for one very specific reason. Someone Publishes (DUH). The motivation behind publishing colours the information. Patterns we will use to better search for answers on the web. Ask yourself who is publishing, and why. One of the biggest publishing segment a year ago were individuals publishing documents derived from their personal expertise. A typical document would be one with minimal peer review, a list of aging links to further resources, simple graphics, variable to short length, prone to bias, but moderately reliable because the publisher knows their topic well. These pages are often located on web pages with private sub-directories (usually starting /~name/). Commercial sites publish mainly for the promotional value. Their secondary purpose is to provide sales information to prospective clients. Rarely do commercial sites go beyond this. Commercial webpages often reside on their own domain name, as a .com, or in sub-directories - without the tilde symbol. Commercial sites also tend to age badly. They are very noticeable from their front page. Government agencies are emerging as valued publishers. Slowly their dormant information becomes available through this new medium. Currently almost all government documents on the Internet also appear in print, meaning they are factual, exhaustively reviewed, tend to be a little old (but age well), and come from highly paid knowledgeable people who believe it is their duty to inform others. Such documents are lengthy and appear on .gov domains. These patterns are simple to see. Grant-funded projects create brilliant research resources and hold much promise in pushing the limits of this technology. I am eager to see the results of the US Patents project, and appreciate the value of having Supreme Court rulings on the Internet. Often such projects are short on money but deeply focused on content. Most projects reside on educational servers and are widely discussed within knowledgeable groups. Associations, publish association-kind-of-things. Most are initially just like the commercial webpages, but with time become much more factual and research-worthy. Most associations are dedicated to developing awareness of their chosen topic, albeit coloured by their chosen bias. Few associations are significant publishers yet, but this segment will begin to liberate dormant information within associations. Let's summarize. The key is to always watch who is the publisher. We can assume a great deal, quickly. We are unlikely to find the latest changes to patent law from government or commercial publishers. Such organizations are simply not motivated to present such information. __ 32.6 promoting information Publishing is one achievement, but you and I will never read any information until we learn it exists. This simple fact creates even more patterns to Internet information. Knowledge of information moves through set routes on its way from writer to reader. Promotion is not simple. It is a process that takes time, effort and perhaps money. Information without serious promotion tends not to be promoted far from the source. Another way to phrase this; you must search close to the source to find poorly promoted information. A search engine indexes pages relatively indiscriminately. This also means a site of quality is not likely to reach your attention. The odds are not good, and from a promotion point of view, search engines generate minimal traffic to your webpage. Search engines drop you rather randomly into a website. It is often necessary to move up a directory to understand the purpose and motivation of a site you find interesting. Information published through advertising tends to have a financial payoff for the promoter. This kind of information tends to be promotional information. Brochureware. The alternatives are to promote a webpage or website through one of the referral tools. Each such tool accepts links on some criterion. Each tool you use to locate information also selects particular types of information for your attention. If you arrive at a document by recommendation through a mailing list, the document is likely to be recent, on-topic, and specific to the purpose of the mailing list. Alternatively, (for poor mailing lists) it will be wildly off topic and trash. You are unlikely to see referrals to old documents or documents of historical importance. These are the qualities most acceptable to the mailing list environment. Directory trees, FAQs, guidebooks and related promotion tools all work as historically important documents. In the past, such resources list, describe and alert people to relevant information for the field. Slowly, over time, this function becomes acknowledged, reinforced and promoted. Time is the essence of this fame. Webpages or websites found through historically important documents, by their nature, tend to be long lasting websites with lasting importance in the field. Such documents point to other similar documents or websites that have achieved a long-lasting importance. You are unlikely to find specific documents, but rather sites that focus or bring together information. In short, there is little motivation to link to specific webpages, when a link to important websites is considered just as good. Similar generations can be made of each type of promotional tool, and become important in rapidly seeking our information which matches our intention, as well as summarizing the likely motivation - and bias - of webpages we are interested in. __ 32.7 information clumps Information Clumps. Information is created, nurtured, develops, gets transplanted, gets arranged and then becomes visible through a process which brings similar information together. As we have discussed, there are factors deeply affecting all information on the Internet. Motivation, Preparation, Format and Promotion defines the quality and content of any given item of information. With so many influences, we should not be surprised to learn information naturally groups together. In reality, there is nothing natural involved - it is a social phenomenon reinforced each time you and I visit or read one resource but not another. History can explain some aspects of Internet development. As a small collection of sites become dominant in particular fields, by collecting and delivering better content to more people, new sites find it progressively more difficult to capture attention. This dynamic works for websites reaching out for visitors, and discussion groups reaching out for subscribers. In each case, seniority counts. Seniority counts in several ways too. Promotion is directly related to quality, interest, traffic and time. The longer a site is active, the better the footpath develops, the more people visit. Secondly, quality content is directly related to access to quality content, peer review, and time/money. Important existing sites gain in every way. This results in a grand system where the first-in, best-dressed, can capture the high ground and secure a grand lead in awareness and footpath over competitors who follow. Yahoo is a prime example of a directory tree, not even the best in most areas, which has achieved unparalleled traffic & awareness. This competition is equally evident where no money is involved. Perhaps your association wishes to create a new referral website, or an open mailing list, or an informative guide. All sound concepts, effective projects. However, if older, established resources exist, the work will be long and arduous. Despite the marketing message, the Internet is not a world where the best information floats to the top. The Internet will not let you to reach millions. You must compete for the attention, participation, devotion and assistance in a manner very similar to building a business. In concrete terms, information clumps on the Internet. The best resource could appear on any Internet system (webpages, email mailing lists, ftp-archives, faqs, online databases, newsgroups...) but we can be fairly certain the best information will congregate in just one or two. Consider our article "Searching the Web" (http://cn.net.au/webpage.htm). We progressively search different web tools, looking for the most worthy. Searching the Internet is the same. You must touch each system to see which system is dominant, where the information is congregating for your topic. __ 32.8 bringing this together In summary, we have broken down and discussed various qualities of published information and promoted information. We have made sweeping generalizations and educated guesses about information on the Internet. Now what? When a painter begins to paint, they have already visualized some of the image. They already have a concept of the finished result. Internet research is no different. We start by building a vision of the information we seek. Who would publish it? Where would I find it? What is its motivation? How would we find it? We now have a practical vision. The address is the key. The url for any item of information gives us a surprising amount of information - particularly now we are making generalizations about information patterns. We can guess if information resides on a personal webpage, a funded university project, or a commercial project. The information resides on a .gov website? - the quality is likely to be higher and conform to our expectations of government resources. We use this new-found experience in three ways. First, we restrict our searches to the most likely sources. Second, we quickly jump through lists of resources (such as those generated by search engines) to the sources that match our expectations. Third, your understanding of the relative qualities of information guides your judgement of information value. Internet newcomers often expect to have instant access to the latest information at the touch of the button in beautiful colour and peer reviewed quality prose. Who is publishing this? Where is this information coming from? Who would help us find this? Such a vision is fantasy. If we were instead to look for an association website, dedicated to a certain type of research, or an informed newsgroup, maintained by people passionate about sharing this technology, then we have made four steps forward. We are clear about where to look for the answers we seek, and we will know quickly if the answers are online. ___________________________________________________ Two relevant webpages from The Spire Project, as text. Searching the Web ----------------- Webpages are often of unknown age, of only guessed at quality and potentially the easiest information to retrieve. There are many points of entry to web resources, but search tools differ. Try to match your search tool to your question. To start, you will need to learn something of the different tools - this is described below - and four basic search techniques: Boolean[1], Proximity[1], Field Searches[1] & Truncation[1]. Internet Global Search Engines [1] Altavista[1], among other tools, has a very large, fast search engine. Allows for Basic Boolean[1] AND + NOT - OR | Proximity[1] " " ~ (near - within 10 words of each other.) Several Fields[1]: title:"Spire Project" domain:gov url:edu link:cn.net.au and Truncation/Wildcard[1] (*) Of import, Capitals matter with Altavista. Read more here[2] and here[3]. [4] All-the-Web[4] is important because it is large - really large - with a flexible search facility. Allows Partial Boolean[1] + - Simple Proximity[1] " " and Several Fields[1] a title field search normal.title:spire url field url.all:.au link text and link url fields normal.atext:spire link.all:cn.net.au All-the-Web is not case sensitive. Read more here[5]. When searching for a topic with precise descriptive terms, use a broad search engines. Always place the Boolean +symbol before each search word (like this: +word1 +word2) to insist all words appear in the results. Quotes keep words together ("word1 word2"). These two simple steps dramatically improve results. Keep adding words and search limits until the number of hits is reasonable. [48] Inktomi[48] provides its substantial web directory through other companies, in this case, Yahoo. You may need to select "Web Page Matches". Accepts Partial Boolean[1] + - Simple Proximity[1] " " and Two Fields[1] title: and an index-date field through this form[6]. [53] Lycos[53] is a rapid search engine, again, one of the larger ones on the web. Accepts Partial Boolean[1] + - Simple Proximity[1] " " and Several Fields[1] through their advanced search form[7]. For more global search engines, consider visiting the W3 Search Engines[8] page at the University of Geneva. The Industry Research Desk also has a good search engines page[9] as does this site[10] by Paul Hopper and this page[11] from Search Engine Watch. Meta-Search Engines & Google If you know something of the destination already, like a title or company name or full name, try using a search tool which excels in finding named websites. There should be little difficulty in finding such sites with either Google or a Meta-Search engine, but don't get excited and use these on other occassions[1]. [2] Debriefing[2] is our meta-search engine of choice. Use this to find names & named websites. Accepts Partial Boolean[1] + - Simple Proximity[1] " ". Capitals matter. [12] Google[13] is a new style of search engine which ranks sites with more care and concern. This works well for sites you know a little about in advance. Allows Partial Boolean[1] + - Simple Proximity[1] " ".
Section 1 of 3 - Prev - Next
All sections - 1 - 2 - 3
| Back to category internet - Use Smart Search |
| Home - Smart Search - About the project - Feedback |
© allanswers.org | Terms of use