In this project, you will build a barebones Text Search Engine with the help of your knowledge in hashing. Imagine, you are a developer at a company which sells thousands of product everyday, and receives millions of reviews as well. Your CEO wants you to retrieve all the documents that have the word "iphone" but not the word "camera" in it. Or it has either the word "amazing" or "awesome". That's what we want to build here.
Given a list of documents, you have to build the following:
You will be tested on the document matrix and the search result only.
You will be given two input files in argv[1] and argv[2]. Let's assume, we have document.txt
and instruction.txt as input. document.txt will contain several lines with a linebreak after each line. First, you have to build a way of indexing the document starting from 1 (and not 0), to
Next, you build a list of vocabulary, where there will be unique words only. Make sure, all the words are in lowercase, and if there are characters in uppercase, convert them to lowercase. Since we are not doing an advanced parsing, words may often contain special characters, e.g., a period, a comma, a semicolon etc. In other words, we define "word" as any entities separated by a space. You need to make sure, your words would only contain alphaneumerics (i.e., alphabets from a-z, and numerics 0-9), and remove any other special characters. Also, make sure the vocabulary list must be sorted lexicographically.
Hints: check out the function std::isalnum.
Next, we wll build a dictionary using our knowledge of linear probe in hashing. For each word, we will compute the hash key by summing the ASCII values of the characters, and find the reaminder by dividing it with the largest six digit prime number 999,883, which is our "bucket size". You will never have unique words list more than this number. That also means, you will never end up being unsuccessful to put a word in the dictionary. Let's assume, we want to find the hash value of the word "game":
A Simple Hash Function for English words with Collision Handling via Linear Probing:
int hash = 0;
int N = 999883;
string s = "game";
for (int i = 0; i < s.length(); i++)
hash = hash + (int)s[i];
hash_value = hash % N //using modulo operator
i=0 to N-1
for (hash + i)%N
Example: Consider the sentence: "Stop spot and post sIlent listen". We will now calculate the hash values for all the owrds.
hash("and") = (int('a') + int('n') + int('d') ) % N = ( 97 + 110 + 100 ) % 999883 = 307
Since, this is the first word entry, there is no chance of a collision
hash("listen") = (int('l') + int('i') + int('s') + int('t') + int('e') + int('n') ) % N = (108 + 105 + 115 + 116 + 101 + 110) % 999883 = 655
Check if there is a collision in the 655th position. If not, move to the next.
hash("post") = (int('p') + int('o')) % N = (112 + 111 + 115 + 116) % 999883 = 454
Check if there is a collision in the 454th position. If not, move to the next.
hash("silent") = (int('s') + int('i') + int('l') + int('e') + int('n') + int('t') ) % N = (...) % 999883 = 655
Now, there IS a collision in the 655th position. So, we check for an available position in (hash + i)%N, where i is 1 to N-1. at i=1, we find an available position
hash("silent") = (655 + 1) % 999883 = 656
For the next word, "spot", we will again have a collision, and i=1 will resolve that
hash("spot") = (454 + 1) % 999883 = 455
Now, for the next word, "stop", we will have a collision, but i=1 will not resolve that, since it was occupied by the "spot". So, we check for i=2, and we got an availability
hash("stop") = (454 + 2) % 999883 = 456
Next, you will build a document matrix, and this is the first output where you will be evaluated.The document matrix will contain the list of words, but not as a string, rather in their hash value. You will not use repeated values (i.e., you will treat each document as a "set of unique words"), and the hash values of the words will be sorted in ascending order.
Finally, you need to implement the instructions provided in the second input (instruction.txt). You need to account for three operations: "AND", "OR", and "NOT". In each operation, you will encounter at most two word.Additonally, the following details need to be adhered to, so as to ensure correctness:
1. Use the hash function provided above with linear probing and bucket size 999,883.The program will take a document as input. Additionally, there will be an instruction file which will instruct you which search operations to be performed. The sample input output data with queries are available here. We also provide a Complete Walkthrough of the first example Input and Queries below.
For example, consider doc1.txt and ins1.txt as input. where doc1.txt containsThis is just a test.
We will build a search engine.
It will be a bit pristine, but trust me, it's fun.
Some fun but not it will still be graded.
and ins1.txt contains
a
will AND a
fun OR pristine
(NOT be) AND a
engine
Your output will have document_matrix.txt document_matrix.txt and instruction_output.txt . The document_matrix.txt will have indices and corresponding sorted word list values.
1->[97, 220, 440, 448, 454]
2->[97, 222, 441, 528, 630, 631]
3->[97, 199, 210, 221, 319, 329, 331, 336, 441, 578, 878]
4->[199, 221, 329, 331, 337, 436, 441, 552, 615]
And instruction_output.txt will have the output values of each instructions.
a-->[1, 2, 3] will AND a-->[2, 3] fun OR pristine-->[3, 4] (NOT be) AND a-->[1, 2] engine-->[2]
The program takes 4 arguments. 1st arg: documents list input file name, 2nd arg: Query instruction file name, 3rd arg: Filename where you write the document matrix and 4th arg: Filename where you provide the results of queries in the instruction file or query results file.
For example,:
$ ./PROJ1 doc1.txt ins1.txt doc1_mat.txt ins1_out.txt
You are NOT allowed to use algorithm or map header files for this project.
Create a directory on the Linux server, its name must be proj1
$ mkdir proj1
Change your current directory to the proj1
$ cd proj1
Run the shell script to test the program
$ sh test_cpp.sh