allQuestions.txt

What is the best way to choose the appropriate k for running k-means clustering?
How do we determine how much we should smooth?
Why does smoothing achieve discriminative weighting?
What does theta represent?
What is hierarchical categorization and how is it useful?
Can a cluster for words be compared to clusters containing larger objects, like groups of documents?
How do we use generated models to do text categorization?
What is the most efficient way to find lambda star?
What is the advantages and disadvanatges for probability and similarity approch repesctively?
What exactly is Hierarchical Agglomerative Clustering?
How do we choose the way to compute a group similarity based on different variations?
What exactly is Hierarchical Agglomerative Clustering?
Should you first determine the class a text belongs to and then cluster?
What exactly are differences between K means and THEM algo?
What is the motivation for test clustering?
What is the functionality for all of text categorization in real life?
Can you use a different model than k unigram LMs?
What are examples of criteria of choosing single links over complete links and vice versa?
What is the difference between generative probabilistic model for cluster and categorization?
What is the intuition in scoring based on ratio rather than compare two scores?
Why the log of ratio is the weight?
How does clustering deal with outliers in a cluster of data?
What Is Good Clustering?
What Is Cluster Analysis?
Can you explain more about how these models help in text categorization?
How can we adapt the vector space retrieval model to discover paradigmatic relations?
What other type of text clustering models support a document that can cover multiple topics?
What are the benefits of probabilistic models vs other models?
How we get the P(Y) here?
Why are we adding the background probability if it is already a common word?
Why can we assume these are correct?
What is an example of how we can combine multiple methods for these kinds of problems?
What happens if we have more than two categories?
What is the optimal way to combine and use character n-grams, word n-grams, and POS tag n-grams?
When should we use precision over recall and vice versa?
What if an opinion's context, like date or time, was considered in the natural language processing, but the opinion maker did not actually make use of that context?
Why does the order of the polarity categories matter?
What is the meaning of parameter lambda?
What is meant by "share training data"?
What is the difference between macro and mirco avergeing of precision and the recall?
Does human effort needed for all items in micro-averaging?
What else semi-supervised learning technique we can use here?
How do we deal with the ambiguity when conducting sentiment classfication?
Will different type of supervised learning techniques change the final result?
Can sentiment classifications be treated as a categorization problem?
What is the functionality for all of text categorization in real life?
How to find the sweet spot?
What are examples of text data that do not adhere to the discriminative classifier?
How does the machine understand that opinion itself when its pattern recognition model is based without understanding context without a schema because I am assuming it just uses ML to understand the context which is curve fitting?
Why do we assume B2 to be positive and B1 to be negative?
Why was there notation changes?
Are there non-linear SVM that move along different points?
What is Text Categorization?
What is a good Text Categorization?
How it is used?
We could do it by providing the training set as classified examples of the various sentiments and opinions we see and then use the resulting model to extract information sentiments and opinions based on how our document is classified?
How can we adapt the vector space retrieval model to discover paradigmatic relations?
How does one choose which Discriminative Classifier to use?
How do you optimize the tradeoff between exhaustivity and specificity?
Does the likelihood function always converge?
What do the parameters represent in conditional likelihood?
From the classification algorithm or from the preprocessing of the data?
What if the data cannot be separated by a line?
Can you give some examples of feature construction process and how to choose the algorithms to use under some specific conditions?
Why can we assume these are correct?
What if we have no prior knowledge about the data and we still want to make appropriate evaluation?
Why do we use a Gaussian distribution?
What techniques can we use to determine how to partition data to determine context?
How do you know if two nodes are close to eachother/measure the distance?
How are the term weights for the different aspect segments discovered?
Would we be able to use a negative edge weight in the instantiation of NetPLSA?
How accurate and useful is iterative casual topic modeling and in what situations should it be used?
Would it be better to weight unbiased reviews more rather than accomodating ratings for biased reviews?
What is the purpose of the regularizer function?
How are the aspects determined?
What if we combined the idea of topic mining and opinion mining?
what is the use for this?
How do we analysis causality or correlation in a multiple language case?
Why the three different colors shown in the "choose a topic" section are not perfect rectangles?
How do we deal with mining topics with time series supervision?
What does it mean by assume context-dependent topic coverage?
Can context be used to partition text?
What is the functionality for all of text categorization in real life?
What does aspect i mean exactly in Latent Rating Regression?
Where did those ratings come from?
What is stopping data mining loop to produce fake data, and ultimately have fake data feed back into the loop?
How are views chosen for CPLSA?
What does These Analysis do?
Can we treat partitioning a set of text documents as a text clustering problem or is that not a good approach because it might not partition documents into distinct sets such as date published but into a mixed set that is a mixture of all the variables found?
Which formulas are being used in this process?
How did we get these values?
How can we adapt the vector space retrieval model to discover paradigmatic relations?
What is the advantage of partitioning data for text mining?
What is the motivation behind Mining Topics with Social Network Context?
How does this affect the population of social media users?
How does non-text data help when we infer values of real-world variables?
What is the significance of the two stages used when solving Latent Aspect Rating Analysis?
How is the content in each subnetwork characterized by the text?
What does the number after each word indicate?
What do the parameters represent in conditional likelihood?
What is next examples have been given. However the input, the quality of the data plays a huge role, what are the state of the arte techniques for this?
Can you please go further in depth with the split words, and how they factor in to the rest of the pipeline?
What if the data cannot be separated by a line?
Can you elaborate on how to introduce the data in times serious and get the biased topics?
How do we determine the initial input topic model, by manual input?
Why can we assume these are correct?
How is R'(q) is related to the relevant documents R(q) and what information does it provide?
Are both queries and documents elements in the vocabulary set, also what do the parameters i,m represent in documents?
How do we determine if the classifier is over or under constrained?
How exactly would synonyms�appear on a VSM vector model and why cannot they be in different dimensions while still having the correct meaning/connotation?
How to solve the problem when there are multiple documents having same similarity score?
How do you know which model for "relevance" to use?
What is the difference between semantic & pragmatic analysis?
How many dimensions does one term in a VSM define?
Which challenge is harder to settled?
How do we know whether the theta we pick is optimal?
How can term weights in the space be seen differently than a query?
What exactly is represented by the multiple subscripts for each document?
What is the difference between POS tagging and parsing?
How are the terms chosen?
How do we value the importance of each word while the vector only checks for exsistance with bit vectors?
What is the difference between query and document besides their lengths?
What does it mean to say that "the query follows from the document"?
What about recommending based also on images?
Which of the following best describes a Bag of Words?
Will we be learning all of those methods listed or just BM25?
What are these random variables that queries and documents are all observed from?
What is the difference between querying and pull mode for accessing information, if both require the user to input specific keywords to search for?
What are components of simplest vector support model?
Why would we use each word in the vocabulary to define a dimension of the vector space?
Would not this result in many unnecessary dimensions?
What is the meaning of parameter lambda?
When did NLP become famous?
How does the assignment of zero for every absent word in a document help in vector placement?
What "the query follows from the document" means?
Why is BM25 the most popular?
How is the BOW created, and would it be preferable to reduce it to the most important words?
Did not exactly understand in detail about POS tagging and parsing. Would be great if the professor can talk more about it.?
How will DF and TF determine relevancy more specifically?
how to define syntactic structures?
How is the performance of each of these models benchmarked?
Do the Vectors contain entries only for certain �key� words that the designer chooses or does it contain entries for every word?
How do I provide the machine with context knaowedge to comp\?
Why is similarity used to rank when queries may not exactly contain words in related documents?
Can professor further compare those two concept in the class?
Does Google adjust such independency in its search results?
Should we also be increasing the frequency of a word of a different form?
For example, in the lecture we are counting the frequency of the word presidential, so should we also take into consideration when we see the word president as well?
What is an example of deeper NLP for complex search tasks and what does deeper NLP refer to?
What is the difference between semantic analysis and pragmatic analysis since they are both about meaning of the sentence?
What is the relationship between State of Art retrieval method and other retrieval mothods first introduced?
Why do not we use a count for number of matches, instead of a binary 0 or 1?
What is the particular feature that makes BM25 the most popular?
How to combine push & pull in practice?
Why do not we take the length of document into consideration?
What is the different between semantic analysis and pragmatic analysis?
why can we build algorithm based on Probability Ranking Principle, even if it is not hold in lots of the situations?
What is the different of Probabilistic models and Probabilistic inference model?
What are other metrics to measure similarities, or other combinations of VSM, since the combination bit-vector + dot product + BOW here is considered a simplest example?
For the probability model, why are we using random number to determine relevancy?
How many dimensions do we need to produce good results?
What is N in N terms here refers to?
Does it refer to the total number of terms that appears in the query and all exisiting documents?
How to effectively break the tie?
How does Zipf's law filter out the completely unrelated results?
What is the "postings" data structure?
What is the point of compression?
Will the access times really be that impactful to the overall indexing?
What if we Tokenize the terms before we make them the index, is that something that people do or would that skew how the index work and potentially give wrong data to the user?
What does f subscript a mean and how is it calculated from the result of the function h?
How do we determine what type of function to use for the IDF?
How exactly does adding a constant to the TF the way that BM25+ limit overpenalyzing?
Can we get more examples of using gamma-code?
How does the gamma-code intergar compression method work?
Can you further explain why we need to compute g(t,d,q) in scoring algorithum?
How do you do gamma coding?
How do you determine how many unary bits there are in the encoding?
What is Zipfs Law used for?
How much worse is unary/delta-encoding versus gamma-encoding for inverted index compression?
What does the term "accumulators" actually refer to; if it is just the scores for the matching of query term to document, is it the same as the term frequency value?
Why are 3 and 5 encoded as they are?
What does aggressive mean?
What is the purpose of passing in q to the function g?
Why we penalize less for long documents with more contents?
What is the point of +1 here?
Why is it (M+1) rather than M in the formula of calculating IDF?
How does the formula come like this?
How do we estimate or choose the appropriate value for k in BM25 formula?
How do we go about determining an actual value for k?
Would not the overhead for calculating inverse document frequency for each word be very high?
How do we normalize the case where a document is long but its relevant content within that document is very short?
What kind of data structure should the method use?
How does it speed up the search?
What exactly does "d-gap" mean and why is it useful?
What is dictionary and posting talking about?
What kind of data structure should we use to store postings then?
What is the meaning of "bag-of-word with phrases"?
Are both dictionaries and postings used to construct an inverted index, or do you use one over another based on the size of a dataset?
How does Zipfs law help avoid touching documents that are not in the query?
Can we get some more examples on how to construct an Inverted Index and how we go about utilizing it?
What are the variables in the Pivotal Length Normalizers referring to?
What exactly is meant by these "score aggregators"?
How is this block in "Local" sort being created / partitioned?
How do tuples with different doc IDs get grouped together?
What does the prof mean when he says BM should not be a vsm but a probabilistic model?
How did they come up with such weighting formula?
How the a word in the dictionary is mapped to the position in Posting?
Does is the delta-code use gamma-code twice recursively?
How does stemming help in increasing the coverage of documents?
What does coverage of documents mean and what could be some other benefits of stemming?
Why is a log function for weight of query words in a ranking function of TF-IDF transformation better than other functions?
Why is doc ID compression using d-gap more efficient than using the original IDs?
When and how do we compute the average document length?
What are some methods employed for language-specific and domain-specific tokenization?
What functions can we use in the Vector Space Model for scoring apart from the dot product?
Can you go over the differences between the different Integer Compression Methods?
What is the reasoning for making the first (1+logx) unary and the x-2^(logx) uniform?
How does unary code compress binary code?
Will not unary code always be more bits than binary?
Why is inverted index the most common type of indexing used?
How could the IDF Weighting technique be improved to include synonyms of common words without significantly increasing time-complexity?
Professor talks about gamma-code. Is there way to directly compute log of x rather than recognizing unary code and getting the value of log of x by plugging the unary code formula?
How would this formula affect popular terms that should not be penalized?
Which algorithm is more often used in industry, BM25F or BM25+?
How does the formula provided by Zipfss law help avoid touching a large number of documents that do not match a query term?
What is uniform code?
How does gamma decoding work?
What is the best way to choose the parameter k, and does it depend on the type of search or is there one preferred value?
What is the meaning of the linear relationship compared to standard IDF?
Why b in the normalizer has to be smaller than one?
What doest the position mean here?
how is this function actually work?
how does epsilon-code work?
why is it called inverted index?
Would it be better to look at other words often searched with that word to combine with the query to get better results?
For the first two steps of inverted index construction, is all documents sorted by doc ID first and then documents are splitted into groups(such as 6 docs per group), with each group sorted by term ID?
How is the position of a term in the dictionary confirmed in the postings?
Any examples of d-gap, and what is the heuristics behind this method?
How does the gamma-code intergar compression method work?
What is the meaning of postings?
What are examples of good MAP and gMAP values when measuring precision?
What are the most popular methods for statistical significance testing?
Why is this a special case?
How is it different from the normal method?
For the Parameter Variable what makes it usually set to one and when would it not be 1?
Where did they get the values p=1 and p=0.9375?
Why do we have to use an F measure to combine precision and recall?
Could we get more examples of Statistical Significance Testing?
Can you explain why we use 1/r with a concrete example since I am still confused?
which distribution are we using in this model?
For example, working with homophones or something, could those irrelevant and non retrieved documents be used to measure accuracy?
How do you get a p-value from the sign test since it is only pluses and minuses?
What is the value of the denominator when calculating recall in test collection evaluation?
What is the benefit of following the pooling strategy?
When you calculate the precision in a collection, does the denominator increase each time you look at any document, even the non-relevant ones?
How does dividing by the Ideal DCG actually normalize the DCG?
What is the difference between Wilcoxon Test and he Sign Test?
Why do we normalise discounted cumulative gain with the log of the document rank instead of just the rank?
What is the base of the log used for nDCG?
Does the base even matter here?
what would be the effect if we set the parameter larger or less than 1?
Why the standard method for evaluating a ranked list is quite sensitive to a small change of precision of random document?
When calculating values for F-measure, it is necessary to have non-zero precision and recall. What would happen if the system doesnot respond well and give zero retrieved docs?
What is the meaning of "combine all the top-k sets"?
What is the trade-off between precision and recall while calculating F-Measure?
Why is B better?
Does it have less random fluctuations?
Why do we have @k in DCG@k?
Should not the ranking system return a ranked list of all documents?
Why did we assume we have 9 documents rated 3 and 1 rated 2 for ideal DCG if in the example for actual DCG all the documents are ranked differently?
Should we be assuming all documents but 1 are ranked 3 in every ideal situation?
What is the meaning of "return top-k document"?
When would we use binary judgements?
Why would not we just use multi level judgements as it allows for more flexibility?
How do you calculate the IdealDCG@10 again?
Do you put a score of 3 (very relevant) for every document except the last one?
What is the Wilcoxon method?
Why do we combine the precision and recall?
Why we need a parameter here?
Why is the discounted cumulative gain calculated by dividing the log of the position?
What exactly does recall refer to?
How does recall be assumed that there are 10 relevant documents in the collection?
How often do engineers use evaluation in the real world scenario?
What is the difficulty of a query?
Why the situation affects the choice of MAP and gMAP?
Why do not we do another operation like deciding which query vectors are likely to be more difficult and rare (like we did in L2) and then based on that choose to do either MAP or GMAP?
Why is precision most important for the top ten resulted documents?
Should precision affect the majority of documents affected?
Which measure should be given more priority, precision or recall?
What is meant by having variance across queries, do you really mean bias?
Would this perhaps result in increased relevance and accuracy?
Why is the ideal discounted cumulative gain based on 9 documents being relevant with 10 documents rather than all 10 documents being relevant?
Can you go into more depth of the differences between MAP and gMAP?
Why do we have to use Precision and Recall over raw accuracy?
What is the risk associated with discarding documents that are potentially relevant?
What is the difference between an F-measure and an F1-measure besides adjusting the Beta parameter?
How is the K determined for the judging of top-K documents, since it can vary from system to system?
What is the difference between MAP and gMAP?
When performing test collection, are the initial relevance judgements determined by humans or some emperical method?
How is it possible to measure the Recall Value of a system if the number of relevant documents out of a complete set is unknown?
How does parameter beta in F-Measure works?
how is gMAP calculated?
what stuffs can nDCG do but DCG cannot?
Could you please give an example of gMAP?
Why is non changing recall count as zero when calculating the average position instead of using updated precision?
Why is the list not sorted by most relevant?
What is the purpose of making human assessors judge a collection of top-K documents?
How do you determine which estimation method would be most effective to solve a particular problem?
How do we pick the best values to use for lambda and mu when doing the smoothing?
Why is the log term of p(q|d) independent of the document?
Why can we assume all words in a query are independent?
Why do we ignore the last part of the log p(q | d) formula when ranking documents?
Could you further explain the doc length normalization a little bit since I do remember it should be a term as a mutiply factor instead of addition factor.?
what is the probability exactly?
How exactly does nlog(alpha_d) relate to doc length normalization?
Why are there only n-1 parameters in the generative model?
Why would we need the statistical language model to generate words for us, instead of sequences?
does the sequence of words matter in the query?
Which one is better, JM smoothing or Dirichlet Prior smoothing?
What is the difference between Jelinek-Mercer smoothing and Dirichlet prior smoothing since they have very similar forms?
What exactly does alpha_d mean in English?
Why do we need to log the probabilities in this formula?
What if we do not ignore the last part of ranking?
Why not have a standard smoothing language model?
What Is the point of having multiple, if at the end of the day, they still assign probabilities while including unseen words?
Which part of the formula for the ranking function with smoothing actually implements smoothing?
Why do we assume that in our smoothing method each word that are not ovserved would have a different form of probability?
Can we have examples on how to rank using the smoothed ranking functions?
Why even care about query words not matched in d?
Why P(Wi|d) represent TF weighting?
How does this probability add a limit/bound to the score of a certain word?
Why do we say that the smoothing variable 'mu' is dynamic in this case?
When exactly is the variable changing and why?
How does it behave differently than 'lambda'?
Why taking away the probabilitiy mass from observered words help us assign probability to words not seen in the document?
When is Unigram LM used?
What is the conditional probalility mean?
Does it mean given the document guess the query which returns this document?
What does it mean to say a user likes a document, if the query is unknown?
How do the user click the document if he did not enter a query?
Why do we have N-1 parameters?
Since not every query can be drawn from the documents, does that mean that query generation and query likelihood eventually become better (improving doc model) as query inputs increase?
What is the assumption we make while calculating the query likelihood?
Does the smaller coefficient for longer documents i.e. lesser smoothing make it harder for models to come up with more accurate retrieval through probabilistic techniques?
Can you go into more detail as to how the query likelihood retrieval function works?
Why would we assume each word in the query is chosen independently when users often chose entire phrases for a query at once?
What is the meaning of rewriting the ranking function with smoothing?
Why is lamda higher makes the common words disappear?
Does this assumption still hold in production?
It seems like we have to use this assumption to calculate the probability.?
Is it possible to glean meaningful information about for probabilistic retrieval models from something other than clickthrough data?
How does one choose which language model to use for a specific set of text?
Are some LMs better for some types of texts than others?
How does lambda affect p(w|d)?
Are there any shortcomings of our assumption that a user formulates a query based on an imaginary relevant document?
Can you please explain how you get the network and mining values using the equation?
why you said that "The classic probabilistic model has led to the BM25 retrieval function"?
What is the probabilistic interpretation of BM25?
how to smooth a LM?
why to deduce the function of Fdir?
How do we choose the coefficient of pseudo-counts?
Why do we need to predict the likelihood of the query?
What is the benefit of a less heristic retrieval function?
What is the function of alpha here, and how to choose its value?
How do different types of feedback integrate into different page rank algorithms?
How are the results from PageRank and HITS integrated with the results of a vector space model or language model?
Why is the vector truncated when using the Rocchio formula in practice?
How do you determine the alpha, beta, gamma terms in the Rocchio Feedback formula?
How do we determine the alpha beta and gamma parameters when moving the query vector?
What are the counts of 1's exactly?
How do you compute the centroid in the Rocchio Feedback formula?
What are the drawbacks of using the KL Divergence Retrieval Model?
What would this new query vector mean in human readable language?
What does it mean for a document to be negative?
What is the meaning of parameter lambda?
May you explain why each word gets count of 1?
Why are we not mapping each vocabulary word to its count instead?
why lambda=0.7 can produce more noise than lambda=0.9 according to the generative mixture model?
How do we rank links and compare this or interleave this with ranking actual webpages?
How to convert the existance of links between pages into the adjacency matrix?
How can we get Query Likelihood by plugging in Query LM to KL-divergence?
What is "topic drifting?
What is the meaning of parameters alpha and beta?
How are Query Likelihood and KL-divergence connected with each other?
Do all search engines only use implicit feedback system as its the most convienient for users while giving semi-reliable results.?
How do we determine the constant terms alpha beta in Rocchio?
What is an inverted index means exactly?
What does an inverted index  consist of?
What exactly do hidden URL's mean?
How can a programmer with minimum effort create an application that can run a large cluster in parallel?
What are good values for alpha, beta, and gamma in the Rocchio Feedback Model?
Would not this give a greater importance to the words in the background LM?
How does it being close to positive vectors work intuitively?
how did they come up with KL-divergence?
What is the theta hat here?
What is the difference between Q and q?
How do the parameters: alpha, beta and gamma control the movement we have in the concept of Rocchio feedback?
What is the significance of each of them?
Would not moving the query vector closer to the rest create an overfitting model?
Can you go over how Rocchio Feedback works?
How can Psedo Feedback be useful if the top 10 documents are assumed instead of actually judged by the user?
What type of heuristics can be applied to analyze links?
How does MapReduce help with the scalability of web searching?
How much does feedback in general play a role in relevance judgement?
Why does BFS balance server load, why does DFS not?
How are data like clickthroughs reliable when "users" may not be real users and instead bots?
Can you please review the KL-divergence equation on the slide and explain what alpha represents?
Which is the most common feedback retrival method used today?
how is Rocchio formula deduced?
how to KL divergence retrieval model in examples?
Can you compare a bit more between the PageRank and HITS algorithms?
Does local search engine use different methods to crawl the websites or it is the same?
What happens when there are zero entries in the matrix?
Will it be a query that contains more common words amoung the relevant documents?
What is the meaning of parameters regarding to the actual words?
Can we transform the new query vector back to the actual words?
why the returned results are more descriptive words of a certain topic when we do not rely much on background model often?
Why the counter would treat the two "World" seperately?
How do we pick which features to use in the tuples, or how do we say one feature is more useful than another?
What are the advantages of regression based learning?
Why cannot we use Cranfield Evaluation methodology to train machine learning models on labeled data?
When talking about predictions in memory based approaches, what is w(a,i) and how does it help to see differences and similarities between users?
Why do we need to normalize?
How do we determine the values for beta to use in the function and how do they modify the output?
How exactly is lambda determined and how does its value affect the function?
Does using ML to rank run into the issue of blackboxing the solution, so you can not really evauluate why the ML algorithim decided on a result?
What is x=v+n?
What do you do if you still want to recommend stuff to a user, but you do not know anything about the user?
Could we use a a ternary classifier instead of binary in content-based filtering?
What do companies with more complex filtering systems (such as google) implement?
How does a system store information about a specific user's interests if the user never indicated their initial interests in a survey-type setting, like in social media (Reddit, Instagram, etc.)?
What do the beta values represent?
What is the data used to make the initialization module?
What is the meaning of the BM25Anchor?
How is it related to BM25?
Why does 1 minus the probability of relevance yield the probability of non relevance?
How would it affect the choise of beta, gamma and the position of cutoff position?
What exactly does "cold start" mean?
How could we determine the tradeoff between exploitation and exploration if it is hard to reach a balance?
How exactly does the filtering predict f values for other (you,o)'s?
How do you effectively set up a filtering system with no bias.?
How is the value of utility obtained?
What does it mean by normalization strategy that gets the predictor rating in the same range as these ratings?
How would the IUF impact the traditional function?
When defining a feature, do we need to constraint the frequency?
how should we decide the value of beta?
Should the beta we choose has a greater influence on alpha than gamma and N do?
When estimating the beta values, how do we know that this estimation properly works exactly?
How did they come up with the method to estimate beta?
Why the formula use multiplication rather than addition?
What is the difference between Meta and Vertical Search Engine in the concept umbrella of recommender systems?
How are the beta parameters decided in the logistic function?
What is the biased training sample problem?
Can we rank the documents according to the maximum product of the results of the various ranking functions?
Which one has the best average performance?
Can you explain more about how the differences between types of collaborative filtering algorthms work?
Why do we need to use map reduce?
How would vertical search engines detect this specialized group of users?
How is actual user feedback incorporated to the evaluation of training data?
How is Beta-gamma threshold learning effective compared to other methods?
How do we implement the recommendation so that it "delivers the decison" immediately?
How does one prevent the overlap of features themselves?
When using machine learning to rank, how do we know when to stop "learning"?
How to evaluate if the problem is in the data and not in the chosen algorithm?
can you give an example of how content-based filtering system works (for the graph)?
How does the formula exactly work in example?
What is the difference between pearson correlation and cosine if they are applied in measuring similarity?
The two formulas look quite similar.?
How does subtracting the average rating from all the ratings ensure that all ratings are fairly evaluated?
What is the functionality of alpha and beta respectively?
Why is speech act = request the hardest thing to do for this sentence, since handling requests is a relatively straightforward task for things like Amazon Alexa or Google Drive?
Why is an entity-relation graph a good way to represent the obserable world?
Why is there overlap of words in paradigmatic relation mining?
How often are Mine Word Associations used and why do they help improve the accuracy of NLP tasks such as POS tagging?
What is the relationship between text retrieval and text mining?
Why does not deep understanding scale?
What is preventing it?
What is the bound on the number of possible contexts?
What techniques are used for language like Chinese to represente in a sequence of word?
Why EOWC favors matching one frequent term very well over matching more distinct terms?
How to address the problem of treating every word equally when calculating similarity?
Why the probability that two randoly picked words are identical?
What algorithm is used for the logic predicates method?
How would the IUF impact the traditional function?
What could be a downside with using EOWC?
Can text mining be used to predict sentiments amongst documents?
How IDF(w) is defined?
What does the professor mean when he says knowledge provenance?
Does it have to do with knowledge interpretation, however it seems that representing entities is a challenge with text data and how is that going to be solved?
How are text mining and text analytics different?
Can we consider a syntagmatical relation to be a superset of a paradigmatical relation?
What kind of data structures would be used to add on additional levels of NLP to the sequence of words storage?
How can we adapt the vector space retrieval model to discover paradigmatic relations?
How can common sense reasoning be incorporated into NLP algorithms?
What statistical methods combined with machine learning models work best for text data?
What is the difference between "String" and "Words" in text representation?
Will making data and text more concise take away from the actual content?
What types of data structures are used to store this text information after its retreival?
How can we adapt the vector space retrieval model to discover paradigmatic relations?
When does EOWC not work well?
What problems does mining non-text data pose with regard to storage space?
Would the application of the formula displayed at the bottom of the slide still work even if there is a word that is identical in another document, but is from a different part of speech?
what the "Speech Act" here means?
How does BM25 relate to Sim?
How do we determine if a word is too similar to a word that we already picked?
What is the defintition of conditional entropy?
When finding the entropy to measure the randomness of a random variable X, why do you use the log base 2 of the p(X =v) and not just the probability itself?
Why are the posterior probabilities in between the likelihood and the prior distribution?
How does this make intuitive sense?
Why is word prediction a binary variable instead of a continuous probability of that word occuring?
What is theta and pi exactly?
How are homonyms handled in the mutual information model?
How would the process for discovering the topic be different if we were using Bayesian estimation instead of maximum likelihood?
Why does knowing more information never decrease the conditional entropy?
How expensive is the computation task?
Would it be better to return a variable amount of terms that represent the majority of topics in the documents rather than a selected k topical terms?
Why is it impossible to specify probability values for all the different sequences of words?
Why do we take the log of the probability in the entropy formula?
How is similarity between terms determined?
Would a dictionary/thesaurus be used for this?
What exactly does maximum a posteriori estimate mean?
When does mutual information reach its maximum in terms of reduction of entropy of Y because of knowing X?
Should this be Pointwise Mutual Information?
What exactly is (MAP) estimate?
Why is Lagrange function used?
How would the IUF impact the traditional function?
why H(Xw1|Xw2) and H(Xw1|Xw3) are comparable but H(Xw1|Xw2) and H(Xw2|Xw3) are not?
Did not understand the explanation in the lecture very well.?
Why is the estimation of probabilities depend on the data?
Do we guess the parameters at first in order to build the model?
How is the second last step transformed to the last formula?
Shouldn't design scoring function also be concerned with the context of some topics based on the country of their origin, the age group where that topic is popular, etc?
How does all that dynamic information fit inside "generic statistic"?
What is the meaning of allows for inferring any derived value from theta?
Can we do an analysis similar to what we did to detect Syntagmatic relations i.e. once we identify a group of words that frequently occur with each other through a syntagmatic relation, we can use that information as a basis for grouping words into terms with those appearing quite rarely with each other being related to different terms and vice versa?
Can you explain more about how probabilistic topic models work to help analyze text?
How can we adapt the vector space retrieval model to discover paradigmatic relations?
Why is it bad to have zero probability of a word?
How is it possible to create a lower bound on the probability of a word occurring using conditional entropy?
what decide the difference between posterior and likeihood?
How accurate are correlated occurrences in the context of syntagmatic relations, since intuition is involved?
How does one quantitatively measure the randomness of a random variable like Xw?
Why is the word "the" like a completely biased coin?
How is mutual information in Vector space model (VSM)?
Can you please go further in depth with the equation (listed in yellow)?
How do people come up with the formula for entropy?
Can you give an example of Bayesian inference?
The meaning of the function f is pretty vague to me.?
How do we determine the initial input topic model, by manual input?
Why can we assume that these probabilities sum to one?
Can we improve by using something other than completely random values for initialization?
Why does the formula work?
How do we mix other Language Models, perhaps two Bigram Language Models?
How does PLSA operate the same way as component mixture model?
What is j?
Would the normalizer for the background probability estimate be that the probability of all words from the background must sum to 1 as well?
How accurate is ML Parameter Estimation and what can be done to improve it?
What is the point of having a common background word, like "the", to be part of both the topic and background probability distributions?
How does assigning high probabilities to words with high frequencies maximize likelihood?
What do all the terms in the two equations mean in a general sense?
How is using a mixture model more effective than removing stop words if we want to "factor out background (common) words"?
Why does fix one components help get rid of background words?
How to make PLSA a generative model?
How do we distinguish which component model is going to be chosen if we apply the PLSA mixture model?
Why is PLSA not a generative model?
how do we compute k parameters?
What is the point of z in PLSA?
What is the difference between the E step and the M step?
Why do different components tend to assign high probability on different words?
Can we not utilize the same method for mining K topics as we do for mining 1 topic?
Will these Zs allow us to pull some binary classification technique on these thetas?
Why we multiply all the probabilities rather than sum them?
Could you provide examples when demonstrating the behaviors of the mixture model, especially for the 2nd and 3rd feature?
What does "avoid competition or waste of probability mean (2nd behavior)?
What excatly does "collboration" mean (3rd behavior)?
How did this response equation come out?
How to guess the probabilities to ensure the it will converge at the global maximum but not a local maximum?
How can we use the entropy function to determine common words that do not provide much content or context to our document?
What does LDA do?
How does imposing the background model prior enforce a 0 probability for models that are not consistent with the prior?
Could you please explain the example use mentioned for the by-products P(z=0|w) of the THEM algorithm?
what factors would help us determine the background probability of each set?
How can we adapt the vector space retrieval model to discover paradigmatic relations?
What is the advantage of having a model which contains the probability of the background words, should not all the background words be treated with the same low probability?
Why we take log function in PLSA formula?
How much does the actual distribution matter in terms of the words each distribution contains?
Why can hill climbing only find local minimum?
What are some the strategies to ensure that THEM does not get stuck on a local max?
Can you please go further in depth with the THEM graph?
How to prove that THEM algorithm will finally lead to a local minimum?
Can you talk about why inference of these parameters using Bayes rule is intractable?
Why are we adding the background probability if it is already a common word?
Why can we assume that these probabilities are correct?
What will affect the convergence rate of THEM?