Dimensionality Reduction problem #2
Labels
enhancement
New feature or request
good first issue
Good for newcomers
help wanted
Extra attention is needed
@Samsomyajit(Somyajit Chakraborty)
As Sayan da(@sayanmondal2098 ) has reported me earlier that Dimensionality Reduction will play a huge(important) role in our creation of perfectly modeled Text Summarizer, so I’ve been doing some research and I might have come across a solution. [Though further discussion on the topic seems necessary].
Irrespective of our two approach theories to our project of creating the Text Summarizer - Abstractive/Extractive, Dimensionality Reduction is a problem on its own. Many Papers I’ve been reading suggests different solutions to the problem in all total. Which if not explained and dealt with properly might cause a pause to our project.
So, firstly, let us understand the nature of our problem -
Since, text files from which we will create our summarizer, is a high dimension (complex) file it will pose a serious problem to visualize our dataset, and if we can’t visualize the data then we cannot determine which algorithm or even which approach to use for creating our summarizer. So what we need is a Dimensionality Reduction Algorithm(Process) to reduce the complex dimension of our dataset to 2-D/3-D dataset so that we can visualize it in perfect plots.
What we already have and the problem:
Standard dimensionality reduction methods such as principal component analysis (PCA), locally linear embedding (LLE) (Roweis and Saul, 2000), or t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten and Hinton, 2008) taken as input a set of feature vectors such as bag of words or tf vectors. An obvious drawback is that such methods ignore the textual nature of documents and instead consider the vocabulary words v1, . . . , vn as abstract orthogonal dimensions.
What we can do:
We focus on the following type of non-Euclidean geometry where the distance between document x and y is defined as
dT (x, y) = sqrt(x − y)⊤T(x − y) …. (1) , Here T ∈ R n×n
is a symmetric positive semidefinite matrix, and we assume that documents x, y are represented as term-frequency (tf) column vectors. Since T can always be written as H⊤H for some matrix H∈R(n×n) , an equivalent but sometimes more intuitive interpretation of (1) is to compose the mapping x 7→ Hx with the Euclidean geometry
dT (x, y) = dI (Hx, Hy) = ||kHx − Hyk2|| …. (2)
We can view T as encoding the semantic similarity between pairs of words and H as smoothing the tf vector by mapping observed words to related but unobserved words. Therefore, the geometry realized by (1) or (2) may be used to derive novel dimensionality reduction methods that are customized to text in general and to specific text domains in particular. The main challenge is to obtain the matrices H or T that describe the relationship amongst vocabulary words appropriately.
[To, skip all the above math, we can use un-supervised learning to solve our problem since our data mostly will be unlabelled]
Or, as an alternative to our above approach we can use Clustering (another non-Euclidean) method to create cluster matrices labelled “positive sentiment words”, “negative sentiment words” and “objective words”, along with their intensities and reviews. And then use t-SNE or LLE.
[But, this method will create its own problem, though nothing that we can’t manage...]
Now in total there are five tools/ methods we can use to do all the above things:
Manual Specification (using unsupervised algos to the diffusion matrix which then acts as a frequency counter, all of these we do to an unlabelled dataset).
Contextual Diffusion (based on word similarities and weights)
Use Google(Web) n-Grams -- modified contextual diffusion.
Use WordNet (using hierarchical labelled word clusters, hypernyms,synonyms….)
What I think we can actually do:
We can first create a counter crawler using BigQuery to size our dataset.
Then after getting the size we use(simulate) dummy algorithms of all the above methods on a part of that dataset and predict each possible complexity and go with the best possible option.
The main job on the whole proposal is to create the dummy algos/ and decide how much of the part of dataset we have to choose so that the simulation run will be fruitful…
So what I am proposing is use Divide-and-Conquer Algo to create a Test run and use its result to go for the Final Run…
……………………………………………………………………………………………………………….
Reference Reading:
Dimensionality Reduction for Text using Domain Knowledge → Click Here
Spaceland Embedding of Sparse Stochastic Graphs → Click Here
Semantic distance in WordNet: → https://drive.google.com/drive/folders/1SR66I4XYKoyDj3gXYyB0YQODBdm4Q3MR
Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy → https://drive.google.com/drive/folders/1SR66I4XYKoyDj3gXYyB0YQODBdm4Q3MR
Mao, Yi & Dillon, Joshua & Lebanon, Guy. (2007). Sequential Document Visualization. IEEE transactions on visualization and computer graphics. 13. 1208-15. 10.1109/TVCG.2007.70592.
The text was updated successfully, but these errors were encountered: