Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dimensionality Reduction problem #2

Open
Samsomyajit opened this issue Oct 15, 2019 · 4 comments
Open

Dimensionality Reduction problem #2

Samsomyajit opened this issue Oct 15, 2019 · 4 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@Samsomyajit
Copy link
Member

Samsomyajit commented Oct 15, 2019

@Samsomyajit(Somyajit Chakraborty)
As Sayan da(@sayanmondal2098 ) has reported me earlier that Dimensionality Reduction will play a huge(important) role in our creation of perfectly modeled Text Summarizer, so I’ve been doing some research and I might have come across a solution. [Though further discussion on the topic seems necessary].

Irrespective of our two approach theories to our project of creating the Text Summarizer - Abstractive/Extractive, Dimensionality Reduction is a problem on its own. Many Papers I’ve been reading suggests different solutions to the problem in all total. Which if not explained and dealt with properly might cause a pause to our project.

So, firstly, let us understand the nature of our problem -

Since, text files from which we will create our summarizer, is a high dimension (complex) file it will pose a serious problem to visualize our dataset, and if we can’t visualize the data then we cannot determine which algorithm or even which approach to use for creating our summarizer. So what we need is a Dimensionality Reduction Algorithm(Process) to reduce the complex dimension of our dataset to 2-D/3-D dataset so that we can visualize it in perfect plots.

What we already have and the problem:

Standard dimensionality reduction methods such as principal component analysis (PCA), locally linear embedding (LLE) (Roweis and Saul, 2000), or t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten and Hinton, 2008) taken as input a set of feature vectors such as bag of words or tf vectors. An obvious drawback is that such methods ignore the textual nature of documents and instead consider the vocabulary words v1, . . . , vn as abstract orthogonal dimensions.

What we can do:

We focus on the following type of non-Euclidean geometry where the distance between document x and y is defined as
dT (x, y) = sqrt(x − y)⊤T(x − y) …. (1) , Here T ∈ R n×n

is a symmetric positive semidefinite matrix, and we assume that documents x, y are represented as term-frequency (tf) column vectors. Since T can always be written as H⊤H for some matrix H∈R(n×n) , an equivalent but sometimes more intuitive interpretation of (1) is to compose the mapping x 7→ Hx with the Euclidean geometry
dT (x, y) = dI (Hx, Hy) = ||kHx − Hyk2|| …. (2)

We can view T as encoding the semantic similarity between pairs of words and H as smoothing the tf vector by mapping observed words to related but unobserved words. Therefore, the geometry realized by (1) or (2) may be used to derive novel dimensionality reduction methods that are customized to text in general and to specific text domains in particular. The main challenge is to obtain the matrices H or T that describe the relationship amongst vocabulary words appropriately.

[To, skip all the above math, we can use un-supervised learning to solve our problem since our data mostly will be unlabelled]

Or, as an alternative to our above approach we can use Clustering (another non-Euclidean) method to create cluster matrices labelled “positive sentiment words”, “negative sentiment words” and “objective words”, along with their intensities and reviews. And then use t-SNE or LLE.

[But, this method will create its own problem, though nothing that we can’t manage...]

Now in total there are five tools/ methods we can use to do all the above things:

Manual Specification (using unsupervised algos to the diffusion matrix which then acts as a frequency counter, all of these we do to an unlabelled dataset).
Contextual Diffusion (based on word similarities and weights)
Use Google(Web) n-Grams -- modified contextual diffusion.
Use WordNet (using hierarchical labelled word clusters, hypernyms,synonyms….)

What I think we can actually do:

We can first create a counter crawler using BigQuery to size our dataset.
Then after getting the size we use(simulate) dummy algorithms of all the above methods on a part of that dataset and predict each possible complexity and go with the best possible option.
The main job on the whole proposal is to create the dummy algos/ and decide how much of the part of dataset we have to choose so that the simulation run will be fruitful…

So what I am proposing is use Divide-and-Conquer Algo to create a Test run and use its result to go for the Final Run…
……………………………………………………………………………………………………………….
Reference Reading:

Dimensionality Reduction for Text using Domain Knowledge → Click Here
Spaceland Embedding of Sparse Stochastic Graphs → Click Here
Semantic distance in WordNet: → https://drive.google.com/drive/folders/1SR66I4XYKoyDj3gXYyB0YQODBdm4Q3MR
Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy → https://drive.google.com/drive/folders/1SR66I4XYKoyDj3gXYyB0YQODBdm4Q3MR
Mao, Yi & Dillon, Joshua & Lebanon, Guy. (2007). Sequential Document Visualization. IEEE transactions on visualization and computer graphics. 13. 1208-15. 10.1109/TVCG.2007.70592.

@Samsomyajit Samsomyajit added good first issue Good for newcomers hacktoberfest enhancement New feature or request help wanted Extra attention is needed labels Oct 15, 2019
@codecakes
Copy link

Hi @Samsomyajit
Ive dabbled with creating tf idf ftom ngrams and done some work on prediction using svm.

What pre requisites reading materials would one need to further work on this issue?

@Samsomyajit
Copy link
Member Author

Hi @codecakes
I would like to know what kind of further work you have on mind (nature), and do you want research papers or articles on tf-idf ngrams or svms ?

@codecakes
Copy link

codecakes commented Dec 20, 2019

Hi @Samsomyajit

We can first create a counter crawler using BigQuery to size our dataset.

what is the counter crawler you are talking about. Is this a word counter? Do you want to create something like tfidfvectorizer?

Then after getting the size we use(simulate) dummy algorithms of all the above methods on a part of that dataset and predict each possible complexity and go with the best possible option.

Can you point me to the resources for these algorithms, input data, expected output data?

@Samsomyajit
Copy link
Member Author

@codecakes
Label Count Algorithm for Web Crawler-Label Count it is similar in structure but a bit different instead of web labels or key words we are actually counting similar words for weight matrix (it was just an idea)..
we are not actually going forward with the above approach...

@sayanmondal2098 sayanmondal2098 pinned this issue Dec 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants