Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.webpages() additional tokenized columns? #402

Closed
ruebot opened this issue Jan 9, 2020 · 14 comments
Closed

.webpages() additional tokenized columns? #402

ruebot opened this issue Jan 9, 2020 · 14 comments
Assignees

Comments

@ruebot
Copy link
Member

ruebot commented Jan 9, 2020

Currently .webpages() creates a DataFrame with the following columns:

  • crawl_date
  • url
  • mime_type_web_server
  • mime_type_tika
  • content

Screenshot from 2020-01-09 14-59-15

The content column is the full text of the page, with HTTP headers, and HTML removed.

In experimenting with full-text analysis in the parquet_text_analyis.ipynb, we add some additional column via nltk for tokenized words, and tokenized text word count.

Screenshot from 2020-01-09 14-59-50

The tokenization process is pretty intensive. It takes around 30 minutes to complete in the example notebook, with the banq dataset. It also nearly exhausts the ~25G of RAM that is allotted via Colab.

So, instead of doing this post, why don't we consider doing this upfront in .webpages()? The Spark MLlib has a tokenizer, and a few other options. Since I'm not a text analysis expert, and it'd just be stabbing in the dark tossing in new columns, let's get a sense of what would actually be useful.

Check out this, and let us know what else would be useful out of the box in .webpages() DataFrame.

@ruebot ruebot self-assigned this Jan 9, 2020
@ruebot
Copy link
Member Author

ruebot commented Jan 9, 2020

Might be worth giving Spark NLP a more exhaustive look again.

@ruebot
Copy link
Member Author

ruebot commented Jan 9, 2020

We should probably add a language column, and we can do that pretty easily with DetectLanguage. Then, we could use that for tokenization, as Yves rightfully calls out.

@lintool
Copy link
Member

lintool commented Jan 9, 2020

I don't think we should include tokenizations as an additional column. My general thinking is to be as conservative as possible - unless there are scholarly clamoring for a pre-generated field, don't include it. Otherwise, the derivatives will just become larger and larger and more unwieldy over time.

@lintool
Copy link
Member

lintool commented Jan 9, 2020

Another reason against - there is no such thing as a "canonical" tokenization. Every tokenizer behaves differently... so unless a scholar happens to want exactly your tokenization, it's not going to be useful...

@ruebot ruebot changed the title .webpages() addition tokenized columns .webpages() additional tokenized columns? Jan 9, 2020
@SinghGursimran
Copy link
Collaborator

To reduce time required for tokenization, if the scholar can setup a distributed environment, we can add a guide for text analysis in pyspark. Instead of simple python where we are converting to pandas dataframe, we can use pyspark dataframe and perform analytics on it using mllib.

@lintool
Copy link
Member

lintool commented Jan 9, 2020

For basic NLP, spaCy https://spacy.io/ has become the go-to toolkit... try it from the Python end?

@ruebot
Copy link
Member Author

ruebot commented Jan 10, 2020

spacy.io looks really promising. Memory footprint appears to be a lot smaller than NLTK. But, I'm now well over an hour into executing tokenization on a DataFrame, and the NLTK option takes ~30 minutes.

Overall, I'm just tying to find some balance between the valid issues @lintool raises, and the reality of taking a derivative from .webpages() and being stuck with a seemingly endless spinning wheel. Or, rephrased as, thinking of the balance between those who have the capability and know-how to run aut as a library, and those who just want to take the derivative output and continue their research in a notebook on their laptop.

Screenshot from 2020-01-09 20-07-56

Definitely lots of food for thought. Hopefully, we get some good feedback from researchers looking to use our derivative output for text analysis.

...
...
...

Another option could be to just create an .enhancedWebpages() function? 🤷‍♂️

@lintool
Copy link
Member

lintool commented Jan 10, 2020

-1 on .enhancedWebpages()

I think this is a good potential scenario for the derivatives of derivatives idea we discussed with Raymie.

@organisciak
Copy link

organisciak commented Jan 10, 2020

SpaCy is especially costly, but you can turn off certain modules, e.g.

doc = nlp(text, disable=['tagger', 'parser', 'ner'])

I agree with @lintool in spirit, that different scholars may want different tokenizations approaches, but very much believe in giving them something - it lowers the barrier to access and those with different needs can retokenize.

I assume Colab only gives you one process? If you have a multi-core machine you can swap out pandas for dask, and then apply will use multi-processing or multi-threading (depending on settings). I vaguely recall this not being too useful with this exact use case (SpaCy) because too much of the processing was locked from parallelizing, but I don't recall where I formed that impression! Maybe worth trying?

When using apply, I expect it would be quicker to do it just on the Series comprising the column you care about. I'm not sure if it's trivially faster or notably, but instead of pages.apply(lambda row: tokenize(row.content)), trying pages.content.apply(lambda txt: tokenize(txt)) may help since you're not passing extra data around?

Another thing that might be worth considering is a dumb tokenize function, again in the spirit of giving something basically useful if not perfect. e.g. splitting on whitespace: pages['dumb_tokens'] = pages.content.str.split().

@organisciak
Copy link

By the way, I see that you use parquet. @bmschmidt smartly pointed out (massivetexts/htrc-feature-reader#8) that when you have repeating values in columns, like your mime type and crawl date columns, the order in which you sort the columns affects the compression size notably - even when using snappy compression, which ostensibly favors speed over compression factor. Hot tip :)

@ruebot
Copy link
Member Author

ruebot commented Jan 12, 2020

@organisciak tried the pages.content.apply, and the time difference was negligible between the two methods. Both took just over 26 minutes. I'm looping back around to using PySpark and MLlib in the the sample text-analysis notebook. That said, it might be useful to chat sometime. I'm curious what your experiences are with using HTRC data. It'd be useful to compare it with our experiences of working with TBs of web archive data.

@lintool, et al., quick testing with PySpark and MLlib in Colab seems to moving a lot quicker that just plain Pandas and NLTK. If researchers are going to use the CSV or Parquet output of .webpages(), I see the rationale for not including tokenized text, since we'd be assuming too much. I really see my naiveity writing up the issue now. The feedback here, and on Twitter has been really great!

As I'm hacking on this notebook, and thinking about the feedback, if a consumer of the output of .webpages() wants to go down the tokenization path, would it be helpful to give them one more column, the output of DetectLanguage? That way they'll at least have a decent idea of what the language is for a given row, and could run tokenization, or anything similar based on it.

All this for archivesunleashed/notebooks#4 @ianmilligan1 😆

@lintool
Copy link
Member

lintool commented Jan 12, 2020

+1 on language id

@ruebot
Copy link
Member Author

ruebot commented Jan 23, 2020

Seeing no more discussion, I'll mark this as resolved with bc0d663

@ruebot ruebot closed this as completed Jan 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants