-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.webpages() additional tokenized columns? #402
Comments
Might be worth giving Spark NLP a more exhaustive look again. |
We should probably add a language column, and we can do that pretty easily with |
I don't think we should include tokenizations as an additional column. My general thinking is to be as conservative as possible - unless there are scholarly clamoring for a pre-generated field, don't include it. Otherwise, the derivatives will just become larger and larger and more unwieldy over time. |
Another reason against - there is no such thing as a "canonical" tokenization. Every tokenizer behaves differently... so unless a scholar happens to want exactly your tokenization, it's not going to be useful... |
To reduce time required for tokenization, if the scholar can setup a distributed environment, we can add a guide for text analysis in pyspark. Instead of simple python where we are converting to pandas dataframe, we can use pyspark dataframe and perform analytics on it using mllib. |
For basic NLP, spaCy https://spacy.io/ has become the go-to toolkit... try it from the Python end? |
spacy.io looks really promising. Memory footprint appears to be a lot smaller than NLTK. But, I'm now well over an hour into executing tokenization on a DataFrame, and the NLTK option takes ~30 minutes. Overall, I'm just tying to find some balance between the valid issues @lintool raises, and the reality of taking a derivative from Definitely lots of food for thought. Hopefully, we get some good feedback from researchers looking to use our derivative output for text analysis. ... Another option could be to just create an |
-1 on I think this is a good potential scenario for the derivatives of derivatives idea we discussed with Raymie. |
SpaCy is especially costly, but you can turn off certain modules, e.g.
I agree with @lintool in spirit, that different scholars may want different tokenizations approaches, but very much believe in giving them something - it lowers the barrier to access and those with different needs can retokenize. I assume Colab only gives you one process? If you have a multi-core machine you can swap out pandas for dask, and then When using Another thing that might be worth considering is a dumb tokenize function, again in the spirit of giving something basically useful if not perfect. e.g. splitting on whitespace: |
By the way, I see that you use parquet. @bmschmidt smartly pointed out (massivetexts/htrc-feature-reader#8) that when you have repeating values in columns, like your mime type and crawl date columns, the order in which you sort the columns affects the compression size notably - even when using |
@organisciak tried the @lintool, et al., quick testing with PySpark and MLlib in Colab seems to moving a lot quicker that just plain Pandas and NLTK. If researchers are going to use the CSV or Parquet output of As I'm hacking on this notebook, and thinking about the feedback, if a consumer of the output of All this for archivesunleashed/notebooks#4 @ianmilligan1 😆 |
+1 on language id |
Seeing no more discussion, I'll mark this as resolved with bc0d663 |
Currently
.webpages()
creates a DataFrame with the following columns:crawl_date
url
mime_type_web_server
mime_type_tika
content
The
content
column is the full text of the page, with HTTP headers, and HTML removed.In experimenting with full-text analysis in the parquet_text_analyis.ipynb, we add some additional column via
nltk
for tokenized words, and tokenized text word count.The tokenization process is pretty intensive. It takes around 30 minutes to complete in the example notebook, with the banq dataset. It also nearly exhausts the ~25G of RAM that is allotted via Colab.
So, instead of doing this post, why don't we consider doing this upfront in
.webpages()
? The Spark MLlib has a tokenizer, and a few other options. Since I'm not a text analysis expert, and it'd just be stabbing in the dark tossing in new columns, let's get a sense of what would actually be useful.Check out this, and let us know what else would be useful out of the box in
.webpages()
DataFrame.The text was updated successfully, but these errors were encountered: