Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading fastText models using only bin file #1341

Merged
merged 34 commits into from
Jun 28, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
7759a95
french wiki issue resolved
May 22, 2017
c12b4fa
Merge branch 'develop' into french
prakhar2b May 22, 2017
8025710
bin and vec mismatch handled
prakhar2b May 22, 2017
7ee83d9
updating with lastest codes and resolving conflicts
May 23, 2017
041a6e9
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
Jun 2, 2017
22c6710
added test from bin only loading
Jun 2, 2017
61be613
[WIP] loading bin only
Jun 2, 2017
e11ac44
word vec from its ngrams
Jun 6, 2017
a63a3bc
[WIP] word vec from ngrams
Jun 6, 2017
f80410f
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
Jun 7, 2017
454d74e
[WIP] getting syn0 from all n-grams
Jun 7, 2017
e6b0d8b
[TDD] test comparing word vector from bin_only and default loading
Jun 7, 2017
9b03ea3
cleaned up test code
Jun 8, 2017
c496be9
added docstring for bin_only
Jun 8, 2017
2c4a8dd
Merge branch 'ft_oov_fix' of https://github.com/jayantj/gensim into f…
Jun 12, 2017
d2ab903
resolved wiki.fr issue
Jun 12, 2017
82507d1
pep8 fixes
Jun 12, 2017
c44b958
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
Jun 16, 2017
0fc1159
default bin file loading only
Jun 16, 2017
f421b05
logging info modified plus changes a/c review
Jun 19, 2017
68ec73b
removed unused code in fasttext.py
Jun 19, 2017
f7b372e
removed unused codes and vec files from test
Jun 19, 2017
5f7fe02
added lee_fasttext vec files again
Jun 20, 2017
8bd56cf
re-added removed files and unused codes
Jun 21, 2017
b916187
added file name in logging info
Jun 21, 2017
1a0bfc0
removing unused load_word2vec_format code
Jun 22, 2017
98e0287
updated logging info and comments
Jun 22, 2017
f3d2032
input file name with or without .bin both accepted
Jun 22, 2017
bd7e7f6
resolved typo mistake
Jun 22, 2017
800cd01
test for file name
Jun 22, 2017
a15233a
minor change to input filename handling in ft wrapper
jayantj Jun 23, 2017
431aebf
changes to logging and assert messages, pep8 fixes
jayantj Jun 23, 2017
e52fee4
removes redundant .vec files
jayantj Jun 23, 2017
cebb3fc
fixes utf8 bug in flake8_diff.sh script
jayantj Jun 28, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions gensim/models/wrappers/fasttext.py
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,10 @@ def save(self, *args, **kwargs):
kwargs['ignore'] = kwargs.get('ignore', ['syn0norm', 'syn0_all_norm'])
super(FastText, self).save(*args, **kwargs)

@classmethod
def load_word2vec_format(cls, *args, **kwargs):
return FastTextKeyedVectors.load_word2vec_format(*args, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that a load using this method only learns the full-word vectors as in the .vec file. If so, isn't it true that the resulting object doesn't have any other capabilities beyond a plain KeyedVectors? In that case, using a specialized class like FastTextKeyedVectors – that maybe is trying to do more, such as ngram-tracking, but inherently is not because that info was lost in the sequence-of-steps used to load it – seems potentially misleading. So unless I'm misunderstanding, I think this load-technique should use a plain KeyedVectors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this method is not used now for loading using bin only. I removed this unused code, but got a strange flake8 error for python 3+, therefore re-added this for this PR. I'll try removing these unused codes later maybe in a different PR. @gojomo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an odd error! I suspect it's not really the presence/absence of that method that triggered it, but something else either random or hidden in the whitespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo ok, test passed this time after removing this code 😄

Copy link
Contributor

@jayantj jayantj Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, this was a bug in the flake8 script, fixed in cebb3fc


@classmethod
def load_fasttext_format(cls, model_file, encoding='utf8'):
"""
Expand Down
172 changes: 172 additions & 0 deletions gensim/test/test_data/cp852_fasttext.vec
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
171 2
ji -0.79132 1.9605
kter� -0.90811 1.6411
jen -0.91547 2.0157
podle -0.64689 1.6221
zde -0.79732 2.4019
u� -0.69159 1.7167
b�t -0.455 1.3266
v�ce -0.75901 1.688
bude -0.71114 2.0771
ji� -0.73027 1.267
ne� -0.97888 1.8332
v�s -0.72803 1.6653
by -0.75761 1.9683
kter� -0.68791 1.6069
co -1.0059 1.6869
nebo -0.94393 1.9611
ten -0.71975 2.124
tak -0.80566 2.0783
m� -0.83065 1.3732
p�i -0.62158 1.8313
od -0.44113 1.7755
po -0.7059 2.2615
tipy -0.60682 1.7247
je�t� -0.68854 1.7517
a� -0.63201 1.4618
bez -0.52021 1.4513
tak� -0.67762 1.8138
pouze -0.62611 1.82
prvn� -0.42235 1.6216
va�e -0.7407 1.5659
kter� -0.70914 1.7359
n�s -0.38286 1.6016
nov� -0.83421 1.7609
jsou -0.82699 1.9694
pokud -0.35516 1.5075
m��e -0.78928 1.6357
strana -0.57276 1.4149
jeho -0.78568 2.0226
sv� -0.44488 1.459
jin� -0.90751 1.9602
zpr�vy -0.90152 1.9703
nov� -0.78853 1.8593
nen� -0.63949 1.5191
tomu -0.68126 1.8729
ona -0.74442 1.825
ono -0.78171 1.9268
oni -0.64023 2.0525
ony -0.78142 1.7097
my -0.61062 1.8857
vy -0.9356 1.8875
j� -0.44615 0.92715
m� -0.73676 1.4089
mne -0.71006 1.7072
jemu -0.92237 2.1452
on -0.71417 1.9224
t�m -0.65242 1.8779
t�mu -0.83376 2.054
n�mu -0.79287 1.8645
n�mu� -0.51786 1.7297
jeho� -0.88721 1.7431
j�� -0.12627 0.68014
jeliko� -0.61809 1.7576
je� -0.8843 1.6723
jako� -0.94336 1.827
na�e� -0.76919 1.8106
ze -0.8277 2.0542
jak -0.97146 1.9164
dal� -0.5719 1.5148
ale -0.79733 1.8867
si -0.61439 1.7134
se -0.80843 1.8957
ve -0.7186 1.7891
to -0.84494 2.3933
jako -1.1045 2.2656
za -0.7136 1.9602
zp�t -0.79965 1.6329
jejich -0.49038 1.6366
do -0.69806 1.8364
pro -0.7878 2.2066
je -1.1291 3.0005
na -1.0203 2.4399
atd -0.70418 1.7405
atp -0.69278 1.5772
jakmile -0.87231 1.6896
p�i�em� -0.64617 1.4417
j� -0.7135 1.5517
n�m -0.42164 1.7603
jej -0.77603 1.9544
zda -0.76742 2.0163
pro� -0.47241 1.7053
m�te -0.75963 1.9814
tato -0.64318 2.0382
kam -0.45101 1.498
tohoto -0.73702 1.8305
kdo -0.80535 1.8551
kte�� -0.72498 1.6669
mi -0.46791 1.7784
tyto -0.50319 1.7659
tom -0.59138 1.8657
tomuto -0.74312 1.7725
m�t -0.27199 1.1315
nic -0.56441 1.8591
proto -0.6649 1.946
kterou -0.84109 1.7498
byla -0.58737 1.941
toho -0.76081 1.8002
proto�e -0.55749 1.6686
asi -0.51689 1.7079
bude� -0.55392 1.6052
s -0.74207 1.8989
k -0.61082 2.079
o -0.76465 1.8956
i -0.85412 1.6611
u -0.68535 1.5332
v -0.73033 1.3855
z -0.60751 1.9108
dnes -0.6001 1.7531
cz -0.59754 1.4239
t�mto -0.69011 1.6643
ho -0.55961 1.6968
budem -0.54027 1.7894
byli -0.60956 1.793
jse� -0.63127 1.5972
m�j -0.48904 1.2814
sv�m -0.48494 1.8751
ta -0.78131 2.4286
tomto -0.60948 1.7083
tohle -0.74747 1.7907
tuto -0.74687 1.9464
neg -0.60997 1.7777
pod -0.49619 1.914
t�ma -0.55525 1.6668
mezi -0.46979 1.3583
p�es -0.5712 1.9908
ty -0.78637 2.2804
pak -0.60084 1.7026
v�m -0.48545 1.4611
ani -0.65672 1.7897
kdy� -0.42318 1.4884
v�ak -0.60908 1.6867
�i -0.36843 1.7586
jsem -0.54047 1.827
tento -0.64813 1.9799
�l�nku -0.65578 1.9129
�l�nky -0.55868 1.8642
aby -0.80989 1.8384
jsme -0.60673 1.843
p�ed -0.53861 2.0502
pta -0.49464 1.714
a -0.63056 2.2477
aj -0.62546 1.6357
na�i -0.5915 1.6066
napi�te -0.50964 1.777
re -0.95733 1.9544
co� -0.54673 1.6466
t�m -0.70952 1.8565
tak�e -0.55439 1.8013
sv�ch -0.36878 1.4883
jej� -0.7694 1.6612
sv�mi -0.63149 2.1581
jste -0.68444 2.0978
byl -0.57205 1.7836
tu -0.88384 2.2256
tedy -0.62474 2.0469
teto -0.63187 1.884
bylo -0.56362 2.0282
kde -0.7308 2.0316
ke -0.60918 1.9317
prav� -0.52626 1.9058
nad -0.54689 1.8666
nejsou -0.66814 1.8323
172 changes: 172 additions & 0 deletions gensim/test/test_data/non_ascii_fasttext.vec
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
171 2
ji -1.5308 2.0551
který -0.99211 1.4997
jen -1.1228 1.3667
podle -1.1469 1.4473
zde -1.0191 1.4011
už -0.91921 1.3531
být -1.0086 1.4582
více -1.1058 1.3376
bude -1.2032 1.7383
již -1.3136 1.4792
než -1.0664 1.6635
vás -1.1113 1.5703
by -1.1698 1.966
které -1.1295 1.6275
co -0.93518 1.1776
nebo -1.0791 1.5071
ten -1.1881 1.415
tak -1.4548 1.8457
má -1.0658 1.5255
při -1.3464 1.6107
od -0.79486 1.5585
po -1.2758 1.9186
tipy -0.69335 1.0799
ještě -0.87116 1.1618
až -1.2688 1.6518
bez -0.99627 1.423
také -1.141 1.4808
pouze -0.94181 1.4076
první -1.1166 1.5035
vaše -0.9672 1.4975
která -1.1102 1.5806
nás -1.1328 1.5253
nový -0.85553 1.1462
jsou -1.0792 1.8008
pokud -1.0427 1.3178
může -1.1269 1.419
strana -0.84973 1.1957
jeho -1.1644 1.5879
své -1.0546 1.6185
jiné -0.95046 1.2816
zprávy -0.88762 1.3374
nové -1.0588 1.619
není -1.0321 1.5566
tomu -1.0753 1.5211
ona -1.21 1.6992
ono -1.0733 1.6574
oni -1.1153 1.643
ony -1.0926 1.5244
my -0.92689 1.6378
vy -1.3708 1.8
jí -1.205 1.6606
mě -0.96436 1.4713
mne -1.0956 1.6333
jemu -1.1181 1.4661
on -1.0062 1.4124
těm -0.90732 1.2586
těmu -0.90621 1.4096
němu -1.0823 1.4396
němuž -1.0786 1.3892
jehož -1.1649 1.4418
jíž -1.0574 1.6338
jelikož -1.0449 1.3625
jež -1.2657 1.7032
jakož -1.3373 1.6112
načež -1.0127 1.3696
ze -1.1784 1.7095
jak -1.2097 1.5224
další -0.7288 0.96256
ale -1.1029 1.4153
si -1.1097 1.5884
se -1.2981 1.7707
ve -1.256 1.7985
to -1.6894 2.2424
jako -1.2333 1.5942
za -1.0376 1.6162
zpět -0.83657 1.354
jejich -0.97548 1.4219
do -0.93685 1.4001
pro -1.4367 1.9498
je -1.9446 2.5147
na -1.5543 2.2901
atd -0.98175 1.3697
atp -0.83266 1.1085
jakmile -1.0954 1.2764
přičemž -1.0533 1.4279
já -1.1496 1.4432
nám -1.0246 1.6043
jej -1.203 1.6252
zda -0.93651 1.2363
proč -0.90395 1.3144
máte -0.99962 1.4802
tato -1.3248 1.5575
kam -0.63468 1.246
tohoto -0.9737 1.3422
kdo -0.88982 1.4152
kteří -0.92973 1.4696
mi -1.343 1.7217
tyto -0.99375 1.3067
tom -1.1636 1.608
tomuto -1.0103 1.3488
mít -1.1538 1.6326
nic -0.76497 1.0685
proto -1.1781 1.6367
kterou -1.0561 1.563
byla -0.9338 1.7033
toho -1.1263 1.5702
protože -1.1777 1.4984
asi -1.0555 1.4401
budeš -0.98208 1.5432
s -1.3733 1.6447
k -1.0223 1.6019
o -1.4531 1.879
i -1.0985 1.2956
u -0.91038 1.6173
v -1.2536 1.5998
z -0.96962 1.7437
dnes -0.92891 1.2478
cz -0.84461 1.0881
tímto -0.98475 1.3061
ho -0.74774 1.4925
budem -1.0178 1.4333
byli -0.90776 1.4799
jseš -1.0297 1.4975
můj -0.891 1.2674
svým -1.0586 1.5377
ta -1.4932 2.0156
tomto -1.1626 1.5135
tohle -1.2215 1.6529
tuto -1.0516 1.3583
neg -0.94527 1.5529
pod -1.0601 1.578
téma -0.93273 1.3456
mezi -0.96807 1.3465
přes -1.1927 1.5099
ty -1.3733 1.7374
pak -1.0392 1.5592
vám -0.89801 1.3586
ani -1.2113 1.5634
když -1.0124 1.5112
však -0.75634 1.1299
či -0.79489 1.2817
jsem -1.0435 1.4903
tento -1.0861 1.5053
článku -0.93302 1.3758
články -0.98897 1.4387
aby -1.0874 1.6114
jsme -1.0547 1.6846
před -1.0538 1.5186
pta -1.062 1.6063
a -1.3116 2.0391
aj -1.1578 1.5193
naši -1.2075 1.3714
napište -1.0436 1.4646
re -1.3115 1.5453
což -1.1731 1.3545
tím -1.0296 1.5885
takže -1.1014 1.3574
svých -0.82606 1.1187
její -1.1029 1.3696
svými -1.1052 1.4953
jste -1.1003 1.7465
byl -0.89449 1.4131
tu -1.1255 1.5505
tedy -1.1693 1.6446
teto -1.2134 1.546
bylo -0.86091 1.3805
kde -1.3468 1.7507
ke -1.0699 1.6688
pravé -0.9391 1.5172
nad -1.3404 1.7661
nejsou -0.85023 1.5033