Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added units to lexicon 🔨 #1043

Closed
wants to merge 1 commit into from

Conversation

MarketingPip
Copy link
Contributor

Added a ton of missing units to the lexicon list. All have been tested & do not have "#Unit" tags.

(We will remove some in future when making rules for them).

@MarketingPip
Copy link
Contributor Author

@spencermountain - a ton of these can also be moved into here

@spencermountain
Copy link
Owner

Hey jared, we can't mtag things like 'j' or 't' as a unit.
The current list has been created carefully to avoid false-positives. If you would like to improve them, please create a clean pr that shows which new units are added.
Cheers

@MarketingPip
Copy link
Contributor Author

@spencermountain just to clarify before I submit another PR what about things like ha or ['ma','pa'] & anything to letter? Or should I just take a rough guess at things like that (assuming removing things like that - that might be false positives)

ps; if you want to make the changes feel free - again - not trying to earn a badge with commits, just trying to help improve this project for everyone.

@spencermountain
Copy link
Owner

Ma and pa are often Massachusetts and Pennsylvania

@MarketingPip
Copy link
Contributor Author

MarketingPip commented Sep 14, 2023

@spencermountain - well f*ck - trying to avoid mentioning you in issues & blowing your GH notifications up etc lol but I make a list - have you confirm here. Then I will add them in.

ps; I have been stacking verbs / prepositions and much more. So be ready for that. (If you want me to make a separate project - I prefer not too & obviously contribute to this). The lists are relatively small tho & have been checked properly to be ensured if noun, verb etc... (10k of words roughly). Tho I am hoping to bring in tags for 30,000 - which fits like you said in your lecture / speech - is the average human vocabulary. Also was pretty cool watching Zipf's Laws come into effect when gathering things!

@MarketingPip
Copy link
Contributor Author

@spencermountain - here's the updated list! All the things I un-commented. I am thinking we could possibly just put a regex rule in for. Example "28g" or "10 pa" could be marked as unit. Let me know what you think. And feel free to hack on this.

let list = [ 'rad',
    'radian',
    'radians',
  //  'deg',
    'degree',
    'degrees',
   // 'grad',
    'gradian',
    'gradians',
    'arcmin',
    'arcminute',
    'arcminutes',
    'arcsec',
    'arcsecond',
    'arcseconds',
 //   'va',
    'volt-ampere',
    'volt-amperes',
//    'mva',
    'millivolt-ampere',
    'millivolt-amperes',
   // 'kva',
    'kilovolt-ampere',
    'kilovolt-amperes',
    'megavolt-ampere',
    'megavolt-amperes',
  //  'gva',
    'gigavolt-ampere',
    'gigavolt-amperes',
    'square millimeter',
    'square millimeters',
    'square meter',
    'square meters',
  //  'ha',
    'square kilometer',
    'square kilometers',
    'square inch',
    'square inches',
    'square yard',
    'square yards',
    'square foot',
    'square feet',
  //  'ac',
    'mi2',
    'square mile',
    'square miles',
   // 'a',
    'ampere',
    'amperes',
  //  'ma',
    'milliampere',
    'milliamperes',
   // 'ka',
    'kiloampere',
    'kiloamperes',
 //   'b',
    'kb',
    'mb',
    'gb',
    'tb',
 //   'bit',
  //  'bits',
    'kilobit',
    'kilobits',
    'megabit',
    'megabits',
    'gigabit',
    'gigabits',
    'terabit',
    'terabits',
    // 'ea',
    'each',
  //  'dz',
    'dozen',
    'dozens',
 //   'wh',
    'mwh',
    'milliwatt-hour',
    'milliwatt-hours',
    'kwh',
    'kilowatt-hour',
    'kilowatt-hours',
    'megawatt-hour',
    'megawatt-hours',
    'gwh',
    'gigawatt-hour',
    'gigawatt-hours',
   // 'j',
    //'kj',
    'kilojoule',
    'kilojoules',
    'mhz',
    'millihertz',
    'hz',
    'khz',
    'kilohertz',
    'megahertz',
    'ghz',
    'gigahertz',
    'thz',
    'terahertz',
    'rpm',
    'rotation per minute',
    'rotations per minute',
    'deg/s',
    'degree per second',
    'degrees per second',
    'rad/s',
    'radian per second',
    'radians per second',
    'ft-cd',
    'foot-candle',
    'foot-candles',
    'mm',
    'cm',
   // 'm',
    'km',
   // 'in',
    'inch',
    'inches',
    'yd',
    'yard',
    'yards',
    'ft-us',
    'us survey foot',
    'us survey feet',
   // 'ft', Ft. Knocks
    'foot',
    'feet',
   // 'mi', (michigian common)
    'miles',
    // 'oz',
   // 'lb', country code
    'pound',
    'pounds',
   // 't',
    'tons',
    'mcg',
   // 'mg', country code
   // 'g',
    'kg',
   // 'mt', country code
    'metric tonne',
    'metric tonnes',
    'min/km',
    'minute per kilometre',
    'minutes per kilometre',
    's/m',
    'second per metre',
    'seconds per metre',
    'min/mi',
    'minute per mile',
    'minutes per mile',
    's/ft',
    'second per foot',
    'seconds per foot',
    'ppm',
    'part-per million',
    'parts-per million',
    'ppb',
    'part-per billion',
    'parts-per billion',
    'ppt',
    'part-per trillion',
    'parts-per trillion',
    'ppq',
    'part-per quadrillion',
    'parts-per quadrillion',
    'w', // keep? 
    // 'mw', // country code
    'milliwatt',
    'milliwatts',
   // 'kw', //  country code
    'kilowatt',
    'kilowatts',
    'megawatt',
    'megawatts',
    'gw', // keep? 
    'gigawatt',
    'gigawatts',
    //'pa',
    'pascal',
//    'kpa', // keep? (can be place etc.. but is valid unit)
    'kilopascal',
    'kilopascals',
   // 'mpa', // keep? (can be accronym but is valid unit) one million pascals
    'megapascal',
    'megapascals',
  //  'hpa', // keep? (can be place etc.. but is valid unit)
    'hectopascal',
    'hectopascals',
 //   'bar', // keep? (can be place etc.. but is valid unit)
    'torr',
    'psi',
    'pound per square inch',
    'pounds per square inch',
    'ksi',
    'kilopound per square inch',
    'varh',
    'volt-ampere reactive hour',
    'volt-amperes reactive hour',
    'mvarh',
    'millivolt-ampere reactive hour',
    'millivolt-amperes reactive hour',
    'kvarh',
    'kilovolt-ampere reactive hour',
    'kilovolt-amperes reactive hour',
    'megavolt-ampere reactive hour',
    'megavolt-amperes reactive hour',
    'gvarh',
    'gigavolt-ampere reactive hour',
    'gigavolt-amperes reactive hour',
    'var',
    'volt-ampere reactive',
    'volt-amperes reactive',
    'mvar',
    'millivolt-ampere reactive',
    'millivolt-amperes reactive',
    'kvar',
    'kilovolt-ampere reactive',
    'kilovolt-amperes reactive',
    'megavolt-ampere reactive',
    'megavolt-amperes reactive',
    'gvar',
    'gigavolt-ampere reactive',
    'gigavolt-amperes reactive',
    'metre',
    'metres',
    'kilometre',
    'kilometres',
    'ft/s',
   // 'c', // keep?
    'degree celsius',
    'degrees celsius',
   // 'k',  //keep?
    'degree kelvin',
    'degrees kelvin',
   // 'f',
    'degree fahrenheit',
    'degrees fahrenheit',
  //  'r', (probably should keep - gas constant)
    'degree rankine',
    'degrees rankine',
    // 'ns', NS Canada (common for Nova Scotia)
   //  'mu', country code
   // 'ms', (mutlti screlosis etc..)
    'millisecond',
    'milliseconds',
   // 's',
    'second',
    'seconds',
    'min',
    'minute',
    'minutes',
   // 'h',
    'hour',
    'hours',
    //'d', 
    'day',
    'days',
    'week',
    'weeks',
    'month',
    'months',
    'year',
    'years',
  //  'v', // keep?
    'volt',
    // 'mv', country code
    'millivolt',
    'millivolts',
    'kv', // keep?  wikipedia - defines this as unit
    'kilovolt',
    'kilovolts',
    'mm3',
    'cubic millimeter',
    'cubic millimeters',
    'cubic centimeter',
    'cubic centimeters',
   // 'ml', country code
    'millilitre',
    'millilitres',
    // 'cl', country code
    'centilitre',
    'centilitres',
    'decilitre',
    'decilitres',
   // 'kl', country code
    'kilolitre',
    'kilolitres',
    'cubic meter',
    'cubic meters',
    'km3',
    'cubic kilometer',
    'cubic kilometers',
    'krm',
    'matsked',
    'matskedar',
    'tsk',
    'tesked',
    'teskedar',
    'msk',
    'kkp',
    'kaffekopp',
    'kaffekoppar',
    'glas',
    'kanna',
    'kannor',
    'tsp',
    'teaspoon',
    'teaspoons',
    'tbs',
    'tablespoon',
    'tablespoons',
    'cubic inch',
    'cubic inches',
    'cup',
    'cups',
    'pnt',
   // 'qt', country code
    'gal',
    'cubic foot',
    'cubic feet',
    'cubic yard',
    'cubic yards',
    'mm3/s',
    'cm3/s',
    'ml/s',
    'cl/s',
    'l/s',
    'l/min',
    'l/h',
    'kl/s',
    'kl/min',
    'kl/h',
    'm3/s',
    'm3/min',
    'm3/h',
    'km3/s',
    'tsp/s',
    'tbs/s',
    'in3/s',
    'in3/min',
    'in3/h',
    'fl-oz/s',
    'fl-oz/min',
    'fl-oz/h',
    'cup/s',
    'pnt/s',
    'pnt/min',
    'pnt/h',
    'qt/s',
    'gal/s',
    'gal/min',
    'gal/h',
    'ft3/s',
    'ft3/min',
    'ft3/h',
    'yd3/s',
    'yd3/min',
    'yd3/h']

@MarketingPip
Copy link
Contributor Author

@spencermountain - hey Spencer, I seen you added some stuff into misc about 2 months ago! Just seeing if you ever got to look through the rest of those. ^

Got something cool coming up too for Compromise. Hoping it'll be some use of major improvements to Compromise. 👍

@spencermountain
Copy link
Owner

hey Jared - yes I think this has been changed since the PR. Some of these like 'yard' are safer as Noun, some of these with the slashes will be handled by the tokenizer, and to be honest, I'm not sure we need to add things like 'kaffekopp'.

If you wanted to mine a subset of these for lexicon candidates, you're welcome to. adding rules for some of these could be cool too, cheers

@MarketingPip
Copy link
Contributor Author

MarketingPip commented Feb 7, 2024

@spencermountain - didn't mean to tag you in old PR (I know you hate me blowing up your issues hahah!). But I have been actually mining hard core for you to dig through. Plus as said I got something coming up HMM based trained on Penn Treebank (your JSON data tests actually in this repo) - new predictions for unseen data work etc....

image

Which with existing rule sets / tags & transformation rules etc. (I think you see were I am going with). I think we should see some CRAZY performance.

As well we should release models (being Penn is only 5000 lines for POS), another model for NER.... etc. And again rules etc...

Plus maybe even something for properly identifying addresses (formats, suffix etc). Names, example: Kelly, Spencer (Last Name, First Name). (Training data / predict then apply rules).....

As well - made some functions that can link relations in entities. Example "Hi my name is Bill Gates, I work at Microsoft". But does require internet. And matching of / extraction of entities to find relations between any.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants