电脑知识|欧美黑人一区二区三区|软件|欧美黑人一级爽快片淫片高清|系统|欧美黑人狂野猛交老妇|数据库|服务器|编程开发|网络运营|知识问答|技术教程文章 - 好吧啦网

您的位置:首頁(yè)技術(shù)文章
文章詳情頁(yè)

python初步實(shí)現(xiàn)word2vec操作

瀏覽:8日期:2022-07-22 08:07:34

一、前言

一開始看到word2vec環(huán)境的安裝還挺復(fù)雜的,安了半天Cygwin也沒(méi)太搞懂。后來(lái)突然發(fā)現(xiàn),我為什么要去安c語(yǔ)言版本的呢,我應(yīng)該去用python版本的,然后就發(fā)現(xiàn)了gensim,安裝個(gè)gensim的包就可以用word2vec了,不過(guò)gensim只實(shí)現(xiàn)了word2vec里面的skip-gram模型。若要用到其他模型,就需要去研究其他語(yǔ)言的word2vec了。

二、語(yǔ)料準(zhǔn)備

有了gensim包之后,看了網(wǎng)上很多教程都是直接傳入一個(gè)txt文件,但是這個(gè)txt文件長(zhǎng)啥樣,是什么樣的數(shù)據(jù)格式呢,很多博客都沒(méi)有說(shuō)明,也沒(méi)有提供可以下載的txt文件作為例子。進(jìn)一步理解之后發(fā)現(xiàn)這個(gè)txt是一個(gè)包含巨多文本的分好詞的文件。如下圖所示,是我自己訓(xùn)練的一個(gè)語(yǔ)料,我選取了自己之前用爬蟲抓取的7000條新聞當(dāng)做語(yǔ)料并進(jìn)行分詞。注意,詞與詞之間一定要用空格:

python初步實(shí)現(xiàn)word2vec操作

這里分詞使用的是結(jié)巴分詞。

這部分代碼如下:

import jiebaf1 =open('fenci.txt')f2 =open('fenci_result.txt', ’a’)lines =f1.readlines() # 讀取全部?jī)?nèi)容for line in lines: line.replace(’t’, ’’).replace(’n’, ’’).replace(’ ’,’’) seg_list = jieba.cut(line, cut_all=False) f2.write(' '.join(seg_list)) f1.close()f2.close()

還要注意的一點(diǎn)就是語(yǔ)料中的文本一定要多,看網(wǎng)上隨便一個(gè)語(yǔ)料都是好幾個(gè)G,而且一開始我就使用了一條新聞當(dāng)成語(yǔ)料庫(kù),結(jié)果很不好,輸出都是0。然后我就用了7000條新聞作為語(yǔ)料庫(kù),分詞完之后得到的fenci_result.txt是20M,雖然也不大,但是已經(jīng)可以得到初步結(jié)果了。

三、使用gensim的word2vec訓(xùn)練模型

相關(guān)代碼如下:

from gensim.modelsimport word2vecimport logging # 主程序logging.basicConfig(format=’%(asctime)s:%(levelname)s: %(message)s’, level=logging.INFO)sentences =word2vec.Text8Corpus(u'fenci_result.txt') # 加載語(yǔ)料model =word2vec.Word2Vec(sentences, size=200) #訓(xùn)練skip-gram模型,默認(rèn)window=5 print model# 計(jì)算兩個(gè)詞的相似度/相關(guān)程度try: y1 = model.similarity(u'國(guó)家', u'國(guó)務(wù)院')except KeyError: y1 = 0print u'【國(guó)家】和【國(guó)務(wù)院】的相似度為:', y1print'-----n'## 計(jì)算某個(gè)詞的相關(guān)詞列表y2 = model.most_similar(u'控?zé)?, topn=20) # 20個(gè)最相關(guān)的print u'和【控?zé)煛孔钕嚓P(guān)的詞有:n'for item in y2: print item[0], item[1]print'-----n' # 尋找對(duì)應(yīng)關(guān)系print u'書-不錯(cuò),質(zhì)量-'y3 =model.most_similar([u’質(zhì)量’, u’不錯(cuò)’], [u’書’], topn=3)for item in y3: print item[0], item[1]print'----n' # 尋找不合群的詞y4 =model.doesnt_match(u'書 書籍 教材 很'.split())print u'不合群的詞:', y4print'-----n' # 保存模型,以便重用model.save(u'書評(píng).model')# 對(duì)應(yīng)的加載方式# model_2 =word2vec.Word2Vec.load('text8.model') # 以一種c語(yǔ)言可以解析的形式存儲(chǔ)詞向量#model.save_word2vec_format(u'書評(píng).model.bin', binary=True)# 對(duì)應(yīng)的加載方式# model_3 =word2vec.Word2Vec.load_word2vec_format('text8.model.bin',binary=True)

輸出如下:

'D:program filespython2.7.0python.exe' 'D:/pycharm workspace/畢設(shè)/cluster_test/word2vec.py'D:program filespython2.7.0libsite-packagesgensimutils.py:840: UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn('detected Windows; aliasing chunkize to chunkize_serial')D:program filespython2.7.0libsite-packagesgensimutils.py:1015: UserWarning: Pattern library is not installed, lemmatization won’t be available. warnings.warn('Pattern library is not installed, lemmatization won’t be available.')2016-12-12 15:37:43,331: INFO: collecting all words and their counts2016-12-12 15:37:43,332: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types2016-12-12 15:37:45,236: INFO: collected 99865 word types from a corpus of 3561156 raw words and 357 sentences2016-12-12 15:37:45,236: INFO: Loading a fresh vocabulary2016-12-12 15:37:45,413: INFO: min_count=5 retains 29982 unique words (30% of original 99865, drops 69883)2016-12-12 15:37:45,413: INFO: min_count=5 leaves 3444018 word corpus (96% of original 3561156, drops 117138)2016-12-12 15:37:45,602: INFO: deleting the raw counts dictionary of 99865 items2016-12-12 15:37:45,615: INFO: sample=0.001 downsamples 29 most-common words2016-12-12 15:37:45,615: INFO: downsampling leaves estimated 2804247 word corpus (81.4% of prior 3444018)2016-12-12 15:37:45,615: INFO: estimated required memory for 29982 words and 200 dimensions: 62962200 bytes2016-12-12 15:37:45,746: INFO: resetting layer weights2016-12-12 15:37:46,782: INFO: training model with 3 workers on 29982 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=52016-12-12 15:37:46,782: INFO: expecting 357 sentences, matching count from corpus used for vocabulary survey2016-12-12 15:37:47,818: INFO: PROGRESS: at 1.96% examples, 267531 words/s, in_qsize 6, out_qsize 02016-12-12 15:37:48,844: INFO: PROGRESS: at 3.70% examples, 254229 words/s, in_qsize 3, out_qsize 12016-12-12 15:37:49,871: INFO: PROGRESS: at 5.99% examples, 273509 words/s, in_qsize 3, out_qsize 12016-12-12 15:37:50,867: INFO: PROGRESS: at 8.18% examples, 281557 words/s, in_qsize 6, out_qsize 02016-12-12 15:37:51,872: INFO: PROGRESS: at 10.20% examples, 280918 words/s, in_qsize 5, out_qsize 02016-12-12 15:37:52,898: INFO: PROGRESS: at 12.44% examples, 284750 words/s, in_qsize 6, out_qsize 02016-12-12 15:37:53,911: INFO: PROGRESS: at 14.17% examples, 278948 words/s, in_qsize 0, out_qsize 02016-12-12 15:37:54,956: INFO: PROGRESS: at 16.47% examples, 284101 words/s, in_qsize 2, out_qsize 12016-12-12 15:37:55,934: INFO: PROGRESS: at 18.60% examples, 285781 words/s, in_qsize 6, out_qsize 12016-12-12 15:37:56,933: INFO: PROGRESS: at 20.84% examples, 288045 words/s, in_qsize 6, out_qsize 02016-12-12 15:37:57,973: INFO: PROGRESS: at 23.03% examples, 289083 words/s, in_qsize 6, out_qsize 22016-12-12 15:37:58,993: INFO: PROGRESS: at 24.87% examples, 285990 words/s, in_qsize 6, out_qsize 12016-12-12 15:38:00,006: INFO: PROGRESS: at 27.17% examples, 288266 words/s, in_qsize 4, out_qsize 12016-12-12 15:38:01,081: INFO: PROGRESS: at 29.52% examples, 290197 words/s, in_qsize 1, out_qsize 22016-12-12 15:38:02,065: INFO: PROGRESS: at 31.88% examples, 292344 words/s, in_qsize 6, out_qsize 02016-12-12 15:38:03,188: INFO: PROGRESS: at 34.01% examples, 291356 words/s, in_qsize 2, out_qsize 22016-12-12 15:38:04,161: INFO: PROGRESS: at 36.02% examples, 290805 words/s, in_qsize 6, out_qsize 02016-12-12 15:38:05,174: INFO: PROGRESS: at 38.26% examples, 292174 words/s, in_qsize 3, out_qsize 02016-12-12 15:38:06,214: INFO: PROGRESS: at 40.56% examples, 293297 words/s, in_qsize 4, out_qsize 12016-12-12 15:38:07,201: INFO: PROGRESS: at 42.69% examples, 293428 words/s, in_qsize 4, out_qsize 12016-12-12 15:38:08,266: INFO: PROGRESS: at 44.65% examples, 292108 words/s, in_qsize 1, out_qsize 12016-12-12 15:38:09,295: INFO: PROGRESS: at 46.83% examples, 292097 words/s, in_qsize 4, out_qsize 12016-12-12 15:38:10,315: INFO: PROGRESS: at 49.13% examples, 292968 words/s, in_qsize 2, out_qsize 22016-12-12 15:38:11,326: INFO: PROGRESS: at 51.37% examples, 293621 words/s, in_qsize 5, out_qsize 02016-12-12 15:38:12,367: INFO: PROGRESS: at 53.39% examples, 292777 words/s, in_qsize 2, out_qsize 22016-12-12 15:38:13,348: INFO: PROGRESS: at 55.35% examples, 292187 words/s, in_qsize 5, out_qsize 02016-12-12 15:38:14,349: INFO: PROGRESS: at 57.31% examples, 291656 words/s, in_qsize 6, out_qsize 02016-12-12 15:38:15,374: INFO: PROGRESS: at 59.50% examples, 292019 words/s, in_qsize 6, out_qsize 02016-12-12 15:38:16,403: INFO: PROGRESS: at 61.68% examples, 292318 words/s, in_qsize 4, out_qsize 22016-12-12 15:38:17,401: INFO: PROGRESS: at 63.81% examples, 292275 words/s, in_qsize 6, out_qsize 02016-12-12 15:38:18,410: INFO: PROGRESS: at 65.71% examples, 291495 words/s, in_qsize 4, out_qsize 12016-12-12 15:38:19,433: INFO: PROGRESS: at 67.62% examples, 290443 words/s, in_qsize 6, out_qsize 02016-12-12 15:38:20,473: INFO: PROGRESS: at 69.58% examples, 289655 words/s, in_qsize 6, out_qsize 22016-12-12 15:38:21,589: INFO: PROGRESS: at 71.71% examples, 289388 words/s, in_qsize 2, out_qsize 22016-12-12 15:38:22,533: INFO: PROGRESS: at 73.78% examples, 289366 words/s, in_qsize 0, out_qsize 12016-12-12 15:38:23,611: INFO: PROGRESS: at 75.46% examples, 287542 words/s, in_qsize 5, out_qsize 12016-12-12 15:38:24,614: INFO: PROGRESS: at 77.25% examples, 286609 words/s, in_qsize 3, out_qsize 02016-12-12 15:38:25,609: INFO: PROGRESS: at 79.33% examples, 286732 words/s, in_qsize 5, out_qsize 12016-12-12 15:38:26,621: INFO: PROGRESS: at 81.40% examples, 286595 words/s, in_qsize 2, out_qsize 02016-12-12 15:38:27,625: INFO: PROGRESS: at 83.53% examples, 286807 words/s, in_qsize 6, out_qsize 02016-12-12 15:38:28,683: INFO: PROGRESS: at 85.32% examples, 285651 words/s, in_qsize 5, out_qsize 32016-12-12 15:38:29,729: INFO: PROGRESS: at 87.56% examples, 286175 words/s, in_qsize 6, out_qsize 12016-12-12 15:38:30,706: INFO: PROGRESS: at 89.86% examples, 286920 words/s, in_qsize 5, out_qsize 02016-12-12 15:38:31,714: INFO: PROGRESS: at 92.10% examples, 287368 words/s, in_qsize 6, out_qsize 02016-12-12 15:38:32,756: INFO: PROGRESS: at 94.40% examples, 288070 words/s, in_qsize 4, out_qsize 22016-12-12 15:38:33,755: INFO: PROGRESS: at 96.30% examples, 287543 words/s, in_qsize 1, out_qsize 02016-12-12 15:38:34,802: INFO: PROGRESS: at 98.71% examples, 288375 words/s, in_qsize 4, out_qsize 02016-12-12 15:38:35,286: INFO: worker thread finished; awaiting finish of 2 more threads2016-12-12 15:38:35,286: INFO: worker thread finished; awaiting finish of 1 more threadsWord2Vec(vocab=29982, size=200, alpha=0.025)【國(guó)家】和【國(guó)務(wù)院】的相似度為: 0.387535493256-----2016-12-12 15:38:35,293: INFO: worker thread finished; awaiting finish of 0 more threads2016-12-12 15:38:35,293: INFO: training on 17805780 raw words (14021191 effective words) took 48.5s, 289037 effective words/s2016-12-12 15:38:35,293: INFO: precomputing L2-norms of word weight vectors和【控?zé)煛孔钕嚓P(guān)的詞有:禁煙 0.6038454175防煙 0.585186183453執(zhí)行 0.530897378922煙控 0.516572892666廣而告之 0.508533298969履約 0.507428050041執(zhí)法 0.494115233421禁煙令 0.471616715193修法 0.465247869492該項(xiàng) 0.457907706499落實(shí) 0.457776963711控制 0.455987215042這方面 0.450040221214立法 0.44820779562控?zé)熮k 0.436062157154執(zhí)行力 0.432559013367控?zé)煏?huì) 0.430508673191進(jìn)展 0.430286765099監(jiān)管 0.429748386145懲罰 0.429243773222-----書-不錯(cuò),質(zhì)量-生存 0.613928854465穩(wěn)定 0.595371186733整體 0.592055797577----不合群的詞: 很-----2016-12-12 15:38:35,515: INFO: saving Word2Vec object under 書評(píng).model, separately None2016-12-12 15:38:35,515: INFO: not storing attribute syn0norm2016-12-12 15:38:35,515: INFO: not storing attribute cum_table2016-12-12 15:38:36,490: INFO: saved 書評(píng).modelProcess finished with exit code 0

python初步實(shí)現(xiàn)word2vec操作

python初步實(shí)現(xiàn)word2vec操作

python初步實(shí)現(xiàn)word2vec操作

python初步實(shí)現(xiàn)word2vec操作

以上這篇python初步實(shí)現(xiàn)word2vec操作就是小編分享給大家的全部?jī)?nèi)容了,希望能給大家一個(gè)參考,也希望大家多多支持好吧啦網(wǎng)。

標(biāo)簽: python
相關(guān)文章:
主站蜘蛛池模板: 光伏支架成型设备-光伏钢边框设备-光伏设备厂家 | SMN-1/SMN-A ABB抽屉开关柜触头夹紧力检测仪-SMN-B/SMN-C-上海徐吉 | 深圳市宏康仪器科技有限公司-模拟高空低压试验箱-高温防爆试验箱-温控短路试验箱【官网】 | pH污水传感器电极,溶解氧电极传感器-上海科蓝仪表科技有限公司 | 混合生育酚_醋酸生育酚粉_琥珀酸生育酚-山东新元素生物科技 | 重庆小面培训_重庆小面技术培训学习班哪家好【终身免费复学】 | 变位机,焊接变位机,焊接变位器,小型变位机,小型焊接变位机-济南上弘机电设备有限公司 | 氧化铝球_高铝球_氧化铝研磨球-淄博誉洁陶瓷新材料有限公司 | 非标压力容器_碳钢储罐_不锈钢_搪玻璃反应釜厂家-山东首丰智能环保装备有限公司 | 雾度仪_雾度计_透光率雾度仪价格-三恩时(3nh)光电雾度仪厂家 | 熔体泵|换网器|熔体齿轮泵|熔体计量泵厂家-郑州巴特熔体泵有限公司 | 二手光谱仪维修-德国OBLF光谱仪|进口斯派克光谱仪-热电ARL光谱仪-意大利GNR光谱仪-永晖检测 | 活性氧化铝球|氧化铝干燥剂|分子筛干燥剂|氢氧化铝粉-淄博同心材料有限公司 | 南京泽朗生物科技有限公司-液体饮料代加工_果汁饮料代加工_固体饮料代加工 | 撕碎机_轮胎破碎机_粉碎机_回收生产线厂家_东莞华达机械有限公司 | 立式壁挂广告机厂家-红外电容触摸一体机价格-华邦瀛 | 玉米深加工设备-玉米深加工机械-新型玉米工机械生产厂家-河南粮院机械制造有限公司 | 安驭邦官网-双向万能直角铣头,加工中心侧铣头,角度头[厂家直销] 闸阀_截止阀_止回阀「生产厂家」-上海卡比阀门有限公司 | 楼梯定制_楼梯设计施工厂家_楼梯扶手安装制作-北京凌步楼梯 | 一礼通 (www.yilitong.com)-企业礼品解决方案一站式服务平台 | 水厂污泥地磅|污泥处理地磅厂家|地磅无人值守称重系统升级改造|地磅自动称重系统维修-河南成辉电子科技有限公司 | 耐力板-PC阳光板-PC板-PC耐力板 - 嘉兴赢创实业有限公司 | 自清洗过滤器_全自动过滤器_全自动反冲洗过滤器_量子过滤器-滑漮滴 | 膏方加工_丸剂贴牌_膏滋代加工_湖北康瑞生物科技有限公司 | 杭州网络公司_百度SEO优化-外贸网络推广_抖音小程序开发-杭州乐软科技有限公司 | 手持式浮游菌采样器-全排二级生物安全柜-浙江孚夏医疗科技有限公司 | 广州迈驰新GMP兽药包装机首页_药品包装机_中药散剂包装机 | 板材品牌-中国胶合板行业十大品牌-环保板材-上海声达板材 | 污水/卧式/潜水/钻井/矿用/大型/小型/泥浆泵,价格,参数,型号,厂家 - 安平县鼎千泵业制造厂 | TPU薄膜_TPU薄膜生产厂家_TPU热熔胶膜厂家定制_鑫亘环保科技(深圳)有限公司 | 变位机,焊接变位机,焊接变位器,小型变位机,小型焊接变位机-济南上弘机电设备有限公司 | 示波器高压差分探头-国产电流探头厂家-南京桑润斯电子科技有限公司 | 半自动预灌装机,卡式瓶灌装机,注射器灌装机,给药器灌装机,大输液灌装机,西林瓶灌装机-长沙一星制药机械有限公司 | 企业微信scrm管理系统_客户关系管理平台_私域流量运营工具_CRM、ERP、OA软件-腾辉网络 | 仿真茅草_人造茅草瓦价格_仿真茅草厂家_仿真茅草供应-深圳市科佰工贸有限公司 | 水质监测站_水质在线分析仪_水质自动监测系统_多参数水质在线监测仪_水质传感器-山东万象环境科技有限公司 | 无锡网站建设_小程序制作_网站设计公司_无锡网络公司_网站制作 | 巨野月嫂-家政公司-巨野县红墙安康母婴护理中心 | 菏泽商标注册_菏泽版权登记_商标申请代理_菏泽商标注册去哪里 | 网站优化公司_SEO优化_北京关键词百度快速排名-智恒博网络 | 面粉仓_储酒罐_不锈钢储酒罐厂家-泰安鑫佳机械制造有限公司 |