昨天跑通了整个项目,但细节并不是很了解,第三天的任务,主要是两点:一是把图数据库本身和展示的关系搞清楚;二是把导入部分的代码搞清楚。
一、数据集
数据是这样的:
我们对关系型数据库的组织都比较熟悉了,现在要把数据转换成图数据库,直观看就是这样一个效果,图中的圆表示实体,边表示关系,每个实体/关系具有自己的属性。
节点。节点表示对象实例,每个节点有唯一的ID区别其它节点,节点带有属性;
关系。就是图里面的边,连接两个节点,另外这里的关系是有向的并带有属性;
属性。key-value对,存在于节点和关系中,如图所示。
在这个案例中,一共有7种实体类型,在neo4j中用不同颜色表示出来,展示界面中就是那些不同颜色的圆。
每个实体有自己的属性,比如疾病这种实体(图里面的圆),它有8种属性,如下图所示。
基于特征词分类的方法来识别用户查询意图,当然它也不是很智能
它的回答也确实很让人头疼呀。
二、导入数据代码
先回顾下项目的代码结构
data:存放数据
img:存放readme里的图片
model:存放训练好的tfidf模型和意图识别模型
build_graph.py:构建图,详见task03
entity_extractor.py:抽取问句中的实体和识别意图,详见task04
search_answer.py:根据不同的实体和意图构造cypher查询语句,查询图数据库并返回答案,详见task05
那我们今天的任务就是分析build_grapy.py
主程序非常简洁
if __name__ == "__main__":
handler = MedicalGraph()
handler.create_graphNodes()
handler.create_graphRels()
看来就主要是MedicalGraph()这个主体类。类的结构如下:
分析read_file代码:
def read_file(self):
"""
读取文件,获得实体,实体关系
:return:
"""
diseases = []
aliases = []
symptoms = []
parts = []
departments = []
complications = []
drugs = []
diseases_infos = []
disease_to_symptom = []
disease_to_alias = []
diseases_to_part = []
disease_to_department = []
disease_to_complication = []
disease_to_drug = []
all_data = pd.read_csv(self.data_path, encoding='gb18030').loc[:, :].values
for data in all_data:
disease_dict = {}
disease = str(data[0]).replace("...", " ").strip()
disease_dict["name"] = disease
line = re.sub("[,、;,.;]", " ", str(data[1])) if str(data[1]) else "未知"
for alias in line.strip().split():
aliases.append(alias)
disease_to_alias.append([disease, alias])
part_list = str(data[2]).strip().split() if str(data[2]) else "未知"
for part in part_list:
parts.append(part)
diseases_to_part.append([disease, part])
age = str(data[3]).strip()
disease_dict["age"] = age
infect = str(data[4]).strip()
disease_dict["infection"] = infect
insurance = str(data[5]).strip()
disease_dict["insurance"] = insurance
department_list = str(data[6]).strip().split()
for department in department_list:
departments.append(department)
disease_to_department.append([disease, department])
check = str(data[7]).strip()
disease_dict["checklist"] = check
symptom_list = str(data[8]).replace("...", " ").strip().split()[:-1]
for symptom in symptom_list:
symptoms.append(symptom)
disease_to_symptom.append([disease, symptom])
complication_list = str(data[9]).strip().split()[:-1] if str(data[9]) else "未知"
for complication in complication_list:
complications.append(complication)
disease_to_complication.append([disease, complication])
treat = str(data[10]).strip()[:-4]
disease_dict["treatment"] = treat
drug_string = str(data[11]).replace("...", " ").strip()
for drug in drug_string.split()[:-1]:
drugs.append(drug)
disease_to_drug.append([disease, drug])
period = str(data[12]).strip()
disease_dict["period"] = period
rate = str(data[13]).strip()
disease_dict["rate"] = rate
money = str(data[14]).strip() if str(data[14]) else "未知"
disease_dict["money"] = money
diseases_infos.append(disease_dict)
return set(diseases), set(symptoms), set(aliases), set(parts), set(departments), set(complications), \
set(drugs), disease_to_alias, disease_to_symptom, diseases_to_part, disease_to_department, \
disease_to_complication, disease_to_drug, diseases_infos