[언어지능 딥러닝] 6차 미프(3) - BM25, SentenceTransformer tutorial

티스토리 뷰

프로젝트/에이블스쿨

[언어지능 딥러닝] 6차 미프(3) - BM25, SentenceTransformer tutorial

sikaro 2024. 4. 30. 11:33

단어를 chunk 단위로 분해한다.

BM25, Sentence Transformer로 하나의 문장에 대해서 검색 기능을 실행.

후에 모든 chunk를 활용해서 해보고, 검색 성능을 향상시킨다.

라이브러리 다운로드 및 임포트

!pip install -q -U transformers==4.38.2
!pip install -q -U datasets==2.18.0
!pip install -q -U bitsandbytes==0.42.0
!pip install -q -U peft==0.9.0
!pip install -q -U trl==0.7.11
!pip install -q -U accelerate==0.27.2
!pip install -q -U rank_bm25==0.2.2
!pip install -q -U sentence-transformers==2.7.0
!pip install -q -U wikiextractor==3.0.6
!pip install -q -U konlpy==0.6.0

import os
import glob
import json

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

import torch
import konlpy
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer

한국어 위키 덤프 파일 다운로드 하기

덤프 파일을 다운로드 하고, extractor로 전부 풀어준다.

!python -m wikiextractor.WikiExtractor \
        --json \
        --out {WORKSPACE}/data/kowiki \
        {WORKSPACE}/data/kowiki-latest-pages-meta-current.xml.bz2

# 5줄만 json 형태로 변경해서 출력해 봅니다.
with open(os.path.join(WORKSPACE, "data", "kowiki", "AA", "wiki_00")) as f:
    for i, line in enumerate(f):
        line = line.strip()
        # print(line)

        data = json.loads(line)
        print(data)

        if i >= 4:
            break

데이터베이스 만들기

문서를 원래는 DB에 저장해야 하지만, 그렇게 하지 않고 일단 json 파일로 만든다.

def make_chunk(text, n_word=128):
    # line 단위로 단어수 계산
    line_list = []
    total = 0
    for line in text.split('\n'):
        total += len(line.split())
        line_list.append((total, line))  #라인 단위로 단어가 몇개인지 센다. total은 누적 인덱스
    # n_word 단위로 분할
    chunk_list = []
    chunk_total, chunk_index = 0, 0
    for i, (total, line) in enumerate(line_list):
        if total - chunk_total >= n_word: #새로 들어온 게 128이상이 되면 청크에 담는다.
            chunk = [line for total, line in line_list[chunk_index:i+1]]
            chunk_list.append('\n'.join(chunk))
            chunk_index = i + 1
            chunk_total = total
    # 마지막 line 추가 (n_word 보다 작은 경우 이전라인 포함)
    if total > chunk_total: #혹시 남은 마지막 라인을 포함한다.
        if total - chunk_total < n_word and chunk_index > 1:
            chunk_index -= 1
        chunk = [line for total, line in line_list[chunk_index:]]
        chunk_list.append('\n'.join(chunk))
    return chunk_list

    #이걸 잘 만들면 의미 단위로도 분할할 수 있다.

확인을 위해서 chunk 단위로 분할해서 확인다.

# 기능 확인을 위해서 문서를 chunk 단위로 분할해서 row_list에 저장
row_list = []
for fn in fn_list[:1]:
    with open(fn) as f:
        for line in f:
            data = json.loads(line)
            chunk_list = make_chunk(data['text']) #청크를 익는다.
            for i, chunk in enumerate(chunk_list):
                title = data['title']
                row = {
                    'id': data['id'],  #타이틀과 청크를 넣은 리스트를 만든다.
                    'chunk_id': str(i + 1),
                    'chunk': f"{title}\n{chunk}"
                }
                print(row)
                row_list.append(row)
len(row_list)

#지미 카터 문서 하나가 청크 다누이로 들어가 있다.

BM25로 검색

BM25 api로 검색하고, 그에 대한 결과를 확인한다.

# bm25 api 생성
bm25 = BM25Okapi(tokenized_chunks)

def query_bm25(bm25, query, tokenizer, top_n=10):
    tokenized_query = tokenizer(query)
    # score 계산
    doc_scores = bm25.get_scores(tokenized_query)
    # score 순서로 정렬
    rank = np.argsort(-doc_scores)
    # top-n
    result = []
    for i in rank[:top_n]:
        if doc_scores[i] > 0:
            result.append((i, doc_scores[i]))
    return result

while True:
    query = input('검색 > ')
    query = query.strip()
    if len(query) == 0:
        break
    result = query_bm25(bm25, query, tokenizer)
    for i, score in result:
        print(f'---- score: {score} ----')
        print(chunk_list[i])
        print()

Sequence Transformer 검색

해당 모델을 불러온다.

https://huggingface.co/snunlp/KR SBERT V40K klueNLI augSTS

# SentenceTransformer 모델 생성
model = SentenceTransformer(MODEL_ID)

임베딩 생성

# chunk embeddings 생성
chunk_embeddings = model.encode(chunk_list)
chunk_embeddings.shape

그 후에 스코어를 계산해서 정렬하는 함수를 만든다.

def query_sentence_transformer(model, chunk_embeddings, query, top_n=10):
    query_embedding = model.encode([query])
    # score 계산
    doc_scores = np.matmul(chunk_embeddings, query_embedding.T)
    doc_scores = doc_scores.reshape(-1)
    # score 순서로 정렬
    rank = np.argsort(-doc_scores)
    # top-n
    result = []
    for i in rank[:top_n]:
        result.append((i, doc_scores[i]))
    return result

검색을 수행하면 된다.

while True:
    query = input('검색 > ')
    query = query.strip()
    if len(query) == 0:
        break
    result = query_sentence_transformer(model, chunk_embeddings, query)
    for i, score in result:
        print(f'---- score: {score} ----')
        print(chunk_list[i])
        print()

성능 향상의 방법

1. Data를 잘 만든다. 조금 더 의미 기반으로 분할해서 만든다.

2. BM25, Sentence BERT 같은 걸 썼는데, 너무 결과가 안나온다.

RAG가 상용화가 어려운 이유가 이것 때문에 그렇다.

합쳐서 사용하는 게 좋다.

bm25 *0.4 , dpr 0.6 이런식

3. LLM이 좋아지면 된다.

4. 그 이외에 프롬프트 바꾸기, 여러가지 실험 등

시카로의 공부방

티스토리 뷰