간단한 아기 이유식 데이터를 활용한 Neo4j 그래프 생성

1. 데이터 수집 및 구조화

지정된 쇼핑몰(예: JD.com)에서 아기 이유식 제품 정보를 크롤링합니다.
주요 필드는 다음과 같습니다:
- 제품명 (1)
- SKU (2)
- 상품 링크 (3)
- 이미지 URL (4)
- 가격 (5)
- 리뷰 수 (6)
- 리뷰 링크 (7)
- 판매자 상점명 (8)
- 상점 링크 (9)
- 태그 정보 (10)
- 광고 여부 (11)
- 페이지 번호 (12)
- 처리 시간 (13)
- 페이지 주소 (14)

2. 데이터 전처리 및 추출

필수 정보 추출:
- 브랜드명: 제품 제목 내에 미리 정의된 브랜드 키워드와 일치하는지 확인하여 추출
- 단계(1~3단): "1단", "2단" 등의 패턴을 정규식으로 검색
- 중량 및 개수: "900g", "300g", "1+1" 등 문자열 기반 규칙 처리
- 나머지 정보는 원본 데이터에서 직접 가져옴

처리 결과 예시 (JSON 형식):

{
  "product_name": "a2",
  "product_stage": "1단",
  "product_weight": "900그램",
  "product_url": "https://item.jd.com/1950756.html",
  "shop_name": "a2 해외직구 공식점",
  "shope_url": "https://mall.jd.com/index-1000015026.html?from=pc",
  "product_price": 230.0,
  "product_comment_num": "32만+"
}

{
  "product_name": "프레시오",
  "product_stage": "2단",
  "product_weight": "380그램*2개",
  "product_url": "https://item.jd.com/6374127.html",
  "shop_name": "프레시오 공식몰",
  "shope_url": "https://mall.jd.com/index-1000002668.html?from=pc",
  "product_price": 170.0,
  "product_comment_num": "119만+"
}

전처리 코드 예시:

def extract_product_data(input_file, brand_list, output_json):
    import pandas as pd
    import json
    import re

    df = pd.read_excel(input_file)
    df.drop_duplicates(subset=['SKU'], keep='first', inplace=True)
    df = df[df['광고 여부'] != '광고']

    # 필요한 컬럼만 선택
    df = df.iloc[:, [0, 2, 4, 5, 7, 8]]

    with open(output_json, 'w', encoding='utf-8') as f_out:
        for _, row in df.iterrows():
            title = str(row[0]).strip().lower()
            brand = None
            stage = None
            weight = None
            count = None

            # 브랜드 추출
            for b in brand_list:
                if b in title:
                    brand = b
                    break

            # 단계 추출 (1단, 2단 등)
            match_stage = re.search(r'(\d+)단', title)
            if match_stage:
                stage = f"{match_stage.group(1)}단"

            # 중량 추출 (숫자 + '그램' 또는 'g')
            match_weight = re.search(r'(\d+)\s*[g|그램]', title)
            if match_weight:
                weight = f"{match_weight.group(1)}그램"

            # 개수 표시 (*2, *1 등)
            match_count = re.search(r'\*(\d)', title)
            if match_count:
                count = f"*{match_count.group(1)}"

            if count and weight:
                weight += count

            # 출력 딕셔너리 구성
            record = {
                "product_name": brand,
                "product_stage": stage,
                "product_weight": weight or "정보 없음",
                "product_url": row[1],
                "product_price": float(row[2]) if row[2] else 0,
                "product_comment_num": row[3],
                "shop_name": row[4],
                "shope_url": row[5].strip()
            }

            f_out.write(json.dumps(record, ensure_ascii=False) + '\n')

3. Neo4j를 통한 그래프 생성

노드 타입:
- 상점: 상점 이름과 URL
- 브랜드: 제품 브랜드명
- 단계: 1단, 2단 등 분류
- 상품 링크: 고유 상품 페이지 주소, 가격, 중량, 리뷰 수 포함
관계 종류:
- 상점 → 브랜드 (소유함)
- 상점 → 단계 (제품 보유)
- 상점 → 상품 링크 (링크 연결)
- 브랜드 → 단계 (제품 제공)
- 상품 링크 → 단계 (제품 분류)

그래프 생성 코드 예시:

from py2neo import Graph, Node, Relationship, NodeMatcher

def create_graph_from_json(graph_uri, auth, json_path):
    graph = Graph(graph_uri, auth=auth)

    matcher = NodeMatcher(graph)

    with open(json_path, 'r', encoding='utf-8') as f:
        for line in f:
            data = json.loads(line.strip())
            brand = data['product_name']
            stage = data['product_stage']
            url = data['product_url']
            shop_name = data['shop_name']
            shop_url = data['shope_url']
            price = data['product_price']
            comment_count = data['product_comment_num']
            weight = data['product_weight']

            # 노드 생성
            shop_node = Node("상점", name=shop_name, url=shop_url)
            brand_node = Node("브랜드", name=brand)
            stage_node = Node("단계", name=stage)
            product_node = Node("상품링크", name=url, 가격=price, 중량=weight, 리뷰수=comment_count)

            # 중복 방지 후 저장
            tx = graph.begin()
            nodes_to_create = []
            if not matcher.match("상점", name=shop_name).first():
                nodes_to_create.append(shop_node)
            if not matcher.match("브랜드", name=brand).first():
                nodes_to_create.append(brand_node)
            if not matcher.match("단계", name=stage).first():
                nodes_to_create.append(stage_node)
            if not matcher.match("상품링크", name=url).first():
                nodes_to_create.append(product_node)

            tx.create(nodes_to_create)
            tx.commit()

            # 관계 설정
            shop_match = matcher.match("상점", name=shop_name).first()
            brand_match = matcher.match("브랜드", name=brand).first()
            stage_match = matcher.match("단계", name=stage).first()
            product_match = matcher.match("상품링크", name=url).first()

            rel1 = Relationship(shop_match, "소유함", brand_match)
            rel2 = Relationship(shop_match, "제품보유", stage_match)
            rel3 = Relationship(shop_match, "링크연결", product_match)
            rel4 = Relationship(brand_match, "제품제공", stage_match)
            rel5 = Relationship(product_match, "분류", stage_match)

            graph.create(rel1)
            graph.create(rel2)
            graph.create(rel3)
            graph.create(rel4)
            graph.create(rel5)

            print(f"처리 완료: {url}")
    print("모든 데이터 처리 완료")

태그: Neo4j py2neo 그래프 데이터베이스 데이터 전처리 파이썬

5월 23일 08:18에 게시됨

괴물 클럽

간단한 아기 이유식 데이터를 활용한 Neo4j 그래프 생성

1. 데이터 수집 및 구조화

2. 데이터 전처리 및 추출

3. Neo4j를 통한 그래프 생성

인기 태그