Apple (Inc.)에 대한 트윗과 사과 (과일)에 대한 트윗을 구분하는 모델을 구축하려면 어떻게해야합니까?

program tip

Apple (Inc.)에 대한 트윗과 사과 (과일)에 대한 트윗을 구분하는 모델을 구축하려면 어떻게해야합니까?

radiobox 2020. 9. 21. 07:33

Apple (Inc.)에 대한 트윗과 사과 (과일)에 대한 트윗을 구분하는 모델을 구축하려면 어떻게해야합니까?

"apple"에 대한 50 개의 트윗은 아래를 참조하십시오. Apple Inc.에 대한 긍정적 인 일치 항목을 손으로 표시했습니다. 아래 1 개로 표시되어 있습니다.

다음은 몇 가지 줄입니다.

1|“@chrisgilmer: Apple targets big business with new iOS 7 features http://bit.ly/15F9JeF ”. Finally.. A corp iTunes account!
0|“@Zach_Paull: When did green skittles change from lime to green apple? #notafan” @Skittles
1|@dtfcdvEric: @MaroneyFan11 apple inc is searching for people to help and tryout all their upcoming tablet within our own net page No.
0|@STFUTimothy have you tried apple pie shine?
1|#SuryaRay #India Microsoft to bring Xbox and PC games to Apple, Android phones: Report: Microsoft Corp... http://dlvr.it/3YvbQx  @SuryaRay

전체 데이터 세트는 다음과 같습니다. http://pastebin.com/eJuEb4eB

"Apple"(Inc)을 분류하는 모델을 구축해야합니다. 나머지에서.

기계 학습에 대한 일반적인 개요를 찾는 것이 아니라 코드에서 실제 모델을 찾고 있습니다 ( Python 선호).

다음과 같이 할 것입니다.

문장을 단어로 나누고, 정규화하고, 사전을 만듭니다.
각 단어마다 회사에 대한 트윗에서 발생한 횟수와 과일에 대한 트윗에 나타난 횟수를 저장합니다.이 트윗은 사람이 확인해야합니다.
새 트윗이 들어 오면 사전에서 트윗의 모든 단어를 찾아 가중치가 부여 된 점수를 계산합니다. 회사와 관련하여 자주 사용되는 단어는 높은 회사 점수를 받게되고 그 반대의 경우도 마찬가지입니다. 드물게 사용되거나 회사와 과일 모두에 사용되는 단어는 점수가별로 없습니다.

당신이 찾고있는 것을 명명 된 엔티티 인식 이라고 합니다. 명명 된 엔터티에 대해 학습하도록 훈련 된 것을 기반으로 조건부 랜덤 필드 를 사용 하여 명명 된 엔터티를 찾는 통계 기술입니다 .

기본적으로 단어 의 내용과 문맥 을 살펴보고 (몇 단어 앞뒤로) 단어가 명명 된 개체 일 확률을 추정합니다.

좋은 소프트웨어는 길이나 모양과 같은 단어의 다른 기능을 볼 수 있습니다 (예 : "Vowel-consonant-vowel"로 시작하는 경우 "Vcv").

아주 좋은 도서관 (GPL)은 스탠포드의 NER입니다.

데모 : http://nlp.stanford.edu:8080/ner/

시도 할 몇 가지 샘플 텍스트 :

애플 본사에서 사과를 먹고 있었는데 Coldplay 남자의 딸인 Apple Martin에 대해 생각했습니다.

(3class 및 4class 분류 기가 올바르게 이해합니다)

나는 scikit-learn을 사용하여 오픈 소스로이 문제를 해결하는 반 작업 시스템을 가지고 있으며, 내가하는 일을 설명하는 일련의 블로그 게시물을 가지고 있습니다. 내가 다루고있는 문제는 명명 된 엔티티 인식과 동일하지 않은 단어 감지 모호성 (여러 단어 감지 옵션 중 하나 선택)입니다. 내 기본 접근 방식은 기존 솔루션과 다소 경쟁적이며 (중요하게) 사용자 정의 할 수 있습니다.

충분한 상업적 결과를 제공 할 수있는 기존의 상업용 NER 도구 (OpenCalais, DBPedia Spotlight 및 AlchemyAPI)가 있습니다. 먼저 시도해보세요!

나는 이들 중 일부를 클라이언트 프로젝트에 사용했지만 (런던에서 NLP / ML을 사용하여 문의), 그들의 리콜 ( 정밀도 및 리콜 )에 만족하지 못했습니다 . 기본적으로 정확할 수 있지만 ( "This is Apple Inc"라고 말하면 일반적으로 정확함) 회상이 낮습니다 (사람에게 트윗이 Apple Inc에 대한 것이 분명하더라도 "This is Apple Inc"라고 거의 말하지 않음). 나는 트윗에 맞춘 오픈 소스 버전을 만드는 것이 지적으로 흥미로운 연습이라고 생각했습니다. 현재 코드는 다음과 같습니다. https://github.com/ianozsvald/social_media_brand_disambiguator

저는이 접근 방식으로 일반화 된 단어 의미 명확화 문제를 해결하려고하는 것이 아니라 이미 이름이있을 때 브랜드 명확성 (회사, 사람 등) 만 해결하려고합니다 . 이것이 바로이 간단한 접근 방식이 효과가 있다고 믿는 이유입니다.

6 주 전에 시작했으며 scikit-learn을 사용하여 Python 2.7로 작성되었습니다. 매우 기본적인 접근 방식을 사용합니다. 1-3 n-grams 로 바이너리 카운트 벡터 라이저를 사용하여 벡터화합니다 (단어가 나타나는지 여부 만 계산하고 몇 번은 계산하지 않음) . 저는 TF-IDF로 확장하지 않습니다 (문서 길이가 가변적 일 때 TF-IDF가 좋습니다. 저에게 트윗은 한두 문장에 불과하고 테스트 결과는 TF-IDF로 개선되지 않았습니다).

매우 기본적이지만 놀랍도록 유용한 기본 토크 나이저를 사용합니다. @ #을 무시하고 (따라서 일부 컨텍스트를 잃음) 물론 URL을 확장하지 않습니다. 그런 다음 로지스틱 회귀를 사용하여 훈련하면 이 문제는 다소 선형 적으로 분리 될 수있는 것 같습니다 (한 클래스에 대한 많은 용어가 다른 클래스에 대해 존재하지 않음). 현재 나는 어떤 형태소 제거 / 청소를 피하고 있습니다 (나는 작동 할 수있는 가장 간단한 것을 시도하고 있습니다).

코드에는 전체 README가 있으며, 상대적으로 쉽게 트윗을 수집 한 다음 테스트를위한 제 제안을 따를 수 있어야합니다.

이것은 사람들이 Apple 컴퓨터를 먹거나 마시지 않고 과일을 입력하거나 놀지 않기 때문에 Apple에 효과적입니다. 따라서 단어는 한 범주 또는 다른 범주로 쉽게 나뉩니다. 이 조건은 TV 쇼의 #definance (사람들이 아랍의 봄, 크리켓 경기, 시험 개정 및 음악 밴드와 관련하여 #definance를 사용하는 경우)와 같은 것을 고려할 때 유지되지 않을 수 있습니다. 여기에 더 현명한 접근 방식이 필요할 수 있습니다.

나는이 블로그 게시물의 일련의 난 (DataScienceLondon 140 명 짧은 프리젠 테이션으로 전환)을 BrightonPython의 사용자 그룹에 준 한 시간 동안 프리젠 테이션을 포함하여이 프로젝트를 설명합니다.

LogisticRegression (각 분류에 대한 확률을 얻음)과 같은 것을 사용하는 경우 신뢰할 수있는 분류 만 선택할 수 있으며, 그렇게하면 재현율과 거래하여 높은 정밀도를 강제 할 수 있습니다 (따라서 정확한 결과를 얻을 수 있지만 결과는 더 적음). 이를 시스템에 맞게 조정해야합니다.

scikit-learn을 사용하는 가능한 알고리즘 접근 방식은 다음과 같습니다.

Binary CountVectorizer를 사용하십시오 (대부분의 단어가 한 번만 발생하기 때문에 짧은 메시지의 용어 계수가 많은 정보를 추가한다고 생각하지 않습니다)
의사 결정 트리 분류기로 시작하십시오. 설명 할 수있는 성능을 제공합니다 ( 예 는 의사 결정 트리 로 과적 합 참조 ).
로지스틱 회귀로 이동
분류 자에 의해 생성 된 오류를 조사합니다 (DecisionTree의 내 보낸 출력을 읽거나 LogisticRegression의 계수를보고, Vectorizer를 통해 잘못 분류 된 트윗을 다시 작업하여 기본 Bag of Words 표현이 어떻게 생겼는지 확인합니다. 당신은 원시 트윗에서 시작했습니다-분류에 충분합니까?)
이 접근 방식의 작동 버전은 https://github.com/ianozsvald/social_media_brand_disambiguator/blob/master/learn1.py의 예제 코드를 참조하십시오.

고려할 사항 :

더 큰 데이터 세트가 필요합니다. 저는 2000 개의 레이블이있는 트윗을 사용하고 있습니다 (5 시간이 걸렸습니다). 최소한 클래스 당 100 개를 초과하는 균형 잡힌 세트를 원합니다 (아래의 과적 합 참고 참조).
tokeniser를 개선하여 (scikit-learn으로 매우 쉽게) 토큰에 # @을 유지하고 대문자 브랜드 탐지기를 추가 할 수 있습니다 (사용자 @ user2425429가 메모 함).
일이 더 어려워 질 때 비선형 분류기 (위의 @oiez의 제안과 같은)를 고려하십시오. 개인적으로 LinearSVC가 로지스틱 회귀보다 더 나쁘다는 것을 발견했습니다 (하지만 아직 축소하지 않은 고차원 기능 공간 때문일 수 있습니다).
트윗 특정 부분의 음성 태거 (@Neil이 제안한 것처럼 Standford가 아닌 겸손한 의견으로는 내 경험상 트위터 문법이 좋지 않음)
토큰이 많으면 차원 축소를 원할 것입니다 (아직 시도하지 않았습니다. LogisticRegression l1 l2 패널티에 대한 내 블로그 게시물 참조).

레. 과적 합. 2000 개의 항목이있는 데이터 세트에는 Twitter에서 'apple'트윗의 10 분 스냅 샷이 있습니다. 트윗의 약 2/3는 Apple Inc 용이고 1/3은 다른 사과 용입니다. 각 클래스의 균형 잡힌 하위 집합 (내 생각에는 약 584 행)을 추출하고 훈련을 위해 5 중 교차 검증을 수행합니다.

나는 10 분의 시간 창을 가지고 있기 때문에 동일한 주제에 대한 많은 트윗을 가지고 있으며 이것이 아마도 내 분류 기가 기존 도구에 비해 잘 작동하는 이유 일 것입니다. 잘 일반화하지 않고도 훈련 기능에 과적 합할 것입니다 (기존 광고 도구는이 snapshop에서 성능이 더 나쁘지만 더 광범위한 데이터 세트에서 더 안정적입니다. 이후 작업으로이를 테스트하기 위해 시간 창을 확장 할 것입니다.

다음을 수행 할 수 있습니다.

과일 및 회사 관련 트윗에서 발생 횟수를 포함하는 단어의 사전을 만드십시오. 이것은 우리가 알고있는 성향을 가진 샘플 트윗을 제공함으로써 달성 할 수 있습니다.
충분한 이전 데이터를 사용하여 apple inc에 대한 트윗에서 단어가 발생할 확률을 알아낼 수 있습니다.
단어의 개별 확률을 곱하여 전체 트윗의 확률을 얻습니다.

간단한 예 :

p_f = 과일 트윗의 확률.

p_w_f = 과일 트윗에서 단어가 나올 확률.

p_t_f = 트윗에 포함 된 모든 단어가 과일 트윗을 올릴 확률을 합친 확률 = p_w1_f * p_w2_f * ...

p_f_t = 특정 트윗이 주어진 과일의 확률.

p_c, p_w_c, p_t_c, p_c_t 는 각 회사 값입니다.

데이터베이스에없는 새로운 단어의 빈도가 0 인 문제를 제거하기 위해 값 1의 라플라시안 스무더가 추가되었습니다.

old_tweets = {'apple pie sweet potatoe cake baby https://vine.co/v/hzBaWVA3IE3': '0', ...}
known_words = {}
total_company_tweets = total_fruit_tweets =total_company_words = total_fruit_words = 0

for tweet in old_tweets:
    company = old_tweets[tweet]
    for word in tweet.lower().split(" "):
        if not word in known_words:
            known_words[word] = {"company":0, "fruit":0 }
        if company == "1":
            known_words[word]["company"] += 1
            total_company_words += 1
        else:
            known_words[word]["fruit"] += 1
            total_fruit_words += 1

    if company == "1":
        total_company_tweets += 1
    else:
        total_fruit_tweets += 1
total_tweets = len(old_tweets)

def predict_tweet(new_tweet,K=1):
    p_f = (total_fruit_tweets+K)/(total_tweets+K*2)
    p_c = (total_company_tweets+K)/(total_tweets+K*2)
    new_words = new_tweet.lower().split(" ")

    p_t_f = p_t_c = 1
    for word in new_words:
        try:
            wordFound = known_words[word]
        except KeyError:
            wordFound = {'fruit':0,'company':0}
        p_w_f = (wordFound['fruit']+K)/(total_fruit_words+K*(len(known_words)))
        p_w_c = (wordFound['company']+K)/(total_company_words+K*(len(known_words)))
    p_t_f *= p_w_f
    p_t_c *= p_w_c

    #Applying bayes rule
    p_f_t = p_f * p_t_f/(p_t_f*p_f + p_t_c*p_c)
    p_c_t = p_c * p_t_c/(p_t_f*p_f + p_t_c*p_c)
    if p_c_t > p_f_t:
        return "Company"
    return "Fruit"

외부 라이브러리를 사용하는 데 문제가 없다면 scikit-learn을 권장 합니다. 아마도 직접 코딩 할 수있는 것보다 더 빠르고 더 빠르게 할 수 있기 때문입니다. 나는 다음과 같이 할 것입니다.

말뭉치를 구축하십시오. 명확성을 위해 목록 이해를 수행했지만 데이터 저장 방법에 따라 다른 작업을 수행해야 할 수도 있습니다.

def corpus_builder(apple_inc_tweets, apple_fruit_tweets):
    corpus = [tweet for tweet in apple_inc_tweets] + [tweet for tweet in apple_fruit_tweets]
    labels = [1 for x in xrange(len(apple_inc_tweets))] + [0 for x in xrange(len(apple_fruit_tweets))]
    return (corpus, labels)

중요한 것은 다음과 같은 두 개의 목록으로 끝납니다.

([['apple inc tweet i love ios and iphones'], ['apple iphones are great'], ['apple fruit tweet i love pie'], ['apple pie is great']], [1, 1, 0, 0])

[1, 1, 0, 0]은 양수 및 음수 레이블을 나타냅니다.

그런 다음 파이프 라인을 만듭니다! Pipeline은 텍스트 처리 단계를 쉽게 연결할 수있는 scikit-learn 클래스이므로 학습 / 예측할 때 하나의 객체 만 호출하면됩니다.

def train(corpus, labels)
    pipe = Pipeline([('vect', CountVectorizer(ngram_range=(1, 3), stop_words='english')),
                        ('tfidf', TfidfTransformer(norm='l2')),
                        ('clf', LinearSVC()),])
    pipe.fit_transform(corpus, labels)
    return pipe

파이프 라인 내부에는 세 가지 처리 단계가 있습니다. CountVectorizer는 단어를 토큰 화하고, 분할하고, 개수를 세고, 데이터를 희소 행렬로 변환합니다. TfidfTransformer는 선택 사항이며 정확도 등급에 따라 제거 할 수 있습니다 (교차 검증 테스트 및 최상의 매개 변수에 대한 그리드 검색이 약간 관련되어 있으므로 여기서 다루지 않겠습니다). LinearSVC는 표준 텍스트 분류 알고리즘입니다.

마지막으로 트윗 카테고리를 예측합니다.

def predict(pipe, tweet):
    prediction = pipe.predict([tweet])
    return prediction

다시 말하지만 트윗은 목록에 있어야하므로 함수를 문자열로 입력한다고 가정했습니다.

그 모든 것을 수업에 넣으면 끝입니다. 최소한이 아주 기본적인 예에서는.

이 코드를 테스트하지 않았으므로 복사하여 붙여 넣으면 작동하지 않을 수 있지만 scikit-learn을 사용하려면 시작 위치에 대한 아이디어를 제공해야합니다.

편집 : 단계를 더 자세히 설명하려고했습니다.

의사 결정 트리를 사용하면이 문제에 대해 매우 잘 작동하는 것 같습니다. 적어도 내가 선택한 기능을 가진 순진한 베이 분류기보다 더 높은 정확도를 생성합니다.

몇 가지 가능성을 가지고 놀고 싶다면 nltk를 설치해야하는 다음 코드를 사용할 수 있습니다. nltk 책은 또한 온라인에서 무료로 구할 수 있으므로이 모든 것이 실제로 작동하는 방식에 대해 조금 읽어 볼 수 있습니다. http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html

#coding: utf-8
import nltk
import random
import re

def get_split_sets():
    structured_dataset = get_dataset()
    train_set = set(random.sample(structured_dataset, int(len(structured_dataset) * 0.7)))
    test_set = [x for x in structured_dataset if x not in train_set]

    train_set = [(tweet_features(x[1]), x[0]) for x in train_set]
    test_set = [(tweet_features(x[1]), x[0]) for x in test_set]
    return (train_set, test_set)

def check_accurracy(times=5):
    s = 0
    for _ in xrange(times):
        train_set, test_set = get_split_sets()
        c = nltk.classify.DecisionTreeClassifier.train(train_set)
        # Uncomment to use a naive bayes classifier instead
        #c = nltk.classify.NaiveBayesClassifier.train(train_set)
        s += nltk.classify.accuracy(c, test_set)

    return s / times


def remove_urls(tweet):
    tweet = re.sub(r'http:\/\/[^ ]+', "", tweet)
    tweet = re.sub(r'pic.twitter.com/[^ ]+', "", tweet)
    return tweet

def tweet_features(tweet):
    words = [x for x in nltk.tokenize.wordpunct_tokenize(remove_urls(tweet.lower())) if x.isalpha()]
    features = dict()
    for bigram in nltk.bigrams(words):
        features["hasBigram(%s)" % ",".join(bigram)] = True
    for trigram in nltk.trigrams(words):
        features["hasTrigram(%s)" % ",".join(trigram)] = True  
    return features

def get_dataset():
    dataset = """copy dataset in here
"""
    structured_dataset = [('fruit' if x[0] == '0' else 'company', x[2:]) for x in dataset.splitlines()]
    return structured_dataset

if __name__ == '__main__':
    print check_accurracy()

지금까지 의견을 보내 주셔서 감사합니다. 다음은 PHP로 준비한 작업 솔루션 입니다. 나는 여전히이 같은 솔루션에 대한보다 알고리즘적인 접근 방식을 다른 사람들로부터 듣는 데 관심이 있습니다.

<?php

// Confusion Matrix Init
$tp = 0;
$fp = 0;
$fn = 0;
$tn = 0;
$arrFP = array();
$arrFN = array();

// Load All Tweets to string
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://pastebin.com/raw.php?i=m6pP8ctM');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$strCorpus = curl_exec($ch);
curl_close($ch);

// Load Tweets as Array
$arrCorpus = explode("\n", $strCorpus);
foreach ($arrCorpus as $k => $v) {
    // init
    $blnActualClass = substr($v,0,1);
    $strTweet = trim(substr($v,2));

    // Score Tweet
    $intScore = score($strTweet);

    // Build Confusion Matrix and Log False Positives & Negatives for Review
    if ($intScore > 0) {
        if ($blnActualClass == 1) {
            // True Positive
            $tp++;
        } else {
            // False Positive
            $fp++;
            $arrFP[] = $strTweet;
        }
    } else {
        if ($blnActualClass == 1) {
            // False Negative
            $fn++;
            $arrFN[] = $strTweet;
        } else {
            // True Negative
            $tn++;
        }
    }
}

// Confusion Matrix and Logging
echo "
           Predicted
            1     0
Actual 1   $tp     $fp
Actual 0    $fn    $tn

";

if (count($arrFP) > 0) {
    echo "\n\nFalse Positives\n";
    foreach ($arrFP as $strTweet) {
        echo "$strTweet\n";
    }
}

if (count($arrFN) > 0) {
    echo "\n\nFalse Negatives\n";
    foreach ($arrFN as $strTweet) {
        echo "$strTweet\n";
    }
}

function LoadDictionaryArray() {
    $strDictionary = <<<EOD
10|iTunes
10|ios 7
10|ios7
10|iPhone
10|apple inc
10|apple corp
10|apple.com
10|MacBook
10|desk top
10|desktop
1|config
1|facebook
1|snapchat
1|intel
1|investor
1|news
1|labs
1|gadget
1|apple store
1|microsoft
1|android
1|bonds
1|Corp.tax
1|macs
-1|pie
-1|clientes
-1|green apple
-1|banana
-10|apple pie
EOD;

    $arrDictionary = explode("\n", $strDictionary);
    foreach ($arrDictionary as $k => $v) {
        $arr = explode('|', $v);
        $arrDictionary[$k] = array('value' => $arr[0], 'term' => strtolower(trim($arr[1])));
    }
    return $arrDictionary;
}

function score($str) {
    $str = strtolower($str);
    $intScore = 0;
    foreach (LoadDictionaryArray() as $arrDictionaryItem) {
        if (strpos($str,$arrDictionaryItem['term']) !== false) {
            $intScore += $arrDictionaryItem['value'];
        }
    }
    return $intScore;
}
?>

위의 출력 :

           Predicted
            1     0
Actual 1   31     1
Actual 0    1    17


False Positives
1|Royals apple #ASGame @mlb @ News Corp Building http://instagram.com/p/bBzzgMrrIV/


False Negatives
-1|RT @MaxFreixenet: Apple no tiene clientes. Tiene FANS// error.... PAGAS por productos y apps, ergo: ERES CLIENTE.

In all the examples that you gave, Apple(inc) was either referred to as Apple or apple inc, so a possible way could be to search for:

a capital "A" in Apple
an "inc" after apple
words/phrases like "OS", "operating system", "Mac", "iPhone", ...
or a combination of them

To simplify answers based on Conditional Random Fields a bit...context is huge here. You will want to pick out in those tweets that clearly show Apple the company vs apple the fruit. Let me outline a list of features here that might be useful for you to start with. For more information look up noun phrase chunking, and something called BIO labels. See (http://www.cis.upenn.edu/~pereira/papers/crf.pdf)

Surrounding words: Build a feature vector for the previous word and the next word, or if you want more features perhaps the previous 2 and next 2 words. You don't want too many words in the model or it won't match the data very well. In Natural Language Processing, you are going to want to keep this as general as possible.

Other features to get from surrounding words include the following:

Whether the first character is a capital

Whether the last character in the word is a period

The part of speech of the word (Look up part of speech tagging)

The text itself of the word

I don't advise this, but to give more examples of features specifically for Apple:

WordIs(Apple)

NextWordIs(Inc.)

You get the point. Think of Named Entity Recognition as describing a sequence, and then using some math to tell a computer how to calculate that.

Keep in mind that natural language processing is a pipeline based system. Typically, you break things in to sentences, move to tokenization, then do part of speech tagging or even dependency parsing.

This is all to get you a list of features you can use in your model to identify what you're looking for.

There's a really good library for processing natural language text in Python called nltk. You should take a look at it.

One strategy you could try is to look at n-grams (groups of words) with the word "apple" in them. Some words are more likely to be used next to "apple" when talking about the fruit, others when talking about the company, and you can use those to classify tweets.

Use LibShortText. This Python utility has already been tuned to work for short text categorization tasks, and it works well. The maximum you'll have to do is to write a loop to pick the best combination of flags. I used it to do supervised speech act classification in emails and the results were up to 95-97% accurate (during 5 fold cross validation!).

And it comes from the makers of LIBSVM and LIBLINEAR whose support vector machine (SVM) implementation is used in sklearn and cran, so you can be reasonably assured that their implementation is not buggy.

Make an AI filter to distinguish Apple Inc (the company) from apple (the fruit). Since these are tweets, define your training set with a vector of 140 fields, each field being the character written in the tweet at position X (0 to 139). If the tweet is shorter, just give a value for being blank.

Then build a training set big enough to get a good accuracy (subjective to your taste). Assign a result value to each tweet, a Apple Inc tweet get 1 (true) and an apple tweet (fruit) gets 0. It would be a case of supervised learning in a logistic regression.

That is machine learning, is generally easier to code and performs better. It has to learn from the set you give it, and it's not hardcoded.

I don't know Python, so I can not write the code for it, but if you were to take more time for machine learning's logic and theory you might want to look the class I'm following.

Try the Coursera course Machine Learning by Andrew Ng. You will learn machine learning on MATLAB or Octave, but once you get the basics you will be able to write machine learning in about any language if you do understand the simple math (simple in logistic regression).

That is, getting the code from someone won't make you able to understand what is going in the machine learning code. You might want to invest a couple of hours on the subject to see what is really going on.

I would recommend avoiding answers suggesting entity recognition. Because this task is a text-classification first and entity recognition second (you can do it without the entity recognition at all).

I think the fastest path to results will be spacy + prodigy. Spacy has well thought through model for English language, so you don't have to build your own. While prodigy allows quickly create training datasets and fine tune spacy model for your needs.

If you have enough samples, you can have a decent model in 1 day.

참고URL : https://stackoverflow.com/questions/17352469/how-can-i-build-a-model-to-distinguish-tweets-about-apple-inc-from-tweets-abo

'program tip' 카테고리의 다른 글

Bootstrap을 사용하여 모바일에 테이블을 표시하는 방법은 무엇입니까? (0)	2020.09.21
PostgreSQL은 "악센트를 구분하지 않는"데이터 정렬을 지원합니까? (0)	2020.09.21
느리게 순열 생성 (0)	2020.09.21
Java에서 오류 응답 본문 읽기 (0)	2020.09.21
Assert.AreEqual (T obj1, Tobj2)이 동일한 바이트 배열로 실패하는 이유 (0)	2020.09.21

현재글Apple (Inc.)에 대한 트윗과 사과 (과일)에 대한 트윗을 구분하는 모델을 구축하려면 어떻게해야합니까?

radiobox

Apple (Inc.)에 대한 트윗과 사과 (과일)에 대한 트윗을 구분하는 모델을 구축하려면 어떻게해야합니까?

Apple (Inc.)에 대한 트윗과 사과 (과일)에 대한 트윗을 구분하는 모델을 구축하려면 어떻게해야합니까?

간단한 예 :

'program tip' 카테고리의 다른 글

'program tip'의 다른글

티스토리툴바

Apple (Inc.)에 대한 트윗과 사과 (과일)에 대한 트윗을 구분하는 모델을 구축하려면 어떻게해야합니까?

Apple (Inc.)에 대한 트윗과 사과 (과일)에 대한 트윗을 구분하는 모델을 구축하려면 어떻게해야합니까?

간단한 예 :

'program tip' 카테고리의 다른 글

'program tip'의 다른글

관련글

티스토리툴바