can tell me difference between corpora ,corpus , lexicon in nltk ?
what movie data set ?
what wordnet ?
corpora plural corpus.
corpus means body, , in context of natural language processing (nlp), means body of text.
(source: https://www.google.com.sg/search?q=corpora)
lexicon vocabulary, list of words, dictionary (source: https://www.google.com.sg/search?q=lexicon)
in nltk, lexicon considered corpus since a list of words a body of text. e.g. list of stopwords can found in nltk corpus api:
>>> nltk.corpus import stopwords >>> print stopwords.words('english') [u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now'] the movie review dataset in nltk (canonically known movie reviews corpus) text dataset of 2k movie reviews sentiment polarity classification (source: http://www.nltk.org/book/ch02.html)
and used tutorial purposes introduction nlp , sentiment analysis, see http://www.nltk.org/book/ch06.html , nltk naivebayesclassifier training sentiment analysis
wordnet lexical database english language (it's lexicon/dictionary word-to-word relations) (source: https://wordnet.princeton.edu/).
in nltk, incorporates open multilingual wordnet (http://compling.hss.ntu.edu.sg/omw/) allows query words in other languages.
since list of words (in case many other things included, relations, lemmas, pos, etc.), it's invoked using nltk.corpus in nltk.
the canonical idiom use wordnet in nltk such:
>>> nltk.corpus import wordnet wn >>> wn.synsets('dog') [synset('dog.n.01'), synset('frump.n.01'), synset('dog.n.03'), synset('cad.n.01'), synset('frank.n.02'), synset('pawl.n.01'), synset('andiron.n.01'), synset('chase.v.01')] the easiest way understand/learn nlp jargons , basics go through these tutorial in nltk book: http://www.nltk.org/book/
Comments
Post a Comment