Electron microscopy
 
PythonML
n-grams
- Python Automation and Machine Learning for ICs -
- An Online Book -
Python Automation and Machine Learning for ICs                                                           http://www.globalsino.com/ICs/        


Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

An n-gram is a contiguous sequence of n items (usually words) from a given sample of text.  N-grams can be a useful and more data-driven approach for certain aspects of natural language processing (NLP), especially when considering the statistical relationships between words in a sequence. N-grams represent contiguous sequences of n items (typically words) from a sample of text. 

Some reasons why n-grams can be a helpful solution are: 

  • Statistical Patterns: N-grams capture statistical patterns of word co-occurrence in a corpus. They provide information about the likelihood of certain words appearing together. 

  • Simplicity: Compared to manually crafting context-free grammar rules, n-grams are often simpler to implement and understand. They don't require explicit rule definition and can be generated directly from the data. 

  • Flexibility: N-grams can adapt to different contexts and domains. By training on domain-specific or task-specific data, we can capture relevant linguistic patterns for your application. 

  • Common in Language Modeling: N-grams are widely used in language modeling tasks, where the goal is to predict the next word in a sequence based on the preceding words. This is often used in applications such as machine translation, speech recognition, and text generation. 

However, note that while n-grams are effective for certain aspects of language modeling, they have limitations: 

  • Lack of Global Context: N-grams only consider local context within a fixed window, which may limit their ability to capture long-range dependencies or understand the global context of a sentence. 

  • Limited Semantics: N-grams might struggle with capturing deeper semantic meanings and relationships between words. 

  • Data Dependency: The quality of n-gram models heavily depends on the size and representativeness of the training data. Rare or unseen combinations of words may not be well-handled. 

A couple of examples of different n-grams are: 

  • Unigram (1-gram): 

    • Text: "I visited Chicago." 

    • Unigrams: ["I", "visited", "Chicago"] 

  • Bigram (2-gram): 

    • Text: "I visited Chicago." 

    • Bigrams: [("I", "visited"), ("visited", "Chicago")] 

  • Trigram (3-gram): 

    • Text: "I visited Chicago." 

    • Trigrams: [("I", "visited", "Chicago")] 

For instance, in a bigram model, we might calculate the probability of a word given its preceding word. In the example sentence "I visited Chicago," the bigram probabilities could help predict the likelihood of "visited" given "I" or "Chicago" given "visited." A simple Python code to generate n-grams (unigram, bigram and trigram) from a given text using NLTK is (code): 

 

The output of the code above is:    

 

In this example above, ngrams from NLTK is used to generate unigrams, bigrams, and trigrams from the input text. The word_tokenize function is used to tokenize the input text into words.

============================================

         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         
         

 

 

 

 

 



















































 

 

 

 

 

=================================================================================