1.语言处理与Python

本章中，我们将解决以下几个问题：

将简单的程序和大量的文本结合起来，我们能实现什么？
我们如何能自动提取概括文本风格和内容的关键词和短语?
Python编程语言上为上述工作提供了哪些工具和技术？
自然语言处理中有哪些有趣的挑战？

1. 语言计算：文本和单词

1.1 Python入门

安装Python3

1.2 NLTK入门

安装NLTK3.0
download from nltk.org
Installing NLTK
Mac/Unix

1.Install NLTK: run

1
2
3
4
5
6
7
8
9
10
        sudo pip install -U nltk
    #or
        sudo pip3 install -U nltk
    ```     
##### 2.Install Numpy/matplotlib(Optional): run
```shell
    sudo pip install -U numpy 
    //pip3 for python3
    sudo pip3 install -U numpy
    sudo pip3 install matplotlib

3. Test Installation: run

python/python3 then type import nltk
nltk.download()
//下载NLTK Book集
from nltk.book import *

Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

4. error run nltk.download() Message: SSL: CERTIFICATE_VERIFY_FAILED

 解决方法： [Mac升级Python3](https://blog.csdn.net/hjw199089/article/details/80053543)

1.3 搜索文本

//使用concordance构建索引
text1.concordance("monstrous")
text2.concordance("affection")
text3.concordance("lived")

//相似上下文
text1.similar("monstrous")
text2.similar("monstrous")

//函数common_contexts允许我们分别研究两个或两个以上的上下文
text2.common_contexts(["monstrous", "very"])

//
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

text3.generate()

1.4 词汇计数

//len获取长度
len(text3)

//获得text3的词汇表
set(text3)

//获得词汇排序表
sorted(set(text3))

//计算文本词汇丰富度
len(set(text3)) / set(text3)

//计算文本出现次数并占文本词数的百分比
text3.count("smote") 
text3.count("smote") / len(text3)

定义词汇丰富度函数lexical_diversity()和百分比函数percentage()

    def lexical_diversity(text):
        return len(set(text)) / len(text)
        
    def percentage(count, total):
        return 100 * (count/total)           
```   
 
### 2.2 索引列表
```python
    //获取索引
    text.index('awaken')
    //抽取语言片段
    text5[16715:16735]

2.3 变量

变量赋值

1	set1 = ["text1", "name2", "..."]

2.4 字符串

//字符串变量赋值
name = "Monty"
//字符串索引
name[0]
//字符串切片
name[0:4]
//字符串乘法
name * 2
//字符串加法    
name + "..."

3. 计算语言：简单的统计

3.1 频率分布

使用FreqDist寻找最常见的50词

fdist

3.2 细粒度的选择词

3.3 词语搭配和双连词

3.4 计数其他东西

4.2 对每个元素进行操作

    #循环输出 字符长度
    [len(w) for w in text1]
    #循环大写输出字符
    [w.upper() for w in text1]
```    
    
### 4.3 嵌套代码块

```shell
    if ...
        print...

4.4 条件循环

#将if语句和for语句结合：
sent1 = ['...',]
for test in sent1:
    if  test.endwith("1"):
        print(test)

5. 自动理解自然语言

语言理解技术

5.1 词意消歧

1 2	sorted(set(w.lower() for w in text1)) sorted(w.lower() for w in set(text1))