import textract


text = textract.process('/content/Frank Herbert - Dune-Orion Publishing Group (2020).pdf', 
                        method='pdfminer', 
                        encoding='ascii')


text[0:100]

b'\x0cDUNE\n\nFrank Herbert\n\nwww.sfgateway.com\n\n\x0cEnter the SF Gateway \xe2\x80\xa6\n\nIn  the  last  years  of  the  t'


list_text_inicial = text.decode().split('\x0c')


# Por ejemplo accedemos a la primer pagina, la cual esta vacia
list_text_inicial[0]

''


# Accedemos a la pagina de indice 4
list_text_inicial[4]

'DUNE\n\nA beginning is the time for taking the most delicate care that the balances are correct. This\nevery sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then,\ntake  care  that  you  first  place  him  in  his  time:  born  in  the  57th  year  of  the  Padishah\nEmperor,  Shaddam  IV.  And  take  the  most  special  care  that  you  locate  Muad’Dib  in  his\nplace: the planet Arrakis. Do not be deceived by the fact that he was born on Caladan and\nlived his first fifteen years there. Arrakis, the planet known as Dune, is forever his place.\n\n—from “Manual of Muad’Dib” by the Princess Irulan\n\nIn the week before their departure to Arrakis, when all the final scurrying\nabout had reached a nearly unbearable frenzy, an old crone came to visit the\nmother of the boy, Paul.\n\nIt was a warm night at Castle Caladan, and the ancient pile of stone that\nhad served the Atreides family as home for twenty-six generations bore that\ncooled-sweat feeling it acquired before a change in the weather.\n\nThe old woman was let in by the side door down the vaulted passage by\nPaul’s room and she was allowed a moment to peer in at him where he lay\nin his bed.\n\nBy the half-light of a suspensor lamp, dimmed and hanging near the floor,\nthe awakened boy could see a bulky female shape at his door, standing one\nstep ahead of his mother. The old woman was a witch shadow—hair like\nmatted spiderwebs, hooded ‘round darkness of features, eyes like glittering\njewels.\n\n“Is  he  not  small  for  his  age,  Jessica?”  the  old  woman  asked.  Her  voice\n\nwheezed and twanged like an untuned baliset.\n\nPaul’s mother answered in her soft contralto: “The Atreides are known to\n\nstart late getting their growth, Your Reverence.”\n\n“So I’ve heard, so I’ve heard,” wheezed the old woman. “Yet he’s already\n\nfifteen.”\n\n“Yes, Your Reverence.”\n\n'


print(type(list_text_inicial[4]))

<class 'str'>


page_4 = list_text_inicial[4].split('\n\n')
page_4

['DUNE',
 'A beginning is the time for taking the most delicate care that the balances are correct. This\nevery sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then,\ntake  care  that  you  first  place  him  in  his  time:  born  in  the  57th  year  of  the  Padishah\nEmperor,  Shaddam  IV.  And  take  the  most  special  care  that  you  locate  Muad’Dib  in  his\nplace: the planet Arrakis. Do not be deceived by the fact that he was born on Caladan and\nlived his first fifteen years there. Arrakis, the planet known as Dune, is forever his place.',
 '—from “Manual of Muad’Dib” by the Princess Irulan',
 'In the week before their departure to Arrakis, when all the final scurrying\nabout had reached a nearly unbearable frenzy, an old crone came to visit the\nmother of the boy, Paul.',
 'It was a warm night at Castle Caladan, and the ancient pile of stone that\nhad served the Atreides family as home for twenty-six generations bore that\ncooled-sweat feeling it acquired before a change in the weather.',
 'The old woman was let in by the side door down the vaulted passage by\nPaul’s room and she was allowed a moment to peer in at him where he lay\nin his bed.',
 'By the half-light of a suspensor lamp, dimmed and hanging near the floor,\nthe awakened boy could see a bulky female shape at his door, standing one\nstep ahead of his mother. The old woman was a witch shadow—hair like\nmatted spiderwebs, hooded ‘round darkness of features, eyes like glittering\njewels.',
 '“Is  he  not  small  for  his  age,  Jessica?”  the  old  woman  asked.  Her  voice',
 'wheezed and twanged like an untuned baliset.',
 'Paul’s mother answered in her soft contralto: “The Atreides are known to',
 'start late getting their growth, Your Reverence.”',
 '“So I’ve heard, so I’ve heard,” wheezed the old woman. “Yet he’s already',
 'fifteen.”',
 '“Yes, Your Reverence.”',
 '']


# Primer parrafo
page_4[0]

'DUNE'


# Segundo parrafo
page_4[1]

'A beginning is the time for taking the most delicate care that the balances are correct. This\nevery sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then,\ntake  care  that  you  first  place  him  in  his  time:  born  in  the  57th  year  of  the  Padishah\nEmperor,  Shaddam  IV.  And  take  the  most  special  care  that  you  locate  Muad’Dib  in  his\nplace: the planet Arrakis. Do not be deceived by the fact that he was born on Caladan and\nlived his first fifteen years there. Arrakis, the planet known as Dune, is forever his place.'


# Nuevo texto con todas las paginas divididas por parrafos:
text_total = []
for page in list_text_inicial[4:]:
    text_total.append(page.split('\n\n'))


# Veamos ahora la primer pagina de nuestro nuevo documento
text_total[0]

['DUNE',
 'A beginning is the time for taking the most delicate care that the balances are correct. This\nevery sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then,\ntake  care  that  you  first  place  him  in  his  time:  born  in  the  57th  year  of  the  Padishah\nEmperor,  Shaddam  IV.  And  take  the  most  special  care  that  you  locate  Muad’Dib  in  his\nplace: the planet Arrakis. Do not be deceived by the fact that he was born on Caladan and\nlived his first fifteen years there. Arrakis, the planet known as Dune, is forever his place.',
 '—from “Manual of Muad’Dib” by the Princess Irulan',
 'In the week before their departure to Arrakis, when all the final scurrying\nabout had reached a nearly unbearable frenzy, an old crone came to visit the\nmother of the boy, Paul.',
 'It was a warm night at Castle Caladan, and the ancient pile of stone that\nhad served the Atreides family as home for twenty-six generations bore that\ncooled-sweat feeling it acquired before a change in the weather.',
 'The old woman was let in by the side door down the vaulted passage by\nPaul’s room and she was allowed a moment to peer in at him where he lay\nin his bed.',
 'By the half-light of a suspensor lamp, dimmed and hanging near the floor,\nthe awakened boy could see a bulky female shape at his door, standing one\nstep ahead of his mother. The old woman was a witch shadow—hair like\nmatted spiderwebs, hooded ‘round darkness of features, eyes like glittering\njewels.',
 '“Is  he  not  small  for  his  age,  Jessica?”  the  old  woman  asked.  Her  voice',
 'wheezed and twanged like an untuned baliset.',
 'Paul’s mother answered in her soft contralto: “The Atreides are known to',
 'start late getting their growth, Your Reverence.”',
 '“So I’ve heard, so I’ve heard,” wheezed the old woman. “Yet he’s already',
 'fifteen.”',
 '“Yes, Your Reverence.”',
 '']


# Primer pagina, parrafo 1
text_total[0][0]

'DUNE'


# Primer pagina, parrafo 2
text_total[0][1]

'A beginning is the time for taking the most delicate care that the balances are correct. This\nevery sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then,\ntake  care  that  you  first  place  him  in  his  time:  born  in  the  57th  year  of  the  Padishah\nEmperor,  Shaddam  IV.  And  take  the  most  special  care  that  you  locate  Muad’Dib  in  his\nplace: the planet Arrakis. Do not be deceived by the fact that he was born on Caladan and\nlived his first fifteen years there. Arrakis, the planet known as Dune, is forever his place.'


# Nuevo documento sin el metacaracter de saltos de linea
text_total_new = []
# Recorremos todas las paginas
for pagina in text_total:
  # Crearemos una lista para almacenar el parrafo en cuestion
  # pero sin \n
    page = []
    # Recorremos cada parrafo de la pagina en cuestion
    for parrafo in pagina:
        # Trabajamos con aquellos parrafos que tengan al menos un elemento
        if len(parrafo)>0:
            # Reemplazamos n por un espacio en blanco
            page.append(parrafo.replace('\n',' '))
    # Agregamos las paginas nuevas al nuevo documento
    text_total_new.append(page)


# Antes
text_total[0][1]

'A beginning is the time for taking the most delicate care that the balances are correct. This\nevery sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then,\ntake  care  that  you  first  place  him  in  his  time:  born  in  the  57th  year  of  the  Padishah\nEmperor,  Shaddam  IV.  And  take  the  most  special  care  that  you  locate  Muad’Dib  in  his\nplace: the planet Arrakis. Do not be deceived by the fact that he was born on Caladan and\nlived his first fifteen years there. Arrakis, the planet known as Dune, is forever his place.'


# Despues
text_total_new[0][1]

'A beginning is the time for taking the most delicate care that the balances are correct. This every sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then, take  care  that  you  first  place  him  in  his  time:  born  in  the  57th  year  of  the  Padishah Emperor,  Shaddam  IV.  And  take  the  most  special  care  that  you  locate  Muad’Dib  in  his place: the planet Arrakis. Do not be deceived by the fact that he was born on Caladan and lived his first fifteen years there. Arrakis, the planet known as Dune, is forever his place.'


import re

re.sub(r'[\W]+', ' ',text_total_new[0][1])

'A beginning is the time for taking the most delicate care that the balances are correct This every sister of the Bene Gesserit knows To begin your study of the life of Muad Dib then take care that you first place him in his time born in the 57th year of the Padishah Emperor Shaddam IV And take the most special care that you locate Muad Dib in his place the planet Arrakis Do not be deceived by the fact that he was born on Caladan and lived his first fifteen years there Arrakis the planet known as Dune is forever his place '


# Nuevo documento sin caracteres no alfanuméricos
new_list_clean = []
# Recorremos todas las páginas
for page in text_total_new:
  # Lista que almacenara la pagina limpia
    page_new = []
    # Recorremos cada parrafo
    for paragraph in page:
        if len(paragraph.split())>0:
          # Realizamos las sustituciones
            page_new.append(re.sub(r'[\W_]+',' ',paragraph))
    # Nuevo documento limpio
    new_list_clean.append(page_new)


new_list_clean[0]

['DUNE',
 'A beginning is the time for taking the most delicate care that the balances are correct This every sister of the Bene Gesserit knows To begin your study of the life of Muad Dib then take care that you first place him in his time born in the 57th year of the Padishah Emperor Shaddam IV And take the most special care that you locate Muad Dib in his place the planet Arrakis Do not be deceived by the fact that he was born on Caladan and lived his first fifteen years there Arrakis the planet known as Dune is forever his place ',
 ' from Manual of Muad Dib by the Princess Irulan',
 'In the week before their departure to Arrakis when all the final scurrying about had reached a nearly unbearable frenzy an old crone came to visit the mother of the boy Paul ',
 'It was a warm night at Castle Caladan and the ancient pile of stone that had served the Atreides family as home for twenty six generations bore that cooled sweat feeling it acquired before a change in the weather ',
 'The old woman was let in by the side door down the vaulted passage by Paul s room and she was allowed a moment to peer in at him where he lay in his bed ',
 'By the half light of a suspensor lamp dimmed and hanging near the floor the awakened boy could see a bulky female shape at his door standing one step ahead of his mother The old woman was a witch shadow hair like matted spiderwebs hooded round darkness of features eyes like glittering jewels ',
 ' Is he not small for his age Jessica the old woman asked Her voice',
 'wheezed and twanged like an untuned baliset ',
 'Paul s mother answered in her soft contralto The Atreides are known to',
 'start late getting their growth Your Reverence ',
 ' So I ve heard so I ve heard wheezed the old woman Yet he s already',
 'fifteen ',
 ' Yes Your Reverence ']


import contractions


contractions.contractions_dict

{"I'm": 'I am',
 "I'm'a": 'I am about to',
 "I'm'o": 'I am going to',
 "I've": 'I have',
 "I'll": 'I will',
 "I'll've": 'I will have',
 "I'd": 'I would',
 "I'd've": 'I would have',
 'Whatcha': 'What are you',
 "amn't": 'am not',
 "ain't": 'are not',
 "aren't": 'are not',
 "'cause": 'because',
 "can't": 'cannot',
 "can't've": 'cannot have',
 "could've": 'could have',
 "couldn't": 'could not',
 "couldn't've": 'could not have',
 "daren't": 'dare not',
 "daresn't": 'dare not',
 "dasn't": 'dare not',
 "didn't": 'did not',
 'didn’t': 'did not',
 "don't": 'do not',
 'don’t': 'do not',
 "doesn't": 'does not',
 "e'er": 'ever',
 "everyone's": 'everyone is',
 'finna': 'fixing to',
 'gimme': 'give me',
 "gon't": 'go not',
 'gonna': 'going to',
 'gotta': 'got to',
 "hadn't": 'had not',
 "hadn't've": 'had not have',
 "hasn't": 'has not',
 "haven't": 'have not',
 "he've": 'he have',
 "he's": 'he is',
 "he'll": 'he will',
 "he'll've": 'he will have',
 "he'd": 'he would',
 "he'd've": 'he would have',
 "here's": 'here is',
 "how're": 'how are',
 "how'd": 'how did',
 "how'd'y": 'how do you',
 "how's": 'how is',
 "how'll": 'how will',
 "isn't": 'is not',
 "it's": 'it is',
 "'tis": 'it is',
 "'twas": 'it was',
 "it'll": 'it will',
 "it'll've": 'it will have',
 "it'd": 'it would',
 "it'd've": 'it would have',
 'kinda': 'kind of',
 "let's": 'let us',
 'luv': 'love',
 "ma'am": 'madam',
 "may've": 'may have',
 "mayn't": 'may not',
 "might've": 'might have',
 "mightn't": 'might not',
 "mightn't've": 'might not have',
 "must've": 'must have',
 "mustn't": 'must not',
 "mustn't've": 'must not have',
 "needn't": 'need not',
 "needn't've": 'need not have',
 "ne'er": 'never',
 "o'": 'of',
 "o'clock": 'of the clock',
 "ol'": 'old',
 "oughtn't": 'ought not',
 "oughtn't've": 'ought not have',
 "o'er": 'over',
 "shan't": 'shall not',
 "sha'n't": 'shall not',
 "shalln't": 'shall not',
 "shan't've": 'shall not have',
 "she's": 'she is',
 "she'll": 'she will',
 "she'd": 'she would',
 "she'd've": 'she would have',
 "should've": 'should have',
 "shouldn't": 'should not',
 "shouldn't've": 'should not have',
 "so've": 'so have',
 "so's": 'so is',
 "somebody's": 'somebody is',
 "someone's": 'someone is',
 "something's": 'something is',
 'sux': 'sucks',
 "that're": 'that are',
 "that's": 'that is',
 "that'll": 'that will',
 "that'd": 'that would',
 "that'd've": 'that would have',
 'em': 'them',
 "there're": 'there are',
 "there's": 'there is',
 "there'll": 'there will',
 "there'd": 'there would',
 "there'd've": 'there would have',
 "these're": 'these are',
 "they're": 'they are',
 "they've": 'they have',
 "they'll": 'they will',
 "they'll've": 'they will have',
 "they'd": 'they would',
 "they'd've": 'they would have',
 "this's": 'this is',
 "this'll": 'this will',
 "this'd": 'this would',
 "those're": 'those are',
 "to've": 'to have',
 'wanna': 'want to',
 "wasn't": 'was not',
 "we're": 'we are',
 "we've": 'we have',
 "we'll": 'we will',
 "we'll've": 'we will have',
 "we'd": 'we would',
 "we'd've": 'we would have',
 "weren't": 'were not',
 "what're": 'what are',
 "what'd": 'what did',
 "what've": 'what have',
 "what's": 'what is',
 "what'll": 'what will',
 "what'll've": 'what will have',
 "when've": 'when have',
 "when's": 'when is',
 "where're": 'where are',
 "where'd": 'where did',
 "where've": 'where have',
 "where's": 'where is',
 "which's": 'which is',
 "who're": 'who are',
 "who've": 'who have',
 "who's": 'who is',
 "who'll": 'who will',
 "who'll've": 'who will have',
 "who'd": 'who would',
 "who'd've": 'who would have',
 "why're": 'why are',
 "why'd": 'why did',
 "why've": 'why have',
 "why's": 'why is',
 "will've": 'will have',
 "won't": 'will not',
 "won't've": 'will not have',
 "would've": 'would have',
 "wouldn't": 'would not',
 "wouldn't've": 'would not have',
 "y'all": 'you all',
 "y'all're": 'you all are',
 "y'all've": 'you all have',
 "y'all'd": 'you all would',
 "y'all'd've": 'you all would have',
 "you're": 'you are',
 "you've": 'you have',
 "you'll've": 'you shall have',
 "you'll": 'you will',
 "you'd": 'you would',
 "you'd've": 'you would have',
 'to cause': 'to cause',
 'will cause': 'will cause',
 'should cause': 'should cause',
 'would cause': 'would cause',
 'can cause': 'can cause',
 'could cause': 'could cause',
 'must cause': 'must cause',
 'might cause': 'might cause',
 'shall cause': 'shall cause',
 'may cause': 'may cause',
 'jan.': 'january',
 'feb.': 'february',
 'mar.': 'march',
 'apr.': 'april',
 'jun.': 'june',
 'jul.': 'july',
 'aug.': 'august',
 'sep.': 'september',
 'oct.': 'october',
 'nov.': 'november',
 'dec.': 'december',
 'I’m': 'I am',
 'I’m’a': 'I am about to',
 'I’m’o': 'I am going to',
 'I’ve': 'I have',
 'I’ll': 'I will',
 'I’ll’ve': 'I will have',
 'I’d': 'I would',
 'I’d’ve': 'I would have',
 'amn’t': 'am not',
 'ain’t': 'are not',
 'aren’t': 'are not',
 '’cause': 'because',
 'can’t': 'cannot',
 'can’t’ve': 'cannot have',
 'could’ve': 'could have',
 'couldn’t': 'could not',
 'couldn’t’ve': 'could not have',
 'daren’t': 'dare not',
 'daresn’t': 'dare not',
 'dasn’t': 'dare not',
 'doesn’t': 'does not',
 'e’er': 'ever',
 'everyone’s': 'everyone is',
 'gon’t': 'go not',
 'hadn’t': 'had not',
 'hadn’t’ve': 'had not have',
 'hasn’t': 'has not',
 'haven’t': 'have not',
 'he’ve': 'he have',
 'he’s': 'he is',
 'he’ll': 'he will',
 'he’ll’ve': 'he will have',
 'he’d': 'he would',
 'he’d’ve': 'he would have',
 'here’s': 'here is',
 'how’re': 'how are',
 'how’d': 'how did',
 'how’d’y': 'how do you',
 'how’s': 'how is',
 'how’ll': 'how will',
 'isn’t': 'is not',
 'it’s': 'it is',
 '’tis': 'it is',
 '’twas': 'it was',
 'it’ll': 'it will',
 'it’ll’ve': 'it will have',
 'it’d': 'it would',
 'it’d’ve': 'it would have',
 'let’s': 'let us',
 'ma’am': 'madam',
 'may’ve': 'may have',
 'mayn’t': 'may not',
 'might’ve': 'might have',
 'mightn’t': 'might not',
 'mightn’t’ve': 'might not have',
 'must’ve': 'must have',
 'mustn’t': 'must not',
 'mustn’t’ve': 'must not have',
 'needn’t': 'need not',
 'needn’t’ve': 'need not have',
 'ne’er': 'never',
 'o’': 'of',
 'o’clock': 'of the clock',
 'ol’': 'old',
 'oughtn’t': 'ought not',
 'oughtn’t’ve': 'ought not have',
 'o’er': 'over',
 'shan’t': 'shall not',
 'sha’n’t': 'shall not',
 'shalln’t': 'shall not',
 'shan’t’ve': 'shall not have',
 'she’s': 'she is',
 'she’ll': 'she will',
 'she’d': 'she would',
 'she’d’ve': 'she would have',
 'should’ve': 'should have',
 'shouldn’t': 'should not',
 'shouldn’t’ve': 'should not have',
 'so’ve': 'so have',
 'so’s': 'so is',
 'somebody’s': 'somebody is',
 'someone’s': 'someone is',
 'something’s': 'something is',
 'that’re': 'that are',
 'that’s': 'that is',
 'that’ll': 'that will',
 'that’d': 'that would',
 'that’d’ve': 'that would have',
 'there’re': 'there are',
 'there’s': 'there is',
 'there’ll': 'there will',
 'there’d': 'there would',
 'there’d’ve': 'there would have',
 'these’re': 'these are',
 'they’re': 'they are',
 'they’ve': 'they have',
 'they’ll': 'they will',
 'they’ll’ve': 'they will have',
 'they’d': 'they would',
 'they’d’ve': 'they would have',
 'this’s': 'this is',
 'this’ll': 'this will',
 'this’d': 'this would',
 'those’re': 'those are',
 'to’ve': 'to have',
 'wasn’t': 'was not',
 'we’re': 'we are',
 'we’ve': 'we have',
 'we’ll': 'we will',
 'we’ll’ve': 'we will have',
 'we’d': 'we would',
 'we’d’ve': 'we would have',
 'weren’t': 'were not',
 'what’re': 'what are',
 'what’d': 'what did',
 'what’ve': 'what have',
 'what’s': 'what is',
 'what’ll': 'what will',
 'what’ll’ve': 'what will have',
 'when’ve': 'when have',
 'when’s': 'when is',
 'where’re': 'where are',
 'where’d': 'where did',
 'where’ve': 'where have',
 'where’s': 'where is',
 'which’s': 'which is',
 'who’re': 'who are',
 'who’ve': 'who have',
 'who’s': 'who is',
 'who’ll': 'who will',
 'who’ll’ve': 'who will have',
 'who’d': 'who would',
 'who’d’ve': 'who would have',
 'why’re': 'why are',
 'why’d': 'why did',
 'why’ve': 'why have',
 'why’s': 'why is',
 'will’ve': 'will have',
 'won’t': 'will not',
 'won’t’ve': 'will not have',
 'would’ve': 'would have',
 'wouldn’t': 'would not',
 'wouldn’t’ve': 'would not have',
 'y’all': 'you all',
 'y’all’re': 'you all are',
 'y’all’ve': 'you all have',
 'y’all’d': 'you all would',
 'y’all’d’ve': 'you all would have',
 'you’re': 'you are',
 'you’ve': 'you have',
 'you’ll’ve': 'you shall have',
 'you’ll': 'you will',
 'you’d': 'you would',
 'you’d’ve': 'you would have'}


len(contractions.contractions_dict)

343


# El metodo fix realiza la siguiente conversion 
contractions.fix("I've")

'I have'


# Nuevo documento sin caracteres no alfanuméricos
new_list_clean = []
# Recorremos todas las páginas
for page in text_total_new:
  # Lista que almacenara la pagina limpia
    page_new = []
    # Recorremos cada parrafo
    for paragraph in page:
        if len(paragraph.split())>0:
          # Realizamos las sustituciones
          # y quitaremos las contracciones de cada texto en el parrafo
            page_new.append(re.sub(r'[\W_]+',' ', contractions.fix(paragraph)))
    # Nuevo documento limpio
    new_list_clean.append(page_new)


new_list_clean[0][1]

'A beginning is the time for taking the most delicate care that the balances are correct This every sister of the Bene Gesserit knows To begin your study of the life of Muad Dib then take care that you first place him in his time born in the 57th year of the Padishah Emperor Shaddam IV And take the most special care that you locate Muad Dib in his place the planet Arrakis Do not be deceived by the fact that he was born on Caladan and lived his first fifteen years there Arrakis the planet known as Dune is forever his place '


# Nuevo documento sin caracteres no alfanuméricos
new_list_clean = []
# Recorremos todas las páginas
for page in text_total_new:
  # Lista que almacenara la pagina limpia
    page_new = []
    # Recorremos cada parrafo
    for paragraph in page:
        if len(paragraph.split())>0:
          # Realizamos las sustituciones
          # y quitaremos las contracciones de cada texto en el parrafo
          # quitaremos el genitivo y trabajaremos solo con letras minusculas
            page_new.append(re.sub(r'[\W_]+',' ',
                                   contractions.fix(paragraph)).replace(' s ', ' ').lower())
    # Nuevo documento limpio
    new_list_clean.append(page_new)


type(new_list_clean)

list


import nltk


# Consideremos el siguiente parrafo
new_list_clean[110][0]

' yes my lord the duke took a deep sighing breath strode out the door he turned to his right down the hall began walking hands behind his back paying little attention to where he was there were corridors and stairs and balconies and halls people who saluted and stood aside for him '


PSt = nltk.stem.PorterStemmer()
LSt = nltk.stem.LancasterStemmer()

[(PSt.stem(y), LSt.stem(y), y) for y in new_list_clean[110][0].split()]

[('ye', 'ye', 'yes'),
 ('my', 'my', 'my'),
 ('lord', 'lord', 'lord'),
 ('the', 'the', 'the'),
 ('duke', 'duk', 'duke'),
 ('took', 'took', 'took'),
 ('a', 'a', 'a'),
 ('deep', 'deep', 'deep'),
 ('sigh', 'sigh', 'sighing'),
 ('breath', 'brea', 'breath'),
 ('strode', 'strode', 'strode'),
 ('out', 'out', 'out'),
 ('the', 'the', 'the'),
 ('door', 'door', 'door'),
 ('he', 'he', 'he'),
 ('turn', 'turn', 'turned'),
 ('to', 'to', 'to'),
 ('hi', 'his', 'his'),
 ('right', 'right', 'right'),
 ('down', 'down', 'down'),
 ('the', 'the', 'the'),
 ('hall', 'hal', 'hall'),
 ('began', 'beg', 'began'),
 ('walk', 'walk', 'walking'),
 ('hand', 'hand', 'hands'),
 ('behind', 'behind', 'behind'),
 ('hi', 'his', 'his'),
 ('back', 'back', 'back'),
 ('pay', 'pay', 'paying'),
 ('littl', 'littl', 'little'),
 ('attent', 'at', 'attention'),
 ('to', 'to', 'to'),
 ('where', 'wher', 'where'),
 ('he', 'he', 'he'),
 ('wa', 'was', 'was'),
 ('there', 'ther', 'there'),
 ('were', 'wer', 'were'),
 ('corridor', 'corrid', 'corridors'),
 ('and', 'and', 'and'),
 ('stair', 'stair', 'stairs'),
 ('and', 'and', 'and'),
 ('balconi', 'balcony', 'balconies'),
 ('and', 'and', 'and'),
 ('hall', 'hal', 'halls'),
 ('peopl', 'peopl', 'people'),
 ('who', 'who', 'who'),
 ('salut', 'salut', 'saluted'),
 ('and', 'and', 'and'),
 ('stood', 'stood', 'stood'),
 ('asid', 'asid', 'aside'),
 ('for', 'for', 'for'),
 ('him', 'him', 'him')]


from nltk.stem import WordNetLemmatizer
lemattizer = WordNetLemmatizer()


nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

True


[(PSt.stem(y), LSt.stem(y), lemattizer.lemmatize(y), y) for y in new_list_clean[110][0].split()]

[('ye', 'ye', 'yes', 'yes'),
 ('my', 'my', 'my', 'my'),
 ('lord', 'lord', 'lord', 'lord'),
 ('the', 'the', 'the', 'the'),
 ('duke', 'duk', 'duke', 'duke'),
 ('took', 'took', 'took', 'took'),
 ('a', 'a', 'a', 'a'),
 ('deep', 'deep', 'deep', 'deep'),
 ('sigh', 'sigh', 'sighing', 'sighing'),
 ('breath', 'brea', 'breath', 'breath'),
 ('strode', 'strode', 'strode', 'strode'),
 ('out', 'out', 'out', 'out'),
 ('the', 'the', 'the', 'the'),
 ('door', 'door', 'door', 'door'),
 ('he', 'he', 'he', 'he'),
 ('turn', 'turn', 'turned', 'turned'),
 ('to', 'to', 'to', 'to'),
 ('hi', 'his', 'his', 'his'),
 ('right', 'right', 'right', 'right'),
 ('down', 'down', 'down', 'down'),
 ('the', 'the', 'the', 'the'),
 ('hall', 'hal', 'hall', 'hall'),
 ('began', 'beg', 'began', 'began'),
 ('walk', 'walk', 'walking', 'walking'),
 ('hand', 'hand', 'hand', 'hands'),
 ('behind', 'behind', 'behind', 'behind'),
 ('hi', 'his', 'his', 'his'),
 ('back', 'back', 'back', 'back'),
 ('pay', 'pay', 'paying', 'paying'),
 ('littl', 'littl', 'little', 'little'),
 ('attent', 'at', 'attention', 'attention'),
 ('to', 'to', 'to', 'to'),
 ('where', 'wher', 'where', 'where'),
 ('he', 'he', 'he', 'he'),
 ('wa', 'was', 'wa', 'was'),
 ('there', 'ther', 'there', 'there'),
 ('were', 'wer', 'were', 'were'),
 ('corridor', 'corrid', 'corridor', 'corridors'),
 ('and', 'and', 'and', 'and'),
 ('stair', 'stair', 'stair', 'stairs'),
 ('and', 'and', 'and', 'and'),
 ('balconi', 'balcony', 'balcony', 'balconies'),
 ('and', 'and', 'and', 'and'),
 ('hall', 'hal', 'hall', 'halls'),
 ('peopl', 'peopl', 'people', 'people'),
 ('who', 'who', 'who', 'who'),
 ('salut', 'salut', 'saluted', 'saluted'),
 ('and', 'and', 'and', 'and'),
 ('stood', 'stood', 'stood', 'stood'),
 ('asid', 'asid', 'aside', 'aside'),
 ('for', 'for', 'for', 'for'),
 ('him', 'him', 'him', 'him')]


from nltk.corpus import stopwords


set(stopwords.words('english'))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same',
 'shan',
 "shan't",
 'she',
 "she's",
 'should',
 "should've",
 'shouldn',
 "shouldn't",
 'so',
 'some',
 'such',
 't',
 'than',
 'that',
 "that'll",
 'the',
 'their',
 'theirs',
 'them',
 'themselves',
 'then',
 'there',
 'these',
 'they',
 'this',
 'those',
 'through',
 'to',
 'too',
 'under',
 'until',
 'up',
 've',
 'very',
 'was',
 'wasn',
 "wasn't",
 'we',
 'were',
 'weren',
 "weren't",
 'what',
 'when',
 'where',
 'which',
 'while',
 'who',
 'whom',
 'why',
 'will',
 'with',
 'won',
 "won't",
 'wouldn',
 "wouldn't",
 'y',
 'you',
 "you'd",
 "you'll",
 "you're",
 "you've",
 'your',
 'yours',
 'yourself',
 'yourselves'}


len(set(stopwords.words('english')))

179


from nltk.corpus.reader.tagged import word_tokenize
word_tokenize('hello how are you')

['hello', 'how', 'are', 'you']


# Nuevo documento sin las stopwords
final_text = []
# Recorremos todas las páginas
for page in new_list_clean:
  # Lista que almacenara la pagina limpia sin stopwords
    page_new = []
    # Recorremos cada parrafo
    for paragraph in page:
      text = []
      # Recorremos cada palabra (obtenido mediante el metodo
      # word_tokenize) de un parrafo
      for word in word_tokenize(paragraph):
        # Si la palabra no es una stopword la agregaremos a la
        # lista text
        if word not in stopwords.words('english'):
          text.append(word)
      # agregamos las palabras que no son stopwords a la lista page_new, pero
      # lo que haremos sera primero unir dichas palabras de la lista text
      # en una sola oracion conformando asi un parrafo
      page_new.append(' '.join(text))
    # agregamos cada pagina al documento final
    final_text.append(page_new)


final_text[0][1]

'beginning time taking delicate care balances correct every sister bene gesserit knows begin study life muad dib take care first place time born 57th year padishah emperor shaddam iv take special care locate muad dib place planet arrakis deceived fact born caladan lived first fifteen years arrakis planet known dune forever place'


Lematizador = []
for page in final_text:
  page_aux = []
  for paragraph in page:
    text=[]
    # Lematizamos cada parrafo
    page_aux.append(lemattizer.lemmatize(paragraph))
  Lematizador.append(page_aux)


Lematizador[0][1]

'beginning time taking delicate care balances correct every sister bene gesserit knows begin study life muad dib take care first place time born 57th year padishah emperor shaddam iv take special care locate muad dib place planet arrakis deceived fact born caladan lived first fifteen years arrakis planet known dune forever place'


total_text_lematizador = []
for num, page in enumerate(Lematizador):
  for paragraph in page:
    total_text_lematizador.append( (num+1, paragraph) )


total_text_lematizador[0:15]

[(1, 'dune'),
 (1,
  'beginning time taking delicate care balances correct every sister bene gesserit knows begin study life muad dib take care first place time born 57th year padishah emperor shaddam iv take special care locate muad dib place planet arrakis deceived fact born caladan lived first fifteen years arrakis planet known dune forever place'),
 (1, 'manual muad dib princess irulan'),
 (1,
  'week departure arrakis final scurrying reached nearly unbearable frenzy old crone came visit mother boy paul'),
 (1,
  'warm night castle caladan ancient pile stone served atreides family home twenty six generations bore cooled sweat feeling acquired change weather'),
 (1,
  'old woman let side door vaulted passage paul room allowed moment peer lay bed'),
 (1,
  'half light suspensor lamp dimmed hanging near floor awakened boy could see bulky female shape door standing one step ahead mother old woman witch shadow hair like matted spiderwebs hooded round darkness features eyes like glittering jewels'),
 (1, 'small age jessica old woman asked voice'),
 (1, 'wheezed twanged like untuned baliset'),
 (1, 'paul mother answered soft contralto atreides known'),
 (1, 'start late getting growth reverence'),
 (1, 'heard heard wheezed old woman yet already'),
 (1, 'fifteen'),
 (1, 'yes reverence'),
 (2, 'thump')]


import pandas as pd

df_lematizador = pd.DataFrame(total_text_lematizador).rename(columns={0: 'Página', 1: 'Párrafos'})
df_lematizador


# Definiremos una nueva columnas para los parrafos tokenizados
df_lematizador['Párrafos tokenizados'] = df_lematizador['Párrafos'].apply(lambda x: word_tokenize(x))
df_lematizador


# Seleccionamos la columna parrafos_tokenizados y
# convertimos dicha columna a una lista
parr_token_lista = df_lematizador['Párrafos tokenizados'].to_list()
parr_token_lista[0:2]

[['dune'],
 ['beginning',
  'time',
  'taking',
  'delicate',
  'care',
  'balances',
  'correct',
  'every',
  'sister',
  'bene',
  'gesserit',
  'knows',
  'begin',
  'study',
  'life',
  'muad',
  'dib',
  'take',
  'care',
  'first',
  'place',
  'time',
  'born',
  '57th',
  'year',
  'padishah',
  'emperor',
  'shaddam',
  'iv',
  'take',
  'special',
  'care',
  'locate',
  'muad',
  'dib',
  'place',
  'planet',
  'arrakis',
  'deceived',
  'fact',
  'born',
  'caladan',
  'lived',
  'first',
  'fifteen',
  'years',
  'arrakis',
  'planet',
  'known',
  'dune',
  'forever',
  'place']]


parr_token_lista = sum(parr_token_lista, [])
parr_token_lista[0:10]

['dune',
 'beginning',
 'time',
 'taking',
 'delicate',
 'care',
 'balances',
 'correct',
 'every',
 'sister']


parr_token_lista = set(parr_token_lista)
print(len(parr_token_lista))

11978


from collections import Counter


# Unimos los parrafos tokenizados en una sola lista
all_words = sum(df_lematizador['Párrafos tokenizados'].to_list(), [])
# Contamos el numero de apariciones de cada palabra en nuestro documento
dict_frecuencia_palabras = Counter(all_words)
# Con base en lo anterior creamos un dataframe
df_freq=pd.DataFrame([dict_frecuencia_palabras]).transpose().rename(columns={0:'Frecuencia'}).reset_index()
df_freq


df_freq.sort_values('Frecuencia', ascending=False)


df_freq[df_freq['index'].apply(lambda x: len(x)) == 1]


import numpy as np

# Convertimos a una lista las llaves del diccionario de frecuencia
# de palabras
lista_elementos = list(dict_frecuencia_palabras.keys())

# Definimos la funcion que vecotrizara un string
def vectorizador_by_word(string):
  # Creamos un arrar de ceros con 11987 entradas
  lista = np.zeros((11978,))
  # Buscamos el string ingresado como parametro y obtenemos su
  # posicion en la la lista lista_elementos, despues, con base en dicha
  # posicion colocaremos la frecuencia de apariciones de dicho string
  # en la misma posicion pero ahora en la lista de ceros _lista_ 
  lista[lista_elementos.index(string)] = dict_frecuencia_palabras[string]
  vector = lista
  # Retornamos la vectorizacion
  return vector


vectorizador_by_word('beginning')

array([ 0., 25.,  0., ...,  0.,  0.,  0.])


def vector_total(lista):
  total = np.zeros((11978,))
  # Para cada palabra de la lista en cuestión,
  # vectorizaremos dicha palabra
  for i in lista:
    # sumaremos los vectores obtenidos tomando como base el
    # array de ceros que creamos antes 
    total += vectorizador_by_word(i)
  return total


df_lematizador['Vectorizados']=df_lematizador['Párrafos tokenizados'].apply(lambda x: vector_total(x) )
df_lematizador


len(parr_token_lista)

11978


import pandas as pd

df_lematizador = pd.read_csv('https://luisapaez.github.io/Teoria_Galois/Texto_procesado.csv')
df_lematizador

	Página	Párrafos
0	1	dune
1	1	beginning time taking delicate care balances c...
2	1	manual muad dib princess irulan
3	1	week departure arrakis final scurrying reached...
4	1	warm night castle caladan ancient pile stone s...
...	...	...
7602	591	part publication may reproduced stored retriev...
7603	591	means without prior permission writing publish...
7604	591	circulated form binding cover published withou...
7605	591	condition including condition imposed subseque...
7606	591	www orionbooks co uk

	Página	Párrafos	Párrafos tokenizados
0	1	dune	[dune]
1	1	beginning time taking delicate care balances c...	[beginning, time, taking, delicate, care, bala...
2	1	manual muad dib princess irulan	[manual, muad, dib, princess, irulan]
3	1	week departure arrakis final scurrying reached...	[week, departure, arrakis, final, scurrying, r...
4	1	warm night castle caladan ancient pile stone s...	[warm, night, castle, caladan, ancient, pile, ...
...	...	...	...
7602	591	part publication may reproduced stored retriev...	[part, publication, may, reproduced, stored, r...
7603	591	means without prior permission writing publish...	[means, without, prior, permission, writing, p...
7604	591	circulated form binding cover published withou...	[circulated, form, binding, cover, published, ...
7605	591	condition including condition imposed subseque...	[condition, including, condition, imposed, sub...
7606	591	www orionbooks co uk	[www, orionbooks, co, uk]

	index	Frecuencia
0	dune	52
1	beginning	25
2	time	322
3	taking	32
4	delicate	14
...	...	...
11973	publisher	1
11974	subsequent	1
11975	purchaser	1
11976	orionbooks	1
11977	co	1

	index	Frecuencia
146	said	2274
57	paul	1723
123	jessica	901
105	one	684
533	thought	619
...	...	...
7509	reckless	1
7507	temptation	1
7504	felled	1
7499	skyward	1
11977	co	1

	index	Frecuencia
974	c	38
1518	h	151
2488	ē	1
2501	b	14
2502	g	16
3313	e	24
3377	n	6
6095	p	4
6216	2	9
6252	f	2
6253	r	6
6866	l	3
8914	1	6
11086	4	3
11200	3	2
11442	5	1
11952	0	1
11955	9	1

Procesamiento de lenguaje natural¶

Preprocesamiento ¶

Preprocesamiento un poco más profundo ¶

Steaming y Lemmatization ¶

Vectorización ¶

	Página	Párrafos	Párrafos tokenizados	Vectorizados
0	1	dune	[dune]	[52.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
1	1	beginning time taking delicate care balances c...	[beginning, time, taking, delicate, care, bala...	[52.0, 25.0, 644.0, 32.0, 14.0, 78.0, 1.0, 12....
2	1	manual muad dib princess irulan	[manual, muad, dib, princess, irulan]	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
3	1	week departure arrakis final scurrying reached...	[week, departure, arrakis, final, scurrying, r...	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
4	1	warm night castle caladan ancient pile stone s...	[warm, night, castle, caladan, ancient, pile, ...	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
...	...	...	...	...
7602	591	part publication may reproduced stored retriev...	[part, publication, may, reproduced, stored, r...	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
7603	591	means without prior permission writing publish...	[means, without, prior, permission, writing, p...	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
7604	591	circulated form binding cover published withou...	[circulated, form, binding, cover, published, ...	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
7605	591	condition including condition imposed subseque...	[condition, including, condition, imposed, sub...	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
7606	591	www orionbooks co uk	[www, orionbooks, co, uk]	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...

	Página	Párrafos	Párrafos tokenizados	Vectorizados
0	1	dune	['dune']	[52. 0. 0. ... 0. 0. 0.]
1	1	beginning time taking delicate care balances c...	['beginning', 'time', 'taking', 'delicate', 'c...	[ 52. 25. 644. ... 0. 0. 0.]
2	1	manual muad dib princess irulan	['manual', 'muad', 'dib', 'princess', 'irulan']	[0. 0. 0. ... 0. 0. 0.]
3	1	week departure arrakis final scurrying reached...	['week', 'departure', 'arrakis', 'final', 'scu...	[0. 0. 0. ... 0. 0. 0.]
4	1	warm night castle caladan ancient pile stone s...	['warm', 'night', 'castle', 'caladan', 'ancien...	[0. 0. 0. ... 0. 0. 0.]
...	...	...	...	...
7602	591	part publication may reproduced stored retriev...	['part', 'publication', 'may', 'reproduced', '...	[0. 0. 0. ... 0. 0. 0.]
7603	591	means without prior permission writing publish...	['means', 'without', 'prior', 'permission', 'w...	[0. 0. 0. ... 0. 0. 0.]
7604	591	circulated form binding cover published withou...	['circulated', 'form', 'binding', 'cover', 'pu...	[0. 0. 0. ... 0. 0. 0.]
7605	591	condition including condition imposed subseque...	['condition', 'including', 'condition', 'impos...	[0. 0. 0. ... 1. 0. 0.]
7606	591	www orionbooks co uk	['www', 'orionbooks', 'co', 'uk']	[0. 0. 0. ... 0. 1. 1.]