Comenzamos instalando la librería que ocuparemos: !pip install textract
y después la cargamos
import textract
Luego, cargamos en una variable el texto en pdf que ocuparemos, para lo cual escribimos textract.process()
como sigue
text = textract.process('/content/Frank Herbert - Dune-Orion Publishing Group (2020).pdf',
method='pdfminer',
encoding='ascii')
Demos un vistazo a la primera parte del texto que hemos "minado":
text[0:100]
b'\x0cDUNE\n\nFrank Herbert\n\nwww.sfgateway.com\n\n\x0cEnter the SF Gateway \xe2\x80\xa6\n\nIn the last years of the t'
Cabe resaltar que \x0c
representa un salto de página, de modo que podemos separar nuestro texto por páginas como sigue
list_text_inicial = text.decode().split('\x0c')
donde el método decode()
decodifica el texto (se convierten partes del texto en ascii en texto habitual salvo algunos detalles) y después dividimos el texto por saltos de página, esto es, nuestro texto ahora está organizado por páginas.
# Por ejemplo accedemos a la primer pagina, la cual esta vacia
list_text_inicial[0]
''
# Accedemos a la pagina de indice 4
list_text_inicial[4]
'DUNE\n\nA beginning is the time for taking the most delicate care that the balances are correct. This\nevery sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then,\ntake care that you first place him in his time: born in the 57th year of the Padishah\nEmperor, Shaddam IV. And take the most special care that you locate Muad’Dib in his\nplace: the planet Arrakis. Do not be deceived by the fact that he was born on Caladan and\nlived his first fifteen years there. Arrakis, the planet known as Dune, is forever his place.\n\n—from “Manual of Muad’Dib” by the Princess Irulan\n\nIn the week before their departure to Arrakis, when all the final scurrying\nabout had reached a nearly unbearable frenzy, an old crone came to visit the\nmother of the boy, Paul.\n\nIt was a warm night at Castle Caladan, and the ancient pile of stone that\nhad served the Atreides family as home for twenty-six generations bore that\ncooled-sweat feeling it acquired before a change in the weather.\n\nThe old woman was let in by the side door down the vaulted passage by\nPaul’s room and she was allowed a moment to peer in at him where he lay\nin his bed.\n\nBy the half-light of a suspensor lamp, dimmed and hanging near the floor,\nthe awakened boy could see a bulky female shape at his door, standing one\nstep ahead of his mother. The old woman was a witch shadow—hair like\nmatted spiderwebs, hooded ‘round darkness of features, eyes like glittering\njewels.\n\n“Is he not small for his age, Jessica?” the old woman asked. Her voice\n\nwheezed and twanged like an untuned baliset.\n\nPaul’s mother answered in her soft contralto: “The Atreides are known to\n\nstart late getting their growth, Your Reverence.”\n\n“So I’ve heard, so I’ve heard,” wheezed the old woman. “Yet he’s already\n\nfifteen.”\n\n“Yes, Your Reverence.”\n\n'
print(type(list_text_inicial[4]))
<class 'str'>
Luego, \n\n
representa un doble salto de línea, así, podemos dividir la página de índice 4 por los dobles saltos de línea
page_4 = list_text_inicial[4].split('\n\n')
page_4
['DUNE', 'A beginning is the time for taking the most delicate care that the balances are correct. This\nevery sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then,\ntake care that you first place him in his time: born in the 57th year of the Padishah\nEmperor, Shaddam IV. And take the most special care that you locate Muad’Dib in his\nplace: the planet Arrakis. Do not be deceived by the fact that he was born on Caladan and\nlived his first fifteen years there. Arrakis, the planet known as Dune, is forever his place.', '—from “Manual of Muad’Dib” by the Princess Irulan', 'In the week before their departure to Arrakis, when all the final scurrying\nabout had reached a nearly unbearable frenzy, an old crone came to visit the\nmother of the boy, Paul.', 'It was a warm night at Castle Caladan, and the ancient pile of stone that\nhad served the Atreides family as home for twenty-six generations bore that\ncooled-sweat feeling it acquired before a change in the weather.', 'The old woman was let in by the side door down the vaulted passage by\nPaul’s room and she was allowed a moment to peer in at him where he lay\nin his bed.', 'By the half-light of a suspensor lamp, dimmed and hanging near the floor,\nthe awakened boy could see a bulky female shape at his door, standing one\nstep ahead of his mother. The old woman was a witch shadow—hair like\nmatted spiderwebs, hooded ‘round darkness of features, eyes like glittering\njewels.', '“Is he not small for his age, Jessica?” the old woman asked. Her voice', 'wheezed and twanged like an untuned baliset.', 'Paul’s mother answered in her soft contralto: “The Atreides are known to', 'start late getting their growth, Your Reverence.”', '“So I’ve heard, so I’ve heard,” wheezed the old woman. “Yet he’s already', 'fifteen.”', '“Yes, Your Reverence.”', '']
lo cual nos dejaría la página de índice 4 aparentemente separada por párrafos.
# Primer parrafo
page_4[0]
'DUNE'
# Segundo parrafo
page_4[1]
'A beginning is the time for taking the most delicate care that the balances are correct. This\nevery sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then,\ntake care that you first place him in his time: born in the 57th year of the Padishah\nEmperor, Shaddam IV. And take the most special care that you locate Muad’Dib in his\nplace: the planet Arrakis. Do not be deceived by the fact that he was born on Caladan and\nlived his first fifteen years there. Arrakis, the planet known as Dune, is forever his place.'
Notamos aún nuestro texto se encuentra un poco sucio. Continuando, haremos la división de todas las páginas por párrafos
# Nuevo texto con todas las paginas divididas por parrafos:
text_total = []
for page in list_text_inicial[4:]:
text_total.append(page.split('\n\n'))
# Veamos ahora la primer pagina de nuestro nuevo documento
text_total[0]
['DUNE', 'A beginning is the time for taking the most delicate care that the balances are correct. This\nevery sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then,\ntake care that you first place him in his time: born in the 57th year of the Padishah\nEmperor, Shaddam IV. And take the most special care that you locate Muad’Dib in his\nplace: the planet Arrakis. Do not be deceived by the fact that he was born on Caladan and\nlived his first fifteen years there. Arrakis, the planet known as Dune, is forever his place.', '—from “Manual of Muad’Dib” by the Princess Irulan', 'In the week before their departure to Arrakis, when all the final scurrying\nabout had reached a nearly unbearable frenzy, an old crone came to visit the\nmother of the boy, Paul.', 'It was a warm night at Castle Caladan, and the ancient pile of stone that\nhad served the Atreides family as home for twenty-six generations bore that\ncooled-sweat feeling it acquired before a change in the weather.', 'The old woman was let in by the side door down the vaulted passage by\nPaul’s room and she was allowed a moment to peer in at him where he lay\nin his bed.', 'By the half-light of a suspensor lamp, dimmed and hanging near the floor,\nthe awakened boy could see a bulky female shape at his door, standing one\nstep ahead of his mother. The old woman was a witch shadow—hair like\nmatted spiderwebs, hooded ‘round darkness of features, eyes like glittering\njewels.', '“Is he not small for his age, Jessica?” the old woman asked. Her voice', 'wheezed and twanged like an untuned baliset.', 'Paul’s mother answered in her soft contralto: “The Atreides are known to', 'start late getting their growth, Your Reverence.”', '“So I’ve heard, so I’ve heard,” wheezed the old woman. “Yet he’s already', 'fifteen.”', '“Yes, Your Reverence.”', '']
# Primer pagina, parrafo 1
text_total[0][0]
'DUNE'
# Primer pagina, parrafo 2
text_total[0][1]
'A beginning is the time for taking the most delicate care that the balances are correct. This\nevery sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then,\ntake care that you first place him in his time: born in the 57th year of the Padishah\nEmperor, Shaddam IV. And take the most special care that you locate Muad’Dib in his\nplace: the planet Arrakis. Do not be deceived by the fact that he was born on Caladan and\nlived his first fifteen years there. Arrakis, the planet known as Dune, is forever his place.'
Vemos que nuestro documento tiene aún el metacaracter de los saltos de línea. Limpiaremos dichos saltos de línea
# Nuevo documento sin el metacaracter de saltos de linea
text_total_new = []
# Recorremos todas las paginas
for pagina in text_total:
# Crearemos una lista para almacenar el parrafo en cuestion
# pero sin \n
page = []
# Recorremos cada parrafo de la pagina en cuestion
for parrafo in pagina:
# Trabajamos con aquellos parrafos que tengan al menos un elemento
if len(parrafo)>0:
# Reemplazamos n por un espacio en blanco
page.append(parrafo.replace('\n',' '))
# Agregamos las paginas nuevas al nuevo documento
text_total_new.append(page)
Notamos que ya no tenemos el metacaracter \n
en nuestro documento
# Antes
text_total[0][1]
'A beginning is the time for taking the most delicate care that the balances are correct. This\nevery sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then,\ntake care that you first place him in his time: born in the 57th year of the Padishah\nEmperor, Shaddam IV. And take the most special care that you locate Muad’Dib in his\nplace: the planet Arrakis. Do not be deceived by the fact that he was born on Caladan and\nlived his first fifteen years there. Arrakis, the planet known as Dune, is forever his place.'
# Despues
text_total_new[0][1]
'A beginning is the time for taking the most delicate care that the balances are correct. This every sister of the Bene Gesserit knows. To begin your study of the life of Muad’Dib, then, take care that you first place him in his time: born in the 57th year of the Padishah Emperor, Shaddam IV. And take the most special care that you locate Muad’Dib in his place: the planet Arrakis. Do not be deceived by the fact that he was born on Caladan and lived his first fifteen years there. Arrakis, the planet known as Dune, is forever his place.'
En el preprocesamiento buscamos quedarnos con los elementos del texto que nos den la información más relevante. Por ejemplo, los siguientes símbolos no nos aportarán información relevante
simbolos = [':',
'=',
';',
"'", '(', ')'
'~',
'[', ']'
]
Ahora, lo que queremos es eliminar u omitir dichos símbolos en nuestro documento, para lo cual utilizaremos regex como sigue:
import re
re.sub(r'[\W]+', ' ',text_total_new[0][1])
'A beginning is the time for taking the most delicate care that the balances are correct This every sister of the Bene Gesserit knows To begin your study of the life of Muad Dib then take care that you first place him in his time born in the 57th year of the Padishah Emperor Shaddam IV And take the most special care that you locate Muad Dib in his place the planet Arrakis Do not be deceived by the fact that he was born on Caladan and lived his first fifteen years there Arrakis the planet known as Dune is forever his place '
en lo cual estamos sustituyendo todos los caracteres que no son alfanuméricos (\W
) por espacios en blanco, esto en nuestra primer página y segundo párrafo. Aplicaremos dicha limpieza a todo el documento
# Nuevo documento sin caracteres no alfanuméricos
new_list_clean = []
# Recorremos todas las páginas
for page in text_total_new:
# Lista que almacenara la pagina limpia
page_new = []
# Recorremos cada parrafo
for paragraph in page:
if len(paragraph.split())>0:
# Realizamos las sustituciones
page_new.append(re.sub(r'[\W_]+',' ',paragraph))
# Nuevo documento limpio
new_list_clean.append(page_new)
new_list_clean[0]
['DUNE', 'A beginning is the time for taking the most delicate care that the balances are correct This every sister of the Bene Gesserit knows To begin your study of the life of Muad Dib then take care that you first place him in his time born in the 57th year of the Padishah Emperor Shaddam IV And take the most special care that you locate Muad Dib in his place the planet Arrakis Do not be deceived by the fact that he was born on Caladan and lived his first fifteen years there Arrakis the planet known as Dune is forever his place ', ' from Manual of Muad Dib by the Princess Irulan', 'In the week before their departure to Arrakis when all the final scurrying about had reached a nearly unbearable frenzy an old crone came to visit the mother of the boy Paul ', 'It was a warm night at Castle Caladan and the ancient pile of stone that had served the Atreides family as home for twenty six generations bore that cooled sweat feeling it acquired before a change in the weather ', 'The old woman was let in by the side door down the vaulted passage by Paul s room and she was allowed a moment to peer in at him where he lay in his bed ', 'By the half light of a suspensor lamp dimmed and hanging near the floor the awakened boy could see a bulky female shape at his door standing one step ahead of his mother The old woman was a witch shadow hair like matted spiderwebs hooded round darkness of features eyes like glittering jewels ', ' Is he not small for his age Jessica the old woman asked Her voice', 'wheezed and twanged like an untuned baliset ', 'Paul s mother answered in her soft contralto The Atreides are known to', 'start late getting their growth Your Reverence ', ' So I ve heard so I ve heard wheezed the old woman Yet he s already', 'fifteen ', ' Yes Your Reverence ']
De tal manera nos quedamos con el texto que solo contiene caracteres alfanuméricos. No obstante, tendremos un problema, digamos, respecto a las contracciones, por ejemplo I ve en vez de I've, con lo cual perdemos información.
Para trabajar con las contraciones requerimos instalar !pip install contractions
y después importamos la librería
import contractions
Por ejemplo
contractions.contractions_dict
{"I'm": 'I am', "I'm'a": 'I am about to', "I'm'o": 'I am going to', "I've": 'I have', "I'll": 'I will', "I'll've": 'I will have', "I'd": 'I would', "I'd've": 'I would have', 'Whatcha': 'What are you', "amn't": 'am not', "ain't": 'are not', "aren't": 'are not', "'cause": 'because', "can't": 'cannot', "can't've": 'cannot have', "could've": 'could have', "couldn't": 'could not', "couldn't've": 'could not have', "daren't": 'dare not', "daresn't": 'dare not', "dasn't": 'dare not', "didn't": 'did not', 'didn’t': 'did not', "don't": 'do not', 'don’t': 'do not', "doesn't": 'does not', "e'er": 'ever', "everyone's": 'everyone is', 'finna': 'fixing to', 'gimme': 'give me', "gon't": 'go not', 'gonna': 'going to', 'gotta': 'got to', "hadn't": 'had not', "hadn't've": 'had not have', "hasn't": 'has not', "haven't": 'have not', "he've": 'he have', "he's": 'he is', "he'll": 'he will', "he'll've": 'he will have', "he'd": 'he would', "he'd've": 'he would have', "here's": 'here is', "how're": 'how are', "how'd": 'how did', "how'd'y": 'how do you', "how's": 'how is', "how'll": 'how will', "isn't": 'is not', "it's": 'it is', "'tis": 'it is', "'twas": 'it was', "it'll": 'it will', "it'll've": 'it will have', "it'd": 'it would', "it'd've": 'it would have', 'kinda': 'kind of', "let's": 'let us', 'luv': 'love', "ma'am": 'madam', "may've": 'may have', "mayn't": 'may not', "might've": 'might have', "mightn't": 'might not', "mightn't've": 'might not have', "must've": 'must have', "mustn't": 'must not', "mustn't've": 'must not have', "needn't": 'need not', "needn't've": 'need not have', "ne'er": 'never', "o'": 'of', "o'clock": 'of the clock', "ol'": 'old', "oughtn't": 'ought not', "oughtn't've": 'ought not have', "o'er": 'over', "shan't": 'shall not', "sha'n't": 'shall not', "shalln't": 'shall not', "shan't've": 'shall not have', "she's": 'she is', "she'll": 'she will', "she'd": 'she would', "she'd've": 'she would have', "should've": 'should have', "shouldn't": 'should not', "shouldn't've": 'should not have', "so've": 'so have', "so's": 'so is', "somebody's": 'somebody is', "someone's": 'someone is', "something's": 'something is', 'sux': 'sucks', "that're": 'that are', "that's": 'that is', "that'll": 'that will', "that'd": 'that would', "that'd've": 'that would have', 'em': 'them', "there're": 'there are', "there's": 'there is', "there'll": 'there will', "there'd": 'there would', "there'd've": 'there would have', "these're": 'these are', "they're": 'they are', "they've": 'they have', "they'll": 'they will', "they'll've": 'they will have', "they'd": 'they would', "they'd've": 'they would have', "this's": 'this is', "this'll": 'this will', "this'd": 'this would', "those're": 'those are', "to've": 'to have', 'wanna': 'want to', "wasn't": 'was not', "we're": 'we are', "we've": 'we have', "we'll": 'we will', "we'll've": 'we will have', "we'd": 'we would', "we'd've": 'we would have', "weren't": 'were not', "what're": 'what are', "what'd": 'what did', "what've": 'what have', "what's": 'what is', "what'll": 'what will', "what'll've": 'what will have', "when've": 'when have', "when's": 'when is', "where're": 'where are', "where'd": 'where did', "where've": 'where have', "where's": 'where is', "which's": 'which is', "who're": 'who are', "who've": 'who have', "who's": 'who is', "who'll": 'who will', "who'll've": 'who will have', "who'd": 'who would', "who'd've": 'who would have', "why're": 'why are', "why'd": 'why did', "why've": 'why have', "why's": 'why is', "will've": 'will have', "won't": 'will not', "won't've": 'will not have', "would've": 'would have', "wouldn't": 'would not', "wouldn't've": 'would not have', "y'all": 'you all', "y'all're": 'you all are', "y'all've": 'you all have', "y'all'd": 'you all would', "y'all'd've": 'you all would have', "you're": 'you are', "you've": 'you have', "you'll've": 'you shall have', "you'll": 'you will', "you'd": 'you would', "you'd've": 'you would have', 'to cause': 'to cause', 'will cause': 'will cause', 'should cause': 'should cause', 'would cause': 'would cause', 'can cause': 'can cause', 'could cause': 'could cause', 'must cause': 'must cause', 'might cause': 'might cause', 'shall cause': 'shall cause', 'may cause': 'may cause', 'jan.': 'january', 'feb.': 'february', 'mar.': 'march', 'apr.': 'april', 'jun.': 'june', 'jul.': 'july', 'aug.': 'august', 'sep.': 'september', 'oct.': 'october', 'nov.': 'november', 'dec.': 'december', 'I’m': 'I am', 'I’m’a': 'I am about to', 'I’m’o': 'I am going to', 'I’ve': 'I have', 'I’ll': 'I will', 'I’ll’ve': 'I will have', 'I’d': 'I would', 'I’d’ve': 'I would have', 'amn’t': 'am not', 'ain’t': 'are not', 'aren’t': 'are not', '’cause': 'because', 'can’t': 'cannot', 'can’t’ve': 'cannot have', 'could’ve': 'could have', 'couldn’t': 'could not', 'couldn’t’ve': 'could not have', 'daren’t': 'dare not', 'daresn’t': 'dare not', 'dasn’t': 'dare not', 'doesn’t': 'does not', 'e’er': 'ever', 'everyone’s': 'everyone is', 'gon’t': 'go not', 'hadn’t': 'had not', 'hadn’t’ve': 'had not have', 'hasn’t': 'has not', 'haven’t': 'have not', 'he’ve': 'he have', 'he’s': 'he is', 'he’ll': 'he will', 'he’ll’ve': 'he will have', 'he’d': 'he would', 'he’d’ve': 'he would have', 'here’s': 'here is', 'how’re': 'how are', 'how’d': 'how did', 'how’d’y': 'how do you', 'how’s': 'how is', 'how’ll': 'how will', 'isn’t': 'is not', 'it’s': 'it is', '’tis': 'it is', '’twas': 'it was', 'it’ll': 'it will', 'it’ll’ve': 'it will have', 'it’d': 'it would', 'it’d’ve': 'it would have', 'let’s': 'let us', 'ma’am': 'madam', 'may’ve': 'may have', 'mayn’t': 'may not', 'might’ve': 'might have', 'mightn’t': 'might not', 'mightn’t’ve': 'might not have', 'must’ve': 'must have', 'mustn’t': 'must not', 'mustn’t’ve': 'must not have', 'needn’t': 'need not', 'needn’t’ve': 'need not have', 'ne’er': 'never', 'o’': 'of', 'o’clock': 'of the clock', 'ol’': 'old', 'oughtn’t': 'ought not', 'oughtn’t’ve': 'ought not have', 'o’er': 'over', 'shan’t': 'shall not', 'sha’n’t': 'shall not', 'shalln’t': 'shall not', 'shan’t’ve': 'shall not have', 'she’s': 'she is', 'she’ll': 'she will', 'she’d': 'she would', 'she’d’ve': 'she would have', 'should’ve': 'should have', 'shouldn’t': 'should not', 'shouldn’t’ve': 'should not have', 'so’ve': 'so have', 'so’s': 'so is', 'somebody’s': 'somebody is', 'someone’s': 'someone is', 'something’s': 'something is', 'that’re': 'that are', 'that’s': 'that is', 'that’ll': 'that will', 'that’d': 'that would', 'that’d’ve': 'that would have', 'there’re': 'there are', 'there’s': 'there is', 'there’ll': 'there will', 'there’d': 'there would', 'there’d’ve': 'there would have', 'these’re': 'these are', 'they’re': 'they are', 'they’ve': 'they have', 'they’ll': 'they will', 'they’ll’ve': 'they will have', 'they’d': 'they would', 'they’d’ve': 'they would have', 'this’s': 'this is', 'this’ll': 'this will', 'this’d': 'this would', 'those’re': 'those are', 'to’ve': 'to have', 'wasn’t': 'was not', 'we’re': 'we are', 'we’ve': 'we have', 'we’ll': 'we will', 'we’ll’ve': 'we will have', 'we’d': 'we would', 'we’d’ve': 'we would have', 'weren’t': 'were not', 'what’re': 'what are', 'what’d': 'what did', 'what’ve': 'what have', 'what’s': 'what is', 'what’ll': 'what will', 'what’ll’ve': 'what will have', 'when’ve': 'when have', 'when’s': 'when is', 'where’re': 'where are', 'where’d': 'where did', 'where’ve': 'where have', 'where’s': 'where is', 'which’s': 'which is', 'who’re': 'who are', 'who’ve': 'who have', 'who’s': 'who is', 'who’ll': 'who will', 'who’ll’ve': 'who will have', 'who’d': 'who would', 'who’d’ve': 'who would have', 'why’re': 'why are', 'why’d': 'why did', 'why’ve': 'why have', 'why’s': 'why is', 'will’ve': 'will have', 'won’t': 'will not', 'won’t’ve': 'will not have', 'would’ve': 'would have', 'wouldn’t': 'would not', 'wouldn’t’ve': 'would not have', 'y’all': 'you all', 'y’all’re': 'you all are', 'y’all’ve': 'you all have', 'y’all’d': 'you all would', 'y’all’d’ve': 'you all would have', 'you’re': 'you are', 'you’ve': 'you have', 'you’ll’ve': 'you shall have', 'you’ll': 'you will', 'you’d': 'you would', 'you’d’ve': 'you would have'}
nos muestra un diccionario de las contracciones (todas las que se pueden encontrar en el inglés) con el texto correspondiente sin contracción
len(contractions.contractions_dict)
343
Así, antes de quitar los símbolos a nuestro documento manejaremos las contracciones, para pasar de, por ejemplo, I've a I have. De tal manera:
# El metodo fix realiza la siguiente conversion
contractions.fix("I've")
'I have'
Realizaremos este proceso para todo el texto de nuestro documento:
# Nuevo documento sin caracteres no alfanuméricos
new_list_clean = []
# Recorremos todas las páginas
for page in text_total_new:
# Lista que almacenara la pagina limpia
page_new = []
# Recorremos cada parrafo
for paragraph in page:
if len(paragraph.split())>0:
# Realizamos las sustituciones
# y quitaremos las contracciones de cada texto en el parrafo
page_new.append(re.sub(r'[\W_]+',' ', contractions.fix(paragraph)))
# Nuevo documento limpio
new_list_clean.append(page_new)
new_list_clean[0][1]
'A beginning is the time for taking the most delicate care that the balances are correct This every sister of the Bene Gesserit knows To begin your study of the life of Muad Dib then take care that you first place him in his time born in the 57th year of the Padishah Emperor Shaddam IV And take the most special care that you locate Muad Dib in his place the planet Arrakis Do not be deceived by the fact that he was born on Caladan and lived his first fifteen years there Arrakis the planet known as Dune is forever his place '
por ejemplo, pasamos de don't a do not. Quitaremos ahora los genitivos, por ejemplo de Paul's, que después de quitar los símbolos solo queda Paul s, quitaremos la s
# Nuevo documento sin caracteres no alfanuméricos
new_list_clean = []
# Recorremos todas las páginas
for page in text_total_new:
# Lista que almacenara la pagina limpia
page_new = []
# Recorremos cada parrafo
for paragraph in page:
if len(paragraph.split())>0:
# Realizamos las sustituciones
# y quitaremos las contracciones de cada texto en el parrafo
# quitaremos el genitivo y trabajaremos solo con letras minusculas
page_new.append(re.sub(r'[\W_]+',' ',
contractions.fix(paragraph)).replace(' s ', ' ').lower())
# Nuevo documento limpio
new_list_clean.append(page_new)
type(new_list_clean)
list
Nos permiten llevar las palabras a un origen o raíz o núcleo que nos sintetiza las palabras, por ejemplo de las palabras ran, running el núcleo es la palabra run.
Para ello utilizaremos la librería nltk
import nltk
En particular trabajaremos con dos algorítmos para el steaming:
nltk.stem.PorterStemmer()
nltk.stem.LancasterStemmer()
Veamos qué es lo que hacen
# Consideremos el siguiente parrafo
new_list_clean[110][0]
' yes my lord the duke took a deep sighing breath strode out the door he turned to his right down the hall began walking hands behind his back paying little attention to where he was there were corridors and stairs and balconies and halls people who saluted and stood aside for him '
Crearemos ahora una lista donde compararemos como actúan los dos algorítmos anteriores
PSt = nltk.stem.PorterStemmer()
LSt = nltk.stem.LancasterStemmer()
[(PSt.stem(y), LSt.stem(y), y) for y in new_list_clean[110][0].split()]
[('ye', 'ye', 'yes'), ('my', 'my', 'my'), ('lord', 'lord', 'lord'), ('the', 'the', 'the'), ('duke', 'duk', 'duke'), ('took', 'took', 'took'), ('a', 'a', 'a'), ('deep', 'deep', 'deep'), ('sigh', 'sigh', 'sighing'), ('breath', 'brea', 'breath'), ('strode', 'strode', 'strode'), ('out', 'out', 'out'), ('the', 'the', 'the'), ('door', 'door', 'door'), ('he', 'he', 'he'), ('turn', 'turn', 'turned'), ('to', 'to', 'to'), ('hi', 'his', 'his'), ('right', 'right', 'right'), ('down', 'down', 'down'), ('the', 'the', 'the'), ('hall', 'hal', 'hall'), ('began', 'beg', 'began'), ('walk', 'walk', 'walking'), ('hand', 'hand', 'hands'), ('behind', 'behind', 'behind'), ('hi', 'his', 'his'), ('back', 'back', 'back'), ('pay', 'pay', 'paying'), ('littl', 'littl', 'little'), ('attent', 'at', 'attention'), ('to', 'to', 'to'), ('where', 'wher', 'where'), ('he', 'he', 'he'), ('wa', 'was', 'was'), ('there', 'ther', 'there'), ('were', 'wer', 'were'), ('corridor', 'corrid', 'corridors'), ('and', 'and', 'and'), ('stair', 'stair', 'stairs'), ('and', 'and', 'and'), ('balconi', 'balcony', 'balconies'), ('and', 'and', 'and'), ('hall', 'hal', 'halls'), ('peopl', 'peopl', 'people'), ('who', 'who', 'who'), ('salut', 'salut', 'saluted'), ('and', 'and', 'and'), ('stood', 'stood', 'stood'), ('asid', 'asid', 'aside'), ('for', 'for', 'for'), ('him', 'him', 'him')]
vemos, por ejemplo, que la palabra original yes ha sido modificada por ye en ambos algorítmos de steaming; la palabra aside ha sido modificada a asid
Donde con el steaming podemos obtener núcleos de palabras que no necesariamente estén en el diccionario. No obstante, con la lematización siempre conseguiremos que el núcleo de la palabra esté en el diccionario. Para ello
from nltk.stem import WordNetLemmatizer
lemattizer = WordNetLemmatizer()
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('punkt')
[nltk_data] Downloading package wordnet to /root/nltk_data... [nltk_data] Downloading package omw-1.4 to /root/nltk_data... [nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip. [nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip.
True
Ahora, haremos algo análogo a lo que hicimos antes con el steaming, pero ahora para el lematizador
[(PSt.stem(y), LSt.stem(y), lemattizer.lemmatize(y), y) for y in new_list_clean[110][0].split()]
[('ye', 'ye', 'yes', 'yes'), ('my', 'my', 'my', 'my'), ('lord', 'lord', 'lord', 'lord'), ('the', 'the', 'the', 'the'), ('duke', 'duk', 'duke', 'duke'), ('took', 'took', 'took', 'took'), ('a', 'a', 'a', 'a'), ('deep', 'deep', 'deep', 'deep'), ('sigh', 'sigh', 'sighing', 'sighing'), ('breath', 'brea', 'breath', 'breath'), ('strode', 'strode', 'strode', 'strode'), ('out', 'out', 'out', 'out'), ('the', 'the', 'the', 'the'), ('door', 'door', 'door', 'door'), ('he', 'he', 'he', 'he'), ('turn', 'turn', 'turned', 'turned'), ('to', 'to', 'to', 'to'), ('hi', 'his', 'his', 'his'), ('right', 'right', 'right', 'right'), ('down', 'down', 'down', 'down'), ('the', 'the', 'the', 'the'), ('hall', 'hal', 'hall', 'hall'), ('began', 'beg', 'began', 'began'), ('walk', 'walk', 'walking', 'walking'), ('hand', 'hand', 'hand', 'hands'), ('behind', 'behind', 'behind', 'behind'), ('hi', 'his', 'his', 'his'), ('back', 'back', 'back', 'back'), ('pay', 'pay', 'paying', 'paying'), ('littl', 'littl', 'little', 'little'), ('attent', 'at', 'attention', 'attention'), ('to', 'to', 'to', 'to'), ('where', 'wher', 'where', 'where'), ('he', 'he', 'he', 'he'), ('wa', 'was', 'wa', 'was'), ('there', 'ther', 'there', 'there'), ('were', 'wer', 'were', 'were'), ('corridor', 'corrid', 'corridor', 'corridors'), ('and', 'and', 'and', 'and'), ('stair', 'stair', 'stair', 'stairs'), ('and', 'and', 'and', 'and'), ('balconi', 'balcony', 'balcony', 'balconies'), ('and', 'and', 'and', 'and'), ('hall', 'hal', 'hall', 'halls'), ('peopl', 'peopl', 'people', 'people'), ('who', 'who', 'who', 'who'), ('salut', 'salut', 'saluted', 'saluted'), ('and', 'and', 'and', 'and'), ('stood', 'stood', 'stood', 'stood'), ('asid', 'asid', 'aside', 'aside'), ('for', 'for', 'for', 'for'), ('him', 'him', 'him', 'him')]
por ejemplo de la palabra balconies, al lematizar tenemos balcony.
Trabajemos ahora con las stopwords ( palabras vacías, hacen referencia a aquellas palabras que no están registradas por los robots de Google, las cuales carecen de sentido cuando se escriben solas o sin la palabra clave o keyword.)
from nltk.corpus import stopwords
set(stopwords.words('english'))
{'a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', 'were', 'weren', "weren't", 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'won', "won't", 'wouldn', "wouldn't", 'y', 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves'}
len(set(stopwords.words('english')))
179
Así, lo que haremos será realizar una limpieza de las stopwords de nuestro documento, para lo cual utilizaremos el método word_tokenize()
que nos divide los textos por palabras
from nltk.corpus.reader.tagged import word_tokenize
word_tokenize('hello how are you')
['hello', 'how', 'are', 'you']
de tal manera
# Nuevo documento sin las stopwords
final_text = []
# Recorremos todas las páginas
for page in new_list_clean:
# Lista que almacenara la pagina limpia sin stopwords
page_new = []
# Recorremos cada parrafo
for paragraph in page:
text = []
# Recorremos cada palabra (obtenido mediante el metodo
# word_tokenize) de un parrafo
for word in word_tokenize(paragraph):
# Si la palabra no es una stopword la agregaremos a la
# lista text
if word not in stopwords.words('english'):
text.append(word)
# agregamos las palabras que no son stopwords a la lista page_new, pero
# lo que haremos sera primero unir dichas palabras de la lista text
# en una sola oracion conformando asi un parrafo
page_new.append(' '.join(text))
# agregamos cada pagina al documento final
final_text.append(page_new)
final_text[0][1]
'beginning time taking delicate care balances correct every sister bene gesserit knows begin study life muad dib take care first place time born 57th year padishah emperor shaddam iv take special care locate muad dib place planet arrakis deceived fact born caladan lived first fifteen years arrakis planet known dune forever place'
Podemos después lematizar nuestro texto
Lematizador = []
for page in final_text:
page_aux = []
for paragraph in page:
text=[]
# Lematizamos cada parrafo
page_aux.append(lemattizer.lemmatize(paragraph))
Lematizador.append(page_aux)
Lematizador[0][1]
'beginning time taking delicate care balances correct every sister bene gesserit knows begin study life muad dib take care first place time born 57th year padishah emperor shaddam iv take special care locate muad dib place planet arrakis deceived fact born caladan lived first fifteen years arrakis planet known dune forever place'
Finalmente organizaremos nuestra información en un dataframe, antes de ello organizaremos nuestro párrafos por números de páginas
total_text_lematizador = []
for num, page in enumerate(Lematizador):
for paragraph in page:
total_text_lematizador.append( (num+1, paragraph) )
total_text_lematizador[0:15]
[(1, 'dune'), (1, 'beginning time taking delicate care balances correct every sister bene gesserit knows begin study life muad dib take care first place time born 57th year padishah emperor shaddam iv take special care locate muad dib place planet arrakis deceived fact born caladan lived first fifteen years arrakis planet known dune forever place'), (1, 'manual muad dib princess irulan'), (1, 'week departure arrakis final scurrying reached nearly unbearable frenzy old crone came visit mother boy paul'), (1, 'warm night castle caladan ancient pile stone served atreides family home twenty six generations bore cooled sweat feeling acquired change weather'), (1, 'old woman let side door vaulted passage paul room allowed moment peer lay bed'), (1, 'half light suspensor lamp dimmed hanging near floor awakened boy could see bulky female shape door standing one step ahead mother old woman witch shadow hair like matted spiderwebs hooded round darkness features eyes like glittering jewels'), (1, 'small age jessica old woman asked voice'), (1, 'wheezed twanged like untuned baliset'), (1, 'paul mother answered soft contralto atreides known'), (1, 'start late getting growth reverence'), (1, 'heard heard wheezed old woman yet already'), (1, 'fifteen'), (1, 'yes reverence'), (2, 'thump')]
Generamos el dataframe
import pandas as pd
df_lematizador = pd.DataFrame(total_text_lematizador).rename(columns={0: 'Página', 1: 'Párrafos'})
df_lematizador
Página | Párrafos | |
---|---|---|
0 | 1 | dune |
1 | 1 | beginning time taking delicate care balances c... |
2 | 1 | manual muad dib princess irulan |
3 | 1 | week departure arrakis final scurrying reached... |
4 | 1 | warm night castle caladan ancient pile stone s... |
... | ... | ... |
7602 | 591 | part publication may reproduced stored retriev... |
7603 | 591 | means without prior permission writing publish... |
7604 | 591 | circulated form binding cover published withou... |
7605 | 591 | condition including condition imposed subseque... |
7606 | 591 | www orionbooks co uk |
7607 rows × 2 columns
Ahora, tokenizaremos cada uno de los párrafos:
# Definiremos una nueva columnas para los parrafos tokenizados
df_lematizador['Párrafos tokenizados'] = df_lematizador['Párrafos'].apply(lambda x: word_tokenize(x))
df_lematizador
Página | Párrafos | Párrafos tokenizados | |
---|---|---|---|
0 | 1 | dune | [dune] |
1 | 1 | beginning time taking delicate care balances c... | [beginning, time, taking, delicate, care, bala... |
2 | 1 | manual muad dib princess irulan | [manual, muad, dib, princess, irulan] |
3 | 1 | week departure arrakis final scurrying reached... | [week, departure, arrakis, final, scurrying, r... |
4 | 1 | warm night castle caladan ancient pile stone s... | [warm, night, castle, caladan, ancient, pile, ... |
... | ... | ... | ... |
7602 | 591 | part publication may reproduced stored retriev... | [part, publication, may, reproduced, stored, r... |
7603 | 591 | means without prior permission writing publish... | [means, without, prior, permission, writing, p... |
7604 | 591 | circulated form binding cover published withou... | [circulated, form, binding, cover, published, ... |
7605 | 591 | condition including condition imposed subseque... | [condition, including, condition, imposed, sub... |
7606 | 591 | www orionbooks co uk | [www, orionbooks, co, uk] |
7607 rows × 3 columns
Averiguemos cuántas palabras tiene nuestro vocabulario:
# Seleccionamos la columna parrafos_tokenizados y
# convertimos dicha columna a una lista
parr_token_lista = df_lematizador['Párrafos tokenizados'].to_list()
parr_token_lista[0:2]
[['dune'], ['beginning', 'time', 'taking', 'delicate', 'care', 'balances', 'correct', 'every', 'sister', 'bene', 'gesserit', 'knows', 'begin', 'study', 'life', 'muad', 'dib', 'take', 'care', 'first', 'place', 'time', 'born', '57th', 'year', 'padishah', 'emperor', 'shaddam', 'iv', 'take', 'special', 'care', 'locate', 'muad', 'dib', 'place', 'planet', 'arrakis', 'deceived', 'fact', 'born', 'caladan', 'lived', 'first', 'fifteen', 'years', 'arrakis', 'planet', 'known', 'dune', 'forever', 'place']]
consiguiendo una lista de listas de todas las palabras de nuestro documento. Unimos todas estas listas
parr_token_lista = sum(parr_token_lista, [])
parr_token_lista[0:10]
['dune', 'beginning', 'time', 'taking', 'delicate', 'care', 'balances', 'correct', 'every', 'sister']
con lo cual los elementos de nuestra lista ahora son únicamente todas las palabras de nuestro documento, así, podemos simplemente quedarnos con las palabras que no se repiten como sigue:
parr_token_lista = set(parr_token_lista)
print(len(parr_token_lista))
11978
siendo así que nuestro vocabulario consta de 11978 palabras distintas
Podemos contar cuántas veces aparece cada palabra en nuestro texto, para ello
from collections import Counter
# Unimos los parrafos tokenizados en una sola lista
all_words = sum(df_lematizador['Párrafos tokenizados'].to_list(), [])
# Contamos el numero de apariciones de cada palabra en nuestro documento
dict_frecuencia_palabras = Counter(all_words)
# Con base en lo anterior creamos un dataframe
df_freq=pd.DataFrame([dict_frecuencia_palabras]).transpose().rename(columns={0:'Frecuencia'}).reset_index()
df_freq
index | Frecuencia | |
---|---|---|
0 | dune | 52 |
1 | beginning | 25 |
2 | time | 322 |
3 | taking | 32 |
4 | delicate | 14 |
... | ... | ... |
11973 | publisher | 1 |
11974 | subsequent | 1 |
11975 | purchaser | 1 |
11976 | orionbooks | 1 |
11977 | co | 1 |
11978 rows × 2 columns
df_freq.sort_values('Frecuencia', ascending=False)
index | Frecuencia | |
---|---|---|
146 | said | 2274 |
57 | paul | 1723 |
123 | jessica | 901 |
105 | one | 684 |
533 | thought | 619 |
... | ... | ... |
7509 | reckless | 1 |
7507 | temptation | 1 |
7504 | felled | 1 |
7499 | skyward | 1 |
11977 | co | 1 |
11978 rows × 2 columns
con lo cual vemos las palabras que más aparecieron. Podemos hallar las palabras que solo tengan una letra
df_freq[df_freq['index'].apply(lambda x: len(x)) == 1]
index | Frecuencia | |
---|---|---|
974 | c | 38 |
1518 | h | 151 |
2488 | ē | 1 |
2501 | b | 14 |
2502 | g | 16 |
3313 | e | 24 |
3377 | n | 6 |
6095 | p | 4 |
6216 | 2 | 9 |
6252 | f | 2 |
6253 | r | 6 |
6866 | l | 3 |
8914 | 1 | 6 |
11086 | 4 | 3 |
11200 | 3 | 2 |
11442 | 5 | 1 |
11952 | 0 | 1 |
11955 | 9 | 1 |
Lo que haremos ahora será vectorizar cada uno de los documentos (filas) dentro del dataframe df_lematizador
, para lo cual construiremos primer una función:
import numpy as np
# Convertimos a una lista las llaves del diccionario de frecuencia
# de palabras
lista_elementos = list(dict_frecuencia_palabras.keys())
# Definimos la funcion que vecotrizara un string
def vectorizador_by_word(string):
# Creamos un arrar de ceros con 11987 entradas
lista = np.zeros((11978,))
# Buscamos el string ingresado como parametro y obtenemos su
# posicion en la la lista lista_elementos, despues, con base en dicha
# posicion colocaremos la frecuencia de apariciones de dicho string
# en la misma posicion pero ahora en la lista de ceros _lista_
lista[lista_elementos.index(string)] = dict_frecuencia_palabras[string]
vector = lista
# Retornamos la vectorizacion
return vector
Por ejemplo, si consideramos
vectorizador_by_word('beginning')
array([ 0., 25., 0., ..., 0., 0., 0.])
Tenemos que la palabra beginning se encontraba en la posición de índice 1 en la lista lista_elementos
y su frecuencia es de 25.
Así, conseguimos con la función anterior vectorizar una palabra en particular de nuestro vocabulario, donde tendremos entonces vectores de 11986 entradas nulas y una entrada con el valor de la frecuencia de la palabra en cuestión.
Lo que haremos ahora será vectorizar cada párrafo tokenizado del dataframe df_lematizador
, pero antes crearemos una función para vectorizar listas completas:
def vector_total(lista):
total = np.zeros((11978,))
# Para cada palabra de la lista en cuestión,
# vectorizaremos dicha palabra
for i in lista:
# sumaremos los vectores obtenidos tomando como base el
# array de ceros que creamos antes
total += vectorizador_by_word(i)
return total
Prácticamente lo que obtendremos con la función anterior será una vector, a partir de una lista de palabras, que tendrá las frecuencias de las palabras que se encuentran dentro de la lista, y tendrá el resto de entradas cero referente al resto de las palabras del vocabulario que no se encontraban en la lista. De tal manera, crearemos dichos vectores para cada párrafo tokenizado:
df_lematizador['Vectorizados']=df_lematizador['Párrafos tokenizados'].apply(lambda x: vector_total(x) )
df_lematizador
Página | Párrafos | Párrafos tokenizados | Vectorizados | |
---|---|---|---|---|
0 | 1 | dune | [dune] | [52.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,... |
1 | 1 | beginning time taking delicate care balances c... | [beginning, time, taking, delicate, care, bala... | [52.0, 25.0, 644.0, 32.0, 14.0, 78.0, 1.0, 12.... |
2 | 1 | manual muad dib princess irulan | [manual, muad, dib, princess, irulan] | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... |
3 | 1 | week departure arrakis final scurrying reached... | [week, departure, arrakis, final, scurrying, r... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... |
4 | 1 | warm night castle caladan ancient pile stone s... | [warm, night, castle, caladan, ancient, pile, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... |
... | ... | ... | ... | ... |
7602 | 591 | part publication may reproduced stored retriev... | [part, publication, may, reproduced, stored, r... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... |
7603 | 591 | means without prior permission writing publish... | [means, without, prior, permission, writing, p... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... |
7604 | 591 | circulated form binding cover published withou... | [circulated, form, binding, cover, published, ... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... |
7605 | 591 | condition including condition imposed subseque... | [condition, including, condition, imposed, sub... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... |
7606 | 591 | www orionbooks co uk | [www, orionbooks, co, uk] | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... |
7607 rows × 4 columns
Notemos que el espacio vectorial generado con todas las palabras de nuestro vocabulario es de
len(parr_token_lista)
11978
pero del dataframe anterior tenemos 7607 filas, donde tenemos además una columna con los párrafos vectorizados. Siendo lo anterior un indicio de que podemos reducir la dimensionalidad de nuestro espacio vectorial.
Con base en la vectorización hecha, trabajaremos con el espacio métrico inducido por la métrica generada a partir del espacio vectorial euclideano. Además, trabajaremos con el coseno de similitud. El coseno de similitud es el obtendido a partir del producto punto de dos vectores:
$$ ||v||\cdot ||w|| cos(θ) = v\cdot w \ \ \ ⇒ cos(θ)=\frac{v\cdot w }{||v||\cdot ||w|| } $$y será ésta la "métrica" que se induce sobre el espacio métrico de las palabras.
Esta libreta fue escrita y ejecutada en google colab (Link a la libreta) pues no puede ejecutarse en equipos con windows dado que la librería textract no está disponible.
Para utilizar el último dataframe creado en cualquier notebook puede escribirse:
import pandas as pd
df_lematizador = pd.read_csv('https://luisapaez.github.io/Teoria_Galois/Texto_procesado.csv')
df_lematizador
Página | Párrafos | Párrafos tokenizados | Vectorizados | |
---|---|---|---|---|
0 | 1 | dune | ['dune'] | [52. 0. 0. ... 0. 0. 0.] |
1 | 1 | beginning time taking delicate care balances c... | ['beginning', 'time', 'taking', 'delicate', 'c... | [ 52. 25. 644. ... 0. 0. 0.] |
2 | 1 | manual muad dib princess irulan | ['manual', 'muad', 'dib', 'princess', 'irulan'] | [0. 0. 0. ... 0. 0. 0.] |
3 | 1 | week departure arrakis final scurrying reached... | ['week', 'departure', 'arrakis', 'final', 'scu... | [0. 0. 0. ... 0. 0. 0.] |
4 | 1 | warm night castle caladan ancient pile stone s... | ['warm', 'night', 'castle', 'caladan', 'ancien... | [0. 0. 0. ... 0. 0. 0.] |
... | ... | ... | ... | ... |
7602 | 591 | part publication may reproduced stored retriev... | ['part', 'publication', 'may', 'reproduced', '... | [0. 0. 0. ... 0. 0. 0.] |
7603 | 591 | means without prior permission writing publish... | ['means', 'without', 'prior', 'permission', 'w... | [0. 0. 0. ... 0. 0. 0.] |
7604 | 591 | circulated form binding cover published withou... | ['circulated', 'form', 'binding', 'cover', 'pu... | [0. 0. 0. ... 0. 0. 0.] |
7605 | 591 | condition including condition imposed subseque... | ['condition', 'including', 'condition', 'impos... | [0. 0. 0. ... 1. 0. 0.] |
7606 | 591 | www orionbooks co uk | ['www', 'orionbooks', 'co', 'uk'] | [0. 0. 0. ... 0. 1. 1.] |
7607 rows × 4 columns