Data Preparation and Operations

 

Competency

Explain data storage processes and database management systems.

Scenario

Sprockets Corporation designs high-end, specialty machine parts for a variety of industries. You have been hired by Sprockets to assist them with their data analysis needs. John Sprocket, CEO, has asked you to take a number of text documents (blogs and email) and prepare them for the purpose of analysis. This requires the documents be normalized into word vectors. For example, the following quoted text “this Program is a test to see if this works, I hope we have the right program” would be represented as words: [“program”, “test”, “works”, “hope’, “right”] and counts: [2,1,1,1,1]. This output was accomplished by:

  • Omitting all words under four characters
  • Counting all of the words left over
  • Eliminating ‘stop words’ – words that have no significance outside of the context of a sentence (e.g. ‘this’).

The deliverable prepared will to be used to prepare one vector over all of the email documents, one for all of the blogs, and one containing both data sets.

Instructions

John Sprocket, CEO has sent you the following details in an email, indicating they would like a memo addressed to John and the leadership team at Sprockets, containing the specific deliverables for this task with a brief explanation of each element as necessary:

  • The Python code created for this task
  • Three output vectors
  • Notes:
  • Select any related Python library for this task, including NLTK.
  • Include the word vector (list) and the count vector (list) for each of the three deliverables.
  • Before counting all unique words, set all characters to lower case which will reduce redundancy in your attempt to normalize these Word documents from unstructured data to structured data.
  • Remove all words under the length of four characters.
  • Remove any other words that are deemed as ‘stop words’ or words that are not significant outside the context of a sentence.
  • There are collections of ‘stop words’ in the python NLTK library that can be compared against your data sets.
  • Use the following links for additional info
  • https://learning.rasmussen.edu/bbcswebdav/pid-5855336-dt-content-rid-151629589_1/xid-151629589_1
  • http://rasmussen.libanswers.com/faq/32874
  • https://www.geeksforgeeks.org/removing-stop-words-nltk-python/