• +44-190-022-0819 +44-190-022-0819
  • +1-248-268-9041 +1-248-268-9041
  • +61288800241 +61288800241


Search your solution from list of 1000+ questions


Option 1- Programming Option


To program a typical information extraction task using a sequence of pre-processing tasks on a corpus of Tweets.

Task Specification

  1. For this project you will be using the NLTK POS tagger that comes as a Python package. 
  2. Download the data file “TweetData.txt” from AutOnline.
  3.  Use the pre-processing tools and techniques that you have learnt in lectures and labs (such as tokenizing, POS tagging, stop word removal, stemming, chunking etc.) to prepare the text for higher level information extraction task.
  4. The information extraction task you are to perform is to determine the 10 most common concepts being discussed in the given tweet corpus.


  1. Proper nouns and nouns are frequently used to formulate concepts.
  2. Full and part name such as John Key and Key are usually merged.
  3. Text Noise is usually ignored.
  4. Hash tags can also represent a concept.

Write up

  1. You are required to write a report describing your programming activity in a maximum of 15 pages, excluding references and any appendices.  Your report should contain the following:
    1. An introduction describing what you set out to do.
    2. A description of how the pre-processing tasks that you did to achieve outcome. Your description should also include the reasons for the various pre-processing tasks.
    3. A detailed description of your concept formulation design.
    4. Your results with proof of output.
    5. An analysis of the errors and how they could be improved.
    6. A discussion of how the concept formulation could be improved.
    7. Conclusions and reflection on your learning.

Option 2-Research option


To do a thorough literature survey in order to understand the different pre-processing techniques and their functions to achieve a higher level outcome such as determining the dominant concepts in a corpus.

Task Specification

  1. Use the topics covered in the lectures and online literature research and collate the information on the various tasks that is required to be done before higher level information extraction can be performed on texts.
  2. Your report should include the techniques, position in the pipeline, and the function of the tasks.   
  3. Write a report in a maximum of 15 pages (excluding bibliography and appendices) describing your survey.

Your report should include:

  1. An introduction describing what is information extraction and the different ways in which it can be achieved.
  2. A comparison of information extraction with and without pre-processing.
  3. A detailed description of the various algorithms that are used to achieve a choice of 2 pre-processing tasks (eg. POS tagging and Chunking).
  4. Conclusion and reflection of the research activity carried out for this assignment.
Download Questions

(Feldman 2005) reference information extraction to be one of the weightiest pre-processing method that escalates the text mining potential significantly. Pre-processing is an essential part in information extraction. In contrast to old-fashioned data mining obligations the data is never given by illustrations already having different traits mined from a database. The two general techniques being used in IE are Natural Language Processing (NLP) and syntactic rules. The techniques are briefly explained below.

Related Questions in (Assignment help Service)


Solution: The Journal of Helene Berr and Rue Ordener, Rue Labat 2 evidences that prove the difficulty and destructiveness that the people had to face in those four years. As mentioned by Berr (2009, p.23), in h ...


Solution: Employee communication highlights the sharing of ideas and information. In this competitive business world, information exchange is essential among employees to develop team performance effectively. m ...


Solution: As per Section 1 of the Thirteenth year plan describes about China’s two key objectives that will be accomplished if the National People’s Congress or the standing committee of this party passes t ...


Solution: The pro forma income statement represents a trending statement that includes the probable net income value for the company considering the current growth and decline rates valid throughout the period ...


Solution: Mode is defined as the value which occurs more frequently in the data set. The mode for non-business is 82 while for business is only 59. P value can be calculated from z table . As per z table p valu ...


Solution: ABC assumes that there are different activities involved in different processes that cause costs andthe product, services, and customers are reasons for those activities. The UK customer segment is br ...


Solution: Residual earnings valuation method is used to calculate the intrinsic value of the stock based on the expected residual income of the company in the coming years. The residual income is discounted bac ...


Solution: The major problem with the business that Sarah Roberts is running is that the staff personnel do not always enter the details of the sales transaction in the sales register but simply put the cash in ...