To program a typical information extraction task using a sequence of pre-processing tasks on a corpus of Tweets.
For this project you will be using the NLTK POS tagger that comes as a Python package.
Download the data file “TweetData.txt” from AutOnline.
Use the pre-processing tools and techniques that you have learnt in lectures and labs (such as tokenizing, POS tagging, stop word removal, stemming, chunking etc.) to prepare the text for higher level information extraction task.
The information extraction task you are to perform is to determine the 10 most common concepts being discussed in the given tweet corpus.
Proper nouns and nouns are frequently used to formulate concepts.
Full and part name such as John Key and Key are usually merged.
Text Noise is usually ignored.
Hash tags can also represent a concept.
You are required to write a report describing your programming activity in a maximum of 15 pages, excluding references and any appendices. Your report should contain the following:
An introduction describing what you set out to do.
A description of how the pre-processing tasks that you did to achieve outcome. Your description should also include the reasons for the various pre-processing tasks.
A detailed description of your concept formulation design.
Your results with proof of output.
An analysis of the errors and how they could be improved.
A discussion of how the concept formulation could be improved.
Conclusions and reflection on your learning.
Option 2-Research option
To do a thorough literature survey in order to understand the different pre-processing techniques and their functions to achieve a higher level outcome such as determining the dominant concepts in a corpus.
Use the topics covered in the lectures and online literature research and collate the information on the various tasks that is required to be done before higher level information extraction can be performed on texts.
Your report should include the techniques, position in the pipeline, and the function of the tasks.
Write a report in a maximum of 15 pages (excluding bibliography and appendices) describing your survey.
Your report should include:
An introduction describing what is information extraction and the different ways in which it can be achieved.
A comparison of information extraction with and without pre-processing.
A detailed description of the various algorithms that are used to achieve a choice of 2 pre-processing tasks (eg. POS tagging and Chunking).
Conclusion and reflection of the research activity carried out for this assignment.
(Feldman 2005) reference information extraction to be one of the weightiest pre-processing method that escalates the text mining potential significantly.
Pre-processing is an essential part in information extraction. In contrast to old-fashioned data mining obligations the data is never given by illustrations already having different traits mined from a database.
The two general techniques being used in IE are Natural Language Processing (NLP) and syntactic rules. The techniques are briefly explained below.
Solution: Effective teamwork does not mean avoiding conflict; it means drawing out all viewpoints and ideas, commitment and analysis, active listening, ability to give constructive feedback, openness to changin ...
Solution: 1.1 Background of the study
Employees are considered as the main strength of any business organisation and increased workforce helped to enhance the productivity of that particular business firm. In ...