gantt
title Proposed gantt for the assignment
dateFormat DD-MM-YYYY
axisFormat %d-%m-%y
section Assignment
Assignment :a1, 02-12-2024, 50d
section Steps
Tutorial 1- Term Extraction : milestone, m1, 02-12-2024, 0d
Topic & data 1 :a2, 02-12-2024, 14d
Tutorial 2- Topic Modelling : milestone, m2, 09-12-2024, 0d
Usage Analysis:a3, after a2, 7d
Tutorial 3- NER : milestone, m3, 16-12-2024, 0d
Second Data set:a4, after a3, 7d
Tutorial 4- Sentiment : milestone, m4, 06-01-2025, 0d
Finalize analysis:a5, after a4, 7d
Tutorial 5- LLM : milestone, m5, 15-01-2025, 0d
Finish writing:a6, after a5, 14d
NLP module Assignment
For the NLP module you are tasked with the preparation of a short, data-driven report which shall not exceed 2000 words. The assignment is part individual, part in groups.
The approach of this assignment is exploratory, this means that you are not required to use a specific theory or theoretical framework. The aim is to explore a topic in a datadriven way, given different data sources. You will however have to specify and explain what you expect to find in specific data sources, why you used them and how you expect to find this information.
- In groups of two:
- you will define a topic on which to do your assignment
- prepare the data sets, run different NLP tasks together
- write the introduction, usage analysis and methodology sections together
- write the analysis of the results
- Individually:
- write the conclusion and discussion
- Identify potential ethical issues associated with the dataset or the NLP techniques used. Discuss how these issues could impact the analysis and suggest ways to mitigate them.
The use of Large Language Models (ChatGPT, LLama, Bard) to help with programming is allowed. The main focus is on the quality of the analysis of the results. You are not allowed to use data sets you have entirely generated yourself. You are of course allowed to compute additional data on an existing data set with real data.
You will use at least two different data sets on a specific topic, and analyse the data. The richer the database, the more value can be extracted. To make things a bit easier, I describe three types of analyses with mandatory data. This ensure that you have enough data to work with. If you have your own idea, this is always an option under the conditions described below and after consultation with me (mainly to ensure feasibility). All types of analysis are graded on the same rubric.
- Your assignment will contain:
- At least two different textual data sources
- A topic model for each of the data sources. With a description of the process that led you to the final topics (steps in data cleaning, and method for the identification of the number of topics)
- A Named Entity Recognition section in which you extract different elements from news articles. You will connect the elements identified to word, topics or other variables in the database in a coherent manner
- Sentiment analysis on the news articles. The sentiment scores can be connected to the other variables identified in both the NER exercise and the topic modelling exercise.
In the table below, some analyses types are presented from which you can chose one. You are free to come up with your own design, but be aware of the time restrictions for this assignment. Using a proposed analysis type ensures data availability and feasibility.
Types of analysis
Technology-focused Analysis
In this type of analysis you will focus on a technology that you choose yourself. As a mandatory datasource you will use patent data. Patents are chosen because they contain a lot of valuable data you can analyse (who is doing research on this technology, where are they from, who are they collaborating with, compare companies or regions, quality assessment etc.).
To complement this data you will at least complement the data with articles from the Lexis Uni database. The main aim of this analysis will be to study a specific technology and show how it is perceived in society.
Some data sources to check out:
- European Environment agency data Hub (Data related to specific technologies)
- Lexis Uni can be accesses via the UU website
- DBnomics (various databases)
- Eurostat
- Lens.org (for Patents and Scientific Publications)
- CORDIS (european research project databases)
Bibliographic analysis
This type of analysis allows you to tackle questions that are less technology oriented. Think about questions related to social justice, emotions and climate transition, governance issues,… The idea is to perform a bibliographic analysis on a topic of your choice and connect this with a societal view with the help of articles from lexis uni.
You can identify who is doing research on this topic, which institutions they work for, what the main topics are, how they evolve over time. This analysis is then connected to an analysis from news sources so analyse the perception of society on the topic chosen.
Some data sources to check out:
- Quality of government
- Academic Freedom
- Values Survey
- European Environment agency data Hub (Data related to specific technologies)
- DBnomics (various databases)
- Eurostat
- Lens.org (for Patents and Scientific Publications)
- CORDIS (european research project databases)
- Overton (via the UU library)
- Lexis Nexis (via the UU library)
- Quality of government
Your own idea
If none of the above options suit you, and you have a good idea you would like to pursue, go for it! Read the rubric for the course carefully and ensure that the data you have chosen connects to the rubric. In any case, contact me to explain your idea.
- To avoid issues with the assignment ensure yourself of the following:
- You can find two complementary data sets on your topic.
- The topic relates to sustainability.
- There are enough observations to perform an analysis on. Try to have at least 40 observations in each data set. And ensure that the text of the data is long enough.
- Check the validity of the data you have and check that you have the rights to use the data.
There is a lot of data available on the internet, pay attention to the rights on this data and where it comes from.
Some data sources to check out:
- Quality of government
- Academic Freedom
- Values Survey
- European Environment agency data Hub (Data related to specific technologies)
- DBnomics (various databases)
- Eurostat
- Lens.org (for Patents and Scientific Publications)
- CORDIS (european research project databases)
- Overton (via the UU library)
- Lexis Uni (via the UU library)
Schedule
To keep up with the rapid pace of this course, you can follow the gantt below to check your progress. In each tutorial you will learn new skills that allow you to dig deeper into your data. If you keep up with this schedule you should be able to work on your data in each tutorial and extract new information from your data.
Rubric for this assignment
This assignment will be graded on different dimensions. The detailed rubric can be found on blackboard.
| Dimension | Description |
|---|---|
| Quality of writing | Spelling, grammar, structure |
| Data Choice | How coherent is the data to the choice of topic |
| Usage Analysis | Data description: where does it come, which source, any biases? |
| Data Preparation | Which steps did you take to prepare the data for your analysis (lemma, stemming, regex, dico) |
| Data Analysis | Which indicators and methods did you use to analyse your data. How coherent is the analysis |
| Transparency in thresholds | Explain choices made in the analysis (thresholds, number of topics, filtering of documents) |
| Conclusion-Discussion | Quality of conclusions pulled from the data. |
| Code | Results from the document should be easy to find the script. The steps should be well documented. |