generated from UtrechtUniversity/re-python-package-setuptools
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* merge all scripts into one script * add example --------- Co-authored-by: parisa-zahedi <p.zahedi@uu.nl>
- Loading branch information
1 parent
ed7d8fa
commit c87a1d3
Showing
11 changed files
with
234 additions
and
518 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
{ | ||
"filters": [ | ||
{ | ||
"type": "AndFilter", | ||
"filters": [ | ||
{ | ||
"type": "YearFilter", | ||
"start_year": 1800, | ||
"end_year": 1910 | ||
}, | ||
{ | ||
"type": "NotFilter", | ||
"filter": { | ||
"type": "ArticleTitleFilter", | ||
"article_title": "Advertentie" | ||
}, | ||
"level": "article" | ||
}, | ||
{ | ||
"type": "KeywordsFilter", | ||
"keywords": ["dames", "liberalen"] | ||
} | ||
] | ||
} | ||
], | ||
"article_selector": | ||
{ | ||
"type": "percentage", | ||
"value": "30" | ||
}, | ||
"output_unit": "segmented_text", | ||
"sentences_per_segment": 10 | ||
} |
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "7070b655-e16c-4b29-9a96-8a55055ebc34", | ||
"metadata": {}, | ||
"source": [ | ||
"# dataQuest pipeline\n", | ||
"\n", | ||
"This notebook illustrates the complete pipeline of dataQuest, from defining keywords and other metadata to selecting final articles and generating output.\n", | ||
"\n", | ||
"## Step0: Install dataQuest package" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "cd6b3982-49cd-4150-93f3-e9a55210bec5", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Run the following line to install dataQuest\n", | ||
"# %pip install dataQuest" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "f4f89a52-dcc3-42cb-8631-47d212118733", | ||
"metadata": {}, | ||
"source": [ | ||
"## Step1: Convert your corpus to the expected json format\n", | ||
"\n", | ||
"The expected format is a set of JSON files compressed in the .gz format. Each JSON file contains metadata related to a newsletter, magazine, etc., as well as a list of article titles and their corresponding bodies. These files may be organized within different folders or sub-folders.\n", | ||
"Below is a snapshot of the JSON file format:\n", | ||
"\n", | ||
"```commandline\n", | ||
"{\n", | ||
" \"newsletter_metadata\": {\n", | ||
" \"title\": \"Newspaper title ..\",\n", | ||
" \"language\": \"NL\",\n", | ||
" \"date\": \"1878-04-29\",\n", | ||
" ...\n", | ||
" },\n", | ||
" \"articles\": {\n", | ||
" \"1\": {\n", | ||
" \"title\": \"title of article1 \",\n", | ||
" \"body\": [\n", | ||
" \"paragraph 1 ....\",\n", | ||
" \"paragraph 2....\"\n", | ||
" ]\n", | ||
" },\n", | ||
" \"2\": {\n", | ||
" \"title\": \"title of article2\",\n", | ||
" \"body\": [\n", | ||
" \"text...\" \n", | ||
" ]\n", | ||
" }\n", | ||
" }\n", | ||
"} \n", | ||
"```\n", | ||
"\n", | ||
"You can find a sample of data in [data](https://github.com/UtrechtUniversity/dataQuest/blob/main/example/data/).\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "19685342-cb9f-4439-a2fb-0f22960a94ae", | ||
"metadata": {}, | ||
"source": [ | ||
"## Step2: Create a config file \n", | ||
"\n", | ||
"Create a config file to include the followings:\n", | ||
"- filters\n", | ||
"- criteria to select final articles\n", | ||
"- output format\n", | ||
"\n", | ||
"```\n", | ||
"{\n", | ||
" \"filters\": [\n", | ||
" {\n", | ||
" \"type\": \"AndFilter\",\n", | ||
" \"filters\": [\n", | ||
" {\n", | ||
" \"type\": \"YearFilter\",\n", | ||
" \"start_year\": 1800,\n", | ||
" \"end_year\": 1910\n", | ||
" },\n", | ||
" {\n", | ||
" \"type\": \"NotFilter\",\n", | ||
" \"filter\": {\n", | ||
" \"type\": \"ArticleTitleFilter\",\n", | ||
" \"article_title\": \"Advertentie\"\n", | ||
" },\n", | ||
" \"level\": \"article\"\n", | ||
" },\n", | ||
" {\n", | ||
" \"type\": \"KeywordsFilter\",\n", | ||
" \"keywords\": [\"dames\", \"liberalen\"]\n", | ||
" }\n", | ||
" ]\n", | ||
" }\n", | ||
" ],\n", | ||
" \"article_selector\":\n", | ||
" {\n", | ||
" \"type\": \"percentage\",\n", | ||
" \"value\": \"30\"\n", | ||
" },\n", | ||
" \"output_unit\": \"segmented_text\",\n", | ||
" \"sentences_per_segment\": 10\n", | ||
"}\n", | ||
"```\n", | ||
"\n", | ||
"You can find a sample of [config.json](https://github.com/UtrechtUniversity/dataQuest/blob/main/example/config.json)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "d7f423b2-4a94-409c-bbc0-ec9248cfa838", | ||
"metadata": {}, | ||
"source": [ | ||
"## Step3: Run the pipeline\n", | ||
"Run the following command:\n", | ||
"\n", | ||
"```\n", | ||
"filter-articles\n", | ||
"--input-dir \"data/\"\n", | ||
"--output-dir \"output/\"\n", | ||
"--input-type \"delpher_kranten\"\n", | ||
"--glob \"*.gz\"\n", | ||
"--config-path \"config.json\"\n", | ||
"--period-type \"decade\"\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "ee3390dd-4e89-4a8f-90aa-0f7fe4a72bb7", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.8" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Oops, something went wrong.