" return any(char in text for char in chars_to_check)\n",
"\n",
"\n",
"# Lists to store samples with suspicious characters\n",
"question_samples = []\n",
"answer_samples = []\n",
"\n",
"# Read the jsonl file again to find examples\n",
"with open(file_path, 'r', encoding='utf-8') as f:\n",
"with open(file_path, \"r\", encoding=\"utf-8\") as f:\n",
" for line in f:\n",
" data = json.loads(line)\n",
"\n",
" # Check question\n",
" if 'question' in data and contains_suspicious_chars(data['question'].lower(), suspicious_chars):\n",
" question_samples.append({\n",
" 'id': data.get('id', 'unknown'),\n",
" 'question': data['question'],\n",
" 'suspicious_chars': [char for char in data['question'] if char.lower() in suspicious_chars]\n",
" })\n",
" if \"question\" in data and contains_suspicious_chars(data[\"question\"].lower(), suspicious_chars):\n",
" question_samples.append(\n",
" {\n",
" \"id\": data.get(\"id\", \"unknown\"),\n",
" \"question\": data[\"question\"],\n",
" \"suspicious_chars\": [char for char in data[\"question\"] if char.lower() in suspicious_chars],\n",
" }\n",
" )\n",
"\n",
" # Check answer\n",
" if 'answer' in data and contains_suspicious_chars(data['answer'].lower(), suspicious_chars):\n",
" answer_samples.append({\n",
" 'id': data.get('id', 'unknown'),\n",
" 'answer': data['answer'],\n",
" 'suspicious_chars': [char for char in data['answer'] if char.lower() in suspicious_chars]\n",
" })\n",
" if \"answer\" in data and contains_suspicious_chars(data[\"answer\"].lower(), suspicious_chars):\n",
" answer_samples.append(\n",
" {\n",
" \"id\": data.get(\"id\", \"unknown\"),\n",
" \"answer\": data[\"answer\"],\n",
" \"suspicious_chars\": [char for char in data[\"answer\"] if char.lower() in suspicious_chars],\n",
" }\n",
" )\n",
"\n",
"# Print some samples with suspicious characters in questions\n",
"print(f\"Found {len(question_samples)} samples with suspicious characters in questions\")\n",
@ -866,6 +876,103 @@
"}\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# example 4 hop - original\n",
"{\n",
" \"id\": \"4hop2__160585_14670_8987_8974\",\n",
" \"paragraphs\": [\n",
" {\n",
" \"idx\": 0,\n",
" \"title\": \"United States Army\",\n",
" \"paragraph_text\": \"Currently, the army is divided into the Regular Army, the Army Reserve, and the Army National Guard. The army is also divided into major branches such as Air Defense Artillery, Infantry, Aviation, Signal Corps, Corps of Engineers, and Armor. Before 1903 members of the National Guard were considered state soldiers unless federalized (i.e., activated) by the President. Since the Militia Act of 1903 all National Guard soldiers have held dual status: as National Guardsmen under the authority of the governor of their state or territory and, when activated, as a reserve of the U.S. Army under the authority of the President.\",\n",
" \"is_supporting\": true,\n",
" },\n",
" {\n",
" \"idx\": 1,\n",
" \"title\": \"Iron Beam\",\n",
" \"paragraph_text\": 'Iron Beam (, \"\") is an air defense system which is in development by Israeli defense contractor Rafael Advanced Defense Systems. Unveiled at the Singapore Air Show on February 11, 2014. The system is designed to destroy short-range rockets, artillery, and mortars with a range of up to , too small for the Iron Dome system to intercept effectively. In addition, the system could also intercept unmanned aerial vehicles. Iron Beam will use a \"directed high energy laser beam\" to destroy hostile targets with ranges of up to . Iron Beam will constitute the fifth element of Israel\\'s integrated air defense system, in addition to Arrow 2, Arrow 3, David\\'s Sling, and Iron Dome. However, Iron Beam is also a stand-alone system.',\n",
" \"is_supporting\": false,\n",
" },\n",
" ...\n",
" {\n",
" \"idx\": 19,\n",
" \"title\": \"Josip Broz Tito\",\n",
" \"paragraph_text\": \"In 1968, Tito offered Czechoslovak leader Alexander Dub\\u010dek to fly to Prague on three hours notice if Dub\\u010dek needed help in facing down the Soviets. In April 1969, Tito removed generals Ivan Go\\u0161njak and Rade Hamovi\\u0107 in the aftermath of the invasion of Czechoslovakia due to the unpreparedness of the Yugoslav army to respond to a similar invasion of Yugoslavia.\",\n",
" \"is_supporting\": true,\n",
" },\n",
" ],\n",
" \"question\": \"Hana Mandlikova was born in Country A that invaded Country B because the military branch the Air Defense Artillery is part of was unprepared. Country B was the only communist country to have an embassy where?\",\n",
" \"question_decomposition\": [\n",
" {\n",
" \"id\": 160585,\n",
" \"question\": \"Where was Hana Mandlikova born?\",\n",
" \"answer\": \"Czechoslovakia\",\n",
" \"paragraph_support_idx\": 15,\n",
" },\n",
" {\n",
" \"id\": 14670,\n",
" \"question\": \"The Air Defense Artillery is a branch of what?\",\n",
" \"answer\": \"the Army\",\n",
" \"paragraph_support_idx\": 0,\n",
" },\n",
" {\n",
" \"id\": 8987,\n",
" \"question\": \"What #2 was unprepared for the invasion of #1 ?\",\n",
" \"answer\": \"Yugoslavia\",\n",
" \"paragraph_support_idx\": 19,\n",
" },\n",
" {\n",
" \"id\": 8974,\n",
" \"question\": \"#3 was the only communist country to have an embassy where?\",\n",
" \"question\": \"Hana Mandlikova was born in Country A that invaded Country B because the military branch the Air Defense Artillery is part of was unprepared. Country B was the only communist country to have an embassy where?\",\n",
" \"supporting_paragraphs\": [ # this is the list of paragraphs that is_supporting=True\n",
" \"Currently, the army is divided into the Regular Army, the Army Reserve, and the Army National Guard. The army is also divided into major branches such as Air Defense Artillery, Infantry, Aviation, Signal Corps, Corps of Engineers, and Armor. Before 1903 members of the National Guard were considered state soldiers unless federalized (i.e., activated) by the President. Since the Militia Act of 1903 all National Guard soldiers have held dual status: as National Guardsmen under the authority of the governor of their state or territory and, when activated, as a reserve of the U.S. Army under the authority of the President.\",\n",
" \"bla bla bla\",\n",
" \"bla bla bla\",\n",
" \"In 1968, Tito offered Czechoslovak leader Alexander Dub\\u010dek to fly to Prague on three hours notice if Dub\\u010dek needed help in facing down the Soviets. In April 1969, Tito removed generals Ivan Go\\u0161njak and Rade Hamovi\\u0107 in the aftermath of the invasion of Czechoslovakia due to the unpreparedness of the Yugoslav army to respond to a similar invasion of Yugoslavia.\",\n",
" \"question\": \"One of the actors in a Pound of Flesh is from a European county whose king Albert I lived during a major war. When did Italy enter that war?\", \"answer\": \"1915\", \"supporting_paragraphs\": [\"The Queen Elisabeth Medical Foundation (QEMF) is a Belgian non-profit organization, founded in 1926 by Elisabeth of Bavaria, wife of Albert I. She founded the organization, based on her experience with the wounded from the front-line during the First World War. The foundation wants to encourage laboratory research and contacts between researchers and clinical practitioners, with a particular focus on neurosciences. The QEMF supports seventeen university teams throughout Belgium.\", \"On 3 May 1915 Italy officially revoked the Triple Alliance. In the following days Giolitti and the neutralist majority of the Parliament opposed declaring war, while nationalist crowds demonstrated in public areas for it. (The nationalist poet Gabriele D'Annunzio called this period le radiose giornate di Maggio -- ``the sunny days of May ''). Giolitti had the support of the majority of Italian parliament so on 13 May Salandra offered his resignation to King Victor Emmanuel III, but then Giolitti learned that the London Pact was already signed: fearful of a conflict between the Crown and the Parliament and the consequences on both internal stability and foreign relationships, Giolitti accepted the fait accompli, declined to succeed as prime minister and Salandra's resignation was not accepted. On 23 May, Italy declared war on Austria - Hungary. This was followed by declarations of war on the Ottoman Empire (21 August 1915, following an ultimatum of 3 August), Bulgaria (19 October 1915) and the German Empire (28 August 1916).\", \"JCVD is a 2008 Belgian crime drama film directed by French Tunisian film director Mabrouk el Mechri, and starring Jean-Claude van Damme as a semi-fictionalized version of himself, a down and out action star whose family and career are crumbling around him as he is caught in the middle of a post office heist in his hometown of Brussels, Belgium.\", \"Pound of Flesh is a 2015 Canadian action thriller film directed by Ernie Barbarash, and starring Jean-Claude Van Damme and Darren Shahlavi. It is the third collaboration between Van Damme and Barbarash (following \\\"Assassination Games\\\" in 2011 and \\\"Six Bullets\\\" in 2012).\"]}\n",
"\n",
"\n",
"# also, need a script to extract all paragraphs from the original data (ensure they are unique) \n",
"# FROM ALL TRAIN DEV TEST SPLIT\n",
"# to a simple csv file like this. \n",
"# chunk_id should be incremental (good enough, since there is no chunk_id in the original data)\n",
"# title is the title of the paragraph\n",
"# content is the text of the paragraph\n",
"# metadata is a list of original question_id that this paragraph is supporting (just for backward compatibility with previous data format)\n",
"\"\"\"\n",
"chunk_id,content,metadata\n",
"1, Bla bla bla, bla bla bla, [2hop_xxx_xxx, ....]\n",
"\"\"\"\n",
"\n",
"# then the an faiss index will be generated from this csv file with intfloat/e5-base-v2 embedding model\n",
"\n",
"# remember the distribution of the number of hops?"
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -927,8 +1034,7 @@
"question = \"\"\"\n",
"Each vertex of a regular octagon is independently colored either red or blue with equal probability. The probability that the octagon can then be rotated so that all of the blue vertices end up at positions where there were originally red vertices is $\\tfrac{m}{n}$, where $m$ and $n$ are relatively prime positive integers. What is $m+n$?\n",