Sunday, 4 January 2015

Extracting Allergen Information from HTML Recipes to XML before importing to Google Sheets.

I use Google sheets as a daily menu system for residents in a home and it works great but now the good old EU Food Information for Consumers Regulation (1169/2011) means i have to be able to show whats in each of our dishes.

The "Top 14" allergens according to them are
  1. Celery 
  2. Gluten 
  3. Crustaceans 
  4. Eggs 
  5. Fish 
  6. Lupin 
  7. Milk 
  8. Mollusc 
  9. Mustard 
  10. Nuts 
  11. Peanuts 
  12. Sesame seeds 
  13. Soya 
  14. Sulphur Dioxide 
  15. Banana <-- Extra one i added

Make sure all your recipes (.html format) include Title, Category and Allergen information.

French Onion Soup  <-- Title

Soups  <-- Category

Allergy advice
Contains:- Celery  <-- Allergens

  • Butter
  • 5lb. Spanish onions (sliced)
  • 4 litres beef stock (knor)
  • Brown sugar (Sprinkle)
  • Splash of sherry
  • 2 x tbsp thyme

  1. Fry onions in butter until golden brown (10-15 minutes)
  2. Add sugar and sherry
  3. Add thyme and stock and bring to boil
  4. Reduce and simmer for 45 minutes or until cooked
  5. Adjust seasoning and consistency if necessary

Now we only need to get the information in blue from the recipe above before automatically inserting it to our spreadsheet.
See the following PHP script which takes the information we need and inserts it into an XML file.

$xmlFileName  = "recipes.xml";
//  Creates a new SimpleXMLElement object 
$xml          = new SimpleXMLElement('<xml/>');
//  Path to recipe folder
$recipeFolder = "";
//  Find all files that end with .html and use each one as recipe
foreach (glob('*.html') as $recipeName) {
    //  Set allergen variable with empty value 
    $allergenString = "";
    $doc            = new DOMDocument();
    @$doc->loadHTMLFile($recipeFolder . $recipeName);
    $xpath                = new DOMXPath($doc);
    $category             = $xpath->query("//div[@id = 'category']")->item(0)->nodeValue;
    $title                = $xpath->query("//title")->item(0)->nodeValue;
    $allergen             = $xpath->query("//h2[@id = 'allergens']")->item(0)->nodeValue;
    $capsInsensitiveNode  = strtolower($allergen);
    $allergenArray        = array(
        "Sesame seeds",
        "Sulphur Dioxide",
    $capsInsensitiveArray = array_map('strtolower', $allergenArray);
    $stringToArrayNode    = explode(',', $capsInsensitiveNode);
    for ($i = 0; $i < count($capsInsensitiveArray); $i++) {
        if (in_array($capsInsensitiveArray[$i], $stringToArrayNode)) {
            $allergenString .= "1,";
        } //  in_array($capsInsensitiveArray[$i], $stringToArrayNode)
        else {
            $allergenString .= "0,";
    } //  $i = 0; $i < count($capsInsensitiveArray); $i++
    $dropLastCharAllegens = substr($allergenString, 0, strlen($allergenString) - 1);
    $recipe               = $xml->addChild('recipe');
    $recipe->addChild('filename', $recipeFolder . $recipeName);
    $recipe->addChild('category', $category);
    $recipe->addChild('title', $title);
    $recipe->addChild('allergens', $dropLastCharAllegens);
} //  glob('*.html') as $recipeName
//  Show xml file
Header('Content-type: text/xml');
//  Save xml file
//  Open and format xml file
$simplexml               = simplexml_load_file($xmlFileName);
$dom                     = new DOMDocument('1.0');
$dom->preserveWhiteSpace = false;
$dom->formatOutput       = true;
$xml = new SimpleXMLElement($dom->saveXML());

Example XML file.

<?xml version="1.0"?>
<title>French Onion Soup</title>

Now using IMPORTXML we can import the XML file into our spreadsheet and use it as we require.
=IMPORTXML("", "//recipe")

Job done!

No comments:

Post a Comment