Recipes Dataset

    1. Go to the download page https://recipenlg.cs.put.poznan.pl/dataset.
    2. Accept Terms and Conditions and download zip file.
    3. Unpack the zip file with . You will get the full_dataset.csv file.

    Run clickhouse-client and execute the following CREATE query:

    Run the following command:

    1. clickhouse-client --query "
    2. INSERT INTO recipes
    3. SELECT
    4. title,
    5. JSONExtract(ingredients, 'Array(String)'),
    6. JSONExtract(directions, 'Array(String)'),
    7. link,
    8. source,
    9. JSONExtract(NER, 'Array(String)')
    10. FROM input('num UInt32, title String, ingredients String, directions String, link String, source LowCardinality(String), NER String')
    11. FORMAT CSVWithNames
    12. " --input_format_with_names_use_header 0 --format_csv_allow_single_quote 0 --input_format_allow_errors_num 10 < full_dataset.csv

    This is a showcase how to parse custom CSV, as it requires multiple tunes.

    Explanation:
    - The dataset is in CSV format, but it requires some preprocessing on insertion; we use table function to perform preprocessing;
    - The structure of CSV file is specified in the argument of the table function input;
    - The field num (row number) is unneeded - we parse it from file and ignore;
    - We use FORMAT CSVWithNames but the header in CSV will be ignored (by command line parameter --input_format_with_names_use_header 0), because the header does not contain the name for the first field;
    - File is using only double quotes to enclose CSV strings; some strings are not enclosed in double quotes, and single quote must not be parsed as the string enclosing - that’s why we also add the --format_csv_allow_single_quote 0 parameter;
    - Some strings from CSV cannot parse, because they contain \M/ sequence at the beginning of the value; the only value starting with backslash in CSV can be \N that is parsed as SQL NULL. We add --input_format_allow_errors_num 10 parameter and up to ten malformed records can be skipped;
    - There are arrays for ingredients, directions and NER fields; these arrays are represented in unusual form: they are serialized into string as JSON and then placed in CSV - we parse them as String and then use JSONExtract function to transform it to Array.

    Query:

    1. SELECT count() FROM recipes;

    Result:

    In this example we learn how to use function to expand an array into a set of rows.

    Query:

    1. SELECT
    2. arrayJoin(NER) AS k,
    3. count() AS c
    4. FROM recipes
    5. GROUP BY k
    6. ORDER BY c DESC
    7. LIMIT 50
    1. ┌─k────────────────────┬──────c─┐
    2. salt 890741
    3. flour 466110
    4. eggs 401276
    5. onion 372469
    6. garlic 358364
    7. milk 346769
    8. water 326092
    9. vanilla 270381
    10. olive oil 197877
    11. pepper 179305
    12. brown sugar 174447
    13. tomatoes 163933
    14. egg 160507
    15. baking powder 148277
    16. lemon juice 146414
    17. Salt 122557
    18. cinnamon 117927
    19. sour cream 116682
    20. cream cheese 114423
    21. margarine 112742
    22. celery 112676
    23. baking soda 110690
    24. parsley 102151
    25. chicken 101505
    26. onions 98903
    27. vegetable oil 91395
    28. oil 85600
    29. mayonnaise 84822
    30. pecans 79741
    31. nuts 78471
    32. potatoes 75820
    33. carrots 75458
    34. pineapple 74345
    35. black pepper 69064
    36. thyme 68429
    37. mustard 65948
    38. chicken broth 65112
    39. bacon 64956
    40. honey 64626
    41. oregano 64077
    42. ground beef 64068
    43. unsalted butter 63848
    44. mushrooms 61465
    45. Worcestershire sauce 59328
    46. cornstarch 58476
    47. green pepper 58388
    48. Cheddar cheese 58354
    49. └──────────────────────┴────────┘
    50. 50 rows in set. Elapsed: 0.112 sec. Processed 2.23 million rows, 361.57 MB (19.99 million rows/s., 3.24 GB/s.)

    Result:

    1. ┌─title────────────────────────────────────────────────────────────┬─length(NER)─┬─length(directions)─┐
    2. Chocolate-Strawberry-Orange Wedding Cake 24 126
    3. Strawberry Cream Cheese Crumble Tart 19 47
    4. Charlotte-Style Ice Cream 11 45
    5. Sinfully Good a Million Layers Chocolate Layer Cake, With Strawb 31 45
    6. Sweetened Berries With Elderflower Sherbet 24 44
    7. Chocolate-Strawberry Mousse Cake 15 42
    8. Rhubarb Charlotte with Strawberries and Rum 20 42
    9. Chef Joey's Strawberry Vanilla Tart │ 7 │ 37 │
    10. │ Old-Fashioned Ice Cream Sundae Cake │ 17 │ 37 │
    11. │ Watermelon Cake │ 16 │ 36 │
    12. └──────────────────────────────────────────────────────────────────┴─────────────┴────────────────────┘
    13. 10 rows in set. Elapsed: 0.215 sec. Processed 2.23 million rows, 1.48 GB (10.35 million rows/s., 6.86 GB/s.)

    In this example, we involve has function to filter by array elements and sort by the number of directions.

    There is a wedding cake that requires the whole 126 steps to produce! Show that directions:

    Query:

    1. SELECT arrayJoin(directions)

    The dataset is also available in the .