Google BigQuery‎ > ‎

Load JSON data to BigQuery

BigQuery所接受的JSON格式為以段行隔開的JSON檔案,也就是資料欄之間會以段行隔開,而每一行都屬於一個完整的JSON格式資料。
使用JSON格式的好處為:
  • 單次資料上傳可以最大至1TB
  • 資料中可以夾帶另一個JSON值或是JSON Array作為階層資料格式
下面是一個模擬訂單實例格式化過的JSON範例資料(單一資料欄),其中items欄位為JSON Array格式,收容訂購的產品項目資料,一項訂單的訂購項目將有一個以上...

{

 "transaction_time": "2013-10-20T14:32:22",

 "transaction_id": 50124,

 "total_spending": 300,

 "coupon_discount": 20,

 "items": [

   {

     "name": "xt920",

     "category": "mobile",

     "quantity": 10,

     "spending": 320

   },

   {

     "name": "xt920b",

     "category": "battery",

     "quantity": 11,

     "spending": 20

   },

   {

     "name": "xt921",

     "category": "mobile",

     "quantity": 10,

     "spending": 200

   }

 ]

}


為格式化過的真實的資料(去除斷行與不必要空格)會像這樣:

{"transaction_time":"2013-10-20T14:32:22","transaction_id":50124,"total_spending":300,"coupon_discount":20,"items":[{"name":"xt920","category":"mobile","quantity":10,"spending":320},{"name":"xt920b","category":"battery","quantity":11,"spending":20},{"name":"xt921","category":"mobile","quantity":10,"spending":200}]}

Load JSON Data


對應上面的資料形態,Schema物件可以定義成下面狀態:

$ cat schema.json

[

 {"name":"transaction_time","mode":"nullable","type":"string"},

 {"name":"transaction_id","mode":"required","type":"integer"},

 {"name":"total_spending","mode":"required","type":"integer"},

 {"name":"coupon_discount","mode":"required","type":"integer"},

 {"name":"items","mode":"repeated","type":"record", "fields": [

   {"name":"name","mode":"required","type":"string"},

   {"name":"category","mode":"required","type":"string"},

   {"name":"quantity","mode":"required","type":"integer"},

   {"name":"spending","mode":"required","type":"integer"},

   {"name":"descript","mode":"nullable","type":"string"}

 ]}

]


其中items為repeated欄位,內部即夾帶了另一個資料格式,並且該欄位指定了repeated,因此該欄位可以夾帶JSON Array格式資料,以達到階層方式的資料格式。最後,透過bq指令即可仔入該JSON資料...

$ bq load --source_format NEWLINE_DELIMITED_JSON project_id:dataset_id.table_name ./sample.json ./schema.json

Waiting on bqjob_r2110b4bab734de17_00000141df2a4271_1 ... (69s) Current status: DONE


附註:網頁界面上或是指令列模式以gcloud config set project [project-id]設定過專案代號後,則可以不用帶入project_id。

Reference






上述Schema格式若有定義錯誤的地方,資料在載入的過程中會有錯誤提示,範例如下:

載入檔案時發生 “non-repeated field” 錯誤


$ cat schema.json

[

 {"name":"transaction_time","mode":"nullable","type":"string"},

 {"name":"transaction_id","mode":"required","type":"integer"},

 {"name":"total_spending","mode":"required","type":"integer"},

 {"name":"coupon_discount","mode":"required","type":"integer"},

 {"name":"items","mode":"required","type":"record", "fields": [

   {"name":"name","mode":"required","type":"string"},

   {"name":"category","mode":"required","type":"string"},

   {"name":"quantity","mode":"required","type":"integer"},

   {"name":"spending","mode":"required","type":"integer"},

   {"name":"descript","mode":"required","type":"string"}

 ]}

]


$ bq load --source_format NEWLINE_DELIMITED_JSON cp300.testdb ./sample.json ./schema.json

Waiting on bqjob_r3d39c5449048c912_00000141df277e2c_1 ... (25s) Current status: DONE

BigQuery error in load operation: Error processing job 'mitac-cp300:bqjob_r3d39c5449048c912_00000141df277e2c_1': Too many errors encountered. Limit is: 0.

Failure details:

- array specified for non-repeated field


載入檔案時發生 “missing required field(s)” 錯誤


$ cat schema.json

[

 {"name":"transaction_time","mode":"nullable","type":"string"},

 {"name":"transaction_id","mode":"required","type":"integer"},

 {"name":"total_spending","mode":"required","type":"integer"},

 {"name":"coupon_discount","mode":"required","type":"integer"},

 {"name":"items","mode":"repeated","type":"record", "fields": [

   {"name":"name","mode":"required","type":"string"},

   {"name":"category","mode":"required","type":"string"},

   {"name":"quantity","mode":"required","type":"integer"},

   {"name":"spending","mode":"required","type":"integer"},

   {"name":"descript","mode":"required","type":"string"}

 ]}

]


$ bq load --source_format NEWLINE_DELIMITED_JSON cp300.testdb ./sample.json ./schema.json

Waiting on bqjob_r128c022a9d7e9a45_00000141df299024_1 ... (16s) Current status: DONE

BigQuery error in load operation: Error processing job 'mitac-cp300:bqjob_r128c022a9d7e9a45_00000141df299024_1': Too many errors encountered. Limit is: 0.

Failure details:

- missing required field(s)


Comments